ISCApad #290 |
Saturday, August 06, 2022 by Chris Wellekens |
8-1 | Omid Ghahabi, 'Deep Learning for i-Vector Speaker and Language Recognition' Omid Ghahabi, 'Deep Learning for i-Vector Speaker and Language Recognition' email address: omid.ghahabi@eml.org
Omid Ghahabi completed his PhD thesis entitled 'Deep Learning for i-Vector Speaker and Language Recognition' at Universitat Politecnica de Catalunya (UPC), Barcelona, Spain. It was supervised by Prof. Javier Hernando at TALP Research Center, Department of Signal Theory and Communications.
Link to the document: https://theses.eurasip.org/theses/798/deep-learning-for-i-vector-speaker-and-language/
| |||||
8-2 | Neeraj Kumar Sharma, 'Information-rich Sampling of Time-varying Signals'
Thesis Author: Neeraj Kumar Sharma
Current Affiliation:
Post-Doctoral Fellow
Carnegie Mellon University
Pittsburgh 15213, USA
E-mail: neerajww@gmail.com
URL: neerajww.github.in
PhD Granting Institution:
Dept. of Electrical Communication
Engineering (ECE)
Indian Institute of Science
Bangalore 560012, India
Thesis Advisor:
Dr. Thippur V. Sreenivas
Professor, Dept. ECE
Indian Institute of Science
Bangalore 560012, India
E-mail: tvsree@iisc.ac.in
Thesis title: Information-rich Sampling of Time-varying Signals
Abstract: This thesis investigates three fundamental concepts of interest to time-varying signal analysis: sampling, modulations and modelling.
The underlying goal is speech/audio signal processing and the motivation is drawn by exploring how these information rich signals are represented in the human auditory system. The rich information content of speech naturally requires the signals to be highly time-varying, as is
evident in the joint time-frequency representation, such as the short-time Fourier transform. Although the theoretical bandwidth of such time-varying signals is infinite, the auditory nerves are known to carry only low rate sampled information of these signals to the cortex, and yet obtain a rich information content of these signals. Thus, it may be unnecessary to sample the signals at a uniform Nyquist rate, as is done in all current day technology applications. Further, the present day quasi-stationary models of speech/audio, based on a linear time-invariant system may be inadequate. Instead of these models, the thesis explores signal decomposition using time-varying
signal components, namely, the amplitude and frequency modulations (AM-FM). The contributions are presented in three parts, and combined these suggest an alternative analysis techniques for fine spectro-temporal analysis of time-varying signals.
In part 1, the thesis analyzes non-uniform event-triggered samples, namely zero-crossings (ZCs) and extrema of the signal. The extrema are the ZCs of the signal first derivative and similarly the ZCs of higher derivative of the signal, denoted HoZC-d; using the sparse signal reconstruction approach, the different 'd' HoZCs are compared for their efficiency to reconstruct the signal based on different signal models. It is found that HoZC-1 outperform others, and a combination of HoZC-1 and HoZC-2 provides acceptable reconstruction.
In part 2, analyzing an AM-FM signal, it is shown that extrema samples (HoZC-1) are better than ZCs or LCs, in estimating the AM and FM components through local polynomial regression. Similarly, HoZC-1 can provide better AM-FM estimation of sub-band speech, moving source Doppler signal, etc., compared to DESA-1 and analytic signal approach, with additional benefit of sub-sampling. Extending the analysis to arbitrary multi-component AM-FM signals, it is shown that the successive derivative operation aids in separating the highest FM component as the dominant AM-FM component out of the multiple components. This is referred to as the 'dominant
instantaneous frequency principle' and is used for sequential estimation of individual mono-component AM-FM signals in the multi-component mixture.
The part 3, focusing on speech signals, visits time-varying sinusoidal modeling of speech, and proposes an alternate model estimation approach. The estimation operates on the whole signal without any short-time analysis. The approach proceeds by extracting the fundamental frequency sinusoid (FFS) from speech signal. The instantaneous amplitude (IA) of the FFS is used for voiced/unvoiced stream segregation. The voiced stream is then demodulated
using a variant of in-phase and quadrature-phase demodulation carried at harmonics of the FFS. The result is a non-parametric time-varying sinusoidal representation, specifically, an additive mixture of quasi-harmonic sinusoids for voiced stream and a wideband mono-component sinusoid for unvoiced stream. The representation is evaluated for analysis-synthesis, and the bandwidth of IA and IF signals are found to be crucial in preserving the quality. Also, the obtained IA and IF signals are found to be carriers of perceived speech attributes, such as speaker characteristics and intelligibility. On comparing the proposed modeling framework with the existing approaches,
which operate on short-time segments, improvement is found in simplicity of implementation, objective-scores, and computation time. The listening test scores suggest that the quality preserves naturalness but does not yet beat the state-of-the-art short-time analysis methods. In summary, the proposed representation lends itself for high resolution temporal analysis of non-stationary speech signals, and also allows quality preserving modification and synthesis.
URL: https://drive.google.com/open?id=17Olne0RBkVHRd2HcmJc0f44e17m47NZB
| |||||
8-3 | Daniel Michelsanti, 'Audio-Visual Speech Enhancement Based on Deep Learning' -Title of the thesis
Audio-Visual Speech Enhancement Based on Deep Learning
-Date and place of the defense 18 December 2020 - Aalborg, Denmark -Advisor 's name Prof. Zheng-Hua Tan and Prof. Jesper Jensen
-University Aalborg University
|