ISCA - International Speech
Communication Association

ISCApad Archive  »  2014  »  ISCApad #194  »  Events  »  ISCA Events  »  (2014-09-14) INTERSPEECH 2014 Tutorials

ISCApad #194

Monday, August 04, 2014 by Chris Wellekens

3-1-14 (2014-09-14) INTERSPEECH 2014 Tutorials
--- September 14-18, 2014

The INTERSPEECH 2014 Organising Committee is pleased to announce 
the following 8 tutorials presented by distinguished speakers 
at the conference and will be offered on Sunday, 14 September 2014.
All Tutorials will be of three (3) hours duration and require 
an additional registration fee (separate from the conference registration fee). 

    • Non-speech acoustic event detection and classification
    • Contribution of MRI to Exploring and Modeling Speech Production
    • Computational Models for Audiovisual Emotion Perception
    • The Art and Science of Speech Feature Engineering
    • Recent Advances in Speaker Diarization
    • Multimodal Speech Recognition with the AusTalk 3D Audio-Visual Corpus
    • Semantic Web and Linked Big Data Resources for Spoken Language Processing
    • Speech and Audio for Multimedia Semantics


Additionally, the ISCSLP 2014 Organising Committee welcomes 
the INTERSPEECH 2014 delegates to join the 4 ISCSLP tutorials 
which will be offered on Saturday, 13 September 2014.

    • Adaptation Techniques for Statistical Speech Recognition
    • Emotion and Mental State Recognition: Features, Models, System Applications and Beyond
    • Unsupervised Speech and Language Processing via Topic Models
    • Deep Learning for Speech Generation and Synthesis

More information available at:

Tutorials Description

T1: Non-speech acoustic event detection and classification

    The research in audio signal processing has been dominated by speech research, 
    but most of the sounds in our real-life environments are actually non-speech 
    events such as cars passing by, wind, warning beeps, and animal sounds. 
    These acoustic events contain much information about the environment and physical
    events that take place in it, enabling novel application areas such as safety,
    health monitoring and investigation of biodiversity. But while recent years 
    have seen wide-spread adoption of applications such as speech recognition and 
    song recognition, generic computer audition is still in its infancy.

    Non-speech acoustic events have several fundamental differences to speech, 
    but many of the core algorithms used by speech researchers can be leveraged 
    for generic audio analysis. The tutorial is a comprehensive review of the field 
    of acoustic event detection as it currently stands. The goal of the tutorial is 
    foster interest in the community, highlight the challenges and opportunities 
    and provide a starting point for new researchers. We will discuss what acoustic 
    event detection entails, the commonalities differences with speech processing, 
    such as the large variation in sounds and the possible overlap with other sounds. 
    We will then discuss basic experimental and algorithm design, including descriptions 
    of available databases and machine learning methods. We will then discuss more 
    advanced topics such as methods to deal with temporally overlapping sounds and 
    modelling the relations between sounds. We will finish with a discussion of 
    avenues for future research.

    Organizers: Tuomas Virtanen and Jort F. Gemmeke

T2: Contribution of MRI to Exploring and Modeling Speech Production

    Magnetic resonance imaging (MRI) provides us a magic vision to look into 
    the human body in various ways not only with static imaging but also with 
    motion imaging. MRI has been a powerful technique for speech research to 
    study finer anatomy of the speech organs or to visualize true vocal tracts 
    in three dimensions. Inherent problems of slow image acquisition for speech 
    tasks or insufficient signal-to-noise ratio for microscopic observation have 
    been the cost for researchers to search for task-specific imaging techniques. 
    The recent advances of the 3-Tesla technology suggest more practical solutions 
    to broader applications of MRI by overcoming previous technical limitations. 
    In this joint tutorial in two parts, we summarize our previous effort to accumulate 
    scientific knowledge with MRI and to advance speech modeling studies for future 
    development. Part I, given by Kiyoshi Honda, introduces how to visualize the 
    speech organs and vocal tracts by presenting techniques and data for finer static 
    imaging, synchronized motion imaging, surface marker tracking, real-time imaging, 
    and vocal-tract mechanical modeling. Part 2, presented by Jianwu Dang, focuses on 
    applications of MRI for phonetics of Mandarin vowels, acoustics of the vocal tracts 
    with side branches, analysis and simulation in search of talker characteristics, 
    physiological modeling of the articulatory system, and motor control paradigm 
    for speech articulation.

    Organizers: Kiyoshi HONDA and Jianwu DANG

T3: Computational Models for Audiovisual Emotion Perception

    In this tutorial we will explore engineering approaches to understanding human 
    emotion perception, focusing both on modeling and application. We will highlight 
    both current and historical trends in emotion perception modeling, focusing on 
    both psychological and engineering-driven theories of perception 
    (statistical analyses, data-driven computational modeling, and implicit sensing). 
    The importance of this topic can be appreciated from both an engineering viewpoint, 
    any system that either models human behavior or interacts with human partners must 
    understand emotion perception as it fundamentally underlies and modulates our 
    communication, or from a psychological perspective, emotion perception is also used 
    in the diagnosis of many mental health conditions and is tracked in therapeutic 
    interventions. Research in emotion perception seeks to identify models that describe 
    the felt sense of ‘typical’ emotion expression – i.e., an observer/evaluator’s attribution 
    of the emotional state of the speaker. This felt sense is a function of the methods through 
    which individuals integrate the presented multimodal emotional information. 
    We will cover psychological theories of emotion, engineering models of emotion, 
    and experimental approaches to measure emotion. We will demonstrate how these modeling 
    strategies can be used as a component of emotion classification frameworks and how 
    they can be used to inform the design of emotional behaviors.

    Organizers: Emily Mower Provost and Carlos Busso

T4: The Art and Science of Speech Feature Engineering

    With significant advances in mobile technology and audio sensing devices, 
    there is a fundamental need to describe vast amounts of audio data in terms 
    of well representative lower dimensional descriptors for efficient automatic 
    processing. The extraction of these signal representations, also called features, 
    constitutes the first step in processing a speech signal. The art and science of 
    feature engineering relates to addressing the two inherent challenges - extracting 
    sufficient information from the speech signal for the task at hand and suppressing 
    the unwanted redundancies for computational efficiency and robustness. 
    The area of speech feature extraction combines a wide variety of disciplines like 
    signal processing, machine learning, psychophysics, information theory, linguistics and physiology. 
    It has a rich history spanning more than five decades and has seen tremendous advances 
    in the last few years. This has propelled the transition of the speech technology from 
    controlled environments to millions of end user applications.

    In this tutorial, we review the evolution of speech feature processing methods, 
    summarize the recent advances of the last two decades and provide insights into the 
    future of feature engineering. This will include the discussions on the spectral 
    representation methods developed in the past, human auditory motivated techniques 
    for robust speech processing, data driven unsupervised features like ivectors and 
    recent advances in deep neural network based techniques. With experimental results, 
    we will also illustrate the impact of these features for various state-of-the-art 
    speech processing systems. The future of speech signal processing will need to address 
    various robustness issues in complex acoustic environments while being able 
    to derive useful information from big data.

    Organizers: Sriram Ganapathy and Samuel Thomas

T5: Recent Advances in Speaker Diarization

    The tutorial will start with an introduction to speaker diarization giving a general 
    overview of the subject. Afterwards, we will cover the basic background including 
    feature extraction, and common modeling techniques such as GMMs and HMMs. 
    Then, we will discuss the first processing step usually done in speaker diarization 
    which is voice activity detection. We will consequently describe the classic approaches 
    for speaker diarization which are widely used today. We will then introduce state-of-the-art 
    techniques in speaker recognition required to understand modern speaker diarization techniques. 
    Following, we will describe approaches for speaker diarization using advanced representation 
    methods (supervectors, speaker factors, i-vectors) and we will describe supervised and 
    unsupervised learning techniques used for speaker diarization. We will also discuss issues 
    such as coping with unknown number of speakers, detecting and dealing with overlapping speech, 
    diarization confidence estimation, and online speaker diarization. Finally we will discuss 
    two recent works: exploiting a-prioiri acoustic information (such as processing a meeting 
    when some of the participants are known in advanced to the system, and training data is available for them), 
    The second recent work is modeling speaker-turn dynamics. If time permits, we will also discuss concepts 
    such as multi-modal diarization and using TDOA (time difference of arrival) for diarization of meetings.

    Organizers: Hagai Aronowitz

T6: Multimodal Speech Recognition with the AusTalk 3D Audio-Visual Corpus

    This tutorial will provide attendees a brief overview of 3D based AVSR research. 
    In this tutorial, attendees will learn how to use the newly developed 3D based audio 
    visual data corpus we derived from the AusTalk corpus ( 
    for audio-visual speech/speaker recognition. In addition, we also plan to introduce 
    some results using this newly developed 3D audio-visual data corpus, which show that 
    there is a significant speech accuracy increase by integrating both depth-level and grey-level 
    visual features. In the first part of the tutorial, we will review some recent works published 
    in the last decade, so that attendees can obtain an overview of the fundamental concepts 
    and challenges in this field. In the second part of the tutorial, we will briefly describe 
    the recording protocol and contents of the 3D data corpus, and show attendees how to use 
    this corpus for their own research. In the third part of this tutorial, we will present our 
    results using the 3D data corpus. The experimental results show that, compared with the 
    conventional AVSR based on the audio and grey-level visual features, the integration of grey 
    and depth visual information can boost the AVSR accuracy significantly. Moreover, 
    we will also experimentally explain why adding depth information can benefit the standard AVSR systems. 
    Eventually, through our tutorial, we hope we can inspire more researchers in the community 
    to contribute to this exciting research.

    Organizers: Roberto Togneri, Mohammed Bennamoun and Chao (Luke) Sui

T7: Semantic Web and Linked Big Data Resources for Spoken Language Processing

    State-of-the-art statistical spoken language processing typically requires 
    significant manual effort to construct domain-specific schemas (ontologies) 
    as well as manual effort to annotate training data against these schemas. 
    At the same time, a recent surge of activity and progress on semantic web-related 
    concepts from the large search-engine companies represents a potential alternative 
    to the manually intensive design of spoken language processing systems. 
    Standards such as have been established for schemas (ontologies) that 
    webmasters can use to semantically and uniformly markup their web pages. 
    Search engines like Bing, Google, and Yandex have adopted these standards and are 
    leveraging them to create semantic search engines at the scale of the web. 
    As a result, the open linked data resources and semantic graphs covering various 
    domains (such as Freebase [3]) have grown massively every year and contains far more 
    information than any single resource anywhere on the Web. Furthermore, these resources 
    contain links to text data (such as Wikipedia pages) related to the knowledge in the graph.

    Recently, several studies on speech language processing started exploiting these massive 
    linked data resources for language modeling and spoken language understanding. 
    This tutorial will include a brief introduction to the semantic web and the linked 
    data structure, available resources, and querying languages. 
    An overview of related work on information extraction and language processing will 
    be presented, where the main focus will be on methods for learning spoken language 
    understanding models from these resources.

    Organizers: Dilek Hakkani-Tür and Larry Heck

T8: Speech and Audio for Multimedia Semantics

    Internet media sharing sites and the one-click upload capability of smartphones 
    are producing a deluge of multimedia content. While visual features are often dominant 
    in such material, acoustic and speech information in particular often complements it. 
    By facilitating access to large amounts of data, the text-based Internet gave a huge 
    boost to the field of natural language processing. The vast amount of consumer-produced 
    video becoming available now will do the same for video processing, eventually enabling 
    semantic understanding of multimedia material, with implications for human computer interaction, robotics, etc.

    Large-scale multi-modal analysis of audio-visual material is now central to a number of 
    multi-site research projects around the world. While each of these have slightly different 
    targets, they are facing largely the same challenges: how to robustly and efficiently process 
    large amounts of data, how to represent and then fuse information across modalities, 
    how to train classifiers and segmenters on unlabeled data, how to include human feedback, etc.

    In this tutorial, we will present the state of the art in large-scale video, speech, 
    and non-speech audio processing, and show how these approaches are being applied to tasks 
    such as content based video retrieval (CBVR) and multimedia event detection (MED). 
    We will introduce the most important tools and techniques, and show how the combination of 
    information across modalities can be used to induce semantics on multimedia material 
    through ranking of information and fusion. Finally, we will discuss opportunities 
    for research that the INTERSPEECH community specifically will find interesting and fertile. 

    Organizers: Florian Metze and Koichi Shinoda

ISCSLP Tutorials @ INTERSPEECH 2014 Description

ISCSLP-T1: Adaptation Techniques for Statistical Speech Recognition

    Adaptation is a technique to make better use of existing models for test data 
    from new acoustic or linguistic conditions. It is an important and challenging 
    research area of statistical speech recognition. This tutorial gives a systematic 
    review of fundamental theories as well as introduction of state-of-the-art adaptation 
    techniques. It includes both acoustic and language model adaptation. Following a simple example 
    of acoustic model adaptation, basic concepts, procedures and categories of adaptation will 
    be introduced. Then, a number of advanced adaptation techniques will be discussed, 
    such as discriminative adaptation, Deep Neural Network adaptation, adaptive training, 
    relationship to noise robustness etc. After the detailed review of acoustic model adaptation, 
    an introduction of language model adaptation, such as topic adaptation will also be given. 
    The whole tutorial is then summarised and future research direction will be discussed.

    Organizers: Kai Yu

ISCSLP-T2: Emotion and Mental State Recognition: Features, Models, System Applications and Beyond

    Emotion recognition is the ability to identify what you are feeling from moment 
    to moment and to understand the connection between your feelings and your expressions. 
    In today’s world, human-computer interaction (HCI) interface undoubtedly plays an 
    important role in our daily life. Toward harmonious HCI interfaces, automated analysis 
    and recognition of human emotion has attracted increasing attention from researchers 
    in multidisciplinary research fields. A specific area of current interest that also has key 
    implications for HCI is the estimation of cognitive load (mental workload), research into 
    which is still at an early stage. Technologies for processing daily activities including speech, 
    text and music have expanded the interaction modalities between humans and computer-supported 
    communicational artifacts.

    In this tutorial, we will present theoretical and practical work offering new and broad views 
    of the latest research in emotional awareness from audio and speech. We discuss several parts 
    spanning a variety of theoretical background and applications ranging from salient emotional features, 
    emotional-cognitive models, compensation methods for variability due to speaker and linguistic content, 
    to machine learning approaches applicable to emotion recognition. In each topic, we will review 
    the state of the art by introducing current methods and presenting several applications. 
    In particular, the application to cognitive load estimation will be discussed, 
    from its psychophysiological origins to system design considerations. Eventually, 
    technologies developed in different areas will be combined for future applications, 
    so in addition to a survey of future research challenges, 
    we will envision a few scenarios in which affective computing can make a difference.

    Organizers: Chung-Hsien Wu, Hsin-Min Wang, Julien Epps and Vidhyasaharan Sethu

ISCSLP-T3: Unsupervised Speech and Language Processing via Topic Models

    In this tutorial, we will present state-of-art machine learning approaches 
    for speech and language processing with highlight on the unsupervised methods 
    for structural learning from the unlabeled sequential patterns. In general, 
    speech and language processing involves extensive knowledge of statistical models. 
    We require designing a flexible, scalable and robust system to meet heterogeneous 
    and nonstationary environments in the era of big data. This tutorial starts from an 
    introduction of unsupervised speech and language processing based on factor analysis 
    and independent component analysis. The unsupervised learning is generalized to a latent 
    variable model which is known as the topic model. The evolution of topic models from 
    latent semantic analysis to hierarchical Dirichlet process, from non-Bayesian parametric 
    models to Bayesian nonparametric models, and from single-layer model to hierarchical 
    tree model shall be surveyed in an organized fashion. The inference approaches based on 
    variational Bayesian and Gibbs sampling are introduced. We will also present several 
    case studies on topic modeling for speech and language applications including language model, 
    document model, retrieval model, segmentation model and summarization model. 
    At last, we will point out new trends of topic models for speech and language processing.

    Organizers: Jen-Tzung Chien

ISCSLP-T4: Deep Learning for Speech Generation and Synthesis

    Deep learning, which can represent high-level abstractions in data with an architecture of 
    multiple non-linear transformation, has made a huge impact on automatic speech recognition (ASR) 
    research, products and services. However, deep learning for speech generation and synthesis 
    (i.e., text-to-speech), which is an inverse process of speech recognition (i.e., speech-to-text), 
    has not generated the similar momentum as it is for ASR yet. Recently, motivated by the success 
    of Deep Neural Networks in speech recognition, some neural network based research attempts have 
    been tried successfully on improving the performance of statistical parametric based 
    speech generation/synthesis. In this tutorial, we focus on deep learning approaches to the 
    problems in speech generation and synthesis, especially on Text-to-Speech (TTS) synthesis and voice conversion.

    First, we give a review for the current main stream of statistical parametric based speech generation 
    and synthesis, or the GMM-HMM based speech synthesis and GMM-based voice conversion with emphasis 
    on analyzing the major factors responsible for the quality problems in the GMM-based voice 
    synthesis/conversion and the intrinsic limitations of a decision-tree based, contextual state 
    clustering and state-based statistical distribution modeling. We then present the latest deep 
    learning algorithms for feature parameter trajectory generation, in contrast to deep learning for 
    recognition or classification. We cover common technologies in Deep Neural Network (DNN) and improved 
    DNN: Mixture Density Networks (MDN), Recurrent Neural Networks (RNN) with Bidirectional Long Short 
    Term Memory (BLSTM) and Conditional RBM (CRBM). Finally, we share our research insights and hand-on 
    experience on building speech generation and synthesis systems based upon deep learning algorithms.

    Organizers: Yao Qian and Frank K. Soong

Back  Top

 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA