ISCA - International Speech
Communication Association

ISCApad Archive  »  2014  »  ISCApad #194  »  Events  »  ISCA Events  »  (2014-09-14) Special sessions at Interspeech 2014: call for submissions

ISCApad #194

Monday, August 04, 2014 by Chris Wellekens

3-1-6 (2014-09-14) Special sessions at Interspeech 2014: call for submissions


--- September 14-18, 2014


INTERSPEECH is the world's largest and most comprehensive conference on issues surrounding

the science and technology of spoken language processing, both in humans and in machines.

The theme of INTERSPEECH 2014 is

--- Celebrating the Diversity of Spoken Languages ---

INTERSPEECH 2014 includes a number of special sessions covering interdisciplinary topics

and/or important new emerging areas of interest related to the main conference topics.

Special sessions proposed for the forthcoming edition are:

• A Re-evaluation of Robustness

• Deep Neural Networks for Speech Generation and Synthesis

• Exploring the Rich Information of Speech Across Multiple Languages

• INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE)

• Multichannel Processing for Distant Speech Recognition

• Open Domain Situated Conversational Interaction

• Phase Importance in Speech Processing Applications

• Speaker Comparison for Forensic and Investigative Applications

• Text-dependent for Short-duration Speaker Verification

• Tutorial Dialogues and Spoken Dialogue Systems

• Visual Speech Decoding

A description of each special session is given below.

For paper submission, please follow the main conference procedure and chose the Special Session track when selecting

your paper area.

Paper submission procedure is described at:

For more information, feel free to contact the Special Session Chair,

Dr. Tomi H. Kinnunen, at email tkinnu [at]


Special Session Description


A Re-evaluation of Robustness

The goal of the session is to facilitate a re-evaluation of robust speech

recognition in the light of recent developments. It’s a re-evaluation at two levels:

• a re-evaluation in perspective brought by breakthroughs in performance obtained

by Deep Neural Network which leads to a fresh questioning of the role and

contribution of robust feature extraction.

• A literal re-evaluation on common databases to be able to present and compare

performances of different algorithms and system approaches to robustness.

Paper submissions are invited on the theme of noise robust speech recognition

and required to submit results on the Aurora 4 database to facilitate cross comparison

of the performance between different techniques.

Recent developments raise interesting research questions that the session aims to help

Progress by bringing focus and exploration of these issues. For example

1. What role is there for signal processing to create feature representations to use as

inputs to Deep Learning or can deep learning do all the work?

2. What feature representations can be automatically learnt in a deep learning architecture?

3. What other techniques can give great improvement in robustness?

4. What techniques don’t work and why?

The session organizers wish to encourage submissions that bring insight and understanding to

the issues highlighted above. Authors are requested not only to present absolute performance

of the whole system but also to highlight the contribution made by various components in a

complex system.

Papers that are accepted for the session are encouraged to also evaluate their techniques on new test

data sets (available in July) and submit their results at the end of August.

Session organization

The session will be structured as a combination of

1. Invited talks

2. Oral paper presentations

3. Poster presentations

4. Summary of contributions and results on newly released test sets

5. Discussion


David Pearce, Audience dpearce [at]

Hans-Guenter Hirsch, Niederrhein University of Applied Sciences, hans-guenter.hirsch [at]

Reinhold Haeb-Umbach, University of Paderborn, haeb [at]

Michael Seltzer, Microsoft, mseltzer [at]

Keikichi Hirose, The University of Tokyo, hirose [at]

Steve Renals, University of Edinburgh, s.renals [at]

Sim Khe Chai, National University of Singapore, simkc [at]

Niko Moritz, Fraunhofer IDMT, Oldenburg, niko.moritz [at]

K K Chin, Google, kkchin [at]

Deep Neural Networks for Speech Generation and Synthesis

This special session aims to bring together researchers who work actively on deep neural

networks for speech research, particularly, in generation and synthesis, to promote and

to understand better the state-of-art DNN research in statistical learning and compare

results with the parametric HMM-GMM model which has been well-established for speech synthesis,

generation, and conversion. DNN, with its neuron-like structure, can simulate human speech

production system in a layered, hierarchical, nonlinear and self-organized network.

It can transform linguistic text information into intermediate semantic, phonetic and prosodic

content and finally generate speech waveforms. Many possible neural network architectures or

typologies exist, e.g. feed-forward NN with multiple hidden layers, stacked RBM or CRBM,

Recurrent Neural Net (RNN), which have been used to speech/image recognition and other applications.

We would like to use this special session as a forum to present updated results in the research frontiers,

algorithm development and application scenarios. Particular focused areas will be on

parametric TTS synthesis, voice conversion, speech compression, de-noising and speech enhancement.


Yao Qian, Microsoft Research Asia, yaoqian [at]

Frank K. Soong, Microsoft Research Asia, frankkps [at]

Exploring the Rich Information of Speech Across Multiple Languages

Spoken language is the most direct means of communication between human beings. However,

speech communication often demonstrates its language-specific characteristics because of,

for instance, the linguistic difference (e.g., tonal vs. non-tonal, monosyllabic vs. multisyllabic)

across languages. Our knowledge on the diversities of speech science across languages is still limited,

including speech perception, linguistic and non-linguistic (e.g., emotion) information, etc.

This knowledge is of great significance to facilitate our design of language-specific application of

speech techniques (e.g., automatic speech recognition, assistive hearing devices) in the future.

This special session will provide an opportunity for researchers from various communities

(including speech science, medicine, linguistics and signal processing) to stimulate further discussion

and new research in the broad cross-language area, and present their latest research on understanding

the language-specific features of speech science and their applications in the speech communication of

machines and human beings. This special session encourages contributions all fields on speech science,

e.g., production and perception, but with a focus on presenting the language-specific characteristics

and discussing their implications to improve our knowledge on the diversities of speech science across

multiple languages. Topics of interest include, but are not limited to:

1. characteristics of acoustic, linguistic and language information in speech communication across

multiple languages;

2. diversity of linguistic and non-linguistic (e.g., emotion) information among multiple spoken languages;

3. language-specific speech intelligibility enhancement and automatic speech recognition techniques; and

4. comparative cross-language assessment of speech perception in challenging environments.


Junfeng Li, Institute of Acoustics, Chinese Academy of Sciences, [at]

Fei Chen, The University of Hong Kong, feichen1 [at]

INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE)

The INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE) is an open Challenge

dealing with speaker characteristics as manifested in their speech signal's acoustic properties.

This year, it introduces new tasks by the Cognitive Load Sub-Challenge, the Physical Load

Sub-Challenge, and a Multitask Sub-Challenge: For these Challenge tasks,


and the ANXIETY-DEPRESSION-EMOTION-SLEEPINESS audio corpus (ADES) with high diversity of

speakers and different languages covered (Australian English and German) are provided by the organizers.

All corpora provide fully realistic data in challenging acoustic conditions and feature rich

annotation such as speaker meta-data. They are given with distinct definitions of test,

development, and training partitions, incorporating speaker independence as needed in most

real-life settings. Benchmark results of the most popular approaches are provided as in the years before.

Transcription of the train and development sets will be known. All Sub-Challenges allow contributors

to find their own features with their own machine learning algorithm. However, a standard feature set

will be provided per corpus that may be used. Participants will have to stick to the definition of

training, development, and test sets. They may report on results obtained on the development set,

but have only five trials to upload their results on the test sets, whose labels are unknown to them.

Each participation will be accompanied by a paper presenting the results that undergoes peer-review

and has to be accepted for the conference in order to participate in the Challenge.

The results of the Challenge will be presented in a Special Session at INTERSPEECH 2014 in Singapore.

Further, contributions using the Challenge data or related to the Challenge but not competing within

the Challenge are also welcome.

More information is given also on the Challenge homepage:


Björn Schuller, Imperial College London / Technische Universität München,schuller [at]

Stefan Steidl, Friedrich-Alexander-University, stefan.steidl [at]

Anton Batliner, Technische Universität München / Friedrich-Alexander-University,

batliner [at]

Jarek Krajweski, Bergische Universität Wuppertal, krajewsk [at]

Julien Epps, The University of New South Wales / National ICT Australia, j.epps [at]

Multichannel Processing for Distant Speech Recognition

Distant speech recognition in real-world environments is still a challenging problem: reverberation

and dynamic background noise represent major sources of acoustic mismatch that heavily decrease ASR

performance, which, on the contrary, can be very good in close-talking microphone setups.

In this context, a particularly interesting topic is the adoption of distributed microphones for

the development of voice-enabled automated home environments based on distant-speech interaction:

microphones are installed in different rooms and the resulting multichannel audio recordings capture

multiple audio events, including voice commands or spontaneous speech, generated in various locations

and characterized by a variable amount of reverberation as well as possible background noise.

The focus of the proposed special session will be on multichannel processing for automatic speech recognition (ASR)

in such a setting. Unlike other robust ASR tasks, where static adaptation or training with noisy data sensibly

ameliorates performance, the distributed microphone scenario requires full exploitation of multichannel

information to reduce the highly variable dynamic mismatch. To facilitate better evaluation of the proposed

algorithms the organizers will provide a set of multichannel recordings in a domestic environment.

The recordings will include spoken commands mixed with other acoustic events occurring in different

rooms of a real apartment.

The data is being created in the frame of the EC project DIRHA (Distant speech Interaction for Robust

Home Applications)

which addresses the challenges of speech interaction for home automation.

The organizers will release the evaluation package (datasets and scripts) on February 17;

the participants are asked to submit a regular paper reporting speech recognition results

on the evaluation set and comparing their performance with the provided reference baseline.

Further details are available at:


Marco Matassoni, Fondazione Bruno Kessler, matasso [at]

Ramon Fernandez Astudillo, Instituto de Engenharia de Sistemas e Computadores, ramon.astudillo [at]

Athanasios Katsamanis, National Technical University of Athens, nkatsam [at]

Open Domain Situated Conversational Interaction

Robust conversational systems have the potential to revolutionize our interactions with computers.

Building on decades of academic and industrial research, we now talk to our computers, phones,

and entertainment systems on a daily basis. However, current technology typically limits conversational

interactions to a few narrow domains/topics (e.g., weather, traffic, restaurants). Users increasingly want

the ability to converse with their devices over broad web-scale content. Finding something on your PC or

the web should be as simple as having a conversation.

A promising approach to address this problem is situated conversational interaction. The approach leverages

the situation and/or context of the conversation to improve system accuracy and effectiveness.

Sources of context include visual content being displayed to the user, Geo-location, prior interactions,

multi-modal interactions (e.g., gesture, eye gaze), and the conversation itself. For example, while a user

is reading a news article on their tablet PC, they initiate a conversation to dig deeper on a particular topic.

Or a user is reading a map and wants to learn more about the history of events at mile marker 121.

Or a gamer wants to interact with a game’s characters to find the next clue in their quest.

All of these interactions are situated – rich context is available to the system as a source of priors/constraints

on what the user is likely to say.

This special session will provide a forum to discuss research progress in open domain situated

conversational interactions.

Topics of the session will include:

• Situated context in spoken dialog systems

• Visual/dialog/personal/geo situated context

• Inferred context through interpretation and reasoning

• Open domain spoken dialog systems

• Open domain spoken/natural language understanding and generation

• Open domain semantic interpretation

• Open domain dialog management (large-scale belief state/policy)

• Conversational Interactions

• Multi-modal inputs in situated open domains (speech/text + gesture, touch, eye gaze)

• Multi-human situated interactions


Larry Heck, Microsoft Research, larry [at]

Dilek Hakkani-Tür, Microsoft Research, dilek [at]

Gokhan Tur, Microsoft Research, gokhan [at]

Steve Young, Cambridge University, sjy [at]

Phase Importance in Speech Processing Applications

In the past decades, the amplitude of speech spectrum is considered to be the most important feature in

different speech processing applications and phase of the speech signal has received less

attention. Recently, several findings justify the phase importance in speech and audio processing communities.

The importance of phase estimation along with amplitude estimation in speech enhancement,

complementary phase-based features in speech and speaker recognition and phase-aware acoustic

modeling of environment are the most prominent

reported works scattered in different communities of speech and audio processing. These examples suggest

that incorporating the phase information can push the limits of state-of-the-art phase-independent solutions

employed for long in different aspects of audio and speech signal processing. This Special Session aims

to explore the recent advances and methodologies to exploit the knowledge of signal phase information in different

aspects of speech processing. Without a dedicated effort to bring researchers from different communities,

a quick advance in investigation towards the phase usefulness in speech processing applications

is difficult to achieve. Therefore, as the first step in this direction, we aim to promote the 'phase-aware

speech and audio signal processing' to form a community of researchers to organize the next steps.

Our initiative is to unify these efforts to better understand the pros and cons of using phase and the degree

of feasibility for phase estimation/enhancement in different areas of speech processing including: speech

enhancement, speech separation, speech quality estimation, speech and speaker recognition,

voice transformation and speech analysis and synthesis. The goal is to promote the importance of

the phase-based signal processing and studying its importance and sharing interesting findings from different

speech processing applications.


Pejman Mowlaee, Graz University of Technology, pejman.mowlaee [at]

Rahim Saeidi, University of Eastern Finland, rahim.saeidi [at]

Yannis Styilianou, Toshiba Labs Cambridge UK / University of Crete, yannis [at]

Speaker Comparison for Forensic and Investigative Applications

In speaker comparison, speech/voice samples are compared by humans and/or machines

for use in investigation or in court to address questions that are of interest to the legal system.

Speaker comparison is a high-stakes application that can change people’s lives and it demands the best

that science has to offer; however, methods, processes, and practices vary widely.

These variations are not necessarily for the better and though recognized, are not generally appreciated

and acted upon. Methods, processes, and practices grounded in science are critical for the proper application

(and non-application) of speaker comparison to a variety of international investigative and forensic applications.

This special session will contribute to scientific progress through 1) understanding speaker comparison

for investigative and forensic application (e.g., describe what is currently being done and critically

analyze performance and lessons learned); 2) improving speaker comparison for investigative and forensic

applications (e.g., propose new approaches/techniques, understand the limitations, and identify challenges

and opportunities); 3) improving communications between communities of researchers, legal scholars,

and practitioners internationally (e.g., directly address some central legal, policy, and societal questions

such as allowing speaker comparisons in court, requirements for expert witnesses, and requirements for specific

automatic or human-based methods to be considered scientific); 4) using best practices (e.g., reduction of bias

and presentation of evidence); 5) developing a roadmap for progress in this session and future sessions; and 6)

producing a documented contribution to the field. Some of these objectives will need multiple sessions

to fully achieve and some are complicated due to differing legal systems and cultures.

This special session builds on previous successful special sessions and tutorials in forensic applications

of speaker comparison at INTERSPEECH beginning in 2003. Wide international participation is planned,

including researchers from the ISCA SIGs for the Association Francophone de la Communication Parlée (AFCP)

and the Speaker and Language Characterization (SpLC).


Joseph P. Campbell, PhD, MIT Lincoln Laboratory, jpc [at]

Jean-François Bonastre, l'Université d'Avignon, jean-francois.bonastre [at]

Text-dependent for Short-duration Speaker Verification

In recent years, speaker verification engines have reached maturity and have been deployed in

commercial applications. Ergonomics of such applications is especially demanding and imposes

a drastic limitation in terms of speech duration during authentication. A well known tactic to address

the problem of lack of data, due to short duration, is using text-dependency. However, recent breakthroughs

achieved in the context of text-independent speaker verification in terms of accuracy and robustness

do not benefit text-dependent applications. Indeed, large development data required by the recent

approaches is not available in the text-dependent context. The purpose of this special session is

to gather the research efforts from both academia and industry toward a common goal of establishing

a new baseline and explore new directions for text-dependent speaker verification.

The focus of the session is on robustness with respect to duration and modeling of lexical information.

To support the development and evaluation of text-dependent speaker verification technologies,

the Institute for Infocomm Research (I2R) has recently released the RSR2015 database,

including 150 hours of data recorded from 300 speakers. The papers submitted to the special

session are encouraged, but not limited, to provide results based on the RSR2015 database

in order to enable comparison of algorithms and methods. For this purpose, the organizers strongly

encourage the participants to report performance on the protocol delivered with the database

in terms of EER and minimum cost (in the sense of NIST 2008 Speaker Recognition evaluation).

To get the database, please contact the organizers.

Further details are available at:


Anthony LARCHER (alarcher [at] Institute for Infocomm Research

Hagai ARONOWITZ (hagaia [at] IBM Research – Haifa

Kong Aik LEE (kalee [at] Institute for Infocomm Research

Patrick KENNY (patrick.kenny [at] CRIM – Montréal

Tutorial Dialogues and Spoken Dialogue Systems

The growing interest in educational applications that use spoken interaction and dialogue technology has boosted

research and development of interactive tutorial systems, and over the recent years, advances have been achieved

in both spoken dialogue community and education research community, with sophisticated speech and multi-modal

technology which allows functionally suitable and reasonably robust applications to be built.

The special session combines spoken dialogue research, interaction modeling, and educational applications,

and brings together the two INTERSPEECH SIG communities: SLaTE and SIGdial. The session focuses

on methods, problems and challenges that are shared by both communities, such as sophistication

of speech processing and dialogue management for educational interaction, integration of the models

with theories of emotion, rapport, and mutual understanding, as well as application of the techniques

to novel learning environments, robot interaction, etc. The session aims to survey issues related

to the processing of spoken language in various learning situations, modeling of the teacher-student

interaction in MOOC-like environments, as well as evaluating tutorial dialogue systems from

the point of view of natural interaction, technological robustness, and learning outcome.

The session encourages interdisciplinary research and submissions related to the special focus

of the conference, 'Celebrating the Diversity of Spoken Languages'.

For further information click


Maxine Eskenazi, max+ [at]

Kristiina Jokinen, kristiina.jokinen [at]

Diane Litman, litman [at]

Martin Russel, M.J.RUSSELL [at]

Visual Speech Decoding

Speech perception is a bi-modal process that takes into account both the acoustic (what we hear)

and visual (what we see) speech information. It has been widely acknowledged that visual clues play

a critical role in automatic speech recognition (ASR) especially when audio is corrupted by,

for example, background noise or voices from untargeted speakers, or even inaccessible.

Decoding the visual speech is utterly important for ASR technologies to be widely implemented

to realize truly natural human-computer interactions. Despite the advances in acoustic ASR,

visual speech decoding remains a challenging problem.

The special session aims to attract more effort to tackle this important problem. In particular,

we would like to encourage researchers to focus on some critical questions in the area.

We propose four questions as the initiative as follows:

1. How to deal with the speaker dependency in visual speech data?

2. How to cope with the head-pose variation?

3. How to encode temporal information in visual features?

4. How to automatically adapt the fusion rule when the quality of the two individual (audio and visual)

modalities varies?

Researchers and participants are encouraged to raise more questions related to visual speech decoding.

We expect the session to draw a wide range of attention from both the speech recognition and machine vision

communities to the problem of visual speech decoding.


Ziheng Zhou, University of Oulu, ziheng.zhou [at]

Matti Pietikäinen, University of Oulu, matti.pietikainen [at]

Guoying Zhao, University of Oulu, gyzhao [at]

Back  Top

 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA