(2014-09-14) Special sessions at Interspeech  2014: call for submissions
 --- INTERSPEECH 2014 - SINGAPORE 
--- September 14-18, 2014 
--- http://www.INTERSPEECH2014.org 
INTERSPEECH is the world's largest and most comprehensive conference on issues surrounding 
the science and technology of spoken language processing, both in humans and in machines. 
The theme of INTERSPEECH 2014 is 
--- Celebrating the Diversity of Spoken Languages --- 
INTERSPEECH 2014 includes a number of special sessions covering interdisciplinary topics 
and/or important new emerging areas of interest related to the main conference topics. 
Special sessions proposed for the forthcoming edition are: 
• A Re-evaluation of Robustness 
• Deep Neural Networks for Speech Generation and Synthesis 
• Exploring the Rich Information of Speech Across Multiple Languages 
• INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE) 
• Multichannel Processing for Distant Speech Recognition 
• Open Domain Situated Conversational Interaction 
• Phase Importance in Speech Processing Applications 
• Speaker Comparison for Forensic and Investigative Applications 
• Text-dependent for Short-duration Speaker Verification 
• Tutorial Dialogues and Spoken Dialogue Systems 
• Visual Speech Decoding 
A description of each special session is given below. 
For paper submission, please follow the main conference procedure and chose the Special Session track when selecting 
your paper area. 
Paper submission procedure is described at: 
http://www.INTERSPEECH2014.org/public.php?page=submission_procedure.html 
For more information, feel free to contact the Special Session Chair, 
Dr. Tomi H. Kinnunen, at email tkinnu [at]cs.uef.fi 
---------------------------------------------------------------------------------------------------- 
Special Session Description 
---------------------------------------------------------------------------------------------------- 
A Re-evaluation of Robustness 
The goal of the session is to facilitate a re-evaluation of robust speech 
recognition in the light of recent developments. It’s a re-evaluation at two levels: 
• a re-evaluation in perspective brought by breakthroughs in performance obtained 
by Deep Neural Network which leads to a fresh questioning of the role and 
contribution of robust feature extraction. 
• A literal re-evaluation on common databases to be able to present and compare 
performances of different algorithms and system approaches to robustness. 
Paper submissions are invited on the theme of noise robust speech recognition 
and required to submit results on the Aurora 4 database to facilitate cross comparison 
of the performance between different techniques. 
Recent developments raise interesting research questions that the session aims to help 
Progress by bringing focus and exploration of these issues. For example 
1. What role is there for signal processing to create feature representations to use as 
inputs to Deep Learning or can deep learning do all the work? 
2. What feature representations can be automatically learnt in a deep learning architecture? 
3. What other techniques can give great improvement in robustness? 
4. What techniques don’t work and why? 
The session organizers wish to encourage submissions that bring insight and understanding to 
the issues highlighted above. Authors are requested not only to present absolute performance 
of the whole system but also to highlight the contribution made by various components in a 
complex system. 
Papers that are accepted for the session are encouraged to also evaluate their techniques on new test 
data sets (available in July) and submit their results at the end of August. 
Session organization 
The session will be structured as a combination of 
1. Invited talks 
2. Oral paper presentations 
3. Poster presentations 
4. Summary of contributions and results on newly released test sets 
5. Discussion 
Organizers: 
David Pearce, Audience dpearce [at]audience.com 
Hans-Guenter Hirsch, Niederrhein University of Applied Sciences, hans-guenter.hirsch [at]hs-niederrhein.de 
Reinhold Haeb-Umbach, University of Paderborn, haeb [at]nt.uni-paderborn.de 
Michael Seltzer, Microsoft, mseltzer [at]microsoft.com 
Keikichi Hirose, The University of Tokyo, hirose [at]gavo.t.u-tokyo.ac.jp 
Steve Renals, University of Edinburgh, s.renals [at]ed.ac.uk 
Sim Khe Chai, National University of Singapore, simkc [at]comp.nus.edu.sg 
Niko Moritz, Fraunhofer IDMT, Oldenburg, niko.moritz [at]idmt.fraunhofer.de 
K K Chin, Google, kkchin [at]google.com 
Deep Neural Networks for Speech Generation and Synthesis 
This special session aims to bring together researchers who work actively on deep neural 
networks for speech research, particularly, in generation and synthesis, to promote and 
to understand better the state-of-art DNN research in statistical learning and compare 
results with the parametric HMM-GMM model which has been well-established for speech synthesis, 
generation, and conversion. DNN, with its neuron-like structure, can simulate human speech 
production system in a layered, hierarchical, nonlinear and self-organized network. 
It can transform linguistic text information into intermediate semantic, phonetic and prosodic 
content and finally generate speech waveforms. Many possible neural network architectures or 
typologies exist, e.g. feed-forward NN with multiple hidden layers, stacked RBM or CRBM, 
Recurrent Neural Net (RNN), which have been used to speech/image recognition and other applications. 
We would like to use this special session as a forum to present updated results in the research frontiers, 
algorithm development and application scenarios. Particular focused areas will be on 
parametric TTS synthesis, voice conversion, speech compression, de-noising and speech enhancement. 
Organizers: 
Yao Qian, Microsoft Research Asia, yaoqian [at]microsoft.com 
Frank K. Soong, Microsoft Research Asia, frankkps [at]microsoft.com 
Exploring the Rich Information of Speech Across Multiple Languages 
Spoken language is the most direct means of communication between human beings. However, 
speech communication often demonstrates its language-specific characteristics because of, 
for instance, the linguistic difference (e.g., tonal vs. non-tonal, monosyllabic vs. multisyllabic) 
across languages. Our knowledge on the diversities of speech science across languages is still limited, 
including speech perception, linguistic and non-linguistic (e.g., emotion) information, etc. 
This knowledge is of great significance to facilitate our design of language-specific application of 
speech techniques (e.g., automatic speech recognition, assistive hearing devices) in the future. 
This special session will provide an opportunity for researchers from various communities 
(including speech science, medicine, linguistics and signal processing) to stimulate further discussion 
and new research in the broad cross-language area, and present their latest research on understanding 
the language-specific features of speech science and their applications in the speech communication of 
machines and human beings. This special session encourages contributions all fields on speech science, 
e.g., production and perception, but with a focus on presenting the language-specific characteristics 
and discussing their implications to improve our knowledge on the diversities of speech science across 
multiple languages. Topics of interest include, but are not limited to: 
1. characteristics of acoustic, linguistic and language information in speech communication across 
multiple languages; 
2. diversity of linguistic and non-linguistic (e.g., emotion) information among multiple spoken languages; 
3. language-specific speech intelligibility enhancement and automatic speech recognition techniques; and 
4. comparative cross-language assessment of speech perception in challenging environments. 
Organizers: 
Junfeng Li, Institute of Acoustics, Chinese Academy of Sciences, junfeng.li.1979 [at]gmail.com 
Fei Chen, The University of Hong Kong, feichen1 [at]hku.hk 
INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE) 
The INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE) is an open Challenge 
dealing with speaker characteristics as manifested in their speech signal's acoustic properties. 
This year, it introduces new tasks by the Cognitive Load Sub-Challenge, the Physical Load 
Sub-Challenge, and a Multitask Sub-Challenge: For these Challenge tasks, 
the COGNITIVE-LOAD WITH SPEECH AND EGG database (CLSE), the MUNICH BIOVOICE CORPUS (MBC), 
and the ANXIETY-DEPRESSION-EMOTION-SLEEPINESS audio corpus (ADES) with high diversity of 
speakers and different languages covered (Australian English and German) are provided by the organizers. 
All corpora provide fully realistic data in challenging acoustic conditions and feature rich 
annotation such as speaker meta-data. They are given with distinct definitions of test, 
development, and training partitions, incorporating speaker independence as needed in most 
real-life settings. Benchmark results of the most popular approaches are provided as in the years before. 
Transcription of the train and development sets will be known. All Sub-Challenges allow contributors 
to find their own features with their own machine learning algorithm. However, a standard feature set 
will be provided per corpus that may be used. Participants will have to stick to the definition of 
training, development, and test sets. They may report on results obtained on the development set, 
but have only five trials to upload their results on the test sets, whose labels are unknown to them. 
Each participation will be accompanied by a paper presenting the results that undergoes peer-review 
and has to be accepted for the conference in order to participate in the Challenge. 
The results of the Challenge will be presented in a Special Session at INTERSPEECH 2014 in Singapore. 
Further, contributions using the Challenge data or related to the Challenge but not competing within 
the Challenge are also welcome. 
More information is given also on the Challenge homepage: 
http://emotion-research.net/sigs/speech-sig/is14-compare 
Organizers: 
Björn Schuller, Imperial College London / Technische Universität München,schuller [at]IEEE.org 
Stefan Steidl, Friedrich-Alexander-University, stefan.steidl [at]fau.de 
Anton Batliner, Technische Universität München / Friedrich-Alexander-University, 
batliner [at]cs.fau.de 
Jarek Krajweski, Bergische Universität Wuppertal, krajewsk [at]uni-wuppertal.de 
Julien Epps, The University of New South Wales / National ICT Australia, j.epps [at]unsw.edu.au 
Multichannel Processing for Distant Speech Recognition 
Distant speech recognition in real-world environments is still a challenging problem: reverberation 
and dynamic background noise represent major sources of acoustic mismatch that heavily decrease ASR 
performance, which, on the contrary, can be very good in close-talking microphone setups. 
In this context, a particularly interesting topic is the adoption of distributed microphones for 
the development of voice-enabled automated home environments based on distant-speech interaction: 
microphones are installed in different rooms and the resulting multichannel audio recordings capture 
multiple audio events, including voice commands or spontaneous speech, generated in various locations 
and characterized by a variable amount of reverberation as well as possible background noise. 
The focus of the proposed special session will be on multichannel processing for automatic speech recognition (ASR) 
in such a setting. Unlike other robust ASR tasks, where static adaptation or training with noisy data sensibly 
ameliorates performance, the distributed microphone scenario requires full exploitation of multichannel 
information to reduce the highly variable dynamic mismatch. To facilitate better evaluation of the proposed 
algorithms the organizers will provide a set of multichannel recordings in a domestic environment. 
The recordings will include spoken commands mixed with other acoustic events occurring in different 
rooms of a real apartment. 
The data is being created in the frame of the EC project DIRHA (Distant speech Interaction for Robust 
Home Applications) 
which addresses the challenges of speech interaction for home automation. 
The organizers will release the evaluation package (datasets and scripts) on February 17; 
the participants are asked to submit a regular paper reporting speech recognition results 
on the evaluation set and comparing their performance with the provided reference baseline. 
Further details are available at: http://dirha.fbk.eu/INTERSPEECH2014 
Organizers: 
Marco Matassoni, Fondazione Bruno Kessler, matasso [at]fbk.eu 
Ramon Fernandez Astudillo, Instituto de Engenharia de Sistemas e Computadores, ramon.astudillo [at]inesc-id.pt 
Athanasios Katsamanis, National Technical University of Athens, nkatsam [at]cs.ntua.gr 
Open Domain Situated Conversational Interaction 
Robust conversational systems have the potential to revolutionize our interactions with computers. 
Building on decades of academic and industrial research, we now talk to our computers, phones, 
and entertainment systems on a daily basis. However, current technology typically limits conversational 
interactions to a few narrow domains/topics (e.g., weather, traffic, restaurants). Users increasingly want 
the ability to converse with their devices over broad web-scale content. Finding something on your PC or 
the web should be as simple as having a conversation. 
A promising approach to address this problem is situated conversational interaction. The approach leverages 
the situation and/or context of the conversation to improve system accuracy and effectiveness. 
Sources of context include visual content being displayed to the user, Geo-location, prior interactions, 
multi-modal interactions (e.g., gesture, eye gaze), and the conversation itself. For example, while a user 
is reading a news article on their tablet PC, they initiate a conversation to dig deeper on a particular topic. 
Or a user is reading a map and wants to learn more about the history of events at mile marker 121. 
Or a gamer wants to interact with a game’s characters to find the next clue in their quest. 
All of these interactions are situated – rich context is available to the system as a source of priors/constraints 
on what the user is likely to say. 
This special session will provide a forum to discuss research progress in open domain situated 
conversational interactions. 
Topics of the session will include: 
• Situated context in spoken dialog systems 
• Visual/dialog/personal/geo situated context 
• Inferred context through interpretation and reasoning 
• Open domain spoken dialog systems 
• Open domain spoken/natural language understanding and generation 
• Open domain semantic interpretation 
• Open domain dialog management (large-scale belief state/policy) 
• Conversational Interactions 
• Multi-modal inputs in situated open domains (speech/text + gesture, touch, eye gaze) 
• Multi-human situated interactions 
Organizers: 
Larry Heck, Microsoft Research, larry [at]ieee.org 
Dilek Hakkani-Tür, Microsoft Research, dilek [at]ieee.org 
Gokhan Tur, Microsoft Research, gokhan [at]ieee.org 
Steve Young, Cambridge University, sjy [at]eng.cam.ac.uk 
Phase Importance in Speech Processing Applications 
In the past decades, the amplitude of speech spectrum is considered to be the most important feature in 
different speech processing applications and phase of the speech signal has received less 
attention. Recently, several findings justify the phase importance in speech and audio processing communities. 
The importance of phase estimation along with amplitude estimation in speech enhancement, 
complementary phase-based features in speech and speaker recognition and phase-aware acoustic 
modeling of environment are the most prominent 
reported works scattered in different communities of speech and audio processing. These examples suggest 
that incorporating the phase information can push the limits of state-of-the-art phase-independent solutions 
employed for long in different aspects of audio and speech signal processing. This Special Session aims 
to explore the recent advances and methodologies to exploit the knowledge of signal phase information in different 
aspects of speech processing. Without a dedicated effort to bring researchers from different communities, 
a quick advance in investigation towards the phase usefulness in speech processing applications 
is difficult to achieve. Therefore, as the first step in this direction, we aim to promote the 'phase-aware 
speech and audio signal processing' to form a community of researchers to organize the next steps. 
Our initiative is to unify these efforts to better understand the pros and cons of using phase and the degree 
of feasibility for phase estimation/enhancement in different areas of speech processing including: speech 
enhancement, speech separation, speech quality estimation, speech and speaker recognition, 
voice transformation and speech analysis and synthesis. The goal is to promote the importance of 
the phase-based signal processing and studying its importance and sharing interesting findings from different 
speech processing applications. 
Organizers: 
Pejman Mowlaee, Graz University of Technology, pejman.mowlaee [at]tugraz.at 
Rahim Saeidi, University of Eastern Finland, rahim.saeidi [at]uef.fi 
Yannis Styilianou, Toshiba Labs Cambridge UK / University of Crete, yannis [at]csd.uoc.gr 
Speaker Comparison for Forensic and Investigative Applications 
In speaker comparison, speech/voice samples are compared by humans and/or machines 
for use in investigation or in court to address questions that are of interest to the legal system. 
Speaker comparison is a high-stakes application that can change people’s lives and it demands the best 
that science has to offer; however, methods, processes, and practices vary widely. 
These variations are not necessarily for the better and though recognized, are not generally appreciated 
and acted upon. Methods, processes, and practices grounded in science are critical for the proper application 
(and non-application) of speaker comparison to a variety of international investigative and forensic applications. 
This special session will contribute to scientific progress through 1) understanding speaker comparison 
for investigative and forensic application (e.g., describe what is currently being done and critically 
analyze performance and lessons learned); 2) improving speaker comparison for investigative and forensic 
applications (e.g., propose new approaches/techniques, understand the limitations, and identify challenges 
and opportunities); 3) improving communications between communities of researchers, legal scholars, 
and practitioners internationally (e.g., directly address some central legal, policy, and societal questions 
such as allowing speaker comparisons in court, requirements for expert witnesses, and requirements for specific 
automatic or human-based methods to be considered scientific); 4) using best practices (e.g., reduction of bias 
and presentation of evidence); 5) developing a roadmap for progress in this session and future sessions; and 6) 
producing a documented contribution to the field. Some of these objectives will need multiple sessions 
to fully achieve and some are complicated due to differing legal systems and cultures. 
This special session builds on previous successful special sessions and tutorials in forensic applications 
of speaker comparison at INTERSPEECH beginning in 2003. Wide international participation is planned, 
including researchers from the ISCA SIGs for the Association Francophone de la Communication Parlée (AFCP) 
and the Speaker and Language Characterization (SpLC). 
Organizers: 
Joseph P. Campbell, PhD, MIT Lincoln Laboratory, jpc [at]ll.mit.edu 
Jean-François Bonastre, l'Université d'Avignon, jean-francois.bonastre [at]univ-avignon.fr 
Text-dependent for Short-duration Speaker Verification 
In recent years, speaker verification engines have reached maturity and have been deployed in 
commercial applications. Ergonomics of such applications is especially demanding and imposes 
a drastic limitation in terms of speech duration during authentication. A well known tactic to address 
the problem of lack of data, due to short duration, is using text-dependency. However, recent breakthroughs 
achieved in the context of text-independent speaker verification in terms of accuracy and robustness 
do not benefit text-dependent applications. Indeed, large development data required by the recent 
approaches is not available in the text-dependent context. The purpose of this special session is 
to gather the research efforts from both academia and industry toward a common goal of establishing 
a new baseline and explore new directions for text-dependent speaker verification. 
The focus of the session is on robustness with respect to duration and modeling of lexical information. 
To support the development and evaluation of text-dependent speaker verification technologies, 
the Institute for Infocomm Research (I2R) has recently released the RSR2015 database, 
including 150 hours of data recorded from 300 speakers. The papers submitted to the special 
session are encouraged, but not limited, to provide results based on the RSR2015 database 
in order to enable comparison of algorithms and methods. For this purpose, the organizers strongly 
encourage the participants to report performance on the protocol delivered with the database 
in terms of EER and minimum cost (in the sense of NIST 2008 Speaker Recognition evaluation). 
To get the database, please contact the organizers. 
Further details are available at: http://www1.i2r.a-star.edu.sg/~kalee/is2014/tdspk.html 
Organizers: 
Anthony LARCHER (alarcher [at]i2r.a-star.edu.sg) Institute for Infocomm Research 
Hagai ARONOWITZ (hagaia [at]il.ibm.com) IBM Research – Haifa 
Kong Aik LEE (kalee [at]i2r.a-star.edu.sg) Institute for Infocomm Research 
Patrick KENNY (patrick.kenny [at]crim.ca) CRIM – Montréal 
Tutorial Dialogues and Spoken Dialogue Systems 
The growing interest in educational applications that use spoken interaction and dialogue technology has boosted 
research and development of interactive tutorial systems, and over the recent years, advances have been achieved 
in both spoken dialogue community and education research community, with sophisticated speech and multi-modal 
technology which allows functionally suitable and reasonably robust applications to be built. 
The special session combines spoken dialogue research, interaction modeling, and educational applications, 
and brings together the two INTERSPEECH SIG communities: SLaTE and SIGdial. The session focuses 
on methods, problems and challenges that are shared by both communities, such as sophistication 
of speech processing and dialogue management for educational interaction, integration of the models 
with theories of emotion, rapport, and mutual understanding, as well as application of the techniques 
to novel learning environments, robot interaction, etc. The session aims to survey issues related 
to the processing of spoken language in various learning situations, modeling of the teacher-student 
interaction in MOOC-like environments, as well as evaluating tutorial dialogue systems from 
the point of view of natural interaction, technological robustness, and learning outcome. 
The session encourages interdisciplinary research and submissions related to the special focus 
of the conference, 'Celebrating the Diversity of Spoken Languages'. 
For further information click http://junionsjlee.wix.com/INTERSPEECH 
Organizers: 
Maxine Eskenazi, max+ [at]cs.cmu.edu 
Kristiina Jokinen, kristiina.jokinen [at]helsinki.fi 
Diane Litman, litman [at]cs.pitt.edu 
Martin Russel, M.J.RUSSELL [at]bham.ac.uk 
Visual Speech Decoding 
Speech perception is a bi-modal process that takes into account both the acoustic (what we hear) 
and visual (what we see) speech information. It has been widely acknowledged that visual clues play 
a critical role in automatic speech recognition (ASR) especially when audio is corrupted by, 
for example, background noise or voices from untargeted speakers, or even inaccessible. 
Decoding the visual speech is utterly important for ASR technologies to be widely implemented 
to realize truly natural human-computer interactions. Despite the advances in acoustic ASR, 
visual speech decoding remains a challenging problem. 
The special session aims to attract more effort to tackle this important problem. In particular, 
we would like to encourage researchers to focus on some critical questions in the area. 
We propose four questions as the initiative as follows: 
1. How to deal with the speaker dependency in visual speech data? 
2. How to cope with the head-pose variation? 
3. How to encode temporal information in visual features? 
4. How to automatically adapt the fusion rule when the quality of the two individual (audio and visual) 
modalities varies? 
Researchers and participants are encouraged to raise more questions related to visual speech decoding. 
We expect the session to draw a wide range of attention from both the speech recognition and machine vision 
communities to the problem of visual speech decoding. 
Organizers: 
Ziheng Zhou, University of Oulu, ziheng.zhou [at]ee.oulu.fi 
Matti Pietikäinen, University of Oulu, matti.pietikainen [at]ee.oulu.fi 
Guoying Zhao, University of Oulu, gyzhao [at]ee.oulu.fi 
  |