(2014-09-14) Special sessions at Interspeech 2014: call for submissions
--- INTERSPEECH 2014 - SINGAPORE
--- September 14-18, 2014
--- http://www.INTERSPEECH2014.org
INTERSPEECH is the world's largest and most comprehensive conference on issues surrounding
the science and technology of spoken language processing, both in humans and in machines.
The theme of INTERSPEECH 2014 is
--- Celebrating the Diversity of Spoken Languages ---
INTERSPEECH 2014 includes a number of special sessions covering interdisciplinary topics
and/or important new emerging areas of interest related to the main conference topics.
Special sessions proposed for the forthcoming edition are:
• A Re-evaluation of Robustness
• Deep Neural Networks for Speech Generation and Synthesis
• Exploring the Rich Information of Speech Across Multiple Languages
• INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE)
• Multichannel Processing for Distant Speech Recognition
• Open Domain Situated Conversational Interaction
• Phase Importance in Speech Processing Applications
• Speaker Comparison for Forensic and Investigative Applications
• Text-dependent for Short-duration Speaker Verification
• Tutorial Dialogues and Spoken Dialogue Systems
• Visual Speech Decoding
A description of each special session is given below.
For paper submission, please follow the main conference procedure and chose the Special Session track when selecting
your paper area.
Paper submission procedure is described at:
http://www.INTERSPEECH2014.org/public.php?page=submission_procedure.html
For more information, feel free to contact the Special Session Chair,
Dr. Tomi H. Kinnunen, at email tkinnu [at]cs.uef.fi
----------------------------------------------------------------------------------------------------
Special Session Description
----------------------------------------------------------------------------------------------------
A Re-evaluation of Robustness
The goal of the session is to facilitate a re-evaluation of robust speech
recognition in the light of recent developments. It’s a re-evaluation at two levels:
• a re-evaluation in perspective brought by breakthroughs in performance obtained
by Deep Neural Network which leads to a fresh questioning of the role and
contribution of robust feature extraction.
• A literal re-evaluation on common databases to be able to present and compare
performances of different algorithms and system approaches to robustness.
Paper submissions are invited on the theme of noise robust speech recognition
and required to submit results on the Aurora 4 database to facilitate cross comparison
of the performance between different techniques.
Recent developments raise interesting research questions that the session aims to help
Progress by bringing focus and exploration of these issues. For example
1. What role is there for signal processing to create feature representations to use as
inputs to Deep Learning or can deep learning do all the work?
2. What feature representations can be automatically learnt in a deep learning architecture?
3. What other techniques can give great improvement in robustness?
4. What techniques don’t work and why?
The session organizers wish to encourage submissions that bring insight and understanding to
the issues highlighted above. Authors are requested not only to present absolute performance
of the whole system but also to highlight the contribution made by various components in a
complex system.
Papers that are accepted for the session are encouraged to also evaluate their techniques on new test
data sets (available in July) and submit their results at the end of August.
Session organization
The session will be structured as a combination of
1. Invited talks
2. Oral paper presentations
3. Poster presentations
4. Summary of contributions and results on newly released test sets
5. Discussion
Organizers:
David Pearce, Audience dpearce [at]audience.com
Hans-Guenter Hirsch, Niederrhein University of Applied Sciences, hans-guenter.hirsch [at]hs-niederrhein.de
Reinhold Haeb-Umbach, University of Paderborn, haeb [at]nt.uni-paderborn.de
Michael Seltzer, Microsoft, mseltzer [at]microsoft.com
Keikichi Hirose, The University of Tokyo, hirose [at]gavo.t.u-tokyo.ac.jp
Steve Renals, University of Edinburgh, s.renals [at]ed.ac.uk
Sim Khe Chai, National University of Singapore, simkc [at]comp.nus.edu.sg
Niko Moritz, Fraunhofer IDMT, Oldenburg, niko.moritz [at]idmt.fraunhofer.de
K K Chin, Google, kkchin [at]google.com
Deep Neural Networks for Speech Generation and Synthesis
This special session aims to bring together researchers who work actively on deep neural
networks for speech research, particularly, in generation and synthesis, to promote and
to understand better the state-of-art DNN research in statistical learning and compare
results with the parametric HMM-GMM model which has been well-established for speech synthesis,
generation, and conversion. DNN, with its neuron-like structure, can simulate human speech
production system in a layered, hierarchical, nonlinear and self-organized network.
It can transform linguistic text information into intermediate semantic, phonetic and prosodic
content and finally generate speech waveforms. Many possible neural network architectures or
typologies exist, e.g. feed-forward NN with multiple hidden layers, stacked RBM or CRBM,
Recurrent Neural Net (RNN), which have been used to speech/image recognition and other applications.
We would like to use this special session as a forum to present updated results in the research frontiers,
algorithm development and application scenarios. Particular focused areas will be on
parametric TTS synthesis, voice conversion, speech compression, de-noising and speech enhancement.
Organizers:
Yao Qian, Microsoft Research Asia, yaoqian [at]microsoft.com
Frank K. Soong, Microsoft Research Asia, frankkps [at]microsoft.com
Exploring the Rich Information of Speech Across Multiple Languages
Spoken language is the most direct means of communication between human beings. However,
speech communication often demonstrates its language-specific characteristics because of,
for instance, the linguistic difference (e.g., tonal vs. non-tonal, monosyllabic vs. multisyllabic)
across languages. Our knowledge on the diversities of speech science across languages is still limited,
including speech perception, linguistic and non-linguistic (e.g., emotion) information, etc.
This knowledge is of great significance to facilitate our design of language-specific application of
speech techniques (e.g., automatic speech recognition, assistive hearing devices) in the future.
This special session will provide an opportunity for researchers from various communities
(including speech science, medicine, linguistics and signal processing) to stimulate further discussion
and new research in the broad cross-language area, and present their latest research on understanding
the language-specific features of speech science and their applications in the speech communication of
machines and human beings. This special session encourages contributions all fields on speech science,
e.g., production and perception, but with a focus on presenting the language-specific characteristics
and discussing their implications to improve our knowledge on the diversities of speech science across
multiple languages. Topics of interest include, but are not limited to:
1. characteristics of acoustic, linguistic and language information in speech communication across
multiple languages;
2. diversity of linguistic and non-linguistic (e.g., emotion) information among multiple spoken languages;
3. language-specific speech intelligibility enhancement and automatic speech recognition techniques; and
4. comparative cross-language assessment of speech perception in challenging environments.
Organizers:
Junfeng Li, Institute of Acoustics, Chinese Academy of Sciences, junfeng.li.1979 [at]gmail.com
Fei Chen, The University of Hong Kong, feichen1 [at]hku.hk
INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE)
The INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE) is an open Challenge
dealing with speaker characteristics as manifested in their speech signal's acoustic properties.
This year, it introduces new tasks by the Cognitive Load Sub-Challenge, the Physical Load
Sub-Challenge, and a Multitask Sub-Challenge: For these Challenge tasks,
the COGNITIVE-LOAD WITH SPEECH AND EGG database (CLSE), the MUNICH BIOVOICE CORPUS (MBC),
and the ANXIETY-DEPRESSION-EMOTION-SLEEPINESS audio corpus (ADES) with high diversity of
speakers and different languages covered (Australian English and German) are provided by the organizers.
All corpora provide fully realistic data in challenging acoustic conditions and feature rich
annotation such as speaker meta-data. They are given with distinct definitions of test,
development, and training partitions, incorporating speaker independence as needed in most
real-life settings. Benchmark results of the most popular approaches are provided as in the years before.
Transcription of the train and development sets will be known. All Sub-Challenges allow contributors
to find their own features with their own machine learning algorithm. However, a standard feature set
will be provided per corpus that may be used. Participants will have to stick to the definition of
training, development, and test sets. They may report on results obtained on the development set,
but have only five trials to upload their results on the test sets, whose labels are unknown to them.
Each participation will be accompanied by a paper presenting the results that undergoes peer-review
and has to be accepted for the conference in order to participate in the Challenge.
The results of the Challenge will be presented in a Special Session at INTERSPEECH 2014 in Singapore.
Further, contributions using the Challenge data or related to the Challenge but not competing within
the Challenge are also welcome.
More information is given also on the Challenge homepage:
http://emotion-research.net/sigs/speech-sig/is14-compare
Organizers:
Björn Schuller, Imperial College London / Technische Universität München,schuller [at]IEEE.org
Stefan Steidl, Friedrich-Alexander-University, stefan.steidl [at]fau.de
Anton Batliner, Technische Universität München / Friedrich-Alexander-University,
batliner [at]cs.fau.de
Jarek Krajweski, Bergische Universität Wuppertal, krajewsk [at]uni-wuppertal.de
Julien Epps, The University of New South Wales / National ICT Australia, j.epps [at]unsw.edu.au
Multichannel Processing for Distant Speech Recognition
Distant speech recognition in real-world environments is still a challenging problem: reverberation
and dynamic background noise represent major sources of acoustic mismatch that heavily decrease ASR
performance, which, on the contrary, can be very good in close-talking microphone setups.
In this context, a particularly interesting topic is the adoption of distributed microphones for
the development of voice-enabled automated home environments based on distant-speech interaction:
microphones are installed in different rooms and the resulting multichannel audio recordings capture
multiple audio events, including voice commands or spontaneous speech, generated in various locations
and characterized by a variable amount of reverberation as well as possible background noise.
The focus of the proposed special session will be on multichannel processing for automatic speech recognition (ASR)
in such a setting. Unlike other robust ASR tasks, where static adaptation or training with noisy data sensibly
ameliorates performance, the distributed microphone scenario requires full exploitation of multichannel
information to reduce the highly variable dynamic mismatch. To facilitate better evaluation of the proposed
algorithms the organizers will provide a set of multichannel recordings in a domestic environment.
The recordings will include spoken commands mixed with other acoustic events occurring in different
rooms of a real apartment.
The data is being created in the frame of the EC project DIRHA (Distant speech Interaction for Robust
Home Applications)
which addresses the challenges of speech interaction for home automation.
The organizers will release the evaluation package (datasets and scripts) on February 17;
the participants are asked to submit a regular paper reporting speech recognition results
on the evaluation set and comparing their performance with the provided reference baseline.
Further details are available at: http://dirha.fbk.eu/INTERSPEECH2014
Organizers:
Marco Matassoni, Fondazione Bruno Kessler, matasso [at]fbk.eu
Ramon Fernandez Astudillo, Instituto de Engenharia de Sistemas e Computadores, ramon.astudillo [at]inesc-id.pt
Athanasios Katsamanis, National Technical University of Athens, nkatsam [at]cs.ntua.gr
Open Domain Situated Conversational Interaction
Robust conversational systems have the potential to revolutionize our interactions with computers.
Building on decades of academic and industrial research, we now talk to our computers, phones,
and entertainment systems on a daily basis. However, current technology typically limits conversational
interactions to a few narrow domains/topics (e.g., weather, traffic, restaurants). Users increasingly want
the ability to converse with their devices over broad web-scale content. Finding something on your PC or
the web should be as simple as having a conversation.
A promising approach to address this problem is situated conversational interaction. The approach leverages
the situation and/or context of the conversation to improve system accuracy and effectiveness.
Sources of context include visual content being displayed to the user, Geo-location, prior interactions,
multi-modal interactions (e.g., gesture, eye gaze), and the conversation itself. For example, while a user
is reading a news article on their tablet PC, they initiate a conversation to dig deeper on a particular topic.
Or a user is reading a map and wants to learn more about the history of events at mile marker 121.
Or a gamer wants to interact with a game’s characters to find the next clue in their quest.
All of these interactions are situated – rich context is available to the system as a source of priors/constraints
on what the user is likely to say.
This special session will provide a forum to discuss research progress in open domain situated
conversational interactions.
Topics of the session will include:
• Situated context in spoken dialog systems
• Visual/dialog/personal/geo situated context
• Inferred context through interpretation and reasoning
• Open domain spoken dialog systems
• Open domain spoken/natural language understanding and generation
• Open domain semantic interpretation
• Open domain dialog management (large-scale belief state/policy)
• Conversational Interactions
• Multi-modal inputs in situated open domains (speech/text + gesture, touch, eye gaze)
• Multi-human situated interactions
Organizers:
Larry Heck, Microsoft Research, larry [at]ieee.org
Dilek Hakkani-Tür, Microsoft Research, dilek [at]ieee.org
Gokhan Tur, Microsoft Research, gokhan [at]ieee.org
Steve Young, Cambridge University, sjy [at]eng.cam.ac.uk
Phase Importance in Speech Processing Applications
In the past decades, the amplitude of speech spectrum is considered to be the most important feature in
different speech processing applications and phase of the speech signal has received less
attention. Recently, several findings justify the phase importance in speech and audio processing communities.
The importance of phase estimation along with amplitude estimation in speech enhancement,
complementary phase-based features in speech and speaker recognition and phase-aware acoustic
modeling of environment are the most prominent
reported works scattered in different communities of speech and audio processing. These examples suggest
that incorporating the phase information can push the limits of state-of-the-art phase-independent solutions
employed for long in different aspects of audio and speech signal processing. This Special Session aims
to explore the recent advances and methodologies to exploit the knowledge of signal phase information in different
aspects of speech processing. Without a dedicated effort to bring researchers from different communities,
a quick advance in investigation towards the phase usefulness in speech processing applications
is difficult to achieve. Therefore, as the first step in this direction, we aim to promote the 'phase-aware
speech and audio signal processing' to form a community of researchers to organize the next steps.
Our initiative is to unify these efforts to better understand the pros and cons of using phase and the degree
of feasibility for phase estimation/enhancement in different areas of speech processing including: speech
enhancement, speech separation, speech quality estimation, speech and speaker recognition,
voice transformation and speech analysis and synthesis. The goal is to promote the importance of
the phase-based signal processing and studying its importance and sharing interesting findings from different
speech processing applications.
Organizers:
Pejman Mowlaee, Graz University of Technology, pejman.mowlaee [at]tugraz.at
Rahim Saeidi, University of Eastern Finland, rahim.saeidi [at]uef.fi
Yannis Styilianou, Toshiba Labs Cambridge UK / University of Crete, yannis [at]csd.uoc.gr
Speaker Comparison for Forensic and Investigative Applications
In speaker comparison, speech/voice samples are compared by humans and/or machines
for use in investigation or in court to address questions that are of interest to the legal system.
Speaker comparison is a high-stakes application that can change people’s lives and it demands the best
that science has to offer; however, methods, processes, and practices vary widely.
These variations are not necessarily for the better and though recognized, are not generally appreciated
and acted upon. Methods, processes, and practices grounded in science are critical for the proper application
(and non-application) of speaker comparison to a variety of international investigative and forensic applications.
This special session will contribute to scientific progress through 1) understanding speaker comparison
for investigative and forensic application (e.g., describe what is currently being done and critically
analyze performance and lessons learned); 2) improving speaker comparison for investigative and forensic
applications (e.g., propose new approaches/techniques, understand the limitations, and identify challenges
and opportunities); 3) improving communications between communities of researchers, legal scholars,
and practitioners internationally (e.g., directly address some central legal, policy, and societal questions
such as allowing speaker comparisons in court, requirements for expert witnesses, and requirements for specific
automatic or human-based methods to be considered scientific); 4) using best practices (e.g., reduction of bias
and presentation of evidence); 5) developing a roadmap for progress in this session and future sessions; and 6)
producing a documented contribution to the field. Some of these objectives will need multiple sessions
to fully achieve and some are complicated due to differing legal systems and cultures.
This special session builds on previous successful special sessions and tutorials in forensic applications
of speaker comparison at INTERSPEECH beginning in 2003. Wide international participation is planned,
including researchers from the ISCA SIGs for the Association Francophone de la Communication Parlée (AFCP)
and the Speaker and Language Characterization (SpLC).
Organizers:
Joseph P. Campbell, PhD, MIT Lincoln Laboratory, jpc [at]ll.mit.edu
Jean-François Bonastre, l'Université d'Avignon, jean-francois.bonastre [at]univ-avignon.fr
Text-dependent for Short-duration Speaker Verification
In recent years, speaker verification engines have reached maturity and have been deployed in
commercial applications. Ergonomics of such applications is especially demanding and imposes
a drastic limitation in terms of speech duration during authentication. A well known tactic to address
the problem of lack of data, due to short duration, is using text-dependency. However, recent breakthroughs
achieved in the context of text-independent speaker verification in terms of accuracy and robustness
do not benefit text-dependent applications. Indeed, large development data required by the recent
approaches is not available in the text-dependent context. The purpose of this special session is
to gather the research efforts from both academia and industry toward a common goal of establishing
a new baseline and explore new directions for text-dependent speaker verification.
The focus of the session is on robustness with respect to duration and modeling of lexical information.
To support the development and evaluation of text-dependent speaker verification technologies,
the Institute for Infocomm Research (I2R) has recently released the RSR2015 database,
including 150 hours of data recorded from 300 speakers. The papers submitted to the special
session are encouraged, but not limited, to provide results based on the RSR2015 database
in order to enable comparison of algorithms and methods. For this purpose, the organizers strongly
encourage the participants to report performance on the protocol delivered with the database
in terms of EER and minimum cost (in the sense of NIST 2008 Speaker Recognition evaluation).
To get the database, please contact the organizers.
Further details are available at: http://www1.i2r.a-star.edu.sg/~kalee/is2014/tdspk.html
Organizers:
Anthony LARCHER (alarcher [at]i2r.a-star.edu.sg) Institute for Infocomm Research
Hagai ARONOWITZ (hagaia [at]il.ibm.com) IBM Research – Haifa
Kong Aik LEE (kalee [at]i2r.a-star.edu.sg) Institute for Infocomm Research
Patrick KENNY (patrick.kenny [at]crim.ca) CRIM – Montréal
Tutorial Dialogues and Spoken Dialogue Systems
The growing interest in educational applications that use spoken interaction and dialogue technology has boosted
research and development of interactive tutorial systems, and over the recent years, advances have been achieved
in both spoken dialogue community and education research community, with sophisticated speech and multi-modal
technology which allows functionally suitable and reasonably robust applications to be built.
The special session combines spoken dialogue research, interaction modeling, and educational applications,
and brings together the two INTERSPEECH SIG communities: SLaTE and SIGdial. The session focuses
on methods, problems and challenges that are shared by both communities, such as sophistication
of speech processing and dialogue management for educational interaction, integration of the models
with theories of emotion, rapport, and mutual understanding, as well as application of the techniques
to novel learning environments, robot interaction, etc. The session aims to survey issues related
to the processing of spoken language in various learning situations, modeling of the teacher-student
interaction in MOOC-like environments, as well as evaluating tutorial dialogue systems from
the point of view of natural interaction, technological robustness, and learning outcome.
The session encourages interdisciplinary research and submissions related to the special focus
of the conference, 'Celebrating the Diversity of Spoken Languages'.
For further information click http://junionsjlee.wix.com/INTERSPEECH
Organizers:
Maxine Eskenazi, max+ [at]cs.cmu.edu
Kristiina Jokinen, kristiina.jokinen [at]helsinki.fi
Diane Litman, litman [at]cs.pitt.edu
Martin Russel, M.J.RUSSELL [at]bham.ac.uk
Visual Speech Decoding
Speech perception is a bi-modal process that takes into account both the acoustic (what we hear)
and visual (what we see) speech information. It has been widely acknowledged that visual clues play
a critical role in automatic speech recognition (ASR) especially when audio is corrupted by,
for example, background noise or voices from untargeted speakers, or even inaccessible.
Decoding the visual speech is utterly important for ASR technologies to be widely implemented
to realize truly natural human-computer interactions. Despite the advances in acoustic ASR,
visual speech decoding remains a challenging problem.
The special session aims to attract more effort to tackle this important problem. In particular,
we would like to encourage researchers to focus on some critical questions in the area.
We propose four questions as the initiative as follows:
1. How to deal with the speaker dependency in visual speech data?
2. How to cope with the head-pose variation?
3. How to encode temporal information in visual features?
4. How to automatically adapt the fusion rule when the quality of the two individual (audio and visual)
modalities varies?
Researchers and participants are encouraged to raise more questions related to visual speech decoding.
We expect the session to draw a wide range of attention from both the speech recognition and machine vision
communities to the problem of visual speech decoding.
Organizers:
Ziheng Zhou, University of Oulu, ziheng.zhou [at]ee.oulu.fi
Matti Pietikäinen, University of Oulu, matti.pietikainen [at]ee.oulu.fi
Guoying Zhao, University of Oulu, gyzhao [at]ee.oulu.fi
|