ISCApad Archive » 2014 » ISCApad #190 » Events » ISCA Events » (2014-09-14) Special sessions at Interspeech 2014: call for submissions |
ISCApad #190 |
Thursday, April 10, 2014 by Chris Wellekens |
--- INTERSPEECH 2014 - SINGAPORE --- September 14-18, 2014 --- http://www.INTERSPEECH2014.org INTERSPEECH is the world's largest and most comprehensive conference on issues surrounding the science and technology of spoken language processing, both in humans and in machines. The theme of INTERSPEECH 2014 is --- Celebrating the Diversity of Spoken Languages --- INTERSPEECH 2014 includes a number of special sessions covering interdisciplinary topics and/or important new emerging areas of interest related to the main conference topics. Special sessions proposed for the forthcoming edition are: • A Re-evaluation of Robustness • Deep Neural Networks for Speech Generation and Synthesis • Exploring the Rich Information of Speech Across Multiple Languages • INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE) • Multichannel Processing for Distant Speech Recognition • Open Domain Situated Conversational Interaction • Phase Importance in Speech Processing Applications • Speaker Comparison for Forensic and Investigative Applications • Text-dependent for Short-duration Speaker Verification • Tutorial Dialogues and Spoken Dialogue Systems • Visual Speech Decoding A description of each special session is given below. For paper submission, please follow the main conference procedure and chose the Special Session track when selecting your paper area. Paper submission procedure is described at: http://www.INTERSPEECH2014.org/public.php?page=submission_procedure.html For more information, feel free to contact the Special Session Chair, Dr. Tomi H. Kinnunen, at email tkinnu [at]cs.uef.fi ---------------------------------------------------------------------------------------------------- Special Session Description ---------------------------------------------------------------------------------------------------- A Re-evaluation of Robustness The goal of the session is to facilitate a re-evaluation of robust speech recognition in the light of recent developments. It’s a re-evaluation at two levels: • a re-evaluation in perspective brought by breakthroughs in performance obtained by Deep Neural Network which leads to a fresh questioning of the role and contribution of robust feature extraction. • A literal re-evaluation on common databases to be able to present and compare performances of different algorithms and system approaches to robustness. Paper submissions are invited on the theme of noise robust speech recognition and required to submit results on the Aurora 4 database to facilitate cross comparison of the performance between different techniques. Recent developments raise interesting research questions that the session aims to help Progress by bringing focus and exploration of these issues. For example 1. What role is there for signal processing to create feature representations to use as inputs to Deep Learning or can deep learning do all the work? 2. What feature representations can be automatically learnt in a deep learning architecture? 3. What other techniques can give great improvement in robustness? 4. What techniques don’t work and why? The session organizers wish to encourage submissions that bring insight and understanding to the issues highlighted above. Authors are requested not only to present absolute performance of the whole system but also to highlight the contribution made by various components in a complex system. Papers that are accepted for the session are encouraged to also evaluate their techniques on new test data sets (available in July) and submit their results at the end of August. Session organization The session will be structured as a combination of 1. Invited talks 2. Oral paper presentations 3. Poster presentations 4. Summary of contributions and results on newly released test sets 5. Discussion Organizers: David Pearce, Audience dpearce [at]audience.com Hans-Guenter Hirsch, Niederrhein University of Applied Sciences, hans-guenter.hirsch [at]hs-niederrhein.de Reinhold Haeb-Umbach, University of Paderborn, haeb [at]nt.uni-paderborn.de Michael Seltzer, Microsoft, mseltzer [at]microsoft.com Keikichi Hirose, The University of Tokyo, hirose [at]gavo.t.u-tokyo.ac.jp Steve Renals, University of Edinburgh, s.renals [at]ed.ac.uk Sim Khe Chai, National University of Singapore, simkc [at]comp.nus.edu.sg Niko Moritz, Fraunhofer IDMT, Oldenburg, niko.moritz [at]idmt.fraunhofer.de K K Chin, Google, kkchin [at]google.com Deep Neural Networks for Speech Generation and Synthesis This special session aims to bring together researchers who work actively on deep neural networks for speech research, particularly, in generation and synthesis, to promote and to understand better the state-of-art DNN research in statistical learning and compare results with the parametric HMM-GMM model which has been well-established for speech synthesis, generation, and conversion. DNN, with its neuron-like structure, can simulate human speech production system in a layered, hierarchical, nonlinear and self-organized network. It can transform linguistic text information into intermediate semantic, phonetic and prosodic content and finally generate speech waveforms. Many possible neural network architectures or typologies exist, e.g. feed-forward NN with multiple hidden layers, stacked RBM or CRBM, Recurrent Neural Net (RNN), which have been used to speech/image recognition and other applications. We would like to use this special session as a forum to present updated results in the research frontiers, algorithm development and application scenarios. Particular focused areas will be on parametric TTS synthesis, voice conversion, speech compression, de-noising and speech enhancement. Organizers: Yao Qian, Microsoft Research Asia, yaoqian [at]microsoft.com Frank K. Soong, Microsoft Research Asia, frankkps [at]microsoft.com Exploring the Rich Information of Speech Across Multiple Languages Spoken language is the most direct means of communication between human beings. However, speech communication often demonstrates its language-specific characteristics because of, for instance, the linguistic difference (e.g., tonal vs. non-tonal, monosyllabic vs. multisyllabic) across languages. Our knowledge on the diversities of speech science across languages is still limited, including speech perception, linguistic and non-linguistic (e.g., emotion) information, etc. This knowledge is of great significance to facilitate our design of language-specific application of speech techniques (e.g., automatic speech recognition, assistive hearing devices) in the future. This special session will provide an opportunity for researchers from various communities (including speech science, medicine, linguistics and signal processing) to stimulate further discussion and new research in the broad cross-language area, and present their latest research on understanding the language-specific features of speech science and their applications in the speech communication of machines and human beings. This special session encourages contributions all fields on speech science, e.g., production and perception, but with a focus on presenting the language-specific characteristics and discussing their implications to improve our knowledge on the diversities of speech science across multiple languages. Topics of interest include, but are not limited to: 1. characteristics of acoustic, linguistic and language information in speech communication across multiple languages; 2. diversity of linguistic and non-linguistic (e.g., emotion) information among multiple spoken languages; 3. language-specific speech intelligibility enhancement and automatic speech recognition techniques; and 4. comparative cross-language assessment of speech perception in challenging environments. Organizers: Junfeng Li, Institute of Acoustics, Chinese Academy of Sciences, junfeng.li.1979 [at]gmail.com Fei Chen, The University of Hong Kong, feichen1 [at]hku.hk INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE) The INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE) is an open Challenge dealing with speaker characteristics as manifested in their speech signal's acoustic properties. This year, it introduces new tasks by the Cognitive Load Sub-Challenge, the Physical Load Sub-Challenge, and a Multitask Sub-Challenge: For these Challenge tasks, the COGNITIVE-LOAD WITH SPEECH AND EGG database (CLSE), the MUNICH BIOVOICE CORPUS (MBC), and the ANXIETY-DEPRESSION-EMOTION-SLEEPINESS audio corpus (ADES) with high diversity of speakers and different languages covered (Australian English and German) are provided by the organizers. All corpora provide fully realistic data in challenging acoustic conditions and feature rich annotation such as speaker meta-data. They are given with distinct definitions of test, development, and training partitions, incorporating speaker independence as needed in most real-life settings. Benchmark results of the most popular approaches are provided as in the years before. Transcription of the train and development sets will be known. All Sub-Challenges allow contributors to find their own features with their own machine learning algorithm. However, a standard feature set will be provided per corpus that may be used. Participants will have to stick to the definition of training, development, and test sets. They may report on results obtained on the development set, but have only five trials to upload their results on the test sets, whose labels are unknown to them. Each participation will be accompanied by a paper presenting the results that undergoes peer-review and has to be accepted for the conference in order to participate in the Challenge. The results of the Challenge will be presented in a Special Session at INTERSPEECH 2014 in Singapore. Further, contributions using the Challenge data or related to the Challenge but not competing within the Challenge are also welcome. More information is given also on the Challenge homepage: http://emotion-research.net/sigs/speech-sig/is14-compare Organizers: Björn Schuller, Imperial College London / Technische Universität München,schuller [at]IEEE.org Stefan Steidl, Friedrich-Alexander-University, stefan.steidl [at]fau.de Anton Batliner, Technische Universität München / Friedrich-Alexander-University, batliner [at]cs.fau.de Jarek Krajweski, Bergische Universität Wuppertal, krajewsk [at]uni-wuppertal.de Julien Epps, The University of New South Wales / National ICT Australia, j.epps [at]unsw.edu.au Multichannel Processing for Distant Speech Recognition Distant speech recognition in real-world environments is still a challenging problem: reverberation and dynamic background noise represent major sources of acoustic mismatch that heavily decrease ASR performance, which, on the contrary, can be very good in close-talking microphone setups. In this context, a particularly interesting topic is the adoption of distributed microphones for the development of voice-enabled automated home environments based on distant-speech interaction: microphones are installed in different rooms and the resulting multichannel audio recordings capture multiple audio events, including voice commands or spontaneous speech, generated in various locations and characterized by a variable amount of reverberation as well as possible background noise. The focus of the proposed special session will be on multichannel processing for automatic speech recognition (ASR) in such a setting. Unlike other robust ASR tasks, where static adaptation or training with noisy data sensibly ameliorates performance, the distributed microphone scenario requires full exploitation of multichannel information to reduce the highly variable dynamic mismatch. To facilitate better evaluation of the proposed algorithms the organizers will provide a set of multichannel recordings in a domestic environment. The recordings will include spoken commands mixed with other acoustic events occurring in different rooms of a real apartment. The data is being created in the frame of the EC project DIRHA (Distant speech Interaction for Robust Home Applications) which addresses the challenges of speech interaction for home automation. The organizers will release the evaluation package (datasets and scripts) on February 17; the participants are asked to submit a regular paper reporting speech recognition results on the evaluation set and comparing their performance with the provided reference baseline. Further details are available at: http://dirha.fbk.eu/INTERSPEECH2014 Organizers: Marco Matassoni, Fondazione Bruno Kessler, matasso [at]fbk.eu Ramon Fernandez Astudillo, Instituto de Engenharia de Sistemas e Computadores, ramon.astudillo [at]inesc-id.pt Athanasios Katsamanis, National Technical University of Athens, nkatsam [at]cs.ntua.gr Open Domain Situated Conversational Interaction Robust conversational systems have the potential to revolutionize our interactions with computers. Building on decades of academic and industrial research, we now talk to our computers, phones, and entertainment systems on a daily basis. However, current technology typically limits conversational interactions to a few narrow domains/topics (e.g., weather, traffic, restaurants). Users increasingly want the ability to converse with their devices over broad web-scale content. Finding something on your PC or the web should be as simple as having a conversation. A promising approach to address this problem is situated conversational interaction. The approach leverages the situation and/or context of the conversation to improve system accuracy and effectiveness. Sources of context include visual content being displayed to the user, Geo-location, prior interactions, multi-modal interactions (e.g., gesture, eye gaze), and the conversation itself. For example, while a user is reading a news article on their tablet PC, they initiate a conversation to dig deeper on a particular topic. Or a user is reading a map and wants to learn more about the history of events at mile marker 121. Or a gamer wants to interact with a game’s characters to find the next clue in their quest. All of these interactions are situated – rich context is available to the system as a source of priors/constraints on what the user is likely to say. This special session will provide a forum to discuss research progress in open domain situated conversational interactions. Topics of the session will include: • Situated context in spoken dialog systems • Visual/dialog/personal/geo situated context • Inferred context through interpretation and reasoning • Open domain spoken dialog systems • Open domain spoken/natural language understanding and generation • Open domain semantic interpretation • Open domain dialog management (large-scale belief state/policy) • Conversational Interactions • Multi-modal inputs in situated open domains (speech/text + gesture, touch, eye gaze) • Multi-human situated interactions Organizers: Larry Heck, Microsoft Research, larry [at]ieee.org Dilek Hakkani-Tür, Microsoft Research, dilek [at]ieee.org Gokhan Tur, Microsoft Research, gokhan [at]ieee.org Steve Young, Cambridge University, sjy [at]eng.cam.ac.uk Phase Importance in Speech Processing Applications In the past decades, the amplitude of speech spectrum is considered to be the most important feature in different speech processing applications and phase of the speech signal has received less attention. Recently, several findings justify the phase importance in speech and audio processing communities. The importance of phase estimation along with amplitude estimation in speech enhancement, complementary phase-based features in speech and speaker recognition and phase-aware acoustic modeling of environment are the most prominent reported works scattered in different communities of speech and audio processing. These examples suggest that incorporating the phase information can push the limits of state-of-the-art phase-independent solutions employed for long in different aspects of audio and speech signal processing. This Special Session aims to explore the recent advances and methodologies to exploit the knowledge of signal phase information in different aspects of speech processing. Without a dedicated effort to bring researchers from different communities, a quick advance in investigation towards the phase usefulness in speech processing applications is difficult to achieve. Therefore, as the first step in this direction, we aim to promote the 'phase-aware speech and audio signal processing' to form a community of researchers to organize the next steps. Our initiative is to unify these efforts to better understand the pros and cons of using phase and the degree of feasibility for phase estimation/enhancement in different areas of speech processing including: speech enhancement, speech separation, speech quality estimation, speech and speaker recognition, voice transformation and speech analysis and synthesis. The goal is to promote the importance of the phase-based signal processing and studying its importance and sharing interesting findings from different speech processing applications. Organizers: Pejman Mowlaee, Graz University of Technology, pejman.mowlaee [at]tugraz.at Rahim Saeidi, University of Eastern Finland, rahim.saeidi [at]uef.fi Yannis Styilianou, Toshiba Labs Cambridge UK / University of Crete, yannis [at]csd.uoc.gr Speaker Comparison for Forensic and Investigative Applications In speaker comparison, speech/voice samples are compared by humans and/or machines for use in investigation or in court to address questions that are of interest to the legal system. Speaker comparison is a high-stakes application that can change people’s lives and it demands the best that science has to offer; however, methods, processes, and practices vary widely. These variations are not necessarily for the better and though recognized, are not generally appreciated and acted upon. Methods, processes, and practices grounded in science are critical for the proper application (and non-application) of speaker comparison to a variety of international investigative and forensic applications. This special session will contribute to scientific progress through 1) understanding speaker comparison for investigative and forensic application (e.g., describe what is currently being done and critically analyze performance and lessons learned); 2) improving speaker comparison for investigative and forensic applications (e.g., propose new approaches/techniques, understand the limitations, and identify challenges and opportunities); 3) improving communications between communities of researchers, legal scholars, and practitioners internationally (e.g., directly address some central legal, policy, and societal questions such as allowing speaker comparisons in court, requirements for expert witnesses, and requirements for specific automatic or human-based methods to be considered scientific); 4) using best practices (e.g., reduction of bias and presentation of evidence); 5) developing a roadmap for progress in this session and future sessions; and 6) producing a documented contribution to the field. Some of these objectives will need multiple sessions to fully achieve and some are complicated due to differing legal systems and cultures. This special session builds on previous successful special sessions and tutorials in forensic applications of speaker comparison at INTERSPEECH beginning in 2003. Wide international participation is planned, including researchers from the ISCA SIGs for the Association Francophone de la Communication Parlée (AFCP) and the Speaker and Language Characterization (SpLC). Organizers: Joseph P. Campbell, PhD, MIT Lincoln Laboratory, jpc [at]ll.mit.edu Jean-François Bonastre, l'Université d'Avignon, jean-francois.bonastre [at]univ-avignon.fr Text-dependent for Short-duration Speaker Verification In recent years, speaker verification engines have reached maturity and have been deployed in commercial applications. Ergonomics of such applications is especially demanding and imposes a drastic limitation in terms of speech duration during authentication. A well known tactic to address the problem of lack of data, due to short duration, is using text-dependency. However, recent breakthroughs achieved in the context of text-independent speaker verification in terms of accuracy and robustness do not benefit text-dependent applications. Indeed, large development data required by the recent approaches is not available in the text-dependent context. The purpose of this special session is to gather the research efforts from both academia and industry toward a common goal of establishing a new baseline and explore new directions for text-dependent speaker verification. The focus of the session is on robustness with respect to duration and modeling of lexical information. To support the development and evaluation of text-dependent speaker verification technologies, the Institute for Infocomm Research (I2R) has recently released the RSR2015 database, including 150 hours of data recorded from 300 speakers. The papers submitted to the special session are encouraged, but not limited, to provide results based on the RSR2015 database in order to enable comparison of algorithms and methods. For this purpose, the organizers strongly encourage the participants to report performance on the protocol delivered with the database in terms of EER and minimum cost (in the sense of NIST 2008 Speaker Recognition evaluation). To get the database, please contact the organizers. Further details are available at: http://www1.i2r.a-star.edu.sg/~kalee/is2014/tdspk.html Organizers: Anthony LARCHER (alarcher [at]i2r.a-star.edu.sg) Institute for Infocomm Research Hagai ARONOWITZ (hagaia [at]il.ibm.com) IBM Research – Haifa Kong Aik LEE (kalee [at]i2r.a-star.edu.sg) Institute for Infocomm Research Patrick KENNY (patrick.kenny [at]crim.ca) CRIM – Montréal Tutorial Dialogues and Spoken Dialogue Systems The growing interest in educational applications that use spoken interaction and dialogue technology has boosted research and development of interactive tutorial systems, and over the recent years, advances have been achieved in both spoken dialogue community and education research community, with sophisticated speech and multi-modal technology which allows functionally suitable and reasonably robust applications to be built. The special session combines spoken dialogue research, interaction modeling, and educational applications, and brings together the two INTERSPEECH SIG communities: SLaTE and SIGdial. The session focuses on methods, problems and challenges that are shared by both communities, such as sophistication of speech processing and dialogue management for educational interaction, integration of the models with theories of emotion, rapport, and mutual understanding, as well as application of the techniques to novel learning environments, robot interaction, etc. The session aims to survey issues related to the processing of spoken language in various learning situations, modeling of the teacher-student interaction in MOOC-like environments, as well as evaluating tutorial dialogue systems from the point of view of natural interaction, technological robustness, and learning outcome. The session encourages interdisciplinary research and submissions related to the special focus of the conference, 'Celebrating the Diversity of Spoken Languages'. For further information click http://junionsjlee.wix.com/INTERSPEECH Organizers: Maxine Eskenazi, max+ [at]cs.cmu.edu Kristiina Jokinen, kristiina.jokinen [at]helsinki.fi Diane Litman, litman [at]cs.pitt.edu Martin Russel, M.J.RUSSELL [at]bham.ac.uk Visual Speech Decoding Speech perception is a bi-modal process that takes into account both the acoustic (what we hear) and visual (what we see) speech information. It has been widely acknowledged that visual clues play a critical role in automatic speech recognition (ASR) especially when audio is corrupted by, for example, background noise or voices from untargeted speakers, or even inaccessible. Decoding the visual speech is utterly important for ASR technologies to be widely implemented to realize truly natural human-computer interactions. Despite the advances in acoustic ASR, visual speech decoding remains a challenging problem. The special session aims to attract more effort to tackle this important problem. In particular, we would like to encourage researchers to focus on some critical questions in the area. We propose four questions as the initiative as follows: 1. How to deal with the speaker dependency in visual speech data? 2. How to cope with the head-pose variation? 3. How to encode temporal information in visual features? 4. How to automatically adapt the fusion rule when the quality of the two individual (audio and visual) modalities varies? Researchers and participants are encouraged to raise more questions related to visual speech decoding. We expect the session to draw a wide range of attention from both the speech recognition and machine vision communities to the problem of visual speech decoding. Organizers: Ziheng Zhou, University of Oulu, ziheng.zhou [at]ee.oulu.fi Matti Pietikäinen, University of Oulu, matti.pietikainen [at]ee.oulu.fi Guoying Zhao, University of Oulu, gyzhao [at]ee.oulu.fi |
Back | Top |