ISCA - International Speech
Communication Association

Previous

ISCApad Archive » 2019 » ISCApad #250 » Resources

ISCApad #250

Friday, April 12, 2019 by Chris Wellekens

5 Resources

5-1 Books

5-1-1

Pejman Mowlaee et al., 'Phase-Aware Signal Processing in Speech Communication: Theory and Practice', Wiley 2016

Phase-Aware Signal Processing in Speech Communication: Theory and Practice

Pejman Mowlaee, Johannes Stahl, Josef Kulmer, Florian Mayer

http://eu.wiley.com/WileyCDA/WileyTitle/productCd-1119238811.html

An overview on the challenging new topic of phase-aware signal processing

Speech communication technology is a key factor in human-machine interaction, digital hearing aids, mobile telephony, and automatic speech/speaker recognition. With the proliferation of these applications, there is a growing requirement for advanced methodologies that can push the limits of the conventional solutions relying on processing the signal magnitude spectrum.

Single-Channel Phase-Aware Signal Processing in Speech Communication provides a comprehensive guide to phase signal processing and reviews the history of phase importance in the literature, basic problems in phase processing, fundamentals of phase estimation together with several applications to demonstrate the usefulness of phase processing.

Key features:

Analysis of recent advances demonstrating the positive impact of phase-based processing in pushing the limits of conventional methods.
Offers unique coverage of the historical context, fundamentals of phase processing and provides several examples in speech communication.
Provides a detailed review of many references and discusses the existing signal processing techniques required to deal with phase information in different applications involved with speech.
The book supplies various examples and MATLAB® implementations delivered within the PhaseLab toolbox.

Single-Channel Phase-Aware Signal Processing in Speech Communication is a valuable single-source for students, non-expert DSP engineers, academics and graduate students.

ejman Mowlaee, Johannes Stahl, Josef Kulmer, Florian Mayer

5-1-2

Jean Caelen, Anne Xuereb, 'Dialogue : altérité, interaction, énaction'

Jean Caelen,Anne Xuereb

Dialogue : altérité, interaction, énaction

Editions universitaires européennes

5-1-3

Bäckström, Tom (with Guillaume Fuchs, Sascha Disch, Christian Uhle and Jeremie Lecomte), 'Speech Coding with Code-Excited Linear Prediction', Springer

Speech Coding with Code-Excited Linear Prediction

Author: Bäckström, Tom

Invited chapters from: Guillaume Fuchs, Sascha Disch, Christian Uhle and Jeremie Lecomte

Publisher: Springer

http://www.springer.com/gp/book/9783319502021

5-1-4

Shinji Watanabe, Marc Delcroix, Florian Metze, John R. Hershey (Eds), 'New Era for Robust Seech Recognition', Springer.

Shinji Watanabe, Marc Delcroix, Florian Metze, John R. Hershey (Eds), 'New Era for Robust Seech Recognition', Springer.

https://link.springer.com/book/10.1007%2F978-3-319-64680-0

5-1-5

Fabrice Marsac, Rudolph Sock, CONSÉCUTIVITÉ ET SIMULTANÉITÉ en Linguistique, Langues et Parole, L'Harmattan,France

Nous avons le plaisir de vous annoncer la parution du volume thématique « CONSÉCUTIVITÉ ET SIMULTANÉITÉ en Linguistique, Langues et Parole » dans la Collection Dixit Grammatica (L’Harmattan, France) :

- CONSÉCUTIVITÉ ET SIMULTANÉITÉ en Linguistique, Langues et Parole – 1. Phonétique, Phonologie (Sous la direction de Camille Fauth, Jean-Paul Meyer, Fabrice Marsac & Rudolph Sock) • ISBN : 978-2-343-14277-7 • 5 mars 2018 • 172 pages http://www.editionsharmattan.fr/index.asp?navig=catalogue&obj=livre&no=59200&razSqlClone=1

- CONSÉCUTIVITÉ ET SIMULTANÉITÉ en Linguistique, Langues et Parole – 2. Syntaxe, Sémantique (Sous la direction de Angelina Aleksandrova, Céline Benninger, Anne Theissen, Fabrice Marsac & Jean-Paul Meyer) • ISBN : 978-2-343-14278-4 • 5 mars 2018 • 300 pages http://www.editionsharmattan.fr/index.asp?navig=catalogue&obj=livre&no=59201&razSqlClone=1

- CONSÉCUTIVITÉ ET SIMULTANÉITÉ en Linguistique, Langues et Parole – 3. Didactique, Traductologie-Interprétation (Sous la direction de Jean-Paul Meyer, Mária Pal'ová & Fabrice Marsac) • ISBN : 978-2-343-14279-1 • 5 mars 2018 • 200 pages http://www.editionsharmattan.fr/index.asp?navig=catalogue&obj=livre&no=59202&razSqlClone=1

Cet ouvrage collectif, qui comprend trois tomes complémentaires, rassemble des études constituant les traces écrites de communications prononcées lors du colloque international éponyme s’étant tenu à l’Université de Strasbourg (France) en juillet 2015. Les tomes renferment des travaux originaux et novateurs traitant de la dynamicité complexe du couple consécutivité-simultanéité saisi dans le domaine des Sciences du Langage. Le contenu, délibérément interdisciplinaire, concerne non seulement l’ensemble des disciplines relatives aux Sciences du langage mais aussi d’autres disciplines scientifiques, connexes mais préoccupées par des problématiques résolument linguistiques. Les éditeurs de ce volume thématique espèrent que les divers points de vue linguistiques ainsi adoptés livreront aux lecteurs un état des connaissances actualisé relativement aux différentes problématiques traitées. Il va sans dire, par ailleurs, que les auteurs comme les éditeurs apprécieront tout retour constructif de la part des lecteurs.

Fabrice Marsac et Rudolph Sock Directeurs de Dixit Grammatica

5-1-6

Emmanuel Vincent (Editor), Tuomas Virtanen (Editor), Sharon Gannot (Editor), 'Audio Source Separation and Speech Enhancement', Wiley

Emmanuel Vincent (Editor), Tuomas Virtanen (Editor), Sharon Gannot (Editor),

Audio Source Separation and Speech Enhancement:

https://www.wiley.com/en-us/Audio+Source+Separation+and+Speech+Enhancement-p-9781119279891

ISBN: 978-1-119-27989-1

October 2018

504 pages

This 500-page book provides a unifying view of source separation and enhancement,
including but not limited to array processing, matrix factorization, and deep learning
based methods, and speech and music applications, with consistent notation and
terminology across all chapters.

5-1-7

Jen-Tzung Chien, 'Source Separation and Machine Learning', Academic Press

Jen-Tzung Chien, 'Source Separation and Machine Learning', Academic Press

Source Separation and Machine Learning presents the fundamentals in adaptive learning
algorithms for Blind Source Separation (BSS) and emphasizes the importance of machine
learning perspectives. It illustrates how BSS problems are tackled through adaptive
learning algorithms and model-based approaches using the latest information on mixture
signals to build a BSS model that is seen as a statistical model for a whole system.
Looking at different models, including independent component analysis (ICA), nonnegative
matrix factorization (NMF), nonnegative tensor factorization (NTF), and deep neural
network (DNN), the book addresses how they have evolved to deal with multichannel and
singlechannel source separation.

Key features:
? Emphasizes the modern model-based Blind Source Separation (BSS) which closely connects
the latest research topics of BSS and Machine Learning
? Includes coverage of Bayesian learning, sparse learning, online learning,
discriminative learning and deep learning
? Presents a number of case studies of model-based BSS, using a variety of learning
algorithms that provide solutions for the construction of BSS systems

https://www.elsevier.com/books/source-separation-and-machine-learning/chien/978-0-12-804566-4

5-1-8

Ingo Feldhausen, « Methods in prosody: A Romance language perspective », Language Science Press (open access)

Nous sommes heureux de vous annoncer la parution d'un recueil validé par un comité de lecture et consacré aux méthodes de recherche en prosodie. Cet ouvrage est intitulé « Methods in prosody: A Romance language perspective ».

Il est publié par Language Science Press, une maison d’édition open access. Le livre peut-être téléchargé gratuitement en cliquant sur le lien suivant :

http://langsci-press.org/catalog/book/183

La table des matières est la suivante :

---------------------------------------------------------------------------------------------------------

Introduction
Ingo Feldhausen, Jan Fliessbach & Maria del Mar Vanrell iii

Foreword
Pilar Prieto vii

I Large corpora and spontaneous speech

1) Using large corpora and computational tools to describe prosody: An
exciting challenge for the future with some (important) pending problems to solve
Juan María Garrido Almiñana 3

2) Intonation of pronominal subjects in Porteño Spanish: Analysis of
spontaneous speech
Andrea Pešková 45

II Approaches to prosodic analysis

3) Multimodal analyses of audio-visual information: Some methods and
issues in prosody research
Barbara Gili Fivela 83

4) The realizational coefficient: Devising a method for empirically
determining prominent positions in Conchucos Quechua
Timo Buchholz & Uli Reich 123

5) On the role of prosody in disambiguating wh-exclamatives and
wh-interrogatives in Cosenza Italian
Olga Kellert, Daniele Panizza & Caterina Petrone 165

III Elicitation methods

6) The Discourse Completion Task in Romance prosody research: Status
quo and outlook
Maria del Mar Vanrell, Ingo Feldhausen & Lluïsa Astruc 191

7) Describing the intonation of speech acts in Brazilian Portuguese:
Methodological aspects
João Antônio de Moraes & Albert Rilliard 229

Indexes 263

---------------------------------------------------------------------------------------------------------

N'hésitez pas à diffuser la parution de cet ouvrage auprès de vos collègues qui pourraient s'y intéresser.

Bien cordialement,

Ingo Feldhausen
(Co-coordinateur d'ouvrage)

5-1-9

Nigel Ward, 'Prosodic Patterns in English Conversation', Cambridge University Press, 2019

Prosodic Patterns in English Conversation

Nigel G. Ward, Professor of Computer Science, University of Texas at El Paso

Cambridge University Press, 2019.

Spoken language is more than words: it includes the prosodic features and patterns that speakers use, subconsciously, to frame meanings and achieve interactional goals. Thanks to the application of simple processing techniques to spoken dialog corpora, this book goes beyond intonation to describe how pitch, timing, intensity and voicing properties combine to form meaningful temporal configurations: prosodic constructions. Combining new findings with hitherto-scattered observations from diverse research traditions, this book enumerates twenty of the principal prosodic constructions of English.

http://www.cambridge.org/ward/

nigel@utep.edu http://www.cs.utep.edu/nigel/

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (March 2019)

March 2019 Newsletter

In this newsletter:

Call for Papers – LTC 2019, LREC 2020

New Publications:

CALLFRIEND Egyptian Arabic Second Edition

Penn Discourse Treebank Version 3.0

VAST Chinese Speech and Transcripts

Call for Papers

The 9^th Language & Technology Conference (LTC 2019) will take place on May 17-19, 2019, at the Adam Mickiewicz University in Pozna?, Poland. LTC addresses Human Language Technologies as a challenge for computer science, linguistics, and related fields. Conference papers are due next week on Wednesday, March 20, 2019 (midnight, any time zone). For more information, visit the conference webpage.

The 12^th Conference on Language Resources and Evaluation (LREC 2020) will take place on May 13-15, 2020, at the Palais du Pharo in Marseille, France. LREC aims to provide an overview of the state-of-the-art, explore new R&D directions and emerging trends, and exchange information regarding language resources and their applications, evaluation methodologies, and tools. Conference papers are due by November 25, 2019. For more information, including conference topics, visit the conference webpage.

New publications:

(1) CALLFRIEND Egyptian Arabic Second Edition was developed by LDC and consists of approximately 25 hours of unscripted telephone conversations between native speakers of Egyptian Arabic. This second edition updates the audio files to wav format, simplifies the directory structure, and adds documentation and metadata. The first edition is available as CALLFRIEND Egyptian Arabic (LDC96S49).

All data were collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes.

CALLFRIEND Egyptian Arabic Second Edition is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1,000.

*

(2) Penn Discourse Treebank Version 3.0 is the third release in the Penn Discourse Treebank project, the goal of which is to annotate the Wall Street Journal (WSJ) section of Treebank-2 (LDC95T7) with discourse relations. Penn Discourse Treebank Version 2 (LDC2008T05) contains over 40,600 tokens of annotated relations. In Version 3, an additional 13,000 tokens were annotated, certain pairwise annotations were standardized, new senses were included, and the corpus was subject to a series of consistency checks.

This corpus contains two tools: (1) The Annotator, used for annotation and adjudication, and which can also be used for viewing the corpus; and (2) The Conversion Tool for converting Version 2 annotation files into the Version 3 format.

The documentation directory contains a manual describing what is new in Version 3 and how Version 3 differs from Version 2; the methods and guidelines used in annotating PDTB Version 3; and a range of statistics on the tokens, including the frequency of each connective, its sense labels, and its modifiers. More information about the corpus and research carried out by the developers and others using the corpus can be found on the PDTB website.

Penn Discourse Treebank Version 3.0 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1,000.

*

(3) VAST Chinese Speech and Transcripts was developed by LDC for the VAST (Video Annotation for Speech Technologies) project and is comprised of approximately 29 hours of Mandarin Chinese audio extracted from amateur video content harvested from the web and corresponding time-aligned transcripts.

Audio files were transcribed using XTrans, which supports manual transcription across multiple channels, languages, and platforms. Transcribers followed a Quick-Rich Transcription style; transcription guidelines are included in this release.

The aim of the VAST project was to collect and annotate data in several languages to support the development of speech technologies such as speech activity detection, language identification, speaker identification, and speech recognition.

VAST Chinese Speech and Transcripts is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1,000.

Membership Office

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104

*

*

5-2-2

ELRA - Language Resources Catalogue - Update (October 2018)

ELRA - Language Resources Catalogue - Update

-------------------------------------------------------

We are happy to announce that 2 new Written Corpora and 4 new Speech resources are now available in our catalogue.

ELRA-W0126 Training and test data for Arabizi detection and transliteration
ISLRN: 986-364-744-303-9
The dataset is composed of : a collection of mixed English and Arabizi text intended to train and test a system for the automatic detection of code-switching in mixed English and Arabizi texts ; and a set of 3,452 Arabizi tokens manually transliterated into Arabic, intended to train and test a system that performs Arabizi to Arabic transliteration.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0126/

ELRA-W0127 Normalized Arabic Fragments for Inestimable Stemming (NAFIS)
ISLRN: 305-450-745-774-1
This is an Arabic stemming gold standard corpus composed by a collection of 37 sentences, selected to be representative of Arabic stemming tasks and manually annotated. Compiled sentences belong to various sources (poems, holy Quran, books, and periodics) of diversified kinds (proverb and dictum, article commentary, religious text, literature, historical fiction). NAFIS is represented according to the TEI standard.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0127/

ELRA-S0396 Mbochi speech corpus
ISLRN: 747-055-093-447-8
This corpus consists of 5131 sentences recorded in Mbochi, together with their transcription and French translation, as well as the results from the work made during JSALT workshop: alignments at the phonetic level and various results of unsupervised word segmentation from audio. The audio corpus is made up of 4,5 hours, downsampled at 16kHz, 16bits, with Linear PCM encoding. Data is distributed into 2 parts, one for training consisting of 4617 sentences, and one for development consisting of 514 sentences.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0396/

ELRA-S0397 Chinese Mandarin (South) database
ISLRN: 503-886-852-083-2
This database contains the recordings of 1000 Chinese Mandarin speakers from Southern China (500 males and 500 females), from 18 to 60 years? old, recorded in quiet studios. Recordings were made through microphone headsets and consist of 341 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0397/

ELRA-S0398 Chinese Mandarin (North) database
ISLRN: 353-548-770-894-7
This database contains the recordings of 500 Chinese Mandarin speakers from Northern China (250 males and 250 females), from 18 to 60 years? old, recorded in quiet studios. Recordings were made through microphone headsets and consist of 172 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0398/

ELRA-S0401 Persian Audio Dictionary
ISLRN: 133-181-128-420-9
This dictionary consists of more than 50,000 entries (along with almost all wordforms and proper names) with corresponding audio files in MP3 and English transliterations. The words have been recorded with standard Persian (Farsi) pronunciation (all by a single speaker). This dictionary is provided with its software.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0401/

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us.

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements/

5-2-3

Speechocean – update (April 2019)

Cantonese Speech Recognition Corpus --- Speechocean

Speechocean: A.I. Data Resource & Service Supplier

At present, we are capable to provide around 8000 hours Cantonese speech recognition corpus, including Mainland Cantonese and Hong Kong Cantonese. Please check the form below: http://kingline.speechocean.com

Language	Content	Speakers	Hours
Mainland Cantonese	Sentences	4,590	5,220
Mainland Cantonese	Conversation	450	390
Hong Kong Cantonese	Sentences	960	670
Hong Kong Cantonese	Conversation	770	1,580

More Information

Information of Speaker: Selected native speakers. Balanced covering ages, gender and regional accents.
Recording Environment: Quiet or noisy environment.
Recording Platform: Desktop, mobile or telephone
Post Processing: Proofreading, transcription, annotation and quality control.
Lexicon: Included

If you have any further inquiries, please do not hesitate to contact us.

Web: http://en.speechocean.com/

Email: marketing@speechocean.com

5-2-4

Google 's Language Model benchmark

A LM benchmark is available at:https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark

Here is a brief description of the project.

'The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:

unpruned Katz (1.1B n-grams),
pruned Katz (~15M n-grams),
unpruned Interpolated Kneser-Ney (1.1B n-grams),
pruned Interpolated Kneser-Ney (~15M n-grams)

ArXiv paper: http://arxiv.org/abs/1312.3005

Happy benchmarking!'

5-2-5

Forensic database of voice recordings of 500+ Australian English speakers

Forensic database of voice recordings of 500+ Australian English speakers

We are pleased to announce that the forensic database of voice recordings of 500+ Australian English speakers is now published.

The database was collected by the Forensic Voice Comparison Laboratory, School of Electrical Engineering & Telecommunications, University of New South Wales as part of the Australian Research Council funded Linkage Project on making demonstrably valid and reliable forensic voice comparison a practical everyday reality in Australia. The project was conducted in partnership with: Australian Federal Police, New South Wales Police, Queensland Police, National Institute of Forensic Sciences, Australasian Speech Sciences and Technology Association, Guardia Civil, Universidad Autónoma de Madrid.

The database includes multiple non-contemporaneous recordings of most speakers. Each speaker is recorded in three different speaking styles representative of some common styles found in forensic casework. Recordings are recorded under high-quality conditions and extraneous noises and crosstalk have been manually removed. The high-quality audio can be processed to reflect recording conditions found in forensic casework.

The database can be accessed at: http://databases.forensic-voice-comparison.net/

5-2-6

Audio and Electroglottographic speech recordings

Audio and Electroglottographic speech recordings from several languages

We are happy to announce the public availability of speech recordings made as part of the UCLA project 'Production and Perception of Linguistic Voice Quality'.

http://www.phonetics.ucla.edu/voiceproject/voice.html

Audio and EGG recordings are available for Bo, Gujarati, Hmong, Mandarin, Black Miao, Southern Yi, Santiago Matatlan/ San Juan Guelavia Zapotec; audio recordings (no EGG) are available for English and Mandarin. Recordings of Jalapa Mazatec extracted from the UCLA Phonetic Archive are also posted. All recordings are accompanied by explanatory notes and wordlists, and most are accompanied by Praat textgrids that locate target segments of interest to our project.

Analysis software developed as part of the project – VoiceSauce for audio analysis and EggWorks for EGG analysis – and all project publications are also available from this site. All preliminary analyses of the recordings using these tools (i.e. acoustic and EGG parameter values extracted from the recordings) are posted on the site in large data spreadsheets.

All of these materials are made freely available under a Creative Commons Attribution-NonCommercial-ShareAlike-3.0 Unported License.

This project was funded by NSF grant BCS-0720304 to Pat Keating, Abeer Alwan and Jody Kreiman of UCLA, and Christina Esposito of Macalester College.

Pat Keating (UCLA)

5-2-7

EEG-face tracking- audio 24 GB data set Kara One, Toronto, Canada

We are making 24 GB of a new dataset, called Kara One, freely available. This database combines 3 modalities (EEG, face tracking, and audio) during imagined and articulated speech using phonologically-relevant phonemic and single-word prompts. It is the result of a collaboration between the Toronto Rehabilitation Institute (in the University Health Network) and the Department of Computer Science at the University of Toronto.

In the associated paper (abstract below), we show how to accurately classify imagined phonological categories solely from EEG data. Specifically, we obtain up to 90% accuracy in classifying imagined consonants from imagined vowels and up to 95% accuracy in classifying stimulus from active imagination states using advanced deep-belief networks.

Data from 14 participants are available here: http://www.cs.toronto.edu/~complingweb/data/karaOne/karaOne.html.

If you have any questions, please contact Frank Rudzicz at frank@cs.toronto.edu.

Best regards,

Frank

PAPER Shunan Zhao and Frank Rudzicz (2015) Classifying phonological categories in imagined and articulated speech. In Proceedings of ICASSP 2015, Brisbane Australia

ABSTRACT This paper presents a new dataset combining 3 modalities (EEG, facial, and audio) during imagined and vocalized phonemic and single-word prompts. We pre-process the EEG data, compute features for all 3 modalities, and perform binary classi?cation of phonological categories using a combination of these modalities. For example, a deep-belief network obtains accuracies over 90% on identifying consonants, which is signi?cantly more accurate than two baseline supportvectormachines. Wealsoclassifybetweenthedifferent states (resting, stimuli, active thinking) of the recording, achievingaccuraciesof95%. Thesedatamaybeusedtolearn multimodal relationships, and to develop silent-speech and brain-computer interfaces.

5-2-8

TORGO data base free for academic use.

In the spirit of the season, I would like to announce the immediate availability of the TORGO database free, in perpetuity for academic use. This database combines acoustics and electromagnetic articulography from 8 individuals with speech disorders and 7 without, and totals over 18 GB. These data can be used for multimodal models (e.g., for acoustic-articulatory inversion), models of pathology, and augmented speech recognition, for example. More information (and the database itself) can be found here: http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html.

5-2-9

Datatang

Datatang is a global leading data provider that specialized in data customized solution, focusing in variety speech, image, and text data collection, annotation, crowdsourcing services.

Summary of the new datasets (2018) and a brief plan for 2019.

? Speech data (with annotation) that we finished in 2018

Language	Datasets Length ( Hours )
French	794
British English	800
Spanish	435
Italian	1,440
German	1,800
Spanish (Mexico/Colombia)	700
Brazilian Portuguese	1,000
European Portuguese	1,000
Russian	1,000

?2019 ongoing speech project

Type	Project Name
Europeans speak English	1000 Hours-Spanish Speak English
	1000 Hours-French Speak English
	1000 Hours-German Speak English
Call Center Speech	1000 Hours-Call Center Speech
off-the-shelf data expansion	1000 Hours-Chinese Speak English
off-the-shelf data expansion	1500 Hours-Mixed Chinese and English Speech Data

On top of the above, there are more planed speech data collections, such as Japanese speech data, children`s speech data, dialect speech data and so on.

What is more, we will continually provide those data at a competitive price with a maintained high accuracy rate.

If you have any questions or need more details, do not hesitate to contact us jessy@datatang.com

It would be possible to send you with a sample or specification of the data.

5-2-10

Fearless Steps Corpus (University of Texas, Dallas)

Fearless Steps Corpus

John H.L. Hansen, Abhijeet Sangwan, Lakshmish Kaushik, Chengzhu Yu Center for Robust Speech Systems (CRSS), Eric Jonsson School of Engineering, The University of Texas at Dallas (UTD), Richardson, Texas, U.S.A.

NASA’s Apollo program is a great achievement of mankind in the 20th century. CRSS, UT-Dallas has undertaken an enormous Apollo data digitization initiative where we proposed to digitize Apollo mission speech data (~100,000 hours) and develop Spoken Language Technology based algorithms to analyze and understand various aspects of conversational speech. Towards achieving this goal, a new 30 track analog audio decoder is designed to decode 30 track Apollo analog tapes and is mounted on to the NASA Soundscriber analog audio decoder (in place of single channel decoder). Using the new decoder all 30 channels of data can be decoded simultaneously thereby reducing the digitization time significantly.
We have digitized 19,000 hours of data from Apollo missions (including entire Apollo-11, most of Apollo-13, Apollo-1, and Gemini-8 missions). This audio archive is named as “Fearless Steps Corpus”. This is one of the most unique and singularly large naturalistic audio corpus of such magnitude. Automated transcripts are generated by building Apollo mission specific custom Deep Neural Networks (DNN) based Automatic Speech Recognition (ASR) system along with Apollo mission specific language models. Speaker Identification System (SID) to identify the speakers are designed. A complete diarization pipeline is established to study and develop various SLT tasks.
We will release this corpus for public usage as a part of public outreach and promote SLT community to utilize this opportunity to build naturalistic spoken language technology systems. The data provides ample opportunity setup challenging tasks in various SLT areas. As a part of this outreach we will be setting “Fearless Challenge” in the upcoming INTERSPEECH 2018. We will define and propose 5 tasks as a part of this challenge. The guidelines and challenge data will be released in the Spring 2018 and will be available for download for free. The five challenges are, (1) Automatic Speech Recognition (2) Speaker Identification (3) Speech Activity Detection (4) Speaker Diarization (5) Keyword spotting and Joint Topic/Sentiment detection.
Looking forward for your participation (John.Hansen@utdallas.edu)

5-2-11

SIWIS French Speech Synthesis Database

The SIWIS French Speech Synthesis Database includes high quality French speech recordings and associated text files, aimed at building TTS systems, investigate multiple styles, and emphasis. A total of 9750 utterances from various sources such as parliament debates and novels were uttered by a professional French voice talent. A subset of the database contains emphasised words in many different contexts. The database includes more than ten hours of speech data and is freely available.

http://datashare.is.ed.ac.uk/handle/10283/2353

5-2-12

JLCorpus - Emotional Speech corpus with primary and secondary emotions

JLCorpus - Emotional Speech corpus with primary and secondary emotions:

For further understanding the wide array of emotions embedded in human speech, we are introducing an emotional speech corpus. In contrast to the existing speech corpora, this corpus was constructed by maintaining an equal distribution of 4 long vowels in New Zealand English. This balance is to facilitate emotion related formant and glottal source feature comparison studies. Also, the corpus has 5 secondary emotions along with 5 primary emotions. Secondary emotions are important in Human-Robot Interaction (HRI), where the aim is to model natural conversations among humans and robots. But there are very few existing speech resources to study these emotions,and this work adds a speech corpus containing some secondary emotions.

Please use the corpus for emotional speech related studies. When you use it please include the citation as:

Jesin James, Li Tian, Catherine Watson, 'An Open Source Emotional Speech Corpus for Human Robot Interaction Applications', in Proc. Interspeech, 2018.

To access the whole corpus including the recording supporting files, click the following link: https://www.kaggle.com/tli725/jl-corpus, (if you have already installed the Kaggle API, you can type the following command to download: kaggle datasets download -d tli725/jl-corpus)

Or if you simply want the raw audio+txt files, click the following link: https://www.kaggle.com/tli725/jl-corpus/downloads/Raw%20JL%20corpus%20(unchecked%20and%20unannotated).rar/4

The corpus was evaluated by a large scale human perception test with 120 participants. The link to the survey are here- For Primary emorion corpus: https://auckland.au1.qualtrics.com/jfe/form/SV_8ewmOCgOFCHpAj3

For Secondary emotion corpus: https://auckland.au1.qualtrics.com/jfe/form/SV_eVDINp8WkKpsPsh

These surveys will give an overall idea about the type of recordings in the corpus.

The perceptually verified and annotated JL corpus will be given public access soon.

5-3 Software

5-3-1

Release of the version 2 of FASST (Flexible Audio Source Separation Toolbox).

Release of the version 2 of FASST (Flexible Audio Source Separation Toolbox). http://bass-db.gforge.inria.fr/fasst/ This toolbox is intended to speed up the conception and to automate the implementation of new model-based audio source separation algorithms. It has the following additions compared to version 1: * Core in C++ * User scripts in MATLAB or python * Speedup * Multichannel audio input We provide 2 examples: 1. two-channel instantaneous NMF 2. real-world speech enhancement (2nd CHiME Challenge, Track 1)

5-3-2

Cantor Digitalis, an open-source real-time singing synthesizer controlled by hand gestures.

We are glad to announce the public realease of the Cantor Digitalis, an open-source real-time singing synthesizer controlled by hand gestures.

It can be used e.g. for making music or for singing voice pedagogy.

A wide variety of voices are available, from the classic vocal quartet (soprano, alto, tenor, bass), to the extreme colors of childish, breathy, roaring, etc. voices. All the features of vocal sounds are entirely under control, as the synthesis method is based on a mathematic model of voice production, without prerecording segments.

The instrument is controlled using chironomy, i.e. hand gestures, with the help of interfaces like stylus or fingers on a graphic tablet, or computer mouse. Vocal dimensions such as the melody, vocal effort, vowel, voice tension, vocal tract size, breathiness etc. can easily and continuously be controlled during performance, and special voices can be prepared in advance or using presets.

Check out the capabilities of Cantor Digitalis, through performances extracts from the ensemble Chorus Digitalis:
http://youtu.be/_LTjM3Lihis?t=13s.

In pratice, this release provides:

the synthesizer application
the source code in the form of a Max package (GPL-like license)
a documentation for the musician and another for the developper

What do you need ?

a Mac OSX
ideally a Wacom graphic tablet, but it also works with your computer mouse
for the developers, the Max software

Interested ?

To download the Cantor Digitalis, click here
To subscribe to the Cantor Digitalisnewsletter and/or the forum list, or to contact the developers, click here
To learn about the Chorus Digitalis, ensemble of Cantor Digitalisand watch videos of performances, click here
For more details about the Cantor Digitalis, click here

Regards,

The Cantor Digitalis team (who loves feedback — cantordigitalis@limsi.fr)
Christophe d'Alessandro, Lionel Feugère, Olivier Perrotin
http://cantordigitalis.limsi.fr/

5-3-3

MultiVec: a Multilingual and MultiLevel Representation Learning Toolkit for NLP

We are happy to announce the release of our new toolkit “MultiVec” for computing continuous representations for text at different granularity levels (word-level or sequences of words). MultiVec includes Mikolov et al. [2013b]’s word2vec features, Le and Mikolov [2014]’s paragraph vector (batch and online) and Luong et al. [2015]’s model for bilingual distributed representations. MultiVec also includes different distance measures between words and sequences of words. The toolkit is written in C++ and is aimed at being fast (in the same order of magnitude as word2vec), easy to use, and easy to extend. It has been evaluated on several NLP tasks: the analogical reasoning task, sentiment analysis, and crosslingual document classification. The toolkit also includes C++ and Python libraries, that you can use to query bilingual and monolingual models.

The project is fully open to future contributions. The code is provided on the project webpage (https://github.com/eske/multivec) with installation instructions and command-line usage examples.

When you use this toolkit, please cite:

@InProceedings{MultiVecLREC2016,

Title = {{MultiVec: a Multilingual and MultiLevel Representation Learning Toolkit for NLP}},

Author = {Alexandre Bérard and Christophe Servan and Olivier Pietquin and Laurent Besacier},

Booktitle = {The 10th edition of the Language Resources and Evaluation Conference (LREC 2016)},

Year = {2016},

Month = {May}

}

The paper is available here: https://github.com/eske/multivec/raw/master/docs/Berard_and_al-MultiVec_a_Multilingual_and_Multilevel_Representation_Learning_Toolkit_for_NLP-LREC2016.pdf

Best regards,

Alexandre Bérard, Christophe Servan, Olivier Pietquin and Laurent Besacier

5-3-4

An android application for speech data collection LIG_AIKUMA

We are pleased to announce the release of LIG_AIKUMA, an android application for speech data collection, specially dedicated to language documentation. LIG_AIKUMA is an improved version of the Android application (AIKUMA) initially developed by Steven Bird and colleagues. Features were added to the app in order to facilitate the collection of parallel speech data in line with the requirements of a French-German project (ANR/DFG BULB - Breaking the Unwritten Language Barrier).

The resulting app, called LIG-AIKUMA, runs on various mobile phones and tablets and proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). It was used for field data collections in Congo-Brazzaville resulting in a total of over 80 hours of speech.

Users who just want to use the app without access to the code can download it directly from the forge direct link: https://forge.imag.fr/frs/download.php/706/MainActivity.apk

Code is also available on demand (contact elodie.gauthier@imag.fr and laurent.besacier@imag.fr).

More details on LIG_AIKUMA can be found on the following paper: http://www.sciencedirect.com/science/article/pii/S1877050916300448

5-3-5

Web services via ALL GO from IRISA-CNRS

It is our pleasure to introduce A||GO (https://allgo.inria.fr/ or http://allgo.irisa.fr/), a platform providing a collection of web-services for the automatic analysis of various data, including multimedia content across modalities. The platform builds on the back-end web service deployment infrastructure developed and maintained by Inria?s Service for Experimentation and Development (SED). Originally dedicated to multimedia content, A||GO progressively broadened to other fields such as computational biology, networks and telecommunications, computational graphics or computational physics.

As part of the CNRS PlaSciDo initiative [1], the Linkmedia team at IRISA / Inria Rennes is making available via A||GO a number of web services devoted to multimedia content analysis across modalities (language, audio, image, video). The web services provided currently include research results from the Linkmedia team as well as contribution from a number of partners. A list of the services available by the date is given below and the current state is available at https://www-linkmedia.irisa.fr/software along with demo videos. Most web services are interoperable, facilitating the implementation of a multimedia content analysis processing chain, and are free to use for trial, prototyping or lab work. A brief and free account creation step will allow you to execute the web-services using either the graphical interface or a command line via a dedicated API.

We expect the number of web services to grow over time and invite interested parties to contact us should they wish to contribute the multimedia web service offer of A||GO.

List of multimedia content analysis tools currently available on A||GO:
- Audio Processing
        SaMuSa: music/speech segmentation
        SilAD: silence detection
        Radi.sh: repeated audio motif discovery
        LORIA STS v2: speech transcription for the French language from LORIA
        Multi channel BSS locate: audio source localization toolbox from IRISA-PANAMA
        A-spade: audio declipper from IRISA-PANAMA
        Transvox: voice faker from LORIA
- Natural Language Processing
        NERO: name entity recognition
        TermEx: keywords/indexing terms detection
        Otis!: topic segmentation
        Hi-tost: hierarchical topic structuring
- Video Processing
        Vidseg: video shot segmentation
        HUFA: face detection and tracking
Shortcuts to Linkmedia services are also available here: https://www-linkmedia.irisa.fr/software/

For more information don't hesitate to contact us (contact-multimedia-allgo@irisa.fr).

Gabriel Sargent and Guillaume Gravier
--
Linkmedia
IRISA - CNRS
Rennes, France

5-3-6

Clickable map - Illustrations of the IPA

Clickable map - Illustrations of the IPA

We have produced a clickable map showing the Illustrations of the International Phonetic
Alphabet.

The map is being updated with each new issue of the Journal of the International Phonetic
Association.

https://richardbeare.github.io/marijatabain/ipa_illustrations_all.html

Marija Tabain - La Trobe University, Australia
Richard Beare - Monash University & MCRI, Australia

5-3-7

LIG-Aikuma running on mobile phones and tablets

Dear all,

LIG is pleased to inform you that the website for the app Lig-Aikuma is online: https://lig-aikuma.imag.fr/

In the same time, an update of Lig-Aikuma (V3) was made available (see website).

LIG-AIKUMA is a free Android app running on various mobile phones and tablets. The app proposes a range of different speech collection modes (recording, respeaking, translation and elicitation) and offers the possibility to share recordings between users. LIG-AIKUMA is built upon the initial AIKUMA app developed by S. Bird & F. Hanke (see https://en.wikipedia.org/wiki/Aikuma for more information)

Improvements of the app:

Visual upgrade:
+ Waveform visualizer on the Respeaking and Translation modes (possibility to zoom in/out the audio signal)
+ File explorer included in all modes, to facilitate the navigation between files
+ New Share mode to share recordings between devices (by Bluetooth, Mail, NFC if available)
+ French and German languages available. In addition to English, the application now supports French and German languages. Lig-Aikuma uses by default the language of the phone/tablet.
+ New icons, more consistent to discriminate all type of files (audio, text, image, video)

Conceptual upgrade:
+ New name for the root project: ligaikuma ?> /! Henceforth, all data will be stored into this directory instead of ?aikuma? (in the previous versions of the app). This change doesn?t have compatibility issues. In the file explorer of the mode, the default position is this root directory. Just go back once with the left grey arrow (on the lower left of the screen) and select the ?aikuma? directory to access to your old recordings
+ Generation of a PDF consent form (from informations filled in the metadata form) that can be signed by linguist and speaker thanks to a pdf annotation tool (like Adobe Fill & Sign mobile app)
+ Generation of a CSV file which can be imported in Elan software: it will automatically create segmented tier, as it was done during a respeaking or a translation session. It will also mention by a ?non-speech? label that a segment has no speech.
+ Géolocalisation of the recordings
+ Respeak an elicit file: it is now possible to use in Respeaking or Translation mode an audio file initially recorded in Elicitation mode

Structural upgrade:
+ Undo button on Elicitation to erase/redo the current recording
+ Improvement session backup on Elicitation
+ Non-speech button in Respeaking and Translation modes to indicate by a comment that the segment does not contain speech (but noise or silent for instance)
+ Automatic speaker profile creation to quickly fill in the metadata infos if several sessions with a same speaker

Best regards,

Elodie Gauthier & Laurent Besacier

5-3-8

Python Library

Nous sommes heureux d'annoncer la mise à disposition du public de la

première bibliothèque en langage Python pour convertir des nombres écrits en

français en leur représentation en chiffres.

L'analyseur est robuste et est capable de segmenter et substituer les expressions

de nombre dans un flux de mots, comme une conversation par exemple. Il reconnaît les différentes

variantes de la langue (quantre-vingt-dix / nonante?) et traduit aussi bien les

ordinaux que les entiers, les nombres décimaux et les séquences formelles (n° de téléphone, CB?).

Nous espérons que cet outil sera utile à celles et ceux qui, comme nous, font du traitment

du langage naturel en français.

Cette bibliothèque est diffusée sous license MIT qui permet une utilisation très libre.

Pypi : https://pypi.org/project/text2num/

Sources : https://github.com/allo-media/text2num

Doc : http://text2num.readthedocs.io/

--

Romuald Texier-Marcadé

http://www.allo-media.fr

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy

© Copyright 2026 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA