ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2019 » ISCApad #252 » Resources » Database

ISCApad #252

Tuesday, June 11, 2019 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (May 2019)

In this newsletter:

New Publications:

Multi-Language Conversational Telephone Speech 2011 -- English Group

TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014

CIEMPIESS Experimentation

IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c

New publications:

(1) Multi-Language Conversational Telephone Speech 2011 -- English Group was developed by LDC and is comprised of approximately 18 hours of telephone speech in two general varieties of English: American and South Asian.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Calls are labeled by human auditors for callee gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

Slavic Group (LDC2016S11)
Turkish (LDC2017S09)
South Asian (LDC2017S14)
Central Asian (LDC2018S03)
Central European (LDC2018S08)
Spanish (LDC2018S12)
Arabic (LDC2019S02)

Multi-Language Conversational Telephone Speech 2011 -- English Group is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1500.

(2) TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Chinese Regular Slot Filling evaluation track conducted in 2014. This release includes queries, the 'manual runs' (human-produced responses to the queries), the final rounds of assessment results, and the complete set of Chinese source documents.

The regular Chinese Slot Filling evaluation track involved mining information about entities from text. In completing the task, participating systems and LDC annotators searched a corpus for information on certain attributes (slots) of person and organization entities and attempted to return all valid answers (slot fillers) in the source collection.

TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 is distributed via web download.

(3) CIEMPIESS Experimentation (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Facultad de Ingeniería at the National Autonomous University of Mexico (UNAM) and consists of approximately 22 hours of Mexican Spanish broadcast and read speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.

CIEMPIESS Experimentation is a set of three different data sets, specifically Complementary, Fem, and Test. Complementary is a phonetically-balanced corpus of isolated Spanish words spoken in Central Mexico. Fem contains broadcast speech from 21 female speakers, collected to balance by gender the number of recordings from male speakers in other CIEMPIESS collections. Test consists of 10 hours of broadcast speech and transcripts and is intended for use as a standard test data set alongside other CIEMPIESS corpora.

Most of the speech recordings in Fem and Test were collected from Radio-IUS, a UNAM radio station. Other recordings were taken from IUS Canal Multimedia and Centro Universitario de Estudios Jurídicos (CUEJ UNAM). Those two channels feature videos with speech around legal issues and topics related to UNAM. The Complementary recordings consist of read speech collected for that corpus.

LDC has released the following data sets in the CIEMPIESS series:

CIEMPIESS (LDC2015S07)
CHM150 (LDC2016S04)
CIEMPIESS Light (LDC2017S23)
CIEMPIESS Balance (LDC2018S11)

CIEMPIESS Experimentation is distributed via web download.

(3) IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. This corpus contains approximately 198 hours of Guarani conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.

The Guarani speech in this release represents that spoken in Paraguay. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $25.

Membership Office

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104

Top

5-2-2

ELRA - Language Resources Catalogue - Update (April 2019)

ELRA is happy to announce that 4 new Speech resources, 1 new Written Corpus and 1 new Multilingual Lexicon are now available in our catalogue.

ELRA-S0399 GlobalPhone Multilingual Model Package
ISLRN: 204-945-263-927-6
The GlobalPhone Multilingual Model Package contains about 22 hours of transcribed read speech spoken by native speakers in 22 languages (Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swahili, Swedish, Tamil, Thai, Turkish, Ukrainian, and Vietnamese). The GlobalPhone Multilingual Model Package covers about 1 hour of transcribed speech from 10 speakers (5 male, 5 female) from each of the above listed 22 languages.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0399/

ELRA-S0400 GlobalPhone 2000 Speaker Package
ISLRN: 331-592-378-424-7
The GlobalPhone 2000 Speaker Package contains transcribed read speech spoken by 2000 native speakers in 22 languages (Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swahili, Swedish, Tamil, Thai, Turkish, Ukrainian, and Vietnamese). The GlobalPhone 2000 Speaker Package covers about 9,000 randomly selected utterances read by 2000 native speakers in 22 languages, i.e. on average 4.5 utterances corresponding to 40 seconds of speech per speaker amounting to a total of 22 hours of speech.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0400/

ELRA-S0402 Speaking atlas of the regional languages of France
ISLRN: 112-393-061-014-3
The Speaking atlas of the regional languages of France offers the same Aesop?s fable read in French and in a number of varieties of languages of France. This work, which has a scientific and heritage dimension, consists in highlighting the linguistic diversity of Metropolitan France and Overseas Territories, through recordings collected in the field and presented via an interactive map, with their orthographic transcription. As far as Occitan is concerned, about sixty varieties were collected in Gascony, Languedoc, Provence, northern Occitania and the Linguistic Crescent. Varieties of Basque, Breton, Frannian, West Flemish, Alsatian, Corsican, Catalan, Francoprovençal and Oïl language(s) are also provided, as well as about fifty languages in the French Overseas and non-territorial languages such as Rromani and the French sign language.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0402/

ELRA-S0403 CLE Pakistan Urdu Speech Corpus
ISLRN: 572-070-066-634-8
This corpus consists of phonetically rich Urdu sentences and additional sentences covering telephone numbers, addresses and personal names. This speech corpus is recorded with a variety of microphone types. Sampling rate of speech files is 16 kHz. Each utterance is stored in a separate file and is accompanied by its orthographic transcription file in Unicode.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0403/

ELRA-W0128 ECPC Corpus (European Comparable and Parallel Corpora of Parliamentary Speeches Archive) ? set 1
ISLRN: 036-939-425-010-1
This corpus is a collection of XML metatextually tagged corpora containing speeches from European chambers. It is a bilingual, bidirectional corpus written corpus in English and Spanish. This first set (ECPC_EP-05) consists of (1) a 'clean' version in XML of European Parliament's 2005 daily sessions; (2) a POS-tagged version of the 2005 daily sessions; and (3) a sentence-based aligned version of 2005 daily sessions. In its raw format, ECPC_EP-05 contains 3,668,476 tokens/words (excluding tagging) in English distributed over 60 utf-8 files and 3,993,867 tokens/words (excluding tagging) in Spanish distributed over 60 utf-8 files.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0128/

ELRA-M0051 EnToSSLNE - a Lexicon of Parallel Named Entities from English to South Slavic Languages
ISLRN: 690-348-503-270-1
This lexicon consists of 26,155 parallel named entities in seven languages: English and six South Slavic ones: Bosnian, Bulgarian, Croatian, Macedonian, Serbian and Slovenian. The lexicon contains multiword entries which are not strictly named entities, but contain a word which is. Slovenian, Croatian and Bosnian are written in Latin script, Macedonian and Bulgarian in Cyrillic. Serbian language is specific since it may come in two scripts (Cyrillic and Latin) and two dialects (ekavica and ijekavica). This lexicon takes Serbian ekavica variant and its Cyrillic script. The lexicon comes in two formats: csv and xml.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0051/

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us.

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements/

Top

5-2-3

Speechocean – update (May 2019)

Latest Speech Recognition Corpora in 2019 --- Speechocean

Speechocean: A.I. Data Resource & Service Supplier

We have made several new speech recognition corpora in last 5 months. Please check the form below:

Serial Number	Language	Platform	Speakers	Hours	Gender Distribution
King-ASR-097	Russian	Desktop	50	103.4	25 male and 25 female
King-ASR-334	Mexican Spanish	Telephone	62	51.86	29 male and 33 female
King-ASR-335	Argentina Spanish	Telephone	26	21.81	13 male and 13 female
King-ASR-439	Russian	Telephone	232	201	106 male and 126 female
King-ASR-462	Chinese and English Mixed	Mobile	1043	3023.73	497 male and 546 female
King-ASR-463	Tamil	Mobile	118	98.5	50 male and 68 female
King-ASR-465	Telugu	Desktop	130	118	67 male and 63 female
King-ASR-467	Gujarati	Desktop	141	116	94 male and 47 female
King-ASR-623	Chinese Mandarin	Mobile	66	54	34 male and 32 female

If you have any further inquiries, please do not hesitate to contact us.

Web: http://www.speechocean.com/

Email: contact@speechocean.com

Top

5-2-4

Google 's Language Model benchmark

A LM benchmark is available at:https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark

Here is a brief description of the project.

'The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:

unpruned Katz (1.1B n-grams),
pruned Katz (~15M n-grams),
unpruned Interpolated Kneser-Ney (1.1B n-grams),
pruned Interpolated Kneser-Ney (~15M n-grams)

ArXiv paper: http://arxiv.org/abs/1312.3005

Happy benchmarking!'

Top

5-2-5

Forensic database of voice recordings of 500+ Australian English speakers

Forensic database of voice recordings of 500+ Australian English speakers

We are pleased to announce that the forensic database of voice recordings of 500+ Australian English speakers is now published.

The database was collected by the Forensic Voice Comparison Laboratory, School of Electrical Engineering & Telecommunications, University of New South Wales as part of the Australian Research Council funded Linkage Project on making demonstrably valid and reliable forensic voice comparison a practical everyday reality in Australia. The project was conducted in partnership with: Australian Federal Police, New South Wales Police, Queensland Police, National Institute of Forensic Sciences, Australasian Speech Sciences and Technology Association, Guardia Civil, Universidad Autónoma de Madrid.

The database includes multiple non-contemporaneous recordings of most speakers. Each speaker is recorded in three different speaking styles representative of some common styles found in forensic casework. Recordings are recorded under high-quality conditions and extraneous noises and crosstalk have been manually removed. The high-quality audio can be processed to reflect recording conditions found in forensic casework.

The database can be accessed at: http://databases.forensic-voice-comparison.net/

Top

5-2-6

Audio and Electroglottographic speech recordings

Audio and Electroglottographic speech recordings from several languages

We are happy to announce the public availability of speech recordings made as part of the UCLA project 'Production and Perception of Linguistic Voice Quality'.

http://www.phonetics.ucla.edu/voiceproject/voice.html

Audio and EGG recordings are available for Bo, Gujarati, Hmong, Mandarin, Black Miao, Southern Yi, Santiago Matatlan/ San Juan Guelavia Zapotec; audio recordings (no EGG) are available for English and Mandarin. Recordings of Jalapa Mazatec extracted from the UCLA Phonetic Archive are also posted. All recordings are accompanied by explanatory notes and wordlists, and most are accompanied by Praat textgrids that locate target segments of interest to our project.

Analysis software developed as part of the project – VoiceSauce for audio analysis and EggWorks for EGG analysis – and all project publications are also available from this site. All preliminary analyses of the recordings using these tools (i.e. acoustic and EGG parameter values extracted from the recordings) are posted on the site in large data spreadsheets.

All of these materials are made freely available under a Creative Commons Attribution-NonCommercial-ShareAlike-3.0 Unported License.

This project was funded by NSF grant BCS-0720304 to Pat Keating, Abeer Alwan and Jody Kreiman of UCLA, and Christina Esposito of Macalester College.

Pat Keating (UCLA)

Top

5-2-7

EEG-face tracking- audio 24 GB data set Kara One, Toronto, Canada

We are making 24 GB of a new dataset, called Kara One, freely available. This database combines 3 modalities (EEG, face tracking, and audio) during imagined and articulated speech using phonologically-relevant phonemic and single-word prompts. It is the result of a collaboration between the Toronto Rehabilitation Institute (in the University Health Network) and the Department of Computer Science at the University of Toronto.

In the associated paper (abstract below), we show how to accurately classify imagined phonological categories solely from EEG data. Specifically, we obtain up to 90% accuracy in classifying imagined consonants from imagined vowels and up to 95% accuracy in classifying stimulus from active imagination states using advanced deep-belief networks.

Data from 14 participants are available here: http://www.cs.toronto.edu/~complingweb/data/karaOne/karaOne.html.

If you have any questions, please contact Frank Rudzicz at frank@cs.toronto.edu.

Best regards,

Frank

PAPER Shunan Zhao and Frank Rudzicz (2015) Classifying phonological categories in imagined and articulated speech. In Proceedings of ICASSP 2015, Brisbane Australia

ABSTRACT This paper presents a new dataset combining 3 modalities (EEG, facial, and audio) during imagined and vocalized phonemic and single-word prompts. We pre-process the EEG data, compute features for all 3 modalities, and perform binary classi?cation of phonological categories using a combination of these modalities. For example, a deep-belief network obtains accuracies over 90% on identifying consonants, which is signi?cantly more accurate than two baseline supportvectormachines. Wealsoclassifybetweenthedifferent states (resting, stimuli, active thinking) of the recording, achievingaccuraciesof95%. Thesedatamaybeusedtolearn multimodal relationships, and to develop silent-speech and brain-computer interfaces.

Top

5-2-8

TORGO data base free for academic use.

In the spirit of the season, I would like to announce the immediate availability of the TORGO database free, in perpetuity for academic use. This database combines acoustics and electromagnetic articulography from 8 individuals with speech disorders and 7 without, and totals over 18 GB. These data can be used for multimodal models (e.g., for acoustic-articulatory inversion), models of pathology, and augmented speech recognition, for example. More information (and the database itself) can be found here: http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html.

Top

5-2-9

Datatang

Datatang is a global leading data provider that specialized in data customized solution, focusing in variety speech, image, and text data collection, annotation, crowdsourcing services.

Summary of the new datasets (2018) and a brief plan for 2019.

? Speech data (with annotation) that we finished in 2018

Language	Datasets Length ( Hours )
French	794
British English	800
Spanish	435
Italian	1,440
German	1,800
Spanish (Mexico/Colombia)	700
Brazilian Portuguese	1,000
European Portuguese	1,000
Russian	1,000

?2019 ongoing speech project

Type	Project Name
Europeans speak English	1000 Hours-Spanish Speak English
	1000 Hours-French Speak English
	1000 Hours-German Speak English
Call Center Speech	1000 Hours-Call Center Speech
off-the-shelf data expansion	1000 Hours-Chinese Speak English
off-the-shelf data expansion	1500 Hours-Mixed Chinese and English Speech Data

On top of the above, there are more planed speech data collections, such as Japanese speech data, children`s speech data, dialect speech data and so on.

What is more, we will continually provide those data at a competitive price with a maintained high accuracy rate.

If you have any questions or need more details, do not hesitate to contact us jessy@datatang.com

It would be possible to send you with a sample or specification of the data.

Top

5-2-10

Fearless Steps Corpus (University of Texas, Dallas)

Fearless Steps Corpus

John H.L. Hansen, Abhijeet Sangwan, Lakshmish Kaushik, Chengzhu Yu Center for Robust Speech Systems (CRSS), Eric Jonsson School of Engineering, The University of Texas at Dallas (UTD), Richardson, Texas, U.S.A.

NASA’s Apollo program is a great achievement of mankind in the 20th century. CRSS, UT-Dallas has undertaken an enormous Apollo data digitization initiative where we proposed to digitize Apollo mission speech data (~100,000 hours) and develop Spoken Language Technology based algorithms to analyze and understand various aspects of conversational speech. Towards achieving this goal, a new 30 track analog audio decoder is designed to decode 30 track Apollo analog tapes and is mounted on to the NASA Soundscriber analog audio decoder (in place of single channel decoder). Using the new decoder all 30 channels of data can be decoded simultaneously thereby reducing the digitization time significantly.
We have digitized 19,000 hours of data from Apollo missions (including entire Apollo-11, most of Apollo-13, Apollo-1, and Gemini-8 missions). This audio archive is named as “Fearless Steps Corpus”. This is one of the most unique and singularly large naturalistic audio corpus of such magnitude. Automated transcripts are generated by building Apollo mission specific custom Deep Neural Networks (DNN) based Automatic Speech Recognition (ASR) system along with Apollo mission specific language models. Speaker Identification System (SID) to identify the speakers are designed. A complete diarization pipeline is established to study and develop various SLT tasks.
We will release this corpus for public usage as a part of public outreach and promote SLT community to utilize this opportunity to build naturalistic spoken language technology systems. The data provides ample opportunity setup challenging tasks in various SLT areas. As a part of this outreach we will be setting “Fearless Challenge” in the upcoming INTERSPEECH 2018. We will define and propose 5 tasks as a part of this challenge. The guidelines and challenge data will be released in the Spring 2018 and will be available for download for free. The five challenges are, (1) Automatic Speech Recognition (2) Speaker Identification (3) Speech Activity Detection (4) Speaker Diarization (5) Keyword spotting and Joint Topic/Sentiment detection.
Looking forward for your participation (John.Hansen@utdallas.edu)

Top

5-2-11

SIWIS French Speech Synthesis Database

The SIWIS French Speech Synthesis Database includes high quality French speech recordings and associated text files, aimed at building TTS systems, investigate multiple styles, and emphasis. A total of 9750 utterances from various sources such as parliament debates and novels were uttered by a professional French voice talent. A subset of the database contains emphasised words in many different contexts. The database includes more than ten hours of speech data and is freely available.

http://datashare.is.ed.ac.uk/handle/10283/2353

Top

5-2-12

JLCorpus - Emotional Speech corpus with primary and secondary emotions

JLCorpus - Emotional Speech corpus with primary and secondary emotions:

For further understanding the wide array of emotions embedded in human speech, we are introducing an emotional speech corpus. In contrast to the existing speech corpora, this corpus was constructed by maintaining an equal distribution of 4 long vowels in New Zealand English. This balance is to facilitate emotion related formant and glottal source feature comparison studies. Also, the corpus has 5 secondary emotions along with 5 primary emotions. Secondary emotions are important in Human-Robot Interaction (HRI), where the aim is to model natural conversations among humans and robots. But there are very few existing speech resources to study these emotions,and this work adds a speech corpus containing some secondary emotions.

Please use the corpus for emotional speech related studies. When you use it please include the citation as:

Jesin James, Li Tian, Catherine Watson, 'An Open Source Emotional Speech Corpus for Human Robot Interaction Applications', in Proc. Interspeech, 2018.

To access the whole corpus including the recording supporting files, click the following link: https://www.kaggle.com/tli725/jl-corpus, (if you have already installed the Kaggle API, you can type the following command to download: kaggle datasets download -d tli725/jl-corpus)

Or if you simply want the raw audio+txt files, click the following link: https://www.kaggle.com/tli725/jl-corpus/downloads/Raw%20JL%20corpus%20(unchecked%20and%20unannotated).rar/4

The corpus was evaluated by a large scale human perception test with 120 participants. The link to the survey are here- For Primary emorion corpus: https://auckland.au1.qualtrics.com/jfe/form/SV_8ewmOCgOFCHpAj3

For Secondary emotion corpus: https://auckland.au1.qualtrics.com/jfe/form/SV_eVDINp8WkKpsPsh

These surveys will give an overall idea about the type of recordings in the corpus.

The perceptually verified and annotated JL corpus will be given public access soon.

Top

5-2-13

OPENGLOT –An open environment for the evaluation of glottal inverse filtering

OPENGLOT –An open environment for the evaluation of glottal inverse filtering

OPENGLOT is a publically available database that was designed primarily for the evaluation of glottal inverse filtering algorithms. In addition, the database can be used in evaluating formant estimation methods. OPENGLOT consists of four repositories. Repository I contains synthetic glottal flow waveforms, and speech signals generated by using the Liljencrants–Fant (LF) waveform as an excitation, and an all-pole vocal tract model. Repository II contains glottal flow and speech pressure signals generated using physical modelling of human speech production. Repository III contains pairs of glottal excitation and speech pressure signal generated by exciting 3D printed plastic vocal tract replica with LF excitations via a loudspeaker. Finally, Repository IV contains multichannel recordings (speech pressure signal, EGG, high-speed video of the vocal folds) from natural production of speech.

OPENGLOT is available at:

http://research.spa.aalto.fi/projects/openglot/

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy