ISCA - International Speech
Communication Association


ISCApad Archive  »  2019  »  ISCApad #249  »  Resources  »  Database

ISCApad #249

Monday, March 11, 2019 by Chris Wellekens

5-2 Database
5-2-1Linguistic Data Consortium (LDC) update (February 2019)

 February 2019 Newsletter

 In this newsletter:

 

Only two weeks left to enjoy 2019 membership discounts

 

Spring 2019 LDC Data Scholarship recipients

 

LDC’s new language game

 

New publications:

DEFT Chinese Committed Belief Annotation

IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b

Multi-Language Conversational Telephone Speech 2011 -- Arabic Group

Multilingual ATIS

_____________________________________________________________________________

 

Only two weeks left to enjoy 2019 membership discounts

There is still time to save on 2019 membership fees. Through March 1, all organizations receive a discount on the 2019 membership fee (up to 10%) when they choose to join or renew. For more information on membership benefits, visit Join LDC.

 

Spring 2019 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Spring 2019 Data Scholarships:

Colin Annand: University of Cincinnati (USA); PhD. Psychology. Colin is awarded a copy of Switchboard-1 Release 2 for his research involving the relationship between speech patterns and conversation content.

Si Chen: Huazhong University of Science and Technology (China); B.S. Communication Engineering. Si is awarded a copy of ACE 2005 Multilingual Training Corpus for his work on event extraction.

Noor-e-Hira: Fatima Jinnah Women University (Pakistan); MSc. Computer Sciences. Noor is awarded a copy of NIST 2008 Open Machine Translation (OpenMT) Evaluation for her research in machine translation.

Matthew Roddy: Trinity College Dublin (Ireland); Ph.D. Electrical Engineering. Matthew is awarded copies of 2000 HUB5 English Evaluation Speech and Transcripts for his work in spoken dialogue systems.

Ammara Zafar: Fatima Jinnah Women University (Pakistan); MSc Computer Sciences. Ammara awarded a copy of NIST 2009 Open Machine Translation (OpenMT) Evaluation for her research in machine translation.

For information about the program, visit the Data Scholarship page.

LDC’s new language game

LDC’s new language game, NameThatLanguage, tests your skill at recognizing the language spoken in short audio clips. The game includes thousands of clips to prevent memorization and offers a real challenge that increases as you progress. In addition to being fun, the game provides useful data on language confusability and linguistic diversity. Game results will be shared freely for research. New clips and more languages continue to be added providing ongoing challenges and new research data. Help support language research by playing! https://namethatlanguage.org

_____________________________________________________________________________

 

New publications:

 

(1) DEFT Chinese Committed Belief Annotation was developed by LDC and consists of approximately 83,000 tokens of Chinese discussion forum text annotated for 'committed belief,' which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text.

 

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships, and anomaly detection. LDC supported the DEFT program by collecting, creating, and annotating a variety of data sources.

 

DEFT Chinese Committed Belief Annotation is distributed via web download.

 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1000. 

 

*

 

(2) IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 210 hours of Lithuanian conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

 

The Lithuanian speech in this release represents that spoken in the Auk?taitian and Samogitian dialect regions of Lithuania. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 71 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

 

IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b is distributed via web download.

 

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $25.

 

*

 

(3) Multi-Language Conversational Telephone Speech 2011 -- Arabic Group was developed by LDC and is comprised of approximately 117 hours of telephone speech in distinct dialects of colloquial Arabic: Iraqi, Levantine and Maghrebi.

 

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused on language pair discrimination for 24 languages/dialects, some of which could be considered mutually intelligible or closely related.

 

Multi-Language Conversational Telephone Speech 2011 -- Arabic Group is distributed via web download.

 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $2500.

 

*

 

(4) Multilingual ATIS was developed by Google Inc. and consists of 5,871 utterances from ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26) annotated and translated into Hindi and Turkish.

 

The ATIS (Air Travel Information Services) collection was developed to support the research and development of speech understanding systems. Participants were presented with various hypothetical travel planning scenarios and asked to solve them by interacting with partially or completely automated ATIS systems. The resulting utterances were recorded and transcribed. Data was collected in the early 1990s at five US sites: Raytheon BBN, Carnegie Mellon University, MIT Laboratory for Computer Science, National Institute for Standards and Technology, and SRI International.

 

The original English utterances were manually translated into Hindi and Turkish. This release also includes the original English utterance and the machine translation back into English of the manual target language utterance translation. Each utterance is annotated with named entities via table lookup; markers include city, airline, airport names, and dates.

 

Multilingual ATIS is distributed via web download.

 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost. 

 

 

Membership Office

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

      Philadelphia, PA 19104

 

 

 

 



 

*

 

 

 

 

*

 


 

 

 

 

Back  Top

5-2-2ELRA - Language Resources Catalogue - Update (October 2018)
ELRA - Language Resources Catalogue - Update
-------------------------------------------------------
We are happy to announce that 2 new Written Corpora and 4 new Speech resources are now available in our catalogue.

ELRA-W0126 Training and test data for Arabizi detection and transliteration
ISLRN: 986-364-744-303-9
The dataset is composed of : a collection of mixed English and Arabizi text intended to train and test a system for the automatic detection of code-switching in mixed English and Arabizi texts ; and a set of 3,452 Arabizi tokens manually transliterated into Arabic, intended to train and test a system that performs Arabizi to Arabic transliteration.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0126/

ELRA-W0127 Normalized Arabic Fragments for Inestimable Stemming (NAFIS)
ISLRN: 305-450-745-774-1
This is an Arabic stemming gold standard corpus composed by a collection of 37 sentences, selected to be representative of Arabic stemming tasks and manually annotated. Compiled sentences belong to various sources (poems, holy Quran, books, and periodics) of diversified kinds (proverb and dictum, article commentary, religious text, literature, historical fiction). NAFIS is represented according to the TEI standard.   
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0127/

ELRA-S0396 Mbochi speech corpus

ISLRN: 747-055-093-447-8
This corpus consists of 5131 sentences recorded in Mbochi, together with their transcription and French translation, as well as the results from the work made during  JSALT workshop: alignments at the phonetic level and various results of unsupervised word segmentation from audio. The audio corpus is made up of 4,5 hours, downsampled at 16kHz, 16bits, with Linear PCM encoding. Data is distributed into 2 parts, one for training consisting of 4617 sentences, and one for development consisting of 514 sentences.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0396/

ELRA-S0397 Chinese Mandarin (South) database

ISLRN: 503-886-852-083-2
This database contains the recordings of 1000 Chinese Mandarin speakers from Southern China (500 males and 500 females), from 18 to 60 years? old, recorded in quiet studios. Recordings were made through microphone headsets and consist of 341 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0397/

ELRA-S0398 Chinese Mandarin (North) database
ISLRN: 353-548-770-894-7
This database contains the recordings of 500 Chinese Mandarin speakers from Northern China (250 males and 250 females), from 18 to 60 years? old, recorded in quiet studios. Recordings were made through microphone headsets and consist of 172 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0398/

ELRA-S0401 Persian Audio Dictionary
ISLRN: 133-181-128-420-9
This dictionary consists of more than 50,000 entries (along with almost all wordforms and proper names) with corresponding audio files in MP3 and English transliterations. The words have been recorded with standard Persian (Farsi) pronunciation (all by a single speaker). This dictionary is provided with its software.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0401/

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us.

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements/












Back  Top

5-2-3Speechocean – update (March 2019)

 

 

 

 

Accented English Speech Recognition Corpus --- Speechocean

 

 

Speechocean: The World’s Leading A.I. Data Resource & Service Supplier

 

At present, we are capable to provide data services with 110+ languages and dialects across the world. For more detailed information, please visit our website:

http://kingline.speechocean.com

 

Language

Speakers

Total Hours

Chinese English

3478

1513

Hong Kong English

200

378

Japanese English

1005

902

Korean English

116

207

Singaporean English

203

366

Indonesian English

804

378

Indian English

2394

3540

French English

130

191

German English

200

314

Italian English

120

189

Portuguese English

100

154

Spanish English

200

326

 

 

More Information

 

  • Information of Speaker: Selected non-native English speakers. Balanced covering ages, gender and regional accents.

  • Recording Environment: Quiet or noisy environment.

  • Recording Platform:Desktop or mobile

  • Recording Content: Sentences, conversation, words…

  • Post Processing: Proofreading, transcription, annotation and quality control.

  • Lexicon: Included

 

Contact Us:

Email: contact@speechocean.com

 

 

 

 


 


 

 

Back  Top

5-2-4Google 's Language Model benchmark
 Here is a brief description of the project.

'The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:

  • unpruned Katz (1.1B n-grams),
  • pruned Katz (~15M n-grams),
  • unpruned Interpolated Kneser-Ney (1.1B n-grams),
  • pruned Interpolated Kneser-Ney (~15M n-grams)

 

Happy benchmarking!'

Back  Top

5-2-5Forensic database of voice recordings of 500+ Australian English speakers

Forensic database of voice recordings of 500+ Australian English speakers

We are pleased to announce that the forensic database of voice recordings of 500+ Australian English speakers is now published.

The database was collected by the Forensic Voice Comparison Laboratory, School of Electrical Engineering & Telecommunications, University of New South Wales as part of the Australian Research Council funded Linkage Project on making demonstrably valid and reliable forensic voice comparison a practical everyday reality in Australia. The project was conducted in partnership with: Australian Federal Police,  New South Wales Police,  Queensland Police, National Institute of Forensic Sciences, Australasian Speech Sciences and Technology Association, Guardia Civil, Universidad Autónoma de Madrid.

The database includes multiple non-contemporaneous recordings of most speakers. Each speaker is recorded in three different speaking styles representative of some common styles found in forensic casework. Recordings are recorded under high-quality conditions and extraneous noises and crosstalk have been manually removed. The high-quality audio can be processed to reflect recording conditions found in forensic casework.

The database can be accessed at: http://databases.forensic-voice-comparison.net/

Back  Top

5-2-6Audio and Electroglottographic speech recordings

 

Audio and Electroglottographic speech recordings from several languages

We are happy to announce the public availability of speech recordings made as part of the UCLA project 'Production and Perception of Linguistic Voice Quality'.

http://www.phonetics.ucla.edu/voiceproject/voice.html

Audio and EGG recordings are available for Bo, Gujarati, Hmong, Mandarin, Black Miao, Southern Yi, Santiago Matatlan/ San Juan Guelavia Zapotec; audio recordings (no EGG) are available for English and Mandarin. Recordings of Jalapa Mazatec extracted from the UCLA Phonetic Archive are also posted. All recordings are accompanied by explanatory notes and wordlists, and most are accompanied by Praat textgrids that locate target segments of interest to our project.

Analysis software developed as part of the project – VoiceSauce for audio analysis and EggWorks for EGG analysis – and all project publications are also available from this site. All preliminary analyses of the recordings using these tools (i.e. acoustic and EGG parameter values extracted from the recordings) are posted on the site in large data spreadsheets.

All of these materials are made freely available under a Creative Commons Attribution-NonCommercial-ShareAlike-3.0 Unported License.

This project was funded by NSF grant BCS-0720304 to Pat Keating, Abeer Alwan and Jody Kreiman of UCLA, and Christina Esposito of Macalester College.

Pat Keating (UCLA)

Back  Top

5-2-7EEG-face tracking- audio 24 GB data set Kara One, Toronto, Canada

We are making 24 GB of a new dataset, called Kara One, freely available. This database combines 3 modalities (EEG, face tracking, and audio) during imagined and articulated speech using phonologically-relevant phonemic and single-word prompts. It is the result of a collaboration between the Toronto Rehabilitation Institute (in the University Health Network) and the Department of Computer Science at the University of Toronto.

 

In the associated paper (abstract below), we show how to accurately classify imagined phonological categories solely from EEG data. Specifically, we obtain up to 90% accuracy in classifying imagined consonants from imagined vowels and up to 95% accuracy in classifying stimulus from active imagination states using advanced deep-belief networks.

 

Data from 14 participants are available here: http://www.cs.toronto.edu/~complingweb/data/karaOne/karaOne.html.

 

If you have any questions, please contact Frank Rudzicz at frank@cs.toronto.edu.

 

Best regards,

Frank

 

 

PAPER Shunan Zhao and Frank Rudzicz (2015) Classifying phonological categories in imagined and articulated speech. In Proceedings of ICASSP 2015, Brisbane Australia

ABSTRACT This paper presents a new dataset combining 3 modalities (EEG, facial, and audio) during imagined and vocalized phonemic and single-word prompts. We pre-process the EEG data, compute features for all 3 modalities, and perform binary classi?cation of phonological categories using a combination of these modalities. For example, a deep-belief network obtains accuracies over 90% on identifying consonants, which is signi?cantly more accurate than two baseline supportvectormachines. Wealsoclassifybetweenthedifferent states (resting, stimuli, active thinking) of the recording, achievingaccuraciesof95%. Thesedatamaybeusedtolearn multimodal relationships, and to develop silent-speech and brain-computer interfaces.

 

Back  Top

5-2-8TORGO data base free for academic use.

In the spirit of the season, I would like to announce the immediate availability of the TORGO database free, in perpetuity for academic use. This database combines acoustics and electromagnetic articulography from 8 individuals with speech disorders and 7 without, and totals over 18 GB. These data can be used for multimodal models (e.g., for acoustic-articulatory inversion), models of pathology, and augmented speech recognition, for example. More information (and the database itself) can be found here: http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html.

Back  Top

5-2-9Datatang

Datatang is a global leading data provider that specialized in data customized solution, focusing in variety speech, image, and text data collection, annotation, crowdsourcing services.

 

Summary of the new datasets (2018) and a brief plan for 2019.

 

 

 

? Speech data (with annotation) that we finished in 2018 

 

Language
Datasets Length
  ( Hours )
French
794
British English
800
Spanish
435
Italian
1,440
German
1,800
Spanish (Mexico/Colombia)
700
Brazilian Portuguese
1,000
European Portuguese
1,000
Russian
1,000

 

?2019 ongoing  speech project 

 

Type

Project Name

Europeans speak English

1000 Hours-Spanish Speak English

1000 Hours-French Speak English

1000 Hours-German Speak English

Call Center Speech

1000 Hours-Call Center Speech

off-the-shelf data expansion

1000 Hours-Chinese Speak English

1500 Hours-Mixed Chinese and English Speech Data

 

 

 

On top of the above,  there are more planed speech data collections, such as Japanese speech data, children`s speech data, dialect speech data and so on.  

 

What is more, we will continually provide those data at a competitive price with a maintained high accuracy rate.

 

 

 

If you have any questions or need more details, do not hesitate to contact us jessy@datatang.com 

 

It would be possible to send you with a sample or specification of the data.

 

 

 


Back  Top

5-2-10Fearless Steps Corpus (University of Texas, Dallas)

Fearless Steps Corpus

John H.L. Hansen, Abhijeet Sangwan, Lakshmish Kaushik, Chengzhu Yu Center for Robust Speech Systems (CRSS), Eric Jonsson School of Engineering, The University of Texas at Dallas (UTD), Richardson, Texas, U.S.A.


NASA’s Apollo program is a great achievement of mankind in the 20th century. CRSS, UT-Dallas has undertaken an enormous Apollo data digitization initiative where we proposed to digitize Apollo mission speech data (~100,000 hours) and develop Spoken Language Technology based algorithms to analyze and understand various aspects of conversational speech. Towards achieving this goal, a new 30 track analog audio decoder is designed to decode 30 track Apollo analog tapes and is mounted on to the NASA Soundscriber analog audio decoder (in place of single channel decoder). Using the new decoder all 30 channels of data can be decoded simultaneously thereby reducing the digitization time significantly. 
We have digitized 19,000 hours of data from Apollo missions (including entire Apollo-11, most of Apollo-13, Apollo-1, and Gemini-8 missions). This audio archive is named as “Fearless Steps Corpus”. This is one of the most unique and singularly large naturalistic audio corpus of such magnitude. Automated transcripts are generated by building Apollo mission specific custom Deep Neural Networks (DNN) based Automatic Speech Recognition (ASR) system along with Apollo mission specific language models. Speaker Identification System (SID) to identify the speakers are designed. A complete diarization pipeline is established to study and develop various SLT tasks. 
We will release this corpus for public usage as a part of public outreach and promote SLT community to utilize this opportunity to build naturalistic spoken language technology systems. The data provides ample opportunity setup challenging tasks in various SLT areas. As a part of this outreach we will be setting “Fearless Challenge” in the upcoming INTERSPEECH 2018. We will define and propose 5 tasks as a part of this challenge. The guidelines and challenge data will be released in the Spring 2018 and will be available for download for free. The five challenges are, (1) Automatic Speech Recognition (2) Speaker Identification (3) Speech Activity Detection (4) Speaker Diarization (5) Keyword spotting and Joint Topic/Sentiment detection.
Looking forward for your participation (John.Hansen@utdallas.edu) 

Back  Top

5-2-11SIWIS French Speech Synthesis Database
The SIWIS French Speech Synthesis Database includes high quality French speech recordings and associated text files, aimed at building TTS systems, investigate multiple styles, and emphasis. A total of 9750 utterances from various sources such as parliament debates and novels were uttered by a professional French voice talent. A subset of the database contains emphasised words in many different contexts. The database includes more than ten hours of speech data and is freely available.
 
Back  Top

5-2-12JLCorpus - Emotional Speech corpus with primary and secondary emotions
JLCorpus - Emotional Speech corpus with primary and secondary emotions:
 

For further understanding the wide array of emotions embedded in human speech, we are introducing an emotional speech corpus. In contrast to the existing speech corpora, this corpus was constructed by maintaining an equal distribution of 4 long vowels in New Zealand English. This balance is to facilitate emotion related formant and glottal source feature comparison studies. Also, the corpus has 5 secondary emotions along with 5 primary emotions. Secondary emotions are important in Human-Robot Interaction (HRI), where the aim is to model natural conversations among humans and robots. But there are very few existing speech resources to study these emotions,and this work adds a speech corpus containing some secondary emotions.

Please use the corpus for emotional speech related studies. When you use it please include the citation as:

Jesin James, Li Tian, Catherine Watson, 'An Open Source Emotional Speech Corpus for Human Robot Interaction Applications', in Proc. Interspeech, 2018.

To access the whole corpus including the recording supporting files, click the following link: https://www.kaggle.com/tli725/jl-corpus, (if you have already installed the Kaggle API, you can type the following command to download: kaggle datasets download -d tli725/jl-corpus)

Or if you simply want the raw audio+txt files, click the following link: https://www.kaggle.com/tli725/jl-corpus/downloads/Raw%20JL%20corpus%20(unchecked%20and%20unannotated).rar/4

The corpus was evaluated by a large scale human perception test with 120 participants. The link to the survey are here- For Primary emorion corpus: https://auckland.au1.qualtrics.com/jfe/form/SV_8ewmOCgOFCHpAj3

For Secondary emotion corpus: https://auckland.au1.qualtrics.com/jfe/form/SV_eVDINp8WkKpsPsh

These surveys will give an overall idea about the type of recordings in the corpus.

The perceptually verified and annotated JL corpus will be given public access soon.

Back  Top



 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA