ISCA - International Speech
Communication Association


ISCApad Archive  »  2019  »  ISCApad #257  »  Resources  »  Database

ISCApad #257

Tuesday, November 12, 2019 by Chris Wellekens

5-2 Database
5-2-1Linguistic Data Consortium (LDC) update (October 2019)

 

In this newsletter:

 

Membership Year 2020 Publication Preview
LDC Data and Commercial Technology Development

 

New Publications:
BOLT English Treebank - Discussion Forum
Polish Speech Database

2016 NIST Speaker Recognition Evaluation Test Set

 


 


Membership Year 2020 Publication Preview

 

The 2020 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are:

 

 

 

Abstract Meaning Representation (AMR) Annotation Release 3.0: semantic treebank of over 59,000 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums; updates the second version (LDC2017T10) with new annotations.

 

TAC KBP: English sentiment slot filling, surprise slot filling, nugget detection and coreference, and event argument data in all languages (English, Chinese, and Spanish)

 

DEFT Chinese ERE: Chinese discussion forum data annotated for entities, relations, and events

 

LibriVox Spanish: 73 hours of Spanish audiobook read speech and transcripts

 

IARPA Babel Language Packs (telephone speech and transcripts): languages include Dhuluo, Javanese, and Mongolian

 

HAVIC Med Training data: web video, metadata, and annotations for developing multimedia systems

 

RATS Speaker Identification: conversational telephone speech in Levantine Arabic, Pashto, Urdu, Farsi and Dari on degraded audio signals with annotation of speech segments for speaker identification

 

BOLT: discussion forums, SMS/chat, conversational telephone speech, word-aligned, tagged and co-reference data in all languages (Chinese, Egyptian Arabic, and English)

 

 

 

Check your inbox in the coming weeks for more information about membership renewal. 

 

 

 

LDC data and commercial technology development

 

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

 


 


New publications:

 

 

 

(1) BOLT English Treebank - Discussion Forum was developed by LDC and consists of 268,907 tokens of English web discussion forum data with part-of-speech and syntactic structure annotations collected for the DARPA BOLT (Broad Operational Language Translation) program.

 

 

 

Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included with this release.

 

 

 

The source data is English discussion forum web text collected by LDC in 2011 and 2012. A subset of that data -- 702 files representing 268,907 tokens -- was selected for the treebank and annotated for word-level tokenization, part-of-speech and syntactic structure. The unannotated English source data is released as BOLT English Discussion Forums (LDC2017T11).

 

 

 

BOLT English Treebank - Discussion Forum is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $2000.

 

*

 

(2) Polish Speech Database was developed by VoiceLab and consists of 263,424 utterances of Polish speech data from 200 speakers, totaling approximately 280 hours, and corresponding transcripts.

 

 

 

Data collection was performed in Poland. Speakers were asked to record themselves reading text on a website for at least 60 minutes from their home computer while using a headset. The read text was comprised of sentences covering most speech sounds in Polish.

 

 

 

This release includes speaker metadata. There were 103 male speakers and 97 female speakers, ranging from 15 – 60 years of age; most speakers were in the 15 – 30 years age range.

 

 

 

Polish Speech Database is distributed via web download.

 

 

 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $3000.

 

*

 

(3) 2016 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology) and contains approximately 340 hours of short segments of Tagalog, Cantonese, Cebuano, and Mandarin telephone speech used as development and test data in the NIST-sponsored 2016 Speaker Recognition Evaluation (SRE).

As in previous evaluations, SRE16 focused on telephone speech recorded over a variety of handset types for the training and test conditions. In addition to development and evaluation data, this corpus also contains trial lists, their associated keys, tables containing metadata information, and evaluation documentation.

 

 

 

The telephone speech data was drawn from the Call My Net 2015 Corpus collected by LDC. Native speakers of Tagalog, Cantonese, Cebuano, or Mandarin (220 unique speakers) made a total of ten telephone calls each to people within their existing social networks. Speakers were encouraged to use different telephone instruments in a variety of acoustic settings and were instructed to talk for 8 - 10 minutes per call on a topic of their choice. All conversations were collected outside North America.

 

 

 

2016 NIST Speaker Recognition Evaluation Test Set is distributed via web download.

 

 

 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $750.

 

*

 

 

 

Membership Office

 

Linguistic Data Consortium

 

University of Pennsylvania

 

T: +1-215-573-1275

 

E: ldc@ldc.upenn.edu

 

M: 3600 Market St. Suite 810

 

      Philadelphia, PA 19104

 

 


 

 

 

 

 

 

 

 

 

 

 

 

Back  Top

5-2-2ELRA - Language Resources Catalogue - Update (October 2019)
In the framework of a distribution agreement between ELRA and the CJK Dictionary Institute, Inc., ELRA is happy to announce the distribution of 29 Monolingual Lexicons and 20 Multilingual Lexicons, suitable for a large variety of natural language processing applications. Monolingual Lexicons are available for Arabic, Cantonese, Simplified and Traditional Chinese, Japanese, Korean, Persian and Spanish and Multilingual lexicons include those languages as well as some other European languages (English, German, French, Italian, Portuguese and Russian) and Asian languages (Vietnamese, Indonesian, Thai). Possible applications include information retrieval, morphological analysis, word segmentation, named entity recognition, machine translation, etc. All lexicons are made available in tab-delimited, UTF-8 encoded text files.

The following list of lexicons is available:

1) Monolingual Lexicons:

Cantonese Readings Database, ELRA ID: ELRA-L0101, ISLRN: 634-690-317-631-5
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0101
Chinese Phonological Database, ELRA ID: ELRA-L0102, ISLRN: 968-547-869-011-3
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0102
Simplified to Traditional Chinese Conversion, ELRA ID: ELRA-L0103, ISLRN: 151-342-562-705-1
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0103
Hanzi Pinyin Database for Simplified Chinese, ELRA ID: ELRA-L0104, ISLRN: 292-895-602-975-4
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0104
Database of Chinese Name Variants, ELRA ID: ELRA-L0105, ISLRN: 379-237-021-386-4
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0105
Database of Chinese Full Names, ELRA ID: ELRA-L0106, ISLRN: 356-835-468-182-0
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0106
Chinese Lexical Database, ELRA ID: ELRA-L0107, ISLRN: 500-068-723-953-8
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0107
Chinese Morphological Database, ELRA ID: ELRA-L0108, ISLRN: 279-636-746-963-2
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0108
Comprehensive Wordlist of Simplified Chinese, ELRA ID: ELRA-L0109, ISLRN: 159-767-888-341-3
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0109
Comprehensive Word List of Traditional Chinese, ELRA ID: ELRA-L0110, ISLRN: 378-715-589-213-1
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0110
Japanese Phonological Database, ELRA ID: ELRA-L0111, ISLRN: 169-903-096-259-9
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0111
Japanese Lexical Database, ELRA ID: ELRA-L0112, ISLRN: 162-212-767-492-8
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0112
Japanese Morphological Database, ELRA ID: ELRA-L0113, ISLRN: 212-935-180-069-7
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0113
Japanese Orthographical Database, ELRA ID: ELRA-L0114, ISLRN: 261-356-756-593-8
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0114
Japanese Companies and Organizations, ELRA ID: ELRA-L0115, ISLRN: 570-674-242-221-3
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0115
Database of Japanese Name Variants, ELRA ID: ELRA-L0116, ISLRN: 850-674-726-461-2
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0116
Comprehensive Word List of Japanese, ELRA ID: ELRA-L0117, ISLRN: 145-375-006-102-6
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0117
Korean Lexical Database, ELRA ID: ELRA-L0118, ISLRN: 702-121-344-159-1
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0118
Comprehensive Word List of Korean, ELRA ID: ELRA-L0119, ISLRN: 652-932-407-045-1
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0119
Arabic Full Form Lexicon, ELRA ID: ELRA-L0120, ISLRN: 968-827-909-119-8
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0120
Database of Arabic Plurals, ELRA ID: ELRA-L0121, ISLRN: 414-072-749-098-5
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0121
Database of Arab Names, ELRA ID: ELRA-L0122, ISLRN: 998-153-793-831-3
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0122
Database of Arab Names in Arabic, ELRA ID: ELRA-L0123, ISLRN: 126-981-976-765-2
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0123
Database of Foreign Names in Arabic, ELRA ID: ELRA-L0124, ISLRN: 130-493-475-689-4
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0124
Database of Arabic Place Names, ELRA ID: ELRA-L0125, ISLRN: 916-541-123-321-8
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0125
Comprehensive Database of Chinese Personal Names, ELRA ID: ELRA-L0126, ISLRN: 797-857-604-135-5
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0126
Database of Persian Names, ELRA ID: ELRA-L0127, ISLRN: 739-878-734-567-6
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0127
Spanish Full-form Lexicon (Monolingual), ELRA ID: ELRA-L0128, ISLRN: 866-578-477-474-1
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0128
Database of Chinese Names, ELRA ID: ELRA-L0129, ISLRN: 792-499-131-789-4
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0129
 
2) Bilingual/Multilingal Lexicons:

Simplified Chinese?English Technical Terms, ELRA ID: ELRA-M0053, ISLRN: 418-191-947-016-4
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0053
Simplified Chinese-to-English Dictionary, ELRA ID: ELRA-M0054, ISLRN: 694-156-385-534-4
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0054
English-to-Simplified Chinese Dictionary, ELRA ID: ELRA-M0055, ISLRN: 407-348-028-638-3
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0055
Chinese-English Database of Proverbs and Idioms (Chengyu), ELRA ID: ELRA-M0056, ISLRN: 506-728-933-717-0
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0056
Chinese-Japanese Technical Terms Dictionary, ELRA ID: ELRA-M0057, ISLRN: 079-503-057-574-0
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0057
Chinese-English Database of Proper Nouns, ELRA ID: ELRA-M0058, ISLRN: 638-295-493-483-2
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0058
Chinese-Japanese Database of Proper Nouns, ELRA ID: ELRA-M0059, ISLRN: 951-838-928-664-9
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0059
Spanish Full-form Lexicon (Bilingual), ELRA ID: ELRA-M0060, ISLRN: 942-238-032-826-7
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0060
Japanese ? English Dictionary, ELRA ID: ELRA-M0061, ISLRN: 854-879-959-652-9
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0061
English ? Japanese Dictionary, ELRA ID: ELRA-M0062, ISLRN: 233-968-157-290-2
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0062
Multilingual Database of Japanese Points-of-Interest 1, ELRA ID: ELRA-M0063, ISLRN: 902-666-654-661-3
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0063
Multilingual Database of Japanese Points-of-Interest 2, ELRA ID: ELRA-M0064, ISLRN: 268-160-514-957-3
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0064
Japanese ? English Database of Proper Nouns, ELRA ID: ELRA-M0065, ISLRN: 104-268-721-502-8
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0065
Japanese - English Dictionary of Technical Terms, ELRA ID: ELRA-M0066, ISLRN: 499-497-806-398-9
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0066
Korean-Japanese Dictionary of Technical Terms, ELRA ID: ELRA-M0067, ISLRN: 584-164-296-035-1
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0067
Korean-English Database of Proper Nouns, ELRA ID: ELRA-M0068, ISLRN: 408-409-094-493-9
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0068
Korean-Japanese Database of Proper Nouns, ELRA ID: ELRA-M0069, ISLRN: 265-620-933-123-5
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0069
Korean-Chinese Database of Proper Nouns, ELRA ID: ELRA-M0070, ISLRN: 207-127-841-003-9
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0070
Comprehensive Word Lists for Chinese, Japanese, Korean and Arabic, ELRA ID: ELRA-M0071, ISLRN: 476-146-877-598-3
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0071
Multilingual Proper Noun Database, ELRA ID: ELRA-M0072, ISLRN: 340-315-642-771-9
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0072
 
About CJK Dictionary Institute, Inc.
The CJK Dictionary Institute, Inc. (CJKI) specializes in CJK lexicography.  The principal activity of CJKI is the development and continuous expansion of lexical databases of general vocabulary, proper nouns and technical terms for CJK languages (Chinese, Japanese, Korean), including Chinese dialects such as Cantonese and Hakka, containing millions of entries. CJKI also developed databases and romanization systems of Arabic proper nouns, a comprehensive Spanish-English dictionary, a Chinese-Vietnamese names dictionary, and various others. In addition, CJKI offers a full range of professional consulting services on CJK linguistics and lexicography.
To find out more about ELRA, please visit the website: http://www.cjk.org/cjk/index.htm

About ELRA
The European Language Resources Association (ELRA) is a non-profit-making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for Language Resources and promoting Human Language Technologies. Language Resources covering various fields of HLT (including Multimodal, Speech, Written, Terminology) and a great number of languages are available from the ELRA catalogue. ELRA's strong involvement in the fields of Language Resources  and Language Technologies is also emphasized at the LREC conference, organized every other year since 1998.
To find out more about ELRA, please visit the website: http://www.elra.info


For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us.

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements








Back  Top

5-2-3Speechocean – update (August 2019)

 

English Speech Recognition Corpus - Speechocean

 

At present, Speechocean has produced more than 24,000 hours of English Speech Recognition Corpora, including some rare corpora recorded by kids. Those corpora were recorded by 23,000 speakers in total. Please check the form below:

 

Name

Speakers

Hours

American English

8,441

8,029

Indian English

2,394

3,540

British English

2,381

3,029

Australian English

1,286

1,954

Chinese (Mainland) English

3,478

1,513

Canadian English

1,607

1,309

Japanese English

1,005

902

Singapore English

404

710

Russian English

230

492

Romanian English

201

389

French English

225

378

Chinese (Hong Kong) English

200

378

Italian English

213

366

Portugal English

201

341

Spainish English

200

326

German English

196

306

Korean English

116

207

Indonesian English

402

126

 

 

If you have any further inquiries, please do not hesitate to contact us.

Web: en.speechocean.com

Email: marketing@speechocean.com

 

 

 

 

 

 


 


 

 

Back  Top

5-2-4Google 's Language Model benchmark
 Here is a brief description of the project.

'The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:

  • unpruned Katz (1.1B n-grams),
  • pruned Katz (~15M n-grams),
  • unpruned Interpolated Kneser-Ney (1.1B n-grams),
  • pruned Interpolated Kneser-Ney (~15M n-grams)

 

Happy benchmarking!'

Back  Top

5-2-5Forensic database of voice recordings of 500+ Australian English speakers

Forensic database of voice recordings of 500+ Australian English speakers

We are pleased to announce that the forensic database of voice recordings of 500+ Australian English speakers is now published.

The database was collected by the Forensic Voice Comparison Laboratory, School of Electrical Engineering & Telecommunications, University of New South Wales as part of the Australian Research Council funded Linkage Project on making demonstrably valid and reliable forensic voice comparison a practical everyday reality in Australia. The project was conducted in partnership with: Australian Federal Police,  New South Wales Police,  Queensland Police, National Institute of Forensic Sciences, Australasian Speech Sciences and Technology Association, Guardia Civil, Universidad Autónoma de Madrid.

The database includes multiple non-contemporaneous recordings of most speakers. Each speaker is recorded in three different speaking styles representative of some common styles found in forensic casework. Recordings are recorded under high-quality conditions and extraneous noises and crosstalk have been manually removed. The high-quality audio can be processed to reflect recording conditions found in forensic casework.

The database can be accessed at: http://databases.forensic-voice-comparison.net/

Back  Top

5-2-6Audio and Electroglottographic speech recordings

 

Audio and Electroglottographic speech recordings from several languages

We are happy to announce the public availability of speech recordings made as part of the UCLA project 'Production and Perception of Linguistic Voice Quality'.

http://www.phonetics.ucla.edu/voiceproject/voice.html

Audio and EGG recordings are available for Bo, Gujarati, Hmong, Mandarin, Black Miao, Southern Yi, Santiago Matatlan/ San Juan Guelavia Zapotec; audio recordings (no EGG) are available for English and Mandarin. Recordings of Jalapa Mazatec extracted from the UCLA Phonetic Archive are also posted. All recordings are accompanied by explanatory notes and wordlists, and most are accompanied by Praat textgrids that locate target segments of interest to our project.

Analysis software developed as part of the project – VoiceSauce for audio analysis and EggWorks for EGG analysis – and all project publications are also available from this site. All preliminary analyses of the recordings using these tools (i.e. acoustic and EGG parameter values extracted from the recordings) are posted on the site in large data spreadsheets.

All of these materials are made freely available under a Creative Commons Attribution-NonCommercial-ShareAlike-3.0 Unported License.

This project was funded by NSF grant BCS-0720304 to Pat Keating, Abeer Alwan and Jody Kreiman of UCLA, and Christina Esposito of Macalester College.

Pat Keating (UCLA)

Back  Top

5-2-7EEG-face tracking- audio 24 GB data set Kara One, Toronto, Canada

We are making 24 GB of a new dataset, called Kara One, freely available. This database combines 3 modalities (EEG, face tracking, and audio) during imagined and articulated speech using phonologically-relevant phonemic and single-word prompts. It is the result of a collaboration between the Toronto Rehabilitation Institute (in the University Health Network) and the Department of Computer Science at the University of Toronto.

 

In the associated paper (abstract below), we show how to accurately classify imagined phonological categories solely from EEG data. Specifically, we obtain up to 90% accuracy in classifying imagined consonants from imagined vowels and up to 95% accuracy in classifying stimulus from active imagination states using advanced deep-belief networks.

 

Data from 14 participants are available here: http://www.cs.toronto.edu/~complingweb/data/karaOne/karaOne.html.

 

If you have any questions, please contact Frank Rudzicz at frank@cs.toronto.edu.

 

Best regards,

Frank

 

 

PAPER Shunan Zhao and Frank Rudzicz (2015) Classifying phonological categories in imagined and articulated speech. In Proceedings of ICASSP 2015, Brisbane Australia

ABSTRACT This paper presents a new dataset combining 3 modalities (EEG, facial, and audio) during imagined and vocalized phonemic and single-word prompts. We pre-process the EEG data, compute features for all 3 modalities, and perform binary classi?cation of phonological categories using a combination of these modalities. For example, a deep-belief network obtains accuracies over 90% on identifying consonants, which is signi?cantly more accurate than two baseline supportvectormachines. Wealsoclassifybetweenthedifferent states (resting, stimuli, active thinking) of the recording, achievingaccuraciesof95%. Thesedatamaybeusedtolearn multimodal relationships, and to develop silent-speech and brain-computer interfaces.

 

Back  Top

5-2-8TORGO data base free for academic use.

In the spirit of the season, I would like to announce the immediate availability of the TORGO database free, in perpetuity for academic use. This database combines acoustics and electromagnetic articulography from 8 individuals with speech disorders and 7 without, and totals over 18 GB. These data can be used for multimodal models (e.g., for acoustic-articulatory inversion), models of pathology, and augmented speech recognition, for example. More information (and the database itself) can be found here: http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html.

Back  Top

5-2-9Datatang

Datatang is a global leading data provider that specialized in data customized solution, focusing in variety speech, image, and text data collection, annotation, crowdsourcing services.

 

Summary of the new datasets (2018) and a brief plan for 2019.

 

 

 

? Speech data (with annotation) that we finished in 2018 

 

Language
Datasets Length
  ( Hours )
French
794
British English
800
Spanish
435
Italian
1,440
German
1,800
Spanish (Mexico/Colombia)
700
Brazilian Portuguese
1,000
European Portuguese
1,000
Russian
1,000

 

?2019 ongoing  speech project 

 

Type

Project Name

Europeans speak English

1000 Hours-Spanish Speak English

1000 Hours-French Speak English

1000 Hours-German Speak English

Call Center Speech

1000 Hours-Call Center Speech

off-the-shelf data expansion

1000 Hours-Chinese Speak English

1500 Hours-Mixed Chinese and English Speech Data

 

 

 

On top of the above,  there are more planed speech data collections, such as Japanese speech data, children`s speech data, dialect speech data and so on.  

 

What is more, we will continually provide those data at a competitive price with a maintained high accuracy rate.

 

 

 

If you have any questions or need more details, do not hesitate to contact us jessy@datatang.com 

 

It would be possible to send you with a sample or specification of the data.

 

 

 


Back  Top

5-2-10Fearless Steps Corpus (University of Texas, Dallas)

Fearless Steps Corpus

John H.L. Hansen, Abhijeet Sangwan, Lakshmish Kaushik, Chengzhu Yu Center for Robust Speech Systems (CRSS), Eric Jonsson School of Engineering, The University of Texas at Dallas (UTD), Richardson, Texas, U.S.A.


NASA’s Apollo program is a great achievement of mankind in the 20th century. CRSS, UT-Dallas has undertaken an enormous Apollo data digitization initiative where we proposed to digitize Apollo mission speech data (~100,000 hours) and develop Spoken Language Technology based algorithms to analyze and understand various aspects of conversational speech. Towards achieving this goal, a new 30 track analog audio decoder is designed to decode 30 track Apollo analog tapes and is mounted on to the NASA Soundscriber analog audio decoder (in place of single channel decoder). Using the new decoder all 30 channels of data can be decoded simultaneously thereby reducing the digitization time significantly. 
We have digitized 19,000 hours of data from Apollo missions (including entire Apollo-11, most of Apollo-13, Apollo-1, and Gemini-8 missions). This audio archive is named as “Fearless Steps Corpus”. This is one of the most unique and singularly large naturalistic audio corpus of such magnitude. Automated transcripts are generated by building Apollo mission specific custom Deep Neural Networks (DNN) based Automatic Speech Recognition (ASR) system along with Apollo mission specific language models. Speaker Identification System (SID) to identify the speakers are designed. A complete diarization pipeline is established to study and develop various SLT tasks. 
We will release this corpus for public usage as a part of public outreach and promote SLT community to utilize this opportunity to build naturalistic spoken language technology systems. The data provides ample opportunity setup challenging tasks in various SLT areas. As a part of this outreach we will be setting “Fearless Challenge” in the upcoming INTERSPEECH 2018. We will define and propose 5 tasks as a part of this challenge. The guidelines and challenge data will be released in the Spring 2018 and will be available for download for free. The five challenges are, (1) Automatic Speech Recognition (2) Speaker Identification (3) Speech Activity Detection (4) Speaker Diarization (5) Keyword spotting and Joint Topic/Sentiment detection.
Looking forward for your participation (John.Hansen@utdallas.edu) 

Back  Top

5-2-11SIWIS French Speech Synthesis Database
The SIWIS French Speech Synthesis Database includes high quality French speech recordings and associated text files, aimed at building TTS systems, investigate multiple styles, and emphasis. A total of 9750 utterances from various sources such as parliament debates and novels were uttered by a professional French voice talent. A subset of the database contains emphasised words in many different contexts. The database includes more than ten hours of speech data and is freely available.
 
Back  Top

5-2-12JLCorpus - Emotional Speech corpus with primary and secondary emotions
JLCorpus - Emotional Speech corpus with primary and secondary emotions:
 

For further understanding the wide array of emotions embedded in human speech, we are introducing an emotional speech corpus. In contrast to the existing speech corpora, this corpus was constructed by maintaining an equal distribution of 4 long vowels in New Zealand English. This balance is to facilitate emotion related formant and glottal source feature comparison studies. Also, the corpus has 5 secondary emotions along with 5 primary emotions. Secondary emotions are important in Human-Robot Interaction (HRI), where the aim is to model natural conversations among humans and robots. But there are very few existing speech resources to study these emotions,and this work adds a speech corpus containing some secondary emotions.

Please use the corpus for emotional speech related studies. When you use it please include the citation as:

Jesin James, Li Tian, Catherine Watson, 'An Open Source Emotional Speech Corpus for Human Robot Interaction Applications', in Proc. Interspeech, 2018.

To access the whole corpus including the recording supporting files, click the following link: https://www.kaggle.com/tli725/jl-corpus, (if you have already installed the Kaggle API, you can type the following command to download: kaggle datasets download -d tli725/jl-corpus)

Or if you simply want the raw audio+txt files, click the following link: https://www.kaggle.com/tli725/jl-corpus/downloads/Raw%20JL%20corpus%20(unchecked%20and%20unannotated).rar/4

The corpus was evaluated by a large scale human perception test with 120 participants. The link to the survey are here- For Primary emorion corpus: https://auckland.au1.qualtrics.com/jfe/form/SV_8ewmOCgOFCHpAj3

For Secondary emotion corpus: https://auckland.au1.qualtrics.com/jfe/form/SV_eVDINp8WkKpsPsh

These surveys will give an overall idea about the type of recordings in the corpus.

The perceptually verified and annotated JL corpus will be given public access soon.

Back  Top

5-2-13OPENGLOT –An open environment for the evaluation of glottal inverse filtering

OPENGLOT –An open environment for the evaluation of glottal inverse filtering

 

OPENGLOT is a publically available database that was designed primarily for the evaluation of glottal inverse filtering algorithms. In addition, the database can be used in evaluating formant estimation methods. OPENGLOT consists of four repositories. Repository I contains synthetic glottal flow waveforms, and speech signals generated by using the Liljencrants–Fant (LF) waveform as an excitation, and an all-pole vocal tract model. Repository II contains glottal flow and speech pressure signals generated using physical modelling of human speech production. Repository III contains pairs of glottal excitation and speech pressure signal generated by exciting 3D printed plastic vocal tract replica with LF excitations via a loudspeaker. Finally, Repository IV contains multichannel recordings (speech pressure signal, EGG, high-speed video of the vocal folds) from natural production of speech.

 

OPENGLOT is available at:

http://research.spa.aalto.fi/projects/openglot/

Back  Top

5-2-14Corpus Rhapsodie

Nous sommes heureux de vous annoncer la publication d¹un ouvrage consacré
au treebank Rhapsodie, un corpus de français parlé de 33 000 mots
finement annoté en prosodie et en syntaxe.

Accès à la publication : https://benjamins.com/catalog/scl.89 (voir flyer
ci-joint)

Accès au treebank : https://www.projet-rhapsodie.fr/
Les données librement accessibles sont diffusées sous licence Creative
Commons.
Le site donne également accès aux guides d¹annotations.

Back  Top

5-2-15The My Science Tutor Children?s Conversational Speech Corpus (MyST Corpus) , Boulder Learning Inc.

The My Science Tutor Children?s Conversational Speech Corpus (MyST Corpus) is the world?s largest English children?s speech corpus.  It is freely available to the research community for research use.  Companies can acquire the corpus for $10,000.  The MyST Corpus was collected over a 10-year period, with support from over $9 million in grants from the US National Science Foundation and Department of Education, awarded to Boulder Learning Inc. (Wayne Ward, Principal Investigator).

The MyST corpus contains speech collected from 1,374 third, fourth and fifth grade students.  The students engaged in spoken dialogs with a virtual science tutor in 8 areas of science.  A total of 11,398 student sessions of 15 to 20 minutes produced a total of 244,069 utterances.  42% of the utterances have been transcribed at the word level.  The corpus is partitioned into training and test sets to support comparison of research results across labs. All parents and students signed consent forms, approved by the University of Colorado?s Institutional Review Board,  that authorize distribution of the corpus for research and commercial use. 

The MyST children?s speech corpus contains approximately ten times as many spoken utterances as all other English children?s speech corpora combined (see https://en.wikipedia.org/wiki/List_of_children%27s_speech_corpora). 

Additional information about the corpus, and instructions for how to acquire the corpus (and samples of the speech data) can be found on the Boulder Learning Web site at http://boulderlearning.com/request-the-myst-corpus/.   

Back  Top

5-2-16HARVARD speech corpus - native British English speaker
  • HARVARD speech corpus - native British English speaker, digital re-recording
 
Back  Top



 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA