ISCA - International Speech
Communication Association


ISCApad Archive  »  2019  »  ISCApad #247  »  Resources  »  Database

ISCApad #247

Friday, January 18, 2019 by Chris Wellekens

5-2 Database
5-2-1Linguistic Data Consortium (LDC) update (December 2018)

 In this newsletter:

 

LDC Membership Discounts for MY2019 Still Available

 

Spring 2019 LDC Data Scholarship Program - deadline approaching

 

New publications:
HUB5 Mandarin Telephone Speech and Transcripts Second Edition

Nautilus Speaker Characterization
TAC Relation Extraction Dataset _______________________________________________________________

 

LDC Membership Discounts for MY2019 Still Available

 

Join LDC while membership savings are still available. Now through March 1, 2019, renewing MY2018 members will receive a 10% discount off the membership fee. New or non-consecutive member organizations will receive a 5% discount. Membership remains the most economical way to access LDC releases. Visit Join LDC for details on membership options and benefits.

 

Spring 2019 LDC Data Scholarship Program - deadline approaching

 

Students can apply for the Spring 2019 Data Scholarship Program now through January 15, 2019. The LDC Data Scholarship program provides students with access to LDC data at no cost. For more information on application requirements and program rules, please visit LDC Data Scholarships

_______________________________________________________________

 

New publications:

 

(1) HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by LDC in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC in two data sets, HUB5 Mandarin Telephone Speech Corpus (LDC98S69) and HUB5 Mandarin Transcripts (LDC98T26). This second edition merges the speech and transcript releases, updates the audio format, and adds Pinyin transcripts, forced alignment, and updated documentation and metadata.

 

This corpus contains approximately 19 hours of Mandarin speech from 42 unscripted telephone conversations between native speakers of Mandarin from CALLFRIEND Mandarin Chinese-Mainland Dialect (LDC96S55), which has also been released in a second, updated edition (LDC2018S09) and (2) associated transcripts of contiguous 5-30 minute segments from those telephone conversations.

 

Participants could speak with a person of their choice on any topic; most called family members and friends. The recorded conversations lasted up to 30 minutes. Transcripts were created manually by native Mandarin speakers in the GB2312 encoding schema. This release includes Pinyin transcripts and the original transcripts, both in UTF-8 format.

 

HUB5 Mandarin Telephone Speech and Transcripts Second Edition is available via web download.

 

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $2500.

 

*

 

(2) Nautilus Speaker Characterization was developed at the Technical University of Berlin and is comprised of approximately 155 hours of conversational speech from 300 German speakers aged 18 to 35 years (126 males and 174 females) with no marked dialect or accent, recorded in an acoustically-isolated room. The corpus was designed to support research on the detection of speaker social characteristics, such as personality, charisma, and voice attractiveness.

 

Four scripted and four semi-spontaneous dialogs simulating telephone call inquiries were elicited from the speakers. Additionally, spontaneous neutral and emotional speech utterances (predominantly excitement or frustration) and questions were produced.

 

Speech corresponding to one of the semi-spontaneous dialogs was evaluated with respect to 34 continuous numeric labels of perceived interpersonal speaker characteristics (such as likable, attractive, competent, childish). For a set of 20 selected 'extreme' speakers evaluated for their warmth-attractiveness, 34 naive voice descriptions (such as bright, creaky, articulate, melodious) were also evaluated. The corpus contains all labels, together with the speech recordings and the speakers' metadata (e.g., age, gender, place of birth, chronological places of residence and duration of stay, parents' place of birth, self-assessed personality).

 

Nautilus Speaker Characterization is available via web download.

 

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

 

 

 

*

 

 

 

(3) TAC Relation Extraction Dataset (TACRED) was developed by The Stanford NLP Group and is a large-scale relation extraction dataset with 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014. The annotations were derived from TAC KBP relation types (see the guidelines), from human annotations developed by LDC and from crowdsourcing using Mechanical Turk.

 

Source corpora used for this dataset were TAC KBP Comprehensive English Source Corpora 2009-2014 (LDC2018T03) and TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 (LDC2018T22). For detailed information about the dataset and benchmark results, please refer to the TACRED paper.

 

TAC Relation Extraction Dataset is available via web download.

 

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $25.

 

*

 

 

 

 

*

 


 

 

 

 

Top

5-2-2ELRA - Language Resources Catalogue - Update (October 2018)
ELRA - Language Resources Catalogue - Update
-------------------------------------------------------
We are happy to announce that 2 new Written Corpora and 4 new Speech resources are now available in our catalogue.

ELRA-W0126 Training and test data for Arabizi detection and transliteration
ISLRN: 986-364-744-303-9
The dataset is composed of : a collection of mixed English and Arabizi text intended to train and test a system for the automatic detection of code-switching in mixed English and Arabizi texts ; and a set of 3,452 Arabizi tokens manually transliterated into Arabic, intended to train and test a system that performs Arabizi to Arabic transliteration.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0126/

ELRA-W0127 Normalized Arabic Fragments for Inestimable Stemming (NAFIS)
ISLRN: 305-450-745-774-1
This is an Arabic stemming gold standard corpus composed by a collection of 37 sentences, selected to be representative of Arabic stemming tasks and manually annotated. Compiled sentences belong to various sources (poems, holy Quran, books, and periodics) of diversified kinds (proverb and dictum, article commentary, religious text, literature, historical fiction). NAFIS is represented according to the TEI standard.   
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0127/

ELRA-S0396 Mbochi speech corpus

ISLRN: 747-055-093-447-8
This corpus consists of 5131 sentences recorded in Mbochi, together with their transcription and French translation, as well as the results from the work made during  JSALT workshop: alignments at the phonetic level and various results of unsupervised word segmentation from audio. The audio corpus is made up of 4,5 hours, downsampled at 16kHz, 16bits, with Linear PCM encoding. Data is distributed into 2 parts, one for training consisting of 4617 sentences, and one for development consisting of 514 sentences.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0396/

ELRA-S0397 Chinese Mandarin (South) database

ISLRN: 503-886-852-083-2
This database contains the recordings of 1000 Chinese Mandarin speakers from Southern China (500 males and 500 females), from 18 to 60 years? old, recorded in quiet studios. Recordings were made through microphone headsets and consist of 341 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0397/

ELRA-S0398 Chinese Mandarin (North) database
ISLRN: 353-548-770-894-7
This database contains the recordings of 500 Chinese Mandarin speakers from Northern China (250 males and 250 females), from 18 to 60 years? old, recorded in quiet studios. Recordings were made through microphone headsets and consist of 172 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0398/

ELRA-S0401 Persian Audio Dictionary
ISLRN: 133-181-128-420-9
This dictionary consists of more than 50,000 entries (along with almost all wordforms and proper names) with corresponding audio files in MP3 and English transliterations. The words have been recorded with standard Persian (Farsi) pronunciation (all by a single speaker). This dictionary is provided with its software.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0401/

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us.

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements/












Top

5-2-3Speechocean – update (November 2018)

 

 

Spanish ASR & TTS Corpus --- Speechocean

 

 

Speechocean: The World’s Leading A.I. Data Resource & Service Supplier

 

At present, we can provide data services with 110+ languages and dialects across the world. For more detailed information, please visit our website: http://kingline.speechocean.com

 

Spanish ASR & TTS Corpus

 

ASR (Multiple Corpora)

Language

Number of Speakers

Recording Hours

Spain Spanish

1596

1814

Mexican Spanish

1298

1876

American Spanish

994

1178

Argentine Spanish

526

950

Chilean Spanish

500

930

TTS

Language

Number of Utterances

Recording Hours

Spain Spanish

7035

10.44

 

More Information

 

About ASR Corpus…

  • Information of speaker: native speakers balanced covering ages, gender and regional accents

  • Recording environment: quiet or noisy environment

  • Recording platform:desktop, mobile, telephone…

  • Recording content: sentences, conversation, digits, contact names, SMS…

  • Post processing: proofreading, transcription, annotation and quality control

  • Lexicon: included

 

About TTS Corpus…

  • Parameters: 48kHz, 16bit

  • Channel: mono channel

  • Information of speaker: a professional native female voice talent

  • Recording environment: professional studio

  • Labeling: phone boundary labeling, prosody labeling and POS labeling

  • Post processing: proofreading and quality control

  • Lexicon: Included

 

 

Contact Information:

Name: Xianfeng Cheng

Position: Vice President

Tel: +86-10-62660928;

+86-10-62660053 ext.8080

Mobile: +86-13681432590

Skype: xianfeng.cheng1

Email: chengxianfeng@speechocean.com

cxfxy0cxfxy0@gmail.com

Website: www.speechocean.com

http://kingline.speechocean.com

 

 

 

 

 

 


 


 

 

Top

5-2-4Google 's Language Model benchmark
 Here is a brief description of the project.

'The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:

  • unpruned Katz (1.1B n-grams),
  • pruned Katz (~15M n-grams),
  • unpruned Interpolated Kneser-Ney (1.1B n-grams),
  • pruned Interpolated Kneser-Ney (~15M n-grams)

 

Happy benchmarking!'

Top

5-2-5Forensic database of voice recordings of 500+ Australian English speakers

Forensic database of voice recordings of 500+ Australian English speakers

We are pleased to announce that the forensic database of voice recordings of 500+ Australian English speakers is now published.

The database was collected by the Forensic Voice Comparison Laboratory, School of Electrical Engineering & Telecommunications, University of New South Wales as part of the Australian Research Council funded Linkage Project on making demonstrably valid and reliable forensic voice comparison a practical everyday reality in Australia. The project was conducted in partnership with: Australian Federal Police,  New South Wales Police,  Queensland Police, National Institute of Forensic Sciences, Australasian Speech Sciences and Technology Association, Guardia Civil, Universidad Autónoma de Madrid.

The database includes multiple non-contemporaneous recordings of most speakers. Each speaker is recorded in three different speaking styles representative of some common styles found in forensic casework. Recordings are recorded under high-quality conditions and extraneous noises and crosstalk have been manually removed. The high-quality audio can be processed to reflect recording conditions found in forensic casework.

The database can be accessed at: http://databases.forensic-voice-comparison.net/

Top

5-2-6Audio and Electroglottographic speech recordings

 

Audio and Electroglottographic speech recordings from several languages

We are happy to announce the public availability of speech recordings made as part of the UCLA project 'Production and Perception of Linguistic Voice Quality'.

http://www.phonetics.ucla.edu/voiceproject/voice.html

Audio and EGG recordings are available for Bo, Gujarati, Hmong, Mandarin, Black Miao, Southern Yi, Santiago Matatlan/ San Juan Guelavia Zapotec; audio recordings (no EGG) are available for English and Mandarin. Recordings of Jalapa Mazatec extracted from the UCLA Phonetic Archive are also posted. All recordings are accompanied by explanatory notes and wordlists, and most are accompanied by Praat textgrids that locate target segments of interest to our project.

Analysis software developed as part of the project – VoiceSauce for audio analysis and EggWorks for EGG analysis – and all project publications are also available from this site. All preliminary analyses of the recordings using these tools (i.e. acoustic and EGG parameter values extracted from the recordings) are posted on the site in large data spreadsheets.

All of these materials are made freely available under a Creative Commons Attribution-NonCommercial-ShareAlike-3.0 Unported License.

This project was funded by NSF grant BCS-0720304 to Pat Keating, Abeer Alwan and Jody Kreiman of UCLA, and Christina Esposito of Macalester College.

Pat Keating (UCLA)

Top

5-2-7EEG-face tracking- audio 24 GB data set Kara One, Toronto, Canada

We are making 24 GB of a new dataset, called Kara One, freely available. This database combines 3 modalities (EEG, face tracking, and audio) during imagined and articulated speech using phonologically-relevant phonemic and single-word prompts. It is the result of a collaboration between the Toronto Rehabilitation Institute (in the University Health Network) and the Department of Computer Science at the University of Toronto.

 

In the associated paper (abstract below), we show how to accurately classify imagined phonological categories solely from EEG data. Specifically, we obtain up to 90% accuracy in classifying imagined consonants from imagined vowels and up to 95% accuracy in classifying stimulus from active imagination states using advanced deep-belief networks.

 

Data from 14 participants are available here: http://www.cs.toronto.edu/~complingweb/data/karaOne/karaOne.html.

 

If you have any questions, please contact Frank Rudzicz at frank@cs.toronto.edu.

 

Best regards,

Frank

 

 

PAPER Shunan Zhao and Frank Rudzicz (2015) Classifying phonological categories in imagined and articulated speech. In Proceedings of ICASSP 2015, Brisbane Australia

ABSTRACT This paper presents a new dataset combining 3 modalities (EEG, facial, and audio) during imagined and vocalized phonemic and single-word prompts. We pre-process the EEG data, compute features for all 3 modalities, and perform binary classi?cation of phonological categories using a combination of these modalities. For example, a deep-belief network obtains accuracies over 90% on identifying consonants, which is signi?cantly more accurate than two baseline supportvectormachines. Wealsoclassifybetweenthedifferent states (resting, stimuli, active thinking) of the recording, achievingaccuraciesof95%. Thesedatamaybeusedtolearn multimodal relationships, and to develop silent-speech and brain-computer interfaces.

 

Top

5-2-8TORGO data base free for academic use.

In the spirit of the season, I would like to announce the immediate availability of the TORGO database free, in perpetuity for academic use. This database combines acoustics and electromagnetic articulography from 8 individuals with speech disorders and 7 without, and totals over 18 GB. These data can be used for multimodal models (e.g., for acoustic-articulatory inversion), models of pathology, and augmented speech recognition, for example. More information (and the database itself) can be found here: http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html.

Top

5-2-9Datatang

Datatang is a global leading data provider that specialized in data customized solution, focusing in variety speech, image, and text data collection, annotation, crowdsourcing services.

 

1, Speech data collection

2, Speech data synthesis

3, Speech data transcription

I’ve attached our company introduction as reference, as well as available speech data lists as follows:

US English Speech Data

300 people, about 200 hours

Uyghur Speech Data

2,500 people, about 1,000 hours

German Speech Data

100 people, about 40 hours

French Speech Data

100 people, about 40 hours

Spanish Speech Data

100 people, about 40 hours

Korean Speech Data

100 people, about 40 hours

Italian Speech Data

100 people, about 40 hours

Thai Speech Data

100 people, about 40 hours

Portuguese Speech Data

300 People, about 100 hours

Chinese Mandarin Speech Data

4,000 people, about 1,200 hours

Chinese Speaking English Speech Data

3,700 people, 720 hours

Cantonese Speech Data

5,000 people, about 1,400 hours

Japanese Speech Data

800 people, about 270 hours

Chinese Mandarin In-car Speech Data

690 people, about 245 hours

Shanghai Dialect Speech Data

2,500 people, about 1,000 hours

Southern Fujian Dialect Speech Data

2,500 people, about 1,000 hours

Sichuan Dialect Speech Data

2,500 people, about 860 hours

Henan Dialect Speech Data

400 people, about 150 hours

Northeastern Dialect Speech Data

300 people, 80 hours

Suzhou Dialect Speech Data

270 people, about 110 hours

Hangzhou Dialect Speech Data

400 people, about 170 hours

Non-Native Speaking Chinese Speech Data

1,100 people, about 73 hours

Real-world Call Center Chinese Speech Data

650 hours, more than 5,000 people

Mobile-end Real-world Voice Assistant Chinese Speech Data

4,000 hours, more than 2,000,000 people

Heavy Accent Chinese Speech Data

2,000 people, more than 1,000 hours

 

If you find any particular interested datasets, we could provide you samples with costs too.

 

Regards

 

Runze Zhao

zhaorunze@datatang.com 

Oversea Sales Manager | Datatang Technology 

China

M: +86 185 1698 2583

18 Zhongguancun St.

Kemao Building Tower B 18F

Beijing 100190

 

US

M: +1 617 763 4722 

640 W California Ave, Suite 210

Sunnyvale, CA 94086


Top

5-2-10Fearless Steps Corpus (University of Texas, Dallas)

Fearless Steps Corpus

John H.L. Hansen, Abhijeet Sangwan, Lakshmish Kaushik, Chengzhu Yu Center for Robust Speech Systems (CRSS), Eric Jonsson School of Engineering, The University of Texas at Dallas (UTD), Richardson, Texas, U.S.A.


NASA’s Apollo program is a great achievement of mankind in the 20th century. CRSS, UT-Dallas has undertaken an enormous Apollo data digitization initiative where we proposed to digitize Apollo mission speech data (~100,000 hours) and develop Spoken Language Technology based algorithms to analyze and understand various aspects of conversational speech. Towards achieving this goal, a new 30 track analog audio decoder is designed to decode 30 track Apollo analog tapes and is mounted on to the NASA Soundscriber analog audio decoder (in place of single channel decoder). Using the new decoder all 30 channels of data can be decoded simultaneously thereby reducing the digitization time significantly. 
We have digitized 19,000 hours of data from Apollo missions (including entire Apollo-11, most of Apollo-13, Apollo-1, and Gemini-8 missions). This audio archive is named as “Fearless Steps Corpus”. This is one of the most unique and singularly large naturalistic audio corpus of such magnitude. Automated transcripts are generated by building Apollo mission specific custom Deep Neural Networks (DNN) based Automatic Speech Recognition (ASR) system along with Apollo mission specific language models. Speaker Identification System (SID) to identify the speakers are designed. A complete diarization pipeline is established to study and develop various SLT tasks. 
We will release this corpus for public usage as a part of public outreach and promote SLT community to utilize this opportunity to build naturalistic spoken language technology systems. The data provides ample opportunity setup challenging tasks in various SLT areas. As a part of this outreach we will be setting “Fearless Challenge” in the upcoming INTERSPEECH 2018. We will define and propose 5 tasks as a part of this challenge. The guidelines and challenge data will be released in the Spring 2018 and will be available for download for free. The five challenges are, (1) Automatic Speech Recognition (2) Speaker Identification (3) Speech Activity Detection (4) Speaker Diarization (5) Keyword spotting and Joint Topic/Sentiment detection.
Looking forward for your participation (John.Hansen@utdallas.edu) 

Top

5-2-11SIWIS French Speech Synthesis Database
The SIWIS French Speech Synthesis Database includes high quality French speech recordings and associated text files, aimed at building TTS systems, investigate multiple styles, and emphasis. A total of 9750 utterances from various sources such as parliament debates and novels were uttered by a professional French voice talent. A subset of the database contains emphasised words in many different contexts. The database includes more than ten hours of speech data and is freely available.
 
Top



 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA