ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2022 » ISCApad #289 » Resources » Database

ISCApad #289

Sunday, July 10, 2022 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (June 2022)

In this newsletter:
LDC at LREC 2022
LDC data and commercial technology development
30th Anniversary Highlight: TIMIT

New publication:
Second DIHARD Challenge Evaluation - Eleven Sources

LDC at LREC 2022
LDC will attend the 13th Language Resource Evaluation Conference (LREC2022), hosted by ELRA, the European Language Resource Association, in Marseille, France June 20-25, 2022. Several LDC staff members will be presenting current work on topics including WeCanTalk: A New Multi-language, Multi-modal Resource for Speaker Recognition; Reflections on 30 Years of Language Resource Development and Sharing; A Study in Contradiction: Data and Annotation for AIDA Focusing on Informational Conflict in Russia-Ukraine Relations; Data Protection, Privacy and US Regulation; BeSt: The Belief and Sentiment Corpus; and more.

Stay tuned for specific announcements on LDC’s social media pages regarding presentation times and locations. Following the conference, LDC’s presented papers and posters will be available on the Papers Page.

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

30th Anniversary Highlight: TIMIT
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is another of the classic releases in LDC’s Catalog. Designed for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems, it contains recordings of 630 American English speakers each reading 10 phonetically rich sentences, for a total of 6300 utterances comprising 2342 distinct sentences. Data collection and annotation were a joint effort by Texas Instruments, the Massachusetts Institute of Technology, and SRI International, and the data release was prepared by NIST (National Institute of Standards and Technology).

TIMIT was among the first publications that appeared with the launch of LDC’s catalog in 1993. It remains one of the Consortium’s top ten distributed corpora and may be the single most widely-used speech database. Despite its age and small size relative to modern data sets, TIMIT’s wide range of phonetically-representative inputs, its time-aligned lexical and phonemic transcripts, and its easy availability through the LDC Catalog have contributed to its widespread use and continued popularity. Thousands of researchers remember its famous first sentence: “she had your dark suit in greasy wash water all year”.

LDC continues the TIMIT series with its Global TIMIT project which aims to create a series of corpora in a variety of languages with TIMIT-like features. (Chanchaochai et al., 2018). Data sets published from that project include: Global TIMIT Learner Treebank English, Global TIMIT Learner Simple English, Global TIMIT Mandarin Chinese – Guanzhong Dialect, and Global TIMIT Mandarin Chinese.

The LDC Catalog features over 900 holdings in more than 90 languages and more data is added each year. All TIMIT corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

New publication:
Second DIHARD Challenge Evaluation - Eleven Sources was developed by LDC and contains approximately 20 hours of English and Chinese speech data along with corresponding annotations used in support of the Second DIHARD Challenge.

The DIHARD second development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, extended child language acquisition recordings, and web videos. Annotations include diarization and segmentation.

Second DIHARD Challenge Evaluation - Eleven Sources is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $300.

To unsubscribe from this newsletter, log in to your LDC account and uncheck the box next to “Receive Newsletter” under Account Options; or contact LDC for assistance.

Membership Coordinator

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104