ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2020 » ISCApad #269 » Resources » Database

ISCApad #269

Thursday, November 12, 2020 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (October 2020)

In this newsletter:
Fall 2020 Data Scholarship Recipients
Membership Year 2021 Publication Preview
LDC data and commercial technology development

New Publications:
Global TIMIT Learner Treebank English
Corpus of Law, Academic, and News
IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b

Fall 2020 data scholarship recipients
Congratulations to the recipients of LDC's Fall 2020 data scholarships:

Nicole Dodd: University of California, Davis (USA); MA, Linguistics. Nicole is awarded a copy of Arabic Treebank Part 3 v. 3.2 LDC2010T08 for her research in relative clause processing in Standard Arabic.

Satwik Dutta: University of Texas at Dallas (USA); PhD, Electrical Engineering. Satwik is awarded copies of The CMU Kids Corpus LDC97263 and CLSU: Kids’ Speech Version 1.1. LDC2007S18 for his work in speech activity detection.

Pedram Hosseini: George Washington University (USA); PhD., Computer Science. Pedram is awarded copies of Penn Discourse Treebank Version 3.0 LDC2019T05 and The New York Times Annotated Corpus LDC2008T19 for his research in automatic detection of causal relations in text.

Mariano Maisonnave: Universidad Nacional del Sur (Argentina); PhD, Computer Science. Mariano is awarded a copy of ACE 2005 Multilingual Training Corpus LDC2006T06 for his work in event extraction.

Mark Sullivan: California State University, Los Angeles (USA); Masters, Applied and Advanced Studies in Education. Mark is awarded a copy of ETS Corpus of Non-Native Written English LDC2014T06 for his research in sentence boundary problems of Chinese L1 speakers in English compositions.

For information about the program, visit the Data Scholarships page.

Membership Year 2021 publication preview
The 2021 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are:

Global TIMIT Mandarin Chinese: 6,000 linguistically rich utterances featuring time-aligned lexical and phonetic transcription
Columbia Games Corpus: 12 spontaneous task-oriented dyadic conversations elicited from native Standard American English speakers playing computer games, transcribed and annotated for discourse/pragmatic phenomena
My Science Tutor Children’s Conversational Speech: 400+ hours of speech from 1,371 US third, fourth, and fifth grade students participating in sessions with a virtual science tutor, transcripts included
The SSNCE Database of Tamil Dysarthric Speech: Tamil speech from 20 dysarthric speakers aged 12-40 years and a control group (10 speakers) with time-aligned phonetic transcripts
Icelandic Parliamentary Speech: 6,493 Icelandic Parliament recordings from 2005-2016 with 196 speakers, aligned and segmented and divided into training, development, and evaluation sets for ASR development
LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources, and related tools (Akan, Kinyarwanda, and Wolof)
BOLT: co-reference, treebank, propbank, and translation resources for discussion forum, SMS/Chat, and conversational telephone speech data in all languages (Chinese, Egyptian Arabic, and English)
TAC KBP: training and evaluation data for English surprise slot filling (2010) and English sentiment slot filling (2013-2014) tasks

Check your inbox in the coming weeks for more information about membership renewal.

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:
(1) Global TIMIT Learner Treebank English was developed by LDC and LAIX Inc. and consists of approximately 24 hours of L1 and L2 English read speech and transcripts. It is comprised of two separate data sets of 50 speakers reading 120 sentences from Treebank-3 (LDC99T42). Among the 120 sentences, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types.

L1 English Treebank was recorded at the University of Pennsylvania, USA; participants were 25 female and 25 male native American English speakers. L2 English Treebank was recorded at LAIX Inc., Shanghai, China. L2 speakers (25 female, 25 male) were Chinese learners of English considered fluent and who had passed specified standards on English assessment tests.

The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Specifically, those features include:

A large number of fluently-read sentences, containing a representative sample of phonetic, lexical, syntactic, semantic, and pragmatic patterns
A relatively large number of speakers
Time-aligned lexical and phonetic transcription of all utterances
Some sentences read by all speakers, others read by a few speakers, and others read by just one speaker

Global TIMIT Learner Treebank English is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $750.

(2) Corpus of Law, Academic and News consists of 400 Persian documents divided into three genres: legal, academic, and news. The legal section contains texts from official publications, including the civil penal code, the criminal penal code, and the constitution of the Islamic Republic of Iran. The academic sub-corpus is comprised of published academic abstracts in various disciplinary areas, such as Art and Humanities, Social Sciences, and Natural Sciences. The news sub-corpus was extracted from an archive of ten Iranian news outlets spanning the period 2010- 2020.

Each document contains metadata in the file's header with information such as specific text type, dates, and source, and also contains annotations marking title and body paragraphs.

Corpus of Law, Academic and News is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $100.

(3) IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Halh Mongolian conversational and scripted telephone speech collected in 2014 along with corresponding transcripts. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 61 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

The Babel program focused on underserved languages and sought to develop speech recognition technology that could be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

This is the last release in the IARPA Babel series which consists of 25 language packs in total.

IARPA Babel Mongolian Language Pack IARPA-babel-401b-v2.0b is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $25.

Membership Coordinator

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104