ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2020 » ISCApad #270 » Resources » Database

ISCApad #270

Friday, December 11, 2020 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (November 2020)

In this newsletter:

Join LDC for Membership Year 2021
Spring 2021 Data Scholarship Application Deadline

New Publications:
Global TIMIT Learner Simple English
LORELEI Ukrainian Representative Language Pack
TAC KBP Event Argument – Comprehensive Training and Evaluation Data 2016-2017

Join LDC for Membership Year 2021
Membership Year 2021 (MY2021) is open and discounts are available for those who keep their membership current and join early. Current MY2020 members who renew their LDC membership before March 1, 2021 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount when joining by March 1.

In addition to receiving new publications, current LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 850 holdings. Current-year for-profit members may use most data for commercial applications.

Plans for MY2021 publications are in progress. Among the expected releases are:

Global TIMIT Mandarin Chinese: 6,000 linguistically rich utterances featuring time-aligned lexical and phonetic transcription
Columbia Games Corpus: 12 spontaneous task-oriented dyadic conversations elicited from native Standard American English speakers playing computer games, transcribed and annotated for discourse/pragmatic phenomena
My Science Tutor Children’s Conversational Speech: 400+ hours of speech from 1,371 US third, fourth, and fifth grade students participating in sessions with a virtual science tutor, transcripts included
The SSNCE Database of Tamil Dysarthric Speech: Tamil speech from 20 dysarthric speakers aged 12-40 years and a control group (10 speakers) with time-aligned phonetic transcripts
Icelandic Parliamentary Speech: 6,493 Icelandic Parliament recordings from 2005-2016 with 196 speakers, aligned and segmented and divided into training, development, and evaluation sets for ASR development
LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources, and related tools (Akan, Kinyarwanda, and Wolof)
BOLT: co-reference, treebank, propbank, and translation resources for discussion forum, SMS/Chat, and conversational telephone speech data in all languages (Chinese, Egyptian Arabic, and English)
TAC KBP: training and evaluation data for English surprise slot filling (2010) and English sentiment slot filling (2013-2014) tasks

It’s also not too late to join for MY2019 (through December 31, 2020) and MY2020 (through December 31, 2021). Data sets from those years include Penn Discourse Treebank Version 3.0, DEFT Committed Belief Annotation (Chinese, English, Spanish), 2018 NIST Speaker Recognition Evaluation Test Set, Mixer 4 and 5 Speech, AMR Annotation Release 3.0, and Penn Parsed Corpora of Historical English.

For full descriptions of all LDC data sets, browse our Catalog.

Visit Join LDC for details on membership, user accounts and payment.

Spring 2021 Data Scholarship Application Deadline
Applications are now being accepted through January 15, 2021 for the Spring 2021 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.

New publications:
(1) Global TIMIT Learner Simple English was developed by LDC and Shanghai Jiao Tong University and consists of approximately 12 hours of L1 and L2 English read speech and transcripts. It is comprised of two separate data sets of 50 speakers reading 120 sentences from TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) deemed “simple” to read by Chinese learners of English. Among the 120 sentences, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 820 sentence types.

L1 Simple English was recorded at the University of Pennsylvania, USA; participants were 25 female and 25 male native American English speakers. L2 Simple English was recorded at Shanghai Jiao Tong University, China. L2 speakers (25 female, 25 male) were Chinese learners of English considered fluent.

The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT data set, which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Specifically, those features include:

A large number of fluently read sentences, containing a representative sample of phonetic, lexical, syntactic, semantic, and pragmatic patterns
A relatively large number of speakers
Time-aligned lexical and phonetic transcription of all utterances
Some sentences read by all speakers, others read by a few speakers, and others read by just one speaker

Global TIMIT Learner Simple English is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $750.

(2) LORELEI Ukrainian Representative Language Pack consists of Ukrainian monolingual text, Ukrainian-English parallel and comparable text, annotations, supplemental resources, and related software tools developed by LDC for the DARPA LORELEI program.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons, and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.

Data was collected from discussion forum, news, reference, social network, and weblog. Data volumes are as follows:

111 million words of Ukrainian monolingual text, approximately 700,000 words of which were translated into English
86,000 Ukrainian words translated from English data
174,000 words of found parallel text
over 2,000,000 words of comparable text

Approximately 75,000 words were annotated for named entities and up to 50,000 words contain additional annotation, including situation frames (identifying entities, needs and issues) and entity linking and detection.

LORELEI Ukrainian Representative Language Pack is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $250.

(3) TAC KBP Event Argument – Comprehensive Training and Evaluation Data 2016-2017 was developed by LDC and contains training and evaluation data produced in support of the 2016 TAC KBP Event Argument Linking Pilot and Evaluation tasks and the 2017 Event Argument Linking Training Evaluation task.

The Event Argument Extraction and Linking task required systems to extract event arguments (entities or attributes playing a role in an event) from unstructured text, indicate the role they play in an event, and link the arguments appearing in the same event to each other. Since the extracted information must be suitable as input to a knowledge base, systems constructed tuples indicating the event type, the role played by the entity in the event, and the most canonical mention of the entity from the source document. The event types and roles were drawn from an externally specified ontology of 31 event types, which included financial transactions, communication events, and attacks.

This corpus includes source documents, manual runs, assessments and event hoppers, a form of identity coreference for events. Source data is Chinese, English, and Spanish newswire and discussion forum text collected by LDC.

TAC KBP Event Argument – Comprehensive Training and Evaluation Data 2016-2017 is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1000.

Membership Coordinator

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104