ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2022 » ISCApad #284 » Resources » Database

ISCApad #284

Thursday, February 10, 2022 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (January 2022)

In this newsletter:
Renew your LDC Membership today

New Publications:
2017 NIST OpenSAT Pilot - SSSF
LORELEI Kinyarwanda Incident Language Pack

Renew your LDC Membership today
The importance of curated resources for language-related education, research, and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe, and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 900 holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 1, 2022, 2021 members receive a 10% discount on 2022 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits.

New publications:

(1) 2017 NIST OpenSAT Pilot - SSSF was developed by NIST (National Institute of Standards and Technology) and contains approximately one hour of operational speech data, transcripts, and annotation files used in the speech activity detection, automatic speech recognition, and keyword search tasks of the 2017 OpenSAT Pilot evaluation. The source audio consists of radio and telephone dispatches during the Sofa Super Store fire (Charleston, South Carolina) in June 2007 (SSSF).

The OpenSAT evaluation series was designed to bring together researchers developing different types of technologies to address speech analytic challenges present in some of the most difficult acoustic conditions The 2017 pilot focused on the public safety communications domain. The SSSF audio represents real-world, fire response, operational data with multiple challenges for system analytics, such as land-mobile-radio transmission effects, significant background noise, speech under stress and variable decibel levels.

2017 NIST OpenSAT Pilot - SSSF is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $250.

(2) LORELEI Kinyarwanda Incident Language Pack was developed by LDC and is comprised of approximately 11.9 million words of Kinyarwanda monolingual text, 35,000 words of English monolingual text, 3.4 million words of parallel and comparable Kinyarwanda-English text, and 50,000 words each of English and Kinyarwanda data annotated for Entity Discovery and Linking and Situation Frames. It constitutes all of the text data, annotations, supplemental resources, and related software tools for the Kinyarwanda language that were used in the DARPA LORELEI / LoReHLT 2018 Evaluation.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. In the evaluation scenario, an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity detection and linking annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

LORELEI Kinyarwanda Incident Language Pack is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $250.

Membership Coordinator

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104