ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2022 » ISCApad #293 » Resources » Database

ISCApad #293

Tuesday, November 08, 2022 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (October 2022)

In this newsletter:
Membership Year 2023 publication preview
LDC data and commercial technology development
30th Anniversary Highlight: ACE

New publications:
Rime-Cantonese: A Normalized Cantonese Jyutping Lexicon
2017 NIST Language Recognition Evaluation Training and Development Sets
LORELEI Bengali Representative Language Pack

Membership Year 2023 publication preview

The 2023 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are:

AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts: 156 hours of Ukrainian conversational telephone speech and broadcast news with 1.2 million words of corresponding orthographic transcripts
2019 NIST SRE: audiovisual and leaderboard challenge sets based on amateur videos and Tunisian Arabic telephone speech, respectively
DEFT English ERE: English text from assorted genres annotated for entities, relations, and events
Mixer 3 and Mixer 7 speech collections: thousands of hours of telephone speech and metadata from Mixer 3 (multiple languages) and Mixer 7 (Spanish, plus interviews and transcript readings)
CALLFRIEND Russian: 100 telephone conversations among native speakers, transcripts, and a lexicon, released in separate speech and text data sets
REMIX Telephone Collection: English telephone speech from 385 participants in previous Mixer studies
LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources, and related tools in various languages (e.g., Indonesian, Swahili, Tagalog, Tamil, Zulu)

Check your inbox in the coming weeks for more information about membership renewal. 

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

30th Anniversary Highlight: ACE
The objective of the Automatic Content Extraction (ACE) program was to develop the capability to extract meaning (entities, relations and events) from multimedia sources (Doddington, et al., 2004). LDC supported ACE by creating annotation guidelines, corpora and other linguistic resources, including training and test data for the common task research evaluations (Strassel, et al., 2003; Huang, et al., 2004).

There are multiple data sets in LDC’s Catalog from the program. One that regularly makes the list of LDC’s top ten most licensed corpora is ACE 2005 Multilingual Training Corpus (LDC2006T06). This data set contains 1,800 files of mixed genre text in English, Arabic, and Chinese annotated for entities, relations, and events. The genres include newswire, broadcast news, broadcast conversation, weblog, discussion forums, and conversational telephone speech.

Another popular data set, ACE 2004 Multilingual Training Corpus (LDC2005T09), consists of varied genre text in English (158,000 words), Chinese (307,000 characters, 154,000 words), and Arabic (151,000 words) annotated for entities and relations.

ACE 2007 Multilingual Training Corpus (LDC2014T18) has the complete set of Arabic and Spanish training data for the 2007 ACE technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions.

Other ACE corpora in the Catalog include ACE 2005 SpatialML Annotations in English and Mandarin (LDC2008T03, LDC2010T09, and LDC2011T02), Datasets for Generic Relation Extraction (reACE), TIDES Extraction (ACE) 2003 Multilingual Training Data, ACE-2 Version 1.0, ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (TERN), and more.

For the full list of available ACE data, visit LDC’s Catalog and select the ACE research project in the search menu. For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions and other documentation, visit LDC's ACE webpage.

New publications:
Rime-Cantonese: A Normalized Cantonese Jyutping Lexicon was developed by the Cantonese Computational Linguistics Infrastructure Working Group. It contains approximately 130,000 Cantonese character, word, and phrase entries paired with their corresponding romanized pronunciations in Jyutping, a scheme created by The Linguistic Society of Hong Kong.

Data was collected from a variety of physical and online sources. The character collection was subjected to a normalization process for differences between traditional and simplified Chinese, regional differences and other variants in Chinese characters, and differences in orthography.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for $100.

2017 NIST Language Recognition Evaluation Training and Development Sets contains training and development material for the 2017 NIST Language Recognition Evaluation. It consists of 2,100 hours of conversational telephone speech, broadcast conversation, broadcast narrow band speech, and speech from video in the following 14 languages, dialects, and varieties: Arabic (Iraqi, Levantine, Maghrebi, Egyptian), English (British, American), Polish, Russian, Portuguese (Brazilian), Spanish (Caribbean, European, Latin American Continental), and Chinese (Mandarin, Min Nan). The 2017 evaluation focused on differentiating closely related language pairs. Source data is from LDC's CALLFRIEND and Fisher telephone collections, the VAST video collection, various broadcast sources, and earlier NIST LRE test sets.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for $1500.

LORELEI Bengali Representative Language Pack was developed by LDC and is comprised of approximately 144 million words of Bengali monolingual text, 96,000 Bengali words translated from English data, and over 2 million words of found Bengali-English parallel text. Approximately 86,000 words were annotated for named entities and up to 25,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from news, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2022 members can access this corpus through their LDC accounts. Non-members may license this data for $250.

To unsubscribe from this newsletter, log in to your LDC account and uncheck the box next to “Receive Newsletter” under Account Options; or contact LDC for assistance.

Membership Coordinator

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104