ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2020 » ISCApad #268 » Resources » Database

ISCApad #268

Saturday, October 10, 2020 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (September 2020)

In this newsletter:

New publications:
BOLT English PropBank and Sense – Discussion Forum, SMS/Chat and Conversational Telephone Speech
LORELEI Tigrinya Incident Language Pack
Chinese Lexical Resources for Gender, Number, Animacy

New publications:
(1) BOLT English PropBank and Sense – Discussion Forum, SMS/Chat and Conversational Telephone Speech was developed by the University of Colorado, Boulder – CLEAR (Computational Language and Education Research) and consists of propbank and verb sense disambiguation annotation on English discussion forum (DF), SMS/Chat, and conversational telephone speech data. Annotation was applied to each predicate verb tree in LDC’s BOLT phrase structure treebanks. PropBank provides a layer of semantic annotation over treebank and was performed on all three genres. DF and SMS/Chat data were also annotated for verb sense disambiguation using Verbnet 3.2 classes.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference.

BOLT English PropBank and Sense – Discussion Forum, SMS/Chat and Conversational Telephone Speech is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $300.

(2) LORELEI Tigrinya Incident Language Pack was developed by LDC and is comprised of approximately 4.5 million words of Tigrinya monolingual text, 25,000 words of English monolingual text, 235,000 words of parallel and comparable Tigrinya-English text, and 50,000 words of data annotated for Entity Discovery and Linking and for Situation Frames. It contains all of the text data, annotations, supplemental resources, and related software tools for the Tigrinya language that were used in the DARPA LORELEI / LoReHLT 2017 Evaluation.

The evaluation protocol was based on a scenario in which an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity Detection and Linking and Situation Frame annotations identified “entities,” “needs” (such as a need for food), and “issues” (such as civil unrest) to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information that would be useful for planning a disaster response effort.

The knowledge base for the entity linking annotation in this corpus is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

LORELEI Tigrinya Incident Language Pack is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $250.

(3) Chinese Lexical Resources for Gender, Number, Animacy was developed by LDC and consists of gender, number, and animacy lexicons produced in support of the DARPA DEFT program. Gender, number, and animacy are lexical indicators useful for named entity tagging, including the detection of person mentions in text.

This corpus was created by extracting information from newswire texts in Chinse Gigaword Fifth Edition (LDC2011T13) in the following steps: (1) segmenting source documents into sentences; (2) converting any traditional Chinese script to simplified Chinese; (3) tagging all sentences for parts-of-speech; (4) developing queries to detect patterns; and (5) building lexicons based on frequency counts and entity types.

The resulting resources include dictionaries of Chinese animate nominals and names; Chinese nominals and name with gender and number predicted; and other dictionaries of Chinese nominals, names, verbs, and pronouns. Each dictionary contains frequency information as well as the features in question.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. LDC supported the DEFT program by collecting, creating and annotating a variety of data sources.

Chinese Lexical Resources for Gender, Number, Animacy is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $750.