ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2020 » ISCApad #261 » Resources » Database

ISCApad #261

Saturday, March 14, 2020 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (February 2020)

In this newsletter:
Only Two Weeks Left to Enjoy 2020 Membership Discounts
LREC Workshop on Citizen Linguistics - Deadline Extended

New Publications:
TAC KBP English Event Argument - Training and Evaluation Data 2014-2015
Chinese CogBank
Machine Reading Phase 1 IC Training Data
IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b

Only Two Weeks Left to Enjoy 2020 Membership Discounts
There is still time to save on 2020 Membership fees. Through March 2, all organizations receive a discount on the 2020 Membership fee (up to 10%) when they choose to join or renew. For more information on membership benefits, visit Join LDC.

LREC Workshop on Citizen Linguistics - Deadline Extended
LDC Researchers and their colleagues are organizing a workshop on Citizen Linguistics and Language Resource Development at LREC 2020 (Language Resource and Evaluation Conference) to take place on May 16, 2020. The workshop includes an open call for papers in language-related citizen science, a tutorial on using the new LanguageARC.org citizen linguistics portal, and a special session on best papers using LanguageARC. Call for Papers deadline extended until February 24, 2020.

New publications:

(1) TAC KBP English Event Argument - Training and Evaluation Data 2014-2015 was developed by LDC and contains training and evaluation data produced in support of the 2014 TAC KBP English Event Argument Extraction Pilot and Evaluation tasks and the 2015 English Event Argument Extraction and Linking Training and Evaluation tasks.

The Event Argument Extraction and Linking task required systems to extract event arguments (entities or attributes playing a role in an event) from unstructured text, indicate the role they play in an event, and link the arguments appearing in the same event to each other. Since the extracted information must be suitable as input to a knowledge base, systems constructed tuples indicating the event type, the role played by the entity in the event, and the most canonical mention of the entity from the source document. The event types and roles were drawn from an externally-specified ontology of 31 event types, which included financial transactions, communication events, and attacks.

This corpus includes source documents, manual runs, assessments, and event hoppers, a form of identity coreference for events (2015 only). Source data is English newswire and discussion forum text collected by LDC.

TAC KBP English Event Argument - Training and Evaluation Data 2014-2015 is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1000.

(2) Chinese CogBank is a database of cognitive properties of Chinese words intended for use in metaphor understanding and generation. It consists of 232,497 'word-property' pairs, which are comprised of 83,104 words and 100,195 properties. Each 'word-property' type also has an associated frequency which can stand as a functional measure of the importance of a property.

The data was collected via the Chinese search engine Baidu.com. The original collection consisted of 1,258,430 types (5,637,500 tokens) of 'word-adjective' pairs that were reduced in Chinese CogBank to 232,497 'word-property' pairs after a series of manual checks.

Chinese CogBank is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $250.

(3) Machine Reading Phase 1 IC Training Data was developed by LDC for use in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program. It contains 248 English source documents and 116 standoff annotation files, annotated with instances of explicit relations and their arguments, as well as some non-explicit relations.

The Machine Reading program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by program participants were required to extract and reason about facts from text in multiple domains.

The data in this release constitutes the training data for the IC (Core Domain) task, which tested the core domain by extracting information about Entities (people, organizations, geopolitical entities) and their involvement in four types of Relations (Attack Relations, Biographical Relations, Affiliation Relations and Family Relations), as described in newswire text. This information was then aligned with an IC Use Cases ontology that would allow automated reasoning about the extracted Entities and Relations.

Machine Reading Phase 1 IC Training Data is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1000.

(4) IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Dholuo conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.

The Dholuo speech in this release represents the South Nyanza and Trans-Yala dialect regions of Kenya. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $25.

Membership Coordinator

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104