ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2023 » ISCApad #299 » Resources » Database » Linguistic Data Consortium (LDC) update (April 2023)

ISCApad #299

Monday, May 08, 2023 by Chris Wellekens

5-2-1 Linguistic Data Consortium (LDC) update (April 2023)

In this newsletter:

In memoriam: Christopher Cieri 1963-2023

New publications:
Penn Korean Universal Dependency Treebank
DEFT English Light and Rich ERE Annotation

In memoriam: Christopher Cieri 1963-2023
With deep sadness, LDC announces the passing of Christopher Cieri, our Executive Director. Chris led the Consortium for over 25 years, guiding its evolution from a small data repository and research hub to a prominent global data center.

An accomplished linguist, computer scientist, and a well-read humanist, Chris embodied the best qualities for executing the wide range of duties demanded by his leadership role. He was a valued colleague and friend and will be sorely missed.

All are welcome to visit our remembrance page for Chris.

New publications:
Penn Korean Universal Dependency Treebank contains 5010 sentences and 132,041 tokens annotated in dependency format under the Universal Dependencies framework. It is a conversion of Korean Treebank Annotations Version 2.0 (LDC2006T09), which was produced in constituency format.

The source text is newswire stories from LDC’s Korean Press Agency collection contained in Korean Newswire (LDC2000T45). Sentences were automatically converted for dependency annotation; the output was manually checked. The corpus contains 112 files in CoNLL-U format, the Universal Dependencies standard, with a mapping to their counterpart in LDC2006T09.

2023 members can access this corpus through their LDC accounts. Non-members may license this data for $500.

DEFT English Light and Rich ERE Annotation was developed by LDC and consists of 1190 English discussion forum, newswire, and proxy documents annotated for entities, relations, and events (ERE). Light ERE annotation labels entity mentions for the target set of entity, relation, and event types between and among those entities, including coreference. Rich ERE annotation expands types and tagging in the entities, relations, and events annotation tasks and replaces strict event coreference with a more loosely defined event hopper annotation.

902 documents were annotated following Light ERE annotation guidelines. 288 documents were labeled with Rich ERE annotation in a second pass after being annotated for Light ERE. The source data consists of English discussion forum web text collected by LDC for the DARPA BOLT program and contained in BOLT English Discussion Forums (LDC2017T11); newswire documents published in various data sets released in the TAC KBP project (Text Analysis Conference Knowledge Base Population); and proxy documents intended to mimic government analysis reports of newswire content published in DEFT Narrative Text (LDC2016T07).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for $2000.

To unsubscribe from this newsletter, log in to your LDC account and uncheck the box next to “Receive Newsletter” under Account Options or contact LDC for assistance.

Membership Coordinator

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy