ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2021 » ISCApad #278 » Resources » Database

ISCApad #278

Monday, August 09, 2021 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (July 2021)

In this newsletter:
LDC Submissions: a new platform for sharing data through LDC
Fall 2021 LDC Data Scholarship Program

New Publications:
Ethnobotanical Research and Language Documentation of Nahuatl
Chinese Abstract Meaning Representation 2.0
BOLT Egyptian Arabic Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech

LDC Submissions: a new platform for sharing data through LDC
LDC is pleased to announce the launch of LDC Submissions, a platform that provides infrastructure and resources for sharing data through the Catalog. After registering for a user account, corpus submitters can create a submission, upload files, and communicate with LDC’s publications team during the review process. After all reviews are complete, the final, release-ready version of your data set is uploaded to the platform and enters the publications queue.

Sharing your corpus through LDC ensures access to the global research community and the permanent preservation of your data according to best practices for archiving digital language resources. Get started and register for an LDC Submissions user account today.

Fall 2021 LDC Data Scholarship Program
Student applications for the Fall 2021 LDC Data Scholarship program are being accepted now through September 15, 2021. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor.

For application requirements and program rules, visit the LDC Data Scholarship page.

New publications:
(1) Ethnobotanical Research and Language Documentation of Nahuatl consists of approximately 190 hours of field recordings collected in the Sierra Nororiental and Sierra Norte regions of Puebla, Mexico. The corpus contains audio and video recordings of native Nahuatl speakers during the collection of particular plants; partial transcripts (Nahuatl and Spanish); a Highland Puebla Nahuat dictionary; botanical and ethnobotanical data; and speaker metadata.

Nahuatl is one of the most widely spoken indigenous languages in the Americas with approximately 1.5 million speakers in Mexico. Many distinct and sometimes mutually intelligible varieties have been recognized. The recordings in this release were collected between 2008 and 2019 in two different municipalities: Cuetzalan del Progreso and Tepetzintla. Speech from Cuetzalan represents Highland Puebla Nahuat, and speech from Tepetzintla represents Zacatlán-Ahuacatlám-Tepetzintla Nahuatl.

The recordings consist of a speaker talking about a plant's nomenclature, classification, and use. Transcripts are included for the Cuetzalan recordings; these transcripts have been partially translated into Spanish. A Highland Puebla Nahuat dictionary is included in both text and Toolbox XML formats. Botanical and ethnobotanical information is presented as a collection of pdfs, and images as jpegs.

Ethnobotanical Research and Language Documentation of Nahuatl is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $250.

(2) Chinese Abstract Meaning Representation 2.0 was developed by Brandeis University and Nanjing Normal University and is comprised of semantic representations of a set of approximately 20,000 Chinese sentences from Chinese Treebank (CTB) 8.0 (LDC2013T21). CAMR 2.0 includes the content of Chinese Abstract Meaning Representation 1.0 (LDC2019T07) (CTB 8.0 weblog and discussion forum sentences), plus an additional 9,933 sentences from the newswire portion of CTB 8.0.

Abstract Meaning Representation (AMR) captures 'who is doing what to whom' in a sentence. Each sentence is paired with a graph that represents its whole sentence meaning in a tree structure. Chinese AMR is constructed following the basic principles developed for English: a compact, readable, whole-sentence semantic representation, while making adaptions where necessary to handle Chinese-specific phenomena.

The corpus contains 20,078 sentences from the weblog, discussion forum, and newswire portions of CTB 8.0. Three sets of files are included: the original Chinese AMR data with concept-to-word and relation-to-word alignments, a converted English AMR format, and a Chinese syntactic dependency tree format. Each set is divided into training, development and test sets.

Chinese Abstract Meaning Representation 2.0 is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $300.

(3) BOLT Egyptian Arabic Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies. Co-reference annotation aims to fill in the connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation. It covers noun phrases (including proper nouns, nominals, pronouns and null arguments), possessives, proper noun pre-modifiers, and verbs.

The source discussion forum data and SMS/Chat data was collected by LDC for the DARPA BOLT program. The telephone data was taken from LDC's Egyptian Arabic CALLHOME and CALLFRIEND telephone collections.

BOLT Egyptian Arabic Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1250.

Membership Coordinator

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104