ISCA - International Speech
Communication Association

ISCApad Archive  »  2024  »  ISCApad #313  »  Resources  »  Database  »  Linguistic Data Consortium (LDC) update (June 2024)

ISCApad #313

Saturday, July 06, 2024 by Chris Wellekens

5-2-1 Linguistic Data Consortium (LDC) update (June 2024)

In this newsletter:
LDC data and commercial technology development

New publications:
Diaspora Tibetan Speech
AIDA Scenario 2 Practice Topic Annotation

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

Diaspora Tibetan Speech was developed at Yale University. It contains 28 hours of Tibetan elicited speech by 73 speakers from the diaspora Tibetan community in Kathmandu, Nepal, along with transcripts, elicitation materials, and speaker metadata.

Recordings were collected in 2016. All speakers were adults and varied in age as well as age of diaspora. A substantial number of speakers were born in Nepal. Each speaker contributed one recording comprising a series of elicitation tasks: some demographic information; a word list and numbers; some sentences in isolation; a scripted story; and free speech based on 'frog story' type illustrations.  Annotation and metadata formats include PDF and Word (some transcripts), Excel (some transcripts, speaker metadata) and Praat TextGrids (word and number lists).

2024 members can access this corpus through their LDC accounts. Non-members may license this data for $100.


AIDA Scenario 2 Practice Topic Annotation was developed by LDC and is comprised of annotations for 29 English, Russian, and Spanish documents (text, image, and video) from AIDA Scenario 2  Practice Topic Source Data (LDC2024T04), specifically, the set of practice documents designated for annotation in Phase 2.

Annotations are presented as tab separated files in the following categories for each topic:

  • Mentions: single references in source data to a real-world entity or filler, event, or relation.
  • Slots: pre-defined roles in an event or relation filled by an argument (entity mention).
  • Linking: entity mentions linked to entries in the knowledge base as a method of indicating the real-world entity to which an entity referred.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for $500.

To unsubscribe from this newsletter, log in to your LDC account and uncheck the box next to “Receive Newsletter” under Account Options or contact LDC for assistance.


Membership Coordinator

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275


M: 3600 Market St. Suite 810

      Philadelphia, PA 19104








Back  Top

 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA