ISCA - International Speech
Communication Association


ISCApad Archive  »  2021  »  ISCApad #277  »  Resources  »  Database  »  Linguistic Data Consortium (LDC) update (June 2021)

ISCApad #277

Saturday, July 10, 2021 by Chris Wellekens

5-2-1 Linguistic Data Consortium (LDC) update (June 2021)
  

 

In this newsletter:

LDC data and commercial technology development

New Publications:
MyST Children’s Conversational Speech
BOLT Egyptian Arabic Treebank – Conversational Telephone Speech


 

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

 


 


New publications:
(1) MyST Children’s Conversational Speech was developed by Boulder Learning Inc. It contains 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science instruction, along with transcripts and a pronunciation dictionary. Data was collected in two phases between 2008 and 2017. Spoken dialogs with the virtual tutor were aligned to classroom instruction using the Full Option Science System, a research-based science curriculum for grades K-8. Students conversed with the virtual science tutor for 15-20 minutes. The tutor asked open-ended questions about media presented on-screen, and students produced spoken answers.

Data was collected in 10,496 sessions for a total of 227,567 utterances. Approximately 45% of those utterances (102,433) were transcribed. Data is divided into development, test, and train partitions for use with ASR systems.

MyST Children’s Conversational Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $500.

 

*

 

(2) BOLT Egyptian Arabic Treebank – Conversational Telephone Speech was developed by LDC and consists of Egyptian Arabic conversational telephone speech data with part-of-speech annotation, morphology, gloss, and syntactic tree annotation.

This release contains 153,171 tokens before clitics were split and 182,965 tree tokens after clitics were split for treebank annotation. The source data was selected from conversational telephone speech collected by LDC for the CALLHOME project that was transcribed and segmented into sentence units.

Annotations follow Penn Arabic Treebank guidelines which consist of: (a) part-of-speech tagging that divides the text into lexical tokens and gives relevant information about each token such as lexical category, inflectional features, and a gloss; and (b) Arabic treebanking, which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, and so on.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference.

BOLT Egyptian Arabic Treebank – Conversational Telephone Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1750.

 

 

 

Membership Coordinator

 

Linguistic Data Consortium

 

University of Pennsylvania

 

T: +1-215-573-1275

 

E: ldc@ldc.upenn.edu

 

M: 3600 Market St. Suite 810

 

      Philadelphia, PA 19104


 

 

 



 

 


 





Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA