ISCA - International Speech
Communication Association


ISCApad Archive  »  2024  »  ISCApad #316  »  Resources  »  Database  »  Linguistic Data Consortium (LDC) update (September 2024)

ISCApad #316

Thursday, October 10, 2024 by Chris Wellekens

5-2-1 Linguistic Data Consortium (LDC) update (September 2024)
  


 

In this newsletter:
LDC data and commercial technology development

 

 


New publications:
L2-KSU Native and Non-Native Arabic Speech
MATERIAL Somali-English Language Pack

 

 


LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product, or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

 

 



New publications:
L2-KSU Native and Non-Native Arabic Speech was developed by King Saud University (KSU) and contains approximately six hours of Modern Standard Arabic read speech from 80 subjects, along with transcripts and speaker metadata.

The speech data was collected in 2022 from 40 native and 40 non-native speakers. Native speakers were from Saudi Arabia, Egypt, and Palestine, and provided audio recordings through the crowd sourcing platform Khamsat. Non-native speakers were Central and West African students enrolled in KSU's Arabic Linguistics Institute; they provided speech recordings on site. All subjects read a series of ten sentences, repeating each sentence multiple times.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for $250.

 *

MATERIAL Somali-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains 80 hours of Somali conversational telephone speech, transcripts, English translations, annotations, and queries.

Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 10% of the speech files, and approximately 4% of the speech files were translated into English. This release also includes domain annotations, English queries, and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for $250.

To unsubscribe from this newsletter, log in to your LDC account and uncheck the box next to “Receive Newsletter” under Account Options or contact LDC for assistance.

Membership Coordinator

Linguistic Data Consortium

 University of Pennsylvania

 T: +1-215-573-1275

 E: ldc@ldc.upenn.edu

 M: 3600 Market St. Suite 810

  Philadelphia, PA 19104

 

 

 

 


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA