ISCA - International Speech
Communication Association


ISCApad Archive  »  2025  »  ISCApad #319  »  Resources  »  Database  »  Linguistic Data Consortium (LDC) update (December 2024)

ISCApad #319

Friday, January 10, 2025 by Chris Wellekens

5-2-1 Linguistic Data Consortium (LDC) update (December 2024)
  

In this newsletter:

LDC 2025 membership discounts now available 
Approaching deadline for Spring 2025 data scholarship applications
LDC closed for Winter Break December 25-January 1 

New publications:
MATERIAL Farsi-English Language Pack
Abstract Meaning Representation 3.0 – Machine Translations

 

 


LDC 2025 membership discounts now available 
Now through March 3, 2025, current 2024 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits. 

Approaching deadline for Spring 2025 data scholarship applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2025 data scholarships are due January 15, 2025. For more information on requirements and program rules, see LDC Data Scholarships

LDC closed for Winter Break December 25-January 1 
LDC will be closed from Wednesday December 25, 2024, through Wednesday, January 1, 2025, in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Thursday, January 2, 2025. Requests received by the Membership Office during Winter Break will be processed when the office reopens. 

 

 



New publications:
MATERIAL Farsi-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 61 hours of Farsi conversational telephone speech, transcripts, English translations, annotations, and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 30% of the speech files, and approximately 3% of the speech files were translated into English. This release also includes English queries and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for $250.

 

 

*

 

 

Abstract Meaning Representation  3.0 - Machine Translations was developed by the Center for Computational Linguistics at KU Leuven in the HORIZON2020 project SignON. It is an automatic translation of a subset of sentences from Abstract Meaning Representation (AMR) Annotation Release 3.0 (LDC2020T02) into Spanish, Irish Gaelic, and Dutch.

 

 

 

 

 

AMR 3.0 training, development, and test splits were translated using Google Translate. 'Unsplit' directories were not translated and are not included in this release. Translations were not manually verified, but formal issues (such as unexpected new lines) were corrected, and special tokens and encoding issues were fixed with the Python tool ftfy.fix_text.

 

 


AMR 3.0 is a semantic treebank of over 59,000 English natural language sentences drawn from material collected by LDC, specifically, discussion forum text from the DARPA BOLT and DARPA DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming, Wall Street Journal text, translated Xinhua news texts, various newswire texts from NIST OpenMT evaluations, and weblog data from the DARPA GALE program.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for $100.

To unsubscribe from this newsletter, log in to your LDC account and uncheck the box next to “Receive Newsletter” under Account Options or contact LDC for assistance.

Membership Coordinator

 

 

Linguistic Data Consortium

 

 

University of Pennsylvania

 

 

T: +1-215-573-1275

 

 

E: ldc@ldc.upenn.edu

 

 

M: 3600 Market St. Suite 810

 

 

      Philadelphia, PA 19104


 

 

 

 

 


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2025 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA