ISCA - International Speech
Communication Association


ISCApad Archive  »  2024  »  ISCApad #317  »  Resources  »  Database  »  Linguistic Data Consortium (LDC) update (October 2024)

ISCApad #317

Sunday, November 10, 2024 by Chris Wellekens

5-2-1 Linguistic Data Consortium (LDC) update (October 2024)
  


In this newsletter:
LDC/Penn receives US Dept of Education research grant 
Membership year 2025 publication preview 
Fall 2024 data scholarship recipients

New publications:
RST Continuity Corpus
MultiTACRED

 

 


LDC/Penn receives US Dept of Education research grant 
LDC and Penn’s Graduate School of Education and Department of Computer and Information Science are part of a team that was recently awarded a $10 million grant from the US Department of Education to develop the Using Generative Artificial Intelligence for Reading R&D Center (U-GAIN Reading) which will explore using generative AI to improve elementary school reading instruction for English learners. Led by the education nonprofit Digital Promise, U-GAIN Reading will build on an existing research-based tutoring platform, Amira Learning, that is used by more than 1 million students each year. The LDC/Penn team will contribute expertise in computational linguistics, computer science, and learning analytics. An evaluation team at MDRC will measure learner outcomes both to improve the R&D and to benchmark its eventual impacts. Additional experts in the science of reading, ethics, and strategies for national impact will support the project’s work. Data developed in the project will be shared with the community through the LDC Catalog.

Membership year 2025 publication preview 
The 2025 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are: 

 

 

  • Iraqi Arabic – English Lexical Database:  a set of six interrelated tables (roots, lemmas, wordforms, multi-word expressions, English definitions, example phrases) presenting each Iraqi Arabic word in Arabic script and IPA format, a result of LDC’s collaboration with Georgetown University Press to enhance and update three dialectal Arabic dictionaries 

     

  • AIDA topic source data and annotations: multimodal source data and annotations in multiple languages (Russian, English, Spanish) for information and entity extraction

     

  • 2015 NIST Language Recognition Evaluation Test Set: 164,000+ segments of conversational telephone speech and broadcast narrow band speech in six linguistic varieties (Arabic, Spanish, English, Chinese, Slavic, French) representing 20 languages, used in NIST’s 2015 language recognition evaluation

     

  • BOLT CALLFRIEND CALLHOME CTS Audio, Transcripts and Translations: previously unpublished Chinese and Egyptian Arabic telephone conversations from the CALLFRIEND and CALLHOME collections, with transcripts and translations developed by LDC for the DARPA BOLT program

     

  • Chinese Sentence Pattern Structure Treebank: 5,000+ sentences from ancient and modern Chinese texts with syntactic annotation based on sentence constituent analysis, developed by Beijing Normal University and Peking University 

     

  • IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations, and queries in multiple languages (e.g., Georgian, Kazakh, Lithuanian)

     

  • LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources, and related tools in various languages (e.g., Hungarian, Hindi, Amharic, Somali) 

     

 

 

 

Check your inbox for more information about membership renewal.

 

 

Fall 2024 data scholarship recipients
Congratulations to the recipients of LDC's Fall 2024 data scholarships:

Yomma Gamaleldin: Alexandria University (Egypt): Master’s student, Computer and Systems Engineering Department. Yomma is awarded a copy of Qatari Corpus of Argumentative Writing (LDC2022T04) for her work in Arabic automated essay scoring.

Arhane Mahaganapathy: Jaffna University (Sri Lanka): Master’s student, Department of Computer Science. Ahrane is awarded copies of IARPA Babel Tamil Language Pack (LDC2017S13) and Multi-Language Telephone Speech 2011 – South Asian (LDC2017S14) for her work in Tamil speech-to-text systems.

Sivashanth Suthakar:  Jaffna University (Sri Lanka): Master’s student, Department of Computer Science. Sivashanth is awarded copies of CAMIO Transcription Languages (LDC2022T07) and LORELEI Tamil Representative Language Pack (LDC2023T03) for his work in Tamil OCR systems.

Oshan Yalegama: University of Moratuwa (Sri Lanka): BSc, Electronic and Telecommunication Engineering. Oshan is awarded copies of CSR-I (WSJ0) Complete (LDC93S6A) and TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) for his work in audio signal processing.

Samer Mohammed Yaseen: Sana’a University (Yemen): PhD candidate, Faculty of Computer and Information Technology. Samer is awarded a copy of Arabic Newswire Part 1 (LDC2001T55) for his work in Arabic information retrieval.

 

 



New publications:

RST Continuity Corpus was developed at Åbo Akademi University and Humboldt-Universität zu Berlin and contains annotations for continuity dimensions added to RST Discourse Treebank (LDC2002T07). RST Discourse Treebank is a collection of English news texts from the Penn Treebank annotated for rhetorical relations under the RST (Rhetorical Structure Theory) framework. In RST Continuity Corpus, the relations are annotated for the seven continuity dimensions: time, space, reference, action, perspective, modality, and speech act. The relations are also annotated for polarity, order of segments, nuclearity, and context.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for $500.

 

 

*

 

 

MultiTACRED was developed by the German Research Center for Artificial Intelligence (DFKI) Speech and Language Technology Lab and is a machine translation of TAC Relation Extraction Dataset (LDC2018T24) (TACRED) into twelve languages with projected entity annotations. TACRED is a large-scale relation extraction dataset containing 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014. The training and evaluation data for the TAC KBP slot filling tasks was developed by the Linguistic Data Consortium.

 

 

TACRED training, development, and test splits were translated into Arabic, Chinese, Finnish, French, German, Hindi, Hungarian, Japanese, Polish, Russian, Spanish, and Turkish using DeepL or Google Translate. The test split was back-translated into English to generate machine-translated English test data. TACRED annotations are specified by token offsets. For translation, tokens were concatenated with white space, and the entity offsets were converted into XML-style markers to denote argument.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for $25.


 

 

 

 

 

 


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA