ISCA - International Speech
Communication Association


ISCApad Archive  »  2018  »  ISCApad #236  »  Resources  »  Database  »  Linguistic Data Consortium (LDC) update January 2018)

ISCApad #236

Saturday, February 10, 2018 by Chris Wellekens

5-2-1 Linguistic Data Consortium (LDC) update January 2018)
  

In this newsletter:

 

Membership Discounts for MY2018 Still Available

New Publications:

 

DEFT Spanish Treebank

 

DIRHA English WSJ Audio

 

TRAD Chinese-French Parallel Text – Blog

 

 

 

_______________________________________________________________________________

 

 

 

Membership Discounts for MY2018 Still Available

 

Join LDC while membership savings are still available. Now through March 1, 2018, renewing MY2017 members will receive a 10% discount off the membership fee. New or non-consecutive member organizations will receive a 5% discount. Membership remains the most economical way to access LDC releases. This year’s planned publications include Multilanguage Conversational Telephone Speech, IARPA Babel Language Packs (telephone speech and transcripts), DIRHA (Distant-speech Interaction for Robust Home Applications), TRAD (Chinese-French and Arabic-French parallel text), data from BOLT, DEFT, LORELEI, RATS and TAC KBP, and more. Browse the Members pages for details on membership options and benefits. 

 

_______________________________________________________________________________

 


New publications:

 

 

 

(1) DEFT Spanish Treebank was developed by LDC and the Language and Computation Center (CLiC), University of Barcelona. It contains treebank annotation of international Spanish newswire text and Latin American Spanish discussion forum data created for the DARPA Deep Exploration and Filtering of Text (DEFT) program. DEFT Spanish Treebank supported the program's goal of deep natural language understanding.

 

 

 

Newswire source files were selected from Spanish Gigaword Third Edition (LDC2011T12) and were manually sentence-segmented for DEFT. Discussion forum source files were selected from Spanish discussion forum source data collected by LDC, consisting of continuous multi-posts of 100-1000 words.

 

 

 

This release contains 114 files (54,394 tokens) of newswire data and 60 files (55,307 tokens) of discussion forum data all of which were annotated with constituents and syntactic functions.

 

DEFT Spanish Treebank is distributed via web download.

 

 

 

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $1000.

 

 

 

*

 

 

 

(2) DIRHA English WSJ Audio was developed as part of the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. It is comprised of approximately 85 hours of real and simulated read speech by six native American English speakers. The target utterances were taken from CSR-I (WSJ0) Complete (LDC93S6A), specifically, the 5,000 word subset of read speech from Wall Street Journal news text.

 

 

 

Speech was collected in a real apartment setting with typical domestic background noise and inter/intra-room reverberation effects. Annotations, speaker metadata and images of the apartment setting are also included.

 

 

 

DIRHA English WSJ Audio is distributed via web download.

 

 

 

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $250.

 

 

 

*

 

 

 

(3) TRAD Chinese-French Parallel Text -- Blog was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Chinese words from GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06).

 

 

 

The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains.

 

 

 

The source data for TRAD Chinese-French Parallel Text is Chinese blog text collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.

 

 

 

TRAD Chinese-French Parallel Text -- Blog is distributed via web download.

 

 

 

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $250.

 

 

 

 

 

Membership Office

 

Linguistic Data Consortium

 

University of Pennsylvania

 

T: +1-215-573-1275

 

E: ldc@ldc.upenn.edu

 

M: 3600 Market St. Suite 810

 

      Philadelphia, PA 19104



 

 

 

 


 

 

 

 

 

 

 

 

 

 

 

 

 



Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA