ISCA - International Speech
Communication Association


ISCApad Archive  »  2019  »  ISCApad #253  »  Resources  »  Database  »  Linguistic Data Consortium (LDC) update (June 2019)

ISCApad #253

Tuesday, July 09, 2019 by Chris Wellekens

5-2-1 Linguistic Data Consortium (LDC) update (June 2019)
  

 

In this newsletter:

 

New Publications:

 

DEFT Spanish Committed Belief Annotation

 

USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition

 

 


New publications:

 

 

 

(1) DEFT Spanish Committed Belief Annotation was developed by LDC and consists of approximately 67,000 tokens of Spanish discussion forum text annotated for 'committed belief,' which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text.

 

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships, and anomaly detection. LDC supported the DEFT program by collecting, creating, and annotating a variety of data sources.

 

DEFT Spanish Committed Belief Annotation is distributed via web download.

 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1000.

 

*

 

(2) USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition was developed by IBM as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project and contains approximately 168 hours of interviews from 682 Holocaust witnesses along with transcripts, a lexicon and other documentation. This release augments USC-SFI MALACH Interviews and Transcripts English (LDC2012S05) by modifying and updating a subset of the original corpus for use with speech recognition systems, such as the Kaldi toolkit.

 

Specifically, the audio data has been converted from unsegmented mpeg files to a segmented flac compressed format. The speaker-turn, time-stamped transcripts have been updated to an utterance-by-utterance format. A lexicon mapping words to phonemes is provided, and the data is divided into development and training sets.

 

The goal of the MALACH project was to develop methods for improved access to large multinational spoken archives in order to advance the state of the art of automatic speech recognition and information retrieval. The characteristics of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching, and emotional speech -- were considered well-suited for that task.

 

USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition is distributed via web download.

 

2019 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

 

*

 

(3) First DIHARD Challenge Development - Eight Sources was developed by LDC and contains approximately 17 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge. This release, when combined with First DIHARD Challenge Development - SEEDLingS (LDC2019S10), contains the development set audio data and annotation (diarization, segmentation) as well as the official scoring tool.

 

The First DIHARD Challenge was an attempt to reinvigorate work on diarization through a shared task focusing on 'hard' diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As such, it included speech from a wide sampling of domains representing diversity in number of speakers, speaker demographics, interaction style, recording quality, and environmental conditions as follows (all sources are in English unless otherwise indicated):

 

  • Autism Diagnostic Observation Schedule (ADOS) interviews
  • DCIEM/HCRC map task (LDC96S38)
  • Audiobook recordings from LibriVox
  • Meeting speech from 2004 Spring NIST Rich Transcription (RT-04S) Development (LDC2007S11) and Evaluation (LDC2007S12) releases.
  • 2001 U.S. Supreme Court oral arguments
  • Sociolinguistic interviews from SLX Corpus of Classic Sociolinguistic Interviews (LDC2003T15)
  • Chinese video collected by LDC as part of the Video Annotation for Speech Technologies (VAST) project
  • YouthPoint radio interviews

 

 

 

First DIHARD Challenge Development - Eight Sources is distributed via web download.

 

2019 Subscription Members will automatically receive copies of this corpus.  2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $300.

 

*

 

(4) First DIHARD Challenge Development - SEEDLingS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the First DIHARD Challenge. This release, when combined with First DIHARD Challenge Development - Eight Sources (LDC2019S09), contains the development set audio data and annotation (diarization, segmentation) as well as the official scoring tool.

 

The source data was drawn from the SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings for SEEDLingS were generated in the home environment of 44 infants from 6-18 months of age in the Rochester, New York, area. A subset of that data was annotated by LDC for use in the First DIHARD Challenge.

 

The First DIHARD Challenge was an attempt to reinvigorate work on diarization through a shared task focusing on 'hard' diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As such, it included speech from a wide sampling of domains representing diversity in number of speakers, speaker demographics, interaction style, recording quality, and environmental conditions.

 

First DIHARD Challenge Development – SEEDLingS is distributed via web download.

 

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $50.

 

*

 

 

 

Membership Office

 

Linguistic Data Consortium

 

University of Pennsylvania

 

T: +1-215-573-1275

 

E: ldc@ldc.upenn.edu

 

M: 3600 Market St. Suite 810

 

      Philadelphia, PA 19104

 

 

 

 

 


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA