ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2019 » ISCApad #254 » Resources » Database

ISCApad #254

Saturday, August 10, 2019 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (July 2019)

In this newsletter:

Fall 2019 LDC Data Scholarship Program

LDC data and commercial technology development

New Publications:

The DKU-JNU-EMA Electromagnetic Articulography Database
Phrase Detectives Corpus Version 2

First DIHARD Challenge Evaluation - Nine Sources

First DIHARD Challenge Evaluation – SEEDLingS

Fall 2019 LDC Data Scholarship Program

Student applications for the Fall 2019 LDC Data Scholarship program are being accepted now through September 15, 2019. This scholarship program provides eligible students with access to LDC data at no cost. Students must complete an application consisting of a data use proposal and letter of support from their advisor.

For application requirements and program rules, please visit the LDC Data Scholarship page.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) The DKU-JNU-EMA Electromagnetic Articulography Database was developed by Duke Kunshan University and Jinan University and contains approximately 10 hours of articulography and speech data in Mandarin, Cantonese, Hakka, and Teochew Chinese from two to seven native speakers for each dialect.

Articulatory measurements were made using the NDI electromagnetic articulography wave research system to capture real-time vocal tract variable trajectories. Subjects had six sensors placed in various locations in their mouth and one reference sensor was placed on the bridge of their nose. For simultaneous recording of speech signals, subjects also wore a head-mounted close-talk microphone.

Speakers engaged in four different types of recording sessions: one in which they read complete sentences or short texts, and three sessions in which they read related words of a specific common consonant, vowel, or tone.

DKU-JNU-EMA Electromagnetic Articulography Database is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1000.

(2) Phrase Detectives Corpus Version 2 was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 407,000 tokens across 537 documents anaphorically-annotated by the Phrase Detectives Game, an online interactive 'game-with-a-purpose' (GWAP) designed to collect data about English anaphoric coreference.

This release constitutes a new version of the Phrase Detectives Corpus (LDC2017T08), adding significantly more annotated tokens to the data set and supplying players’ judgments and a silver label annotation based on the probabilistic aggregation method for anaphoric information for each markable.

The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. The annotation is a simplified form of the coding scheme used in The ARRAU Corpus of Anaphoric Information (LDC2013T22).

Phrase Detectives Corpus Version 2 is distributed via web download.

(3) First DIHARD Challenge Evaluation - Nine Sources was developed by LDC and contains approximately 18 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge.

The First DIHARD Challenge was an attempt to reinvigorate work on diarization through a shared task focusing on 'hard' diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As such, it included speech from a wide sampling of domains representing diversity in number of speakers, speaker demographics, interaction style, recording quality, and environmental conditions as follows (all sources are in English unless otherwise indicated):

Autism Diagnostic Observation Schedule (ADOS) interviews
Conversations in Restaurants
DCIEM/HCRC map task (LDC96S38)
Audiobook recordings from LibriVox
Meeting speech collected by LDC in 2001 for the ROAR project (see, e.g., ISL Meeting Speech Part 1 (LDC2004S05))
2001 U.S. Supreme Court oral arguments
Mixer 6 Speech (LDC2013S02)
Chinese video collected by LDC as part of the Video Annotation for Speech Technologies (VAST) project
YouthPoint radio interviews

This release, when combined with First DIHARD Challenge Evaluation - SEEDLingS (LDC2019S13), contains the evaluation set audio data and annotation as well as the official scoring tool. The development data for the First DIHARD Challenge is also available from LDC as Eight Sources (LDC2019S09) and SEEDLingS (LDC2019S10).

First DIHARD Challenge Evaluation - Nine Sources is distributed via web download.

(4) First DIHARD Challenge Evaluation – SEEDLingS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the First DIHARD Challenge.

The source data was drawn from the SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings for SEEDLingS were generated in the home environment of 44 infants from 6-18 months of age in the Rochester, New York area. A subset of that data was annotated by LDC for use in the First DIHARD Challenge.

This release, when combined with First DIHARD Challenge Evaluation - Nine Sources (LDC2019S12), contains the evaluation set audio data and annotation as well as the official scoring tool. The development data for the First DIHARD Challenge is also available from LDC as Eight Sources (LDC2019S09) and SEEDLingS (LDC2019S10).

First DIHARD Challenge Evaluation – SEEDLingS is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $50.

Membership Office

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104