ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2017 » ISCApad #231 » Resources » Database » Linguistic Data Consortium (LDC) update (August 2017)

ISCApad #231

Sunday, September 10, 2017 by Chris Wellekens

5-2-17 Linguistic Data Consortium (LDC) update (August 2017)

In this newsletter:

Fall 2017 LDC Data Scholarship program

LDC at Interspeech 2017

New Publications:

Multi-Language Conversational Telephone Speech 2011 -- South Asian

GALE Phase 4 Arabic Broadcast Conversation Speech

GALE Phase 4 Arabic Broadcast Conversation Transcripts

Fall 2017 LDC Data Scholarship program - September 15 deadline approaching

There is still time to apply to the Fall 2017 LDC Data Scholarship program. Applications will be accepted through Friday September 15, 2017. The LDC Data Scholarship program provides university students with access to LDC data at no cost. Students must complete an application which consists of a data use proposal and letter of support from their advisor.

For more information on application requirements and program rules, please visit the LDC Data Scholarship page.

Applicants can email their materials to the LDC Data Scholarship program.

LDC at Interspeech 2017

LDC will once again be exhibiting at Interspeech, held this year August 20-24 in Stockholm, Sweden. Stop by booth 17 to learn more about recent developments at the Consortium and new publications.

Also, be on the lookout for the following oral presentation by LDC:

Call My Net Corpus: A Multilingual Corpus for Evaluation of Speaker Recognition Technology

Karen Jones, Stephanie Strassel, Kevin Walker, David Graff, Jonathan Wright

Wednesday, August 3, 17:40-18:00 in the Agula Magna room

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!

New publications:

(1) Multi-Language Conversational Telephone Speech 2011 -- South Asian was developed by LDC and is comprised of approximately 118 hours of telephone speech in five distinct language varieties of South Asia (i.e. the Indian sub-continent): Bengali, Hindi, Punjabi, Tamil and Urdu. The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused on language pair discrimination for 24 languages/dialects, some which could be considered mutually intelligible or closely related.

Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type, and noise. Demographic information about the participants was not collected.

LDC has also released the following as part of the Multi-Language Conversation Telephone Speech 2011 series: Slavic Group (LDC2016S11) and Turkish (LDC2017S09).

Multi-Language Conversational Telephone Speech 2011 -- South Asian is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $2,500.

(2) GALE Phase 4 Arabic Broadcast Conversation Speech was developed by LDC and is comprised of approximately 75 hours of Arabic broadcast conversation speech collected in 2008 and 2009 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Arabic Broadcast Conversation Transcripts (LDC2017T12).

This release contains 83 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release.

The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Al Alam News Channel, based in Iran; Al Fayhaa, an Iraqi television channel; Al Hiwar, a regional broadcast station based in the United Kingdom; Alnurra, a U.S. government-funded regional broadcaster; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Dubai TV, a broadcast station in the United Arab Emirates; Lebanese Broadcasting Corporation, a Lebanese television station; Saudi TV, a national television station based in Saudi Arabia; Syria TV, the national television station in Syria; and Tunisian National TV, a national television station in Tunisia.

GALE Phase 4 Arabic Broadcast Conversation Speech is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $2,000.

(3) GALE Phase 4 Arabic Broadcast Conversation Transcripts was developed by LDC and contains transcriptions of approximately 75 hours of Arabic broadcast conversation speech collected in 2008 and 2009 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 4 Arabic Broadcast Conversation Speech (LDC2017S15).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 475,211 tokens. The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

GALE Phase 4 Arabic Broadcast Conversation Transcripts is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $1,500.

Membership Office

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy