ISCApad Archive » 2017 » ISCApad #231 » Resources » Database » Linguistic Data Consortium (LDC) update (August 2017) |
ISCApad #231 |
Sunday, September 10, 2017 by Chris Wellekens |
In this newsletter:
Multi-Language Conversational Telephone Speech 2011 -- South Asian Fall 2017 LDC Data Scholarship program - September 15 deadline approaching
There is still time to apply to the Fall 2017 LDC Data Scholarship program. Applications will be accepted through Friday September 15, 2017. The LDC Data Scholarship program provides university students with access to LDC data at no cost. Students must complete an application which consists of a data use proposal and letter of support from their advisor.
Applicants can email their materials to the LDC Data Scholarship program.
LDC at Interspeech 2017
LDC will once again be exhibiting at Interspeech, held this year August 20-24 in Stockholm, Sweden. Stop by booth 17 to learn more about recent developments at the Consortium and new publications.
Also, be on the lookout for the following oral presentation by LDC:
Karen Jones, Stephanie Strassel, Kevin Walker, David Graff, Jonathan Wright Wednesday, August 3, 17:40-18:00 in the Agula Magna room
LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!
(1) Multi-Language Conversational Telephone Speech 2011 -- South Asian was developed by LDC and is comprised of approximately 118 hours of telephone speech in five distinct language varieties of South Asia (i.e. the Indian sub-continent): Bengali, Hindi, Punjabi, Tamil and Urdu. The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused on language pair discrimination for 24 languages/dialects, some which could be considered mutually intelligible or closely related.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $2,500.
(2) GALE Phase 4 Arabic Broadcast Conversation Speech was developed by LDC and is comprised of approximately 75 hours of Arabic broadcast conversation speech collected in 2008 and 2009 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding transcripts are released as GALE Phase 4 Arabic Broadcast Conversation Transcripts (LDC2017T12).
This release contains 83 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release.
The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Al Alam News Channel, based in Iran; Al Fayhaa, an Iraqi television channel; Al Hiwar, a regional broadcast station based in the United Kingdom; Alnurra, a U.S. government-funded regional broadcaster; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Dubai TV, a broadcast station in the United Arab Emirates; Lebanese Broadcasting Corporation, a Lebanese television station; Saudi TV, a national television station based in Saudi Arabia; Syria TV, the national television station in Syria; and Tunisian National TV, a national television station in Tunisia.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $2,000. *
Corresponding audio data is released as GALE Phase 4 Arabic Broadcast Conversation Speech (LDC2017S15).
The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 475,211 tokens. The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $1,500. Membership Office University of Pennsylvania T: +1-215-573-1275 E: ldc@ldc.upenn.edu M: 3600 Market St. Suite 810 Philadelphia, PA 19104
|
Back | Top |