ISCApad Archive » 2018 » ISCApad #235 » Resources » Database » Linguistic Data Consortium (LDC) update December 2017) |
ISCApad #235 |
Wednesday, January 10, 2018 by Chris Wellekens |
In this newsletter:
Lingo Boingo: a web portal to language games
Renew your LDC membership today
Spring 2018 LDC Data Scholarship Program - deadline approaching
Students can apply for the Spring 2018 Data Scholarship Program now through January 15, 2018. The LDC Data Scholarship program provides students with access to LDC data at no cost. For more information on application requirements and program rules, please visit LDC Data Scholarships.
Lingo Boingo: a web portal to language games
LDC is pleased to announce a new collaborative project, Lingo Boingo (https://lingoboingo.org), a web portal that brings together new and existing language games that are fun to play and that provide useful annotations and judgments for linguistic research. Gamers and grammar lovers can choose from a list of challenging games, which will continue to expand through the efforts of LDC and external collaborators. For more information, contact jfiumara@ldc.upenn.edu. Start playing today!
Renew your LDC membership today
Membership Year 2018 (MY2018) is open for joining and discounts are available for those who keep their membership current and join early in the year. Now through March 1, 2018, current MY2017 members who renew before March 1, will receive a 10% discount off of the membership fee. New or returning organizations will receive a 5% discount through March 1.
In addition to receiving new publications, current year LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 700 holdings; current year for-profit members may use most data for commercial applications. Visit Join LDC for details on membership, user accounts and payment.
Plans for MY2018 publications are in progress. Among the expected releases are:
New publications:
(1) CHiME3 was developed as part of The 3rd CHiME Speech Separation and Recognition Challenge and contains approximately 342 hours of English speech and transcripts from noisy environments and 50 hours of noisy environment audio. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments. CHiME3 involved two types of data: speech data recorded in very noisy environments (on a bus, in a cafe, pedestrian area, and street junction) and noisy utterances generated by artificially mixing clean speech data with noisy backgrounds.
Data is divided into training, development, and test sets. All data is provided as 16 bit WAV files sampled at 16 kHz. The audio data consists of the background noises, enhanced speech data using the baseline speech enhancement technique, unsegmented noisy speech data, and segmented noisy speech data.
LDC has also released two CHiME2 corpora -- CHiME2 Grid and CHiME2 WSJ0.
CHiME3 is distributed via USB drive.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $75.
(2) GALE Phase 4 Chinese Broadcast News Speech was developed by LDC and is comprised of approximately 134 hours of Mandarin Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding audio data is released as GALE Phase 4 Chinese Broadcast News Transcripts (LDC2017T18).
The broadcast news recordings in this release feature news broadcasts focusing principally on current events from the following sources: China Central TV (CCTV), a national and international broadcaster in Mainland China; Phoenix TV, a Hong Kong-based satellite television station; and Voice of America (VOA), a U.S. government-funded broadcast programmer.
This release contains 256 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.
GALE Phase 4 Chinese Broadcast News Speech is distributed via web download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $2,000.
*
Corresponding audio data is released as GALE Phase 4 Chinese Broadcast News Speech (LDC2017S25).
The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,696,879 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings.
GALE Phase 4 Chinese Broadcast News Transcripts is distributed via web download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $1,500.
Membership Office
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
|
Back | Top |