ISCApad #183 |
Wednesday, September 11, 2013 by Chris Wellekens |
In this newsletter: - Mixer 6 now available! -
- LDC at Interspeech 2013, Lyon France -
New publications:
- GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2 -
- Mixer 6 Speech -
Mixer
6 now available!
The release of Mixer 6 Speech this month marks the first time in close to a decade that LDC has made available a large-scale speech training data collection. Representing more than 15,000 hours of speech from over 500 speakers, Mixer 6 follows in the footsteps of the Switchboard and Fisher studies by providing a large database of rich telephone conversations with the addition of subject interviews and transcript readings. Participants were native American English speakers local to the Philadelphia area, providing further scope for a variety of research tasks. Mixer 6 Speech is a members-only release and a great reason to join the consortium. In addition to this substantial resource, members enjoy rights to other data released in 2013 and can license older publications at reduced fees. Please see the full
Fall 2013 LDC Data Scholarship Program - deadline approaching!
The deadline for the Fall 2013 LDC Data Scholarship Program is one month away! Student applications are being accepted now through September
Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.
LDC at Interspeech 2013, Lyon France
LDC will once again be exhibiting at Interspeech held this year August 25-29 in Lyon. Please stop by LDC’s booth to to learn about recent developments at the Consortium, including new publications.
Also, be on the lookout for the following presentations:
· Speech Activity Detection on YouTube Using Deep Neural Networks
· The Spectral Dynamics of Vowels in Mandarin Chinese
· Automatic Phonetic Segmentation using Boundary Models
LDC will continue to post conference updates via our Facebook page. We hope to see you there!
New
(1)GALE Phase
This release includes 20 source-translation document pairs, comprising 152,894 characters of Chinese source text and its English translation. Data is drawn from six distinct Chinese programs broadcast in 2005-2007 from Phoenix TV, a Hong Kong-based satellite television station. Broadcast conversation programming is generally more interactive than traditional news broadcasts and includes talk shows, interviews, call-in programs and roundtable discussions. The programs in this release focus on current events topics.
The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.
GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2 is distributed via web download.
2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1750.
*
(2)MADCAT (Multilingual
The goal of the MADCAT program is to automatically convert foreign text images into English transcripts. MADCAT Phase 3 data was collected from Arabic source documents in three genres: newswire, weblog and newsgroup text. Arabic speaking scribes copied documents by hand, following specific instructions on writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which resulted in some original source documents being broken into multiple pages for handwriting. Each resulting handwritten page was assigned to up to five independent scribes, using different writing conditions.
The handwritten, transcribed documents were next checked for quality and completeness, then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were then annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, along with any errors produced by the scribes when copying the text.
The final step was to produce a unified data format that takes multiple data streams and generates a single MADCAT XML output file which contains all required information. The resulting madcat.xml file contains distinct components: a text layer that consists of the source text, tokenization and sentence segmentation; an image layer that consists of bounding boxes; a scribe demographic layer that consists of scribe ID and partition (train/test); and a document metadata layer.
This release includes 4,540 annotation files in both GEDI XML and MADCAT XML formats (gedi.xml and madcat.xml) along with their corresponding scanned image files in TIFF format.
*
(3)Mixer 6
The speech data in this release was collected by LDC at its Human Subjects Collection facilities in Philadelphia. The telephone collection protocol was similar to other LDC telephone studies (e.g., Switchboard-2 Phase III Audio - LDC2002S06): recruited
The multi-microphone portion of the collection utilized 14 distinct microphones installed identically in two mutli-channel audio recording rooms at LDC. Each session was guided by collection staff using prompting and recording software to conduct the following activities: (1) repeat questions (less than one minute), (2) informal conversation (typically 15 minutes), (3) transcript reading (approximately 15 minutes) and (4) telephone call (generally 10 minutes). Speakers recorded up to three 45-minute sessions on distinct days. The 14 channels were recorded synchronously into separate single-channel files, using 16-bit PCM sample encoding at 16000 samples/second.
The recordings in this corpus were used in NIST Speaker Recognition Evaluation (SRE) test sets for 2010 and 2012. Researchers interested in applying those benchmark test sets should consult the respective NIST Evaluation Plans for guidelines on allowable training data for those tests.
The collection contains 4,410 recordings made via the public telephone network and 1,425 sessions of multiple microphone recordings in office-room settings. The telephone recordings are presented as 8-KHz 2-channel NIST SPHERE files, and the microphone recordings are 16-KHz 1-channel flac/ms-wav files.
Mixer 6 Speech is distributed on one hard drive.
.
|
Back | Top |