ISCApad #185 |
Tuesday, November 12, 2013 by Chris Wellekens |
In this newsletter:
Fall 2013 LDC Data Scholarship Recipients
New publications:
GALE Phase 2 Chinese Broadcast News Speech
GALE Phase 2 Chinese Broadcast News Transcripts
Fall 2013 LDC Data Scholarship Recipients
LDC is pleased to announce the student recipients of the Fall 2013 LDC
New publications
(1) GALE Phase 2 Chinese Broadcast News Speech was developed by LDC and is comprised of approximately 126 hours of Mandarin Chinese broadcast news speech collected in 2006 and 2007 by the Linguistic Data Consortium (LDC) and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding
Broadcast audio for the GALE program was collected at LDC's Philadelphia, PA USA facilities and at three remote collection sites: HKUST (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.
The broadcast conversation recordings in this release feature news broadcasts focusing principally on current events from the following sources: Anhui TV, a regional television station in Mainland China, Anhui Province; China Central TV (CCTV), a national and international broadcaster in Mainland China; and Phoenix TV, a Hong Kong-based satellite television station.
This release contains 248 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings, as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded, and as a guide for data selection by retaining information about a program's genre, data type and topic.
GALE Phase 2 Chinese Broadcast News Speech is distributed on 2 DVD-ROM.
2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.
*
(2) GALE Phase 2 Chinese Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 110 hours of Chinese broadcast news speech collected in 2006 and 2007 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding
The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,593,049 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript.
GALE Phase 2 Chinese Broadcast News Transcripts is distributed via web download.
2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.
*
OntoNotes Release 5.0 contains the content of earlier releases -- OntoNotes Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04, OntoNotes Release 3.0 LDC2009T24 and OntoNotes Release 4.0 LDC2011T03 -- and adds source data from and/or additional annotations for, newswire (News), broadcast news (BN), broadcast conversation (BC), telephone conversation (Tele) and web data (Web) in English and Chinese and newswire data in Arabic. Also contained is English pivot text (Old Testament and New Testament text). This cumulative publication consists of 2.9 million words
The OntoNotes project built on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation includes word sense disambiguation for nouns and verbs, with some word senses connected to an ontology, and coreference.
Documents describing the annotation guidelines and the routines for deriving various views of the data from the database are included in the documentation directory of this release. The annotation is provided both in separate text files for each annotation layer (Treebank, PropBank, word sense, etc.) and in the form of an integrated relational database (ontonotes-v5.0.sql.gz) with a Python API to provide convenient cross-layer access.
OntoNotes Release 5.0 is distributed on 1 DVD-ROM.
2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may request this data by completing a copy of the LDC User Agreement for Non-Members. The agreement can be faxed +1 215 573 2175 or scanned and emailed to this address. This data is available at no charge, but is subject to non-member shipping and handling fees.
|
Back | Top |