ISCA - International Speech
Communication Association


ISCApad Archive  »  2013  »  ISCApad #185  »  Resources  »  Database  »  LDC Newsletter (October 2013)

ISCApad #185

Tuesday, November 12, 2013 by Chris Wellekens

5-2-3 LDC Newsletter (October 2013)
  

 

In this           newsletter:

   

Fall 2013 LDC Data             Scholarship Recipients

   

New           publications:

   

GALE Phase 2 Chinese Broadcast News Speech

   

GALE             Phase 2 Chinese Broadcast News Transcripts

   

OntoNotes
            Release 5.0

       

       

   

Fall 2013 LDC Data           Scholarship Recipients
       

   

LDC is pleased to announce the         student recipients of the Fall 2013 LDC
          Data Scholarship program
!          This program provides university and college students with         access to LDC data at no-cost. Students were asked to complete         an application which consisted of a proposal describing their         intended use of the data, as well as a letter of support from         their thesis adviser. We received many solid applications and         have chosen six  proposals
        to support.   The following students will receive no-cost copies         of LDC data:

   

     

Shamama Afnan - Clemson           University (USA), MS candidate, Electrical Engineering.            Shamana has been awarded a copy of 2008 NIST Speaker           Recognition Training and Test data for her work in speaker           recognition.

     

Seyedeh Firoozabadi -           University of Connecticut (USA), PhD candidate, Biomedical           Engineering.  Seyedeh has been awarded a copy of TIDIGITS and           TI-46 Word for her work in speech recognition.

     

Lei Liu - Beijing Foreign           Studies University (China), PhD candidate, Foreign Language           Education.  Lei has been awarded a copy of Treebank-3 and           Prague Czech-English Dependency Treebank 2.0 for his work in           parsing.

     

Monisankha Pal - Indian           Institute of Technology, Kharagpur (India), PhD candidate,           Electronics and Electrical Communication Engineering.            Monisankha has been awarded a copy of CSR-I (WSJ0) and CSR-II           (WSJ1) for his work in speaker recognition.

     

Sachin Pawar - Indian           Institute of Technology, Bombay (India), PhD candidate,           Computer Science and Engineering.  Sachin has been awarded a           copy of ACE 2004 Multilingual Training Corpus for his work in           named-entity recognition.

     

Sergio Silva - Federal           University of Rio Grande do Sul (Brazil), MS candidate,           Computer Science.  Sergio has been awarded a copy of 2004 and           2005 Spring NIST Rich Transcription data for his work in           diarization.

   

   

 

   

New           publications

   

(1) GALE Phase 2 Chinese Broadcast           News Speech was developed by LDC and is         comprised of approximately 126 hours of Mandarin Chinese         broadcast news speech collected in 2006 and 2007 by the         Linguistic Data Consortium (LDC) and Hong University of Science         and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA         GALE (Global Autonomous Language Exploitation) Program.

   

Corresponding
        transcripts are released as GALE Phase 2 Chinese Broadcast News         Transcripts (LDC2013T20).

   

Broadcast         audio for the GALE program was collected at LDC's Philadelphia,         PA USA facilities and at three remote collection sites: HKUST         (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat,         Morocco) (Arabic). The combined local and outsourced broadcast         collection supported GALE at a rate of approximately 300 hours         per week of programming from more than 50 broadcast sources for         a total of over 30,000 hours of collected broadcast audio over         the life of the program.

   

The         broadcast conversation recordings in this release feature news         broadcasts focusing principally on current events from the         following sources: Anhui TV, a regional television station in         Mainland China, Anhui Province; China Central TV (CCTV), a         national and international broadcaster in Mainland China; and         Phoenix TV, a Hong Kong-based satellite television station.

   

This         release contains 248 audio files presented in FLAC-compressed Waveform Audio File format (.flac),         16000 Hz single-channel 16-bit PCM. Each file was audited by a         native Chinese speaker following Audit Procedure Specification         Version 2.0 which is included in this release. The broadcast         auditing process served three principal goals: as a check on the         operation of the broadcast collection system equipment by         identifying failed, incomplete or faulty recordings, as an         indicator of broadcast schedule changes by identifying instances         when the incorrect program was recorded, and as a guide for data         selection by retaining information about a program's genre, data         type and topic.

   

GALE Phase         2 Chinese Broadcast News Speech is         distributed on 2 DVD-ROM.

   

2013         Subscription Members will automatically receive two copies of         this data. 2013 Standard Members may request a copy as part of         their 16 free membership corpora.  Non-members may license this         data for US$2000.
     

   

                                                                                                                                    *
     

   

(2) GALE Phase 2 Chinese Broadcast           News Transcripts was developed by LDC and contains         transcriptions of approximately 110 hours of Chinese broadcast         news speech collected in 2006 and 2007 by LDC and Hong         University of Science and Technology (HKUST), Hong Kong, during         Phase 2 of the DARPA GALE (Global Autonomous Language         Exploitation) Program.

   

Corresponding
        audio data is released as GALE Phase 2 Chinese Broadcast News         Speech (LDC2013S08).

   

The         transcript files are in plain-text, tab-delimited format (TDF)         with UTF-8 encoding, and the transcribed data totals 1,593,049         tokens. The transcripts were created with the LDC-developed         transcription tool, XTrans, a multi-platform, multilingual, multi-channel         transcription tool that supports manual transcription and         annotation of audio recordings. 

   

The files         in this corpus were transcribed by LDC staff and/or by         transcription vendors under contract to LDC. Transcribers         followed LDC’s quick transcription guidelines (QTR) and quick         rich transcription specification (QRTR) both of which are         included in the documentation with this release. QTR         transcription consists of quick (near-)verbatim, time-aligned         transcripts plus speaker identification with minimal additional         mark-up. It does not include sentence unit annotation. QRTR         annotation adds structural information such as topic boundaries         and manual sentence unit annotation to the core components of a         quick transcript.

   

GALE Phase         2 Chinese Broadcast News Transcripts is distributed via web         download.

   

2013         Subscription Members will automatically receive two copies of         this data. 2013 Standard Members may request a copy as part of         their 16 free membership corpora.  Non-members may license this         data for US$2000.
     

   

                                                                                                                        *
     

   


      (3) OntoNotes Release 5.0 is the final release of the         OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of           Pennsylvania and the University of Southern           Californias Information Sciences Institute. The goal         of the project was to annotate a large corpus comprising various         genres of text (news, conversational telephone speech, weblogs,         usenet newsgroups, broadcast, talk shows) in three languages         (English, Chinese, and Arabic) with structural information         (syntax and predicate argument structure) and shallow semantics         (word sense linked to an ontology and coreference).

   

OntoNotes         Release 5.0 contains the content of earlier releases --         OntoNotes Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04, OntoNotes Release 3.0 LDC2009T24 and OntoNotes Release 4.0 LDC2011T03 -- and adds source data from and/or additional         annotations for, newswire (News), broadcast news (BN), broadcast         conversation (BC), telephone conversation (Tele) and web data         (Web) in English and Chinese and newswire data in Arabic. Also         contained is English pivot text (Old Testament and New Testament         text). This cumulative publication consists of 2.9 million words        

   

The         OntoNotes project built on two time-tested resources, following         the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic         representation includes word sense disambiguation for nouns and         verbs, with some word senses connected to an ontology, and         coreference.

   

Documents         describing the annotation guidelines and the routines for         deriving various views of the data from the database are         included in the documentation directory of this release. The         annotation is provided both in separate text files for each         annotation layer (Treebank, PropBank, word sense, etc.) and in         the form of an integrated relational database         (ontonotes-v5.0.sql.gz) with a Python API to provide convenient         cross-layer access.

   

OntoNotes         Release 5.0 is distributed on 1 DVD-ROM.

   

2013         Subscription Members will automatically receive two copies of         this data. 2013 Standard Members may request a copy as part of         their 16 free membership corpora.  Non-members may request this         data by completing a copy of the LDC User Agreement for           Non-Members.  The agreement can be faxed +1         215 573 2175 or scanned and emailed to this address.  This data         is available at no charge, but is subject to non-member shipping         and handling fees.

   

 


   

      


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA