ISCA - International Speech
Communication Association


ISCApad Archive  »  2013  »  ISCApad #184  »  Resources  »  Database  »  LDC Newsletter (September 2013)

ISCApad #184

Friday, October 11, 2013 by Chris Wellekens

5-2-3 LDC Newsletter (September 2013)
  

In this newsletter:
   

   

New LDC           Website Coming Soon

   

LDC Spoken           Language Sampler - 2nd Release

   

New publications:
   

   

GALE Phase 2           Arabic Broadcast Conversation Speech Part 2

   

GALE Phase 2           Arabic Broadcast Conversation Transcripts Part 2

   

Semantic Textual           Similarity (STS) 2013 Machine Translation

   


   

New LDC Website Coming           Soon
     
      Look for LDC's new website in the coming weeks. We've revamped the       design and site plan to make it easier than ever to find what       you're looking for. The features you use the most -- the catalog,       new corpus releases and user login -- will be a short click away.       We expect the LDC website to be occasionally unavailable for a few       days at the end of September as we make the switch and thank you       in advance for your understanding.
     
   

   

LDC Spoken Language           Sampler - 2nd Release

   

 

   

                     

   

The LDC
        Spoken Language Sampler – 2nd Release
is now available.  It       contains speech and transcript samples from recent releases and is       available at no cost.  Follow the link above to the catalog page,       download and browse.

   
    New publications
   
   

      (1) GALE Phase
        2 Arabic Broadcast Conversation Speech Part 2
was developed       by LDC and is comprised of approximately 128 hours of Arabic       broadcast conversation speech collected in 2007 by LDC as part of       the DARPA GALE (Global Autonomous Language Exploitation) Program.       The data was collected at LDC’s Philadelphia, PA USA facilities       and at three remote collection sites. The combined local and       outsourced broadcast collection supported GALE at a rate of       approximately 300 hours per week of programming from more than 50       broadcast sources for a total of over 30,000 hours of collected       broadcast audio over the life of the program.

   

LDC's local broadcast collection system is       highly automated, easily extensible and robust and capable of       collecting, processing and evaluating hundreds of hours of content       from several dozen sources per day. The broadcast material is       served to the system by a set of free-to-air (FTA) satellite
      receivers, commercial direct satellite systems (DSS) such as       DirecTV, direct broadcast satellite (DBS) receivers, and cable       television (CATV) feeds. The mapping between receivers and       recorders is dynamic and modular; all signal routing is performed       under computer control, using a 256x64 A/V matrix switch. Programs       are recorded in a high bandwidth A/V format and are then processed       to extract audio, to generate keyframes and compressed       audio/video, to produce time-synchronized closed captions (in the       case of North American English) and to generate automatic speech       recognition (ASR) output.

   

The broadcast conversation recordings in this       release feature interviews, call-in programs and round table       discussions focusing principally on current events from several       sources. This release contains 141 audio files presented in .wav,       16000 Hz single-channel 16-bit PCM. Each file was audited by a       native Arabic speaker following Audit Procedure Specification       Version 2.0 which is included in this release.

   

GALE Phase 2 Arabic Broadcast Conversation       Speech Part 2 is distributed on 2 DVD-ROM.
     
      2013 Subscription Members will automatically receive two copies of       this data.  2013 Standard Members may request a copy as part of       their 16 free membership corpora.  Non-members may license this       data for US$2000.
   

   

     

   

(2) GALE Phase
        2 Arabic Broadcast Conversation Transcripts Part 2
was       developed by LDC and contains transcriptions of approximately 128       hours of Arabic broadcast conversation speech collected in 2007 by       LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase       2 of the DARPA GALE (Global Autonomous Language Exploitation)       program. The source broadcast conversation recordings feature       interviews, call-in programs and round table discussions focusing       principally on current events from several sources.

   

The transcript files are in plain-text,       tab-delimited format (TDF) with UTF-8 encoding, and the       transcribed data totals 763,945 tokens. The transcripts were       created with the LDC-developed transcription tool, XTrans,       a multi-platform, multilingual, multi-channel transcription tool       that supports manual transcription and annotation of audio       recordings. 

   

The files in this corpus were transcribed by       LDC staff and/or by transcription vendors under contract to LDC.       Transcribers followed LDC’s quick transcription guidelines (QTR)       and quick rich transcription specification (QRTR) both of which       are included in the documentation with this release. QTR       transcription consists of quick (near-)verbatim, time-aligned       transcripts plus speaker identification with minimal additional       mark-up. It does not include sentence unit annotation. QRTR       annotation adds structural information such as topic boundaries       and manual sentence unit annotation to the core components of a       quick transcript.

   

GALE Phase 2 Arabic Broadcast Conversation       Transcripts - Part 2 is distributed via web download.

   

2013 Subscription Members will automatically       receive two copies of this data on disc.  2013 Standard Members       may request a copy as part of their 16 free membership corpora.        Non-members may license this data for US$1500.
   

   

   

   

(3)  Semantic Textual
        Similarity (STS) 2013 Machine Translation
was developed as       part of the STS 2013 Shared Task which was held in conjunction       with *SEM 2013,       the second joint conference on lexical and computational semantics       organized by the ACL (Association of Computational Linguistics)       interest groups SIGLEX       and SIGSEM. It       is comprised of one text file containing 750 English sentence       pairs translated from the Arabic and Chinese newswire and web data       sources.

   

The goal of the Semantic Textual Similarity       (STS) task was to create a unified framework for the evaluation of       semantic textual similarity modules and to characterize their       impact on natural language processing (NLP) applications. STS       measures the degree of semantic equivalence. The STS task was       proposed as an attempt at creating a unified framework that allows       for an extrinsic evaluation of multiple semantic components that       otherwise have historically tended to be evaluated independently       and without characterization of impact on NLP applications. More       information is available at the STS 2013 Shared Task homepage.

   

The source data is Arabic and Chinese newswire       and web data collected by LDC that was translated and used in the       DARPA GALE (Global Autonomous Language Exploitation) program and       in several NIST Open Machine Translation evaluations. Of the 750       sentence pairs, 150 pairs are from the GALE Phase 5 collection and       600 pairs are from NIST 2008-2012 Open Machine Translation       (OpenMT) Progress Test Sets (LDC2013T07).

   

The data was built to identify semantic textual       similarity between two short text passages. The corpus is       comprised of two tab delimited sentences per line. The first       sentence is a translation and the second sentence is a post-edited       translation. Post-editing is a process to improve machine       translation with a minimum of manual labor. The gold standard       similarity values and other STS datasets can be obtained from the       STS homepage, linked above.

   

Semantic Textual Similarity (STS) 2013 Machine       Translation is distributed via web download.

   

2013 Subscription Members will automatically       receive two copies of this data on disc. 2013 Standard Members may       request a copy as part of their 16 free membership corpora.        Non-members may request this data by submitting a signed copy of LDC User
        Agreement for Non-members
.  This data
      is available at no-cost.

      


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA