ISCA - International Speech
Communication Association


ISCApad Archive  »  2012  »  ISCApad #172  »  Resources  »  Database  »  LDC Newsletter (September 2012)

ISCApad #172

Sunday, October 07, 2012 by Chris Wellekens

5-2-2 LDC Newsletter (September 2012)
  

In this newsletter:

   

-  The           Future of Language Resources: LDC 20th Anniversary Workshop           Summary  -

   

English
          Treebanking at LDC
-
   

   

New publications

   

GALE           Chinese-English Word Alignment and Tagging Training Part 1 --           Newswire and Web  -

   

MADCAT         Phase 1 Training Set  -

   


   

The Future of Language         Resources: LDC 20th Anniversary Workshop Summary 

   

Thanks to the members,       friends and staff  who made       our 20th Anniversary Workshop (September 6-7) a fruitful and fun       experience. The speakers -- from academia, industry and government       – engaged participants and provoked discussion with their talks       about the ways in which language resources contribute to research       in language-related fields and other disciplines and with their       insights into the future. The result was much food for thought as       we enter our third decade.
   

   

Visit the workshop
        page
for the proceedings and to learn more about the event.

   

English Treebanking at LDC
     

   

 

   

                     

   

As part of our 20th anniversary celebration, the coming newsletters         will include features that provide an overview of the broad         range of LDC’s activities. This month, we'll examine English       treebanking efforts at LDC.  The English treebanking team is lead       by Ann Bies, Senior Research Coordinator.  The association of treebanks       with LDC began with the publication of the original Penn English       Treebank (Treebank-2)       in 1995.  Since that time       the need for new varieties of English treebank data has continued       to grow, and LDC has expanded its expertise to address new       research challenges.  This       includes the development of treebanked data for additional domains       including conversational speech and web text as well as the       creation of parallel treebank data.

   

Speech data presents unique       challenges not inherent in edited text such as speech disfluency       and hesitations.  Penn       Treebank contains conversational speech data from the Switchboardtelephone
      collection which has been tagged, dysfluency-annotated, and       parsed.  LDC’s more recent publication, English CTS Treebank with Structural           Metadata,       builds on that annotation and includes new data. The development       of that corpus was motivated by the need to have both structural       metadata and syntactic structure annotated in order to support       work on speech parsing and structural event detection. The       annotation involved a two-pass approach to annotating metadata,       speech effects and syntactic structure in transcribed       conversational speech: separately annotating for structural       metadata, or structural events, and for syntactic structure. The       two annotations were then combined into a single aligned       representation.

   

Also recently, LDC has undertaken       complex syntactic annotation of data collected over the web.  Since most parsers are       trained using newswire, they achieve better accuracy on similar       heavily edited texts.  LDC,       through a gift from Google Inc., developed English Web
        Treebank
to improve parsing, translation and information       extraction on unedited domains, such as blogs, newsgroups, and       consumer reviews.  LDC’s       annotation guidelines were adapted to handle unique features of       web text such as inconsistent punctuation and capitalization as       well as the increased use of slang, technical jargon and       ungrammatical sentences.

   

LDC and its research partners are       also involved in the creation of parallel treebanks used for word       alignment tasks.  Parallel       treebanks are annotated morphological and syntactic structures       that are aligned at sentence as well as sub-sentence levels. These       resources are used for improving machine translation quality. To       create such treebanks, English files (translated from         the source Arabic or Chinese) are first automatically  part-of-speech       tagged and parsed and then hand-corrected at each stage.  The quality control process       consists of a series of specific searches for over 100 types of       potential inconsistency and parser or annotation error.  Parallel treebank data in the       LDC catalog includes the English Translation
        Treebank: An Nahar Newswire
whose files are parallel with       those in Arabic Treebank:
        Part 3 v 3.2

   

English treebanking at       LDC is ongoing; new titles are in progress and will be added to       our catalog.

   

     
     
   

   
       

New Publications
     

   

(1) GALE Chinese-English
        Word Alignment and Tagging Training Part 1 -- Newswire and Web
      was developed by LDC and contains 150,068 tokens of word aligned       Chinese and English parallel text enriched with linguistic tags.       This material was used as training data in the DARPA GALE       (Global Autonomous Language Exploitation) program.  This  release consists of Chinese       source newswire and web data (newsgroup, weblog) collected by LDC       in 2008.

   

Some approaches to statistical machine       translation include the incorporation of linguistic knowledge in       word aligned text as a means to improve automatic word alignment       and machine translation quality. This is accomplished with two       annotation schemes: alignment and tagging. Alignment identifies       minimum translation units and translation relations by using       minimum-match and attachment annotation approaches. A set of word       tags and alignment link tags are designed in the tagging scheme to       describe these translation units and relations. Tagging adds       contextual, syntactic and language-specific features to the       alignment annotation.

   

The Chinese word alignment tasks consisted of       the following components:

   

-Identifying, aligning, and tagging 8 different       types of links

   

-Identifying, attaching, and tagging       local-level unmatched words

   

-Identifying and tagging       sentence/discourse-level unmatched words

   

-Identifying and tagging all instances of       Chinese 的       (DE) except when they were a part of a semantic link.

   

GALE Chinese-English Word Alignment and Tagging       Training Part 1 -- Newswire and Web is distributed via web       download.

   

2012 Subscription Members will automatically       receive two copies of this data on disc.  2012 Standard Members       may request a copy as part of their 16 free membership corpora.        Non-members may license this data for US$1750.

   

*

   

(2) MADCAT
        Phase 1 Training Set
contains all training data created by       LDC to support Phase 1 of the DARPA MADCAT Program. The data in       this release consists of handwritten Arabic documents scanned at       high resolution and annotated for the physical coordinates of each       line and token. Digital transcripts and English translations of       each document are also provided, with the various content and       annotation layers integrated in a single MADCAT XML output.

   

The goal of the MADCAT program is to       automatically convert foreign text images into English       transcripts. MADCAT Phase 1 data was collected by LDC from Arabic       source documents in three genres: newswire, weblog and newsgroup       text. Arabic speaking 'scribes' copied documents by hand,       following specific instructions on writing style (fast, normal,       careful), writing implement (pen, pencil) and paper (lined,       unlined). Prior to assignment, source documents were processed to       optimize their appearance for the handwriting task, which resulted       in some original source documents being broken into multiple       'pages' for handwriting. Each resulting handwritten page was       assigned to up to five independent scribes, using different       writing conditions.

   

The handwritten, transcribed documents were  checked for quality and       completeness, then each page was scanned at a high resolution (600       dpi, greyscale) to create a digital version of the handwritten       document. The scanned images were then annotated to indicate the       physical coordinates of each line and token. Explicit reading       order was also labeled, along with any errors produced by the       scribes when copying the text.

   

The final step was to produce a unified data       format that takes multiple data streams and generates a single xml       output file which contains all required information. The resulting       xml file  has these       distinct components: a text layer that consists of the source       text, tokenization and sentence segmentation; an image layer that       consist of bounding boxes; a scribe demographic layer that       consists of scribe ID and partition (train/test); and a document       metadata layer. This release includes 9693 annotation files in       MADCAT XML format (.madcat.xml) along with their corresponding       scanned image files in TIFF format.

   

MADCAT Phase 1 Training Set is distributed on       two DVD-ROM.

   

2012 Subscription Members will automatically       receive two copies of this data.  2012 Standard Members may       request a copy as part of their 16 free membership corpora.        Non-members may license this data for US$2000.

   

 

 
            


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA