ISCA - International Speech
Communication Association


ISCApad Archive  »  2012  »  ISCApad #169  »  Resources  »  Database  »  LDC Newsletter (June 2012)

ISCApad #169

Tuesday, July 10, 2012 by Chris Wellekens

5-2-2 LDC Newsletter (June 2012)
  

In this newsletter:

LDC at          LREC 2012 -

New publications:

LDC2012T09 - Arabic-Dialect/English Parallel Text -
      LDC2012T08 - Prague Czech-English Dependency Treebank 2.0         -

 


LDC        at LREC 2012

      LDC attended the 8th Language Resource        Evaluation Conference (LREC2012), hosted by ELRA, the European Language      Resource Association. The conference was held in Istanbul, Turkey      and featured a broad range of sessions on language resource and      human language technologies research. Fourteen LDC staff members      presented current work on a wide range of topics, including      handwriting recognition, word alignment, treebanks, machine      translation and information retrieval as well as initiatives for      synchronizing metadata practices in sociolinguistic data      collection.
      The LDC Papers page      now includes research papers presented at LREC 2012.  Most papers      are available for download in pdf format; presentations slides and      posters are available for several papers as well. On the Papers      page, you can read about LDC's role in resource creation to      support handwriting recognition and translation technology (Song      et al 2012).   LDC is developing resources to support two research      programs:  Multilingual Automatic Document Classification,      Analysis and Translations (MADCAT) and Open Handwriting      Recognition and Translation (OpenHaRT).  To support these      programs, LDC is collecting handwritten samples of pre-processed      Arabic and Chinese data that had previously been translated      into English.  To date, LDC has collected and annotated over      225,000 handwriting images.
      Additionally, you can learn about LDC's efforts to collect and      annotate very large corpora of user-contributed content in      multiple languages (Garland et al, 2012).  For the Broad Operational      Language Translation (BOLT) program, LDC is developing resources      to support genre-independent machine translation and information      retrieval systems.  In the current phase of BOLT, LDC is      collecting and annotating threaded posts from online discussion      forums, targeting at least 500 millions words each in three      languages:  English, Chinese, and Egyptian Arabic.  A portion of      the data undergoes manual, multi-layered linguistic annotation.
      As we mark LDC's 20th anniversary, we will feature the work behind      these LREC papers as well as other ongoing research in upcoming      newsletters.

New publications

(1) Arabic-Dialect/English        Parallel Text was developed by Raytheon BBN Technologies (BBN),      LDC and Sakhr Software and      contains approximately 3.5 million tokens of Arabic dialect      sentences and their English translations.

The data in this corpus consists of Arabic web      text as follows:

1. Filtered automatically from large Arabic      text corpora harvested from the web by LDC. The LDC corpora      consisted largely of weblog and online user groups and amounted to      around 350 million Arabic words. Documents that contained a large      percentage of non-Arabic or Modern Standard Arabic (MSA) words      were eliminated. A list of dialect words was manually selected by      culling through the Levantine Fisher (LDC2005S07,      LDC2005T03,      LDC2007S02      and LDC2007T04)      and Egyptian CALLHOME speech corpora (LDC97S45,      LDC2002S37,      LDC97T19      and LDC2002T38)      distributed by LDC. That list was then used to retain documents      that contained a certain number of matches. The resulting subset      of the web corpora contained around four million words. Documents      were automatically segmented into passages using formatting      information from the raw data.

2. Manually harvested by Sakhr Software from      Arabic dialect web sites.

Dialect classification and sentence      segmentation, as needed, and translation into English were      performed by BBN through Amazon's Mechanical        Turk. Arabic annotators from Mechanical Turk classified      filtered passages as being either MSA or one of four regional      dialects: Egyptian, Levantine, Gulf/Iraqi or Maghrebi. An      additional 'General' dialect option was allowed for ambiguous      passages. The classification was applied to whole passages rather      than individual sentences. Only the passages labeled Levantine and      Egyptian were further processed. The segmented Levantine and      Egyptian sentences were then translated. Annotators were      instructed to translate completely and accurately and to      transliterate Arabic names. They were also provided with examples.      All segments of a passage were presented in the same translation      task to provide context.

Arabic-Dialect/English Parallel Text is      distributed via web download.

2012 Subscription Members will automatically      receive two copies of this data on disc.  2012 Standard Members      may request a copy as part of their 16 free membership corpora.       Non-members may license this data for US$2250.

*

(2) Prague        Czech-English Dependency Treebank (PCEDT) 2.0 was developed      by the Institute of Formal and        Applied Linguistics at Charles        University in Prague, Czech Republic. It is a corpus of      Czech-English parallel resources translated, aligned and manually      annotated for dependency structure, semantic labeling, argument      structure, ellipsis and anaphora resolution. This release updates      Prague Czech-English      Dependency Treebank 1.0 (LDC2004T25)      by adding English newswire texts so that it now contains over two      million words in close to 100,000 sentences.

The principal new material in PCEDT 2.0 is the      inclusion of the entire Wall Street Journal data from Treebank-3 (LDC99T42). Not      included from PCEDT 1.0 are the Reader's Digest material, the Czech monolingual corpus      and the English-Czech      dictionary.

Each section is enhanced with a comprehensive      manual linguistic annotation in the Prague Dependency Treebank      style (LDC2006T01), Prague      Dependency Treebank 2.0). The main features of this annotation      style are:

-dependency structure of the content words        and coordinating and similar structures (function words are        attached as their attribute values)

-semantic labeling of content words and types        of coordinating structures

-argument structure, including an argument        structure ('valency') lexicon for both languages

-ellipsis and anaphora resolution

This annotation style is called      tectogrammatical annotation, and it constitutes the      tectogrammatical layer in the corpus. Please consult the PCEDT website for more      information and documentation.

Prague Czech-English Dependency Treebank      (PCEDT) 2.0 is distributed on one DVD-ROM.

    2012 Subscription Members will automatically receive two copies of    this data.  2012 Standard Members may request a copy as part of    their 16 free membership corpora.  Non-members may license this data    for US$100.



Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA