ISCA - International Speech
Communication Association


ISCApad Archive  »  2013  »  ISCApad #178  »  Resources  »  Database  »  LDC Newsletter (March 2013)

ISCApad #178

Wednesday, April 10, 2013 by Chris Wellekens

5-2-3 LDC Newsletter (March 2013)
  

 

 

Reprint         of March 2013 Newsletter

   

LDC's March 2013 newsletter may not have       reached all intended recipients and is being reprinted below.

   

     

   

LDC’s 20th Anniversary:         Concluding a Year of Celebration

   

      We’ve enjoyed celebrating our 20th Anniversary this last year       (April 2012 - March 2013) and would like to review some highlights       before its close.
     
      Our 2012 User Survey, circulated early in 2012, included a special       Anniversary section in which respondents were asked to reflect on       their opinions of, and dealings with, LDC over the years. We were       humbled by the response. Multiple users mentioned that they would       not be able to conduct their research without LDC and its data.       For a full list of survey testimonials, please click here.
     
      LDC also developed its first-ever timeline        (initially published in the April 2012 Newsletter) marking       significant milestones in the consortium’s founding and growth.
     
      In September, we hosted a 20th Anniversary
        Workshop
  that brought together many friends and       collaborators to discuss the present and future of language       resources.
     
      Throughout the year, we conducted several interviews of long-time       LDC staff members to document their unique recollections of LDC       history and to solicit their opinions on the future of the       Consortium. These interviews are available as podcasts on the LDC         Blog
     
      As our Anniversary year draws to a close, one task remains – to       thank all of LDC’s past, present and future members and other       friends of the Consortium for their loyalty and for their       contributions to the community. LDC would not exist if not for its       supporters.  The variety of relationships that LDC has built over       the years is a direct reflection of the vitality, strength and       diversity of the community. We thank you all and hope that we       continue to serve your needs in our third decade and beyond.
     
      For a last treat, please visit LDC’s newly-launched YouTube       channel to enjoy this video montage       of the LDC staff interviews featured in the podcast series.
     
      Thank you again for your continued support!

   

New publications

   

(1) 1993-2007 United
        Nations Parallel Text
was developed by Google Research. It       consists of United Nations (UN) parliamentary documents from 1993       through 2007 in the official languages of the UN: Arabic, Chinese,       English, French, Russian, and Spanish.

   

UN parliamentary documents are available from       the UN Official Document System (UN         ODS). UN ODS, in its main UNDOC database, contains the full       text of all types of UN parliamentary documents. It has complete       coverage datng from 1993 and variable coverage before that.       Documents exist in one or more of the official languages of the       UN: Arabic, Chinese, English, French, Russian, and Spanish. UN ODS       also contains a large number of German documents, marked with the       language other, but these are not included in this dataset.

   

LDC has released parallel UN parliamentary       documents in English, French and Spanish spanning the period       1988-1993, UN Parallel
        Text (Complete) (LDC94T4A)
.

   

The data is presented as raw text and       word-aligned text. There are 673,670 raw text documents and       520,283 word aligned documents. The raw text is very close to what       was extracted from the original word processing documents in UN       ODS (e.g., Word, WordPerfect, PDF), converted to UTF-8 encoding.       The word-aligned text was normalized, tokenized, aligned at the       sentence-level, further broken into sub-sentential chunk-pairs,       and then aligned at the word. The sentence, chunk, and word       alignment operations were performed separately for each individual       language pair.

   

1993-2007 United Nations Parallel Text is       distributed on 3 DVD-ROM.

   

2013 Subscription Members will automatically       receive two copies of this data provided they have completed the UN         Parallel Text Corpus User Agreement.  2013 Standard Members       may request a copy as part of their 16 free membership corpora.        Non-members may license this data for US$175.

   

*

   

(2) GALE Chinese-English
        Word Alignment and Tagging Training Part 4 -- Web
was       developed by LDC and contains 158,387 tokens of word aligned       Chinese and English parallel text enriched with linguistic tags.       This material was used as training data in the DARPA GALE       (Global Autonomous Language Exploitation) program.

   

Some approaches to statistical machine       translation include the incorporation of linguistic knowledge in       word aligned text as a means to improve automatic word alignment       and machine translation quality. This is accomplished with two       annotation schemes: alignment and tagging. Alignment identifies       minimum translation units and translation relations by using       minimum-match and attachment annotation approaches. A set of word       tags and alignment link tags are designed in the tagging scheme to       describe these translation units and relations. Tagging adds       contextual, syntactic and language-specific features to the       alignment annotation.

   

This release consists of Chinese source web       data (newsgroup, weblog) collected by LDC between 2005-2010. The       distribution by words, character tokens and segments appears       below:

   

                                                                                                                                                   
           

Language

         
           

Files

         
           

Words

         
           

CharTokens

         
           

Segments

         
           

Chinese

         
           

1,224

         
           

105,591

         
           

158,387

         
           

4,836

         

   

      Note that all token counts are based on the Chinese data only. One       token is equivalent to one character and one word is equivalent to       1.5 characters.

   

The Chinese word alignment tasks consisted of       the following components:

   

Identifying, aligning, and tagging 8 different       types of links

   

Identifying, attaching, and tagging local-level       unmatched words

   

Identifying and tagging       sentence/discourse-level unmatched words

   

Identifying and tagging all instances of       Chinese 的(DE) except
      when they were a part of a semantic link.

   

GALE Chinese-English Word Alignment and Tagging       Training Part 4 -- Web is distributed via web download.

   

2013 Subscription Members will automatically       receive two copies of this data on disc.  2013 Standard Members       may request a copy as part of their 16 free membership corpora.        Non-members may license this data for US$1750.

   

     

 

 


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA