ISCA - International Speech
Communication Association


ISCApad Archive  »  2013  »  ISCApad #182  »  Resources  »  Database  »  LDC Newsletter (July 2013)

ISCApad #182

Saturday, August 10, 2013 by Chris Wellekens

5-2-3 LDC Newsletter (July 2013)
  

 

In this                 newsletter:
             

   

Fall 2013 Data Scholarship Program                  -
             

   

New                 publications:
             

   

-            Chinese Proposition Bank 3.0  -
       

   

-            GALE Arabic-English Parallel Aligned Treebank             -- Broadcast News Part 1 -

   


   

 

   

Fall 2013 Data           Scholarship Program

   

Applications
        are now being accepted through September 16, 2013, 11:59PM EST         for the Fall 2013 LDC Data Scholarship program! The LDC Data         Scholarship program provides university students with access to         LDC data at no-cost.

   


        This program is open to students pursuing both undergraduate and         graduate studies in an accredited college or university. LDC         Data Scholarships are not restricted to any particular field of         study; however, students must demonstrate a well-developed         research agenda and a bona fide inability to pay. The selection         process is highly competitive.
       
        The application consists of two parts:
       
        (1) Data Use Proposal. Applicants must submit a proposal         describing their intended use of the data. The proposal should         state which data the student plans to use and how the data will         benefit their research project as well as information on the         proposed methodology or algorithm.
       
        Applicants should consult the LDC Corpus           Catalog for a complete list of         data distributed by LDC. Due to certain restrictions, a handful         of LDC corpora are restricted to members of the Consortium.         Applicants are advised to select a maximum of one to two         databases.
       
        (2) Letter of Support. Applicants must submit one letter         of support from their thesis adviser or department chair. The         letter must confirm that the department or university lacks the         funding to pay the full Non-member Fee for the data and verify         the student's need for data.
       
        For further information on application materials and program         rules, please visit the LDC Data
          Scholarship
        page.
       
        Students can email their applications to the LDC Data           Scholarship program. Decisions will be         sent by email from the same address.
       
        The deadline for the Fall 2013 program is Monday, September 16,         2013, 11:59PM EST.
       
     

   


                                                                                                                  New publications
     

   

     

   

(1)
      Chinese           Proposition Bank 3.0         is a continuation of the Chinese           Proposition Bank project which aims to         create a corpus of text annotated with information about basic         semantic propositions. Chinese Proposition Bank 3.0 adds         predicate-argument annotation on 187,731 words from Chinese         Treebank 7.0 (LDC2010T07). The data sources are         comprised of newswire, magazine articles, various broadcast news         and broadcast conversation programming, web newsgroups and         weblogs.

   

LDC
        has also released Chinese Proposition Bank 1.0 (LDC2005T23) and Chinese         Proposition Bank 2.0 (LDC2008T07).

   

This
        release contains the predicate-argument annotation of 173,206         verb instances and 14,525 noun instances. The annotation of         nouns is limited to nominalizations that have a corresponding         verb. The general annotation guidelines and the lexical         guidelines (called frame files) for each verbal and nominal         predicate are also included in this release. Below are some         statistics about the corpus.

   

         
  • Total
              propositions for verbs - 173,206
  •      
  • Total
              propositions for nouns - 14,525
  •      
  • Total
              verbs framed - 24,642
  •      
  • Total
              framesets - 26,467
  •      
  • Verbs
              with multiple framesets - 1337
  •      
  • Average
              framesets per verb - 1.07
  •      
  • Total
              nouns framed - 1,421
  •      
  • Total
              noun framesets - 1,528
  •      
  • Nouns
              with multiple framesets - 48
  •      
  • Average
              framesets per nouns - 1.08
  •    

   

Chinese Proposition
        Bank 3.0 is distributed via web download.

   

2013
        Subscription Members will automatically receive two copies of         this data on disc. 2013 Standard Members may request a copy as         part of their 16 free membership corpora.  Non-members may         license this data for US$300.

   

*

   

(2)
      GALE           Arabic-English Parallel Aligned Treebank -- Broadcast News           Part 1 was developed by LDC         and contains 115,826 tokens of word aligned Arabic and English         parallel text with treebank annotations. This material was used         as training data in the DARPA GALE (Global Autonomous Language         Exploitation) program.

   

Parallel
        aligned treebanks are treebanks annotated with morphological and         syntactic structures aligned at the sentence level and the         sub-sentence level. Such data sets are useful for natural         language processing and related fields, including automatic word         alignment system training and evaluation, transfer-rule         extraction, word sense disambiguation, translation lexicon         extraction and cultural heritage and cross-linguistic studies.         With respect to machine translation system development, parallel         aligned treebanks may improve system performance with enhanced         syntactic parsers, better rules and knowledge about language         pairs and reduced word error rate.

   

In         this release, the source Arabic data was translated into         English. Arabic and English treebank annotations were performed         independently. The parallel texts were then word aligned. The         material in this corpus corresponds to a portion of the Arabic         treebanked data in Arabic Treebank - Broadcast News v1.0 (LDC2012T07).

   

The
        source data consists of Arabic broadcast news programming         collected by LDC in 2005 and 2006 from Alhurra, Aljazeera and         Dubai TV. All data is encoded as UTF-8. A count of files, words,         tokens and segments is below.

                                                                                                                                                       

           

Language

         
           

Files

         
           

Words

         
           

Tokens

         
           

Segments

         
           

Arabic

         
           

28

         
           

89,213

         
           

115,826

         
           

4,824

         

   

Note:
        Word count is based on the untokenized Arabic source. Ttoken         count is based on the ATB-tokenized Arabic source.

   

The
        purpose of the GALE word alignment task was to find         correspondences between words, phrases or groups of words in a         set of parallel texts. Arabic-English word alignment annotation         consisted of the following tasks:

   

         
  • Identifying
              different types of links: translated (correct or incorrect)           and not translated (correct or incorrect)
  •      
  • Identifying
              sentence segments not suitable for annotation, e.g., blank           segments, incorrectly-segmented segments, segments with           foreign languages
  •      
  • Tagging
              unmatched words attached to other words or phrases
  •    

   

GALE Arabic-English
        Parallel Aligned Treebank -- Broadcast News Part 1 is         distributed via web download.

   

2013
        Subscription Members will automatically receive two copies of         this data on disc. 2013 Standard Members may request a copy as         part of their 16 free membership corpora.  Non-members may         license this data for US$1750.

 

 


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA