ISCA - International Speech
Communication Association


ISCApad Archive  »  2012  »  ISCApad #174  »  Resources  »  Database  »  LDC Newsletter (November 2012)

ISCApad #174

Sunday, December 09, 2012 by Chris Wellekens

5-2-2 LDC Newsletter (November 2012)
  

In this newsletter:

   

Spring 2013 LDC             Data Scholarship Program

   

 Invitation to Join             for Membership Year 2013

   

Why become an LDC member?

   

2012 User Survey Results
         

   

LDC to Close for             Thanksgiving Break

   

New


          publications:

   

Annotated English Gigaword

   

Chinese-English             Semiconductor Parallel Text
         

          GALE               Phase 2 Arabic Newswire Parallel Text

       

 

   


   

 

       

Spring 2013 LDC           Data Scholarship Program

   

Applications are now being accepted         through January 15, 2013, 11:59PM EST for the Spring 2013 LDC         Data Scholarship program! The LDC Data Scholarship program         provides university students with access to LDC data at no-cost.         During previous program cycles, LDC has awarded no-cost copies         of LDC data to over 25 individual students and student research         groups.

   

This program is open to students         pursuing both undergraduate and graduate studies in an         accredited college or university. LDC Data Scholarships are not         restricted to any particular field of study; however, students         must demonstrate a well-developed research agenda and a bona         fide inability to pay. The selection process is highly         competitive.

   

The application consists of two         parts:

   

(1) Data Use Proposal. Applicants         must submit a proposal describing their intended use of the         data. The proposal should state which data the student plans to         use and how the data will benefit their research project as well         as information on the proposed methodology or algorithm.

   

Applicants should consult the LDC Corpus Catalog         for a complete list of data distributed by LDC. Due to certain         restrictions, a handful of LDC corpora are restricted to members         of the Consortium. Applicants are advised to select a maximum of         one to two datasets; students may apply for additional datasets         during the following cycle once they have completed processing         of the initial datasets and publish or present work in some         juried venue.

   

(2) Letter of Support. Applicants         must submit one letter of support from their thesis adviser or         department chair. The letter must verify the student's need for         data and confirm that the department or university lacks the         funding to pay the full Non-member Fee for the data or to join         the consortium.      

   

For further information on         application materials and program rules, please visit the LDC Data Scholarship         page.

   

Students can email their applications to the LDC           Data Scholarship program. Decisions will be sent by email         from the same address.

   

The deadline for the Spring 2013         program cycle is January 15, 2013, 11:59PM EST.

   

 

   

Invitation to Join           for Membership Year 2013

   

Membership
        Year (MY) 2013 is open for joining!  We would like to invite all         current and previous members of LDC to renew their membership as         well as welcome new organizations to join the consortium.    For         MY2013, LDC is pleased to maintain membership fees at last         year’s rates – membership fees will not increase.  Additionally,         LDC will extend discounts on membership fees to members who keep         their membership current and who join early in the year.
       
        The details of our early renewal discounts for MY2013 are as         follows:

   

·         Organizations
        who joined for MY2012 will receive a 5% discount when renewing.         This discount will apply throughout 2013, regardless of time of         renewal. MY2012 members renewing before March 1, 2013 will         receive an additional 5% discount, for a total 10% discount off         the membership fee.

   

·         New
        members as well as organizations who did not join for MY2012,         but who held membership in any of the previous MYs (1993-2011),         will also be eligible for a 5% discount provided that they         join/renew before March 1, 2013.

   

The
        following table provides exact pricing information.

   

 

   

                                                                                                                                                                                                                                                                                                                                                                                                                                                               
           

                 

         
           

MY2013
                  Fee

         
           

MY2013
                  Fee
                  with 5% Discount*

         
           

MY2013
                  Fee
                  with 10% Discount** 

         
           

Not-for-Profit
                  /US Government

         
           

                 

         
           

                 

         
           

                 

         
           

                 

         
           

Standard
               

         
           

US$2400
               

         
           

US$2280
               

         
           

US$2160
               

         
           

                 

         
           

Subscription
               

         
           

US$3850
               

         
           

US$3658
               

         
           

US$3465
               

         
           

For-Profit
               

         
           

                 

         
           

                 

         
           

                 

         
           

                 

         
           

Standard
               

         
           

US$24000
               

         
           

US$22800
               

         
           

US$21600
               

         
           

                 

         
           

Subscription
               

         
           

US$27500
               

         
           

US$26125
               

         
           

US$24750
               

         

   


        *  For new members, MY2012 Members renewing for MY2013, and any         previous year Member who renews before March 1, 2013
       
        ** For MY2012 Members renewing before March 1, 2013
       
       
        Publications for MY2013 are still being planned; here are the         working titles of data sets we intend to provide:

   

                                                                                                                           
           

·     
                Arabic Treebank - Weblog

         
           

·        
                Hispanic-English Speech   

         
           

·       
                Chinese-English Biomedical Parallel Text

         
           

·        
                Maninkakan Lexicon

         
           

·        
                GALE data – all phases and tasks

         
           

·        
                OpenMT 2008-2012 Progress Set

         

   

         
        In addition to receiving new publications, current year members         of the LDC also enjoy the benefit of licensing older data at         reduced costs; current year for-profit members may use most data         for commercial applications.
       
        This past year, LDC members who joined early or kept their         membership current saved almost US$70,000 collectively on         membership fees.  Be sure to keep an eye on your mail - all         previous and current LDC members will be sent an invitation to         join letter and renewal invoice for MY2013.  Renew early for         MY2013 to save today!

   

 

   

Why become an LDC member?
       

   

LDC
        is offering early renewal discounts on membership fees for         Membership Year 2013 making now a good time to consider joining         or renewing membership.   LDC membership has the following         advantages:

   

         
  • LDC
              membership provides cost-effective access to an extensive and           growing catalog that spans 20 years and includes over 500           multilingual speech, text, and video resources. Even if your           organization only needs a few datasets from a given membership           year, membership is often the most economical way to obtain           current corpora. Additionally, the generous discounts that           member organizations receive on older corpora reduce the cost           of acquiring such datasets.
  •    

   

         
  • All
              members enjoy unlimited use of LDC data within their           organizations.  For universities, there is no difference in           cost between a departmental membership and one that is           university-wide. Departments can therefore combine resources           and establish one LDC membership for use by the entire           university community.  Likewise, for-profit members with           multiple branches can maintain one membership for use by their           entire organizations.
  •    

   

For-profit
        organizations are reminded that an LDC membership is a         pre-requisite for obtaining a commercial license to almost all         LDC databases.  Non-member organizations, including non-member         for-profit organizations, cannot use LDC data to develop or test         products for commercialization, nor can they use LDC data in any         commercial product or for any commercial purpose.  LDC data         users should consult corpus-specific license agreements for         limitations, including commercial restrictions, on the use of         certain corpora. In the case of a small group of corpora,         commercial licenses must be obtained separately from the owners         of the data.

   

     

   

2012 User Survey           Results

   

 Earlier         this year, LDC sent a survey to its user communities. Like         previous iterations in 2006 and 2007, the survey solicited         community input and suggestions on key LDC-related topics,         including:

   

  • Satisfaction levels with LDC’s data, homepage and         Catalog
  • Reflections on LDC’s 20th Anniversary         year
  • Suggestions for future publications
  • Speculations on the future of HLT-related fields,         specifically on mobile technologies, cloud computing, social         networking and open data

               

Survey respondents were         generally satisfied with LDC’s data, membership options,         homepage and Catalog, though there were requests for additional         data options and data acquisition methods. Some of the data         respondents requested are already in our pipeline for the end of         2012 or for Membership Year (MY) 2013, so please be on the         lookout for Publications updates. Respondents were also very         supportive of LDC’s 20th Anniversary, posting         testimonials and well-wishes in the 20th Anniversary         section.

   

LDC would like to thank         all survey participants. Survey participants will receive access         to full survey results shortly.

   

 

   

LDC to Close for           Thanksgiving Break

   


        LDC will be closed on Thursday, November 22, 2012 and Friday,         November 23, 2012 in observance of the US Thanksgiving Holiday.          Our offices will reopen on Monday, November 26, 2012.

   

 

   


     
       
        New publications

   

(1)
      Annotated English           Gigaword was developed by Johns Hopkins
          University's Human Language Technology Center of Excellence
. It adds         automatically-generated syntactic and discourse structure         annotation to English Gigaword Fifth Edition (LDC2011T07) and also contains an         API and tools for reading the dataset's XML files. The goal of         the annotation is to provide a standardized corpus for knowledge         extraction and distributional semantics which enables broader         involvement in large-scale knowledge-acquisition efforts by         researchers.

   

Annotated
        English Gigaword contains the nearly ten million documents (over         four billion words) of the original English Gigaword Fifth         Edition from seven news sources:

   

         
  • Agence
              France-Presse, English Service (afp_eng)
  •      
  • Associated
              Press Worldstream, English Service (apw_eng)
  •      
  • Central
              News Agency of Taiwan, English Service (cna_eng)
  •      
  • Los
              Angeles Times/Washington Post Newswire Service (ltw_eng)
  •      
  • Washington
              Post/Bloomberg Newswire Service (wpb_eng)
  •      
  • New
              York Times Newswire Service (nyt_eng)
  •      
  • Xinhua
              News Agency, English Service (xin_eng)
  •    

   

The
        following layers of annotation were added:

   

         
  • Tokenized
              and segmented sentences
  •      
  • Treebank-style
              constituent parse trees
  •      
  • Syntactic
              dependency trees
  •      
  • Named
              entities
  •      
  • In-document
              coreference chains
  •    

   

The
        annotation was performed in a three-step process: (1) the data         was preprocessed and sentences selected for annotation         (sentences with more than 100 tokens were excluded); (2)         syntactic parses were derived; and (3) the parsed output was         post-processed to derive syntactic dependencies, named entities         and coreference chains. Over 183 million sentences were parsed.        

   

Annotated
        English Gigaword is distributed on one hard drive.

   

2012
        Subscription Members will automatically receive one copy of this         data on hard drive.  2012 Standard Members may request a copy as         part of their 16 free membership corpora. 2011 Members who         licensed English Gigaword Fifth Edition (LDC2011T07) may request a no-cost         copy of Annotated English Gigaword. Non-member organizations who         licensed English Gigaword Fifth Edition may request a copy of         Annotated English Gigaword for the US$200 media fee. Non-member         organizations without a license to English Gigaword Fifth         Edition may obtain this data for US$6000.

   

*

   

(2) Chinese-English           Semiconductor Parallel Text         was developed by The MITRE Corporation. It consists of         parallel sentences from a collection of abstracts from         scientific articles on semiconductors published in Mandarin and         translated into English by translators with particular expertise         in the technical area. Translators were instructed to err on the         side of literal translation if required, but to maintain the         technical writing style of the source and to make the resulting         English as natural as possible. The translators followed         specific guidelines for translation, and those are included in         this distribution.

   

There
        are 2,169 lines of parallel Mandarin and English, with a total         of 125,302 characters of Mandarin and 64,851 words of English,         presented in a separate UTF-8 plain text file for each language.         The sentences were translated in sequential order and presented         in a scrambled order, such that parallel sentences at identical         line numbers are translations. For example, the 31st line of the         English file is a translation of the 31st line of the Mandarin         file. The original line sequence is not provided.

   

Chinese-English Semiconductor
        Parallel Text is distributed via web download.

   

2012
        Subscription Members will automatically receive two copies of         this data on disc.  2012 Standard Members may request a copy as         part of their 16 free membership corpora.  Non-members may         license this data for US$1500.

   

*
     

   

     

   

(3)
      GALE Phase 2 Arabic           Newswire Parallel Text         was developed by LDC.  Along         with other corpora, the parallel text in this release comprised         training data for Phase 2 of the DARPA GALE (Global Autonomous         Language Exploitation) Program. This corpus contains Modern         Standard Arabic source text and corresponding English         translations selected from newswire data collected in 2007 by         LDC and transcribed by LDC or under its direction.

   

GALE
        Phase 2 Arabic Newswire Parallel Text includes 400         source-translation pairs, comprising 181,704 tokens of Arabic         source text and its English translation. Data is drawn from six         distinct Arabic newswire sources.: Al Ahram, Al Hayat, Al-Quds         Al-Arabi, An Nahar, Asharq Al-Awsat and Assabah.

   

The
        files in this release were transcribed by LDC staff and/or         transcription vendors under contract to LDC in accordance with         the Quick Rich           Transcription guidelines developed         by LDC. Transcribers indicated sentence boundaries in addition         to transcribing the text. Data was manually selected for         translation according to several criteria, including linguistic         features, transcription features and topic features. The         transcribed and segmented files were then reformatted into a         human-readable translation format and assigned to translation         vendors. Translators followed LDC's Arabic to English         translation guidelines. Bilingual LDC staff performed quality         control procedures on the completed translations.

   

GALE
        Phase 2 Arabic Newswire Parallel Text is distributed via web         download.

   

2012
        Subscription Members will automatically receive two copies of         this data on disc.  2012 Standard Members may request a copy as         part of their 16 free membership corpora.  Non-members may         license this data for US$1750.
     

   

 

   


   

To unsubscribe visit:

https://secure.ldc.upenn.edu/intranet/

-- 
--

 


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA