ISCA - International Speech
Communication Association

ISCApad Archive  »  2012  »  ISCApad #173  »  Resources  »  Database  »  LDC Newsletter (October 2012)

ISCApad #173

Sunday, November 11, 2012 by Chris Wellekens

5-2-2 LDC Newsletter (October 2012)

In this newsletter:


-  Fall 2012 LDC Data Scholarship Recipients  -


-  LDC Exhibiting at NWAV 41  - 


-  LDC 20th Anniversary Workshop Wrap-up  -


-  LDC 20th Anniversary Podcasts              -


-  Language Resource Wiki  -


New publications:


-  GALE Chinese-English Word Alignment and  Tagging Training Part 2 -- Newswire  -


-  GALE Phase 2 Arabic Broadcast News Parallel Text  -





Fall 2012 LDC Data Scholarship Recipients


LDC is pleased to announce the student recipients  of the Fall 2012 LDC Data Scholarship program!  This program provides university and college students with access to LDC data  at no-cost. Students were asked to complete an application which         consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser.  We received many solid applications and have chosen six  proposals to support.   The following students will receive no-cost copies of LDC data:



Jaffar Atwan - National University of Malaysia (Malaysia), Phd  candidate, Information Science and           Technology.  Jaffar has been awarded a copy of Arabic Newswire
          Part 1 (LDC2001T55) for his work in information retrieval.
          Sarath Chandar - Indian Institute of Technology, Madras (India), MS candidate, Computer Science and Engineering.  Sarath has been awarded a copy of Treebank-3 (LDC99T42) forhis work in grammar induction.
          Kuruvachan K. George - Amrita Vishwa Vidyapeetham (India), Phd Candidate, Electrical and Computer Engineering.  Kuruvachan   has been awarded a copy of Fisher English Part 2 (LDC2005S13/T19) and2008 NIST Speaker Recognition Evaluation data (LDC2011S05/07/08/11) for his work in speaker recognition.


Eduardo Motta - Pontifícia Universidade Católica do Rio de Janeiro (Brazil), Phd candidate, Information           Sciences.  Eduardo has been awarded a copy of English Web Treebank (LDC2012T13) for his work in machine learning.


Genevieve Sapijaszko - University of Central  Florida (USA), Phd Candidate, Electrical and Computer           Engineering.  Genevieve  been awarded a copy TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) and YOHO Speaker Verification (LDC94S16) for her work in digital signal processing.
John Steinberg - Temple University (USA), MS candidate, Electrical and Computer Engineering.  John has been awarded a copy of CALLHOME Mandarin Chinese Lexicon (LDC96L15) and CALLHOME Mandarin Chinese Transcripts (LDC96T16) for his  work in speech recognition.



LDC Exhibiting at NWAV 41



The conference runs from October 25-28 and the exhibition hall will be open from October 26-28, 2012. Please stop by to say hello!




LDC 20th Anniversary Workshop Wrap-up


In early September, LDC hosted a workshop entitled  “The Future of Language Resources” in celebration of  our 20th anniversary. Visit  the Program   page to browse speaker abstracts and to access pdfs of the   presentations. Thanks to the speakers and attendees for making the workshop a success!




LDC 20th Anniversary Podcasts


To further celebrate our 20th Anniversary, LDC is  conducting  interviews of long-time staff members for their unique perspectives on the  Consortium’s growth and evolution over the past two decades. The first interview podcast debuts this month and features Dave Graff, LDC’s Lead Programmer. Visit the LDC blog to access the podcast.


Other podcasts will  be published via the LDC blog, so stay tuned to that space.


Language Resource Wiki


The Language Resource Wiki  catalogs data, software, descriptive grammars and other resources for a variety of languages but especially those with a paucity of generally available resources for research. LDC is actively seeking editors knowledgeable in these and other languages to develop and maintain the pages, which are readable by anyone but  writable only by editors. The wiki currently has resource listings  for: Bengali, Berber, Breton, Ewe, Greek (Ancient), Indonesian, Hindi, Latin, Panjabi, Pashto, Sorani (Central Kurdish), Russian, Tagalog, Tamil, and Urdu, and for the following Sign Languages: American, British, Catalan, Dutch, Flemish, German, Japanese, New Zealand, Polish, Spanish, and Swiss German.





(1) GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire was  developed by LDC and contains 169,080 tokens of word aligned  Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA  GALE (Global Autonomous Language Exploitation) program.


Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word  aligned text as a means to improve automatic word alignment and  machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using  minimum-match and attachment annotation approaches. A set of  word tags and alignment link tags are designed in the tagging  scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.



The Chinese word alignment tasks consisted of the following components:


Identifying, aligning, and tagging 8 different  types of links


Identifying, attaching, and tagging local-level  unmatched words


Identifying and tagging sentence/discourse-level unmatched words


Identifying and tagging all instances of Chinese  的(DE) except when they were a part of a semantic link.



GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire is distributed via web download.


2012 Subscription Members will automatically receive two copies of this data on disc.  2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1750.




(2) GALE Phase 2 Arabic Broadcast News Parallel Text was developed by LDC, and along with other corpora, the parallel text in this  release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding  English translations selected from broadcast news (BN) data collected by LDC between 2005 and 2007 and transcribed by LDC or under its direction.


GALE Phase 2 Arabic Broadcast News Parallel Text includes seven source-translation pairs, comprising 29,210 words of Arabic source text and its English translation. Data is drawn from six distinct Arabic programs broadcast between 2005 and  2007 from Abu Dhabi TV, based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Aljazeera, a regional broadcast programmer based in Doha, Qatar; Dubai TV,  based in Dubai, United Arab Emirates; and Kuwait TV, a national television station based in Kuwait. The BN programming in this release focuses on current events topics.


The files in this release were transcribed by LDC  staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the    text. Data was manually selected for translation according to several criteria, including linguistic features, transcription         features and topic features. The transcribed and segmented files werethen reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed  translations.


GALE Phase 2 Arabic Broadcast News Parallel Text is distributed via web download.
        2012 Subscription Members will automatically receive two copies of this data on disc.  2012 Standard Members may request a copy  as part of their 16 free membership corpora.  Non-members may  license this data for US$1750.



Back  Top

 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA