ISCA - International Speech
Communication Association


ISCApad Archive  »  2010  »  ISCApad #148  »  Resources  »  Database  »  LDC Newsletter (September 2010)

ISCApad #148

Sunday, October 10, 2010 by Chris Wellekens

5-2-3 LDC Newsletter (September 2010)
  

-                                LDC Data Scholarship Program Update  -

-  LDC at Interspeech 2010, Makuhari Japan, September 27-30, 2010  -


LDC2010T16
-  Indian Language Part-of-Speech Tagset: Bengali  -

LDC2010T15
-  Message Understanding Conference 7 Timed (MUC7_T)  -

 


LDC Data Scholarship Program Update

LDC is excited to announce that we've received many strong applications for our Fall 2010 LDC Data Scholarship program!  The LDC Data Scholarship program provides university students with access to LDC data at no-cost.  Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser.  LDC will provide information on our scholarship winners in our October newsletter.  The next program cycle is scheduled for the Spring 2011 semester.

LDC at Interspeech 2010, Makuhari Japan, September 27-30, 2010


LDC will soon be traveling to the Far East to exhibit at Interspeech 2010 in Makuhari Japan. We are very enthusiastic about this opportunity to mingle with members of the speech research community in a far-away setting. Please stop by booth #27 to say hi and to try your luck at scoring an exciting giveaway! We hope to see you there!

Interspeech 2010’s central theme is ‘Spoken Language Processing for All’. For more information on the Conference, please click
here.

New Publications

 

(1) Indian Language Part-of-Speech Tagset: Bengali is a corpus developed by Microsoft Research (MSR) India to support the task of Part-of-Speech Tagging (POS) and other data-driven linguistic research on Indian Languages in general. It is created as a part of the Indian Language Part-of-Speech Tagset (IL-POST) project, a collaborative effort among linguists and computer scientists from MSR India,  Anna Universtiy, Chennai (AU-KBC), Delhi University,  IIT Bombay,  Jawaharlal Nehru University (Delhi) and Tamil University (Tamilnadu).

The goal of the IL-POST project is to provide a common tagset framework for Indian Languages that offers flexibility, cross-linguistic compatibility and resuability across those languages. It supports a three-level hierarchy of Categories, Types and Attributes. The corpus mainly consists therefore of two different levels of information for each lexical token: (a) lexical Category and Types, and (b) set morphological attributes and their associated values in the context.

Bengali (also referred to as Bangla) is a member of the Eastern Indo-Aryan language group. It is native to the region of Bengal which consists of Bangladesh, the Indian state of West Bengal, and parts of the Indian states of Tripura and Assam. It is spoken by more than 210 million people as a first or a second language with around 100 million speakers in Bangladesh, about 85 million speakers in India, and others in immigrant communities in the United Kingdom, USA and the Middle East.

This corpus contains 7168 sentences (102933 words) of manually annotated text from modern standard Bengali sources including blogs, Wikipedia, Multikulti and a portion of the EMILLE/CIIL corpus. The annotated data is structured into two folders, Bangla1 (3684 sentences, 51091 words) and Bangla2 (3484 sentences, 51842 words), which represent the two stages in which the data was annotated. All annotated data is provided in both xml and text files. Each data file contains between 3,000-5,000 words. The XML file contains metadata about the material, such as language, encoding and data size.

The Annotation Guidelines for Bangla contain a detailed description of the annotation methodology. The Annotation Tool Guideline 1.0 describes the annotation interface developed for the IL-POST framework; the tool is not included in this release.

2010 Subscription Members will automatically receive two copies of this corpus provided that they have submitted a completed copy of the Microsoft Research India License Agreement.  2010 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data by submitting a completed copy of the Microsoft Research India License Agreement.  The agreement can be faxed to +1 215 573 2175 or scanned and emailed to this address.  This data is available at no charge.

*

(2) Message Understanding Conference 7 Timed (MUC7_T) was developed by researchers at Jena University Language & Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Germany. It is a re-annotation of a portion of the MUC7 corpus (Linguistic Data Consortium, LDC2001T02), which consists of New York Times news stories annotated for use in the Message Understanding Conference 7 (MUC7) evaluation.  The series of MUC evaluations in the 1990s focused on emerging information extraction technologies. Further information about the MUC7 evaluation can be found here here.

MUC7_T consists of 100 articles from the MUC7 corpus training set reannotated for named entities (persons, locations and organizations) with a time stamp indicating the time measured for the linguistic decision making process. The corpus was developed for two principal purposes: for use in evaluations of selective sampling strategies, such as Active Learning; and to create predictive models for annotation costs. The annotation was performed by two advanced students of linguistics with good English language skills who followed the the original guidelines of the MUC7 named entity task (which can be found in the online documentation for the MUC7 corpus).

The data is stored in XML format. There is an element anno_example for each annotation example that has the original MUC7 document as text context. The MUC7 document was tokenized using the Stanford Tokenizer3 with white spaces marking token boundaries. The tokenizer is part of the Stanford Parser package which can be obtained from The Stanford Natural Language Processing Group.

2010 Subscription Members will automatically receive two copies of this corpus on disc.  2010 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$150.


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA