ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2011 » ISCApad #153 » Resources » Database » LDC Newsletter (February 2011)

ISCApad #153

Saturday, March 05, 2011 by Chris Wellekens

5-2-3 LDC Newsletter (February 2011)

In this newsletter:

- Publications Pipeline for MY2011 -

- LDC Data Scholarship Update -

New publications:

- Indian Language Part-of-Speech Tagset: Sanskrit -

- OntoNotes 4.0 -

Publications Pipeline for MY2011

LDC is pleased to provide the following information on our planned releases for Membership Year 2011 (MY2011) and would like to remind our data users that there is still time to save on membership fees for MY2011, but time is running out! Any organization which joins or renews membership for 2011 through Tuesday, March 1, 2011, is entitled to a 5% discount on membership fees. Organizations which held membership for MY2010 can receive a 10% discount on fees provided they renew prior to March 1, 2011.

Many publications for MY2011 are still in development, but we plan to release updates to some of our popular Gigaword corpora as well as new speech corpora. Please note that the list is tentative and subject to modifications. Our planned publications for this year include:

2005 NIST Speaker Recognition Evaluation - the 2005 data from the ongoing series of yearly evaluations conducted by NIST (National Institute of Standards and Technology). These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text-independent speaker recognition.

Arabic Gigaword Fifth Edition ~ LDC’s Arabic newswire collection from 2009 and 2010 as well as the contents of Arabic Gigaword Fourth Edition (LDC2009T30). The news sources represented include Agence France Presse, An Nahar, Al Hayat, Al-Quds Al-Arabi, Asharq Al-Awsat, Assabah Al- Ahram, Ummah Press and Xinhua News Agency.

Chinese Gigaword Fifth Edition ~ LDC’s Chinese newswire collection from 2009 and 2010 as well as the contents of Chinese Gigaword Fourth Edition (LDC2009T27). The news sources represented include Agence France Presse, Central News Agency (Taiwan), Xinhua News Agency, Zaobao, People's Liberation Army Daily, People’s Daily, Guangming Daily and China News Service.

Digital Archive of Southern Speech ~ a geographical sampling of colloquial speech in the Southern United States. Samples of speech were collected through interviews of single subjects speaking on a variety of common topics like family, the weather, household articles and activities, agriculture, and social connections. Speakers range in age from 15 to 90, with an average age of 61.

English Gigaword Fifth Edition ~ LDC’s English newswire collection from 2009 and 2010 as well as the contents of English Gigaword Fourth Edition (LDC2009T13). The news sources represented include Agence France Presse, Associated Press, Central News Agency (Taiwan), NY Times, Washington Post, Los Angeles Times and Xinhua News Agency.

MALACH English ~ over 300 hours of English audio recordings of interviews conducted under the auspices of the USC Shoah Foundation Institute for Visual History and Education and associated transcripts produced as part of the Multilingual Access to Large Spoken ArCHives (MALACH) project. The data was collected using table microphones. Recordings are 2-channel, 128 kBps, 44.1 kHz mp2 files, with a different speaker generally predominant in each channel.

2011 Subscription Members are automatically sent all MY2011 data as it is released. 2011 Standard Members are entitled to request 16 corpora for free from MY2011. Non-members may license most data for research use.

LDC Data Scholarship Update

LDC received many solid applications for the second installment of the LDC Data Scholarship Program! The LDC Data Scholarship program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. Data use proposals included a range of research interests from entity tagging to parsing to automatic speech recognition which made for a competitive selection process.

We are reviewing applications and will announce our winners soon.

New Publications

(1) Indian Language Part-of-Speech Tagset: Sanskrit was developed by Microsoft Research (MSR) India to support the task of Part-of-Speech Tagging (POS) and other data-driven linguistic research on Indian Languages in general. It is created as a part of the Indian Language Part-of-Speech Tagset (IL-POST) project, a collaborative effort among linguists and computer scientists from MSR India, AU-KBC (Anna University, Chennai), Delhi University, IIT Bombay, Jawaharlal Nehru University (Delhi) and Tamil University (Tamilnadu).

The goal of the IL-POST project is to provide a common tagset framework for Indian Languages that offers flexibility, cross-linguistic compatibility and resuability across those languages. It supports a three-level hierarchy of Categories, Types and Attributes. The corpus mainly consists therefore of two different levels of information for each lexical token: (a) lexical Category and Types, and (b) set morphological attributes and their associated values in the context.

This corpus contains 3,703 sentences (57,218 words) of manually annotated Sanskrit text selected from the Panchatrantra stories, a collection of animal fables in verse and prose dating from the third century BCE. All annotated data is provided in both xml and text files. The xml files are contained in the 'XML_files' folder and the text files in the 'text_files' folder. Each data file contains between 12,000-45,000 words. The XML file contains metadata about the material, such as language, encoding and data size.

Indian Language Part-of-Speech Tagset: Sanskrit is distributed via web download.

2011 Subscription Members will automatically receive two copies of this corpus on disc provided that they have submitted a completed copy of the Microsoft Research India License Agreement. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data by submitting a completed copy of the Microsoft Research India License Agreement. The agreement can be faxed to +1 215 573 2175 or scanned and emailed to this address. This data is available at no charge.

(2) OntoNotes Release 4.0 was developed as part of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of California's Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).

OntoNotes Release 4.0 contains the content of earlier releases -- OntoNotes Release 1.0 LDC2007T21,OntoNotes Release 2.0 LDC2008T04 and OntoNotes Release 3.0 LDC2009T24 -- and adds newswire, broadcast news, broadcast conversation and web data in English and Chinese and newswire data in Arabic. This cumulative publication consists of 2.4 million words as follows: 300k words of Arabic newswire; 250k words of Chinese newswire, 250k words of Chinese broadcast news, 150k words of Chinese broadcast conversation and 150k words of Chinese web text; and 600k words of English newswire, 200k word of English broadcast news, 200k words of English broadcast conversation and 300k words of English web text.

The OntoNotes project builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference.

Documents describing the annotation guidelines and the routines for deriving various views of the data from the database are included in the documentation directory of this release. The annotation is provided both in separate text files for each annotation layer (Treebank, PropBank, word sense, etc.) and in the form of an integrated relational database with a Python API to provide convenient cross-layer access.

OntoNotes 4.0 is distributed on 1 DVD-ROM.

2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may request this data by completing a copy of the LDC User Agreement for Non-Members. The agreement can be faxed +1 215 573 2175 or scanned and emailed to this address. This data is available at no charge, but is subject to shipping and handling fees for non-members.

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy