ISCApad #153 |
Saturday, March 05, 2011 by Chris Wellekens |
In this newsletter: - Publications Pipeline for MY2011 - - LDC Data Scholarship Update - New publications: - Indian Language Part-of-Speech Tagset: Sanskrit - - OntoNotes 4.0 - Publications Pipeline for MY2011 LDC is pleased to provide the following information on our planned releases for Membership Year 2011 (MY2011) and would like to remind our data users that there is still time to save on membership fees for MY2011, but time is running out! Any organization which joins or renews membership for 2011 through Tuesday, March 1, 2011, is entitled to a 5% discount on membership fees. Organizations which held membership for MY2010 can receive a 10% discount on fees provided they renew prior to March 1, 2011. Many publications for MY2011 are still in development, but we plan to release updates to some of our popular Gigaword corpora as well as new speech corpora. Please note that the list is tentative and subject to modifications. Our planned publications for this year include:
LDC Data Scholarship Update
New Publications
(1) Indian Language Part-of-Speech Tagset: Sanskrit was developed by Microsoft Research (MSR) India to support the task of Part-of-Speech Tagging (POS) and other data-driven linguistic research on Indian Languages in general. It is created as a part of the Indian Language Part-of-Speech Tagset (IL-POST) project, a collaborative effort among linguists and computer scientists from MSR India, AU-KBC (Anna University, Chennai), Delhi University, IIT Bombay, Jawaharlal Nehru University (Delhi) and Tamil University (Tamilnadu). The goal of the IL-POST project is to provide a common tagset framework for Indian Languages that offers flexibility, cross-linguistic compatibility and resuability across those languages. It supports a three-level hierarchy of Categories, Types and Attributes. The corpus mainly consists therefore of two different levels of information for each lexical token: (a) lexical Category and Types, and (b) set morphological attributes and their associated values in the context. This corpus contains 3,703 sentences (57,218 words) of manually annotated Sanskrit text selected from the Panchatrantra stories, a collection of animal fables in verse and prose dating from the third century BCE. All annotated data is provided in both xml and text files. The xml files are contained in the 'XML_files' folder and the text files in the 'text_files' folder. Each data file contains between 12,000-45,000 words. The XML file contains metadata about the material, such as language, encoding and data size. * (2) OntoNotes Release 4.0 was developed as part of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of California's Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes Release 4.0 contains the content of earlier releases -- OntoNotes Release 1.0 LDC2007T21,OntoNotes Release 2.0 LDC2008T04 and OntoNotes Release 3.0 LDC2009T24 -- and adds newswire, broadcast news, broadcast conversation and web data in English and Chinese and newswire data in Arabic. This cumulative publication consists of 2.4 million words as follows: 300k words of Arabic newswire; 250k words of Chinese newswire, 250k words of Chinese broadcast news, 150k words of Chinese broadcast conversation and 150k words of Chinese web text; and 600k words of English newswire, 200k word of English broadcast news, 200k words of English broadcast conversation and 300k words of English web text. The OntoNotes project builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference. Documents describing the annotation guidelines and the routines for deriving various views of the data from the database are included in the documentation directory of this release. The annotation is provided both in separate text files for each annotation layer (Treebank, PropBank, word sense, etc.) and in the form of an integrated relational database with a Python API to provide convenient cross-layer access.
|
Back | Top |