ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2013 » ISCApad #182 » Resources » Database » LDC Newsletter (July 2013)

ISCApad #182

Saturday, August 10, 2013 by Chris Wellekens

5-2-3 LDC Newsletter (July 2013)

In this newsletter:

- Fall 2013 Data Scholarship Program -

New publications:

- Chinese Proposition Bank 3.0 -

- GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1 -

Fall 2013 Data Scholarship Program

Applications
are now being accepted through September 16, 2013, 11:59PM EST for the Fall 2013 LDC Data Scholarship program! The LDC Data Scholarship program provides university students with access to LDC data at no-cost.

        This program is open to students pursuing both undergraduate and         graduate studies in an accredited college or university. LDC         Data Scholarships are not restricted to any particular field of         study; however, students must demonstrate a well-developed         research agenda and a bona fide inability to pay. The selection         process is highly competitive.

        The application consists of two parts:

        (1) Data Use Proposal. Applicants must submit a proposal         describing their intended use of the data. The proposal should         state which data the student plans to use and how the data will         benefit their research project as well as information on the         proposed methodology or algorithm.

        Applicants should consult the LDC Corpus           Catalog for a complete list of         data distributed by LDC. Due to certain restrictions, a handful         of LDC corpora are restricted to members of the Consortium.         Applicants are advised to select a maximum of one to two         databases.

        (2) Letter of Support. Applicants must submit one letter         of support from their thesis adviser or department chair. The         letter must confirm that the department or university lacks the         funding to pay the full Non-member Fee for the data and verify         the student's need for data.

        For further information on application materials and program         rules, please visit the LDC Data
          Scholarship         page.

        Students can email their applications to the LDC Data           Scholarship program. Decisions will be         sent by email from the same address.

        The deadline for the Fall 2013 program is Monday, September 16,         2013, 11:59PM EST.

New publications

(1)
Chinese Proposition Bank 3.0 is a continuation of the Chinese Proposition Bank project which aims to create a corpus of text annotated with information about basic semantic propositions. Chinese Proposition Bank 3.0 adds predicate-argument annotation on 187,731 words from Chinese Treebank 7.0 (LDC2010T07). The data sources are comprised of newswire, magazine articles, various broadcast news and broadcast conversation programming, web newsgroups and weblogs.

LDC
has also released Chinese Proposition Bank 1.0 (LDC2005T23) and Chinese Proposition Bank 2.0 (LDC2008T07).

This
release contains the predicate-argument annotation of 173,206 verb instances and 14,525 noun instances. The annotation of nouns is limited to nominalizations that have a corresponding verb. The general annotation guidelines and the lexical guidelines (called frame files) for each verbal and nominal predicate are also included in this release. Below are some statistics about the corpus.

Total
propositions for verbs - 173,206
Total
propositions for nouns - 14,525
Total
verbs framed - 24,642
Total
framesets - 26,467
Verbs
with multiple framesets - 1337
Average
framesets per verb - 1.07
Total
nouns framed - 1,421
Total
noun framesets - 1,528
Nouns
with multiple framesets - 48
Average
framesets per nouns - 1.08

Chinese Proposition
Bank 3.0 is distributed via web download.

2013
Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$300.

(2)
GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1 was developed by LDC and contains 115,826 tokens of word aligned Arabic and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel
aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.

In this release, the source Arabic data was translated into English. Arabic and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this corpus corresponds to a portion of the Arabic treebanked data in Arabic Treebank - Broadcast News v1.0 (LDC2012T07).

The
source data consists of Arabic broadcast news programming collected by LDC in 2005 and 2006 from Alhurra, Aljazeera and Dubai TV. All data is encoded as UTF-8. A count of files, words, tokens and segments is below.

Language	Files	Words	Tokens	Segments
Arabic	28	89,213	115,826	4,824

Note:
Word count is based on the untokenized Arabic source. Ttoken count is based on the ATB-tokenized Arabic source.

The
purpose of the GALE word alignment task was to find correspondences between words, phrases or groups of words in a set of parallel texts. Arabic-English word alignment annotation consisted of the following tasks:

Identifying
different types of links: translated (correct or incorrect) and not translated (correct or incorrect)
Identifying
sentence segments not suitable for annotation, e.g., blank segments, incorrectly-segmented segments, segments with foreign languages
Tagging
unmatched words attached to other words or phrases

GALE Arabic-English
Parallel Aligned Treebank -- Broadcast News Part 1 is distributed via web download.

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy