ISCApad #189 |
Saturday, March 15, 2014 by Chris Wellekens |
In this newsletter: Spring 2014 LDC Data Scholarship recipients! Membership fee savings and publications pipeline New LDC website enhancements coming soon New publications: GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2 King Saud University Arabic Speech Database NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source Spring 2014 LDC Data Scholarship recipients! LDC is pleased to announce the student recipients of the Spring 2014 LDC Data Scholarship program! This program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen two proposals to support. The following students will receive no-cost copies of LDC data:
Membership fee savings and publications pipeline Members can still save on 2014 membership fees, but time is running out. Any organization which joins or renews membership for 2014 through Monday, March 3, 2014, is entitled to a 5% discount. Organizations which held membership for MY2013 can receive a 10% discount on fees provided they renew prior to March 3, 2014. Planned publications for this year include:
New LDC website enhancements coming soon
New publications (1) GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2 was developed by LDC and contains 141,058 tokens of word aligned Arabic and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program. Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate. In this release, the source Arabic data was translated into English. Arabic and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this corpus corresponds to a portion of the Arabic treebanked data in Arabic Treebank - Broadcast News v1.0 (LDC2012T07). The source data consists of Arabic broadcast news programming collected by LDC in 2007 and 2008. All data is encoded as UTF-8. A count of files, words, tokens and segments is below.
The purpose of the GALE word alignment task was to find correspondences between words, phrases or groups of words in a set of parallel texts. Arabic-English word alignment annotation consisted of the following tasks:
GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2 is distributed via web download. 2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1750.
The corpus was designed principally for speaker recognition research. The speech sources are sentences, word lists, prose and question and answer sessions. Read speech text includes the following:
Spontaneous speech was captured through question and answer sessions between participants and project team members. Speakers responded to questions on general topics such as the weather and food. Each speaker was recorded in three different environments: a sound proof room, an office, and a cafeteria. The recordings were collected via microphone and mobile phone and averaged between 16-19 minutes. The data was verified for missing recordings, problems with the recording system or errors in the recording process. King Saud University Arabic Speech Database is distributed on one hard disk. 2014 Subscription Members will receive a copy of this data provided that they have completed the User License Agreement. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.
(3) NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source was developed by NIST Multimodal Information Group. This release contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plan for the OpenMT 2012 test for Arabic, Chinese, Dari, Farsi, and Korean to English on a parallel data set. The set is based on a subset of the Arabic-to-English and Chinese-to-English progress tests from the OpenMT 2008, 2009 and 2012 evaluations with new source data created by humans based on the English reference translation. The package was compiled, and scoring software was developed, at NIST, making use of newswire and web data and reference translations developed by the Linguistic Data Consortium and the Defense Language Institute Foreign Language Center. The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. The 2012 task included the evaluation of five language pairs: Arabic-to-English, Chinese-to-English, Dari-to-English, Farsi-to-English and Korean-to-English in two source data styles. For general information about the NIST OpenMT evaluations, refer to the NIST OpenMT website. This evaluation kit includes a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation. This release consists of 20 files, four for each of the five languages, presented in XML with an included DTD. The four files are source and reference data in the following two styles:
NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source is distributed via web download. 2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$150.
|
Back | Top |