ISCApad #192 |
Thursday, June 12, 2014 by Chris Wellekens |
In this newsletter: LDC at LREC 2014 Following the conference LDC’s presented papers and posters will be available on LDC’s Papers Page.
New publications (1) GALE Arabic-English Word Alignment Training Part 2 -- Newswire was developed by LDC and contains 162,359 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program. Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation. This release consists of Arabic source newswire collected by LDC in 2004 - 2006 and 2008. The distribution by genre, words, character tokens and segments appears below:
Note that word count is based on the untokenized Arabic source, and token count is based on the tokenized Arabic source. The Arabic word alignment tasks consisted of the following components:
GALE Arabic-English Word Alignment Training Part 2 -- Newswire is distributed via web download. 2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1750.
* (2) Hispanic-English Database contains approximately 30 hours of English and Spanish conversational and read speech with transcripts (24 hours) and metadata collected from 22 non-native English speakers between 1996 and 1998. The corpus was developed by Entropic Research Laboratory, Inc, a developer of speech recognition and speech synthesis software toolkits that was acquired by Microsoft in 1999. Participants were adult native speakers of Spanish as spoken in Central America and South America who resided in the Palo Alto, California area, had lived in the United States for at least one year and demonstrated a basic ability to understand, read and speak English. They read a total of 2200 sentences, 50 each in Spanish and English per speaker. The Spanish sentence prompts were a subset of the materials in LATINO-40 Spanish Read News, and the English sentence prompts were taken from the TIMIT database. Conversations were task-oriented, drawing on exercises similar to those used in English second language instruction and designed to engage the speakers in collaborative, problem-solving activities. Read speech was recorded on two wideband channels with a Shure SM10A head-mounted microphone in a quiet laboratory environment. The conversational speech was simultaneously recorded on four channels, two of which were used to place phone calls to each subject in two separate offices and to record the incoming speech of the two channels into separate files. The audio was originally saved under the Entropic Audio (ESPS) format using a 16kHz sampling rate and 16 bit samples. Audio files were converted to flac compressed .wav files from the ESPS format. ESPS headers were removed and are presented in this release as *.hdr files that include demographic and technical data. Transcripts were developed with the Entropic Annotator tool and are time-aligned with speaker turns. The transcription conventions were based on those used in the LDC Switchboard and CALLHOME collections. Transcript files are denoted with a .lab extension. Hispanic-English Database is distributed on 1 DVD-ROM. 2014 Subscription Members will automatically receive two copies of this data. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1500.
*
The source material is comprised of Arabic and Chinese newswire and web data collected by LDC in 2007. Annotators created meaning-equivalent annotations under three annotation protocols. In the first protocol, foreign language native speakers built English networks starting from foreign language sentences. In the second, English native speakers built English networks from the best translation of a foreign language sentence as identified by NIST (National Institute of Standards and Technology). In the third protocol, English native speakers built English networks starting from the best translation, but those annotators also had access to three additional, independently produced human translations. Networks created by different annotators for each sentence were combined and evaluated. This release includes the source sentences and four human reference translations produced by LDC in XML format, along with five machine translation system outputs representing a variety of system architectures and performance, and the human post-edited output of those systems also presented in XML. HyTER Networks of Selected OpenMT08/09 Progress Set Sentences is distributed via web download. 2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$150.
|
Back | Top |