ISCApad #187 |
Saturday, January 11, 2014 by Chris Wellekens |
In this newsletter:
- Spring 2014 LDC Data Scholarship Program - deadline approaching! - - LDC to close for Winter Break - New publications: - GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 - - The ARRAU Corpus of Anaphoric Information - Spring 2014 LDC Data Scholarship Program - deadline approaching! The deadline for the Spring 2014 LDC Data Scholarship Program is right around the corner! Student applications are being accepted now through January 15, 2014, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.
LDC will be closed from Wednesday, December 25, 2013 through Wednesday, January 1, 2014 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Thursday, January 2, 2014. Requests received for membership renewals and corpora during the Winter Break will be processed at that time.
(1) GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 was developed by LDC and contains 179,842 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program. Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation. This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2005 - 2007. The distribution by genre, words, character tokens and segments appears below:
Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters. The Chinese word alignment tasks consisted of the following components:
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1750. * (2) Maninkakan Lexicon was developed by LDC and contains 5,834 entries of the Maninkakan language presented as a Maninkakan-English lexicon and a Maninkakan-French lexicon. It is the second publication in an ongoing LDC project to to build an electronic dictionary of four Mandekan languages: Mawukakan, Maninkakan, Bambara and Jula. These are Eastern Manding languages in the Mande Group of the Niger-Congo language family. LDC released a Mawukakan Lexicon (LDC2005L01) in 2005. More information about LDC’s work in the languages of West Africa and the challenges those languages present for language resource development can be found here. Maninkakan is written using Latin script, Arabic script and the NKo alphabet. This lexicon is presented using a Latin-based transcription system because the Latin alphabet is familiar to the majority of Mandekan language speakers and because it is expected to facilitate the work of researchers interested in this resource. The dictionary is provided in two formats, Toolbox and XML. Toolbox is a version of the widely used SIL Shoebox program adapted to display Unicode. The Toolbox files are provided in two fonts, Arial and Doulous SIL. The Arial files should display using the Arial font which is standard on most operating systems. Doulous SIL, available as a free download, is a robust font that should display all characters without issue. Users should launch Toolbox using the *.prj files in the Arial or Doulous_SIL folders. Maninkakan Lexicon is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$800. * (3) The ARRAU (Anaphora Resolution and Underspecification) Corpus of Anaphoric Information was developed by the University of Essex and the University of Trento. It contains annotations of multi-genre English texts for anaphoric relations with information about agreement and explicit representation of multiple antecedents for ambiguous anaphoric expressions and discourse antecedents for expressions which refer to abstract entities such as events, actions and plans. The source texts in this release include task-oriented dialogues from the TRAINS-91 and TRAINS-93 corpora (the latter released through LDC, TRAINS Spoken Dialog Corpus LDC95S25), narratives from the English Pear Stories, articles from the Wall Street Journal portions of the Penn Treebank(Treebank-2 LDC95T7) and the RST Discourse TreebankLDC2002T07, and the Vieira/Poesio Corpus which consists of training and test files from Treebank-2 and RST Discourse Treebank. The texts were annotated using the ARRAU guidelines which treat all noun phrases (NPs) as markables. Different semantic roles are recognized by distinguishing between referring expressions (that update or refer to a discourse model), and non-referring ones (including expletives, predicative expressions, quantifiers, and coordination). A variety of linguistic features were also annotated, including morphosyntactic agreement, grammatical function, semantic type (person, animate, concrete, action, time, other abstract) and genericity. The annotation was carried out using the MMAX2 annotation tool which allows text units to be marked at different levels. The files in MMAX format have been organized so that they can be visualized using the MMAX2 tool or directly used as input/output for the BART toolkit which performs automatic coreference resolution including all necessary preprocessing steps. The ARRAU Corpus of Anaphoric Information is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000. |
Back | Top |