ISCApad #175 |
Thursday, January 10, 2013 by Chris Wellekens |
In this newsletter: - Spring 2013 LDC Data Scholarship Program - deadline approaching! - - Two New LDC Podcasts for your Listening Pleasure - - Penn Discourse Treebank Version 2.0 Update - New publications: - GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web - - Russian-English Computer Security Parallel Text -
Spring 2013 LDC Data Scholarship Program - deadline approaching! The deadline for the Spring 2013 LDC Data Scholarship Program is one month away! Student applications are being accepted now through January 15, 2013, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.
Two New LDC Podcasts for your Listening Pleasure Two new podcasts are available on a theLDC blog continuing the 20th Anniversary series. The first features Natalia Bragilevskaya, LDC’s Business Administrator, Membership Coordinator Ilya Ahtaridis and Marian Reed, Marketing Coordinator. They recall the early days of LDC and describe the growth of sponsored projects work and LDC’s interactions with its membership. Click here for Natalia, Ilya and Marian’s podcast. The third podcast in the series introduces the community to two LDC’ researchers Yiwola Awoyale and Moussa Bamba, whose work focuses on West African languages. Yiwola has been teaching Linguistics, Yoruba language studies and various aspects of African linguistics since 1975. At LDC, he developed the Global Yoruba Lexical Database, a set of related dictionaries based on Yoruba and its diaspora. Moussa’s work in the Manding languages of the Niger-Congo family has resulted in the release of the Mawukakan Lexicon, to be followed by similar resources for Maninkakan, Bambara, and Jula. In their podcast, Yiwola and Moussa discuss how they came to LDC, their current research and how it benefits multiple communities.Click here for Yiwola and Moussa’s podcast. Other podcasts will be published via the LDC blog, so stay tuned to that space. Penn Discourse Treebank Version 2.0 Update The developers of the Penn Discourse Treebank Version 2.0 LDC2008T05 (PDTB) have updated this release to add metadata to the Wall Street Journal (WSJ) news stories in the corpus. The goal is to aid understanding PDTB files as texts and to support distinguishing texts from different genres within the WSJ. The metadata includes the following fields:
All new downloads of PDTB will contain the complete updated corpus. Current PDTB licensees can re-download the file to obtain the updated data.
LDC will be closed from Monday, December 24, 2012 through Tuesday, January 1, 2013 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Wednesday, January 2, 2013. Requests received for membership renewals and corpora during the Winter Break will be processed at that time. New publications (1) GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Webwas developed by LDC and contains 154,541 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE(Global Autonomous Language Exploitation) program. Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation. GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web (LDC2012T16) and GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web (LDC2012T20) are also available through LDC.
Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters. The Chinese word alignment tasks consisted of the following components:
GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web is distributed via web download. 2012 Subscription Members will automatically receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1750. * (2) Russian-English Computer Security Parallel Textwas developed by The MITRE Corporation. It consists of parallel sentences from a set of computer security reports published in Russian and translated into English by translators with particular expertise in the technical area. Translators were instructed to err on the side of literal translation if required, but to maintain the technical writing style of the source and to make the resulting English as natural as possible. The translators followed specific guidelines for translation, and those are included in this distribution. There are 6,276 lines of parallel Russian and English, with a total of 60,059 words of Russian and 76,437 words of English, presented in a separate UTF-8 plain text file for each language. The sentences were translated in sequential order and presented in a scrambled order, such that parallel sentences at identical line numbers are translations. For example, the 31st line of the English file is a translation of the 31st line of the Russian file. The original line sequence is not provided. 1,694 untranslated lines (such as code snippets) are included as a separate file. Russian-English Computer Security Parallel Text is distributed via web download. 2012 Subscription Members will automatically receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1500.
|
Back | Top |