ISCApad #193 |
Friday, July 11, 2014 by Chris Wellekens |
In this newsletter: - LDC at ACL 2014: June 23-25, Baltimore, MD - - Commercial use and LDC data - New publications: - Abstract Meaning Representation (AMR) Annotation Release 1.0 - - ETS Corpus of Non-Native Written English - - GALE Phase 2 Chinese Broadcast News Parallel Text Part 2 - - MADCAT Chinese Pilot Training Set - LDC at ACL 2014: June 23-25, Baltimore, MD LDC staff will also participate in the post-conference 2nd Workshop on EVENTS: Definition, Detection, Coreference and Representation on Friday, June 27, https://sites.google.com/site/wsevents2014/home with presentations at the poster session:
Early renewing members save on fees Commercial use and LDC data For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for further information, https://www.ldc.upenn.edu/data-management/using/licensing.
(1) Abstract Meaning Representation (AMR) Annotation Release 1.0 was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Center for omputational Language and Educational Research and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 4,500 English natural language sentences from newswire, weblogs and web discussion forums. AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. The source data includes discussion forums collected for the DARPA BOLT program, Wall Street Journal and translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset:
* (2) ETS Corpus of Non-Native Written English was developed by Educational Testing Service and is comprised of 12,100 English essays written by speakers of 11 non-English native languages as part of an international test of academic English proficiency, TOEFL (Test of English as a Foreign Language). The test includes reading, writing, listening, and speaking sections and is delivered by computer in a secure test center. This release contains 1,100 essays for each of the 11 native languages sampled from eight topics with information about the score level (low/medium/high) for each essay. The corpus was developed with the specific task of native language identification in mind, but is likely to support tasks and studies in the educational domain, including grammatical error detection and correction and automatic essay scoring, in addition to a broad range of research studies in the fields of natural language processing and corpus linguistics. For the task of native language identification, the following division is recommended: 82% as training data, 9% as development data and 9% as test data, split according to the file IDs accompanying the data set. The data is sampled from essays written in 2006 and 2007 by test takers whose native languages were Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. Original raw files for 11,000 of the 12,100 tokenized files are included in this release along with prompts (topics) for the essays and metadata about the test takers’ proficiency level. The data is presented in UTF-8 formatted text files. ETS Corpus of Non-Native Written English is distributed via web download. 2014 Subscription Members will automatically receive two copies of this data on disc provided they have completed the user license agreement. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1000. * (3) GALE Phase 2 Chinese Broadcast News Parallel Text Part 2 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast news (BN) data collected by LDC between 2005 and 2007 and transcribed by LDC or under its direction. This release includes 30 source-translation document pairs, comprising 206,737 characters of translated material. Data is drawn from 12 distinct Chinese BN programs broadcast by China Central TV, a national and international broadcaster in Mainland China; New Tang Dynasty TV, a broadcaster based in the United States; and Phoenix TV, a Hong-Kong based satellite television station. The broadcast news recordings in this release focus principally on current events. The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations. GALE Phase 2 Chinese Broadcast News Parallel Text Part 2 is distributed via web download. 2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1750. * (4) MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Chinese Pilot Training Set contains all training data created by LDC to support a Chinese pilot collection in the DARPA MADCAT Program. The data in this release consists of handwritten Chinese documents, scanned at high resolution and annotated for the physical coordinates of each line and token. Digital transcripts and English translations of each document are also provided, with the various content and annotation layers integrated in a single MADCAT XML output. The goal of the MADCAT program was to automatically convert foreign text images into English transcripts. MADCAT Chinese pilot data was collected from Chinese source documents in three genres: newswire, weblog and newsgroup text. Chinese speaking 'scribes' copied documents by hand, following specific instructions on writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which resulted in some original source documents being broken into multiple 'pages' for handwriting. Each resulting handwritten page was assigned to up to five independent scribes, using different writing conditions. The handwritten, transcribed documents were next checked for quality and completeness, then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were then annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, along with any errors produced by the scribes when copying the text. The final step was to produce a unified data format that takes multiple data streams and generates a single MADCAT XML output file which contains all required information. The resulting madcat.xml file contains distinct components: a text layer that consists of the source text, tokenization and sentence segmentation; an image layer that consist of bounding boxes; a scribe demographic layer that consists of scribe ID and partition (train/test); and a document metadata layer. This release includes 22,284 annotation files in both GEDI XML and MADCAT XML formats (gedi.xml and .madcat.xml) along with their corresponding scanned image files in TIFF format. The annotation results in GEDI XML files include ground truth annotations and source transcripts. MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Chinese Pilot Training Set is distributed on five DVD-ROM.
|
Back | Top |