ISCApad #206 |
Thursday, August 20, 2015 by Chris Wellekens |
In this newsletter: Fall 2015 Data Scholarship Program
New publications:
English News Text Treebank: Penn Treebank Revised
TS Wikipedia
Fall 2015 Data Scholarship Program
Applications are now being accepted through Tuesday, September 15, 2015 for the Fall 2015 LDC Data Scholarship program. The LDC Data Scholarship program provides university students with access to LDC data at no-cost.
(1) English News Text Treebank: Penn Treebank Revised was developed by LDC with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the Penn Treebank annotation of Wall Street Journal (WSJ) stories. The data is comprised of 1,203,648 word-level tokens in 49,191 sentence-level tokens -- in all 2,312 of the original Penn Treebank WSJ files.
This release includes revised tokenization, part-of-speech, and syntactic treebank annotation intended to bring the full WSJ treebank section into compliance with the agreed-upon policies and updates implemented for current English treebank annotation specifications at LDC. Examples include English Web Treebank (LDC2012T13), OntoNotes (LDC2013T19), and English translation treebanks such as English Translation Treebank: An-Nahar Newswire (LDC2012T02). English Treebank Supplemental Guidelines are included in this release.
English News Text Treebank: Penn Treebank Revised is distributed via web download.
2015 Subscription Members will automatically receive two copies of this corpus on disc. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$175.
*
(2) TS Wikipedia is a collection of approximately 1.6 million processed Turkish Wikipedia pages. The data is tokenized and includes part-of-speech tags, morphological analysis, lemmas, bi-grams and tri-grams.
The data is in a word-per-line format with five tab-separated columns: token, part-of-speech tag, morphological analysis, lemma and corrected token spelling if needed. All data is presented in UTF-8 XML files and was selected and filtered to reduce non-Turkish characters, mathematical formulas and non-Turkish entries.
TS Wikipedia is distributed via web download.
2015 Subscription Members will automatically receive two copies of this corpus on disc. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$100. TS Wikipedia is made available to for-profit members under the LDC For-Profit Membership Agreement and to not-for-profit members and non-members under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license.
*
(3) The Walking Around Corpus was developed by Stony Brook University and is comprised of approximately 33 hours of navigational telephone dialogues from 72 speakers (36 speaker pairs). Participants were Stony Brook University students who identified themselves as native English speakers.
This corpus was elicited using a navigation task in which one person directed another to walk to 18 unique destinations on Stony Brook University’s West campus. The direction-giver remained inside the lab and gave directions on a landline telephone to the pedestrian who used a mobile phone. As they visited each location, the pedestrians took a picture of each of the 18 destinations using the mobile phone. Pairs conversed spontaneously as they completed the task. The pedestrians' locations were tracked using their cell phones' GPS systems. The pedestrians did not have any maps or pictures of the target destinations and therefore relied on the direction-giver's verbal directions and descriptions to locate and photograph the target destinations.
Each digital audio file was transcribed with time stamps. The corpus material also includes the visual materials (pictures and maps) used to elicit the dialogues, data about the speakers' relationship, spatial abilities and memory performance, and other information.
All audio is presented as 8000Hz, 16-bit flac compressed wav. Transcripts are presented as xls spreadsheets.
The Walking Around Corpus is distributed via web download.
2015 Subscription Members will automatically receive two copies of this corpus on disc. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1000.
|
Back | Top |