ISCApad Archive » 2017 » ISCApad #233 » Resources » Database » Linguistic Data Consortium (LDC) update (October 2017) |
ISCApad #233 |
Friday, November 10, 2017 by Chris Wellekens |
In this newsletter:
Membership Year 2018 Publication Preview
MWE-Aware English Dependency Corpus Version 2.0
_______________________________________________________________________________
LDC is pleased to award fifteen data scholarships to students this fall. Recipients are from eight countries and a variety of academic disciplines. Twenty unique data sets are awarded to the students for their work in diverse applications including machine translation, abstractive text summarization using recurrent neural networks, speech recognition for multiple languages, semantic role labeling for social data, text summarization, speaker recognition for forensic applications, and more. Please look to LDC’s social media pages for upcoming announcements highlighting each recipient and their intended research. Congratulations to all of our recipients!
Membership Year 2018 Publication Preview
Check your inbox in the coming weeks for more information about membership renewal.
New publications:
(1) RATS Keyword Spotting was developed by LDC and is comprised of approximately 3,100 hours of Levantine Arabic and Farsi conversational telephone speech with automatic and manual annotation of speech segments, transcripts, and keywords generated from transcript content. The corpus was created to provide training, development, and initial test sets for the keyword spotting (KWS) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.
The source audio consists of conversational telephone speech recordings collected by LDC: (1) data collected for the RATS program from Levantine Arabic and Farsi speakers; and (2) material from Levantine Arabic QT Training Data Set 5, Speech (LDC2006S29), and (3) CALLFRIEND Farsi Second Edition Speech (LDC2014S01). Transcripts of calls were either produced or available from the source corpora. Potential target keywords were selected from the transcripts based on word frequencies to fall within a range of target-word likelihood per hour of speech. The selected words were manually reviewed to confirm that each was a regular or multi-word expression of more than three syllables.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $7,500.
(2) English Web Treebank Propbank was developed by University of Colorado Boulder - CLEAR (Computational Language and Education Research) and provides predicate-argument structure annotation for English Web Treebank (LDC2012T13).
The goal of Propbank (or proposition bank) annotation is to develop annotations with information about basic semantic propositions. English Web Treebank Propbank provides semantic role annotation and predicate sense disambiguation for roughly 50,000 predicates, corresponding to all verbs, all adjectives in equational clauses, and all nouns considered to be predicative. Mark-up is in the 'unified' propbank annotation format, which combines representations in nouns, verbs, and adjectives. The source data consists of weblogs, newsgroups, email, reviews, and questions-answers.
English Web Treebank Propbank is distributed via web download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $175.
*
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $50.
*
(4) MWE-Aware English Dependency Corpus Version 2.0 was developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory and consists of English compound function words annotated in dependency format. The data is derived from OntoNotes Release 5.0 (LDC2013T19).
Version 2.0 adds annotations of named entities (persons, locations, organizations) into dependency trees that are aware of compound function words. Version 1.0 is available from LDC as MWE-Aware English Dependency Corpus (LDC2017T01).
MWEs (multiword expressions) were identified in OntoNotes' phrase structure trees and each MWE was established as a single subtree. Those phrase structure subtrees were then converted to a dependency structure (the Stanford dependencies) in CoNLL format. The data is split into 1,728 phrase structure trees as *.parse files and a single 14-column tab separated dependency as a *.conll file. Both file types are encoded as UTF-8.
MWE-Aware English Dependency Corpus Version 2.0 is distributed via web download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $50.
Membership Office
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
|
Back | Top |