ISCApad Archive » 2017 » ISCApad #234 » Resources » Database » Linguistic Data Consortium (LDC) update November 2017) |
ISCApad #234 |
Monday, December 11, 2017 by Chris Wellekens |
In this newsletter:
Spring 2018 Data Scholarship Program
Commercial use and LDC data
IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a
TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training & Evaluation Data 2011-2014
_______________________________________________________________________________
Plans for MY2018 publications are in progress. Among the expected releases are:
And don’t forget, MY2017 and MY2016 are still open for joining. MY2016 can be joined through December 31, 2017 and includes data such as BOLT Chinese Discussion Forums, IARPA Babel Language Packs in multiple languages and Multi-Language Conversational Telephone Speech – Slavic Group. MY 2017 will remain open through December 31, 2018; among the year’s releases are 2010 NIST Speaker Recognition Evaluation Test Set, RATS Keyword Spotting, Noisy TIMIT Speech and BOLT Egyptian Arabic SMS/Chat and Transliteration. For full descriptions of these data sets, browse our Catalog.
Visit Join LDC for details on membership, user accounts and payment.
Spring 2018 Data Scholarship Program
Applications are now being accepted through January 15, 2018 for the Spring 2018 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.
Commercial use and LDC data
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.
New publications:
(1) ASpIRE Development and Development Test Sets was developed for the Automatic Speech recognition In Reverberant Environments (ASpIRE) Challenge sponsored by IARPA (the Intelligent Advanced Research Projects Activity). It contains approximately 226 hours of English speech with transcripts and scoring files.
The audio data is a subset of Mixer 6 Speech (LDC2013S03), audio recordings of interviews, transcript readings and conversational telephone speech collected by LDC in 2009 and 2010 from native English speakers local to the Philadelphia area. The transcripts were developed by Appen for the ASpIRE challenge.
Data is divided into development and development test sets.
2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $25.
(2) CIEMPIESS Light (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Light was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 18 hours of Mexican Spanish radio and television speech and associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.
CIEMPIESS Light is an updated version of CIEMPIESS, released by LDC as LDC2015S07. This 'light' version contains speech and transcripts presented in a revised directory structure that allows for use with the Kaldi toolkit.
The speech recordings were collected from Podcast UNAM, a program created by Radio-IUS, and Mirador Universitario, a TV program broadcast by UNAM. They are comprised of spontaneous conversations in Mexican Spanish between a moderator and guests.
The audio files are in 16 kHz, 16-bit PCM flac format, and transcripts are presented as UTF-8 encoded plain text.
CIEMPIESS Light is distributed via web download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $0.
*
The Kurmanji Kurdish speech in this release represents that spoken in the southeastern and eastern Anatolian regions of Turkey. The gender distribution among speakers is approximately 37% female and 63% male; speakers' ages range from 16 years to 70 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.
2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $25.
*
(4) TACKBP Chinese Cross-lingual Entity Linking - Comprehensive Training & Evaluation Data 2011-2014 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Chinese Cross-lingual Entity Linking tasks in 2011, 2012, 2013 and 2014. It includes queries and gold standard entity type information, Knowledge Base links, and equivalence class clusters for NIL entities along with the source documents for the queries, specifically, English and Chinese newswire, discussion forum and web data. The corresponding knowledge base is available as TAC KBP Reference Knowledge Base (LDC2014T16).
The goal of TAC KBP’s entity linking track is to measure systems’ ability to determine whether an entity, specified by a query, has a matching node in a reference knowledge base and if so, to create a link between the two. If there is no matching node, entity linking systems are required to cluster the mention together with others referencing the same entity. More information about the TAC KBP Entity Linking task and other TAC KBP evaluations can be found on the NIST TAC website.
TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014 is distributed via web download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $1000.
|
Back | Top |