ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2019 » ISCApad #257 » Resources » Database

ISCApad #257

Tuesday, November 12, 2019 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (October 2019)

In this newsletter:

Membership Year 2020 Publication Preview
LDC Data and Commercial Technology Development

New Publications:
BOLT English Treebank - Discussion Forum
Polish Speech Database
2016 NIST Speaker Recognition Evaluation Test Set

Membership Year 2020 Publication Preview

The 2020 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are:

Abstract Meaning Representation (AMR) Annotation Release 3.0: semantic treebank of over 59,000 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums; updates the second version (LDC2017T10) with new annotations.

TAC KBP: English sentiment slot filling, surprise slot filling, nugget detection and coreference, and event argument data in all languages (English, Chinese, and Spanish)

DEFT Chinese ERE: Chinese discussion forum data annotated for entities, relations, and events

LibriVox Spanish: 73 hours of Spanish audiobook read speech and transcripts

IARPA Babel Language Packs (telephone speech and transcripts): languages include Dhuluo, Javanese, and Mongolian

HAVIC Med Training data: web video, metadata, and annotations for developing multimedia systems

RATS Speaker Identification: conversational telephone speech in Levantine Arabic, Pashto, Urdu, Farsi and Dari on degraded audio signals with annotation of speech segments for speaker identification

BOLT: discussion forums, SMS/chat, conversational telephone speech, word-aligned, tagged and co-reference data in all languages (Chinese, Egyptian Arabic, and English)

Check your inbox in the coming weeks for more information about membership renewal.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) BOLT English Treebank - Discussion Forum was developed by LDC and consists of 268,907 tokens of English web discussion forum data with part-of-speech and syntactic structure annotations collected for the DARPA BOLT (Broad Operational Language Translation) program.

Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included with this release.

The source data is English discussion forum web text collected by LDC in 2011 and 2012. A subset of that data -- 702 files representing 268,907 tokens -- was selected for the treebank and annotated for word-level tokenization, part-of-speech and syntactic structure. The unannotated English source data is released as BOLT English Discussion Forums (LDC2017T11).

BOLT English Treebank - Discussion Forum is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $2000.

(2) Polish Speech Database was developed by VoiceLab and consists of 263,424 utterances of Polish speech data from 200 speakers, totaling approximately 280 hours, and corresponding transcripts.

Data collection was performed in Poland. Speakers were asked to record themselves reading text on a website for at least 60 minutes from their home computer while using a headset. The read text was comprised of sentences covering most speech sounds in Polish.

This release includes speaker metadata. There were 103 male speakers and 97 female speakers, ranging from 15 – 60 years of age; most speakers were in the 15 – 30 years age range.

Polish Speech Database is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $3000.

(3) 2016 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology) and contains approximately 340 hours of short segments of Tagalog, Cantonese, Cebuano, and Mandarin telephone speech used as development and test data in the NIST-sponsored 2016 Speaker Recognition Evaluation (SRE).

As in previous evaluations, SRE16 focused on telephone speech recorded over a variety of handset types for the training and test conditions. In addition to development and evaluation data, this corpus also contains trial lists, their associated keys, tables containing metadata information, and evaluation documentation.

The telephone speech data was drawn from the Call My Net 2015 Corpus collected by LDC. Native speakers of Tagalog, Cantonese, Cebuano, or Mandarin (220 unique speakers) made a total of ten telephone calls each to people within their existing social networks. Speakers were encouraged to use different telephone instruments in a variety of acoustic settings and were instructed to talk for 8 - 10 minutes per call on a topic of their choice. All conversations were collected outside North America.

2016 NIST Speaker Recognition Evaluation Test Set is distributed via web download.

Membership Office

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104