ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2021 » ISCApad #271 » Resources » Database

ISCApad #271

Monday, January 11, 2021 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (December 2020)

In this newsletter:
LDC 2021 Membership Discounts Now Available
Approaching Deadline for Spring 2021 Data Scholarship Applications
LDC Closed for Winter Break Dec. 24- Jan. 5

New Publications:
BOLT English Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech
Phonemes of Arabic
Global TIMIT Mandarin Chinese – Guanzhong Dialect

LDC 2021 Membership Discounts Now Available
Now through March 1, 2021, current 2020 members receive a 10% discount for renewing their membership and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits.

Approaching Deadline for Spring 2021 Data Scholarship Applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2021 data scholarships are due January 15, 2021. For more information on requirements and program rules, see LDC Data Scholarships.

LDC Closed for Winter Break Dec. 24-Jan. 5
LDC will be closed from Thursday, December 24, 2020 through Tuesday, January 5, 2021 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Wednesday, January 6, 2021. Requests received by the Membership Office during Winter Break will be processed when the office reopens.

New publications:
(1) BOLT English Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies for the BOLT co-reference task and consists of co-reference annotation on English discussion forum, SMS/Chat, and conversational telephone speech.

Co-reference annotation aims to fill in connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation and covers noun phrases (including proper nouns, nominals, pronouns, and null arguments), possessives, proper noun pre-modifiers, and verbs.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference.

BOLT English Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1250.

(2) Phonemes of Arabic was developed at the Florida Institute of Technology. It contains approximately one hour of speech from native Arabic speakers that includes all Arabic sounds (consonants and vowels) and 24 words with specific consonant-vowel patterns.

Arabic has three short vowels, three long vowels, and 28 consonants. Speakers recorded all sounds, repeating each sound three times. Each speaker also recorded 24 Arabic words with a specified consonant-vowel pattern and repeated each word three times. The speakers (19 male) were from the following countries: Egypt, Iraq, Lebanon, Libya, Morocco, Saudi Arabia, and Syria.

Phonemes of Arabic is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $250.

(3) Global TIMIT Mandarin Chinese – Guanzhong Dialect was developed by LDC and Xi’an Jiaotong University and consists of approximately five hours of read speech and transcripts in the Guanzhong dialect of Mandarin Chinese as spoken in Shannxi province. It is comprised of 50 speakers reading 120 sentences from Chinese Gigaword Fifth Edition (LDC2011T13). Among the 120 sentences, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types.

The corpus was recorded at Xi’an Jiaotong University, Xi’an, China. Speakers (25 female, 25 male) were born in Weinan, Shannxi and spoke the Guanzhong dialect.

The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT data set, which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Specifically, those features include:

A large number of fluently read sentences, containing a representative sample of phonetic, lexical, syntactic, semantic, and pragmatic patterns
A relatively large number of speakers
Time-aligned lexical and phonetic transcription of all utterances
Some sentences read by all speakers, others read by a few speakers, and others read by just one speaker

Global TIMIT Mandarin Chinese – Guanzhong Dialect is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $500.

Membership Coordinator

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104