ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2022 » ISCApad #294 » Resources » Database

ISCApad #294

Saturday, December 10, 2022 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (November 2022)

In this newsletter:

Join LDC for membership year 2023
Fall 2022 data scholarship recipients
Spring 2023 data scholarship application deadline
30th Anniversary Highlight: CALLFRIEND

New publications:
BOLT English Translation Treebank – Egyptian Arabic SMS/Chat
Samrómur Children Icelandic Speech 1.0
Third DIHARD Challenge Development

Join LDC for membership year 2023
It’s time to renew your LDC membership for 2023. Current (2022) members who renew their membership before March 1, 2023 will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 1.

In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 900+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for 2023 publications are in progress. Among the expected releases are:

AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts: 156 hours of Ukrainian conversational telephone speech and broadcast news with 1.2 million words of corresponding orthographic transcripts
2019 NIST SRE: audiovisual and leaderboard challenge sets based on amateur videos and Tunisian Arabic telephone speech, respectively
DEFT English ERE: English text from assorted genres annotated for entities, relations and events
Mixer 3 and Mixer 7 speech collections: thousands of hours of telephone speech and metadata from Mixer 3 (multiple languages) and Mixer 7 (Spanish, plus interviews and transcript readings)
CALLFRIEND Russian: 100 telephone conversations among native speakers, transcripts and a lexicon, released in separate speech and text data sets
REMIX Telephone Collection: English telephone speech from 385 participants in previous Mixer studies
LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Indonesian, Swahili, Tagalog, Tamil, Zulu)

For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.

Fall 2022 LDC data scholarship recipients
LDC congratulates the following Fall 2022 data scholarship recipients:

Nelson Filipe Costa: Concordia University (Canada); PhD, Machine Learning. Nelson is awarded a copy of Penn Discourse Treebank Version 3.0 (LDC2019T05) for his work in discourse relationships and mapping.
Paul Pope: University of Eastern Finland (Finland); MA, Linguistic Data Sciences. Paul is awarded a copy of ETS Corpus of Non-Native Written English (LDC2014T06) for his research on text classification.
Abhinav Singh: Sharda University (India); PhD, Forensic Science. Abhinav is awarded a copy of TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) for his research on forensic speech recognition.
Lucas Zheng: Deerfield Academy (USA); High School Scholar. Lucas is awarded copies of Arabic Treebank Part 1 v. 4.1 (LDC2010T13) and Arabic Treebank Part 2 v. 3.1 (LDC2011T09) for his work on analyzing syntactic and lexical similarities across MSA genres and POS-tagging for MSA.

Students can learn more about the LDC data scholarship program on the Data Scholarships page.

Spring 2023 data scholarship application deadline
Applications are now being accepted through January 15, 2023 for the Spring 2023 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.

30th Anniversary Highlight: CALLFRIEND
The CALLFRIEND series is a multi-language collection of unscripted telephone conversations conducted by LDC in the 1990s to support language identification technology development (Liberman & Cieri, 1998). Covered languages are American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. For English, Mandarin and Spanish, the collection includes two distinct dialects. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America.

This speech data was the foundation for NIST’s Language Recognition Evaluations conducted from 1996-2007. The first editions of the CALLFRIEND series published in LDC’s Catalog in 1996 contain 60 calls evenly split into 20 calls each for a training partition to develop language models, a development partition for parameter tuning, and an evaluation partition to test performance (Torres-Carrasquillo, et al., 2004).

Beginning in 2014, LDC released second editions for American English (LDC2019S21, LDC2020S08), Canadian French (LDC2019S18), Egyptian Arabic (LDC2019S04), Farsi (LDC2014S01), and Mandarin Chinese (LDC2018S09, LDC2020S06). The goal of the second editions is to facilitate continued widespread use of the data, specifically, by updating the audio files to .wav format, simplifying the directory structure, adding documentation and metadata, and combining the training, development and evaluation splits. CALLFRIEND Farsi Second Edition also includes additional telephone recordings and a separate transcripts release (LDC2014T01).

In addition to work on language identification, CALLFRIEND corpora have been used in a variety of research tasks, including subject omission in Korean (Lee 2012), contemporary Persian vowels in casual speech (Jones 2019), Mandarin telephone closings among familiars (Huang, 2020), and adjective constructions in English conversation (Bybee & Thompson, 2021), among many others.

To learn more about the CALLFRIEND collection or about other LDC corpora used for language identification research, search the Catalog by the “recommended application” and select “language identification” from the list.

New publications:
BOLT English Translation Treebank – Egyptian Arabic SMS/Chat was developed by LDC and consists of SMS and chat text data (472 files representing 98,206 tokens) translated from Egyptian Arabic to English and annotated for part-of-speech and syntactic structure. Only the translated English text is included in the source data for this release. Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included in the corpus documentation.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for $1750.

Samrómur Children Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 131 hours of Icelandic prompted speech from 3,175 speakers (children, aged 4-17 years) representing 137,597 utterances.

Speech data was collected between October 2019 and September 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus

2022 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for $250.

Third DIHARD Challenge Development was developed by LDC and contains approximately 34 hours of English and Chinese speech data along with corresponding annotations used in support of the Third DIHARD Challenge.

The DIHARD third development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, and amateur web videos. Annotations include diarization and segmentation.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for $300.

To unsubscribe from this newsletter, log in to your LDC account and uncheck the box next to “Receive Newsletter” under Account Options; or contact LDC for assistance.