ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2023 » ISCApad #300 » Resources » Database » Linguistic Data Consortium (LDC) update (May 2023)

ISCApad #300

Saturday, June 10, 2023 by Chris Wellekens

5-2-1 Linguistic Data Consortium (LDC) update (May 2023)

n this newsletter:
LDC at ICASSP 2023

New publications:
2019 NIST Speaker Recognition Evaluation Test Set – CTS Challenge
LORELEI Zulu Representative Language Pack

LDC at ICASSP 2023
LDC will be exhibiting at ICASSP 2023, held this year June 4-10 in Rhodes, Greece. Stop by booth 15 to learn more about recent developments at the Consortium and the latest publications.

LDC will post conference updates via Twitter and Facebook. We look forward to seeing you there!

New publications:
2019 NIST Speaker Recognition Evaluation Test Set – CTS Challenge, developed by LDC and NIST, contains 635 hours of Tunisian Arabic telephone recordings for development and test, answer keys, enrollment, trial files, and documentation from the CTS Challenge portion of the NIST-sponsored 2019 Speaker Recognition Evaluation. The 2019 evaluation was conducted in two parts: (1) a leaderboard-style challenge based on conversational telephone speech from LDC's Call My Net 2 (CMN2) corpus; and (2) a separate evaluation using audio-visual material collected by LDC for the VAST (Video Annotation for Speech Technology) project (released as LDC2023V01).

The telephone speech data for the CTS Challenge was drawn from the CMN2 collection conducted by LDC in Tunisia in which Tunisian Arabic speakers called friends or relatives who agreed to record their telephone conversations lasting between 8-10 minutes. The speech segments include PSTN (public switched telephone network) and VOIP (voice over IP) data.

2023 members can access this corpus through their LDC accounts. Non-members may license this data for $400.

LORELEI Zulu Representative Language Pack is comprised of over 5 million words of Zulu monolingual text, 2.7 million words of found Zulu-English parallel text, and 71,000 Zulu words translated from English data. Approximately 100,000 words were annotated for named entities and over 23,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs, and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for $250.

To unsubscribe from this newsletter, log in to your LDC account and uncheck the box next to “Receive Newsletter” under Account Options or contact LDC for assistance.

Membership Coordinator

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

Philadelphia, PA 19104

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy