ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2023 » ISCApad #301 » Resources » Database

ISCApad #301

Thursday, July 06, 2023 by Chris Wellekens

5-2 Database

5-2-1

Linguistic Data Consortium (LDC) update (June 2023)

In this newsletter:
LDC at ACL 2023
LDC data and commercial technology development

New publications:
Moroccan Arabic – English Lexical Database
LORELEI Indonesian Representative Language Pack

LDC at ACL 202
LDC will be exhibiting at ACL 2023, held this year July 9-14 in Toronto, Canada. Stop by our booth to learn more about recent developments at the Consortium and the latest publications. LDC will post conference updates via Twitter and Facebook. We look forward to seeing you there!

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:
Moroccan Arabic - English Lexical Database was developed by LDC. It contains a set of five interrelated tables presenting each Moroccan Arabic word as an orthographic form in Arabic script and a pronunciation form in International Phonetic Alphabet (IPA) format. This release contains over 21,000 Moroccan Arabic words in Arabic script and IPA notation, and more than 33,000 English tokens.

This lexical database is the result of a collaboration with Georgetown University Press (GUP) to enhance and update three dialectal Arabic dictionaries -- Iraqi, Moroccan, and Syrian -- originally published in paper form in the 1960s by GUP. LDC also undertook to develop a lexical database for each dialect. The Georgetown Dictionary of Moroccan Arabic was published in 2019; this work was based on, and expanded, A Dictionary of Moroccan Arabic.

The several enhancements developed by LDC included facilitating comparisons across Arabic dialects and Modern Standard Arabic by providing Arabic script spellings and IPA pronunciations to Moroccan words and phrases; promoting ease of use by language learners and researchers by developing reasonable orthographic conventions for applying the Arabic alphabet to the dialect; and facilitating a user's understanding of morphological and lexical relations by adding information on the linguistic structures of Moroccan Arabic.

2023 members can access this corpus through their LDC accounts provided they have submitted a signed copy of the special license agreement. Non-members may license this data for $1500.

LORELEI Indonesian Representative Language Pack is comprised of over 17 million words of Indonesian monolingual text, 950,000 million words of found Indonesian-English parallel text, and 92,000 Indonesian words translated from English data. Over 113,000 words were annotated for named entities and more than 24,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs, and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for $250.