ISCA - International Speech
Communication Association

ISCApad Archive  »  2022  »  ISCApad #288  »  Resources  »  Database  »  ELRA - Language Resources Catalogue - Update (May 2022)

ISCApad #288

Friday, June 10, 2022 by Chris Wellekens

5-2-2 ELRA - Language Resources Catalogue - Update (May 2022)
 We are happy to announce that 1 new written corpus, 4 new bilingual lexicons and 1 new monolingual lexicon are now available in our catalogue. Annotated tweet corpus in Arabizi, French and English ISLRN482-848-308-105-6 The purpose of the annotated tweet corpus in Arabizi, French and English constitution, completed in 2020, was to collect and annotate tweets in 3 languages (Arabizi, French and English) for 3 predefined themes (Hooliganism, Racism, Terrorism). It consists of 17,103 sequences annotated from 585,163 tweets (196,374 in English, 254,748 in French and 134,041 in Arabizi), including the themes “Others” and “Incomprehensible”. Among these sequences, 4,578 sequences having at least 20 tweets annotated with the 3 predefined themes (Hooliganism, Racism and Terrorism) were obtained, including 1,866 sequences with an opinion change. They are distributed as follows: 2,141 sequences in English (57,655 tweets), 1,942 sequences in French (48,854 tweets) and 495 sequences in Arabizi (21,216 tweets). A sub-corpus of 8,733 tweets (1,209 in English, 3,938 in French and 3,585 in Arabizi) annotated as “hateful”, according to topic/opinion annotations and by selecting tweets that contained insults, is also provided.

A Bilingual English-Ukrainian Lexicon of Named Entities Extracted from Wikipedia ISLRN: 110-617-195-245-4 The bilingual English-Ukrainian lexicon of named entities uses Wikipedia metadata as a source. The extracted named entity pairs are classified into five classes: PERSON, ORGANIZATION, LOCATION, PRODUCT, and MISC (miscellaneous). The lexicon consists of 624,168 pairs and comes in two formats: csv and xml. ArabLEX set of data
ArabLEX set of data consists of 4 databases dedicated to Arabic language:

ArabLEX: Database of Arabic General Vocabulary (DAG)

ISLRN: 879-334-992-724-8

A comprehensive full-form lexicon of Arabic general vocabulary including all inflected, conjugated and cliticized forms. Each entry is accompanied by a rich set of morphological, grammatical, and phonological attributes. Ideally suited for NLP applications, DAG provides precise phonemic transcriptions and full vowel diacritics designed to enhance Arabic speech technology. Quantity and size: 87,930,738 lines / 24,399 MB (23.8 GB)

ArabLEX: Database of Arabic Place Names (DAP)

ISLRN: 161-842-321-771-2

This full-form Arabic-English place name database provides worldwide coverage of common place names, given in standard MSA orthography, and includes all inflected and cliticized forms for each place name. In addition, precise phonemic transcriptions and full vowel diacritics are designed to enhance Arabic speech technology. Quantity and size: 6,455,201 lines / 812 MB

ArabLEX: Database of Foreign Names in Arabic (DAF)

ISLRN: 943-592-129-040-2

This full-form database covers non-Arab personal names in both Arabic and English, some Arabic script variants, vocalized or unvocalized formats, as well as inflected and cliticized forms. The precise phonemic transcriptions and full vowel diacritics are designed to enhance Arabic speech technology. Quantity and size: 226,784,907 lines / 32,181 MB (31.4 GB)

ArabLEX: Database of Arab Names (DAN)

ISLRN: 773-974-582-139-4
This full-form database covers Arab personal names (both given names and surnames) in both Arabic and English and contains a rich set of romanized  name variants for each name with a variety of supplementary information such as gender, name type and frequency statistics. This comprehensive lexicon (over 6.4 million variants) contains precise phonemic transcriptions and vocalized Arabic for all inflected and cliticized forms for each name. Quantity and size: 218,215,875 lines / 32,659 MB (31.9 GB)

For more information on the catalogue or if you would like to enquire about having your resources distributed by ELRA, please contact us.

Visit the ELRA Catalogue of Language Resources
Visit the Universal Catalogue 
Archives of ELRA Language Resources Catalogue Updates


Back  Top

 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA