ISCApad Archive » 2022 » ISCApad #288 » Resources » Database » ELRA - Language Resources Catalogue - Update (May 2022) |
ISCApad #288 |
Friday, June 10, 2022 by Chris Wellekens |
We are happy to announce that 1 new written corpus, 4 new bilingual lexicons and 1 new monolingual lexicon are now available in our catalogue. Annotated tweet corpus in Arabizi, French and English ISLRN: 482-848-308-105-6 The purpose of the annotated tweet corpus in Arabizi, French and English constitution, completed in 2020, was to collect and annotate tweets in 3 languages (Arabizi, French and English) for 3 predefined themes (Hooliganism, Racism, Terrorism). It consists of 17,103 sequences annotated from 585,163 tweets (196,374 in English, 254,748 in French and 134,041 in Arabizi), including the themes “Others” and “Incomprehensible”. Among these sequences, 4,578 sequences having at least 20 tweets annotated with the 3 predefined themes (Hooliganism, Racism and Terrorism) were obtained, including 1,866 sequences with an opinion change. They are distributed as follows: 2,141 sequences in English (57,655 tweets), 1,942 sequences in French (48,854 tweets) and 495 sequences in Arabizi (21,216 tweets). A sub-corpus of 8,733 tweets (1,209 in English, 3,938 in French and 3,585 in Arabizi) annotated as “hateful”, according to topic/opinion annotations and by selecting tweets that contained insults, is also provided. A Bilingual English-Ukrainian Lexicon of Named Entities Extracted from Wikipedia ISLRN: 110-617-195-245-4 The bilingual English-Ukrainian lexicon of named entities uses Wikipedia metadata as a source. The extracted named entity pairs are classified into five classes: PERSON, ORGANIZATION, LOCATION, PRODUCT, and MISC (miscellaneous). The lexicon consists of 624,168 pairs and comes in two formats: csv and xml. ArabLEX set of data ArabLEX set of data consists of 4 databases dedicated to Arabic language: ArabLEX: Database of Arabic General Vocabulary (DAG) ISLRN: 879-334-992-724-8 A comprehensive full-form lexicon of Arabic general vocabulary including all inflected, conjugated and cliticized forms. Each entry is accompanied by a rich set of morphological, grammatical, and phonological attributes. Ideally suited for NLP applications, DAG provides precise phonemic transcriptions and full vowel diacritics designed to enhance Arabic speech technology. Quantity and size: 87,930,738 lines / 24,399 MB (23.8 GB) |