ISCA - International Speech
Communication Association


ISCApad Archive  »  2023  »  ISCApad #298  »  Resources  »  Database  »  ELRA - Language Resources Catalogue - Update (February 2023)

ISCApad #298

Friday, April 07, 2023 by Chris Wellekens

5-2-2 ELRA - Language Resources Catalogue - Update (February 2023)
  

We are happy to announce that 1 new written corpus, 3 new monolingual lexica and 2 new bilingual lexica are now available in our catalogue.

 
Learner Corpus of Portuguese L2 – COPLE2
ISLRN: 936-320-703-366-7
The Learner Corpus of Portuguese as Second/Foreign Language (COPLE2) is a corpus of written and oral texts produced by students of Portuguese as Foreign/Second Language courses in the Instituto de Cultura e Língua Portuguesa (the Institute of Portuguese Language and Culture) (ICLP – FLUL) and by applicants for examinations in the Centro de Avaliação de Português Língua Estrangeira (Center for Evaluation of Portuguese as a Foreign Language) (CAPLE – FLUL). The corpus contains texts from learners with 15 different native languages (L1s) and proficiencies from A1 to C1, and covers different topics and types of tasks. It is encoded in TEI format through the TEITOK environment.  The corpus includes at the moment a total of 182,474 tokens and 978 texts, classified according to the CEFR scales. The corpus contains annotations for part of speech, lemma and learner errors. All the information encoded is searchable through the CQP query language.

CALEM (Comprehensive Arabic LEMmas)
ISLRN: 462-532-124-988-8
Comprehensive Arabic LEMmas is a lexicon covering a large list of Arabic inflected word forms (stems) and their corresponding lemmas. It is composed of 164,272 lemmas representing 7,151,106 stems, detailed as follows: 720 Arabic particles, 15,291 broken plurals, 2,464,239 verbs, 4,675,856 nouns. The lexicon is provided as plain text in UTF8 encoding and represents about 284 Mb of data.
 

MADED (Moroccan Arabic Dialect Electronic Dictionary)
ISLRN: 977-057-254-691-5
Moroccan Arabic Dialect Electronic Dictionary (MADED) is an electronic lexicon containing almost 13,000 entries. They are written in Arabic script wherein each Modern Standard Arabic (MSA) lemma is provided with its corresponding Moroccan Arabic equivalent. In addition, MADED entries are annotated with useful metadata such as part-of-speech (POS), origin and root. MADED is designed for the practical use of the NLP community. This dictionary is provided as a csv file and represents about 2 Mb of data.


MORV (Moroccan Morphological vocabulary)
ISLRN: 064-194-729-767-0
The Moroccan Morphological vocabulary is a lexicon containing more than 4.6 M entries describing a given Moroccan Arabic word with fourteen (14) morphological and semantic features: the word orthographic form, the segmentation (prefix and suffix), part-of-speech (POS), gender, number, tense and transitivity (for verbs), its origin, dialectal lemma, Arabic lemma, the root, voice, state, and affirmative/negative form. This vocabulary is provided as a csv file and represents about 350 Mb of data.

 
CroaTPAS
ISLRN: 649-554-159-147-9
CroaTPAS is a bi-lingual lexicon in Croatian and English. It was created by manual annotation from the Croatian Web as Corpus and pattern creation using the Skema editor on the Sketch Engine platform. CroaTPAS is tailor-made to represent verb polysemy and currently contains a total of 683 patterns (belonging to 180 Croatian verbs) expressing different verb senses and 22.677 annotated corpus lines. Moreover, the resource includes 109 metonymic sub patterns linked to 1112 corpus lines featuring 62 different metonymic shifts.


T-PAS
ISLRN: 432-666-503-743-8
T-PAS is a digital lexicographic resource consisting of a corpus-derived collection of Italian verb valency structures, whose argument slots have been manually annotated with a set of hierarchically organised semantic labels called Semantic Types.
As of today, T-PAS contains a total of 1164 Italian verb entries containing 5529 patterns expressing different verb senses, and 252943 annotated corpus lines. Moreover, the resource includes 84 metonymic subpatterns linked to 1218 corpus lines featuring 37 different metonymic shifts.


For more information on the catalogue or if you would like to enquire about having your resources distributed by ELRA, please contact us.
_________________________________________

Visit the ELRA Catalogue of Language Resources
Visit the Universal Catalogue
Archives of ELRA Language Resources Catalogue Updates

 

 

 

 

 

 

 

 

 

 

 

 
 

Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA