ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2023 » ISCApad #306 » Resources » Database » ELRA - Language Resources Catalogue - Update (November and December2023)

ISCApad #306

Saturday, December 09, 2023 by Chris Wellekens

5-2-2 ELRA - Language Resources Catalogue - Update (November and December2023)

We are happy to announce that 1 new written corpus, 1 new monolingual lexicon and 2 new speech resources are now available in our catalogue.

Corpus for fine-grained analysis and automatic detection of irony on Twitter
ISLRN: 478-366-550-085-8

This corpus was annotated by trained annotators (Master’s students in Linguistics) using a detailed annotation scheme for irony categorization, which describes four labels: ‘ironic by means of a polarity contrast’, ‘situational irony’, ‘other verbal irony’ and ‘not ironic’. It consists of 4791 instances with an irony label and a tweet ID.

Bitext Synonym Data - General Language
ISLRN: 470-885-612-363-1

The Bitext Synonym Data - General Language includes 31,723 entries and more than 100,000 synonyms for English language. This dataset is a set of synonyms developed to augment the English version of Wordnet, a powerful open-source
lexical database, released in 2005. All synonyms can be linked to Bitext Lexical Data - English (see ELRA-L0140) for lemmatization, POS and morphological information.

Corpus of Spontaneous Japanese (CSJ)
ISLRN: 280-594-494-328-0

The 'Corpus of Spontaneous Japanese' (or CSJ) contains about 650 hours of spontaneous speech that correspond to about 7000k words. All these speech materials are recorded using head-worn close-talking microphones and DAT, and down-sampled to 16kHz, 16bit accuracy. The speech material is transcribed both at orthographic and phonetic levels. In addition, segment label, intonation label, and other miscellaneous annotations are provided for a subset of CSJ, called the Core, which contains about 500k words or 45 hours of speech.

EWA-DB – Early Warning of Alzheimer speech database
ISLRN: 730-022-142-264-9

EWA-DB is a speech database that contains data from 3 clinical groups: Alzheimer's disease, Parkinson's disease, mild cognitive impairment, and a control group of healthy subjects. Speech samples of each clinical group were obtained using the EWA smartphone application, which contains 4 different language tasks: sustained vowel phonation, diadochokinesis, object and action naming (30 objects and 30 actions), picture description (two single pictures and three complex pictures). The total number of speakers in the database is 1649. Of these, there are 87 people with Alzheimer's disease, 175 people with Parkinson's disease, 62 people with mild cognitive impairment, 2 people with a mixed diagnosis of Alzheimer's + Parkinson's disease and 1323 healthy controls.

For more information on the catalogue or if you would like to enquire about having your resources distributed by ELRA, please contact us.
_________________________________________

Visit the ELRA Catalogue of Language Resources
Visit the Universal Catalogue

Archives of ELRA Language Resources Catalogue Updates

***************************************************************

We are happy to announce that 3 new monolingual lexicons are now available in our catalogue.

DiaLEX – Egyptian (DiaLEX-EA)
ISLRN: 697-328-151-668-9
A comprehensive full-form lexicon of Egyptian Arabic general vocabulary (DiaLEX-EA) including 78 million entries for 31,000 lemmas with all inflected forms, enclitics, proclitics, case endings, declensions, and conjugated forms.
Each entry is accompanied by a full and accurate diacriticization (vocalization) as well as an extensive coverage of variants. The lexicon is ideally suited to support natural language processing applications for Egyptian Arabic, especially
morphological analysis and speech technology.
Quantity and size: 75,204,644 lines / 11,217 MB (11.0 GB)

DiaLEX – Emirati (DiaLEX-UA)
ISLRN: 836-793-503-213-8
A comprehensive full-form lexicon of Emirati Arabic general vocabulary (DiaLEX-UA) including 28 million entries for 29,000 lemmas with all inflected forms, enclitics, proclitics, case endings, declensions, and conjugated forms.
Each entry is accompanied by a full and accurate diacriticization (vocalization) as well as an extensive coverage of variants. The lexicon is ideally suited to support natural language processing applications for Emirati Arabic, especially
morphological analysis and speech technology.
Quantity and size: 24,976,871 lines / 3,841 MB (3.8 GB)

DiaLEX – Saudi Arabian Hijazi (DiaLEX-HA)
ISLRN: 849-157-479-216-3
A comprehensive full-form lexicon of Hijazi Arabic general vocabulary (DiaLEX-HA) including 21 million entries for 30,000 lemmas with all inflected forms, enclitics, proclitics, case endings, declensions, and conjugated forms.
Each entry is accompanied by a full and accurate diacriticization (vocalization) as well as an extensive coverage of variants. The lexicon is ideally suited to support natural language processing applications for Hijazi Arabic, especially
morphological analysis and speech technology.
Quantity and size: 20,247,655 lines / 2,835 MB (2.8 GB)

For more information on the catalogue or if you would like to enquire about having your resources distributed by ELRA, please contact us.

_________________________________________

Visit the ELRA Catalogue of Language Resources
Visit the Universal Catalogue
Archives of ELRA Language Resources Catalogue Updates

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy