ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2022 » ISCApad #293 » Resources » Database » ELRA - Language Resources Catalogue - Update (October 2022)

ISCApad #293

Tuesday, November 08, 2022 by Chris Wellekens

5-2-2 ELRA - Language Resources Catalogue - Update (October 2022)

We are happy to announce that 6 new written corpora, 1 new monolingual lexicon and 1 new speech resource and 1 new multimodal resource are now available in our catalogue.

AnCora Spanish 2.0.0

ISLRN: 252-495-813-736-1

The AnCora Spanish Corpus 2.0.0 is a corpus of 500,000 words annotated at different levels: Lemma and Part of Speech, Syntactic constituents and functions, Argument structure and thematic roles, Semantic classes of the verb, Denotative type of deverbal nouns, Nouns related to WordNet synsets, Named Entities, Coreference relation.

AnCora Catalan 2.0.0
ISLRN: 186-654-762-852-8
The AnCora Catalan Corpus 2.0.0 is a corpus of 500,000 words annotated at different levels: Lemma and Part of Speech, Syntactic constituents and functions, Argument structure and thematic roles, Semantic classes of the verb, Denotative type of deverbal nouns, Nouns related to WordNet synsets, Named Entities, Coreference relation.

Bulgarian Treebank Corpus
ISLRN: 761-430-854-533-2
The Bulgarian Treebank Corpus is composed of 156,149 tokens (11,138 sentences) coming from three main sources in the domain of Grammar Notebooks (1,391 sentences), News (6,698 sentences), Other (3,049 sentences). It is available with syntactical and morphological annotation on a sentence basis in Universal Dependencies format.

Bulgarian Event Corpus
ISLRN: 832-960-876-604-2
The Bulgarian Event Corpus is composed 324,905 tokens appropriate for training Named Entity Recognition (NER), Named Entity Linking (NEL) and Event Recognition models for Bulgarian in a multidomain context within Humanities. The texts are domain related. They include documents from the area of Social Sciences and Humanities – scientific papers, archive documents, popular documents, and Wikipedia articles in the relevant areas.

Bulgarian Valency Frame Lexicon
ISLRN: 188-702-981-369-5
The Bulgarian Valency Frame Lexicon is composed of 9547 lexical entries organized by frames with 960 mappings to Princeton WordNet available in XML format. It is a treebank-driven resource of extracted valency frames from BulTreeBank. The frames were manually curated. The structure of the frames follows the BulTreeBank syntactic structure.

How2Sign Dataset
ISLRN: 583-408-694-292-6
The How2Sign dataset consists of a parallel corpus of speech and transcriptions of instructional videos and their corresponding American Sign Language (ASL) translation videos and annotations. It has been produced by recording 11 persons (6 males and 5 females) with various hearing status (5 self-identified as hearing, 4 as deaf, 2 as hard of hearing). The video has been recorded at 30 fps in MPEG format. A total of 80 hours of Multiview American Sign Language videos were collected, as well as gloss annotations and a coarse video categorization.

Persian Speech Corpus
ISLRN: 058-406-130-314-1
This dataset contains more than 31 hours and 30 minutes of Persian scripted monologue and dialogue data, recorded from 89 Persian speakers (39 males and 50 females) between 17-80 years old in Iran (Tehrani dialect). Data consists of read and spontaneous speech recordings: books read by a person, recorded podcasts, articles in the newspapers, radio conversations, phone dialogues. Domains are labelled and include:include Accounting, Banking, Economics, Finance, Insurance, Literature, Marketing, Medicine, Psychology, Science, Technology, Telecommunication, and Law.

Venice Italian Treebank (VIT) – version 2
ISLRN: 942-234-530-020-7
This is a new release of the Venice Italian Treebank (VIT). It consists of the Written and Spoken VIT subsets. The PennTreebank version of the treebank is also made available on both subsets using parentheses and also a slightly modified version using brackets that allows web basedweb-based visualization tools to build a tree of the structure. The Written VIT consists of 223,292 tokens excluding punctuation, but 280,641 single tokens including enclitics and punctuation. It contains a totally revised constituency basedconstituency-based representation of the corpus as well as three new files. As for the Spoken VIT, 425 new fully parsed turns were added for a total of 3973. The total count of sentences is now 5851.

Wojood - A corpus for nested Arabic Named Entity Recognition
ISLRN: 688-718-284-176-0
Wojood consists of about 550,000 tokens (Modern Standard Arabic and dialect) that are manually annotated with 21 entity types (person, group of people, occupation, organization, geopolitical entity, location, facility, event, date, time, language, website, law, product, cardinal number, ordinal number, percent, quantity, unit, money, currency). It covers multiple domains (Media, History, Culture, Health, Finance, ICT, Law, Elections, Politics, Migration, Terrorism, social media) and was annotated with nested entities. The corpus contains about 75K entities and 22.5% of which are nested. The corpus was annotated using the IOB2 tagging scheme and is available in CSV format.

For more information on the catalogue or if you would like to enquire about having your resources distributed by ELRA, please contact us.
_______________________

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy