ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2016 » ISCApad #212 » Resources » Database » ELRA - Language Resources Catalogue - Update (January 2016)

ISCApad #212

Friday, February 05, 2016 by Chris Wellekens

5-2-1 ELRA - Language Resources Catalogue - Update (January 2016)

*****************************************************************
ELRA - Language Resources Catalogue - Update
*****************************************************************

ELRA-W0084 Arboretum treebank
ISLRN: 025-729-182-451-2

The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences. It consists of about 425,000 tokens and there are ca. 22,260 sentences/utterances containing 3 or more tokens. Arboretum provides named entity categories for all proper nouns. It also contains subclass categorisation for the pronoun and adverb word classes The final version of the treebank consists of two independent versions, constituent trees and dependency trees, and is distributed in the following versions:
1. Native dependency format (Constraint Grammar format)
2. Dependency annotation converted to MALT xml format
3. Native constituent tree format (Cross-language VISL standard)
4. Constituent format converted to TIGER xml
For more information, see: http://catalog.elra.info/product_info.php?products_id=1248

ELRA-W0085 ROCO Romanian journalistic corpus
ISLRN: 312-617-089-348-7

ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. It is rich in proper names, numerals and named entities. The corpus has been lemmatized and PoS annotated following the Multext-East morphosyntactic specifications, and it is XML encoded.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1249

ELRA-S0375 GlobalPhone Swahili
ISLRN: 200-331-212-512-8

The GlobalPhone Swahili corpus contains 7,728 utterances spoken by 70 speakers. Native speakers of Swahili were asked to read prompted sentences of newspaper articles. The entire collection took place in Nairobi, Kenya.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1258

ELRA-S0376 GlobalPhone Swahili Pronunciation Dictionary
ISLRN: 010-360-238-702-2

The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Swahili dictionary contains 10664 entries.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1259

ELRA-S0377 GlobalPhone Ukrainian
ISLRN: 456-398-378-806-1

The GlobalPhone Ukrainian corpus contains 12,814 utterances spoken by 119 speakers. Native speakers of Ukrainian were asked to read prompted sentences of newspaper articles. The entire collection took place in Donezk, Ukraine.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1260

ELRA-S0378 GlobalPhone Ukrainian Pronunciation Dictionary
ISLRN: 022-652-862-222-7

The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Ukrainian dictionary contains 7748 entries/7740 words.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1261

ELRA-S0379 JV_TDM Corpus
ISLRN: 371-240-320-910-4

This corpus provides a phonetic annotation of 37 chapters of the original French version of ?Around the World in 80 Days? by Jules Verne read by a single speaker. Each chapter has been annotated in a separate .TextGrid file. The total audio size is 6h 41mn 36s with 5h 2mn 41s of speech. The .TextGrid files contain several annotation tiers: phoneme, number of characters, syllable, transcription, PoS, paragraph break, sentence break, prosodic annotations, breathing pauses.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1252

ELRA-W0088 ROMBAC - Romanian balanced corpus
ISLRN: 162-192-982-061-0

ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, medicine and biographical data for Romanian literary personalities. The entire corpus counts around 41,000,000 words, including punctuation. The corpus is annotated at paragraph, sentence, constituent group and word levels, and it provides morpho-syntactic information (MSD). It is xml encoded.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1253

ELRA-W0089 NPChunks
ISLRN: 412-883-442-173-8

NPChunks is a training corpus containing approximately 1,000 sentences, with a total of 24,243 tokens, selected randomly from the written part of the CINTIL corpus. The corpus is PoS-annotated at token level, including punctuation. Noun Phrases were annotated with specific tags. It was automatically PoS-tagged with MBT tagger, and lemmatized with MBLEM, following the annotation scheme of the Corpus of Reference of Contemporary Portuguese.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1256

ELRA-W0090 EUROPARL Corpus Parallel Corpora: Portuguese-English
ISLRN: 435-502-922-727-2

The Portuguese-English subpart of the EUROPARL Corpus was extracted from the proceedings of the European Parliament. It contains approximately 58,324,562 tokens of European Portuguese (L1) and 49,216,896 tokens of English (translation). It is composed of one text file for the English corpus and two files for the Portuguese version: a text file and an annotated file, containing a PoS tag and a lemma for each token.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1257

ELRA-L0096 MCL - Multifunctional Computational Lexicon of Contemporary Portuguese
ISLRN: 489-956-642-755-8

MCL is a 26,443 lemma Frequency Lexicon with 140,315 tokens extracted from CORLEX, a contemporary Portuguese corpus (16,210,438 words). In order to extract the lexicon, all the different lexical forms occurring in the corpus were indexed and subsequently tagged morphosyntactically and lemmatised by PALAVROSO. Each lemma in MCL is followed by morphosyntactic and quantitative information.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1254

ELRA-L0097 LEX-MWE-PT - Word Combination in Portuguese
ISLRN: 353-430-176-260-6

LEX-MWE-PT is a lexicon of European Portuguese containing multiword expressions (MWE) extracted from a balanced 50.8M-word written corpus. The lexicon covers 1,198 lemmas (composed of single words from different PoS categories: nouns, adjectives, verbs and adverbs); 12,753 MWE lemmas (which include inflectional variants of the MWE lemmas); and 242,233 concordances of those MWE manually verified.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1255

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us.

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements/

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy