ISCApad Archive » 2016 » ISCApad #212 » Resources » Database » ELRA - Language Resources Catalogue - Update (January 2016) |
ISCApad #212 |
Friday, February 05, 2016 by Chris Wellekens |
The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences. It consists of about 425,000 tokens and there are ca. 22,260 sentences/utterances containing 3 or more tokens. Arboretum provides named entity categories for all proper nouns. It also contains subclass categorisation for the pronoun and adverb word classes The final version of the treebank consists of two independent versions, constituent trees and dependency trees, and is distributed in the following versions:
1. Native dependency format (Constraint Grammar format) 2. Dependency annotation converted to MALT xml format 3. Native constituent tree format (Cross-language VISL standard) 4. Constituent format converted to TIGER xml For more information, see: http://catalog.elra.info/product_info.php?products_id=1248 ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. It is rich in proper names, numerals and named entities. The corpus has been lemmatized and PoS annotated following the Multext-East morphosyntactic specifications, and it is XML encoded.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1249
The GlobalPhone Swahili corpus contains 7,728 utterances spoken by 70 speakers. Native speakers of Swahili were asked to read prompted sentences of newspaper articles. The entire collection took place in Nairobi, Kenya.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1258 The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Swahili dictionary contains 10664 entries. For more information, see: http://catalog.elra.info/product_info.php?products_id=1259 The GlobalPhone Ukrainian corpus contains 12,814 utterances spoken by 119 speakers. Native speakers of Ukrainian were asked to read prompted sentences of newspaper articles. The entire collection took place in Donezk, Ukraine. For more information, see: http://catalog.elra.info/product_info.php?products_id=1260 The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Ukrainian dictionary contains 7748 entries/7740 words. For more information, see: http://catalog.elra.info/product_info.php?products_id=1261 ELRA-S0379 JV_TDM Corpus ISLRN: 371-240-320-910-4 This corpus provides a phonetic annotation of 37 chapters of the original French version of ?Around the World in 80 Days? by Jules Verne read by a single speaker. Each chapter has been annotated in a separate .TextGrid file. The total audio size is 6h 41mn 36s with 5h 2mn 41s of speech. The .TextGrid files contain several annotation tiers: phoneme, number of characters, syllable, transcription, PoS, paragraph break, sentence break, prosodic annotations, breathing pauses.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1252 ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, medicine and biographical data for Romanian literary personalities. The entire corpus counts around 41,000,000 words, including punctuation. The corpus is annotated at paragraph, sentence, constituent group and word levels, and it provides morpho-syntactic information (MSD). It is xml encoded. For more information, see: http://catalog.elra.info/product_info.php?products_id=1253 NPChunks is a training corpus containing approximately 1,000 sentences, with a total of 24,243 tokens, selected randomly from the written part of the CINTIL corpus. The corpus is PoS-annotated at token level, including punctuation. Noun Phrases were annotated with specific tags. It was automatically PoS-tagged with MBT tagger, and lemmatized with MBLEM, following the annotation scheme of the Corpus of Reference of Contemporary Portuguese. For more information, see: http://catalog.elra.info/product_info.php?products_id=1256 The Portuguese-English subpart of the EUROPARL Corpus was extracted from the proceedings of the European Parliament. It contains approximately 58,324,562 tokens of European Portuguese (L1) and 49,216,896 tokens of English (translation). It is composed of one text file for the English corpus and two files for the Portuguese version: a text file and an annotated file, containing a PoS tag and a lemma for each token. For more information, see: http://catalog.elra.info/product_info.php?products_id=1257 ELRA-L0096 MCL - Multifunctional Computational Lexicon of Contemporary Portuguese
MCL is a 26,443 lemma Frequency Lexicon with 140,315 tokens extracted from CORLEX, a contemporary Portuguese corpus (16,210,438 words). In order to extract the lexicon, all the different lexical forms occurring in the corpus were indexed and subsequently tagged morphosyntactically and lemmatised by PALAVROSO. Each lemma in MCL is followed by morphosyntactic and quantitative information.ISLRN: 489-956-642-755-8 For more information, see: http://catalog.elra.info/product_info.php?products_id=1254 LEX-MWE-PT is a lexicon of European Portuguese containing multiword expressions (MWE) extracted from a balanced 50.8M-word written corpus. The lexicon covers 1,198 lemmas (composed of single words from different PoS categories: nouns, adjectives, verbs and adverbs); 12,753 MWE lemmas (which include inflectional variants of the MWE lemmas); and 242,233 concordances of those MWE manually verified. For more information, see: http://catalog.elra.info/product_info.php?products_id=1255 For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us. Visit the Universal Catalogue: http://universal.elra.info Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements/ |
Back | Top |