ELRA-W0084 Arboretum treebank
ISLRN: 025-729-182-451-2


The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences. It consists of about 425,000 tokens and there are ca. 22,260 sentences/utterances containing 3 or more tokens. Arboretum provides named entity categories for all proper nouns. It also contains subclass categorisation for the pronoun and adverb word classes The final version of the treebank consists of two independent versions, constituent trees and dependency trees, and is distributed in the following versions:
1. Native dependency format (Constraint Grammar format)
2. Dependency annotation converted to MALT xml format
3. Native constituent tree format (Cross-language VISL standard)
4. Constituent format converted to TIGER xml
ELRA-W0085 ROCO Romanian journalistic corpus
ISLRN: 312-617-089-348-7
ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. It is rich in proper names, numerals and named entities. The corpus has been lemmatized and PoS annotated following the Multext-East morphosyntactic specifications, and it is XML encoded.
ELRA-S0375 GlobalPhone Swahili
ISLRN: 200-331-212-512-8
The GlobalPhone Swahili corpus contains 7,728 utterances spoken by 70 speakers. Native speakers of Swahili were asked to read prompted sentences of newspaper articles. The entire collection took place in Nairobi, Kenya.
ELRA-S0376 GlobalPhone Swahili Pronunciation Dictionary
ISLRN: 010-360-238-702-2
The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Swahili dictionary contains 10664 entries.
ELRA-S0377 GlobalPhone Ukrainian
ISLRN: 456-398-378-806-1
The GlobalPhone Ukrainian corpus contains 12,814 utterances spoken by 119 speakers. Native speakers of Ukrainian were asked to read prompted sentences of newspaper articles. The entire collection took place in Donezk, Ukraine.
ELRA-S0378 GlobalPhone Ukrainian Pronunciation Dictionary
ISLRN: 022-652-862-222-7
The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Ukrainian dictionary contains 7748 entries/7740 words.
ELRA-S0379 JV_TDM Corpus
ISLRN: 371-240-320-910-4
This corpus provides a phonetic annotation of 37 chapters of the original French version of ?Around the World in 80 Days? by Jules Verne read by a single speaker. Each chapter has been annotated in a separate .TextGrid file. The total audio size is 6h 41mn 36s with 5h 2mn 41s of speech. The .TextGrid files contain several annotation tiers: phoneme, number of characters, syllable, transcription, PoS, paragraph break, sentence break, prosodic annotations, breathing pauses.
ELRA-W0088 ROMBAC - Romanian balanced corpus
ISLRN: 162-192-982-061-0
ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, medicine and biographical data for Romanian literary personalities. The entire corpus counts around 41,000,000 words, including punctuation. The corpus is annotated at paragraph, sentence, constituent group and word levels, and it provides morpho-syntactic information (MSD). It is xml encoded.
ELRA-W0089 NPChunks
ISLRN: 412-883-442-173-8
NPChunks is a training corpus containing approximately 1,000 sentences, with a total of 24,243 tokens, selected randomly from the written part of the CINTIL corpus. The corpus is PoS-annotated at token level, including punctuation. Noun Phrases were annotated with specific tags. It was automatically PoS-tagged with MBT tagger, and lemmatized with MBLEM, following the annotation scheme of the Corpus of Reference of Contemporary Portuguese.
ELRA-W0090 EUROPARL Corpus Parallel Corpora: Portuguese-English
ISLRN: 435-502-922-727-2
The Portuguese-English subpart of the EUROPARL Corpus was extracted from the proceedings of the European Parliament. It contains approximately 58,324,562 tokens of European Portuguese (L1) and 49,216,896 tokens of English (translation). It is composed of one text file for the English corpus and two files for the Portuguese version: a text file and an annotated file, containing a PoS tag and a lemma for each token.
ELRA-L0096 MCL - Multifunctional Computational Lexicon of Contemporary Portuguese
ISLRN: 489-956-642-755-8
MCL is a 26,443 lemma Frequency Lexicon with 140,315 tokens extracted from CORLEX, a contemporary Portuguese corpus (16,210,438 words). In order to extract the lexicon, all the different lexical forms occurring in the corpus were indexed and subsequently tagged morphosyntactically and lemmatised by PALAVROSO. Each lemma in MCL is followed by morphosyntactic and quantitative information.
ELRA-L0097 LEX-MWE-PT - Word Combination in Portuguese
ISLRN: 353-430-176-260-6
LEX-MWE-PT is a lexicon of European Portuguese containing multiword expressions (MWE) extracted from a balanced 50.8M-word written corpus. The lexicon covers 1,198 lemmas (composed of single words from different PoS categories: nouns, adjectives, verbs and adverbs); 12,753 MWE lemmas (which include inflectional variants of the MWE lemmas); and 242,233 concordances of those MWE manually verified.
