ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2019 » ISCApad #252 » Resources » Database » ELRA - Language Resources Catalogue - Update (April 2019)

ISCApad #252

Tuesday, June 11, 2019 by Chris Wellekens

5-2-2 ELRA - Language Resources Catalogue - Update (April 2019)

ELRA is happy to announce that 4 new Speech resources, 1 new Written Corpus and 1 new Multilingual Lexicon are now available in our catalogue.

ELRA-S0399 GlobalPhone Multilingual Model Package
ISLRN: 204-945-263-927-6
The GlobalPhone Multilingual Model Package contains about 22 hours of transcribed read speech spoken by native speakers in 22 languages (Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swahili, Swedish, Tamil, Thai, Turkish, Ukrainian, and Vietnamese). The GlobalPhone Multilingual Model Package covers about 1 hour of transcribed speech from 10 speakers (5 male, 5 female) from each of the above listed 22 languages.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0399/

ELRA-S0400 GlobalPhone 2000 Speaker Package
ISLRN: 331-592-378-424-7
The GlobalPhone 2000 Speaker Package contains transcribed read speech spoken by 2000 native speakers in 22 languages (Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swahili, Swedish, Tamil, Thai, Turkish, Ukrainian, and Vietnamese). The GlobalPhone 2000 Speaker Package covers about 9,000 randomly selected utterances read by 2000 native speakers in 22 languages, i.e. on average 4.5 utterances corresponding to 40 seconds of speech per speaker amounting to a total of 22 hours of speech.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0400/

ELRA-S0402 Speaking atlas of the regional languages of France
ISLRN: 112-393-061-014-3
The Speaking atlas of the regional languages of France offers the same Aesop?s fable read in French and in a number of varieties of languages of France. This work, which has a scientific and heritage dimension, consists in highlighting the linguistic diversity of Metropolitan France and Overseas Territories, through recordings collected in the field and presented via an interactive map, with their orthographic transcription. As far as Occitan is concerned, about sixty varieties were collected in Gascony, Languedoc, Provence, northern Occitania and the Linguistic Crescent. Varieties of Basque, Breton, Frannian, West Flemish, Alsatian, Corsican, Catalan, Francoprovençal and Oïl language(s) are also provided, as well as about fifty languages in the French Overseas and non-territorial languages such as Rromani and the French sign language.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0402/

ELRA-S0403 CLE Pakistan Urdu Speech Corpus
ISLRN: 572-070-066-634-8
This corpus consists of phonetically rich Urdu sentences and additional sentences covering telephone numbers, addresses and personal names. This speech corpus is recorded with a variety of microphone types. Sampling rate of speech files is 16 kHz. Each utterance is stored in a separate file and is accompanied by its orthographic transcription file in Unicode.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0403/

ELRA-W0128 ECPC Corpus (European Comparable and Parallel Corpora of Parliamentary Speeches Archive) ? set 1
ISLRN: 036-939-425-010-1
This corpus is a collection of XML metatextually tagged corpora containing speeches from European chambers. It is a bilingual, bidirectional corpus written corpus in English and Spanish. This first set (ECPC_EP-05) consists of (1) a 'clean' version in XML of European Parliament's 2005 daily sessions; (2) a POS-tagged version of the 2005 daily sessions; and (3) a sentence-based aligned version of 2005 daily sessions. In its raw format, ECPC_EP-05 contains 3,668,476 tokens/words (excluding tagging) in English distributed over 60 utf-8 files and 3,993,867 tokens/words (excluding tagging) in Spanish distributed over 60 utf-8 files.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0128/

ELRA-M0051 EnToSSLNE - a Lexicon of Parallel Named Entities from English to South Slavic Languages
ISLRN: 690-348-503-270-1
This lexicon consists of 26,155 parallel named entities in seven languages: English and six South Slavic ones: Bosnian, Bulgarian, Croatian, Macedonian, Serbian and Slovenian. The lexicon contains multiword entries which are not strictly named entities, but contain a word which is. Slovenian, Croatian and Bosnian are written in Latin script, Macedonian and Bulgarian in Cyrillic. Serbian language is specific since it may come in two scripts (Cyrillic and Latin) and two dialects (ekavica and ijekavica). This lexicon takes Serbian ekavica variant and its Cyrillic script. The lexicon comes in two formats: csv and xml.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0051/

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us.

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements/

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy