ISCA - International Speech
Communication Association


ISCApad Archive  »  2011  »  ISCApad #161  »  Resources  »  Database  »  ELRA - Language Resources Catalogue - Update (2011-09)

ISCApad #161

Monday, November 07, 2011 by Chris Wellekens

5-2-1 ELRA - Language Resources Catalogue - Update (2011-09)
  

*****************************************************************
ELRA - Language Resources Catalogue - Update
*****************************************************************

ELRA is happy to announce that 4 new Speech Resources from the GlobalPhone corpus are now available in its catalogue.
Moreover, an updated version of the Venice Italian Treebank (VIT) has also been released. 

1) New Language Resources:

The GlobalPhone Corpus:
The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 19 spoken languages Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322). In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

Special prices are offered for a combined purchase of several GlobalPhone languages (5 languages, 10 languages, 15 languages or 19 languages).

New 4 languages are available from the GlobalPhone corpus:
ELRA-S0319 GlobalPhone Bulgarian
For more information, see: http://catalog.elra.info/product_info.php?products_id=1141
ELRA-S0320 GlobalPhone Polish
For more information, see: http://catalog.elra.info/product_info.php?products_id=1142
ELRA-S0321 GlobalPhone Thai
For more information, see: http://catalog.elra.info/product_info.php?products_id=1143
ELRA-S0322 GlobalPhone Vietnamese
For more information, see: http://catalog.elra.info/product_info.php?products_id=1144


2) Update of ELRA-W0040 Venice Italian Treebank (VIT):
The new version of VIT has a totally revised constituent-based representation and a completely new dependency-based representation which has been achieved by semi-automatic procedures.

The VIT, Venice Italian Treebank contains about 272,000 words distributed over six different domains: bureaucratic, political, economic and financial, literary, scientific, and news. In addition, some 60,000 tokens of spoken dialogues in different Italian varieties were annotated.
The annotation follows general X-bar criteria with 29 constituency labels and 102 PoS tags. VIT is also made available in a broad annotation version with 10 constituency labels and 22 PoS tags for machine learning purposes. The format is plain text with square bracketing. However, a UPenn style version which is readable by the open source query language CorpusSearch is also provided.

For more information, see: http://catalog.elra.info/product_info.php?products_id=831


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA