ISCA - International Speech
Communication Association


ISCApad Archive  »  2022  »  ISCApad #285  »  Resources  »  Database  »  ELRA - Language Resources Catalogue - Update (February 2022)

ISCApad #285

Tuesday, March 08, 2022 by Chris Wellekens

5-2-2 ELRA - Language Resources Catalogue - Update (February 2022)
  
 We are happy to announce that 3 new written corpora are now available in our catalogue.

Danish Gigaword Corpus
ISLRN: 024-504-318-388-3
This corpus consists of over a billion words for Danish collected from various websites. Domains are distributed as follows: Legal (308.8 million words), Social Media (261.4 million words), Subtitles (130.1 million words), Debates (108.4 million words), Conversations (0.7 million words), Web (101.02 million words), Encyclopedia (55.6 million words), Literature (31.3 million words), Manuals (2.6 million words), Books (2.1 million words), Religion (600k words), News (40 million words), Other (1.2 million words).


English-Punjabi Code-Mixed Social Media Content
ISLRN:
 695-759-706-170-8
The English-Punjabi Code-Mixed Social Media Content corpus is composed of 893,615 parallel sentences of English-Punjabi in the following domains: Agriculture, Culture, Entertainment, Health, Religion, Sports, Technology, Tourism, Education, and Entertainment.

Parallel Corpora for 6 Indian Languages
ISLRN: 657-350-757-058-6
The Parallel Corpora for 6 Indian Languages contains data sets for Bengali (540,000 words ? 20,000 parallel sentences), Hindi (1,200,000 words ? 37,000 parallel sentences), Malayalam (660,000 words ? 29,000 parallel sentences), Tamil (747,000 words ? 35,000 parallel sentences), Telugu (951,000 words ? 43,000 parallel sentences), and Urdu (1,200,000 words ? 33,000 parallel sentences), translated into English. Each data set was created by taking around 100 Indian-language Wikipedia pages and obtaining four independent translations in English of each of the sentences in those documents via non-professional translators hired by crowdsourcing on Amazon Mechanical Turk.


For more information on the catalogue or if you would like to enquire about having your resources distributed by ELRA, please contact us.

_________________________________________
Visit the ELRA Catalogue of Language Resources
Visit the Universal Catalogue 
Archives of ELRA Language Resources Catalogue Updates

 


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA