ISCApad Archive » 2022 » ISCApad #285 » Resources » Database » ELRA - Language Resources Catalogue - Update (February 2022) |
ISCApad #285 |
Tuesday, March 08, 2022 by Chris Wellekens |
We are happy to announce that 3 new written corpora are now available in our catalogue.
Danish Gigaword Corpus ISLRN: 024-504-318-388-3
This corpus consists of over a billion words for Danish collected from various websites. Domains are distributed as follows: Legal (308.8 million words), Social Media (261.4 million words), Subtitles (130.1 million words), Debates (108.4 million words), Conversations (0.7 million words), Web (101.02 million words), Encyclopedia (55.6 million words), Literature (31.3 million words), Manuals (2.6 million words), Books (2.1 million words), Religion (600k words), News (40 million words), Other (1.2 million words). English-Punjabi Code-Mixed Social Media Content ISLRN: 695-759-706-170-8 The English-Punjabi Code-Mixed Social Media Content corpus is composed of 893,615 parallel sentences of English-Punjabi in the following domains: Agriculture, Culture, Entertainment, Health, Religion, Sports, Technology, Tourism, Education, and Entertainment. Parallel Corpora for 6 Indian Languages ISLRN: 657-350-757-058-6 The Parallel Corpora for 6 Indian Languages contains data sets for Bengali (540,000 words ? 20,000 parallel sentences), Hindi (1,200,000 words ? 37,000 parallel sentences), Malayalam (660,000 words ? 29,000 parallel sentences), Tamil (747,000 words ? 35,000 parallel sentences), Telugu (951,000 words ? 43,000 parallel sentences), and Urdu (1,200,000 words ? 33,000 parallel sentences), translated into English. Each data set was created by taking around 100 Indian-language Wikipedia pages and obtaining four independent translations in English of each of the sentences in those documents via non-professional translators hired by crowdsourcing on Amazon Mechanical Turk. For more information on the catalogue or if you would like to enquire about having your resources distributed by ELRA, please contact us. _________________________________________ Visit the Universal Catalogue Archives of ELRA Language Resources Catalogue Updates
|
Back | Top |