ISCA - International Speech
Communication Association


ISCApad Archive  »  2013  »  ISCApad #176  »  Resources  »  Database  »  ELRA - Language Resources Catalogue - Update (2013-02)

ISCApad #176

Saturday, February 09, 2013 by Chris Wellekens

5-2-1 ELRA - Language Resources Catalogue - Update (2013-02)
  

 

ELRA - Language Resources Catalogue - Update *****************************************************************

ELRA is happy to announce that 4 new Written Corpora are now available in its catalogue.

 

ELRA-W0059 LT Corpus The LT Corpus is composed of 70 fiction texts from Portuguese renowned authors. The corpus contains 1,781,083 tokens. The texts date from before 1940. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks. For more information, see: http://catalog.elra.info/product_info.php?products_id=1178

ELRA-W0060 PTPARL Corpus The PTPARL Corpus contains 1,076 texts consisting of adapted transcriptions of the Portuguese Parliament sessions. The corpus contains 1,000,441 tokens. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks.For more information, see: http://catalog.elra.info/product_info.php?products_id=1179

ELRA-W0061 CINTIL-DependencyBank The CINTIL-DependencyBank (Silva and Branco, 2012) is a corpus of sentences annotated with their syntactic dependency graphs and grammatical function tags composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus. For more information, see: http://catalog.elra.info/product_info.php?products_id=1180

 ELRA-W0062 CINTIL-DeepBank The CINTIL-DeepBank (Branco et al., 2010) is a corpus of sentences annotated with their full-fledged deep grammatical representations, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) used for regression testing of the computational grammar that supported the annotation of the corpus.

For more information, see: http://catalog.elra.info/product_info.php?products_id=1181 For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org Visit our On-line Catalogue: http://catalog.elra.info Visit the Universal Catalogue: http://universal.elra.info Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/LRs-Announcements.html

 


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA