ISCApad #278 |
Monday, August 09, 2021 by Chris Wellekens |
5-2-1 | Linguistic Data Consortium (LDC) update (July 2021)
In this newsletter: LDC Submissions: a new platform for sharing data through LDC
* (2) Chinese Abstract Meaning Representation 2.0 was developed by Brandeis University and Nanjing Normal University and is comprised of semantic representations of a set of approximately 20,000 Chinese sentences from Chinese Treebank (CTB) 8.0 (LDC2013T21). CAMR 2.0 includes the content of Chinese Abstract Meaning Representation 1.0 (LDC2019T07) (CTB 8.0 weblog and discussion forum sentences), plus an additional 9,933 sentences from the newswire portion of CTB 8.0. * (3) BOLT Egyptian Arabic Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies. Co-reference annotation aims to fill in the connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation. It covers noun phrases (including proper nouns, nominals, pronouns and null arguments), possessives, proper noun pre-modifiers, and verbs.
Membership Coordinator University of Pennsylvania T: +1-215-573-1275 E: ldc@ldc.upenn.edu M: 3600 Market St. Suite 810 Philadelphia, PA 19104
| ||
5-2-2 | ELRA - Language Resources Catalogue - Update (June 2021) We are happy to announce that 6 new written corpora and 8 new bilingual dictionaries are now available in our catalogue. ELRA-W0310 Monolingual Vietnamese Annotated Corpus ISLRN: 004-081-406-421-7 The Monolingual Vietnamese Annotated Corpus consists of 100,000 sentences, manually annotated with word boundaries, POS, named entities, with an average length of 20 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines. For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0310/ ELRA-W0311 English-Vietnamese Parallel Corpus ISLRN: 893-470-491-825-6 The English-Vietnamese Parallel Corpus consists of 1,000,000 sentence pairs, with an average length of 20 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines. For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0311/ ELRA-W0312 Chinese-Vietnamese Parallel Corpus ISLRN: 128-772-037-486-0 The Chinese-Vietnamese Parallel Corpus consists of 200,000 sentence pairs, with an average length of 15 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines. For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0312/ ELRA-W0313 Korean-Vietnamese Parallel Corpus ISLRN: 365-128-449-700-7 The Korean-Vietnamese Parallel Corpus consists of 200,000 sentence pairs, with an average length of 15 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines. For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0313/ ELRA-W0314 English-Chinese-Vietnamese Trilingual Parallel Corpus ISLRN: 637-630-726-817-9 The English-Chinese-Vietnamese Trilingual Parallel Corpus consists of 20,046 trilingual sets of sentence pairs. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines. For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0314/ ELRA-W0315 Persian Ezafe Construction Dataset ISLRN: 663-014-610-121-2 This database includes gold Ezafe tags in almost 30 thousand Persian sentences. The sentences were manually annotated by six annotators who where all native Persian speakers and linguists. For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0315/ ELRA-M0078 English-Vietnamese Dictionary ISLRN: 853-782-057-600-0 The English-Vietnamese Dictionary consists of 125,000 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples for the source language only. The dictionary is provided in XML format. For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0078/ ELRA-M0079 Vietnamese-English Dictionary ISLRN: 747-175-261-587-4 The Vietnamese-English Dictionary consists of 156,000 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples for source language only. The dictionary is provided in XML format. For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0079/ ELRA-M0080 Chinese-Vietnamese Dictionary ISLRN: 120-577-487-890-2 The Chinese-Vietnamese Dictionary consists of 52,470 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples. The dictionary is provided in XML format. For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0080/ ELRA-M0081 Vietnamese-Chinese Dictionary ISLRN: 481-792-486-258-2 The Vietnamese-Chinese Dictionary consists of 50,911 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples for the source language only. The dictionary is provided in XML format. For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0081/ |