5-2-1 | Linguistic Data Consortium (LDC) update (May 2022)
In this newsletter: 30th Anniversary Highlight: Penn Treebank
New publications: NUBUC Samrómur Icelandic Speech 1.0
30th Anniversary Highlight: Penn Treebank LDC’s Catalog features classic corpora responsible for critical advances in human language technology that continue to influence researchers. Among them are the Penn Treebank releases, Treebank-2 (LDC96T7) and Treebank-3 (LDC99T42).
The Penn Treebank project (1989-1996) produced seven million words tagged for part-of-speech, three million words of parsed text, over two million words annotated for predicate-argument structure and 1.6 million words of transcribed speech annotated for speech disfluencies (Taylor et al., 2003). Source material represents a diverse range of data, including Wall Street Journal (WSJ) articles, the Brown Corpus and Switchboard telephone conversations.
Penn Treebanks are used for a wide range of purposes, including the creation and training of parsers and taggers, work on machine translation and speech recognition, and research concerning joint syntactic and semantic role labeling. Their ongoing influence is evidenced by the popularity of Treebank-3 (LDC99T42), which continues to be one of LDC’s top ten most distributed corpora in the Catalog. In addition, the WSJ section has served as a model for treebanks across many languages (Nivre, 2008).
The Penn Treebank has inspired related annotation schemes, such as Proposition Bank, the Penn Discourse Treebank project, and word alignment annotation. In addition, LDC has developed revised English treebank guidelines resulting in the re-issue of the WSJ section (English News Text Treebank: Penn Treebank Revised (LDC2015T13)) and treebanked web text (e.g., English Web Treebank (LDC2012T13) and BOLT English Translation Treebank – Chinese Discussion Forum (LDC2020T09)).
Penn Treebank corpora and its related releases are available for licensing to LDC members and nonmembers. For more information about licensing LDC data, visit Obtaining Data.
New publications: (1) NUBUC (NyU-BU contextually controlled stories Corpus) was developed by New York University, Max Planck Institute for Empirical Aesthetics and Boston University. It contains approximately three hours of English read speech from eight stories focused on linguistic keywords that were created specifically for this corpus, along with transcripts, syntactic annotations, and corpus metadata.
Stories are centered on a protagonist and bear a similarity to a modern fairy tale. Each story consists of approximately 2,000 words organized around critical keywords matched along multiple linguistic dimensions. The story texts comprise a total of 1024 sentences and 16,472 words. Each story was read by two different voice actors, one male and one female, in a neutral American English accent.
Recordings are 11-12 minutes in duration, for a total of about 90 minutes of continuous speech per speaker.
NUBUC is distributed via web download.
2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.
*
(2) Samrómur Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 145 hours of Icelandic prompted speech from 8,392 speakers representing 100,000 utterances.
Speech data was collected between October 2019 and May 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus.
Samrómur Icelandic Speech 1.0 is distributed via web download.
2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $250.
To unsubscribe from this newsletter, log in to your LDC account and uncheck the box next to “Receive Newsletter” under Account Options; or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
|
5-2-2 | ELRA - Language Resources Catalogue - Update (May 2022)
We are happy to announce that 1 new written corpus, 4 new bilingual lexicons and 1 new monolingual lexicon are now available in our catalogue. Annotated tweet corpus in Arabizi, French and English ISLRN: 482-848-308-105-6 The purpose of the annotated tweet corpus in Arabizi, French and English constitution, completed in 2020, was to collect and annotate tweets in 3 languages (Arabizi, French and English) for 3 predefined themes (Hooliganism, Racism, Terrorism). It consists of 17,103 sequences annotated from 585,163 tweets (196,374 in English, 254,748 in French and 134,041 in Arabizi), including the themes “Others” and “Incomprehensible”. Among these sequences, 4,578 sequences having at least 20 tweets annotated with the 3 predefined themes (Hooliganism, Racism and Terrorism) were obtained, including 1,866 sequences with an opinion change. They are distributed as follows: 2,141 sequences in English (57,655 tweets), 1,942 sequences in French (48,854 tweets) and 495 sequences in Arabizi (21,216 tweets). A sub-corpus of 8,733 tweets (1,209 in English, 3,938 in French and 3,585 in Arabizi) annotated as “hateful”, according to topic/opinion annotations and by selecting tweets that contained insults, is also provided.A Bilingual English-Ukrainian Lexicon of Named Entities Extracted from Wikipedia ISLRN: 110-617-195-245-4 The bilingual English-Ukrainian lexicon of named entities uses Wikipedia metadata as a source. The extracted named entity pairs are classified into five classes: PERSON, ORGANIZATION, LOCATION, PRODUCT, and MISC (miscellaneous). The lexicon consists of 624,168 pairs and comes in two formats: csv and xml. ArabLEX set of data ArabLEX set of data consists of 4 databases dedicated to Arabic language: ArabLEX: Database of Arabic General Vocabulary (DAG) ISLRN: 879-334-992-724-8 A comprehensive full-form lexicon of Arabic general vocabulary including all inflected, conjugated and cliticized forms. Each entry is accompanied by a rich set of morphological, grammatical, and phonological attributes. Ideally suited for NLP applications, DAG provides precise phonemic transcriptions and full vowel diacritics designed to enhance Arabic speech technology. Quantity and size: 87,930,738 lines / 24,399 MB (23.8 GB) |