ISCApad Archive » 2018 » ISCApad #239 » Resources » Database » Linguistic Data Consortium (LDC) update (April 2018) |
ISCApad #239 |
Friday, May 11, 2018 by Chris Wellekens |
In this newsletter: LDC at ICASSP 2018 LDC at the Philadelphia Science Carnival Concretely Annotated New York Times H2, E2, ERK1 Children's Writing TRAD Arabic-French Parallel Text -- Newsgroup LDC at ICASSP 2018 LDC will be exhibiting at ICASSP 2018, held this year April 15-20 in Calgary, Canada. Stop by booth B2 to learn more about recent developments at the Consortium and new publications. Also, be on the lookout for the following presentations featuring LDC work: Enhancement and Analysis of Conversational Speech: JSALT 2017 Leveraging LSTM Models for Overlap Detection in Multi-Party Meetings A Novel LSTM-based Speech Preprocessor for Speaker Diarization in Realistic Mismatch Conditions LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there! LDC at the Philadelphia Science Carnival LDC will share the fun of language with the community on Saturday, April 28, with a booth at the Philadelphia Science Carnival. Visitors will enjoy three language-oriented educational activities that include a language identification game and Chinese character recognition.. The Philadelphia Science Carnival is an annual event organized by Philadelphia’s Franklin Institute to acquaint children and adults with the joys of science. _______________________________________________________________________________
(1) Concretely Annotated New York Times was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic, and coreference annotations to The New York Times Annotated Corpus (LDC2008T19). Concrete is a schema for representing structured, hierarchical, and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization. Concretely Annotated New York Times contains all of the 1.8 million articles in The New York Times Annotated Corpus. Concretely Annotated New York Times is distributed via hard drive. 2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Any organization that licensed The New York Times Annotated Corpus (LDC2008T19) may request a copy of Concretely Annotated New York Times (LDC2018T12) for a $250 media fee. Non-members may license this data for $300. *
(2) H2, E2, ERK1 Children's Writing was developed by the Cooperative State University Baden-Württemberg, University of Education. It consists of approximately 2,000 texts written over four months by 173 German school children age six through eleven years. The data in this corpus was collected by elementary schools in Baden Württemberg, Germany, and digitized at the Cooperative State University during the 2016/2017 school year. Three second, third, and fourth grade classrooms participated in the collection. Texts were written within regular class settings. The students were presented with a picture and were asked to write a story to describe the picture or, if unable to write a text, to list what they saw in the picture. There were 173 total participants. 100 students were multilingual, and further metadata is available for 166 of the 173 children. The following is included for each text in the database: school week of collection; school type; age; gender; grade/classroom; language spoken at home; and school materials used. H2, E2, ERK1 Children's Writing is distributed via web download.
2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $750.
*
(3) TRAD Arabic-French Parallel Text -- Newsgroup was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Arabic words from GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03). The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. This release consists of 398 segments (translations units) from 17 documents. The source data is Arabic newsgroup text collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program. LDC has also released TRAD Chinese-French Parallel Text -- Blog (LDC2018T02). TRAD Arabic-French Parallel Text -- Newsgroup is distributed via web download.
2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $300.
|
Back | Top |