ISCApad Archive » 2020 » ISCApad #264 » Resources » Database » Linguistic Data Consortium (LDC) update (May 2020) |
ISCApad #264 |
Wednesday, June 10, 2020 by Chris Wellekens |
In this newsletter:
New publications:
*
(2) LORELEI Entity Detection and Linking Knowledge Base was developed by LDC and contains the full LORELEI Entity Detection and Linking (EDL) Knowledge Base (KB) used for all LORELEI Representative Language and Incident Language Pack entity linking annotation. The LORELEI (Low Resource Languages for Emergent Incidents) Program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks.
The KB in this release supported the EDL task in LORELEI for four entity types -- geo-political entities (GPE), locations (LOC), persons (PER), and organizations (ORG) -- and contains a total of 10,216,832 entities. There are four inputs to the KB, each designated by a unique 'origin' code in the KB, as follows: GPE and LOC entities from a snapshot of GeoNames, PER entities from the CIA World Leaders List, ORG entities from Appendix B of the CIA World Factbook, and additional entities manually created by LDC for each of the representative and incident languages in the LORELEI Program.
2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $250.
*
(3) BOLT English Translation Treebank - Chinese Discussion Forum was developed by LDC and consists of 147,432 tokens of web discussion forum data translated from Chinese to English and annotated for part-of-speech and syntactic structure.
The source data is Chinese discussion forum web text collected by LDC in 2011 and 2012, translated into English and released in BOLT Chinese Discussion Forum Parallel Training Data (LDC2017T05). A subset of the translated text -- 148 files representing 147,432 tokens -- was selected for the treebank and annotated for word-level tokenization, part-of-speech, and syntactic structure. Only the translated English text is included in the source data for this release.
Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included with this release.
BOLT English Translation Treebank - Chinese Discussion Forum is distributed via web download.
2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1750.
*
(4) Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese was developed by LDC and is comprised of approximately 25 hours of telephone speech in Mandarin Chinese.
The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type, and noise.
LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:
Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese is distributed via web download. 2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1500.
*
University of Pennsylvania
T: +1-215-573-1275
E: ldc@ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
|
Back | Top |