ISCApad Archive » 2020 » ISCApad #266 » Resources » Database » Linguistic Data Consortium (LDC) update (July 2020) |
ISCApad #266 |
Monday, August 10, 2020 by Chris Wellekens |
In this newsletter: Penn Parsed Corpora of Historical English Now Available From LDC
Penn Parsed Corpora of Historical English Now Available From LDC
This release also includes annotation guidelines and philological information for each corpus, as well as the CorpusSearch 2 program which allows users to search the data for words, word sequences, and syntactic structure.
Fall 2020 LDC Data Scholarship Program Student applications for the Fall 2020 LDC Data Scholarship program are being accepted now through September 15, 2020. This scholarship program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor.
For application requirements and program rules, please visit the LDC Data Scholarship page.
New publications: (1) Speech Sentiment Annotations was developed by Google Inc. and consists of sentiment labels (positive, negative, neutral) for approximately 49,500 utterances covering 140 hours of audio from Switchboard-1 Release 2 (LDC97S62).
Switchboard speech files were segmented based on the start and end time of transcript turns. Annotators listened to the audio corresponding to each segment (utterance) and classified each into positive, negative, or neutral categories based on the emotion and attitude of the speaker. Annotators provided a justification for positive and negative classifications using a flow chart. Further information about the methodology and annotation process is contained in the documentation accompanying this release.
Switchboard-1 Release 2 (LDC97S62) consists of 260 hours of telephone speech from 543 speakers across the United States (302 male speakers, 241 female speakers). A computer-driven telephone collection platform paired two subjects for each conversation and provided a discussion topic, ensuring that no two speakers conversed together more than once and no one speaker talked more than once on a given topic.
Speech Sentiment Annotations is distributed via web download.
2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $250. *
(2) Penn Parsed Corpora of Historical English was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE). This data set contains three corpora covering traditionally recognized periods of English:
Penn Parsed Corpora of Historical English is distributed via web download. *
(3) IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Javanese conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.
The Javanese speech in this release represents the Central, Western, and Eastern Javanese dialect regions of Indonesia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.
IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b is distributed via web download.
2020 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $25.
*
The source data in this release consists of transcripts of Chinese conversational telephone speech (CTS) from LDC's CALLHOME and CALLFRIEND collections (LDC96S34, LDC96T16, LDC96S55) that were translated into English by professional translation agencies and annotated for the word alignment task.
The BOLT word alignment task was built on treebank annotation. LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment and were also tokenized for character alignment by inserting white spaces to separate characters.
BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training is distributed via web download.
2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1750.
|
Back | Top |