ISCApad Archive » 2018 » ISCApad #246 » Resources » Database » Linguistic Data Consortium (LDC) update (November 2018) |
ISCApad #246 |
Thursday, December 13, 2018 by Chris Wellekens |
In this newsletter:
Join LDC for Membership Year 2019
Spring 2019 Data Scholarship Program
Commercial use and LDC data
New publications: BOLT Egyptian Arabic Treebank - Discussion Forum IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a _________________________________________________________________________________
Join LDC for Membership Year 2019 Membership Year 2019 (MY2019) is open and discounts are available for those who keep their membership current and join early in the year. Now through March 1, 2019, current MY2018 members who renew their LDC membership before March 1 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount through March 1.
In addition to receiving new publications, current LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 750 holdings. Current-year for-profit members may use most data for commercial applications.
Plans for MY2019 publications are in progress. Among the expected releases are:
And, it’s not too late to join for MY2017 (through December 31, 2018) and MY2018 (through December 31, 2019). Data sets from those years include 2010 NIST Speaker Recognition Evaluation Test Set, RATS Keyword Spotting and Language Identification releases, CHiME, Noisy TIMIT Speech, Concretely Annotated New York Times and English Gigaword, DIRHA English WSJ Audio, LORELEI Amharic and Somali Language Packs and DEFT Spanish Treebank. For full descriptions of all LDC data sets, browse our Catalog.
Visit Join LDC for details on membership, user accounts and payment.
Spring 2019 Data Scholarship Program Applications are now being accepted through January 15, 2019 for the Spring 2019 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.
Commercial use and LDC data For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.
New publications:
(1) AISHELL-1 was developed by Beijing Shell Shell Technology Co., Ltd. It contains approximately 520 hours of Chinese Mandarin speech from 400 speakers recorded simultaneously on three different devices with associated transcripts.
The goal of the collection was to support speech recognition system development in 11 domains, including smart homes, autonomous driving, entertainment, finance, and science and technology. Participants read 500 sentences covering the domains; sentences were chosen for their speech and phonetic characteristics. The speech was recorded in a quiet indoor environment on a high fidelity microphone and two mobile phones (Android and IOS).
Speakers were recruited from different accent areas across China, including North, South, and Yue-Gui-Min regions. There were 214 female speakers and 186 male speakers. Additional demographic information about the participants is included in this release.
AISHELL-1 is distributed via hard drive.
2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $3,000. *
(2) Avatar Education Portuguese was developed by the University of Pernambuco and consists of approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant designed to enhance communication and interaction in educational contexts, such as online learning.
The corpus contains 1,400 speakers (700 male, 700 female) who generated 1,400 utterances from read and spontaneous speech. Utterances were transcribed at the word level (without time alignments) and at the phoneme level (with time alignment labels).
Avatar Education Portuguese is distributed via web download.
2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $100. *
(3) BOLT Egyptian Arabic Treebank - Discussion Forum was developed by LDC and consists of Egyptian Arabic web discussion forum data with part-of-speech annotation, morphology, gloss and syntactic tree annotation collected for the DARPA Broad Operational Language Translation (BOLT) Program.
The annotations in this release follow Penn Arabic Treebank (PATB) annotation guidelines. There are two kinds of morphological analysis synchronized in the corpus. LDC Standard Morphological Analyzer (SAMA) Version 3.1 (LDC2010L01) was used for Modern Standard Arabic tokens, and CALIMA (Columbia Arabic Language and dIalect Morphological Analyzer) was used for Egyptian-Arabic tokens.
This release contains 440,448 tokens before clitics were split and 508,548 tree tokens after clitics were split for treebank annotation. The source material is web discussion forums collected by LDC from various sources.
The unannotated Egyptian Arabic source data is released as BOLT Arabic Discussion Forums (LDC2018T10).
BOLT Egyptian Arabic Treebank - Discussion Forum is distributed via web download.
2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $4,500. *
(4) IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Telugu conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.
The Telugu speech in this release represents that spoken in the Central, East, South, and North Telugu dialect regions of India. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.
IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a is available via web download.
2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $25.
*
|
Back | Top |