ISCApad Archive » 2024 » ISCApad #318 » Resources » Database » Linguistic Data Consortium (LDC) update (November 2024) |
ISCApad #318 |
Wednesday, December 11, 2024 by Chris Wellekens |
In this newsletter: Join LDC for membership year 2025
Spring 2025 data scholarship application deadline New publications: LORELEI Yoruba Representative Language Pack Samrómur Synthetic Join LDC for membership year 2025
It’s time to renew your LDC membership for 2025. Current (2024) members who renew their membership before March 3, 2025, will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 3. In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 950+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications. Plans for next year’s publications are in progress. Among the expected releases are:
For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.
Spring 2025 data scholarship application deadline Applications are now being accepted through January 15, 2025, for the Spring 2025 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements. New publications: LORELEI Yoruba Representative Language Pack was developed by LDC and is comprised of approximately 7.2 million words of Yoruba monolingual text, 127,000 Yoruba words translated from English data, and 810,000 words of Yoruba-English parallel text. Approximately 77,000 words were annotated for named entities, over 25,000 words were annotated for full entity (including nominals and pronouns) and simple semantic annotation, and around 10,000 words were annotated for noun phrase chunking. Data was collected from discussion forum, news, reference, social network, and weblogs. The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage. The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10). 2024 members can access this corpus through their LDC accounts. Non-members may license this data for $250. *
Samrómur Synthetic was developed by the Language and Voice Lab, Reykjavik University and contains 72 hours of Icelandic synthetic speech, transcripts and metadata. Source sentences were extracted from the Samrómur platform, comprised of texts and transcripts covering various genres. Text was processed through a text-to-speech system developed by Reykjavik University's Language and Voice Lab to generate speech files. Synthesized speech was created with 44 voices (22 male, 22 female) at four different speed rates for a total of 220 speakers and 62,700 utterances (with 285 sentences/speaker).
2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for $250. To unsubscribe from this newsletter, log in to your LDC account and uncheck the box next to “Receive Newsletter” under Account Options or contact LDC for assistance. Membership Coordinator University of Pennsylvania
T: +1-215-573-1275
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
*
MultiTACRED was developed by the German Research Center for Artificial Intelligence (DFKI) Speech and Language Technology Lab and is a machine translation of TAC Relation Extraction Dataset (LDC2018T24) (TACRED) into twelve languages with projected entity annotations. TACRED is a large-scale relation extraction dataset containing 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014. The training and evaluation data for the TAC KBP slot filling tasks was developed by the Linguistic Data Consortium.
TACRED training, development, and test splits were translated into Arabic, Chinese, Finnish, French, German, Hindi, Hungarian, Japanese, Polish, Russian, Spanish, and Turkish using DeepL or Google Translate. The test split was back-translated into English to generate machine-translated English test data. TACRED annotations are specified by token offsets. For translation, tokens were concatenated with white space, and the entity offsets were converted into XML-style markers to denote argument.
|
Back | Top |