5-2-1 | ELRA - Language Resources Catalogue - Update (2013-03)
ELRA - Language Resources Catalogue - Update ***************************************************************** ELRA is happy to announce that QUAERO Structured Named Entity Language Resources are now available in its catalogue. A Written Corpus and a Broadcast Resource annotated with Structured Named Entities from the QUAERO Programme are now being released (free for academic research): ELRA-W0073 Quaero Old Press Extended Named Entity corpus This corpus consists of the manual annotation of 76 newspaper issues published in 1890-1891 and provided by the French National Library (Bibliothèque Nationale de France). Three different titles are used (Le Temps, La Croix and Le Figaro) for a total of 295 pages. The corpus is fully manually annotated according to the Quaero extended and structured named entity definition. For more information, see: http://catalog.elra.info/product_info.php?products_id=1194&language=en ELRA-S0349 Quaero Broadcast News Extended Named Entity corpus This corpus consists of the manual annotation of (i) the ESTER 2 (see also ELRA-S0338) manual transcription corpus and (ii) the Quaero Speech Recognition Evaluation corpus (manual and automatic transcriptions coming from 3 different ASR systems). The corpus is fully manually annotated according to the Quaero extended and structured named entity definition. For more information, see: http://catalog.elra.info/product_info.php?products_id=1195&language=en These two corpora are described in : S. Rosset, C. Grouin, K. Fort, O. Galibert, J. Kahn, P. Zweigenbaum. Structured Named Entities in two distinct press corpora: Contemporary Broadcast News and Old Newspapers. In Proc. of LAW VI, 2012. QUAERO is a research and innovation program adressing automatic processing of multimedia and multilingual content aiming at the development of new tools for navigating in large volumes of text and audiovisual content. The research and development undertaken covers automatic information retrieval, analysis, segmentation and classification of text, speech, music, image and video. The program, supported by OSEO, gathers 32 French and German partners -- large groups, small and medium size enterprises, research laboratories and public organizations. The program consists of a number of application projects aiming at industrial targets and markets that are supported by a common shared research structure. Real world data sets (corpora) are used to define the evaluation tasks and to conduct the research challenges between partners. The use of systematic periodic technology evaluation allows to assess progress made and to select the most promising technical and scientific approaches. After nearly five years of existence, Quaero is a very active eco-system that has produced in excess of 700 scientific publications, more than 25 awards, numerous top 3 rankings in technology evaluation campaigns, 31 patent applications and many innovative prototypes. To find out more about QUAERO, please visit the following website: http://www.quaero.org For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org Visit our On-line Catalogue: http://catalog.elra.info Visit the Universal Catalogue: http://universal.elra.info Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/LRs-Announcements.html
|
5-2-2 | ELRA releases free Language Resources
ELRA releases free Language Resources ***************************************************
Anticipating users’ expectations, ELRA has decided to offer a large number of resources for free for Academic research use. Such an offer consists of several sets of speech, text and multimodal resources that are regularly released, for free, as soon as legal aspects are cleared. A first set was released in May 2012 at the occasion of LREC 2012. A second set is now being released.
Whenever this is permitted by our licences, please feel free to use these resources for deriving new resources and depositing them with the ELRA catalogue for community re-use.
Over the last decade, ELRA has compiled a large list of resources into its Catalogue of LRs. ELRA has negotiated distribution rights with the LR owners and made such resources available under fair conditions and within a clear legal framework. Following this initiative, ELRA has also worked on LR discovery and identification with a dedicated team which investigated and listed existing and valuable resources in its 'Universal Catalogue', a list of resources that could be negotiated on a case-by-case basis. At LREC 2010, ELRA introduced the LRE Map, an inventory of LRs, whether developed or used, that were described in LREC papers. This huge inventory listed by the authors themselves constitutes the first 'community-built' catalogue of existing or emerging resources, constantly enriched and updated at major conferences.
Considering the latest trends on easing the sharing of LRs, from both legal and commercial points of view, ELRA is taking a major role in META-SHARE, a large European open infrastructure for sharing LRs. This infrastructure will allow LR owners, providers and distributors to distribute their LRs through an additional and cost-effective channel.
To obtain the available sets of LRs, please visit the web page below and follow the instructions given online: http://www.elra.info/Free-LRs,26.html
|
5-2-3 | Appen ButlerHill
Appen ButlerHill
A global leader in linguistic technology solutions
RECENT CATALOG ADDITIONS—MARCH 2012
1. Speech Databases
1.1 Telephony
1.1 Telephony
Language |
Database Type
|
Catalogue Code
|
Speakers
|
Status
|
Bahasa Indonesia
|
Conversational
|
BAH_ASR001
|
1,002
|
Available
|
Bengali
|
Conversational
|
BEN_ASR001
|
1,000
|
Available
|
Bulgarian
|
Conversational
|
BUL_ASR001
|
217
|
Available shortly
|
Croatian
|
Conversational
|
CRO_ASR001
|
200
|
Available shortly
|
Dari
|
Conversational
|
DAR_ASR001
|
500
|
Available
|
Dutch
|
Conversational
|
NLD_ASR001
|
200
|
Available
|
Eastern Algerian Arabic
|
Conversational
|
EAR_ASR001
|
496
|
Available
|
English (UK)
|
Conversational
|
UKE_ASR001
|
1,150
|
Available
|
Farsi/Persian
|
Scripted
|
FAR_ASR001
|
789
|
Available
|
Farsi/Persian
|
Conversational
|
FAR_ASR002
|
1,000
|
Available
|
French (EU)
|
Conversational
|
FRF_ASR001
|
563
|
Available
|
French (EU)
|
Voicemail
|
FRF_ASR002
|
550
|
Available
|
German
|
Voicemail
|
DEU_ASR002
|
890
|
Available
|
Hebrew
|
Conversational
|
HEB_ASR001
|
200
|
Available shortly
|
Italian
|
Conversational
|
ITA_ASR003
|
200
|
Available shortly
|
Italian
|
Voicemail
|
ITA_ASR004
|
550
|
Available
|
Kannada
|
Conversational
|
KAN_ASR001
|
1,000
|
In development
|
Pashto
|
Conversational
|
PAS_ASR001
|
967
|
Available
|
Portuguese (EU)
|
Conversational
|
PTP_ASR001
|
200
|
Available shortly
|
Romanian
|
Conversational
|
ROM_ASR001
|
200
|
Available shortly
|
Russian
|
Conversational
|
RUS_ASR001
|
200
|
Available
|
Somali
|
Conversational
|
SOM_ASR001
|
1,000
|
Available
|
Spanish (EU)
|
Voicemail
|
ESO_ASR002
|
500
|
Available
|
Turkish
|
Conversational
|
TUR_ASR001
|
200
|
Available
|
Urdu
|
Conversational
|
URD_ASR001
|
1,000
|
Available
|
1.2 Wideband
Language |
Database Type
|
Catalogue Code
|
Speakers
|
Status
|
English (US)
|
Studio
|
USE_ASR001
|
200
|
Available
|
French (Canadian)
|
Home/ Office
|
FRC_ASR002
|
120
|
Available
|
German
|
Studio
|
DEU_ASR001
|
127
|
Available
|
Thai
|
Home/Office
|
THA_ASR001
|
100
|
Available
|
Korean
|
Home/Office
|
KOR_ASR001
|
100
|
Available
|
2. Pronunciation Lexica
Appen Butler Hill has considerable experience in providing a variety of lexicon types. These include:
Pronunciation Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)
Part-of-speech tagged Lexica providing grammatical and semantic labels
Other reference text based materials including spelling/mis-spelling lists, spell-check dictionar-ies, mappings of colloquial language to standard forms, orthographic normalization lists.
Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages. For holdings information in a given language or to discuss any customized development efforts, please contact: sales@appenbutlerhill.com
3. Named Entity Corpora
Language |
Catalogue Code
|
Words
|
Description
|
Arabic
|
ARB_NER001
|
500,000
|
These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities
|
English
|
ENI_NER001
|
500,000
|
Farsi/Persian
|
FAR_NER001
|
500,000
|
Korean
|
KOR_NER001
|
500,000
|
Japanese
|
JPY_NER001
|
500,000
|
Russian
|
RUS_NER001
|
500,000
|
Mandarin
|
MAN_NER001
|
500,000
|
Urdu
|
URD_NER001
|
500,000
|
3. Named Entity Corpora
Language |
Catalogue Code
|
Words
|
Description
|
Arabic
|
ARB_NER001
|
500,000
|
These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities
|
English
|
ENI_NER001
|
500,000
|
Farsi/Persian
|
FAR_NER001
|
500,000
|
Korean
|
KOR_NER001
|
500,000
|
Japanese
|
JPY_NER001
|
500,000
|
Russian
|
RUS_NER001
|
500,000
|
Mandarin
|
MAN_NER001
|
500,000
|
Urdu
|
URD_NER001
|
500,000
|
4. Other Language Resources
Morphological Analyzers – Farsi/Persian & Urdu
Arabic Thesaurus
Language Analysis Documentation – multiple languages
For additional information on these resources, please contact: sales@appenbutlerhill.com
5. Customized Requests and Package Configurations
Appen Butler Hill is committed to providing a low risk, high quality, reliable solution and has worked in 130+ languages to-date supporting both large global corporations and Government organizations.
We would be glad to discuss to any customized requests or package configurations and prepare a cus-tomized proposal to meet your needs.
6. Contact Information
Prithivi Pradeep
Business Development Manager
ppradeep@appenbutlerhill.com
+61 2 9468 6370
|
Tom Dibert
Vice President, Business Development, North America
tdibert@appenbutlerhill.com
+1-315-339-6165
|
www.appenbutlerhill.com
|