ISCApad #204 |
Tuesday, June 16, 2015 by Chris Wellekens |
5-2-1 | ELRA - Language Resources Catalogue - Update (2015-05) dedicated to the Nepali people. As an answer to the April 2015 devastating earthquake in Nepal, ELRA would like to make Nepali Corpora available for free. Originally available for research purposes only in the ELRA Catalogue, those Language Resources (2 Nepali Written Corpora and 1 Speech Corpus) will be provided at no cost to those working on the the development of systems and applications to be used during the reconstruction phase in Nepal, for not-for-profit purposes. If you feel that ELRA can help in other ways please let us know. The Nepali Monolingual written corpus comprises the core corpus (core sample) and the general corpus. The core sample (CS) represents the collection of Nepali written texts from 15 different genres with 2000 words each published between 1990 and 1992. It is based on FLOB/FROWN corpora and contains 802,000 words. The general corpus (GC) consists of written texts collected opportunistically from a wide range of sources such as the internet webs, newspapers, books, publishers and authors. It contains 1,400,000 words.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1216 This corpus consists of a collection of national development texts in English and Nepali. A small set of data is aligned at the sentence level (27,060 English words; 21,756 Nepali words), and a larger set of texts at the document level (617,340 English words; 596,571 Nepali words). An additional set of monolingual data in Nepali is also provided (386,879 words in Nepali). For more information, see http://catalog.elra.info/product_info.php?products_id=1217 The Nepali Spoken Corpus contains audio recordings from different social activities within their natural settings as much as possible, with phonologically transcribed and annotated texts, and information about the participants. A total of 17 types of activity were recorded. The total temporal duration of the recorded material is 31 hours and 26 minutes. For more information, see: http://catalog.elra.info/product_info.php?products_id=1219 For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org Visit the Universal Catalogue: http://universal.elra.info Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements/
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-2 | LDC Newsletter (May 2015)
Early renewing members save again
New publications:
Coordination Annotation for the Penn Treebank
Early renewing members save again
Commercial use and LDC data
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for further information,
New publications
(1) Coordination Annotation for the Penn Treebank is a stand-off annotation for the Wall Street Journal portion of Treebank-3 (PTB3) (LDC99T42) developed by researchers at the University of Düsseldorf and Indiana University. It marks all tokens that have a coordinating function (potentially among other functions).
Coordination is a syntactic structure that links together two or more elements known as conjuncts or conjoins. The presence of coordination is often signaled by the appearance of a coordinator (coordinating conjunction), such as and, or, but in English.
This annotation is presented in a single UTF-8 plain text tsv file with columns as follows:
Coordination Annotation for the Penn Treebank is available at no cost to all licensees of PTB3 and appears in their download queue associated with LDC99T42 as penn_coordination_anno_LDC2015T08.tgz.
*
(2) GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 was developed by LDC and is comprised of approximately 112 hours of Mandarin Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding transcripts are released as GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 (LDC2015T09).
Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites. The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.
The broadcast conversation recordings in this release feature interviews, call-in programs, and roundtable discussions focusing principally on current events from the following sources: Beijing TV, China Central TV, Hubei TV, Phoenix TV and Voice of America.
This release contains 209 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings, as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded, and as a guide for data selection by retaining information about a program’s genre, data type and topic.
GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 is distributed on DVD. 2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.
*
(3) GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 112 hours of Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding audio data is released as GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 (LDC2015S06).
The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,388,236 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.
GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 is distributed via web download. 2015 Subscription Members will automatically receive two copies of this corpus on disc. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1500.
*
(4) SenSem (Sentence Semantics) Lexicons was developed by GRIAL, the Linguistic Applications Inter-University Research Group that includes the following Spanish institutions: the Universitat Autonoma de Barcelona, the Universitat de Barcelona, the Universitat de Lleida and the Universitat Oberta de Catalunya. It contains feature descriptions for approximately 1,300 Spanish verbs and 1,300 Catalan verbs in the SenSem Databank (LDC2015T02). GRIAL's work focuses on resources for applied linguistics, including lexicography, translation and natural language processing.
The verb features for each language consist of two groups: those codified manually, including definition, WordNet synset, Aktionsart, arguments and semantic functions; and those extracted automatically from the SenSem Databank. Among the latter are verb frequency, semantic construction, syntactic categories and constituent order. The verbs analyzed correspond to the 250 most frequent verbs in Spanish and 320 lemmas in Catalan. Further information about the SenSem project can be obtained from the GRIAL website. Data is presented in a single XML file per language.
SenSem Lexicons is distributed via web download.
2015 Subscription Members will automatically receive two copies of this corpus on disc. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$200. This data is made available to LDC not-for-profit members and all non-members under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license and to LDC for-profit members under the terms of the For-Profit Membership Agreement.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-3 | Appen ButlerHill
Appen ButlerHill A global leader in linguistic technology solutions RECENT CATALOG ADDITIONS—MARCH 2012 1. Speech Databases 1.1 Telephony
2. Pronunciation Lexica Appen Butler Hill has considerable experience in providing a variety of lexicon types. These include: Pronunciation Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate) Part-of-speech tagged Lexica providing grammatical and semantic labels Other reference text based materials including spelling/mis-spelling lists, spell-check dictionar-ies, mappings of colloquial language to standard forms, orthographic normalization lists. Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages. For holdings information in a given language or to discuss any customized development efforts, please contact: sales@appenbutlerhill.com
4. Other Language Resources Morphological Analyzers – Farsi/Persian & Urdu Arabic Thesaurus Language Analysis Documentation – multiple languages
For additional information on these resources, please contact: sales@appenbutlerhill.com 5. Customized Requests and Package Configurations Appen Butler Hill is committed to providing a low risk, high quality, reliable solution and has worked in 130+ languages to-date supporting both large global corporations and Government organizations. We would be glad to discuss to any customized requests or package configurations and prepare a cus-tomized proposal to meet your needs.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-4 | OFROM 1er corpus de français de Suisse romande Nous souhaiterions vous signaler la mise en ligne d'OFROM, premier corpus de français parlé en Suisse romande. L'archive est, dans version actuelle, d'une durée d'environ 15 heures. Elle est transcrite en orthographe standard dans le logiciel Praat. Un concordancier permet d'y effectuer des recherches, et de télécharger les extraits sonores associés aux transcriptions.
Pour accéder aux données et consulter une description plus complète du corpus, nous vous invitons à vous rendre à l'adresse suivante : http://www.unine.ch/ofrom.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-5 | Real-world 16-channel noise recordings We are happy to announce the release of DEMAND, a set of real-world
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-6 | Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne
Le consortium IRCOM de la TGIR Corpus et l’EquipEx ORTOLANG s’associent pour proposer une aide technique et financière à la finalisation de corpus de données orales ou multimodales à des fins de diffusion et pérennisation par l’intermédiaire de l’EquipEx ORTOLANG. Cet appel ne concerne pas la création de nouveaux corpus mais la finalisation de corpus existants et non-disponibles de manière électronique. Par finalisation, nous entendons le dépôt auprès d’un entrepôt numérique public, et l’entrée dans un circuit d’archivage pérenne. De cette façon, les données de parole qui ont été enrichies par vos recherches vont pouvoir être réutilisées, citées et enrichies à leur tour de manière cumulative pour permettre le développement de nouvelles connaissances, selon les conditions d’utilisation que vous choisirez (sélection de licences d’utilisation correspondant à chacun des corpus déposés).
Cet appel d’offre est soumis à plusieurs conditions (voir ci-dessous) et l’aide financière par projet est limitée à 3000 euros. Les demandes seront traitées dans l’ordre où elles seront reçues par l’ IRCOM. Les demandes émanant d’EA ou de petites équipes ne disposant pas de support technique « corpus » seront traitées prioritairement. Les demandes sont à déposer du 1er septembre 2013 au 31 octobre 2013. La décision de financement relèvera du comité de pilotage d’IRCOM. Les demandes non traitées en 2013 sont susceptibles de l’être en 2014. Si vous avez des doutes quant à l’éligibilité de votre projet, n’hésitez pas à nous contacter pour que nous puissions étudier votre demande et adapter nos offres futures.
Pour palier la grande disparité dans les niveaux de compétences informatiques des personnes et groupes de travail produisant des corpus, L’ IRCOM propose une aide personnalisée à la finalisation de corpus. Celle-ci sera réalisée par un ingénieur IRCOM en fonction des demandes formulées et adaptées aux types de besoin, qu’ils soient techniques ou financiers.
Les conditions nécessaires pour proposer un corpus à finaliser et obtenir une aide d’IRCOM sont :
Les demandes peuvent concerner tout type de traitement : traitements de corpus quasi-finalisés (conversion, anonymisation), alignement de corpus déjà transcrits, conversion depuis des formats « traitement de textes », digitalisation de support ancien. Pour toute demande exigeant une intervention manuelle importante, les demandeurs devront s’investir en moyens humains ou financiers à la hauteur des moyens fournis par IRCOM et ORTOLANG.
IRCOM est conscient du caractère exceptionnel et exploratoire de cette démarche. Il convient également de rappeler que ce financement est réservé aux corpus déjà largement constitués et ne peuvent intervenir sur des créations ex-nihilo. Pour ces raisons de limitation de moyens, les propositions de corpus les plus avancés dans leur réalisation pourront être traitées en priorité, en accord avec le CP d’IRCOM. Il n’y a toutefois pas de limite « théorique » aux demandes pouvant être faites, IRCOM ayant la possibilité de rediriger les demandes qui ne relèvent pas de ses compétences vers d’autres interlocuteurs.
Les propositions de réponse à cet appel d’offre sont à envoyer à ircom.appel.corpus@gmail.com. Les propositions doivent utiliser le formulaire de deux pages figurant ci-dessous. Dans tous les cas, une réponse personnalisée sera renvoyée par IRCOM.
Ces propositions doivent présenter les corpus proposés, les données sur les droits d’utilisation et de propriétés et sur la nature des formats ou support utilisés.
Cet appel est organisé sous la responsabilité d’IRCOM avec la participation financière conjointe de IRCOM et l’EquipEx ORTOLANG.
Pour toute information complémentaire, nous rappelons que le site web de l'Ircom (http://ircom.corpus-ir.fr) est ouvert et propose des ressources à la communauté : glossaire, inventaire des unités et des corpus, ressources logicielles (tutoriaux, comparatifs, outils de conversion), activités des groupes de travail, actualités des formations, ... L'IRCOM invite les unités à inventorier leur corpus oraux et multimodaux - 70 projets déjà recensés - pour avoir une meilleure visibilité des ressources déjà disponibles même si elles ne sont pas toutes finalisées.
Le comité de pilotage IRCOM
Utiliser ce formulaire pour répondre à l’appel : Merci.
Réponse à l’appel à la finalisation de corpus oral ou multimodal
Nom du corpus :
Nom de la personne à contacter : Adresse email : Numéro de téléphone :
Nature des données de corpus :
Existe-t’il des enregistrements : Quel média ? Audio, vidéo, autre… Quelle est la longueur totale des enregistrements ? Nombre de cassettes, nombre d’heures, etc. Quel type de support ? Quel format (si connu) ?
Existe-t’il des transcriptions : Quel format ? (papier, traitement de texte, logiciel de transcription) Quelle quantité (en heures, nombre de mots, ou nombre de transcriptions) ?
Disposez vous de métadonnées (présentation des droits d’auteurs et d’usage) ?
Disposez-vous d’une description précise des personnes enregistrées ?
Disposez-vous d’une attestation de consentement éclairé pour les personnes ayant été enregistrées ? En quelle année (environ) les enregistrements ont eu lieu ?
Quelle est la langue des enregistrements ?
Le corpus comprend-il des enregistrements d’enfants ou de personnes ayant un trouble du langage ou une pathologie ? Si oui, de quelle population s’agit-il ?
Dans un souci d’efficacité et pour vous conseiller dans les meilleurs délais, il nous faut disposer d’exemples des transcriptions ou des enregistrements en votre possession. Nous vous contacterons à ce sujet, mais vous pouvez d’ores et déjà nous adresser par courrier électronique un exemple des données dont vous disposez (transcriptions, métadonnées, adresse de page web contenant les enregistrements).
Nous vous remercions par avance de l’intérêt que vous porterez à notre proposition. Pour toutes informations complémentaires veuillez contacter Martine Toda martine.toda@ling.cnrs.fr ou à ircom.appel.corpus@gmail.com.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-7 | Rhapsodie: un Treebank prosodique et syntaxique de français parlé Rhapsodie: un Treebank prosodique et syntaxique de français parlé
Nous avons le plaisir d'annoncer que la ressource Rhapsodie, Corpus de français parlé annoté pour la prosodie et la syntaxe, est désormais disponible sur http://www.projet-rhapsodie.fr/
Le treebank Rhapsodie est composé de 57 échantillons sonores (5 minutes en moyenne, au total 3h de parole, 33000 mots) dotés d’une transcription orthographique et phonétique alignées au son.
Il s'agit d’une ressource de français parlé multi genres (parole privée et publique ; monologues et dialogues ; entretiens en face à face vs radiodiffusion, parole plus ou moins interactive et plus ou moins planifiée, séquences descriptives, argumentatives, oratoires et procédurales) articulée autour de sources externes (enregistrements extraits de projets antérieurs, en accord avec les concepteurs initiaux) et internes. Nous tenons en particulier à remercier les responsables des projets CFPP2000, PFC, ESLO, C-Prom ainsi que Mathieu Avanzi, Anne Lacheret, Piet Mertens et Nicolas Obin.
Les échantillons sonores (wave & MP3, pitch nettoyé et lissé), les transcriptions orthographiques (txt), les annotations macrosyntaxiques (txt), les annotations prosodiques (xml, textgrid) ainsi que les metadonnées (xml & html) sont téléchargeables librement selon les termes de la licence Creative Commons Attribution - Pas d’utilisation commerciale - Partage dans les mêmes conditions 3.0 France. Les annotations microsyntaxiques seront disponibles prochainement Les métadonnées sont également explorables en ligne grâce à un browser. Les tutoriels pour la transcription, les annotations et les requêtes sont disponibles sur le site Rhapsodie. Enfin, L’annotation prosodique est interrogeable en ligne grâce au langage de requêtes Rhapsodie QL. L'équipe Ressource Rhapsodie (Modyco, Université Paris Ouest Nanterre) Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong. Partenaires : IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence), CLLE-ERSS (Toulouse).
******************************************************** Rhapsodie: a Prosodic and Syntactic Treebank for Spoken French We are pleased to announce that Rhapsodie, a syntactic and prosodic treebank of spoken French created with the aim of modeling the interface between prosody, syntax and discourse in spoken French is now available at http://www.projet-rhapsodie.fr/ The Rhapsodie treebank is made up of 57 short samples of spoken French (5 minutes long on average, amounting to 3 hours of speech and a 33 000 word corpus) endowed with an orthographical phoneme-aligned transcription . The corpus is representative of different genres (private and public speech; monologues and dialogues; face-to-face interviews and broadcasts; more or less interactive discourse; descriptive, argumentative and procedural samples, variations in planning type). The corpus samples have been mainly drawn from existing corpora of spoken French and partially created within the frame of theRhapsodie project. We would especially like to thank the coordinators of the CFPP2000, PFC, ESLO, C-Prom projects as well as Piet Mertens, Mathieu Avanzi, Anne Lacheret and Nicolas Obin. The sound samples (waves, MP3, cleaned and stylized pitch), the orthographic transcriptions (txt), the macrosyntactic annotations (txt), the prosodic annotations (xml, textgrid) as well as the metadata (xml and html) can be freely downloaded under the terms of the Creative Commons licence Attribution - Noncommercial - Share Alike 3.0 France. Microsyntactic annotation will be available soon. The metadata are searchable on line through a browser. The prosodic annotation can be explored on line through the Rhapsodie Query Language. The tutorials of transcription, annotations and Rhapsodie Query Language are available on the site.
The Rhapsodie team (Modyco, Université Paris Ouest Nanterre : Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong. Partners: IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence),CLLE-ERSS (Toulouse).
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-8 | Annotation of “Hannah and her sisters” by Woody Allen. We have created and made publicly available a dense audio-visual person-oriented ground-truth annotation of a feature movie (100 minutes long): “Hannah and her sisters” by Woody Allen. Jean-Ronan Vigouroux, Louis Chevallier Patrick Pérez Technicolor Research & Innovation
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-9 | French TTS Text to Speech Synthesis:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-10 | Google 's Language Model benchmark A LM benchmark is available at: https://code.google.com/p/1-billion-word-language-modeling-benchmark/.
Here is a brief description of the project.
'The purpose of the project is to make available a standard training and test setup for language modeling experiments. The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here. This also means that your results on this data set are reproducible by the research community at large. Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:
ArXiv paper: http://arxiv.org/abs/1312.3005
Happy benchmarking!'
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-11 | International Standard Language Resource Number (ISLRN) (ELRA Press release) Press Release - Immediate - Paris, France, December 13, 2013 Establishing the International Standard Language Resource Number (ISLRN) 12 major NLP organisations announce the establishment of the ISLRN, a Persistent Unique Identifier, to be assigned to each Language Resource. On November 18, 2013, 12 NLP organisations have agreed to announce the establishment of the International Standard Language Resource Number (ISLRN), a Persistent Unique Identifier, to be assigned to each Language Resource. Experiment replicability, an essential feature of scientific work, would be enhanced by such unique identifier. Set up by ELRA, LDC and AFNLP/Oriental-COCOSDA, the ISLRN Portal will provide unique identifiers using a standardised nomenclature, as a service free of charge for all Language Resource providers. It will be supervised by a steering committee composed of representatives of participating organisations and enlarged whenever necessary. More information on ELRA and the ISLRN, please contact: Khalid Choukri choukri@elda.org More information on ELDA, please contact: Hélène Mazo mazo@elda.org ELRA 55-57, rue Brillat Savarin 75013 Paris (France) Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-12 | ISLRN new portal Opening of the ISLRN Portal
ELRA, LDC, and AFNLP/Oriental-COCOSDA announce the opening of the ISLRN Portal @ www.islrn.org.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-13 | Speechocean – update (May 2015)
Speechocean – update (May2015):
Speechocean: A global language resources and data services supplier
Speechocean has over 500 large-scale databases available in 110+ languages and accents with the platform of desktop, in-car, telephony and tablet PC. Our data repository is enormous and diversified, which includes ASR Databases, TTS Databases, Lexica, Text Corpora, etc.
Speechocean is glad to announce more resources that available in its catalogue:
ID: King-ASR-173
This database collection is a desktop 2-channel speech database collected and owned by Speechocean (http://www.speechocean.com).
The corpus contains the recordings of gender balanced 200 speakers of Canadian French speech data from different regions of Quebec in Canada. Each speaker recorded about 300 different utterances naturally in office environment. The total amount of utterances is 115964, and the total recording time is175.48 hours, including the leading and trailing silence (about 500 ms). The total size of this database is51.9 GB.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included. All the data was transcribed and labeled.
ID: King-ASR-202
This database collection is a desktop 1-channel speech database collected and owned by Speechocean (www.speechocean.com ).
The corpus contains the recordings of gender balanced 210 speakers of Spanish speech data from the different regions of Spain, and 500 prompts for each speaker were recorded in quiet office environment. The total amount of utterance is 105000,and the total recording time is 92.9 hours. The total size of this database is 9.96 GB. All the data has been transcribed and annotated precisely. The database was made for the tuning and testing purpose of speech recognition system and language study and etc.
Contact Information
Xianfeng Cheng
Business Manager of Commercial Department
Tel: +86-10-62660928; +86-10-62660053 ext.8080
Mobile: +86 13681432590
Skype: xianfeng.cheng1
Email: chengxianfeng@speechocean.com; cxfxy0cxfxy0@gmail.com
Website: www.speechocean.com
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-14 | kidLUCID: London UCL Children’s Clear Speech in Interaction Database kidLUCID: London UCL Children’s Clear Speech in Interaction Database We are delighted to announce the availability of a new corpus of spontaneous speech for children aged 9 to 14 years inclusive, produced as part of the ESRC-funded project on ‘Speaker-controlled Variability in Children's Speech in Interaction’ (PI: Valerie Hazan). Speech recordings (a total of 288 conversations) are available for 96 child participants (46M, 50F, range 9;0 to 15;0 years), all native southern British English speakers. Participants were recorded in pairs while completing the diapix spot-the-difference picture task in which the pair verbally compared two scenes, only one of which was visible to each talker. High-quality digital recordings were made in sound-treated rooms. For each conversation, a stereo audio recording is provided with each speaker on a separate channel together with a Praat Textgrid containing separate word- and phoneme-level segmentations for each speaker. There are six recordings per speaker pair made in the following conditions:
The kidLUCID corpus is available online within the OSCAAR (Online Speech/Corpora Archive and Analysis Resource) archive (https://oscaar.ci.northwestern.edu/). Free access can be requested for research purposes. Further information about the project can be found at: http://www.ucl.ac.uk/pals/research/shaps/research/shaps/research/clear-speech-strategies This work was supported by Economic and Social Research Council Grant No. RES-062- 23-3106.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-15 | Robust speech datasets and ASR software tools
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-16 | International Standard Language Resource Number (ISLRN) implemented by ELRA and LDC ELRA and LDC partner to implement ISLRN process and assign identifiers to all the Language Resources in their catalogues.
Following the meeting of the largest NLP organizations, the NLP12, and their endorsement of the International Standard Language Resource Number (ISLRN), ELRA and LDC partnered to implement the ISLRN process and to assign identifiers to all the Language Resources (LRs) in their catalogues. The ISLRN web portal was designed to enable the assignment of unique identifiers as a service free of charge for all Language Resource providers. To enhance the use of ISLRN, ELRA and LDC have collaborated to provide the ISLRN 13-digit ID to all the Language Resources distributed in their respective catalogues. Anyone who is searching the ELRA and LDC catalogues can see that each Language Resource is now identified by both the data centre ID and the ISLRN number. All providers and users of such LRs should refer to the latter in their own publications and whenever referring to the LR.
ELRA and LDC will continue their joint involvement in ISLRN through active participation in this web service.
Visit the ELRA and LDC catalogues, respectively at http://catalogue.elra.info and https://catalog.ldc.upenn.edu
Background The International Standard Language Resource Number (ISLRN) aims to provide unique identifiers using a standardised nomenclature, thus ensuring that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications within R&D projects, product evaluation and benchmarking, as well as in documents and scientific papers. Moreover, this is a major step in the networked and shared world that Human Language Technologies (HLT) has become: unique resources must be identified as such and meta-catalogues need a common identification format to manage data correctly.
***About NLP12*** Representatives of the major Natural Language Processing and Computational Linguistics organizations met in Paris on 18 November 2013 to harmonize and coordinate their activities within the field.
*** About ELRA *** The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and research laboratories that creates and distributes linguistic resources for language-related education, research and technology development. To find out more about LDC, please visit our web site: https://www.ldc.upenn.edu
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-17 | ELRA News We are happy to announce that 1 new Written Corpus and 1 new Terminological Resource are now available in our catalogue. ELRA-W0081 Khresmoi manually annotated reference corpus
ELRA-T0375 ACL RD-TEC: A Reference Dataset for Terminology Extraction and Classification Research in Computational Linguistics
For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org
Visit our On-line Catalogue: http://catalog.elra.info
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-18 | ELRA - Language Resources Catalogue - Update (May 2015) ***************************************************************** We are happy to announce that 1 new Speech resource is now available in our catalogue.
GVLEX tales corpus consists of 89 written tales, manually annotated in structures, speech turns, speakers, phrases, 7 of which were annotated by 2 human annotators (96 annotated texts in total); 12 tales read by a professional, transcribed and manually annotated, including audio files; and annotation and viewing software developed within the GV-LEX project .
For more information, see: http://catalog.elra.info/product_info.php?products_id=1240 For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org Visit our On-line Catalogue: http://catalog.elra.info
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-19 | ISLRN adopted by Joint Research Center (JRC) of the European Commission JRC, the EC's Joint Research Centre, an important LR player: First to adopt the ISLRN initiative
The Joint Research Centre (JRC), the European Commission's in house science service, is the first organisation to use the International Standard Language Resource Number (ISLRN) initiative and has requested ISLRN 13-digit unique identifiers to its Language Resources (LR).
The current JRC LRs (downloadable from https://ec.europa.eu/jrc/en/language-technologies) with an ISLRN ID are:
Background The International Standard Language Resource Number (ISLRN) aims to provide unique identifiers using a standardised nomenclature, thus ensuring that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications within R&D projects, product evaluation and benchmarking, as well as in documents and scientific papers. Moreover, this is a major step in the networked and shared world that Human Language Technologies (HLT) has become: unique resources must be identified as such and meta-catalogues need a common identification format to manage data correctly.
*** About the JRC *** As the Commission's in-house science service, the Joint Research Centre's mission is to provide EU policies with independent, evidence-based scientific and technical support throughout the whole policy cycle.
*** About ELRA ***
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-20 | Forensic database of voice recordings of 500+ Australian English speakers Forensic database of voice recordings of 500+ Australian English speakers
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-21 | Audio and Electroglottographic speech recordings
Audio and Electroglottographic speech recordings from several languages We are happy to announce the public availability of speech recordings made as part of the UCLA project 'Production and Perception of Linguistic Voice Quality'. http://www.phonetics.ucla.edu/voiceproject/voice.html Audio and EGG recordings are available for Bo, Gujarati, Hmong, Mandarin, Black Miao, Southern Yi, Santiago Matatlan/ San Juan Guelavia Zapotec; audio recordings (no EGG) are available for English and Mandarin. Recordings of Jalapa Mazatec extracted from the UCLA Phonetic Archive are also posted. All recordings are accompanied by explanatory notes and wordlists, and most are accompanied by Praat textgrids that locate target segments of interest to our project. Analysis software developed as part of the project – VoiceSauce for audio analysis and EggWorks for EGG analysis – and all project publications are also available from this site. All preliminary analyses of the recordings using these tools (i.e. acoustic and EGG parameter values extracted from the recordings) are posted on the site in large data spreadsheets. All of these materials are made freely available under a Creative Commons Attribution-NonCommercial-ShareAlike-3.0 Unported License. This project was funded by NSF grant BCS-0720304 to Pat Keating, Abeer Alwan and Jody Kreiman of UCLA, and Christina Esposito of Macalester College. Pat Keating (UCLA)
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-22 | Press release: Opening of the ELRA License Wizard Press Release - Immediate - Paris, France, April 2, 2015
Currently, the License Wizard allows the user to choose among several licenses that exist for the use of Language Resources: ELRA, Creative Commons and META-SHARE.
|