5-2-1 | ELRA - Language Resources Catalogue - Update (2013-11)
***************************************************************** ELRA - Language Resources Catalogue - Update November 2013 ***************************************************************** We are happy to announce that 2 new Pronunciation Dictionaries from the GlobalPhone database (Chinese-Mandarin and Korean) are now available in our catalogue. The GlobalPhone Pronunciation Dictionaries: GlobalPhone is a multilingual speech and text database collected at Karlsruhe University, Germany. The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 17 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swedish (about 25000 entries), Turkish (31330 entries/31087 words), Vietnamese (38504 entries/29974 words), Chinese-Mandarin (73388 pronunciations), and Korean (3500 syllables). Special prices are offered for a combined purchase of several GlobalPhone languages. Available GlobalPhone Pronuncation Dictionaries are listed below (click on the links for further details): ELRA-S0340 GlobalPhone French Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1197 ELRA-S0341 GlobalPhone German Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1198 ELRA-S0348 GlobalPhone Japanese Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1199 ELRA-S0350 GlobalPhone Arabic Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1200 ELRA-S0351 GlobalPhone Bulgarian Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1201 ELRA-S0352 GlobalPhone Czech Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1202 ELRA-S0353 GlobalPhone Hausa Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1203 ELRA-S0354 GlobalPhone Polish Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1204 ELRA-S0355 GlobalPhone Portuguese (Brazilian) Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1205 ELRA-S0356 GlobalPhone Swedish Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1206 ELRA-S0358 GlobalPhone Croatian Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1207 ELRA-S0359 GlobalPhone Russian Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1208 ELRA-S0360 GlobalPhone Spanish (Latin American) Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1209 ELRA-S0361 GlobalPhone Turkish Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1210 ELRA-S0362 GlobalPhone Vietnamese Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1211 *** NEW *** ELRA-S0363 GlobalPhone Chinese-Mandarin Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1212 ELRA-S0364 GlobalPhone Korean Pronunciation Dictionary For more information, see: http://catalog.elra.info/product_info.php?products_id=1213 For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org Visit our On-line Catalogue: http://catalog.elra.info Visit the Universal Catalogue: http://universal.elra.info Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/LRs-Announcements.html
|
5-2-2 | ELRA releases free Language Resources
ELRA releases free Language Resources ***************************************************
Anticipating users’ expectations, ELRA has decided to offer a large number of resources for free for Academic research use. Such an offer consists of several sets of speech, text and multimodal resources that are regularly released, for free, as soon as legal aspects are cleared. A first set was released in May 2012 at the occasion of LREC 2012. A second set is now being released.
Whenever this is permitted by our licences, please feel free to use these resources for deriving new resources and depositing them with the ELRA catalogue for community re-use.
Over the last decade, ELRA has compiled a large list of resources into its Catalogue of LRs. ELRA has negotiated distribution rights with the LR owners and made such resources available under fair conditions and within a clear legal framework. Following this initiative, ELRA has also worked on LR discovery and identification with a dedicated team which investigated and listed existing and valuable resources in its 'Universal Catalogue', a list of resources that could be negotiated on a case-by-case basis. At LREC 2010, ELRA introduced the LRE Map, an inventory of LRs, whether developed or used, that were described in LREC papers. This huge inventory listed by the authors themselves constitutes the first 'community-built' catalogue of existing or emerging resources, constantly enriched and updated at major conferences.
Considering the latest trends on easing the sharing of LRs, from both legal and commercial points of view, ELRA is taking a major role in META-SHARE, a large European open infrastructure for sharing LRs. This infrastructure will allow LR owners, providers and distributors to distribute their LRs through an additional and cost-effective channel.
To obtain the available sets of LRs, please visit the web page below and follow the instructions given online: http://www.elra.info/Free-LRs,26.html
|
5-2-3 | LDC Newsletter (November 2013)
Invitation to Join for Membership Year (MY) 2014 Membership Year (MY) 2014 is open for joining! We would like to invite all current and previous members of LDC to renew their membership as well as welcome new organizations to join the Consortium. For MY2014, LDC is pleased to maintain membership fees at last year’s rates – membership fees will not increase. Additionally, LDC will extend discounts on membership fees to members who keep their membership current and who join early in the year. The details of our early renewal discounts for MY2014 are as follows: · Organizations who joined for MY2013 will receive a 5% discount when renewing. This discount will apply throughout 2014, regardless of time of renewal. MY2013 members renewing before Monday, March 3, 2014 will receive an additional 5% discount, for a total 10% discount off the membership fee. · New members as well as organizations who did not join for MY2013, but who held membership in any of the previous MYs (1993-2012), will also be eligible for a 5% discount provided that they join/renew before March 3, 2014. The following table provides exact pricing information.
|
MY2014 Fee
|
MY2014 Fee with 5% Discount*
|
MY2014 Fee with 10% Discount**
|
Not-for-Profit /US Government
|
|
|
|
|
Standard
|
US$2400
|
US$2280
|
US$2160
|
|
Subscription
|
US$3850
|
US$3658
|
US$3465
|
For-Profit
|
|
|
|
|
Standard
|
US$24000
|
US$22800
|
US$21600
|
|
Subscription
|
US$27500
|
US$26125
|
US$24750
|
* For new members, MY2013 Members renewing for MY2014, and any previous year Member who renews before March 3, 2014 ** For MY2013 Members renewing before March 3, 2014 Publications for MY2014 are still being planned; here are the working titles of data sets we intend to provide:
2009 NIST Language Recognition Evaluation
|
MADCAT Phase 4 Training
|
Callfriend Farsi Speech and Transcripts
|
MALACH Czech ASR
|
GALE data – all phases and tasks
|
NIST OpenMT Five Language Progress Set
|
Hispanic-English Speech
|
|
|
In addition to receiving new publications, current year members of LDC also enjoy the benefit of licensing older data at reduced costs; current year for-profit members may use most data for commercial applications.
Spring 2014 LDC Data Scholarship Program Applications are now being accepted through Wednesday, January 15, 2014, 11:59PM EST for the Spring 20143 LDC Data Scholarship program! The LDC Data Scholarship program provides university students with access to LDC data at no-cost. During previous program cycles, LDC has awarded no-cost copies of LDC data to over 35 individual students and student research groups. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive. The application consists of two parts: (1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm. Applicants should consult the LDC Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two datasets; students may apply for additional datasets during the following cycle once they have completed processing of the initial datasets and publish or present work in some juried venue. (2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must verify the student's need for data and confirm that the department or university lacks the funding to pay the full Non-member Fee for the data or to join the Consortium. For further information on application materials and program rules, please visit the LDC Data Scholarship page. Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address. The deadline for the Spring 2014 program cycle is January 15, 2014, 11:59PM EST. LDC to Close for Thanksgiving Break LDC will be closed on Thursday, November 28, 2013 and Friday, November 29, 2013 in observance of the US Thanksgiving Holiday. Our offices will reopen on Monday, December 2, 2013.
New publications (1) Chinese Treebank 8.0 consists of approximately 1.5 million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups and weblogs.
The Chinese Treebank project began at the University of Pennsylvania in 1998, continued at the University of Colorado and then moved to Brandeis University. The project’s goal is to provide a large, part-of-speech tagged and fully bracketed Chinese language corpus. The first delivery, Chinese Treebank 1.0, contained 100,000 syntactically annotated words from Xinhua News Agency newswire. It was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. The LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01). Chinese Treebank 6.0 (LDC2007T36), released in 2007, consisted of 780,000 words. Chinese Treebank 7.0 (LDC2010T08), released in 2010, added new annotated newswire data, broadcast material and web text to the approximate total of one million words. Chinese Treebank 8.0 adds new annotated data from newswire, magazine articles and government documents.
There are 3,007 text files in this release, containing 71,369 sentences, 1,620,561 words, 2,589,848 characters (hanzi or foreign). The data is provided in UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the segmentation, POS-tagging and bracketing guidelines included in the release. The data is provided in four different formats: raw text, word segmented, POS-tagged, and syntactically bracketed formats. All files were automatically verified and manually checked.
Chinese Treebank 8.0 is distributed via web download.
2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$300.
*
(2) CSC Deceptive Speech was developed by Columbia University, SRI International and University of Colorado Boulder. It consists of 32 hours of audio interview from 32 native speakers of Standard American English (16 male, 16 female) recruited from the Columbia University student population and the community. The purpose of the study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on extracted features from the corpus.
The participants were told that they were participating in a communication experiment which sought to identify people who fit the profile of the top entrepreneurs in America. To this end, the participants performed tasks and answered questions in six areas. Tthey were later told that they had received low scores in some of those areas and did not fit the profile. The subjects then participated in an interview where they were told to convince the interviewer that they had actually achieved high scores in all areas and that they did indeed fit the profile. The task of the interviewer was to determine how he thought the subjects had actually performed, and he was allowed to ask them any questions other than those that were part of the performed tasks. For each question from the interviewer, subjects were asked to indicate whether the reply was true or contained any false information by pressing one of two pedals hidden from the interviewer under a table.
Interviews were conducted in a double-walled sound booth and recorded to digital audio tape on two channels using Crown CM311A Differoid headworn close-talking microphones, then down sampled to 16kHz before processing.
The interviews were orthographically transcribed by hand using the NIST EARS transcription guidelines. Labels for local lies were obtained automatically from the pedal-press data and hand-corrected for alignment, and labels for global lies were annotated during transcription based on the known scores of the subjects versus their reported scores. The orthographic transcription was force-aligned using the SRI telephone speech recognizer adapted for full-bandwidth recordings. There are several segmentations associated with the corpus: the implicit segmentation of the pedal presses, derived semi-automatically sentence-like units (EARS SLASH-UNITS or SUs) which were hand labeled, intonational phrase units and the units corresponding to each topic of the interview.
CSC Deceptive Speech is distributed on 1 DVD-ROM.
2013 Subscription Members will automatically receive two copies of this data provided they have completed and returned the User License Agreement for CSC Deceptive Speech (LDC2013S09). 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1000.
|
5-2-4 | Appen ButlerHill
Appen ButlerHill
A global leader in linguistic technology solutions
RECENT CATALOG ADDITIONS—MARCH 2012
1. Speech Databases
1.1 Telephony
1.1 Telephony
Language |
Database Type
|
Catalogue Code
|
Speakers
|
Status
|
Bahasa Indonesia
|
Conversational
|
BAH_ASR001
|
1,002
|
Available
|
Bengali
|
Conversational
|
BEN_ASR001
|
1,000
|
Available
|
Bulgarian
|
Conversational
|
BUL_ASR001
|
217
|
Available shortly
|
Croatian
|
Conversational
|
CRO_ASR001
|
200
|
Available shortly
|
Dari
|
Conversational
|
DAR_ASR001
|
500
|
Available
|
Dutch
|
Conversational
|
NLD_ASR001
|
200
|
Available
|
Eastern Algerian Arabic
|
Conversational
|
EAR_ASR001
|
496
|
Available
|
English (UK)
|
Conversational
|
UKE_ASR001
|
1,150
|
Available
|
Farsi/Persian
|
Scripted
|
FAR_ASR001
|
789
|
Available
|
Farsi/Persian
|
Conversational
|
FAR_ASR002
|
1,000
|
Available
|
French (EU)
|
Conversational
|
FRF_ASR001
|
563
|
Available
|
French (EU)
|
Voicemail
|
FRF_ASR002
|
550
|
Available
|
German
|
Voicemail
|
DEU_ASR002
|
890
|
Available
|
Hebrew
|
Conversational
|
HEB_ASR001
|
200
|
Available shortly
|
Italian
|
Conversational
|
ITA_ASR003
|
200
|
Available shortly
|
Italian
|
Voicemail
|
ITA_ASR004
|
550
|
Available
|
Kannada
|
Conversational
|
KAN_ASR001
|
1,000
|
In development
|
Pashto
|
Conversational
|
PAS_ASR001
|
967
|
Available
|
Portuguese (EU)
|
Conversational
|
PTP_ASR001
|
200
|
Available shortly
|
Romanian
|
Conversational
|
ROM_ASR001
|
200
|
Available shortly
|
Russian
|
Conversational
|
RUS_ASR001
|
200
|
Available
|
Somali
|
Conversational
|
SOM_ASR001
|
1,000
|
Available
|
Spanish (EU)
|
Voicemail
|
ESO_ASR002
|
500
|
Available
|
Turkish
|
Conversational
|
TUR_ASR001
|
200
|
Available
|
Urdu
|
Conversational
|
URD_ASR001
|
1,000
|
Available
|
1.2 Wideband
Language |
Database Type
|
Catalogue Code
|
Speakers
|
Status
|
English (US)
|
Studio
|
USE_ASR001
|
200
|
Available
|
French (Canadian)
|
Home/ Office
|
FRC_ASR002
|
120
|
Available
|
German
|
Studio
|
DEU_ASR001
|
127
|
Available
|
Thai
|
Home/Office
|
THA_ASR001
|
100
|
Available
|
Korean
|
Home/Office
|
KOR_ASR001
|
100
|
Available
|
2. Pronunciation Lexica
Appen Butler Hill has considerable experience in providing a variety of lexicon types. These include:
Pronunciation Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)
Part-of-speech tagged Lexica providing grammatical and semantic labels
Other reference text based materials including spelling/mis-spelling lists, spell-check dictionar-ies, mappings of colloquial language to standard forms, orthographic normalization lists.
Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages. For holdings information in a given language or to discuss any customized development efforts, please contact: sales@appenbutlerhill.com
3. Named Entity Corpora
Language |
Catalogue Code
|
Words
|
Description
|
Arabic
|
ARB_NER001
|
500,000
|
These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities
|
English
|
ENI_NER001
|
500,000
|
Farsi/Persian
|
FAR_NER001
|
500,000
|
Korean
|
KOR_NER001
|
500,000
|
Japanese
|
JPY_NER001
|
500,000
|
Russian
|
RUS_NER001
|
500,000
|
Mandarin
|
MAN_NER001
|
500,000
|
Urdu
|
URD_NER001
|
500,000
|
3. Named Entity Corpora
Language |
Catalogue Code
|
Words
|
Description
|
Arabic
|
ARB_NER001
|
500,000
|
These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities
|
English
|
ENI_NER001
|
500,000
|
Farsi/Persian
|
FAR_NER001
|
500,000
|
Korean
|
KOR_NER001
|
500,000
|
Japanese
|
JPY_NER001
|
500,000
|
Russian
|
RUS_NER001
|
500,000
|
Mandarin
|
MAN_NER001
|
500,000
|
Urdu
|
URD_NER001
|
500,000
|
4. Other Language Resources
Morphological Analyzers – Farsi/Persian & Urdu
Arabic Thesaurus
Language Analysis Documentation – multiple languages
For additional information on these resources, please contact: sales@appenbutlerhill.com
5. Customized Requests and Package Configurations
Appen Butler Hill is committed to providing a low risk, high quality, reliable solution and has worked in 130+ languages to-date supporting both large global corporations and Government organizations.
We would be glad to discuss to any customized requests or package configurations and prepare a cus-tomized proposal to meet your needs.
6. Contact Information
Prithivi Pradeep
Business Development Manager
ppradeep@appenbutlerhill.com
+61 2 9468 6370
|
Tom Dibert
Vice President, Business Development, North America
tdibert@appenbutlerhill.com
+1-315-339-6165
|
www.appenbutlerhill.com
|
5-2-5 | OFROM 1er corpus de français de Suisse romande
Nous souhaiterions vous signaler la mise en ligne d'OFROM, premier corpus de français parlé en Suisse romande. L'archive est, dans version actuelle, d'une durée d'environ 15 heures. Elle est transcrite en orthographe standard dans le logiciel Praat. Un concordancier permet d'y effectuer des recherches, et de télécharger les extraits sonores associés aux transcriptions.
Pour accéder aux données et consulter une description plus complète du corpus, nous vous invitons à vous rendre à l'adresse suivante : http://www.unine.ch/ofrom.
|
5-2-6 | Real-world 16-channel noise recordings
We are happy to announce the release of DEMAND, a set of real-world 16-channel noise recordings designed for the evaluation of microphone array processing techniques.
http://www.irisa.fr/metiss/DEMAND/
1.5 h of noise data were recorded in 18 different indoor and outdoor environments and are available under the terms of the Creative Commons Attribution-ShareAlike License.
Joachim Thiemann (CNRS - IRISA) Nobutaka Ito (University of Tokyo) Emmanuel Vincent (Inria Nancy - Grand Est)
|
5-2-7 | Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne
Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne
Le consortium IRCOM de la TGIR Corpus et l’EquipEx ORTOLANG s’associent pour proposer une aide technique et financière à la finalisation de corpus de données orales ou multimodales à des fins de diffusion et pérennisation par l’intermédiaire de l’EquipEx ORTOLANG. Cet appel ne concerne pas la création de nouveaux corpus mais la finalisation de corpus existants et non-disponibles de manière électronique. Par finalisation, nous entendons le dépôt auprès d’un entrepôt numérique public, et l’entrée dans un circuit d’archivage pérenne. De cette façon, les données de parole qui ont été enrichies par vos recherches vont pouvoir être réutilisées, citées et enrichies à leur tour de manière cumulative pour permettre le développement de nouvelles connaissances, selon les conditions d’utilisation que vous choisirez (sélection de licences d’utilisation correspondant à chacun des corpus déposés).
Cet appel d’offre est soumis à plusieurs conditions (voir ci-dessous) et l’aide financière par projet est limitée à 3000 euros. Les demandes seront traitées dans l’ordre où elles seront reçues par l’ IRCOM. Les demandes émanant d’EA ou de petites équipes ne disposant pas de support technique « corpus » seront traitées prioritairement. Les demandes sont à déposer du 1er septembre 2013 au 31 octobre 2013. La décision de financement relèvera du comité de pilotage d’IRCOM. Les demandes non traitées en 2013 sont susceptibles de l’être en 2014. Si vous avez des doutes quant à l’éligibilité de votre projet, n’hésitez pas à nous contacter pour que nous puissions étudier votre demande et adapter nos offres futures.
Pour palier la grande disparité dans les niveaux de compétences informatiques des personnes et groupes de travail produisant des corpus, L’ IRCOM propose une aide personnalisée à la finalisation de corpus. Celle-ci sera réalisée par un ingénieur IRCOM en fonction des demandes formulées et adaptées aux types de besoin, qu’ils soient techniques ou financiers.
Les conditions nécessaires pour proposer un corpus à finaliser et obtenir une aide d’IRCOM sont :
-
Pouvoir prendre toutes décisions concernant l’utilisation et la diffusion du corpus (propriété intellectuelle en particulier).
-
Disposer de toutes les informations concernant les sources des corpus et le consentement des personnes enregistrées ou filmées.
-
Accorder un droit d’utilisation libre des données ou au minimum un accès libre pour la recherche scientifique.
Les demandes peuvent concerner tout type de traitement : traitements de corpus quasi-finalisés (conversion, anonymisation), alignement de corpus déjà transcrits, conversion depuis des formats « traitement de textes », digitalisation de support ancien. Pour toute demande exigeant une intervention manuelle importante, les demandeurs devront s’investir en moyens humains ou financiers à la hauteur des moyens fournis par IRCOM et ORTOLANG.
IRCOM est conscient du caractère exceptionnel et exploratoire de cette démarche. Il convient également de rappeler que ce financement est réservé aux corpus déjà largement constitués et ne peuvent intervenir sur des créations ex-nihilo. Pour ces raisons de limitation de moyens, les propositions de corpus les plus avancés dans leur réalisation pourront être traitées en priorité, en accord avec le CP d’IRCOM. Il n’y a toutefois pas de limite « théorique » aux demandes pouvant être faites, IRCOM ayant la possibilité de rediriger les demandes qui ne relèvent pas de ses compétences vers d’autres interlocuteurs.
Les propositions de réponse à cet appel d’offre sont à envoyer à ircom.appel.corpus@gmail.com. Les propositions doivent utiliser le formulaire de deux pages figurant ci-dessous. Dans tous les cas, une réponse personnalisée sera renvoyée par IRCOM.
Ces propositions doivent présenter les corpus proposés, les données sur les droits d’utilisation et de propriétés et sur la nature des formats ou support utilisés.
Cet appel est organisé sous la responsabilité d’IRCOM avec la participation financière conjointe de IRCOM et l’EquipEx ORTOLANG.
Pour toute information complémentaire, nous rappelons que le site web de l'Ircom (http://ircom.corpus-ir.fr) est ouvert et propose des ressources à la communauté : glossaire, inventaire des unités et des corpus, ressources logicielles (tutoriaux, comparatifs, outils de conversion), activités des groupes de travail, actualités des formations, ...
L'IRCOM invite les unités à inventorier leur corpus oraux et multimodaux - 70 projets déjà recensés - pour avoir une meilleure visibilité des ressources déjà disponibles même si elles ne sont pas toutes finalisées.
Le comité de pilotage IRCOM
Utiliser ce formulaire pour répondre à l’appel : Merci.
Réponse à l’appel à la finalisation de corpus oral ou multimodal
Nom du corpus :
Nom de la personne à contacter :
Adresse email :
Numéro de téléphone :
Nature des données de corpus :
Existe-t’il des enregistrements :
Quel média ? Audio, vidéo, autre…
Quelle est la longueur totale des enregistrements ? Nombre de cassettes, nombre d’heures, etc.
Quel type de support ?
Quel format (si connu) ?
Existe-t’il des transcriptions :
Quel format ? (papier, traitement de texte, logiciel de transcription)
Quelle quantité (en heures, nombre de mots, ou nombre de transcriptions) ?
Disposez vous de métadonnées (présentation des droits d’auteurs et d’usage) ?
Disposez-vous d’une description précise des personnes enregistrées ?
Disposez-vous d’une attestation de consentement éclairé pour les personnes ayant été enregistrées ? En quelle année (environ) les enregistrements ont eu lieu ?
Quelle est la langue des enregistrements ?
Le corpus comprend-il des enregistrements d’enfants ou de personnes ayant un trouble du langage ou une pathologie ?
Si oui, de quelle population s’agit-il ?
Dans un souci d’efficacité et pour vous conseiller dans les meilleurs délais, il nous faut disposer d’exemples des transcriptions ou des enregistrements en votre possession. Nous vous contacterons à ce sujet, mais vous pouvez d’ores et déjà nous adresser par courrier électronique un exemple des données dont vous disposez (transcriptions, métadonnées, adresse de page web contenant les enregistrements).
Nous vous remercions par avance de l’intérêt que vous porterez à notre proposition. Pour toutes informations complémentaires veuillez contacter Martine Toda martine.toda@ling.cnrs.fr ou à ircom.appel.corpus@gmail.com.
|
5-2-8 | Rhapsodie: un Treebank prosodique et syntaxique de français parlé
Rhapsodie: un Treebank prosodique et syntaxique de français parlé
Nous avons le plaisir d'annoncer que la ressource Rhapsodie, Corpus de français parlé annoté pour la prosodie et la syntaxe, est désormais disponible sur http://www.projet-rhapsodie.fr/
Le treebank Rhapsodie est composé de 57 échantillons sonores (5 minutes en moyenne, au total 3h de parole, 33000 mots) dotés d’une transcription orthographique et phonétique alignées au son.
Il s'agit d’une ressource de français parlé multi genres (parole privée et publique ; monologues et dialogues ; entretiens en face à face vs radiodiffusion, parole plus ou moins interactive et plus ou moins planifiée, séquences descriptives, argumentatives, oratoires et procédurales) articulée autour de sources externes (enregistrements extraits de projets antérieurs, en accord avec les concepteurs initiaux) et internes. Nous tenons en particulier à remercier les responsables des projets CFPP2000, PFC, ESLO, C-Prom ainsi que Mathieu Avanzi, Anne Lacheret, Piet Mertens et Nicolas Obin.
Les échantillons sonores (wave & MP3, pitch nettoyé et lissé), les transcriptions orthographiques (txt), les annotations macrosyntaxiques (txt), les annotations prosodiques (xml, textgrid) ainsi que les metadonnées (xml & html) sont téléchargeables librement selon les termes de la licence Creative Commons Attribution - Pas d’utilisation commerciale - Partage dans les mêmes conditions 3.0 France.
Les annotations microsyntaxiques seront disponibles prochainement
Les métadonnées sont également explorables en ligne grâce à un browser.
Les tutoriels pour la transcription, les annotations et les requêtes sont disponibles sur le site Rhapsodie.
Enfin, L’annotation prosodique est interrogeable en ligne grâce au langage de requêtes Rhapsodie QL.
L'équipe Ressource Rhapsodie (Modyco, Université Paris Ouest Nanterre)
Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.
Partenaires : IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence), CLLE-ERSS (Toulouse).
********************************************************
Rhapsodie: a Prosodic and Syntactic Treebank for Spoken French
We are pleased to announce that Rhapsodie, a syntactic and prosodic treebank of spoken French created with the aim of modeling the interface between prosody, syntax and discourse in spoken French is now available at http://www.projet-rhapsodie.fr/
The Rhapsodie treebank is made up of 57 short samples of spoken French (5 minutes long on average, amounting to 3 hours of speech and a 33 000 word corpus) endowed with an orthographical phoneme-aligned transcription .
The corpus is representative of different genres (private and public speech; monologues and dialogues; face-to-face interviews and broadcasts; more or less interactive discourse; descriptive, argumentative and procedural samples, variations in planning type).
The corpus samples have been mainly drawn from existing corpora of spoken French and partially created within the frame of theRhapsodie project. We would especially like to thank the coordinators of the CFPP2000, PFC, ESLO, C-Prom projects as well as Piet Mertens, Mathieu Avanzi, Anne Lacheret and Nicolas Obin.
The sound samples (waves, MP3, cleaned and stylized pitch), the orthographic transcriptions (txt), the macrosyntactic annotations (txt), the prosodic annotations (xml, textgrid) as well as the metadata (xml and html) can be freely downloaded under the terms of the Creative Commons licence Attribution - Noncommercial - Share Alike 3.0 France.
Microsyntactic annotation will be available soon.
The metadata are searchable on line through a browser.
The prosodic annotation can be explored on line through the Rhapsodie Query Language.
The tutorials of transcription, annotations and Rhapsodie Query Language are available on the site.
The Rhapsodie team (Modyco, Université Paris Ouest Nanterre :
Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.
Partners: IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence),CLLE-ERSS (Toulouse).
|
5-2-9 | COVAREP: A Cooperative Voice Analysis Repository for Speech Technologies
======================
CALL for contributions
======================
We are pleased to announce the creation of an open-source repository of advanced speech processing algorithms called COVAREP (A Cooperative Voice Analysis Repository for Speech Technologies). COVAREP has been created as a GitHub project ( https://github.com/covarep/covarep) where researchers in speech processing can store original implementations of published algorithms.
Over the past few decades a vast array of advanced speech processing algorithms have been developed, often offering significant improvements over the existing state-of-the-art. Such algorithms can have a reasonably high degree of complexity and, hence, can be difficult to accurately re-implement based on article descriptions. Another issue is the so-called 'bug magnet effect' with re-implementations frequently having significant differences from the original. The consequence of all this has been that many promising developments have been under-exploited or discarded, with researchers tending to stick to conventional analysis methods.
By developing the COVAREP repository we are hoping to address this by encouraging authors to include original implementations of their algorithms, thus resulting in a single de facto version for the speech community to refer to.
We envisage a range of benefits to the repository:
1) Reproducible research: COVAREP will allow fairer comparison of algorithms in published articles.
2) Encouraged usage: the free availability of these algorithms will encourage researchers from a wide range of speech-related disciplines (both in academia and industry) to exploit them for their own applications.
3) Feedback: as a GitHub project users will be able to offer comments on algorithms, report bugs, suggest improvements etc.
SCOPE
We welcome contributions from a wide range of speech processing areas, including (but not limited to): Speech analysis, synthesis, conversion, transformation, enhancement, speech quality, glottal source/voice quality analysis, etc.
REQUIREMENTS
In order to achieve a reasonable standard of consistency and homogeneity across algorithms we have compiled a list of requirements for prospective contributors to the repository. However, we intend the list of the requirements not to be so strict as to discourage contributions.
- Only published work can be added to the repository
- The code must be available as open source
- Algorithms should be coded in Matlab, however we strongly encourage authors to make the code compatible with Octave in order to maximize usability
- Contributions have to comply with a Coding Convention (see GitHub site for coding convention and template). However, only for normalizing the inputs/outputs and the documentation. There is no restriction for the content of the functions (though, comments are obviously encouraged).
LICENCE
Getting contributing institutions to agree to a homogenous IP policy would be close to impossible. As a result COVAREP is a repository and not a toolbox, and each algorithm will have its own licence associated with it. Though flexible to different licence types, contributions will need to have a licence which is compatible with the repository, i.e. {GPL, LGPL, X11, Apache, MIT} or similar. We would encourage contributors to try to obtain LGPL licences from their institutions in order to be more industry friendly.
CONTRIBUTE!
We believe that the COVAREP repository has a great potential benefit to the speech research community and we hope that you will consider contributing your published algorithms to it. If you have any questions, comments issues etc regarding COVAREP please contact us on one of the email addresses below. Please forward this email to others who may be interested.
Existing contributions include: algorithms for spectral envelope modelling, adaptive sinusoidal modelling, fundamental frequncy/voicing decision/glottal closure instant detection algorithms, methods for detecting non-modal phonation types etc.
|
5-2-10 | Annotation of “Hannah and her sisters” by Woody Allen.
We have created and made publicly available a dense audio-visual person-oriented ground-truth annotation of a feature movie (100 minutes long): “Hannah and her sisters” by Woody Allen.
The annotation includes
• Face tracks in video (densely annotated, i.e., in each frame, and person-labeled)
• Speech segments in audio (person-labeled)
• Shot boundaries in video
The annotation can be useful for evaluating
• Person-oriented video-based tasks (e.g., face tracking, automatic character naming, etc.)
• Person-oriented audio-based tasks (e.g., speaker diarization or recognition)
• Person-oriented multimodal-based tasks (e.g., audio-visual character naming)
Detail on Hannah dataset and access to it can be obtained there:
https://research.technicolor.com/rennes/hannah-home/
https://research.technicolor.com/rennes/hannah-download/
Acknowledgments:
This work is supported by AXES EU project: http://www.axes-project.eu/
Alexey Ozerov Alexey.Ozerov@technicolor.com
Jean-Ronan Vigouroux,
Louis Chevallier
Patrick Pérez
Technicolor Research & Innovation
|
5-2-11 | French TTS
Text to Speech Synthesis: over an hour of speech synthesis samples from 1968 to 2001 by 25 French, Canadian, US , Belgian, Swedish, Swiss systems 33 ans de synthèse de la parole à partir du texte: une promenade sonore (1968-2001) (33 years of Text to Speech Synthesis in French : an audio tour (1968-2001) ) Christophe d'Alessandro Article published in Volume 42 - No. 1/2001 issue of Traitement Automatique des Langues (TAL, Editions Hermes), pp. 297-321. posted to: http://groupeaa.limsi.fr/corpus:synthese:start
|
5-2-12 | Google 's Language Model benchmark
Here is a brief description of the project.
'The purpose of the project is to make available a standard training and test setup for language modeling experiments.
The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.
This also means that your results on this data set are reproducible by the research community at large.
Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:
- unpruned Katz (1.1B n-grams),
- pruned Katz (~15M n-grams),
- unpruned Interpolated Kneser-Ney (1.1B n-grams),
- pruned Interpolated Kneser-Ney (~15M n-grams)
Happy benchmarking!'
|