ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2014 » ISCApad #192 » Resources » Database

ISCApad #192

Thursday, June 12, 2014 by Chris Wellekens

5-2 Database

5-2-1

ELRA - Language Resources Catalogue - Update (2014-05))

*****************************************************************
ELRA - Language Resources Catalogue - Update
*****************************************************************

We are happy to announce that 2 new Speech resources are now available in our catalogue.

ELRA-S0368 Nepali Spoken Corpus
The Nepali Spoken Corpus contains audio recordings from different social activities within their natural settings as much as possible, with phonologically transcribed and annotated texts, and information about the participants. A total of 17 types of activity were recorded. The total temporal duration of the recorded material is 31 hours and 26 minutes.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1219

ELRA-S0369 CLIPS_MT_MANUAL
CLIPS_MT_MANUAL is a sub-corpus of the original Italian CLIPS corpus (Corpora e Lessici dell'Italiano Parlato e Scritto). This corpus contains 3228 inspected and partially repaired WAV signal files, each containing one dialogue turn (*.wav), 3228 corrected original CLIPS annotation files (*.acs, *.phn, *.std, *.wrd), 3228 BAS Partitur files containing the annotation tiers ORT, KAN and SAP (*.par), 3228 EMU database annotation files (*.vot, *.hlb) covering 30 maptask dialogues performed by 30 speakers (each speaker pair performing two different map tasks) recorded in 15 different locations in Italy in 2000-2004.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1220

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/LRs-Announcements.html

Back

Top

5-2-2

ELRA releases free Language Resources

ELRA releases free Language Resources
***************************************************

Anticipating users’ expectations, ELRA has decided to offer a large number of resources for free for Academic research use. Such an offer consists of several sets of speech, text and multimodal resources that are regularly released, for free, as soon as legal aspects are cleared. A first set was released in May 2012 at the occasion of LREC 2012. A second set is now being released.

Whenever this is permitted by our licences, please feel free to use these resources for deriving new resources and depositing them with the ELRA catalogue for community re-use.

Over the last decade, ELRA has compiled a large list of resources into its Catalogue of LRs. ELRA has negotiated distribution rights with the LR owners and made such resources available under fair conditions and within a clear legal framework. Following this initiative, ELRA has also worked on LR discovery and identification with a dedicated team which investigated and listed existing and valuable resources in its 'Universal Catalogue', a list of resources that could be negotiated on a case-by-case basis. At LREC 2010, ELRA introduced the LRE Map, an inventory of LRs, whether developed or used, that were described in LREC papers. This huge inventory listed by the authors themselves constitutes the first 'community-built' catalogue of existing or emerging resources, constantly enriched and updated at major conferences.

Considering the latest trends on easing the sharing of LRs, from both legal and commercial points of view, ELRA is taking a major role in META-SHARE, a large European open infrastructure for sharing LRs. This infrastructure will allow LR owners, providers and distributors to distribute their LRs through an additional and cost-effective channel.

To obtain the available sets of LRs, please visit the web page below and follow the instructions given online:
http://www.elra.info/Free-LRs,26.html

Back

Top

5-2-3

LDC Newsletter (May 2014)

In this newsletter:

- LDC at LREC 2014 -

New publications:

- GALE Arabic-English Word Alignment Training Part 2 -- Newswire -

- Hispanic-English Database -

- HyTER Networks of Selected OpenMT08/09 Progress Set Sentences -

LDC at LREC 2014

LDC will attend the 9th Language Resource Evaluation Conference (LREC2014), hosted by ELRA, the European Language Resource Association. The conference will be held in Reykjavik, Iceland from May 26-31 and features a broad range of sessions on language resource and human language technologies research. Ten LDC staff members will be presenting current work on topics including the language application grid project, collecting natural SMS and chat conversations in multiple languages, incorporating alternate translations into English translation treebanks, supporting HLT research with degraded audio data, developing an Egyptian Arabic Treebank and more.

Following the conference LDC’s presented papers and posters will be available on LDC’s Papers Page.

New publications

(1) GALE Arabic-English Word Alignment Training Part 2 -- Newswire was developed by LDC and contains 162,359 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source newswire collected by LDC in 2004 - 2006 and 2008. The distribution by genre, words, character tokens and segments appears below:

Language	Genre	Files	Words	CharTokens	Segments
Arabic	NW	1,126	112,318	162,359	5,349

Note that word count is based on the untokenized Arabic source, and token count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:

Identifying and correcting incorrectly tokenized tokens
Identifying different types of links
Identifying sentence segments not suitable for annotation, such as those that were blank, incorrectly-segmented or containing other languages
Tagging unmatched words attached to other words or phrases

GALE Arabic-English Word Alignment Training Part 2 -- Newswire is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1750.

(2) Hispanic-English Database contains approximately 30 hours of English and Spanish conversational and read speech with transcripts (24 hours) and metadata collected from 22 non-native English speakers between 1996 and 1998. The corpus was developed by Entropic Research Laboratory, Inc, a developer of speech recognition and speech synthesis software toolkits that was acquired by Microsoft in 1999.

Participants were adult native speakers of Spanish as spoken in Central America and South America who resided in the Palo Alto, California area, had lived in the United States for at least one year and demonstrated a basic ability to understand, read and speak English. They read a total of 2200 sentences, 50 each in Spanish and English per speaker. The Spanish sentence prompts were a subset of the materials in LATINO-40 Spanish Read News, and the English sentence prompts were taken from the TIMIT database. Conversations were task-oriented, drawing on exercises similar to those used in English second language instruction and designed to engage the speakers in collaborative, problem-solving activities.

Read speech was recorded on two wideband channels with a Shure SM10A head-mounted microphone in a quiet laboratory environment. The conversational speech was simultaneously recorded on four channels, two of which were used to place phone calls to each subject in two separate offices and to record the incoming speech of the two channels into separate files. The audio was originally saved under the Entropic Audio (ESPS) format using a 16kHz sampling rate and 16 bit samples. Audio files were converted to flac compressed .wav files from the ESPS format. ESPS headers were removed and are presented in this release as *.hdr files that include demographic and technical data.

Transcripts were developed with the Entropic Annotator tool and are time-aligned with speaker turns. The transcription conventions were based on those used in the LDC Switchboard and CALLHOME collections. Transcript files are denoted with a .lab extension.

Hispanic-English Database is distributed on 1 DVD-ROM.

2014 Subscription Members will automatically receive two copies of this data. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1500.

(3) HyTER Networks of Selected OpenMT08/09 Progress Set Sentences was developed by SDL and contains HyTER (Hybrid Translation Edit Rate) networks for 102 selected source Arabic and Chinese sentences from OpenMT08 and OpenMT09 Progress Set data. HyTER is an evaluation metric based on large reference networks created by an annotation tool that allows users to develop an exponential number of correct translations for a given sentence. Reference networks can be used as a foundation for developing improved machine translation evaluation metrics and for automating the evaluation of human translation efficiency.

The source material is comprised of Arabic and Chinese newswire and web data collected by LDC in 2007. Annotators created meaning-equivalent annotations under three annotation protocols. In the first protocol, foreign language native speakers built English networks starting from foreign language sentences. In the second, English native speakers built English networks from the best translation of a foreign language sentence as identified by NIST (National Institute of Standards and Technology). In the third protocol, English native speakers built English networks starting from the best translation, but those annotators also had access to three additional, independently produced human translations. Networks created by different annotators for each sentence were combined and evaluated.

This release includes the source sentences and four human reference translations produced by LDC in XML format, along with five machine translation system outputs representing a variety of system architectures and performance, and the human post-edited output of those systems also presented in XML.

HyTER Networks of Selected OpenMT08/09 Progress Set Sentences is distributed via web download.

Back

Top

5-2-4

Appen ButlerHill

A global leader in linguistic technology solutions

RECENT CATALOG ADDITIONS—MARCH 2012

1. Speech Databases

1.1 Telephony

1.1 Telephony Language	Database Type	Catalogue Code	Speakers	Status
Bahasa Indonesia	Conversational	BAH_ASR001	1,002	Available
Bengali	Conversational	BEN_ASR001	1,000	Available
Bulgarian	Conversational	BUL_ASR001	217	Available shortly
Croatian	Conversational	CRO_ASR001	200	Available shortly
Dari	Conversational	DAR_ASR001	500	Available
Dutch	Conversational	NLD_ASR001	200	Available
Eastern Algerian Arabic	Conversational	EAR_ASR001	496	Available
English (UK)	Conversational	UKE_ASR001	1,150	Available
Farsi/Persian	Scripted	FAR_ASR001	789	Available
Farsi/Persian	Conversational	FAR_ASR002	1,000	Available
French (EU)	Conversational	FRF_ASR001	563	Available
French (EU)	Voicemail	FRF_ASR002	550	Available
German	Voicemail	DEU_ASR002	890	Available
Hebrew	Conversational	HEB_ASR001	200	Available shortly
Italian	Conversational	ITA_ASR003	200	Available shortly
Italian	Voicemail	ITA_ASR004	550	Available
Kannada	Conversational	KAN_ASR001	1,000	In development
Pashto	Conversational	PAS_ASR001	967	Available
Portuguese (EU)	Conversational	PTP_ASR001	200	Available shortly
Romanian	Conversational	ROM_ASR001	200	Available shortly
Russian	Conversational	RUS_ASR001	200	Available
Somali	Conversational	SOM_ASR001	1,000	Available
Spanish (EU)	Voicemail	ESO_ASR002	500	Available
Turkish	Conversational	TUR_ASR001	200	Available
Urdu	Conversational	URD_ASR001	1,000	Available

1.2 Wideband Language	Database Type	Catalogue Code	Speakers	Status
English (US)	Studio	USE_ASR001	200	Available
French (Canadian)	Home/ Office	FRC_ASR002	120	Available
German	Studio	DEU_ASR001	127	Available
Thai	Home/Office	THA_ASR001	100	Available
Korean	Home/Office	KOR_ASR001	100	Available

2. Pronunciation Lexica

Appen Butler Hill has considerable experience in providing a variety of lexicon types. These include:

 Pronunciation Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)

 Part-of-speech tagged Lexica providing grammatical and semantic labels

 Other reference text based materials including spelling/mis-spelling lists, spell-check dictionar-ies, mappings of colloquial language to standard forms, orthographic normalization lists.

Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages. For holdings information in a given language or to discuss any customized development efforts, please contact: sales@appenbutlerhill.com

3. Named Entity Corpora Language	Catalogue Code		Words		Description
Arabic	ARB_NER001		500,000		These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities
English		ENI_NER001		500,000
Farsi/Persian		FAR_NER001		500,000
Korean		KOR_NER001		500,000
Japanese		JPY_NER001		500,000
Russian		RUS_NER001		500,000
Mandarin		MAN_NER001		500,000
Urdu		URD_NER001		500,000

3. Named Entity Corpora Language	Catalogue Code		Words		Description
Arabic	ARB_NER001		500,000		These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities
English		ENI_NER001		500,000
Farsi/Persian		FAR_NER001		500,000
Korean		KOR_NER001		500,000
Japanese		JPY_NER001		500,000
Russian		RUS_NER001		500,000
Mandarin		MAN_NER001		500,000
Urdu		URD_NER001		500,000

4. Other Language Resources

 Morphological Analyzers – Farsi/Persian & Urdu

 Arabic Thesaurus

 Language Analysis Documentation – multiple languages

For additional information on these resources, please contact: sales@appenbutlerhill.com

5. Customized Requests and Package Configurations

Appen Butler Hill is committed to providing a low risk, high quality, reliable solution and has worked in 130+ languages to-date supporting both large global corporations and Government organizations.

We would be glad to discuss to any customized requests or package configurations and prepare a cus-tomized proposal to meet your needs.

6. Contact Information

Prithivi Pradeep

Business Development Manager

ppradeep@appenbutlerhill.com

+61 2 9468 6370

Tom Dibert

Vice President, Business Development, North America

tdibert@appenbutlerhill.com

+1-315-339-6165

www.appenbutlerhill.com

Back

Top

5-2-5

OFROM 1er corpus de français de Suisse romande

Nous souhaiterions vous signaler la mise en ligne d'OFROM, premier corpus de français parlé en Suisse romande. L'archive est, dans version actuelle, d'une durée d'environ 15 heures. Elle est transcrite en orthographe standard dans le logiciel Praat. Un concordancier permet d'y effectuer des recherches, et de télécharger les extraits sonores associés aux transcriptions.

Pour accéder aux données et consulter une description plus complète du corpus, nous vous invitons à vous rendre à l'adresse suivante : http://www.unine.ch/ofrom.

Back

Top

5-2-6

Real-world 16-channel noise recordings

We are happy to announce the release of DEMAND, a set of real-world
16-channel noise recordings designed for the evaluation of microphone
array processing techniques.

http://www.irisa.fr/metiss/DEMAND/

1.5 h of noise data were recorded in 18 different indoor and outdoor
environments and are available under the terms of the Creative Commons Attribution-ShareAlike License.

Joachim Thiemann (CNRS - IRISA)
Nobutaka Ito (University of Tokyo)
Emmanuel Vincent (Inria Nancy - Grand Est)

Back

Top

5-2-7

Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne

Le consortium IRCOM de la TGIR Corpus et l’EquipEx ORTOLANG s’associent pour proposer une aide technique et financière à la finalisation de corpus de données orales ou multimodales à des fins de diffusion et pérennisation par l’intermédiaire de l’EquipEx ORTOLANG. Cet appel ne concerne pas la création de nouveaux corpus mais la finalisation de corpus existants et non-disponibles de manière électronique. Par finalisation, nous entendons le dépôt auprès d’un entrepôt numérique public, et l’entrée dans un circuit d’archivage pérenne. De cette façon, les données de parole qui ont été enrichies par vos recherches vont pouvoir être réutilisées, citées et enrichies à leur tour de manière cumulative pour permettre le développement de nouvelles connaissances, selon les conditions d’utilisation que vous choisirez (sélection de licences d’utilisation correspondant à chacun des corpus déposés).

Cet appel d’offre est soumis à plusieurs conditions (voir ci-dessous) et l’aide financière par projet est limitée à 3000 euros. Les demandes seront traitées dans l’ordre où elles seront reçues par l’ IRCOM. Les demandes émanant d’EA ou de petites équipes ne disposant pas de support technique « corpus » seront traitées prioritairement. Les demandes sont à déposer du 1^er septembre 2013 au 31 octobre 2013. La décision de financement relèvera du comité de pilotage d’IRCOM. Les demandes non traitées en 2013 sont susceptibles de l’être en 2014. Si vous avez des doutes quant à l’éligibilité de votre projet, n’hésitez pas à nous contacter pour que nous puissions étudier votre demande et adapter nos offres futures.

Pour palier la grande disparité dans les niveaux de compétences informatiques des personnes et groupes de travail produisant des corpus, L’ IRCOM propose une aide personnalisée à la finalisation de corpus. Celle-ci sera réalisée par un ingénieur IRCOM en fonction des demandes formulées et adaptées aux types de besoin, qu’ils soient techniques ou financiers.

Les conditions nécessaires pour proposer un corpus à finaliser et obtenir une aide d’IRCOM sont :

Pouvoir prendre toutes décisions concernant l’utilisation et la diffusion du corpus (propriété intellectuelle en particulier).
Disposer de toutes les informations concernant les sources des corpus et le consentement des personnes enregistrées ou filmées.
Accorder un droit d’utilisation libre des données ou au minimum un accès libre pour la recherche scientifique.

Les demandes peuvent concerner tout type de traitement : traitements de corpus quasi-finalisés (conversion, anonymisation), alignement de corpus déjà transcrits, conversion depuis des formats « traitement de textes », digitalisation de support ancien. Pour toute demande exigeant une intervention manuelle importante, les demandeurs devront s’investir en moyens humains ou financiers à la hauteur des moyens fournis par IRCOM et ORTOLANG.

IRCOM est conscient du caractère exceptionnel et exploratoire de cette démarche. Il convient également de rappeler que ce financement est réservé aux corpus déjà largement constitués et ne peuvent intervenir sur des créations ex-nihilo. Pour ces raisons de limitation de moyens, les propositions de corpus les plus avancés dans leur réalisation pourront être traitées en priorité, en accord avec le CP d’IRCOM. Il n’y a toutefois pas de limite « théorique » aux demandes pouvant être faites, IRCOM ayant la possibilité de rediriger les demandes qui ne relèvent pas de ses compétences vers d’autres interlocuteurs.

Les propositions de réponse à cet appel d’offre sont à envoyer à ircom.appel.corpus@gmail.com. Les propositions doivent utiliser le formulaire de deux pages figurant ci-dessous. Dans tous les cas, une réponse personnalisée sera renvoyée par IRCOM.

Ces propositions doivent présenter les corpus proposés, les données sur les droits d’utilisation et de propriétés et sur la nature des formats ou support utilisés.

Cet appel est organisé sous la responsabilité d’IRCOM avec la participation financière conjointe de IRCOM et l’EquipEx ORTOLANG.

Pour toute information complémentaire, nous rappelons que le site web de l'Ircom (http://ircom.corpus-ir.fr) est ouvert et propose des ressources à la communauté : glossaire, inventaire des unités et des corpus, ressources logicielles (tutoriaux, comparatifs, outils de conversion), activités des groupes de travail, actualités des formations, ...

L'IRCOM invite les unités à inventorier leur corpus oraux et multimodaux - 70 projets déjà recensés - pour avoir une meilleure visibilité des ressources déjà disponibles même si elles ne sont pas toutes finalisées.

Le comité de pilotage IRCOM

Utiliser ce formulaire pour répondre à l’appel : Merci.

Réponse à l’appel à la finalisation de corpus oral ou multimodal

Nom du corpus :

Nom de la personne à contacter :

Adresse email :

Numéro de téléphone :

Nature des données de corpus :

Existe-t’il des enregistrements :

Quel média ? Audio, vidéo, autre…

Quelle est la longueur totale des enregistrements ? Nombre de cassettes, nombre d’heures, etc.

Quel type de support ?

Quel format (si connu) ?

Existe-t’il des transcriptions :

Quel format ? (papier, traitement de texte, logiciel de transcription)

Quelle quantité (en heures, nombre de mots, ou nombre de transcriptions) ?

Disposez vous de métadonnées (présentation des droits d’auteurs et d’usage) ?

Disposez-vous d’une description précise des personnes enregistrées ?

Disposez-vous d’une attestation de consentement éclairé pour les personnes ayant été enregistrées ? En quelle année (environ) les enregistrements ont eu lieu ?

Quelle est la langue des enregistrements ?

Le corpus comprend-il des enregistrements d’enfants ou de personnes ayant un trouble du langage ou une pathologie ?

Si oui, de quelle population s’agit-il ?

Dans un souci d’efficacité et pour vous conseiller dans les meilleurs délais, il nous faut disposer d’exemples des transcriptions ou des enregistrements en votre possession. Nous vous contacterons à ce sujet, mais vous pouvez d’ores et déjà nous adresser par courrier électronique un exemple des données dont vous disposez (transcriptions, métadonnées, adresse de page web contenant les enregistrements).

Nous vous remercions par avance de l’intérêt que vous porterez à notre proposition. Pour toutes informations complémentaires veuillez contacter Martine Toda martine.toda@ling.cnrs.fr ou à ircom.appel.corpus@gmail.com.

Back

Top

5-2-8

Rhapsodie: un Treebank prosodique et syntaxique de français parlé

Nous avons le plaisir d'annoncer que la ressource Rhapsodie, Corpus de français parlé annoté pour la prosodie et la syntaxe, est désormais disponible sur http://www.projet-rhapsodie.fr/

Le treebank Rhapsodie est composé de 57 échantillons sonores (5 minutes en moyenne, au total 3h de parole, 33000 mots) dotés d’une transcription orthographique et phonétique alignées au son.

Il s'agit d’une ressource de français parlé multi genres (parole privée et publique ; monologues et dialogues ; entretiens en face à face vs radiodiffusion, parole plus ou moins interactive et plus ou moins planifiée, séquences descriptives, argumentatives, oratoires et procédurales) articulée autour de sources externes (enregistrements extraits de projets antérieurs, en accord avec les concepteurs initiaux) et internes. Nous tenons en particulier à remercier les responsables des projets CFPP2000, PFC, ESLO, C-Prom ainsi que Mathieu Avanzi, Anne Lacheret, Piet Mertens et Nicolas Obin.

Les échantillons sonores (wave & MP3, pitch nettoyé et lissé), les transcriptions orthographiques (txt), les annotations macrosyntaxiques (txt), les annotations prosodiques (xml, textgrid) ainsi que les metadonnées (xml & html) sont téléchargeables librement selon les termes de la licence Creative Commons Attribution - Pas d’utilisation commerciale - Partage dans les mêmes conditions 3.0 France.

Les annotations microsyntaxiques seront disponibles prochainement

Les métadonnées sont également explorables en ligne grâce à un browser.

Les tutoriels pour la transcription, les annotations et les requêtes sont disponibles sur le site Rhapsodie.

Enfin, L’annotation prosodique est interrogeable en ligne grâce au langage de requêtes Rhapsodie QL.

L'équipe Ressource Rhapsodie (Modyco, Université Paris Ouest Nanterre)

Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.

Partenaires : IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence), CLLE-ERSS (Toulouse).

********************************************************

Rhapsodie: a Prosodic and Syntactic Treebank for Spoken French

We are pleased to announce that Rhapsodie, a syntactic and prosodic treebank of spoken French created with the aim of modeling the interface between prosody, syntax and discourse in spoken French is now available at http://www.projet-rhapsodie.fr/

The Rhapsodie treebank is made up of 57 short samples of spoken French (5 minutes long on average, amounting to 3 hours of speech and a 33 000 word corpus) endowed with an orthographical phoneme-aligned transcription .

The corpus is representative of different genres (private and public speech; monologues and dialogues; face-to-face interviews and broadcasts; more or less interactive discourse; descriptive, argumentative and procedural samples, variations in planning type).

The corpus samples have been mainly drawn from existing corpora of spoken French and partially created within the frame of theRhapsodie project. We would especially like to thank the coordinators of the CFPP2000, PFC, ESLO, C-Prom projects as well as Piet Mertens, Mathieu Avanzi, Anne Lacheret and Nicolas Obin.

The sound samples (waves, MP3, cleaned and stylized pitch), the orthographic transcriptions (txt), the macrosyntactic annotations (txt), the prosodic annotations (xml, textgrid) as well as the metadata (xml and html) can be freely downloaded under the terms of the Creative Commons licence Attribution - Noncommercial - Share Alike 3.0 France.

Microsyntactic annotation will be available soon.

The metadata are searchable on line through a browser.

The prosodic annotation can be explored on line through the Rhapsodie Query Language.

The tutorials of transcription, annotations and Rhapsodie Query Language are available on the site.

The Rhapsodie team (Modyco, Université Paris Ouest Nanterre :

Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.

Partners: IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence),CLLE-ERSS (Toulouse).

Back

Top

5-2-9

COVAREP: A Cooperative Voice Analysis Repository for Speech Technologies

======================

CALL for contributions

======================

We are pleased to announce the creation of an open-source repository of advanced speech processing algorithms called COVAREP (A Cooperative Voice Analysis Repository for Speech Technologies). COVAREP has been created as a GitHub project (https://github.com/covarep/covarep) where researchers in speech processing can store original implementations of published algorithms.

Over the past few decades a vast array of advanced speech processing algorithms have been developed, often offering significant improvements over the existing state-of-the-art. Such algorithms can have a reasonably high degree of complexity and, hence, can be difficult to accurately re-implement based on article descriptions. Another issue is the so-called 'bug magnet effect' with re-implementations frequently having significant differences from the original. The consequence of all this has been that many promising developments have been under-exploited or discarded, with researchers tending to stick to conventional analysis methods.

By developing the COVAREP repository we are hoping to address this by encouraging authors to include original implementations of their algorithms, thus resulting in a single de facto version for the speech community to refer to.

We envisage a range of benefits to the repository:

1) Reproducible research: COVAREP will allow fairer comparison of algorithms in published articles.

2) Encouraged usage: the free availability of these algorithms will encourage researchers from a wide range of speech-related disciplines (both in academia and industry) to exploit them for their own applications.

3) Feedback: as a GitHub project users will be able to offer comments on algorithms, report bugs, suggest improvements etc.

SCOPE

We welcome contributions from a wide range of speech processing areas, including (but not limited to): Speech analysis, synthesis, conversion, transformation, enhancement, speech quality, glottal source/voice quality analysis, etc.

REQUIREMENTS

In order to achieve a reasonable standard of consistency and homogeneity across algorithms we have compiled a list of requirements for prospective contributors to the repository. However, we intend the list of the requirements not to be so strict as to discourage contributions.

Only published work can be added to the repository
The code must be available as open source
Algorithms should be coded in Matlab, however we strongly encourage authors to make the code compatible with Octave in order to maximize usability
Contributions have to comply with a Coding Convention (see GitHub site for coding convention and template). However, only for normalizing the inputs/outputs and the documentation. There is no restriction for the content of the functions (though, comments are obviously encouraged).

LICENCE

Getting contributing institutions to agree to a homogenous IP policy would be close to impossible. As a result COVAREP is a repository and not a toolbox, and each algorithm will have its own licence associated with it. Though flexible to different licence types, contributions will need to have a licence which is compatible with the repository, i.e. {GPL, LGPL, X11, Apache, MIT} or similar. We would encourage contributors to try to obtain LGPL licences from their institutions in order to be more industry friendly.

CONTRIBUTE!

We believe that the COVAREP repository has a great potential benefit to the speech research community and we hope that you will consider contributing your published algorithms to it. If you have any questions, comments issues etc regarding COVAREP please contact us on one of the email addresses below. Please forward this email to others who may be interested.

Existing contributions include: algorithms for spectral envelope modelling, adaptive sinusoidal modelling, fundamental frequncy/voicing decision/glottal closure instant detection algorithms, methods for detecting non-modal phonation types etc.

Gilles Degottex <degottex@csd.uoc.gr>, John Kane <kanejo@tcd.ie>, Thomas Drugman <thomas.drugman@umons.ac.be>, Tuomo Raitio <tuomo.raitio@aalto.fi>, Stefan Scherer <scherer@ict.usc.edu>

Website - http://covarep.github.io/covarep

GitHub - https://github.com/covarep/covarep

Back

Top

5-2-10

Annotation of “Hannah and her sisters” by Woody Allen.

We have created and made publicly available a dense audio-visual person-oriented ground-truth annotation of a feature movie (100 minutes long): “Hannah and her sisters” by Woody Allen.

The annotation includes

•          Face tracks in video (densely annotated, i.e., in each frame, and person-labeled)

•             Speech segments in audio (person-labeled)

•             Shot boundaries in video

The annotation can be useful for evaluating

•   Person-oriented video-based tasks (e.g., face tracking, automatic character naming, etc.)

•             Person-oriented audio-based tasks (e.g., speaker diarization or recognition)

•             Person-oriented multimodal-based tasks (e.g., audio-visual character naming)

Detail on Hannah dataset and access to it can be obtained there:

https://research.technicolor.com/rennes/hannah-home/

https://research.technicolor.com/rennes/hannah-download/

Acknowledgments:

This work is supported by AXES EU project: http://www.axes-project.eu/

Alexey Ozerov Alexey.Ozerov@technicolor.com

Jean-Ronan Vigouroux,

Louis Chevallier

Patrick Pérez

Technicolor Research & Innovation

Back

Top

5-2-11

French TTS

Text to         Speech Synthesis:
      over an hour of speech       synthesis samples from         1968 to 2001 by       25 French, Canadian, US , Belgian,       Swedish, Swiss systems

      33 ans de synthèse de la parole à         partir du texte: une promenade sonore (1968-2001)
        (33 years of Text to Speech Synthesis       in French : an audio tour (1968-2001)       )
      Christophe d'Alessandro
      Article published in         Volume 42 - No. 1/2001 issue of Traitement       Automatique des Langues (TAL,       Editions Hermes),         pp. 297-321.

      posted to:
      http://groupeaa.limsi.fr/corpus:synthese:start

Back

Top

5-2-12

Google 's Language Model benchmark

A LM benchmark is available at: https://code.google.com/p/1-billion-word-language-modeling-benchmark/.

Here is a brief description of the project.

'The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:

unpruned Katz (1.1B n-grams),
pruned Katz (~15M n-grams),
unpruned Interpolated Kneser-Ney (1.1B n-grams),
pruned Interpolated Kneser-Ney (~15M n-grams)

ArXiv paper: http://arxiv.org/abs/1312.3005

Happy benchmarking!'

Back

Top

5-2-13

International Standard Language Resource Number (ISLRN) (ELRA Press release)

Press Release - Immediate - Paris, France, December 13, 2013

Establishing the International Standard Language Resource Number (ISLRN)

12 major NLP organisations announce the establishment of the ISLRN, a Persistent Unique Identifier, to be assigned to each Language Resource.

On November 18, 2013, 12 NLP organisations have agreed to announce the establishment of the International Standard Language Resource Number (ISLRN), a Persistent Unique Identifier, to be assigned to each Language Resource. Experiment replicability, an essential feature of scientific work, would be enhanced by such unique identifier. Set up by ELRA, LDC and AFNLP/Oriental-COCOSDA, the ISLRN Portal will provide unique identifiers using a standardised nomenclature, as a service free of charge for all Language Resource providers. It will be supervised by a steering committee composed of representatives of participating organisations and enlarged whenever necessary.

More information on ELRA and the ISLRN, please contact: Khalid Choukri choukri@elda.org

More information on ELDA, please contact: Hélène Mazo mazo@elda.org

ELRA

55-57, rue Brillat Savarin

75013 Paris (France)

Tel.: +33 1 43 13 33 33

Fax: +33 1 43 13 33 30

Back

Top

5-2-14

ISLRN new portal

Opening of the ISLRN Portal
ELRA, LDC, and AFNLP/Oriental-COCOSDA announce the opening of the ISLRN Portal @ www.islrn.org.

Further to the establishment of the International Standard Language Resource Number (ISLRN) as a unique and universal identification schema for Language Resources on November 18, 2013, ELRA, LDC and AFNLP/Oriental-COCOSDA now announce the opening of the ISLRN Portal (www.islrn.org). As a service free of charge for all Language Resource providers and under the supervision of a steering committee composed of representatives of participating organisations, the ISLRN Portal provides unique identifiers using a standardised nomenclature.

Overview
The 13-digit ISLRN format is: XXX-XXX-XXX-XXX-X. It can be allocated to any Language Resource; its composition is neutral and does not include any semantics in reference to the type or nature of the Language Resource. The ISLRN is a randomly created number with a check digit that validates a Verhoeff algorithm.

Two types of external players may interact with the ISLRN Portal: Visitors and Providers. Visitors may browse the web site and search for the ISLRN of a given Language Resource by its name or by its number if it exists. Providers are registered and own credentials. They can request a new ISLRN for a given Language Resource. A provider has the possibility to become certified, after moderation, in order to be able to import metadata in XML format.

The functionalities that can be accessed by Visitors are:

-          Identify a language resource according to its ISLRN
-          Identify an ISLRN by the name of a language resource
-          Get information about ISLRN, FAQ, Basic Metadata, Legal Information
-          View last 5 accepted resources (“What’s new” block on home page)
-          Sign up to become a provider

The functionalities that can be accessed by Providers, once they have signed up, are:

-          Log in
-          Request an ISLRN according to the metadata of a given resource
-          Request to become a certified provider so as to import XML files containing metadata
-          Import one or more metadata descriptions in XML to request ISLRN(s) (only for certified providers)
-          Edit pending requests
-          Access previous requests
-          Contact a Moderator or an Administrator
-          Edit Providers’ own profile

ISLRN request is handled by moderators within 5 working days.
Contact: islrn@elda.org

Background
The International Standard Language Resource Number (ISLRN) is a unique and universal identification schema for Language Resources which provides Language Resources with unique identifier using a standardised nomenclature. It also ensures that Language Resources are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers. Moreover, it is a major step in the interconnected world that Human Language Technologies (HLT) has become: unique resources must be identified as they are and meta-catalogues need a common identification format to manage data correctly.

The ISLRN does not intend to replace local and specific identifiers, it is not meant to be a legal deposit, not an obligation, but rather an essential and best practice. For instance a resource that is distributed by several data centres will still have the “local” data-centre identifier but will have a unique ISLRN.

********************************************************************
About ELRA
The European Language Resources Association (ELRA) is a non-profit making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies (HLT). To find out more about ELRA, please visit www.elra.info.

About LDC
Founded in 1992, the Linguistic Data Consortium (LDC) is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC's host institution. To find out more about LDC, please visit www.ldc.upenn.edu.

About AFNLP
The mission of the Asian Federation of Natural Language Processing (AFNLP) is to promote and enhance R&D relating to the computational analysis and the automatic processing of all languages of importance to the Asian region by assisting and supporting like-minded organizations and institutions through information sharing, conference organization, research and publication co-ordination, and other forms of support. To find out more about AFNLP, please visit www.afnlp.org.

About Oriental-COCOSDA
The International Committee for the Co-ordination and Standardisation of Speech Databases and Assesment Techniques, Oriental-COCOSDA, has been established to encourage and promote international interaction and cooperation in the foundation areas of Spoken Language Processing, especially for Speech Input/Output. To find out more about Oriental-COCOSDA, please visit our web site: www.cocosda.org

Back

Top

5-2-15

Speechocean – update (June 2014)

Speechocean – update (June 2014):

Speechocean: A global language resources and data services supplier

Speechocean has over 500 large-scale databases available in 110+ languages and accents with the platform of desktop, in-car, telephony and tablet PC. Our data repository is enormous and diversified, which includes ASR Databases, TTS Databases, Lexica, Text Corpora, etc.

Speechocean is glad to announce more resources that have been released:

ASR Databases

Speechocean provides 110+ regional languages corpora, available in a variety of formats, situational styles, scene environments and platform systems, covering In-car speech recognition corpora, mobile phone speech recognition corpora, fixed-line speech recognition corpora, desktop speech recognition corpora, etc. This month we released more European Languages Databases (Part One) which were made for the tuning and testing purpose of speech recognition systems for speech ASR applications.

In-Car

Serial Number	Kingline Data Names	Sound Parameter	Utterances
King-ASR-129	Canadian French Speech Recognition Corpus (In car) Sentence (328 Speakers)	16 K，16 bit Four Channels	361,560
King-ASR-132	France French Speech Recognition Corpus (in car )300 Speakers	16 K，16 bit Four Channels	360000
King-ASR-134	Turkish Speech Recognition Corpus (in car) Sentence (316 Speakers)	16 K，16 bit Four Channels	398,692
King-ASR-141	Spain Spanish Speech Recognition Corpus (in car ) 300 Speakers	16 K，16 bit Four Channels	360000

Telephony

Serial Number	Kingline Data Names	Sound Parameter	Utterances
King-ASR-220	German Speech Recognition Corpus (Telephone) Conversational 1000 speakers	8K,16bit one Channels	150000

1.3 Mobile

Serial Number	Kingline Data Names	Sound Parameter	Utterances
King-ASR-106	Catalan Speech Recognition Corpus (mobile) 200 Speakers	16K,16bit One Channel	60000
King-ASR-116	Polish Speech Recognition Corpus (Mobile) 600 Speakers	16K,16bit one channel	180000
King-ASR-124	Russian Speech Recognition Corpus (mobile) Sentence (604 Speakers)	16 K, 16 bit one channel	180542
King-ASR-128	Romanian Speech Recognition Corpus (Mobile) 600 Speakers	16K,16bit one channel	180000
King-ASR-133	Swedish Speech Recognition Corpus (Mobile) 300 Speakers	16K,16bit One Channel	45000

Desktop

Serial Number	Kingline Data Names	Sound Parameter	Utterances
King-ASR-207	Brazilian Portuguese Speech Recognition Corpus(Desktop) (203 Speakers)	44.1K,16bit Two Channels	121780
King-ASR-075	European Portuguese Speech Recognition Corpus (desktop) 200 Speakers	44.1K,16bit Four Channels	319908
King-ASR-171	France French Speech Recognition Corpus(Desktop) -Sentence (203 Speakers)	44.1K,16bit Two Channels	121642
King-ASR-182	German Speech Recognition Corpus (Desktop) -Sentence (200 Speakers)	44.1K,16bit Four Channels	239940

TTS Databases

Speechocean licenses a variety of databases in more than 40 languages for speech synthesis broadcasting speech, emotional speech, etc. which can be used in different algorithms.

Serial No.	Kingline Data Names	Sound Parameter	Utterances	Recording Hours
King-TTS-004	Arabic Speech Synthesis Database 1 (Male)	16K,16bit Two Channels	8055	11.7
King-TTS-005	Arabic Speech Synthesis Database 2 (Male)	16K,16bit Two Channels	8039	12.01
King-TTS-008	Spain Spanish Speech Synthesis Database (Female)	44.1K,16bit Two Channels	5000	Under Building
King-TTS-009	Fr-French Spanish Speech Synthesis Database (Female)	44.1K,17bit Two Channels	5000	Under Building
King-TTS-010	German Speech Synthesis Database (Female)	44.1K,18bit Two Channels	5000	Under Building
King-TTS-015	Italian Speech Synthesis Database (Female)	44.1K,19bit Two Channels	10300	13.13

Text Corpora

Speechocean licenses many kinds of text corpora in many languages which is superb for language model training.

ID	Kingline Data Names	Languages	Size
King-NLP-017	Spain Spanish Personal Names Corpus	Spain Spanish	Under Building
King-NLP-018	Spain Spanish Address Corpus	Spain Spanish	Under Building
King-NLP-021	Polish address corpus	Polish	Under Building
King-NLP-025	Turkish Personal Names Corpus	Turkish	Under Building
King-NLP-026	Turkish Address Corpus	Turkish	Under Building

Lexica

Speechocean builds pronunciation lexica in many languages which can be licensed to customers.

No.	Name	Phoneme Set
King-Lexicon-019	Italian Pronunciation Lexicon	SAMPA
King-Lexicon-020	Polish Pronunciation Lexicon	SAMPA
King-Lexicon-021	Dutch Pronunciation Lexicon	SAMPA
King-Lexicon-022	Swedish Pronunciation Lexicon	XSAMPA
King-Lexicon-024	Finnish Pronunciation Lexicon	Under Building
King-Lexicon-025	Romanian Pronunciation Lexicon	Under Building