ISCA - International Speech
Communication Association


ISCApad Archive  »  2014  »  ISCApad #189  »  Resources  »  Database

ISCApad #189

Saturday, March 15, 2014 by Chris Wellekens

5-2 Database
5-2-1ELRA - Language Resources Catalogue - Update (2013-12)

 *****************************************************************
    ELRA - Language Resources Catalogue - Update  December  2013
    *****************************************************************

We are happy to announce that 1 new Speech Language Resource and 1 new Written Corpus are now available in our catalogue. 

ELRA-S0365 aGender
aGender contains speech sample recordings over public telephone lines with read and (semi-)spontaneous speech. Native German speakers called a voice portal from their private phone, and read text + answered some open questions. The corpus contains the voices of 945 German speakers (approx. minimum of 100 speakers per class), each delivering 18 speech items in up to six different sessions.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1214

ELRA-W0074 Amharic-English bilingual corpus
The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in transliterated form and in English. The size of the corpus is of 232,653 words in Amharic and 291,701 in English.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1215

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/LRs-Announcements.html

Back  Top

5-2-2ELRA releases free Language Resources

ELRA releases free Language Resources
***************************************************

Anticipating users’ expectations, ELRA has decided to offer a large number of resources for free for Academic research use. Such an offer consists of several sets of speech, text and multimodal resources that are regularly released, for free, as soon as legal aspects are cleared. A first set was released in May 2012 at the occasion of LREC 2012. A second set is now being released.

Whenever this is permitted by our licences, please feel free to use these resources for deriving new resources and depositing them with the ELRA catalogue for community re-use.

Over the last decade, ELRA has compiled a large list of resources into its Catalogue of LRs. ELRA has negotiated distribution rights with the LR owners and made such resources available under fair conditions and within a clear legal framework. Following this initiative, ELRA has also worked on LR discovery and identification with a dedicated team which investigated and listed existing and valuable resources in its 'Universal Catalogue', a list of resources that could be negotiated on a case-by-case basis. At LREC 2010, ELRA introduced the LRE Map, an inventory of LRs, whether developed or used, that were described in LREC papers. This huge inventory listed by the authors themselves constitutes the first 'community-built' catalogue of existing or emerging resources, constantly enriched and updated at major conferences.

Considering the latest trends on easing the sharing of LRs, from both legal and commercial points of view, ELRA is taking a major role in META-SHARE, a large European open infrastructure for sharing LRs. This infrastructure will allow LR owners, providers and distributors to distribute their LRs through an additional and cost-effective channel.

To obtain the available sets of LRs, please visit the web page below and follow the instructions given online:
http://www.elra.info/Free-LRs,26.html

Back  Top

5-2-3LDC Newsletter (February 2014)

 

In this newsletter:

Spring 2014 LDC Data Scholarship recipients!

Membership fee savings and publications pipeline

New LDC website enhancements coming soon

New publications:

GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2

King Saud University Arabic Speech Database

NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source



Spring 2014 LDC Data Scholarship recipients!

LDC is pleased to announce the student recipients of the Spring 2014 LDC Data Scholarship program!  This program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen two proposals to support.   The following students will receive no-cost copies of LDC data:

  • Skye Anderson ~ Tulane University (USA), BA candidate, Linguistics.  Skye has been awarded a copy of LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 for her work in author profiling.
  • Hao Liu ~ University College London (UK), PhD candidate, Speech, Hearing and Phonetic Sciences.  Hao has been awarded a copy of Switchboard-1 Release 2, and NXT Switchboard Annotations for his work in prosody modeling.

 

Membership fee savings and publications pipeline

Members can still save on 2014 membership fees, but time is running out. Any organization which joins or renews membership for 2014 through Monday, March 3, 2014, is entitled to a 5% discount. Organizations which held membership for MY2013 can receive a 10% discount on fees provided they renew prior to March 3, 2014.

Planned publications for this year include:

  • 2009 NIST Language Recognition Evaluation ~  development data from VOA broadcast and CTS telephone speech in target and non-target languages.
  • ETS Corpus of Non-Native Written English ~ contains 1100 essays written for a college-entrance test sampled from eight prompts (i.e., topics)  with score levels (low/medium/high) for each essay.
  • GALE data ~ including Word Alignment, Broadcast Speech & Transcripts, Parallel Text, Parallel Aligned Treebanks in Arabic, Chinese, and English.
  • Hispanic Accented English ~ contains approximately 30 hours of spontaneous speech and read utterances from non-native speakers of English with corresponding transcripts.
  • Multi-Channel Wall Street Journal Audio-Visual Corpus (MC-WSJ-AV) ~  re-recording of parts of the WSJCAM0 using a number of microphones as well as three recording conditions resulting in 18-20 channels of audio per recording.
  • TAC KBP Reference Knowledge Base ~  TAC KBP aims to develop and evaluate technologies for building and populating knowledge bases (KBs) about named entities from unstructured text.  KBP systems must either populate an existing reference KB, or else build a KB from scratch. The reference KB for is based on a snapshot of English Wikipedia snapshot from October 2008 and contains a set of entities, each with a canonical name and title for the Wikipedia page, an entity type, an automatically parsed version of the data from the infobox in the entity's Wikipedia article, and a stripped version of the text of the Wiki article.
  • USC-SFI MALACH Interviews and Transcripts Czech ~ developed by The University of Southern California's Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 143 hours of interviews from 420 interviewees along with transcripts and other documentation.

 

New LDC website enhancements coming soon

Look for LDC’s new website enhancements in the coming weeks. We've revamped our membership services to make it easier than ever for you to manage your membership and access data more quickly.

 

New publications

(1) GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2 was developed by LDC and contains 141,058 tokens of word aligned Arabic and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.

In this release, the source Arabic data was translated into English. Arabic and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this corpus corresponds to a portion of the Arabic treebanked data in Arabic Treebank - Broadcast News v1.0 (LDC2012T07).

The source data consists of Arabic broadcast news programming collected by LDC in 2007 and 2008. All data is encoded as UTF-8. A count of files, words, tokens and segments is below.

Language

Files

Words

Tokens

Segments

Arabic

31

110,690

141,058

7,102

The purpose of the GALE word alignment task was to find correspondences between words, phrases or groups of words in a set of parallel texts. Arabic-English word alignment annotation consisted of the following tasks:

  • Identifying different types of links: translated (correct or incorrect) and not translated (correct or incorrect)
  • Identifying sentence segments not suitable for annotation, e.g., blank segments, incorrectly-segmented segments, segments with foreign languages
  • Tagging unmatched words attached to other words or phrases

GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$1750.


(2) King Saud University Arabic Speech Database was developed by King Saud University and contains 590 hours of recorded Arabic speech from male and female speakers. The utterances include read and spontaneous speech. The recordings were conducted in varied environments representing quiet and noisy settings.

The corpus was designed principally for speaker recognition research. The speech sources are sentences, word lists, prose and question and answer sessions. Read speech text includes the following:

  • Sets of sentences devised to cover allophones of each phoneme, phonetic balance, and differentiation of accents.
  • Word lists developed to minimize missing phonemes and to represent nasals fricatives, commonly used words, and numbers.
  • Two paragraphs, one from the Quran and another from a book, selected because they included all letters of the alphabet and were easy to read.

Spontaneous speech was captured through question and answer sessions between participants and project team members. Speakers responded to questions on general topics such as the weather and food.

Each speaker was recorded in three different environments: a sound proof room, an office, and a cafeteria. The recordings were collected via microphone and mobile phone and averaged between 16-19 minutes. The data was verified for missing recordings, problems with the recording system or errors in the recording process.

King Saud University Arabic Speech Database is distributed on one hard disk.

2014 Subscription Members will receive a copy of this data provided that they have completed the User License Agreement. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$2000.

 

 

(3) NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source was developed by NIST Multimodal Information Group. This release contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plan for the OpenMT 2012 test for Arabic, Chinese, Dari, Farsi, and Korean to English on a parallel data set. The set is based on a subset of the Arabic-to-English and Chinese-to-English progress tests from the OpenMT 2008, 2009 and 2012 evaluations with new source data created by humans based on the English reference translation. The package was compiled, and scoring software was developed, at NIST, making use of newswire and web data and reference translations developed by the Linguistic Data Consortium  and the Defense Language Institute Foreign Language Center.

The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. The 2012 task included the evaluation of five language pairs: Arabic-to-English, Chinese-to-English, Dari-to-English, Farsi-to-English and Korean-to-English in two source data styles. For general information about the NIST OpenMT evaluations, refer to the NIST OpenMT website.

This evaluation kit includes a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation.

This release consists of 20 files, four for each of the five languages, presented in XML with an included DTD. The four files are source and reference data in the following two styles:

  • English-true: an English-oriented translation this requires that the text read well and not use any idiomatic expressions in the foreign language to convey meaning, unless absolutely necessary.
  • Foreign-true: a translation as close as possible to the foreign language, as if the text had originated in that language.

NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$150.

 

 


        

 

         
Back  Top

5-2-4Appen ButlerHill

 

Appen ButlerHill 

A global leader in linguistic technology solutions

RECENT CATALOG ADDITIONS—MARCH 2012

1. Speech Databases

1.1 Telephony

1.1 Telephony

Language

Database Type

Catalogue Code

Speakers

Status

Bahasa Indonesia

Conversational

BAH_ASR001

1,002

Available

Bengali

Conversational

BEN_ASR001

1,000

Available

Bulgarian

Conversational

BUL_ASR001

217

Available shortly

Croatian

Conversational

CRO_ASR001

200

Available shortly

Dari

Conversational

DAR_ASR001

500

Available

Dutch

Conversational

NLD_ASR001

200

Available

Eastern Algerian Arabic

Conversational

EAR_ASR001

496

Available

English (UK)

Conversational

UKE_ASR001

1,150

Available

Farsi/Persian

Scripted

FAR_ASR001

789

Available

Farsi/Persian

Conversational

FAR_ASR002

1,000

Available

French (EU)

Conversational

FRF_ASR001

563

Available

French (EU)

Voicemail

FRF_ASR002

550

Available

German

Voicemail

DEU_ASR002

890

Available

Hebrew

Conversational

HEB_ASR001

200

Available shortly

Italian

Conversational

ITA_ASR003

200

Available shortly

Italian

Voicemail

ITA_ASR004

550

Available

Kannada

Conversational

KAN_ASR001

1,000

In development

Pashto

Conversational

PAS_ASR001

967

Available

Portuguese (EU)

Conversational

PTP_ASR001

200

Available shortly

Romanian

Conversational

ROM_ASR001

200

Available shortly

Russian

Conversational

RUS_ASR001

200

Available

Somali

Conversational

SOM_ASR001

1,000

Available

Spanish (EU)

Voicemail

ESO_ASR002

500

Available

Turkish

Conversational

TUR_ASR001

200

Available

Urdu

Conversational

URD_ASR001

1,000

Available

1.2 Wideband

Language

Database Type

Catalogue Code

Speakers

Status

English (US)

Studio

USE_ASR001

200

Available

French (Canadian)

Home/ Office

FRC_ASR002

120

Available

German

Studio

DEU_ASR001

127

Available

Thai

Home/Office

THA_ASR001

100

Available

Korean

Home/Office

KOR_ASR001

100

Available

2. Pronunciation Lexica

Appen Butler Hill has considerable experience in providing a variety of lexicon types. These include:

Pronunciation Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)

Part-of-speech tagged Lexica providing grammatical and semantic labels

Other reference text based materials including spelling/mis-spelling lists, spell-check dictionar-ies, mappings of colloquial language to standard forms, orthographic normalization lists.

Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages. For holdings information in a given language or to discuss any customized development efforts, please contact: sales@appenbutlerhill.com

3. Named Entity Corpora

Language

Catalogue Code

Words

Description

Arabic

ARB_NER001

500,000

These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities

English

ENI_NER001

500,000

Farsi/Persian

FAR_NER001

500,000

Korean

KOR_NER001

500,000

Japanese

JPY_NER001

500,000

Russian

RUS_NER001

500,000

Mandarin

MAN_NER001

500,000

Urdu

URD_NER001

500,000

3. Named Entity Corpora

Language

Catalogue Code

Words

Description

Arabic

ARB_NER001

500,000

These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities

English

ENI_NER001

500,000

Farsi/Persian

FAR_NER001

500,000

Korean

KOR_NER001

500,000

Japanese

JPY_NER001

500,000

Russian

RUS_NER001

500,000

Mandarin

MAN_NER001

500,000

Urdu

URD_NER001

500,000

4. Other Language Resources

Morphological Analyzers – Farsi/Persian & Urdu

Arabic Thesaurus

Language Analysis Documentation – multiple languages

 

For additional information on these resources, please contact: sales@appenbutlerhill.com

5. Customized Requests and Package Configurations

Appen Butler Hill is committed to providing a low risk, high quality, reliable solution and has worked in 130+ languages to-date supporting both large global corporations and Government organizations.

We would be glad to discuss to any customized requests or package configurations and prepare a cus-tomized proposal to meet your needs.

6. Contact Information

Prithivi Pradeep

Business Development Manager

ppradeep@appenbutlerhill.com

+61 2 9468 6370

Tom Dibert

Vice President, Business Development, North America

tdibert@appenbutlerhill.com

+1-315-339-6165

                                                         www.appenbutlerhill.com

Back  Top

5-2-5OFROM 1er corpus de français de Suisse romande
Nous souhaiterions vous signaler la mise en ligne d'OFROM, premier corpus de français parlé en Suisse romande. L'archive est, dans version actuelle, d'une durée d'environ 15 heures. Elle est transcrite en orthographe standard dans le logiciel Praat. Un concordancier permet d'y effectuer des recherches, et de télécharger les extraits sonores associés aux transcriptions. 
 
Pour accéder aux données et consulter une description plus complète du corpus, nous vous invitons à vous rendre à l'adresse suivante : http://www.unine.ch/ofrom
Back  Top

5-2-6Real-world 16-channel noise recordings

We are happy to announce the release of DEMAND, a set of real-world
16-channel noise recordings designed for the evaluation of microphone
array processing techniques.

http://www.irisa.fr/metiss/DEMAND/

1.5 h of noise data were recorded in 18 different indoor and outdoor
environments and are available under the terms of the Creative Commons Attribution-ShareAlike License.

Joachim Thiemann (CNRS - IRISA)
Nobutaka Ito (University of Tokyo)
Emmanuel Vincent (Inria Nancy - Grand Est)

Back  Top

5-2-7Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne

Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne

 

 

Le consortium IRCOM de la TGIR Corpus et l’EquipEx ORTOLANG s’associent pour proposer une aide technique et financière à la finalisation de corpus de données orales ou multimodales à des fins de diffusion et pérennisation par l’intermédiaire de l’EquipEx ORTOLANG. Cet appel ne concerne pas la création de nouveaux corpus mais la finalisation de corpus existants et non-disponibles de manière électronique. Par finalisation, nous entendons le dépôt auprès d’un entrepôt numérique public, et l’entrée dans un circuit d’archivage pérenne. De cette façon, les données de parole qui ont été enrichies par vos recherches vont pouvoir être réutilisées, citées et enrichies à leur tour de manière cumulative pour permettre le développement de nouvelles connaissances, selon les conditions d’utilisation que vous choisirez (sélection de licences d’utilisation correspondant à chacun des corpus déposés).

 

Cet appel d’offre est soumis à plusieurs conditions (voir ci-dessous) et l’aide financière par projet est limitée à 3000 euros. Les demandes seront traitées dans l’ordre où elles seront reçues par l’ IRCOM. Les demandes émanant d’EA ou de petites équipes ne disposant pas de support technique « corpus » seront traitées prioritairement. Les demandes sont à déposer du 1er septembre 2013 au 31 octobre 2013. La décision de financement relèvera du comité de pilotage d’IRCOM. Les demandes non traitées en 2013 sont susceptibles de l’être en 2014. Si vous avez des doutes quant à l’éligibilité de votre projet, n’hésitez pas à nous contacter pour que nous puissions étudier votre demande et adapter nos offres futures.

 

Pour palier la grande disparité dans les niveaux de compétences informatiques des personnes et groupes de travail produisant des corpus, L’ IRCOM propose une aide personnalisée à la finalisation de corpus. Celle-ci sera réalisée par un ingénieur IRCOM en fonction des demandes formulées et adaptées aux types de besoin, qu’ils soient techniques ou financiers.

 

Les conditions nécessaires pour proposer un corpus à finaliser et obtenir une aide d’IRCOM sont :

  • Pouvoir prendre toutes décisions concernant l’utilisation et la diffusion du corpus (propriété intellectuelle en particulier).

  • Disposer de toutes les informations concernant les sources des corpus et le consentement des personnes enregistrées ou filmées.

  • Accorder un droit d’utilisation libre des données ou au minimum un accès libre pour la recherche scientifique.

 

Les demandes peuvent concerner tout type de traitement : traitements de corpus quasi-finalisés (conversion, anonymisation), alignement de corpus déjà transcrits, conversion depuis des formats « traitement de textes », digitalisation de support ancien. Pour toute demande exigeant une intervention manuelle importante, les demandeurs devront s’investir en moyens humains ou financiers à la hauteur des moyens fournis par IRCOM et ORTOLANG.

 

IRCOM est conscient du caractère exceptionnel et exploratoire de cette démarche. Il convient également de rappeler que ce financement est réservé aux corpus déjà largement constitués et ne peuvent intervenir sur des créations ex-nihilo. Pour ces raisons de limitation de moyens, les propositions de corpus les plus avancés dans leur réalisation pourront être traitées en priorité, en accord avec le CP d’IRCOM. Il n’y a toutefois pas de limite « théorique » aux demandes pouvant être faites, IRCOM ayant la possibilité de rediriger les demandes qui ne relèvent pas de ses compétences vers d’autres interlocuteurs.

 

Les propositions de réponse à cet appel d’offre sont à envoyer à ircom.appel.corpus@gmail.com. Les propositions doivent utiliser le formulaire de deux pages figurant ci-dessous. Dans tous les cas, une réponse personnalisée sera renvoyée par IRCOM.

 

Ces propositions doivent présenter les corpus proposés, les données sur les droits d’utilisation et de propriétés et sur la nature des formats ou support utilisés.

 

Cet appel est organisé sous la responsabilité d’IRCOM avec la participation financière conjointe de IRCOM et l’EquipEx ORTOLANG.

 

Pour toute information complémentaire, nous rappelons que le site web de l'Ircom (http://ircom.corpus-ir.fr) est ouvert et propose des ressources à la communauté : glossaire, inventaire des unités et des corpus, ressources logicielles (tutoriaux, comparatifs, outils de conversion), activités des groupes de travail, actualités des formations, ...

L'IRCOM invite les unités à inventorier leur corpus oraux et multimodaux - 70 projets déjà recensés - pour avoir une meilleure visibilité des ressources déjà disponibles même si elles ne sont pas toutes finalisées.

 

Le comité de pilotage IRCOM

 

 

Utiliser ce formulaire pour répondre à l’appel : Merci.

 

Réponse à l’appel à la finalisation de corpus oral ou multimodal

 

Nom du corpus :

 

Nom de la personne à contacter :

Adresse email :

Numéro de téléphone :

 

Nature des données de corpus :

 

Existe-t’il des enregistrements :

Quel média ? Audio, vidéo, autre…

Quelle est la longueur totale des enregistrements ? Nombre de cassettes, nombre d’heures, etc.

Quel type de support ?

Quel format (si connu) ?

 

Existe-t’il des transcriptions :

Quel format ? (papier, traitement de texte, logiciel de transcription)

Quelle quantité (en heures, nombre de mots, ou nombre de transcriptions) ?

 

Disposez vous de métadonnées (présentation des droits d’auteurs et d’usage) ?

 

Disposez-vous d’une description précise des personnes enregistrées ?

 

Disposez-vous d’une attestation de consentement éclairé pour les personnes ayant été enregistrées ? En quelle année (environ) les enregistrements ont eu lieu ?

 

Quelle est la langue des enregistrements ?

 

Le corpus comprend-il des enregistrements d’enfants ou de personnes ayant un trouble du langage ou une pathologie ?

Si oui, de quelle population s’agit-il ?

 

 

Dans un souci d’efficacité et pour vous conseiller dans les meilleurs délais, il nous faut disposer d’exemples des transcriptions ou des enregistrements en votre possession. Nous vous contacterons à ce sujet, mais vous pouvez d’ores et déjà nous adresser par courrier électronique un exemple des données dont vous disposez (transcriptions, métadonnées, adresse de page web contenant les enregistrements).

 

Nous vous remercions par avance de l’intérêt que vous porterez à notre proposition. Pour toutes informations complémentaires veuillez contacter Martine Toda martine.toda@ling.cnrs.fr ou à ircom.appel.corpus@gmail.com.

Back  Top

5-2-8Rhapsodie: un Treebank prosodique et syntaxique de français parlé

Rhapsodie: un Treebank prosodique et syntaxique de français parlé

 

Nous avons le plaisir d'annoncer que la ressource Rhapsodie, Corpus de français parlé annoté pour la prosodie et la syntaxe, est désormais disponible sur http://www.projet-rhapsodie.fr/

 

Le treebank Rhapsodie est composé de 57 échantillons sonores (5 minutes en moyenne, au total 3h de parole, 33000 mots) dotés d’une transcription orthographique et phonétique alignées au son.

 

Il s'agit d’une ressource de français parlé multi genres (parole privée et publique ; monologues et dialogues ; entretiens en face à face vs radiodiffusion, parole plus ou moins interactive et plus ou moins planifiée, séquences descriptives, argumentatives, oratoires et procédurales) articulée autour de sources externes (enregistrements extraits de projets antérieurs, en accord avec les concepteurs initiaux) et internes. Nous tenons en particulier à remercier les responsables des projets CFPP2000, PFC, ESLO, C-Prom ainsi que Mathieu Avanzi, Anne Lacheret, Piet Mertens et Nicolas Obin.

 

Les échantillons sonores (wave & MP3, pitch nettoyé et lissé), les transcriptions orthographiques (txt), les annotations macrosyntaxiques (txt), les annotations prosodiques (xml, textgrid) ainsi que les metadonnées (xml & html) sont téléchargeables librement selon les termes de la licence Creative Commons Attribution - Pas d’utilisation commerciale - Partage dans les mêmes conditions 3.0 France.

Les annotations microsyntaxiques seront disponibles prochainement

 Les métadonnées sont également explorables en ligne grâce à un browser.

 Les tutoriels pour la transcription, les annotations et les requêtes sont disponibles sur le site Rhapsodie.

 Enfin, L’annotation prosodique est interrogeable en ligne grâce au langage de requêtes Rhapsodie QL.

 L'équipe Ressource Rhapsodie (Modyco, Université Paris Ouest Nanterre)

Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.

 Partenaires : IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence), CLLE-ERSS (Toulouse).

 

********************************************************

Rhapsodie: a Prosodic and Syntactic Treebank for Spoken French

We are pleased to announce that Rhapsodie, a syntactic and prosodic treebank of spoken French created with the aim of modeling the interface between prosody, syntax and discourse in spoken French is now available at   http://www.projet-rhapsodie.fr/

The Rhapsodie treebank is made up of 57 short samples of spoken French (5 minutes long on average, amounting to 3 hours of speech and a 33 000 word corpus) endowed with an orthographical phoneme-aligned transcription . 

The corpus is representative of different genres (private and public speech; monologues and dialogues; face-to-face interviews and broadcasts; more or less interactive discourse; descriptive, argumentative and procedural samples, variations in planning type).

The corpus samples have been mainly drawn from existing corpora of spoken French and partially created within the frame of theRhapsodie project. We would especially like to thank the coordinators of the  CFPP2000, PFC, ESLO, C-Prom projects as well as Piet Mertens, Mathieu Avanzi, Anne Lacheret and Nicolas Obin.

The sound samples (waves, MP3, cleaned and stylized pitch), the orthographic transcriptions (txt), the macrosyntactic annotations (txt), the prosodic annotations  (xml, textgrid) as well as the metadata (xml and html) can be freely downloaded under the terms of the Creative Commons licence Attribution - Noncommercial - Share Alike 3.0 France.

Microsyntactic annotation will be available soon.

The metadata are  searchable on line through a browser.

The prosodic annotation can be explored on line through the Rhapsodie Query Language.

The tutorials of transcription, annotations and Rhapsodie Query Language  are available on the site.

 

The Rhapsodie team (Modyco, Université Paris Ouest Nanterre :

Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.

Partners: IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence),CLLE-ERSS (Toulouse).

Back  Top

5-2-9COVAREP: A Cooperative Voice Analysis Repository for Speech Technologies
======================
CALL for contributions
======================
 
We are pleased to announce the creation of an open-source repository of advanced speech processing algorithms called COVAREP (A Cooperative Voice Analysis Repository for Speech Technologies). COVAREP has been created as a GitHub project (https://github.com/covarep/covarep) where researchers in speech processing can store original implementations of published algorithms.
 
Over the past few decades a vast array of advanced speech processing algorithms have been developed, often offering significant improvements over the existing state-of-the-art. Such algorithms can have a reasonably high degree of complexity and, hence, can be difficult to accurately re-implement based on article descriptions. Another issue is the so-called 'bug magnet effect' with re-implementations frequently having significant differences from the original. The consequence of all this has been that many promising developments have been under-exploited or discarded, with researchers tending to stick to conventional analysis methods.
 
By developing the COVAREP repository we are hoping to address this by encouraging authors to include original implementations of their algorithms, thus resulting in a single de facto version for the speech community to refer to.
 
We envisage a range of benefits to the repository:
1) Reproducible research: COVAREP will allow fairer comparison of algorithms in published articles.
2) Encouraged usage: the free availability of these algorithms will encourage researchers from a wide range of speech-related disciplines (both in academia and industry) to exploit them for their own applications.
3) Feedback: as a GitHub project users will be able to offer comments on algorithms, report bugs, suggest improvements etc.
 
SCOPE
We welcome contributions from a wide range of speech processing areas, including (but not limited to): Speech analysis, synthesis, conversion, transformation, enhancement, speech quality, glottal source/voice quality analysis, etc.
 
REQUIREMENTS
In order to achieve a reasonable standard of consistency and homogeneity across algorithms we have compiled a list of requirements for prospective contributors to the repository. However, we intend the list of the requirements not to be so strict as to discourage contributions.
  • Only published work can be added to the   repository
  • The code must be available as open source
  • Algorithms should be coded in Matlab, however we   strongly encourage authors to make the code compatible with Octave in order to   maximize usability
  • Contributions have to comply with a Coding   Convention (see GitHub site for coding convention and template). However, only   for normalizing the inputs/outputs and the documentation. There is no   restriction for the content of the functions (though, comments are obviously   encouraged).
 
LICENCE
Getting contributing institutions to agree to a homogenous IP policy would be close to impossible. As a result COVAREP is a repository and not a toolbox, and each algorithm will have its own licence associated with it. Though flexible to different licence types, contributions will need to have a licence which is compatible with the repository, i.e. {GPL, LGPL, X11, Apache, MIT} or similar. We would encourage contributors to try to obtain LGPL licences from their institutions in order to be more industry friendly.
 
CONTRIBUTE!
We believe that the COVAREP repository has a great potential benefit to the speech research community and we hope that you will consider contributing your published algorithms to it. If you have any questions, comments issues etc regarding COVAREP please contact us on one of the email addresses below. Please forward this email to others who may be interested.
 
Existing contributions include: algorithms for spectral envelope modelling, adaptive sinusoidal modelling, fundamental frequncy/voicing decision/glottal closure instant detection algorithms, methods for detecting non-modal phonation types etc.
 
Gilles Degottex <degottex@csd.uoc.gr>, John Kane <kanejo@tcd.ie>, Thomas Drugman <thomas.drugman@umons.ac.be>, Tuomo Raitio <tuomo.raitio@aalto.fi>, Stefan Scherer <scherer@ict.usc.edu>
 
 
Back  Top

5-2-10Annotation of “Hannah and her sisters” by Woody Allen.

We have created and made publicly available a dense audio-visual person-oriented ground-truth annotation of a feature movie (100 minutes long): “Hannah and her sisters” by Woody Allen.

The annotation includes

•          Face tracks in video (densely annotated, i.e., in each frame, and person-labeled)

•             Speech segments in audio (person-labeled)

•             Shot boundaries in video



The annotation can be useful for evaluating



•   Person-oriented video-based tasks (e.g., face tracking, automatic character naming, etc.)

•             Person-oriented audio-based tasks (e.g., speaker diarization or recognition)

•             Person-oriented multimodal-based tasks (e.g., audio-visual character naming)



Detail on Hannah dataset and access to it can be obtained there:

https://research.technicolor.com/rennes/hannah-home/

https://research.technicolor.com/rennes/hannah-download/



Acknowledgments:

This work is supported by AXES EU project: http://www.axes-project.eu/










Alexey Ozerov Alexey.Ozerov@technicolor.com

Jean-Ronan Vigouroux,

Louis Chevallier

Patrick Pérez

Technicolor Research & Innovation



 

Back  Top

5-2-11French TTS

Text to         Speech Synthesis:
      over an hour of speech       synthesis samples from         1968 to 2001 by       25 French, Canadian, US , Belgian,       Swedish, Swiss systems
     
     
33 ans de synthèse de la parole à         partir du texte: une promenade sonore (1968-2001)
        (33 years of
Text to Speech Synthesis       in French : an audio tour (1968-2001)       )
      Christophe d'Alessandro
      Article published in         Volume 42 - No. 1/2001 issue of 
Traitement       Automatique des Langues  (TAL,       Editions Hermes),         pp. 297-321.
     
      posted to:
      http://groupeaa.limsi.fr/corpus:synthese:start

Back  Top

5-2-12Google 's Language Model benchmark
 Here is a brief description of the project.

'The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:

  • unpruned Katz (1.1B n-grams),
  • pruned Katz (~15M n-grams),
  • unpruned Interpolated Kneser-Ney (1.1B n-grams),
  • pruned Interpolated Kneser-Ney (~15M n-grams)

 

Happy benchmarking!'

Back  Top

5-2-13International Standard Language Resource Number (ISLRN) (ELRA Press release)

Press Release - Immediate - Paris, France, December 13, 2013

Establishing the International Standard Language Resource Number (ISLRN)

12 major NLP organisations announce the establishment of the ISLRN, a Persistent Unique Identifier, to be assigned to each Language Resource.

On November 18, 2013, 12 NLP organisations have agreed to announce the establishment of the International Standard Language Resource Number (ISLRN), a Persistent Unique Identifier, to be assigned to each Language Resource. Experiment replicability, an essential feature of scientific work, would be enhanced by such unique identifier. Set up by ELRA, LDC and AFNLP/Oriental-COCOSDA, the ISLRN Portal will provide unique identifiers using a standardised nomenclature, as a service free of charge for all Language Resource providers. It will be supervised by a steering committee composed of representatives of participating organisations and enlarged whenever necessary.

More information on ELRA and the ISLRN, please contact: Khalid Choukri choukri@elda.org

More information on ELDA, please contact: Hélène Mazo mazo@elda.org

ELRA

55-57, rue Brillat Savarin

75013 Paris (France)

Tel.: +33 1 43 13 33 33

Fax: +33 1 43 13 33 30

Back  Top

5-2-14Speechocean March 2014 update

Speechocean March 2014 update:

 

Speechocean: A global language resources and data services supplier

 

has over 500 large-scale databases available in 110+ languages and accents with the platform of desktop, in-car, telephony and tablet PC. Our data repository is enormous and diversified, which includes ASR Databases, TTS Databases, Lexica, Text Corpora, etc.

 

Speechocean is glad to announce that more resources have been released:

ASR Databases

Speechocean provides 110+ regional languages corpora, available in a variety of formats, situational styles, scene environments and platform systems, covering In-car speech recognition corpora, mobile phone speech recognition corpora, fixed-line speech recognition corpora, desktop speech recognition corpora, etc. This month we released more Asian languages databases which were made for the tuning and testing purpose of speech recognition systems for speech ASR applications.

    1. In-Car

Chinese Mandarin Speech Recognition Database ---- (In-Car)-100 Speakers

ID: King-ASR-122

This database was collected in China Mainland. It contains the voices of 100 different native speakers (50 males, 50 females) who were balanced according to age(mainly 18 – 3062),31 – 4528,46 – 6010), gender (Male 50%, Female 50%) and regional accents (Northern 60%, Wu 10%, Xiang 5%, Gan 5%, Kejia 5%, Min 5%, Cantonese 10%).


The script was specially designed to provide material for both training and testing of many classes of speech recognizers which contain 320 utterances covering 15 categories and 35 sub-categories for each speaker (for the detail script structure design, please see the technical document).
Each speaker was recorded under two environments in three variations (Parked, City Driving and Highway driving) with various kinds of recording conditions such as motor running, fan on/off, window up/down, etc. 320 utterances were recorded for each speaker under two environments and there are 200796 utterances recorded in total.

 

Each utterance is stored in a separate file and each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information. A pronunciation lexicon with a phonemic transcription in SAMPA is also included. All the data was transcribed and labeled.


Japanese Speech Recognition Database ---- (In-Car)-800 Speakers

ID: King-ASR-125

This Japan In-car Speech database was collected in Japan and contains the voices of 800 different native speakers who were demographically balanced according to Age (16-30, 31-45, and 46-60), Gender (400±5% males, 400±5% females) and Dialectical Region. The script was specially designed to provide material for both training and testing of many classes of speech recognizers which contains 16 general categories and more than 50 specific sub-categories. Each speaker was recorded under three driving environments (parked, city driving and highway driving) with recording conditions such as fan on/off and window up/down. A total of 300 utterances were recorded for each speaker in two of three driving environments (150 utterances and 10 spontaneous utterances per environment).


Four high quality audio channels (C1: SHURE SM10A, C2: SENNHEISER ME104, F1: AKG Q400, F2: AKG Q400) and three popular cars in the country were used in this recording.


The speech data is stored as sequences of 16 kHz, 16 bit which is uncompressed and each prompted utterance is stored in a separate file and each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information. A pronunciation lexicon with a phonemic transcription in SAMPA is also included. All the data was transcribed and labeled.

    1. Telephony

Japanese Speech Recognition Database ---- Conversation (Telephony)-201 Speakers

ID: King-ASR-055

This Japanese Speech Recognition database was collected in Japan and contains the voices of 201 different native speakers who were demographic balanced according to age distribution (16-28,29-60), Gender, Dialectical Regions. The corpus contains 100 pairs of spontaneous dialog speech data which were from 201 speakers. Each pair of speech consists of 3 audio files: two of them from single speaker separately and the other is from the mixed channel. The three files were recorded simultaneously. The pure recording time of mixed channel is about 104.8 hours. 33 topics were contained in this database.

 

There are 7,009 audio files which were saved as uncompressed PCM files. All the speech data was transcribed and labeled.

1.3 Mobile

Korean Speech Recognition Database—(Mobile)--1023 Speakers

ID: King-ASR-137

The Korean mobile speech Recognition database which was collected in Korea, contains the voices of 1023 different native speakers (510±5%males, 513±5% females) who were balanced according to age (mainly 16 – 30,31 – 45,46 – 60), Gender and regional accents (for the details, please see the technical document).


The script was specially designed to provide material for both training and testing of many classes of speech recognizers which contain 15 categories and 35 sub-categories for each speaker (for the detail script structure design, please see the technical document).

Each speaker has recorded 300 utterances under two environments, one in a quiet session (Office/Home) and one in a noisy session (Garden/roadside/restaurant/bus). Each speaker has recorded 150 utterances and spontaneous sentences per session and totally 300 utterances were recorded by each speaker.

Popular mobiles in this country were used for collecting this data such as Samsung, Nokia, HTC, etc. The speech data is stored as sequences of 16 kHz, 16 bit and uncompressed.
Each utterance is stored in a separate file and each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.


A pronunciation lexicon with a phonemic transcription is also included. All the data was transcribed and labeled

 

Chinese Mandarin Speech Recognition Database---Sentences (Mobile) - (5048 Speakers)

ID: King-ASR-216

This database is a desktop speech database collected by Speechocean which is performed in a quiet environment in China. This database is one of our databases of Speech Data ----Mobile Project (SDM) which contains the database collections in 30 languages presently.


This database contains 1,514,028 sentences of Chinese Mandarin speech data which were from 5048 speakers which were recorded in a quiet environment. The pure recording length is about 2,268 hours. All speakers are native speakers from 14 typical dialectical cities covering seven main dialectical regions of China who were demographic balanced according to age distribution (16~30, 31~45, 46~60), Gender (2,584 Males and 2464 Females) and regional accents.

The script was specially designed to provide material for both training and testing of many classes of speech recognizers. The script of each speaker contains 300 sentences which were randomly selected from a pool of sentences specially designed. Each speaker will be recorded as naturally as possible in quiet environment through Popular Mobile Phones such as of iPhones, HTC Samsung, MOTO and etc. which cover the platforms of ios, android and window mobile.

The speech data are stored as sequences of 16 kHz, 16 bit and uncompressed PCM format. All the speech was manually transcribed and labeled. A pronunciation lexicon with a phonemic transcription in Pinyin is also included.

    1. Desktop

Indonesian Speech Recognition Database ---- Sentences (Desktop)-200 Speakers

ID: King-ASR-061

This Indonesian Speech Recognition database was collected in Indonesia and contains the voices of 200 different native speakers who were demographic balanced according to age distribution (16–30, 31–45, 46–60) and Gender. It contains 239267 audio files with about 460.94 hours of recording.

Each speaker uttered 300 sentences in a quiet office room. The whole data has been proofread manually with precise data labeling.

Urdu Speech Recognition Database ----Sentences (Desktop)-200 Speakers

ID: King-ASR-063

This Urdu Speech Recognition database, which was collected in Pakistan, contains the voices of 200 native speakers who were demographic balanced according to age distribution (16–60), gender, dialectical Regions, there were 241,354 audio files which were saved as uncompressed PCM files. All the speech data was transcribed and labeled.

Vietnamese Speech Recognition Database ----Sentences (Desktop)-200 Speakers

ID: King-ASR-074

This Vietnamese Speech Recognition database, which was collected in Vietnam, contains the voices of 200 native speakers who were demographic balanced according to age distribution (16–60), Gender, Dialectical Regions, there were 263,204 audio files which were saved as uncompressed PCM files. All the speech data was transcribed and labeled.

  1. TTS Databases

Speechocean licenses a variety of databases in more than 40 languages for speech synthesis broadcasting speech, emotional speech, etc. which can be used in different algorithms.

 

European Portuguese Speech Corpus for TTS (Female)

ID: King-TTS-017

The European Portuguese (pt-PT) Speech Corpus consists native Portuguese female professional broadcaster (Female, 32 years old) recorded in a studio with high SNR (>35dB) over two channels (Shure SM15 microphone and Electroglottography (EGG) sensor).

 

The Corpus includes the following sub-corpora:

  1. Sentence sub-corpus: including 3000 short sentences (7~12 words) and 2000 sentences with normal length (13~20 words). Considering all kinds of linguistic phenomena, all sentences are extracted from the daily articles in Portugal, such as national and international news, papers in life, travel, and so on. The sentences with political/religious/obscene/pornographic words which might lead to negative emotions are carefully excluded.

  2. Emotional sub-corpus: including 100 exclamatory sentences and 100 interrogative sentences which can be used for emotional TTS study;

  3. Digit sub-corpus: including many kinds of digits data, such as isolated digit, connected digits with blocks, natural and ordinal number readings;

  4. Expression sub-corpus: consists of general expressions, such as date, time, money and measure expression;

  5. Spell sub-corpus: including characters in alphabet, Greek characters and general abbreviations;

 

All reading prompts are manually revised and prosody annotations were made according to real speech. All speech data are segmented and labeled on phone level. Pronunciation lexicon and pitch extract from EEG can also be provided based on demands

  1. Text Corpora

Speechocean licenses many kinds of text corpora in many languages which is superb for language model training.

ID

Kingline Data Names

 Languages

Size

King-MT-001

Chinese-English-Korean-
Japanese Parallel Corpus

Chinese-English-
Korean-Japanese

200,000 Pairs
of Sentences

King-MT-005

English-to-Simplified Chinese Dictionary

English-Chinese

80,000 Words

King-MT-010

Japanese - English Place Names

Japanese - English

80,000 Words

King-NLP-019

SC and TC Chinese Pinyin Database

Chinese

2,600,000 Words

King-NLP-020

Japanese Phonological Database

Japanese

35,000 Words

King-NLP-022

Database of Japanese Name Variants

Japanese

4,000,000 Words

King-NLP-023

Japanese Lexical Database

Japanese

290,000 Words

King-NLP-024

Japanese - English Personal Names

Japanese

580,000 Words

  1. Lexica

Speechocean builds pronunciation lexica in many languages which can be licensed to customers.

No

Name

License

Phoneme Set

King-Lexicon-001

Chinese Mandarin Pronunciation Lexicon

211,444 Entries

Pinyin

King-Lexicon-002

Canadian French Pronunciation Lexicon

23,000 Entries

SAMPA

King-Lexicon-003

Russian Pronunciation Lexicon

139,032 Entities

SAMPA

King-Lexicon-004

US English Pronunciation Lexicon

36,000 Entries

CMU

King-Lexicon-005

UK English Pronunciation Lexicon

23,000 Entries

SAMPA

King-Lexicon-006

Argentina Spanish Pronunciation Lexicon

14,636 Entries

SAMPA

King-Lexicon-007

European Spanish Pronunciation Lexicon

31,388 Entries

SAMPA

King-Lexicon-008

German Pronunciation Lexicon

80,745 Entries

SAMPA

King-Lexicon-009

Cantonese pronunciation Lexicon

86,364 Entries

Jyutpin

King-Lexicon-010

Turkish Pronunciation Lexicon

101,950 Entries

SAMPA

King-Lexicon-011

European Portuguese Pronunciation Lexicon

23,033 Entries

SAMPA

King-Lexicon-012

European French Pronunciation Lexicon

53,000 Entries

SAMPA

King-Lexicon-013

Chile Spanish Pronunciation Lexicon

21,884 Entities

SAMPA

King-Lexicon-014

Ukrainian Pronunciation Lexicon

37,000 Entries

SAMPA

King-Lexicon-015

Danish Pronunciation Lexicon

6,983 Entries

SAMPA

King-Lexicon-016

Japanese Pronunciation Lexicon

72,968 Entries

Hepburn

 

 

Contact Information

Xianfeng Cheng

Business Manager of Commercial Department

Tel: +86-10-62660928; +86-10-62660053 ext.8080

Cell phone: +86 13681432590

Skype: xianfeng.cheng1

Email: chengxianfeng@speechocean.com; cxfxy0cxfxy0@gmail.com

Website: www.speechocean.com

 

Back  Top



 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA