ISCA - International Speech
Communication Association


ISCApad Archive  »  2018  »  ISCApad #241  »  Resources  »  Database

ISCApad #241

Tuesday, July 10, 2018 by Chris Wellekens

5-2 Database
5-2-1Linguistic Data Consortium (LDC) update (June 2018)

 

In this newsletter:

 

LDC Catalog certified as CoreTrustSeal data repository

 

LDC data and commercial technology development

New Publications:

 

BOLT Chinese SMS/Chat

 

Multi-Language Conversational Telephone Speech 2011 -- Central European

 

TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013

 

IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b
______________________________________________________________________________

 

LDC Catalog certified as CoreTrustSeal data repository

 

LDC is pleased to announce that the Catalog has been awarded the CoreTrustSeal for recognition as a trustworthy data repository. This means that the Catalog meets a series of standards covering data access, rights management, curation, and storage developed by the ISCU World Data System and the Data Seal of Approval. LDC joins the other 136 certified repositories around the globe in the commitment to promote sustainable and trustworthy data infrastructures.  

 

LDC data and commercial technology development

 

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

 

_______________________________________________________________________________

 


New publications:

 

(1) BOLT Chinese SMS/Chat was developed by LDC and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Chinese. The corpus contains 14,877 conversations totaling 3,005,810 words across 497,543 messages.

 

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources – discussion forums, text messaging, and chat – in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference. The data in this release was collected using two methods: new collection via LDC's collection platform, and donation of SMS or chat archives from BOLT collection participants.

 

BOLT Chinese SMS/Chat is distributed via web download.

 

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $1750.

 

 

 

*

 

 

 

(2) Multi-Language Conversational Telephone Speech 2011 -- Central European was developed by LDC and is comprised of approximately 44 hours of telephone speech in two distinct language varieties of Central Europe: Czech and Slovak.

 

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Human auditors labeled the calls for callee gender, dialect type, and noise.

 

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

 

 

Multi-Language Conversational Telephone Speech 2011 -- Central European is distributed via web download.

 

 

 

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $2000.

 

 

 

*

 

 

 

(3) TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP English Entity Linking tasks in 2009, 2010, 2011, 2012, and 2013. It includes queries and gold standard entity type information, Knowledge Base links, and equivalence class clusters for NIL entities. Also included are the source documents for the queries, specifically, English newswire, discussion forum, and web data. The corresponding knowledge base is available as TAC KBP Reference Knowledge Base (LDC2014T16). Also included in this package are the results of an Entity Linking IAA (Inter-Annotator Agreement) study conducted in 2010.

 

TAC KBP encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. English Entity Linking was first conducted as part of the 2009 TAC KBP evaluations. Its goal is to measure systems' ability to determine whether an entity, specified by a query, has a matching node in a reference knowledge base (KB) and, if so, to create a link between the two. If there is no matching node for a query entity in the KB, EL systems are required to cluster the mention together with others referencing the same entity.

 

TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 is distributed via web download.

 

 

 

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $1000.

 

*

 

 

 

(4) IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 191 hours of Cebuano conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

 

 

 

The Cebuano speech in this release represents that spoken in the Cebu-North Kana, Sialo, and Mindanao dialect regions of the Philippines. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 75 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

 

 

 

IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b is available via web download.

 

 

 

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $25.

 

 


 

 


Back  Top

5-2-2ELRA - Language Resources Catalogue - Update (February 2018)
ELRA - Language Resources Catalogue - Update
-------------------------------------------------------
We are happy to announce that 1 new Monolingual Lexicon, 1 new Written Corpus and 2 new Speech resources are now available in our catalogue.

ELRA-L0100 French dictionary of definitions (SYNAPSE)
ISLRN: 357-949-964-163-0
The French dictionary of definitions (SYNAPSE) consists of 216,835 entries (147,378 nouns, 80,552 adjectives, 24,001 verbs, 4,677 adverbs, 1,560 prefixes, 107 prepositions, 614 interjections, 147 pronouns, 42 conjunctions, 27 articles), 309,078 definitions and 7,395 phraseological units (phrases). Grammatical information for each entry consists of: grammatical category, gender, number, inflected forms. This dictionary is provided in XML format together with its DTD.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1315

ELRA-W0124 English-Vietnamese Parallel Corpus
ISLRN: 838-483-738-912-8
This is a corpus of 500,000 English-Vietnamese sentence pairs. The parallel corpus contains English documents translated by professional translators into Vietnamese. The source texts include books, dictionaries, newspapers, online news. The texts are provided in TEI format.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1316

ELRA-S0394 Metalogue Multi-Issue Bargaining Dialogue
ISLRN: 217-906-813-531-9
This corpus consists of approximately 2.5 hours of semantically annotated English dialogue data that includes speech and transcripts. Six unique subjects (undergraduates between 19 and 25 years of age) participated in the collection. The dialogue speech was captured with two headset microphones and saved in 16kHz, 16-bit mono linear PCM FLAC format. Transcripts were produced semi-automatically, using an automatic speech recognizer followed by manual correction. All text is presented in UTF-8 as either plain text or XML.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1317

ELRA-S0395 Nautilus Speaker Characterization (NSC) Corpus

ISLRN: 157-037-166-491-1
This corpus comprises clean microphone recordings of conversational speech from 300 German speakers (126 males and 174 females) aged 18 to 35 years, with no marked dialect/accent. The recordings were performed in an acoustically-isolated room in 2016/2017. Four scripted and four semi-spontaneous dialogs were elicited from the speakers, simulating telephone call inquiries. Additionally, spontaneous neutral and emotional speech utterances and questions were produced. All labels are provided, together with the speech recordings and the speakers' metadata.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1318

 
For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us.

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements/




 



 

Back  Top

5-2-3Appen ButlerHill

 

Appen ButlerHill 

A global leader in linguistic technology solutions

RECENT CATALOG ADDITIONS—MARCH 2012

1. Speech Databases

1.1 Telephony

1.1 Telephony

Language

Database Type

Catalogue Code

Speakers

Status

Bahasa Indonesia

Conversational

BAH_ASR001

1,002

Available

Bengali

Conversational

BEN_ASR001

1,000

Available

Bulgarian

Conversational

BUL_ASR001

217

Available shortly

Croatian

Conversational

CRO_ASR001

200

Available shortly

Dari

Conversational

DAR_ASR001

500

Available

Dutch

Conversational

NLD_ASR001

200

Available

Eastern Algerian Arabic

Conversational

EAR_ASR001

496

Available

English (UK)

Conversational

UKE_ASR001

1,150

Available

Farsi/Persian

Scripted

FAR_ASR001

789

Available

Farsi/Persian

Conversational

FAR_ASR002

1,000

Available

French (EU)

Conversational

FRF_ASR001

563

Available

French (EU)

Voicemail

FRF_ASR002

550

Available

German

Voicemail

DEU_ASR002

890

Available

Hebrew

Conversational

HEB_ASR001

200

Available shortly

Italian

Conversational

ITA_ASR003

200

Available shortly

Italian

Voicemail

ITA_ASR004

550

Available

Kannada

Conversational

KAN_ASR001

1,000

In development

Pashto

Conversational

PAS_ASR001

967

Available

Portuguese (EU)

Conversational

PTP_ASR001

200

Available shortly

Romanian

Conversational

ROM_ASR001

200

Available shortly

Russian

Conversational

RUS_ASR001

200

Available

Somali

Conversational

SOM_ASR001

1,000

Available

Spanish (EU)

Voicemail

ESO_ASR002

500

Available

Turkish

Conversational

TUR_ASR001

200

Available

Urdu

Conversational

URD_ASR001

1,000

Available

1.2 Wideband

Language

Database Type

Catalogue Code

Speakers

Status

English (US)

Studio

USE_ASR001

200

Available

French (Canadian)

Home/ Office

FRC_ASR002

120

Available

German

Studio

DEU_ASR001

127

Available

Thai

Home/Office

THA_ASR001

100

Available

Korean

Home/Office

KOR_ASR001

100

Available

2. Pronunciation Lexica

Appen Butler Hill has considerable experience in providing a variety of lexicon types. These include:

Pronunciation Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)

Part-of-speech tagged Lexica providing grammatical and semantic labels

Other reference text based materials including spelling/mis-spelling lists, spell-check dictionar-ies, mappings of colloquial language to standard forms, orthographic normalization lists.

Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages. For holdings information in a given language or to discuss any customized development efforts, please contact: sales@appenbutlerhill.com

3. Named Entity Corpora

Language

Catalogue Code

Words

Description

Arabic

ARB_NER001

500,000

These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities

English

ENI_NER001

500,000

Farsi/Persian

FAR_NER001

500,000

Korean

KOR_NER001

500,000

Japanese

JPY_NER001

500,000

Russian

RUS_NER001

500,000

Mandarin

MAN_NER001

500,000

Urdu

URD_NER001

500,000

3. Named Entity Corpora

Language

Catalogue Code

Words

Description

Arabic

ARB_NER001

500,000

These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities

English

ENI_NER001

500,000

Farsi/Persian

FAR_NER001

500,000

Korean

KOR_NER001

500,000

Japanese

JPY_NER001

500,000

Russian

RUS_NER001

500,000

Mandarin

MAN_NER001

500,000

Urdu

URD_NER001

500,000

4. Other Language Resources

Morphological Analyzers – Farsi/Persian & Urdu

Arabic Thesaurus

Language Analysis Documentation – multiple languages

 

For additional information on these resources, please contact: sales@appenbutlerhill.com

5. Customized Requests and Package Configurations

Appen Butler Hill is committed to providing a low risk, high quality, reliable solution and has worked in 130+ languages to-date supporting both large global corporations and Government organizations.

We would be glad to discuss to any customized requests or package configurations and prepare a cus-tomized proposal to meet your needs.

6. Contact Information

Prithivi Pradeep

Business Development Manager

ppradeep@appenbutlerhill.com

+61 2 9468 6370

Tom Dibert

Vice President, Business Development, North America

tdibert@appenbutlerhill.com

+1-315-339-6165

                                                         www.appenbutlerhill.com

Back  Top

5-2-4OFROM 1er corpus de français de Suisse romande
Nous souhaiterions vous signaler la mise en ligne d'OFROM, premier corpus de français parlé en Suisse romande. L'archive est, dans version actuelle, d'une durée d'environ 15 heures. Elle est transcrite en orthographe standard dans le logiciel Praat. Un concordancier permet d'y effectuer des recherches, et de télécharger les extraits sonores associés aux transcriptions. 
 
Pour accéder aux données et consulter une description plus complète du corpus, nous vous invitons à vous rendre à l'adresse suivante : http://www.unine.ch/ofrom
Back  Top

5-2-5Real-world 16-channel noise recordings

We are happy to announce the release of DEMAND, a set of real-world
16-channel noise recordings designed for the evaluation of microphone
array processing techniques.

http://www.irisa.fr/metiss/DEMAND/

1.5 h of noise data were recorded in 18 different indoor and outdoor
environments and are available under the terms of the Creative Commons Attribution-ShareAlike License.

Joachim Thiemann (CNRS - IRISA)
Nobutaka Ito (University of Tokyo)
Emmanuel Vincent (Inria Nancy - Grand Est)

Back  Top

5-2-6Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne

Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne

 

 

Le consortium IRCOM de la TGIR Corpus et l’EquipEx ORTOLANG s’associent pour proposer une aide technique et financière à la finalisation de corpus de données orales ou multimodales à des fins de diffusion et pérennisation par l’intermédiaire de l’EquipEx ORTOLANG. Cet appel ne concerne pas la création de nouveaux corpus mais la finalisation de corpus existants et non-disponibles de manière électronique. Par finalisation, nous entendons le dépôt auprès d’un entrepôt numérique public, et l’entrée dans un circuit d’archivage pérenne. De cette façon, les données de parole qui ont été enrichies par vos recherches vont pouvoir être réutilisées, citées et enrichies à leur tour de manière cumulative pour permettre le développement de nouvelles connaissances, selon les conditions d’utilisation que vous choisirez (sélection de licences d’utilisation correspondant à chacun des corpus déposés).

 

Cet appel d’offre est soumis à plusieurs conditions (voir ci-dessous) et l’aide financière par projet est limitée à 3000 euros. Les demandes seront traitées dans l’ordre où elles seront reçues par l’ IRCOM. Les demandes émanant d’EA ou de petites équipes ne disposant pas de support technique « corpus » seront traitées prioritairement. Les demandes sont à déposer du 1er septembre 2013 au 31 octobre 2013. La décision de financement relèvera du comité de pilotage d’IRCOM. Les demandes non traitées en 2013 sont susceptibles de l’être en 2014. Si vous avez des doutes quant à l’éligibilité de votre projet, n’hésitez pas à nous contacter pour que nous puissions étudier votre demande et adapter nos offres futures.

 

Pour palier la grande disparité dans les niveaux de compétences informatiques des personnes et groupes de travail produisant des corpus, L’ IRCOM propose une aide personnalisée à la finalisation de corpus. Celle-ci sera réalisée par un ingénieur IRCOM en fonction des demandes formulées et adaptées aux types de besoin, qu’ils soient techniques ou financiers.

 

Les conditions nécessaires pour proposer un corpus à finaliser et obtenir une aide d’IRCOM sont :

  • Pouvoir prendre toutes décisions concernant l’utilisation et la diffusion du corpus (propriété intellectuelle en particulier).

  • Disposer de toutes les informations concernant les sources des corpus et le consentement des personnes enregistrées ou filmées.

  • Accorder un droit d’utilisation libre des données ou au minimum un accès libre pour la recherche scientifique.

 

Les demandes peuvent concerner tout type de traitement : traitements de corpus quasi-finalisés (conversion, anonymisation), alignement de corpus déjà transcrits, conversion depuis des formats « traitement de textes », digitalisation de support ancien. Pour toute demande exigeant une intervention manuelle importante, les demandeurs devront s’investir en moyens humains ou financiers à la hauteur des moyens fournis par IRCOM et ORTOLANG.

 

IRCOM est conscient du caractère exceptionnel et exploratoire de cette démarche. Il convient également de rappeler que ce financement est réservé aux corpus déjà largement constitués et ne peuvent intervenir sur des créations ex-nihilo. Pour ces raisons de limitation de moyens, les propositions de corpus les plus avancés dans leur réalisation pourront être traitées en priorité, en accord avec le CP d’IRCOM. Il n’y a toutefois pas de limite « théorique » aux demandes pouvant être faites, IRCOM ayant la possibilité de rediriger les demandes qui ne relèvent pas de ses compétences vers d’autres interlocuteurs.

 

Les propositions de réponse à cet appel d’offre sont à envoyer à ircom.appel.corpus@gmail.com. Les propositions doivent utiliser le formulaire de deux pages figurant ci-dessous. Dans tous les cas, une réponse personnalisée sera renvoyée par IRCOM.

 

Ces propositions doivent présenter les corpus proposés, les données sur les droits d’utilisation et de propriétés et sur la nature des formats ou support utilisés.

 

Cet appel est organisé sous la responsabilité d’IRCOM avec la participation financière conjointe de IRCOM et l’EquipEx ORTOLANG.

 

Pour toute information complémentaire, nous rappelons que le site web de l'Ircom (http://ircom.corpus-ir.fr) est ouvert et propose des ressources à la communauté : glossaire, inventaire des unités et des corpus, ressources logicielles (tutoriaux, comparatifs, outils de conversion), activités des groupes de travail, actualités des formations, ...

L'IRCOM invite les unités à inventorier leur corpus oraux et multimodaux - 70 projets déjà recensés - pour avoir une meilleure visibilité des ressources déjà disponibles même si elles ne sont pas toutes finalisées.

 

Le comité de pilotage IRCOM

 

 

Utiliser ce formulaire pour répondre à l’appel : Merci.

 

Réponse à l’appel à la finalisation de corpus oral ou multimodal

 

Nom du corpus :

 

Nom de la personne à contacter :

Adresse email :

Numéro de téléphone :

 

Nature des données de corpus :

 

Existe-t’il des enregistrements :

Quel média ? Audio, vidéo, autre…

Quelle est la longueur totale des enregistrements ? Nombre de cassettes, nombre d’heures, etc.

Quel type de support ?

Quel format (si connu) ?

 

Existe-t’il des transcriptions :

Quel format ? (papier, traitement de texte, logiciel de transcription)

Quelle quantité (en heures, nombre de mots, ou nombre de transcriptions) ?

 

Disposez vous de métadonnées (présentation des droits d’auteurs et d’usage) ?

 

Disposez-vous d’une description précise des personnes enregistrées ?

 

Disposez-vous d’une attestation de consentement éclairé pour les personnes ayant été enregistrées ? En quelle année (environ) les enregistrements ont eu lieu ?

 

Quelle est la langue des enregistrements ?

 

Le corpus comprend-il des enregistrements d’enfants ou de personnes ayant un trouble du langage ou une pathologie ?

Si oui, de quelle population s’agit-il ?

 

 

Dans un souci d’efficacité et pour vous conseiller dans les meilleurs délais, il nous faut disposer d’exemples des transcriptions ou des enregistrements en votre possession. Nous vous contacterons à ce sujet, mais vous pouvez d’ores et déjà nous adresser par courrier électronique un exemple des données dont vous disposez (transcriptions, métadonnées, adresse de page web contenant les enregistrements).

 

Nous vous remercions par avance de l’intérêt que vous porterez à notre proposition. Pour toutes informations complémentaires veuillez contacter Martine Toda martine.toda@ling.cnrs.fr ou à ircom.appel.corpus@gmail.com.

Back  Top

5-2-7Rhapsodie: un Treebank prosodique et syntaxique de français parlé

Rhapsodie: un Treebank prosodique et syntaxique de français parlé

 

Nous avons le plaisir d'annoncer que la ressource Rhapsodie, Corpus de français parlé annoté pour la prosodie et la syntaxe, est désormais disponible sur http://www.projet-rhapsodie.fr/

 

Le treebank Rhapsodie est composé de 57 échantillons sonores (5 minutes en moyenne, au total 3h de parole, 33000 mots) dotés d’une transcription orthographique et phonétique alignées au son.

 

Il s'agit d’une ressource de français parlé multi genres (parole privée et publique ; monologues et dialogues ; entretiens en face à face vs radiodiffusion, parole plus ou moins interactive et plus ou moins planifiée, séquences descriptives, argumentatives, oratoires et procédurales) articulée autour de sources externes (enregistrements extraits de projets antérieurs, en accord avec les concepteurs initiaux) et internes. Nous tenons en particulier à remercier les responsables des projets CFPP2000, PFC, ESLO, C-Prom ainsi que Mathieu Avanzi, Anne Lacheret, Piet Mertens et Nicolas Obin.

 

Les échantillons sonores (wave & MP3, pitch nettoyé et lissé), les transcriptions orthographiques (txt), les annotations macrosyntaxiques (txt), les annotations prosodiques (xml, textgrid) ainsi que les metadonnées (xml & html) sont téléchargeables librement selon les termes de la licence Creative Commons Attribution - Pas d’utilisation commerciale - Partage dans les mêmes conditions 3.0 France.

Les annotations microsyntaxiques seront disponibles prochainement

 Les métadonnées sont également explorables en ligne grâce à un browser.

 Les tutoriels pour la transcription, les annotations et les requêtes sont disponibles sur le site Rhapsodie.

 Enfin, L’annotation prosodique est interrogeable en ligne grâce au langage de requêtes Rhapsodie QL.

 L'équipe Ressource Rhapsodie (Modyco, Université Paris Ouest Nanterre)

Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.

 Partenaires : IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence), CLLE-ERSS (Toulouse).

 

********************************************************

Rhapsodie: a Prosodic and Syntactic Treebank for Spoken French

We are pleased to announce that Rhapsodie, a syntactic and prosodic treebank of spoken French created with the aim of modeling the interface between prosody, syntax and discourse in spoken French is now available at   http://www.projet-rhapsodie.fr/

The Rhapsodie treebank is made up of 57 short samples of spoken French (5 minutes long on average, amounting to 3 hours of speech and a 33 000 word corpus) endowed with an orthographical phoneme-aligned transcription . 

The corpus is representative of different genres (private and public speech; monologues and dialogues; face-to-face interviews and broadcasts; more or less interactive discourse; descriptive, argumentative and procedural samples, variations in planning type).

The corpus samples have been mainly drawn from existing corpora of spoken French and partially created within the frame of theRhapsodie project. We would especially like to thank the coordinators of the  CFPP2000, PFC, ESLO, C-Prom projects as well as Piet Mertens, Mathieu Avanzi, Anne Lacheret and Nicolas Obin.

The sound samples (waves, MP3, cleaned and stylized pitch), the orthographic transcriptions (txt), the macrosyntactic annotations (txt), the prosodic annotations  (xml, textgrid) as well as the metadata (xml and html) can be freely downloaded under the terms of the Creative Commons licence Attribution - Noncommercial - Share Alike 3.0 France.

Microsyntactic annotation will be available soon.

The metadata are  searchable on line through a browser.

The prosodic annotation can be explored on line through the Rhapsodie Query Language.

The tutorials of transcription, annotations and Rhapsodie Query Language  are available on the site.

 

The Rhapsodie team (Modyco, Université Paris Ouest Nanterre :

Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.

Partners: IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence),CLLE-ERSS (Toulouse).

Back  Top

5-2-8Annotation of “Hannah and her sisters” by Woody Allen.

We have created and made publicly available a dense audio-visual person-oriented ground-truth annotation of a feature movie (100 minutes long): “Hannah and her sisters” by Woody Allen.

The annotation includes

•          Face tracks in video (densely annotated, i.e., in each frame, and person-labeled)

•             Speech segments in audio (person-labeled)

•             Shot boundaries in video



The annotation can be useful for evaluating



•   Person-oriented video-based tasks (e.g., face tracking, automatic character naming, etc.)

•             Person-oriented audio-based tasks (e.g., speaker diarization or recognition)

•             Person-oriented multimodal-based tasks (e.g., audio-visual character naming)



Detail on Hannah dataset and access to it can be obtained there:

https://research.technicolor.com/rennes/hannah-home/

https://research.technicolor.com/rennes/hannah-download/



Acknowledgments:

This work is supported by AXES EU project: http://www.axes-project.eu/










Alexey Ozerov Alexey.Ozerov@technicolor.com

Jean-Ronan Vigouroux,

Louis Chevallier

Patrick Pérez

Technicolor Research & Innovation



 

Back  Top

5-2-9French TTS

Text to         Speech Synthesis:
      over an hour of speech       synthesis samples from         1968 to 2001 by       25 French, Canadian, US , Belgian,       Swedish, Swiss systems
     
     
33 ans de synthèse de la parole à         partir du texte: une promenade sonore (1968-2001)
        (33 years of
Text to Speech Synthesis       in French : an audio tour (1968-2001)       )
      Christophe d'Alessandro
      Article published in         Volume 42 - No. 1/2001 issue of 
Traitement       Automatique des Langues  (TAL,       Editions Hermes),         pp. 297-321.
     
      posted to:
      http://groupeaa.limsi.fr/corpus:synthese:start

Back  Top

5-2-10Google 's Language Model benchmark
 Here is a brief description of the project.

'The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:

  • unpruned Katz (1.1B n-grams),
  • pruned Katz (~15M n-grams),
  • unpruned Interpolated Kneser-Ney (1.1B n-grams),
  • pruned Interpolated Kneser-Ney (~15M n-grams)

 

Happy benchmarking!'

Back  Top

5-2-11ISLRN new portal
Opening of the ISLRN Portal
ELRA, LDC,  and AFNLP/Oriental-COCOSDA announce the opening of the ISLRN Portal @ www.islrn.org.


Further to the establishment of the International Standard Language Resource Number (ISLRN) as a unique and universal identification schema for Language Resources on November 18, 2013, ELRA, LDC and AFNLP/Oriental-COCOSDA now announce the opening of the ISLRN Portal (www.islrn.org). As a service free of charge for all Language Resource providers and under the supervision of a steering committee composed of representatives of participating organisations, the ISLRN Portal provides unique identifiers using a standardised nomenclature.

Overview
The 13-digit ISLRN format is: XXX-XXX-XXX-XXX-X. It can be allocated to any Language Resource; its composition is neutral and does not include any semantics in reference to the type or nature of the Language Resource. The ISLRN is a randomly created number with a check digit that validates a Verhoeff algorithm.

Two types of external players may interact with the ISLRN Portal: Visitors and Providers. Visitors may browse the web site and search for the ISLRN of a given Language Resource by its name or by its number if it exists. Providers are registered and own credentials. They can request a new ISLRN for a given Language Resource. A provider has the possibility to become certified, after moderation, in order to be able to import metadata in XML format.

The functionalities that can be accessed by Visitors are:

-          Identify a language resource according to its ISLRN
-          Identify an ISLRN by the name of a language resource
-          Get information about ISLRN, FAQ, Basic Metadata, Legal Information
-          View last 5 accepted resources (“What’s new” block on home page)
-          Sign up to become a provider

The functionalities that can be accessed by Providers, once they have signed up, are:

-          Log in
-          Request an ISLRN according to the metadata of a given resource
-          Request to become a certified provider so as to import XML files containing metadata
-          Import one or more metadata descriptions in XML to request ISLRN(s) (only for certified providers)
-          Edit pending requests
-          Access previous requests
-          Contact a Moderator or an Administrator
-          Edit Providers’ own profile

ISLRN request is handled by moderators within 5 working days.
Contact: islrn@elda.org

Background
The International Standard Language Resource Number (ISLRN) is a unique and universal identification schema for Language Resources which provides Language Resources with unique identifier using a standardised nomenclature. It also ensures that Language Resources are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers. Moreover, it is a major step in the interconnected world that Human Language Technologies (HLT) has become: unique resources must be identified as they are and meta-catalogues need a common identification format to manage data correctly.

The ISLRN does not intend to replace local and specific identifiers, it is not meant to be a legal deposit, not an obligation, but rather an essential and best practice. For instance a resource that is distributed by several data centres will still have the “local” data-centre identifier but will have a unique ISLRN.

********************************************************************
About ELRA
The European Language Resources Association (ELRA) is a non-profit making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies (HLT). To find out more about ELRA, please visit www.elra.info.

About LDC
Founded in 1992, the Linguistic Data Consortium (LDC) is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC's host institution. To find out more about LDC, please visit www.ldc.upenn.edu.

About AFNLP
The mission of the Asian Federation of Natural Language Processing (AFNLP) is to promote and enhance R&D relating to the computational analysis and the automatic processing of all languages of importance to the Asian region by assisting and supporting like-minded organizations and institutions through information sharing, conference organization, research and publication co-ordination, and other forms of support. To find out more about AFNLP, please visit www.afnlp.org.

About Oriental-COCOSDA
The International Committee for the Co-ordination and Standardisation of Speech Databases and Assesment Techniques, Oriental-COCOSDA, has been established to encourage and promote international interaction and cooperation in the foundation areas of Spoken Language Processing, especially for Speech Input/Output. To find out more about Oriental-COCOSDA, please visit our web site: www.cocosda.org

Back  Top

5-2-12Speechocean – update (July 2018)

 

Speechocean – update (July 2018):

 

 

 

Speechocean: A global language resources and data services supplier

 

 

 

About Speechocean

 

Speechocean is one of the world well-known language related resources & services provider in the fields of Human Computer Interaction and Human Language Technology. At present, we can provide data services with 110+ languages and dialects across the world.

 

 

 

KingLine Data Center---Data Sharing Platform

 

Kingline Data Center is operated and supervised by Speechocean, which is mainly focused on language resources creating and providing for research and development of human language technology.

 

These diversified corpora are widely used for the research and development in the fields of Speech Recognition, Speech Synthesis, Natural Language Processing, Machine Translation, Web Search, etc. All corpora are openly accessible for users all over the world, including users from scientific research institutions, enterprises or individuals.

 

 

 

For more detailed information, please visit our website: http://kingline.speechocean.com

 

 

 

New released data:

 

 

 

1. Chinese Mandarin Speech Recognition Corpus (Mobile)-Conversation-1250 Speakers

 

 

 

S.NKing-ASR-408

 

 

 

Details:

 

 

 

The Chinese Mandarin Speech Recognition Corpus was collected in China.

 

 

 

The script contains 625 pairs of daily spontaneous conversational speech data utterances in total, specially designed to provide materials for both training and testing of speech recognizers.

 

 

 

This corpus contains the voices of 1,250 different speakers (571 males, 679 females) who were balanced distributed in age (16 – 30, 31 – 45, 46 – 60), gender and regional accents. Each speaker was recorded in quiet office or home environment. 

 

 

 

Mobile platform, i.e. Android was used for speech collection. A pronunciation lexicon is available with a phonemic transcription in zh-cn_pinyin. All manually checked. All audio files were manually transcribed and annotated by native transcribers.

 

 

 

 

 

2. Guangdong Cantonese Speech Recognition Corpus (Mobile)-Sentences-1014 Speakers

 

 

 

S.NKing-ASR-241

 

 

 

Details:

 

 

 

The Guangdong Cantonese Speech Recognition Corpus was collected in Guangdong.

The script contains 2,339,859(approx.) utterances in total, specially designed to provide materials for both training and testing of speech recognizers. Each utterance wave was stored in a separate file and uncompressed.

This corpus contains the voices of 1014 different speakers (450 males, 564 females) who were balanced distributed in age (16 – 30, 31 – 45, >45), gender and regional accents. Each speaker was recorded in both quiet (office/home) and noisy (restaurant/street) environment. 

Mobile platform, i.e. Android was used for speech collection. A pronunciation lexicon is available with a phonemic transcription in Jyutping. All manually checked. All audio files were manually transcribed and annotated by native transcribers.

 

 

 

3. Russian Speech Synthesis Corpus - Male

 

 

 

S.NKing-TTS-020

 

 

 

Details:

 

 

 

Size: 8.12 GB

 

Recording Hours: 13.69 Hours

 

Parameters: 44.1k, 16bit; 2 Channels

 

 

 

The Russian Speech Synthesis Corpus contains the recordings of 1 male voice talent. He is a broadcaster, 34 years old when recording this database, and he was born and grew up in Moscow.

 

 

 

The corpus contains 9,212 utterances. It was recorded in a professional studio over two channels--waveform and electroglottography (EGG) signal. Speech rate, energy and timbre were strictly controlled during recording process.

 

 

 

Each utterance was carefully proofreaded by linguists and was stored in Windows uncompressed PCM format. Prosody labeling and phone boundary labeling are included. A pronouncing dictionary is available. All data were manually checked. 

 

 

 

 

 

4. Prounciation Lexicon of Loan words of US English

 

 

 

S.NKing-Lexicon-079

 

 

 

Details:

 

 

 

Entries: 350,000

 

Phoneme Inventory: Computer Readable IPAIt can be converted to the phoneset Sampa, XSampa, and etc., based on demand.)

 

Stress: Included

 

Syllable Boundary: Included

 

 

 

 

 

Contact Information

 

Xianfeng Cheng

 

VP

 

Tel: +86-10-62660928; +86-10-62660053 ext.8080

 

Mobile: +86 13681432590

 

Skype: xianfeng.cheng1

 

Email: chengxianfeng@speechocean.com; cxfxy0cxfxy0@gmail.com

 

Website: www.speechocean.com

 


 


 

 

Back  Top

5-2-13kidLUCID: London UCL Children’s Clear Speech in Interaction Database

kidLUCID: London UCL Children’s Clear Speech in Interaction Database

We are delighted to announce the availability of a new corpus of spontaneous speech for children aged 9 to 14 years inclusive, produced as part of the ESRC-funded project on ‘Speaker-controlled Variability in Children's Speech in Interaction’ (PI: Valerie Hazan).

Speech recordings (a total of 288 conversations) are available for 96 child participants (46M, 50F, range 9;0 to 15;0 years), all native southern British English speakers. Participants were recorded in pairs while completing the diapix spot-the-difference picture task in which the pair verbally compared two scenes, only one of which was visible to each talker. High-quality digital recordings were made in sound-treated rooms. For each conversation, a stereo audio recording is provided with each speaker on a separate channel together with a Praat Textgrid containing separate word- and phoneme-level segmentations for each speaker.

There are six recordings per speaker pair made in the following conditions:

  • NOB (No barrier): both speakers heard each other normally

  • VOC (Vocoder): one conversational partner heard the other's speech after it had been processed in real time through a noise-excited three channel vocoder

  • BAB (Babble): one conversational partner heard the other's speech in a background of adult multi-talker babble at an approximate SNR of 0 dB.

The kidLUCID corpus is available online within the OSCAAR (Online Speech/Corpora Archive and Analysis Resource) archive (https://oscaar.ci.northwestern.edu/). Free access can be requested for research purposes. Further information about the project can be found at: http://www.ucl.ac.uk/pals/research/shaps/research/shaps/research/clear-speech-strategies

This work was supported by Economic and Social Research Council Grant No. RES-062- 23-3106.

Back  Top

5-2-14Robust speech datasets and ASR software tools


We are happy to announce the release of a table of 44 publicly available robust speech processing datasets and a table of 4 ASR software tools on the wiki of ISCA's Robust Speech Processing SIG:
https://wiki.inria.fr/rosp/Datasets#Speech_datasets
https://wiki.inria.fr/rosp/Software#Automatic_speech_recognition

We hope that these tables will promote wider dissemination of the datasets and software tools available in our community and help newcomers select the most suitable dataset or software for a given experiment. We plan to provide additional tables on, e.g., room impulse response datasets or speaker recognition software in the future.

We highly welcome your input, especially additional tables/entries and reproducible baselines for each dataset. It just takes a few minutes thanks to the simple wiki interface.

For more information about joining the SIG and contributing, see
https://wiki.inria.fr/rosp/

Jonathan Le Roux, Emmanuel Vincent, and Ramon Astudillo

Back  Top

5-2-15International Standard Language Resource Number (ISLRN) implemented by ELRA and LDC

ELRA and LDC partner to implement ISLRN process and assign identifiers to all the Language Resources in their catalogues.

 

Following the meeting of the largest NLP organizations, the NLP12, and their endorsement of the International Standard Language Resource Number (ISLRN), ELRA and LDC partnered to implement the ISLRN process and to assign identifiers to all the Language Resources (LRs) in their catalogues. The ISLRN web portal was designed to enable the assignment of unique identifiers as a service free of charge for all Language Resource providers. To enhance the use of ISLRN, ELRA and LDC have collaborated to provide the ISLRN 13-digit ID to all the Language Resources distributed in their respective catalogues. Anyone who is searching the ELRA and LDC catalogues can see that each Language Resource is now identified by both the data centre ID and the ISLRN number. All providers and users of such LRs should refer to the latter in their own publications and whenever referring to the LR.

 

ELRA and LDC will continue their joint involvement in ISLRN through active participation in this web service.

 

Visit the ELRA and LDC catalogues, respectively at http://catalogue.elra.info and https://catalog.ldc.upenn.edu

 

Background

The International Standard Language Resource Number (ISLRN) aims to provide unique identifiers using a standardised nomenclature, thus ensuring that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications within R&D projects, product evaluation and benchmarking, as well as in documents and scientific papers. Moreover, this is a major step in the networked and shared world that Human Language Technologies (HLT) has become: unique resources must be identified as such and meta-catalogues need a common identification format to manage data correctly.

The ISLRN portal can be accessed from http://www.islrn.org,

 

***About NLP12***

Representatives of the major Natural Language Processing and Computational Linguistics organizations met in Paris on 18 November 2013 to harmonize and coordinate their activities within the field.
The results of this coordination are expressed in the Paris Declaration: http://www.elra.info/NLP12-Paris-Declaration.html.

 

*** About ELRA ***
The European Language Resources Association (ELRA) is a non-profit making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies (HLT).
To find out more about ELRA, please visit our web site: http://www.elra.info

*** About LDC ***

The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and research laboratories that creates and distributes linguistic resources for language-related education, research and technology development.

To find out more about LDC, please visit our web site: https://www.ldc.upenn.edu


For more information, please contact: admin@islrn.org

Back  Top

5-2-16Base de données LIBRE et GRATUITE pour la reconnaissance du locuteur
Je me permet de vous solliciter pour contribuer à la création 
d?une base de données LIBRE et GRATUITE
pour la reconnaissance du locuteur.

Plus de détails et la marche à suivre ci-dessous.

Merci beaucoup,
Anthony Larcher
 
 
 
Récemment, un certain nombre de laboratoires spécialisés dans la reconnaissance du locuteur dépendante du texte ont initié le projet RedDots.

Il s?agit d?une initiative volontaire sur financement propre des laboratoires.
Ce projet encourage des discussions sur les thèmes de la reconnaissance du locuteur,
la collection de corpus et les cas d?usage propres à cette technologie à travers un Google Group.

Dans le cadre du projet RedDots, l?Institute for Infocomm Research (Singapour) a développé une application Android 
qui permet d?enregistrer des données sur un téléphone portable.

Cette base de données a pour but de pallier certaines lacunes des corpus existants:
- le coût (certaines bases standard sont vendues à plusieurs milliers d?euro)
- la taille limitée (le nombre limité de locuteurs ne permet plus d?évaluer les systèmes de reconnaissance de manière significative)
- la variabilité limitée (les données sont actuellement enregistrées dans plus de 5 pays dans le monde entier)

Afin de distributer une base de données, qui puisse bénéficier librement 
à l?ensemble de la communauté de recherche nous vous sollicitons.
 
 
Comment faire et en combien de temps?
- inscrivez vous en 2 minutes à l?adresse suivante
- installez l?application Android sur votre téléphone en 2 minutes, saisissez l'ID et mot de passe qui vous seront envoyé par email
- enregistrez une session  3 minutes sur votre téléphone
 
Tout se fait en moins de 10 minutes?
Une des limitations principale des corpus existant est le nombre limité de sessions 
enregistrée par locuteur et le court intervalle de temps au cours duquel ces sessions sont enregistrées.
Afin de combler ce manque nous espérons que chaque participant acceptera d?enregistrer
plusieurs sessions dans les mois à venir.
Idealement, chaque participant enregistrera 3 ou 4 minutes par semaine pendant un an.
 
Ou vont mes données et pour quoi sont elles utilisées?
Les données sont actuellement envoyées sur un serveur de l?Institute for Infocomm Research 
à Singapour. Un institut de recherche public.
En vous enregistrant, vous acceptez que ces données soient utilisées à des fins de recherche
uniquement. ces données seront mise à disposition en ligne gratuitement tout au long du projet.

Merci pour votre contribution, n?hésitez pas à faire circuler cet email.
Plus de détails seront données prochainement dans un article soumis à INTERSPEECH 2015.

Anthony Larcher
Back  Top

5-2-17ISLRN adopted by Joint Research Center (JRC) of the European Commission

JRC, the EC's Joint Research Centre, an important LR player: First to adopt the ISLRN initiative

 

The Joint Research Centre (JRC), the European Commission's in house science service, is the first organisation to use the International Standard Language Resource Number (ISLRN) initiative and has requested ISLRN 13-digit unique identifiers to its Language Resources (LR).
Thus, anyone who is using JRC LRs may now refer to this number in their own publications.

 

The current JRC LRs (downloadable from https://ec.europa.eu/jrc/en/language-technologies) with an ISLRN ID are:

 

 

Background

The International Standard Language Resource Number (ISLRN) aims to provide unique identifiers using a standardised nomenclature, thus ensuring that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications within R&D projects, product evaluation and benchmarking, as well as in documents and scientific papers. Moreover, this is a major step in the networked and shared world that Human Language Technologies (HLT) has become: unique resources must be identified as such and meta-catalogues need a common identification format to manage data correctly.
The ISLRN portal can be accessed from http://www.islrn.org,

 

*** About the JRC ***

As the Commission's in-house science service, the Joint Research Centre's mission is to provide EU policies with independent, evidence-based scientific and technical support throughout the whole policy cycle.
Within its research in the field of global security and crisis management, the JRC develops open source intelligence and analysis systems that can automatically harvest and analyse a huge amount of multi-lingual information from the internet-based sources. In this context, the JRC has developed Language Technology resources and tools that can be used for highly multilingual text analysis and cross-lingual applications.
To find out more about JRC's research in open source information monitoring, please visit https://ec.europa.eu/jrc/en/research-topic/internet-surveillance-systems. To access media monitoring applications directly, go to http://emm.newsbrief.eu/overview.html.

 

*** About ELRA ***
The European Language Resources Association (ELRA) is a non-profit making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies (HLT).
To find out more about ELRA, please visit our web site: http://www.elra.info

For more information, contact admin@ilsrn.org
 

Back  Top

5-2-18Forensic database of voice recordings of 500+ Australian English speakers

Forensic database of voice recordings of 500+ Australian English speakers

We are pleased to announce that the forensic database of voice recordings of 500+ Australian English speakers is now published.

The database was collected by the Forensic Voice Comparison Laboratory, School of Electrical Engineering & Telecommunications, University of New South Wales as part of the Australian Research Council funded Linkage Project on making demonstrably valid and reliable forensic voice comparison a practical everyday reality in Australia. The project was conducted in partnership with: Australian Federal Police,  New South Wales Police,  Queensland Police, National Institute of Forensic Sciences, Australasian Speech Sciences and Technology Association, Guardia Civil, Universidad Autónoma de Madrid.

The database includes multiple non-contemporaneous recordings of most speakers. Each speaker is recorded in three different speaking styles representative of some common styles found in forensic casework. Recordings are recorded under high-quality conditions and extraneous noises and crosstalk have been manually removed. The high-quality audio can be processed to reflect recording conditions found in forensic casework.

The database can be accessed at: http://databases.forensic-voice-comparison.net/

Back  Top

5-2-19Audio and Electroglottographic speech recordings

 

Audio and Electroglottographic speech recordings from several languages

We are happy to announce the public availability of speech recordings made as part of the UCLA project 'Production and Perception of Linguistic Voice Quality'.

http://www.phonetics.ucla.edu/voiceproject/voice.html

Audio and EGG recordings are available for Bo, Gujarati, Hmong, Mandarin, Black Miao, Southern Yi, Santiago Matatlan/ San Juan Guelavia Zapotec; audio recordings (no EGG) are available for English and Mandarin. Recordings of Jalapa Mazatec extracted from the UCLA Phonetic Archive are also posted. All recordings are accompanied by explanatory notes and wordlists, and most are accompanied by Praat textgrids that locate target segments of interest to our project.

Analysis software developed as part of the project – VoiceSauce for audio analysis and EggWorks for EGG analysis – and all project publications are also available from this site. All preliminary analyses of the recordings using these tools (i.e. acoustic and EGG parameter values extracted from the recordings) are posted on the site in large data spreadsheets.

All of these materials are made freely available under a Creative Commons Attribution-NonCommercial-ShareAlike-3.0 Unported License.

This project was funded by NSF grant BCS-0720304 to Pat Keating, Abeer Alwan and Jody Kreiman of UCLA, and Christina Esposito of Macalester College.

Pat Keating (UCLA)

Back  Top

5-2-20Press release: Opening of the ELRA License Wizard

Press Release - Immediate - Paris, France, April 2, 2015

Opening of the ELRA License Wizard

ELRA announces the opening of the License Wizard @ http://wizard.elda.org.

ELRA  is deploying a License Wizard to:

  • support the right-holders in finding the appropriate licenses under which to share/distribute their Language Resources, and
  • clarify the legal obligations applicable in various licensing situations.

Currently, the License Wizard allows the user to choose among several licenses that exist for the use of Language Resources: ELRA, Creative Commons and META-SHARE.
More will be added.

The License Wizard works as a web configurator that helps Right Holders/Users:

- to select a number of legal features and obtain the user license adapted to their selection.
- to define which user licenses they would like to select in order to distribute their Language Resources.
- to integrate the user license terms into a Distribution Agreement that could be proposed to ELRA or META-SHARE  for further distribution through the ELRA Catalogue of Language Resources (http://catalogue.elra.infowww.meta-share.eu).

Background
From the very beginning, ELRA has come across all types of legal issues that arise when exchanging and sharing Language Resources. The association has devoted huge efforts to streamline the licensing processes while continuously monitoring the impacts of regulation changes on the HLT community activities. The first major step was to come up with a few licenses for both the research and the industrial sectors to use the resources available within the ELRA catalogue. Recently, its strong involvement in the META-SHARE infrastructure led to designing and drafting a small set of licenses, inspired by the ELRA licenses but also accounting for the new trends of permissive licenses and free resources, represented in particular by the Creative Commons.

Back  Top

5-2-21EEG-face tracking- audio 24 GB data set Kara One, Toronto, Canada

We are making 24 GB of a new dataset, called Kara One, freely available. This database combines 3 modalities (EEG, face tracking, and audio) during imagined and articulated speech using phonologically-relevant phonemic and single-word prompts. It is the result of a collaboration between the Toronto Rehabilitation Institute (in the University Health Network) and the Department of Computer Science at the University of Toronto.

 

In the associated paper (abstract below), we show how to accurately classify imagined phonological categories solely from EEG data. Specifically, we obtain up to 90% accuracy in classifying imagined consonants from imagined vowels and up to 95% accuracy in classifying stimulus from active imagination states using advanced deep-belief networks.

 

Data from 14 participants are available here: http://www.cs.toronto.edu/~complingweb/data/karaOne/karaOne.html.

 

If you have any questions, please contact Frank Rudzicz at frank@cs.toronto.edu.

 

Best regards,

Frank

 

 

PAPER Shunan Zhao and Frank Rudzicz (2015) Classifying phonological categories in imagined and articulated speech. In Proceedings of ICASSP 2015, Brisbane Australia

ABSTRACT This paper presents a new dataset combining 3 modalities (EEG, facial, and audio) during imagined and vocalized phonemic and single-word prompts. We pre-process the EEG data, compute features for all 3 modalities, and perform binary classi?cation of phonological categories using a combination of these modalities. For example, a deep-belief network obtains accuracies over 90% on identifying consonants, which is signi?cantly more accurate than two baseline supportvectormachines. Wealsoclassifybetweenthedifferent states (resting, stimuli, active thinking) of the recording, achievingaccuraciesof95%. Thesedatamaybeusedtolearn multimodal relationships, and to develop silent-speech and brain-computer interfaces.

 

Back  Top

5-2-22TORGO data base free for academic use.

In the spirit of the season, I would like to announce the immediate availability of the TORGO database free, in perpetuity for academic use. This database combines acoustics and electromagnetic articulography from 8 individuals with speech disorders and 7 without, and totals over 18 GB. These data can be used for multimodal models (e.g., for acoustic-articulatory inversion), models of pathology, and augmented speech recognition, for example. More information (and the database itself) can be found here: http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html.

Back  Top

5-2-23Datatang

Datatang is a global leading data provider that specialized in data customized solution, focusing in variety speech, image, and text data collection, annotation, crowdsourcing services.

 

1, Speech data collection

2, Speech data synthesis

3, Speech data transcription

I’ve attached our company introduction as reference, as well as available speech data lists as follows:

US English Speech Data

300 people, about 200 hours

Uyghur Speech Data

2,500 people, about 1,000 hours

German Speech Data

100 people, about 40 hours

French Speech Data

100 people, about 40 hours

Spanish Speech Data

100 people, about 40 hours

Korean Speech Data

100 people, about 40 hours

Italian Speech Data

100 people, about 40 hours

Thai Speech Data

100 people, about 40 hours

Portuguese Speech Data

300 People, about 100 hours

Chinese Mandarin Speech Data

4,000 people, about 1,200 hours

Chinese Speaking English Speech Data

3,700 people, 720 hours

Cantonese Speech Data

5,000 people, about 1,400 hours

Japanese Speech Data

800 people, about 270 hours

Chinese Mandarin In-car Speech Data

690 people, about 245 hours

Shanghai Dialect Speech Data

2,500 people, about 1,000 hours

Southern Fujian Dialect Speech Data

2,500 people, about 1,000 hours

Sichuan Dialect Speech Data

2,500 people, about 860 hours

Henan Dialect Speech Data

400 people, about 150 hours

Northeastern Dialect Speech Data

300 people, 80 hours

Suzhou Dialect Speech Data

270 people, about 110 hours

Hangzhou Dialect Speech Data

400 people, about 170 hours

Non-Native Speaking Chinese Speech Data

1,100 people, about 73 hours

Real-world Call Center Chinese Speech Data

650 hours, more than 5,000 people

Mobile-end Real-world Voice Assistant Chinese Speech Data

4,000 hours, more than 2,000,000 people

Heavy Accent Chinese Speech Data

2,000 people, more than 1,000 hours

 

If you find any particular interested datasets, we could provide you samples with costs too.

 

Regards

 

Runze Zhao

zhaorunze@datatang.com 

Oversea Sales Manager | Datatang Technology 

China

M: +86 185 1698 2583

18 Zhongguancun St.

Kemao Building Tower B 18F

Beijing 100190

 

US

M: +1 617 763 4722 

640 W California Ave, Suite 210

Sunnyvale, CA 94086


Back  Top

5-2-24The International Standard Language Resource Number (ISLRN) assigned to LRE Map 2016 Language Resources

Press Release ? Immediate

 

 

Paris, France, June 8, 2017

 

 

The International Standard Language Resource Number (ISLRN) assigned to LRE Map 2016 Language Resources

 

 

As a follow-up of the LRE Map 2016 initiative, ELRA has processed the information on existing and newly-created Language Resources provided by the authors submitting at LREC 2016 Conference (http://lrec2016.lrec-conf.org). In order to increase the visibility of these resources, ELRA has allocated ISLRNs to 106 submitted languages resources. They distribute as follows:

 

 

  • 73 Corpora (ca. written, spoken, multimodal)
  • 30 Lexicons (including ontologies)
  • 3 Evaluation Data

 

 

The meta-information for these language resources is also available on the ISLRN website with a broad international audience.

 

 

Background

 

 

As part of an international effort to document and archive the various language resource development efforts around the world, a system of assigning ISLRNs was established in November 2013. The ISLRN is a unique persistent identifier to be assigned to each language resource. The establishment of ISLRNs was a major step in the networked and shared world of human language technologies. Unique resources must be identified as they are, and meta-catalogues require a common identification format to manage data correctly. Therefore, language resources should carry identical identification schemes independent of their representations, whatever their types and wherever their physical locations (on hard drives, internet or intranet). Visit: http://islrn.org/.

 

 

About LRE Map

 

 

Initiated by ELRA and FlareNet at LREC 2010 (http://www.lrec-conf.org), the LRE Map is a mechanism intended to monitor the use and creation of language resources by collecting information on both existing and newly-created resources during the submission process. Apart from providing a portrait of the resources behind the community, of their uses and usability, the LRE Map intends to be a measuring instrument for monitoring the field of language resources. The feature has been so successful that it has been implemented also at other major conferences like COLING, IJCNLP, Interspeech, LTC, ACLHT, O-COCOSDA, RANLP, in addition to the LRE Journal. Visit: http://lremap.elra.info

 

 

About ELRA

 

 

The European Language Resources Association (ELRA) is a non-profit-making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting human language technologies.

 

 

To find out more about ELRA, please visit the website: http://www.elra.info

 

Contact: info@elda.org


 

The International Standard Language Resource Number (ISLRN) is becoming an increasingly widespread persistent identifier

 

Since the deployment of ISLRN, 3 years ago, the number of Language Resources which were allocated an ISLRN has grown significantly to reach 2500+. These LRs include raw and annotated corpora, lexicons and dictionaries, speech resources (conversational, synthesis, etc.), evaluation sets and multimodal resources, and cover 219 distinct languages (including sign languages).

 

In the first place, the ISLRN system has been endorsed by two large data centers, namely ELRA (European Language Resources Association) and LDC (Linguistic Data Consortium) which team up to maintain jointly the assignment process. Other significant contributions come from institutions like the Joint Research Centre (JRC), the Resource Management Agency (RMA), the Institute for Applied Linguistics (IULA) at the Universitat Pompeu Fabra (UPF).

 

Moreover, authors are invited to quote the ISLRN of each Language Resource they are referring to in the paper(s) they are submitting to LREC Conferences, which makes the persistent identifier a key factor of the LR citation process.

 

 

 

Background

 

As part of an international effort to document and archive the various Language Resource development efforts around the world, a system assigning ISLRNs was established in November 2013 and deployed in April 2014. The ISLRN is a unique persistent identifier to be assigned to each Language Resource. The establishment of ISLRNs was a major step in the networked and shared world of Human Language Technologies. Unique resources must be identified as they are, and meta-catalogues require a common identification format to manage data correctly. Therefore, Language Resources should carry identical identification schemes regardless their representations, their types and their storage place (hard drives, internet or intranet) (http://islrn.org/).

 

About ELRA

 

The European Language Resources Association (ELRA) is a non-profit-making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for Language Resources and promoting Human Language Technologies.

 

To find out more about ELRA, please visit the website: http://www.elra.info

 

References

 

LDC: https://www.ldc.upenn.edu

 

JRC: https://ec.europa.eu/jrc/en/research-topic/internet-surveillance-systems

 

RMA: http://rma.nwu.ac.za

 

UPF: http://www.iula.upf.edu  and https://www.upf.edu/web/universitat

 

LREC Conferences: www.lrec-conf.org

 

 

 

Read also: Valérie Mapelli, Vladimir Popescu, Lin Liu and Khalid Choukri, Language Resource Citation: the ISLRN Dissemination and Further Developments, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portoro?, Slovenia: http://www.lrec-conf.org/proceedings/lrec2016/summaries/1251.html

 

 

 

Contact: mapelli@elda.org



Back  Top

5-2-25Fearless Steps Corpus (University of Texas, Dallas)

Fearless Steps Corpus

John H.L. Hansen, Abhijeet Sangwan, Lakshmish Kaushik, Chengzhu Yu Center for Robust Speech Systems (CRSS), Eric Jonsson School of Engineering, The University of Texas at Dallas (UTD), Richardson, Texas, U.S.A.


NASA’s Apollo program is a great achievement of mankind in the 20th century. CRSS, UT-Dallas has undertaken an enormous Apollo data digitization initiative where we proposed to digitize Apollo mission speech data (~100,000 hours) and develop Spoken Language Technology based algorithms to analyze and understand various aspects of conversational speech. Towards achieving this goal, a new 30 track analog audio decoder is designed to decode 30 track Apollo analog tapes and is mounted on to the NASA Soundscriber analog audio decoder (in place of single channel decoder). Using the new decoder all 30 channels of data can be decoded simultaneously thereby reducing the digitization time significantly. 
We have digitized 19,000 hours of data from Apollo missions (including entire Apollo-11, most of Apollo-13, Apollo-1, and Gemini-8 missions). This audio archive is named as “Fearless Steps Corpus”. This is one of the most unique and singularly large naturalistic audio corpus of such magnitude. Automated transcripts are generated by building Apollo mission specific custom Deep Neural Networks (DNN) based Automatic Speech Recognition (ASR) system along with Apollo mission specific language models. Speaker Identification System (SID) to identify the speakers are designed. A complete diarization pipeline is established to study and develop various SLT tasks. 
We will release this corpus for public usage as a part of public outreach and promote SLT community to utilize this opportunity to build naturalistic spoken language technology systems. The data provides ample opportunity setup challenging tasks in various SLT areas. As a part of this outreach we will be setting “Fearless Challenge” in the upcoming INTERSPEECH 2018. We will define and propose 5 tasks as a part of this challenge. The guidelines and challenge data will be released in the Spring 2018 and will be available for download for free. The five challenges are, (1) Automatic Speech Recognition (2) Speaker Identification (3) Speech Activity Detection (4) Speaker Diarization (5) Keyword spotting and Joint Topic/Sentiment detection.
Looking forward for your participation (John.Hansen@utdallas.edu) 

Back  Top

5-2-26SIWIS French Speech Synthesis Database
The SIWIS French Speech Synthesis Database includes high quality French speech recordings and associated text files, aimed at building TTS systems, investigate multiple styles, and emphasis. A total of 9750 utterances from various sources such as parliament debates and novels were uttered by a professional French voice talent. A subset of the database contains emphasised words in many different contexts. The database includes more than ten hours of speech data and is freely available.
 
Back  Top

5-2-27The new ELRA Catalogue of Language Resources (February 2018)

Press Release - Immediate
Paris, France, February 14, 2018


The new Catalogue of Language Resources now opening.

ELRA is happy to release a new version of its Catalogue of Language Resources publicly.

Completely redesigned, with a new interface and an improved navigation, the new Catalogue allows visitors an easier access to the 1075 Language Resources (LRs) and their corresponding description. Among the new features, the Catalogue now offers an extended metadata to describe the LRs, a refined search on the Catalogue data for finding more specific information using criteria such as language, resource or media type, license, etc.

Currently, LRs can be selected, and placed in a cart from where the user can send a request for quotation to initiate the order. When logging in, the user selects LRs and obtains distribution details (licensing information, prices) depending on his/her user status: ELRA member/Non-member, Research vs Commercial organization. The full e-commerce integration will be completed at a later stage.

More functionalities pertaining to the ELRA Catalogue, including the ISLRN automatic submission and the e-licensing module (automatic filling in and electronic signature), will be developed and integrated.

Please visit this new version of the Catalogue here: http://catalogue.elra.info

*** About ELRA ***
The European Language Resources Association (ELRA) is a non-profit making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies (HLT).
To find out more about ELRA, please visit: http://www.elra.info.

Back  Top



 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA