ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2015 » ISCApad #209 » Resources » Database

ISCApad #209

Wednesday, November 11, 2015 by Chris Wellekens

5-2 Database

5-2-1

ELRA - Language Resources Catalogue - Update (2015-05) dedicated to the Nepali people.

As an answer to the April 2015 devastating earthquake in Nepal, ELRA would like to make Nepali Corpora available for free.

Originally available for research purposes only in the ELRA Catalogue, those Language Resources (2 Nepali Written Corpora and 1 Speech Corpus) will be provided at no cost to those working on the the development of systems and applications to be used during the reconstruction phase in Nepal, for not-for-profit purposes.

If you feel that ELRA can help in other ways please let us know.

ELRA-W0076 Nepali Monolingual written corpus
ISLRN: 325-796-965-405-9

The Nepali Monolingual written corpus comprises the core corpus (core sample) and the general corpus. The core sample (CS) represents the collection of Nepali written texts from 15 different genres with 2000 words each published between 1990 and 1992. It is based on FLOB/FROWN corpora and contains 802,000 words. The general corpus (GC) consists of written texts collected opportunistically from a wide range of sources such as the internet webs, newspapers, books, publishers and authors. It contains 1,400,000 words.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1216

ELRA-W0077 English-Nepali Parallel Corpus
ISLRN: 853-487-663-161-6

This corpus consists of a collection of national development texts in English and Nepali. A small set of data is aligned at the sentence level (27,060 English words; 21,756 Nepali words), and a larger set of texts at the document level (617,340 English words; 596,571 Nepali words). An additional set of monolingual data in Nepali is also provided (386,879 words in Nepali).
For more information, see http://catalog.elra.info/product_info.php?products_id=1217

ELRAS0368 Nepali Spoken Corpus
ISLRN: 688-800-566-571-0

The Nepali Spoken Corpus contains audio recordings from different social activities within their natural settings as much as possible, with phonologically transcribed and annotated texts, and information about the participants. A total of 17 types of activity were recorded. The total temporal duration of the recorded material is 31 hours and 26 minutes.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1219

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements/

Back

Top

5-2-2

LDC Newsletter (October 2015)

In this newsletter:

Fall 2015 LDC Data Scholarship recipients

New publications:

ACE 2007 Spanish DevTest - Pilot Evaluation

GALE Phase 4 Chinese Broadcast News Parallel Sentences

Karlsruhe Children's Text

Fall 2015 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Fall 2015 data scholarships:

Anthony Beylerian - Keio University (Japan), MSc, Informatics and Computer Science. Anthony has been awarded a copy of OntoNotes for his work in word sense disambiguation.

Siti Binte Faizal - Newcastle University (UK), PhD candidate, Speech and Language Sciences. Siti has been awarded a copy of Levantine Arabic QT Training Speech and Text for her work in psycholinguistics.

Sara El-Kafrawy - Ain Shams University (Egypt), MSc candidate, Computer and Information Sciences. Sara has been awarded a copy of GALE Arabic English Word Alignment and Arabic Gigaword for her work in machine translation.

Marwa Hadj Salah - University of Sfax (Tunisia), PhD candidate, Computer Science. Marwa has been awarded a copy of Arabic English Parallel News and Arabic News Translation Text for her work in machine translation.

Tomoaki Goto - University of Tokyo (Japan), PhD candidate, Linguistics. Tomoaki has been awarded a copy of Arabic Newswire English Translation for his work in syntax.

Richard Metzger - Pennsylvania State University (USA), PhD candidate, Electrical Engineering. Richard has been awarded a copy of 2008 NIST Speaker Recognition Training Part 2 and Test for his work in speaker recognition.

Jun Ren - Massey University (New Zealand), PhD, Engineering. Jun has been awarded a copy of TORGO Dysarthric Articulation for his work in speaker recognition.

Gozde Sahin - Istanbul Technical University (Turkey), PhD candidate, Computer Engineering and Informatics. Gozde has been been awarded a copy of 2009 CoNLL Parts 1 and 2 for her work in semantic role labeling.

Alexey Sholokhov - University of Eastern Finland (Finland), PhD candidate, Computer Sciences. Alexey has been awarded a copy of RATS Speech Activity Detection for his work in speaker verification.

Stefan Watson - University of the West Indies (Jamaica), PhD candidate, Physics. Stefan has been awarded a copy of CMU Kids for his work in phonology and speech recognition.

For program information visit the Data Scholarship page.

New publications

(1) ACE 2007 Spanish DevTest - Pilot Evaluation was developed by LDC. This publication contains the complete set of Spanish development and test data to support the 2007 Automatic Content Extraction (ACE) technology evaluation, namely, newswire data annotated for entities and temporal expressions.

The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC ACE project pages.

LDC has also released ACE 2007 Multilingual Training Corpus (LDC2014T18) which contains the Arabic and Spanish training data used in the 2007 evaluation.

The data consists of newswire material published in May 2005 from the following sources: Agence France Press, The Associated Press and Xinhua News Agency.

All files were annotated by two human annotators working independently. Discrepancies between the two annotations were adjudicated by a senior team member resulting in a gold standard file.

There are three annotation directories for each newswire story that contain an identical copy of the source text in SGML format and two associated annotated versions in XML format and tab delimited format. All text is UTF-8 encoded.

ACE 2007 Spanish DevTest - Pilot Evaluation is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1000.

(2) GALE Phase 4 Chinese Broadcast News Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source sentences and corresponding English translations selected from broadcast news data collected by LDC in 2008 and transcribed and translated by LDC or under its direction.

GALE Phase 4 Chinese Broadcast News Parallel Sentences includes 40 source-translation document pairs, comprising 156,429 tokens of Chinese source text and its English translation. Data is drawn from eight distinct Chinese programs broadcast in 2008 from China Central TV, a national and international broadcaster in Mainland China; and Voice of America, a U.S. government-funded broadcast programmer. The programs in this release feature news programs on current events topics.

The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts provided by GALE program participants SRI International and IBM. The output was then manually reviewed by LDC staff to eliminate problematic sentences. Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.txt. All data are encoded in UTF-8.

GALE Phase 4 Chinese Broadcast News Parallel Sentences is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1750.

(3) Karlsruhe Children's Text was developed by the Cooperative State University Baden-Württemberg, University of Education and Karlsruhe Institute of Technology. It consists of over 14,000 freely written, German sentences from more than 1,700 school children in grades one through eight.

The data collection was conducted in 2011-2013 at elementary and secondary schools in and around Karlsruhe, Germany. Students were asked to write as verbose a text as possible. Those in grades one to four were read two stories and were then asked to write their own stories. Students in grades five through eight were instructed to write on a specific theme, such as 'Imagine the world in 20 years. What has changed?”. The goal of the collection was to use the data to develop a spelling error classification system.

Annotators converted the handwritten text into digital form with all errors committed by the writers; they also created an orthographically correct version of every sentence. Metadata about the text was gathered, including the circumstances under which it was collected, information about the student writer and background about spelling lessons in the particular class. In a second step, the students' spelling errors were annotated into general groupings: grapheme level, syllable level, morphology and syntax. The files were anonymized in a third step.

This release also contains metadata regarding the writers’ language biography, teaching methodology, age, gender and school year. The average age of the participants was 11 years, and the gender distribution was nearly equal. Original handwriting is presented as JPEG format image files and the converted annotated text as UTF-8 plain text. Metadata is contained within each text file.

Karlsruhe Children's Text is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$750.

Back

Top

5-2-3

Appen ButlerHill

A global leader in linguistic technology solutions

RECENT CATALOG ADDITIONS—MARCH 2012

1. Speech Databases

1.1 Telephony

1.1 Telephony Language	Database Type	Catalogue Code	Speakers	Status
Bahasa Indonesia	Conversational	BAH_ASR001	1,002	Available
Bengali	Conversational	BEN_ASR001	1,000	Available
Bulgarian	Conversational	BUL_ASR001	217	Available shortly
Croatian	Conversational	CRO_ASR001	200	Available shortly
Dari	Conversational	DAR_ASR001	500	Available
Dutch	Conversational	NLD_ASR001	200	Available
Eastern Algerian Arabic	Conversational	EAR_ASR001	496	Available
English (UK)	Conversational	UKE_ASR001	1,150	Available
Farsi/Persian	Scripted	FAR_ASR001	789	Available
Farsi/Persian	Conversational	FAR_ASR002	1,000	Available
French (EU)	Conversational	FRF_ASR001	563	Available
French (EU)	Voicemail	FRF_ASR002	550	Available
German	Voicemail	DEU_ASR002	890	Available
Hebrew	Conversational	HEB_ASR001	200	Available shortly
Italian	Conversational	ITA_ASR003	200	Available shortly
Italian	Voicemail	ITA_ASR004	550	Available
Kannada	Conversational	KAN_ASR001	1,000	In development
Pashto	Conversational	PAS_ASR001	967	Available
Portuguese (EU)	Conversational	PTP_ASR001	200	Available shortly
Romanian	Conversational	ROM_ASR001	200	Available shortly
Russian	Conversational	RUS_ASR001	200	Available
Somali	Conversational	SOM_ASR001	1,000	Available
Spanish (EU)	Voicemail	ESO_ASR002	500	Available
Turkish	Conversational	TUR_ASR001	200	Available
Urdu	Conversational	URD_ASR001	1,000	Available

1.2 Wideband Language	Database Type	Catalogue Code	Speakers	Status
English (US)	Studio	USE_ASR001	200	Available
French (Canadian)	Home/ Office	FRC_ASR002	120	Available
German	Studio	DEU_ASR001	127	Available
Thai	Home/Office	THA_ASR001	100	Available
Korean	Home/Office	KOR_ASR001	100	Available

2. Pronunciation Lexica

Appen Butler Hill has considerable experience in providing a variety of lexicon types. These include:

 Pronunciation Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)

 Part-of-speech tagged Lexica providing grammatical and semantic labels

 Other reference text based materials including spelling/mis-spelling lists, spell-check dictionar-ies, mappings of colloquial language to standard forms, orthographic normalization lists.

Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages. For holdings information in a given language or to discuss any customized development efforts, please contact: sales@appenbutlerhill.com

3. Named Entity Corpora Language	Catalogue Code		Words		Description
Arabic	ARB_NER001		500,000		These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities
English		ENI_NER001		500,000
Farsi/Persian		FAR_NER001		500,000
Korean		KOR_NER001		500,000
Japanese		JPY_NER001		500,000
Russian		RUS_NER001		500,000
Mandarin		MAN_NER001		500,000
Urdu		URD_NER001		500,000

3. Named Entity Corpora Language	Catalogue Code		Words		Description
Arabic	ARB_NER001		500,000		These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities
English		ENI_NER001		500,000
Farsi/Persian		FAR_NER001		500,000
Korean		KOR_NER001		500,000
Japanese		JPY_NER001		500,000
Russian		RUS_NER001		500,000
Mandarin		MAN_NER001		500,000
Urdu		URD_NER001		500,000

4. Other Language Resources

 Morphological Analyzers – Farsi/Persian & Urdu

 Arabic Thesaurus

 Language Analysis Documentation – multiple languages

For additional information on these resources, please contact: sales@appenbutlerhill.com

5. Customized Requests and Package Configurations

Appen Butler Hill is committed to providing a low risk, high quality, reliable solution and has worked in 130+ languages to-date supporting both large global corporations and Government organizations.

We would be glad to discuss to any customized requests or package configurations and prepare a cus-tomized proposal to meet your needs.

6. Contact Information

Prithivi Pradeep

Business Development Manager

ppradeep@appenbutlerhill.com

+61 2 9468 6370

Tom Dibert

Vice President, Business Development, North America

tdibert@appenbutlerhill.com

+1-315-339-6165

www.appenbutlerhill.com

Back

Top

5-2-4

OFROM 1er corpus de français de Suisse romande

Nous souhaiterions vous signaler la mise en ligne d'OFROM, premier corpus de français parlé en Suisse romande. L'archive est, dans version actuelle, d'une durée d'environ 15 heures. Elle est transcrite en orthographe standard dans le logiciel Praat. Un concordancier permet d'y effectuer des recherches, et de télécharger les extraits sonores associés aux transcriptions.

Pour accéder aux données et consulter une description plus complète du corpus, nous vous invitons à vous rendre à l'adresse suivante : http://www.unine.ch/ofrom.

Back

Top

5-2-5

Real-world 16-channel noise recordings

We are happy to announce the release of DEMAND, a set of real-world
16-channel noise recordings designed for the evaluation of microphone
array processing techniques.

http://www.irisa.fr/metiss/DEMAND/

1.5 h of noise data were recorded in 18 different indoor and outdoor
environments and are available under the terms of the Creative Commons Attribution-ShareAlike License.

Joachim Thiemann (CNRS - IRISA)
Nobutaka Ito (University of Tokyo)
Emmanuel Vincent (Inria Nancy - Grand Est)

Back

Top

5-2-6

Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne

Le consortium IRCOM de la TGIR Corpus et l’EquipEx ORTOLANG s’associent pour proposer une aide technique et financière à la finalisation de corpus de données orales ou multimodales à des fins de diffusion et pérennisation par l’intermédiaire de l’EquipEx ORTOLANG. Cet appel ne concerne pas la création de nouveaux corpus mais la finalisation de corpus existants et non-disponibles de manière électronique. Par finalisation, nous entendons le dépôt auprès d’un entrepôt numérique public, et l’entrée dans un circuit d’archivage pérenne. De cette façon, les données de parole qui ont été enrichies par vos recherches vont pouvoir être réutilisées, citées et enrichies à leur tour de manière cumulative pour permettre le développement de nouvelles connaissances, selon les conditions d’utilisation que vous choisirez (sélection de licences d’utilisation correspondant à chacun des corpus déposés).

Cet appel d’offre est soumis à plusieurs conditions (voir ci-dessous) et l’aide financière par projet est limitée à 3000 euros. Les demandes seront traitées dans l’ordre où elles seront reçues par l’ IRCOM. Les demandes émanant d’EA ou de petites équipes ne disposant pas de support technique « corpus » seront traitées prioritairement. Les demandes sont à déposer du 1^er septembre 2013 au 31 octobre 2013. La décision de financement relèvera du comité de pilotage d’IRCOM. Les demandes non traitées en 2013 sont susceptibles de l’être en 2014. Si vous avez des doutes quant à l’éligibilité de votre projet, n’hésitez pas à nous contacter pour que nous puissions étudier votre demande et adapter nos offres futures.

Pour palier la grande disparité dans les niveaux de compétences informatiques des personnes et groupes de travail produisant des corpus, L’ IRCOM propose une aide personnalisée à la finalisation de corpus. Celle-ci sera réalisée par un ingénieur IRCOM en fonction des demandes formulées et adaptées aux types de besoin, qu’ils soient techniques ou financiers.

Les conditions nécessaires pour proposer un corpus à finaliser et obtenir une aide d’IRCOM sont :

Pouvoir prendre toutes décisions concernant l’utilisation et la diffusion du corpus (propriété intellectuelle en particulier).
Disposer de toutes les informations concernant les sources des corpus et le consentement des personnes enregistrées ou filmées.
Accorder un droit d’utilisation libre des données ou au minimum un accès libre pour la recherche scientifique.

Les demandes peuvent concerner tout type de traitement : traitements de corpus quasi-finalisés (conversion, anonymisation), alignement de corpus déjà transcrits, conversion depuis des formats « traitement de textes », digitalisation de support ancien. Pour toute demande exigeant une intervention manuelle importante, les demandeurs devront s’investir en moyens humains ou financiers à la hauteur des moyens fournis par IRCOM et ORTOLANG.

IRCOM est conscient du caractère exceptionnel et exploratoire de cette démarche. Il convient également de rappeler que ce financement est réservé aux corpus déjà largement constitués et ne peuvent intervenir sur des créations ex-nihilo. Pour ces raisons de limitation de moyens, les propositions de corpus les plus avancés dans leur réalisation pourront être traitées en priorité, en accord avec le CP d’IRCOM. Il n’y a toutefois pas de limite « théorique » aux demandes pouvant être faites, IRCOM ayant la possibilité de rediriger les demandes qui ne relèvent pas de ses compétences vers d’autres interlocuteurs.

Les propositions de réponse à cet appel d’offre sont à envoyer à ircom.appel.corpus@gmail.com. Les propositions doivent utiliser le formulaire de deux pages figurant ci-dessous. Dans tous les cas, une réponse personnalisée sera renvoyée par IRCOM.

Ces propositions doivent présenter les corpus proposés, les données sur les droits d’utilisation et de propriétés et sur la nature des formats ou support utilisés.

Cet appel est organisé sous la responsabilité d’IRCOM avec la participation financière conjointe de IRCOM et l’EquipEx ORTOLANG.

Pour toute information complémentaire, nous rappelons que le site web de l'Ircom (http://ircom.corpus-ir.fr) est ouvert et propose des ressources à la communauté : glossaire, inventaire des unités et des corpus, ressources logicielles (tutoriaux, comparatifs, outils de conversion), activités des groupes de travail, actualités des formations, ...

L'IRCOM invite les unités à inventorier leur corpus oraux et multimodaux - 70 projets déjà recensés - pour avoir une meilleure visibilité des ressources déjà disponibles même si elles ne sont pas toutes finalisées.

Le comité de pilotage IRCOM

Utiliser ce formulaire pour répondre à l’appel : Merci.

Réponse à l’appel à la finalisation de corpus oral ou multimodal

Nom du corpus :

Nom de la personne à contacter :

Adresse email :

Numéro de téléphone :

Nature des données de corpus :

Existe-t’il des enregistrements :

Quel média ? Audio, vidéo, autre…

Quelle est la longueur totale des enregistrements ? Nombre de cassettes, nombre d’heures, etc.

Quel type de support ?

Quel format (si connu) ?

Existe-t’il des transcriptions :

Quel format ? (papier, traitement de texte, logiciel de transcription)

Quelle quantité (en heures, nombre de mots, ou nombre de transcriptions) ?

Disposez vous de métadonnées (présentation des droits d’auteurs et d’usage) ?

Disposez-vous d’une description précise des personnes enregistrées ?

Disposez-vous d’une attestation de consentement éclairé pour les personnes ayant été enregistrées ? En quelle année (environ) les enregistrements ont eu lieu ?

Quelle est la langue des enregistrements ?

Le corpus comprend-il des enregistrements d’enfants ou de personnes ayant un trouble du langage ou une pathologie ?

Si oui, de quelle population s’agit-il ?

Dans un souci d’efficacité et pour vous conseiller dans les meilleurs délais, il nous faut disposer d’exemples des transcriptions ou des enregistrements en votre possession. Nous vous contacterons à ce sujet, mais vous pouvez d’ores et déjà nous adresser par courrier électronique un exemple des données dont vous disposez (transcriptions, métadonnées, adresse de page web contenant les enregistrements).

Nous vous remercions par avance de l’intérêt que vous porterez à notre proposition. Pour toutes informations complémentaires veuillez contacter Martine Toda martine.toda@ling.cnrs.fr ou à ircom.appel.corpus@gmail.com.

Back

Top

5-2-7

Rhapsodie: un Treebank prosodique et syntaxique de français parlé

Nous avons le plaisir d'annoncer que la ressource Rhapsodie, Corpus de français parlé annoté pour la prosodie et la syntaxe, est désormais disponible sur http://www.projet-rhapsodie.fr/

Le treebank Rhapsodie est composé de 57 échantillons sonores (5 minutes en moyenne, au total 3h de parole, 33000 mots) dotés d’une transcription orthographique et phonétique alignées au son.

Il s'agit d’une ressource de français parlé multi genres (parole privée et publique ; monologues et dialogues ; entretiens en face à face vs radiodiffusion, parole plus ou moins interactive et plus ou moins planifiée, séquences descriptives, argumentatives, oratoires et procédurales) articulée autour de sources externes (enregistrements extraits de projets antérieurs, en accord avec les concepteurs initiaux) et internes. Nous tenons en particulier à remercier les responsables des projets CFPP2000, PFC, ESLO, C-Prom ainsi que Mathieu Avanzi, Anne Lacheret, Piet Mertens et Nicolas Obin.

Les échantillons sonores (wave & MP3, pitch nettoyé et lissé), les transcriptions orthographiques (txt), les annotations macrosyntaxiques (txt), les annotations prosodiques (xml, textgrid) ainsi que les metadonnées (xml & html) sont téléchargeables librement selon les termes de la licence Creative Commons Attribution - Pas d’utilisation commerciale - Partage dans les mêmes conditions 3.0 France.

Les annotations microsyntaxiques seront disponibles prochainement

Les métadonnées sont également explorables en ligne grâce à un browser.

Les tutoriels pour la transcription, les annotations et les requêtes sont disponibles sur le site Rhapsodie.

Enfin, L’annotation prosodique est interrogeable en ligne grâce au langage de requêtes Rhapsodie QL.

L'équipe Ressource Rhapsodie (Modyco, Université Paris Ouest Nanterre)

Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.

Partenaires : IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence), CLLE-ERSS (Toulouse).

********************************************************

Rhapsodie: a Prosodic and Syntactic Treebank for Spoken French

We are pleased to announce that Rhapsodie, a syntactic and prosodic treebank of spoken French created with the aim of modeling the interface between prosody, syntax and discourse in spoken French is now available at http://www.projet-rhapsodie.fr/

The Rhapsodie treebank is made up of 57 short samples of spoken French (5 minutes long on average, amounting to 3 hours of speech and a 33 000 word corpus) endowed with an orthographical phoneme-aligned transcription .

The corpus is representative of different genres (private and public speech; monologues and dialogues; face-to-face interviews and broadcasts; more or less interactive discourse; descriptive, argumentative and procedural samples, variations in planning type).

The corpus samples have been mainly drawn from existing corpora of spoken French and partially created within the frame of theRhapsodie project. We would especially like to thank the coordinators of the CFPP2000, PFC, ESLO, C-Prom projects as well as Piet Mertens, Mathieu Avanzi, Anne Lacheret and Nicolas Obin.

The sound samples (waves, MP3, cleaned and stylized pitch), the orthographic transcriptions (txt), the macrosyntactic annotations (txt), the prosodic annotations (xml, textgrid) as well as the metadata (xml and html) can be freely downloaded under the terms of the Creative Commons licence Attribution - Noncommercial - Share Alike 3.0 France.

Microsyntactic annotation will be available soon.

The metadata are searchable on line through a browser.

The prosodic annotation can be explored on line through the Rhapsodie Query Language.

The tutorials of transcription, annotations and Rhapsodie Query Language are available on the site.

The Rhapsodie team (Modyco, Université Paris Ouest Nanterre :

Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.

Partners: IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence),CLLE-ERSS (Toulouse).

Back

Top

5-2-8

Annotation of “Hannah and her sisters” by Woody Allen.

We have created and made publicly available a dense audio-visual person-oriented ground-truth annotation of a feature movie (100 minutes long): “Hannah and her sisters” by Woody Allen.

The annotation includes

•          Face tracks in video (densely annotated, i.e., in each frame, and person-labeled)

•             Speech segments in audio (person-labeled)

•             Shot boundaries in video

The annotation can be useful for evaluating

•   Person-oriented video-based tasks (e.g., face tracking, automatic character naming, etc.)

•             Person-oriented audio-based tasks (e.g., speaker diarization or recognition)

•             Person-oriented multimodal-based tasks (e.g., audio-visual character naming)

Detail on Hannah dataset and access to it can be obtained there:

https://research.technicolor.com/rennes/hannah-home/

https://research.technicolor.com/rennes/hannah-download/

Acknowledgments:

This work is supported by AXES EU project: http://www.axes-project.eu/

Alexey Ozerov Alexey.Ozerov@technicolor.com

Jean-Ronan Vigouroux,

Louis Chevallier

Patrick Pérez

Technicolor Research & Innovation

Back

Top

5-2-9

French TTS

Text to         Speech Synthesis:
      over an hour of speech       synthesis samples from         1968 to 2001 by       25 French, Canadian, US , Belgian,       Swedish, Swiss systems

      33 ans de synthèse de la parole à         partir du texte: une promenade sonore (1968-2001)
        (33 years of Text to Speech Synthesis       in French : an audio tour (1968-2001)       )
      Christophe d'Alessandro
      Article published in         Volume 42 - No. 1/2001 issue of Traitement       Automatique des Langues (TAL,       Editions Hermes),         pp. 297-321.

      posted to:
      http://groupeaa.limsi.fr/corpus:synthese:start

Back

Top

5-2-10

Google 's Language Model benchmark

A LM benchmark is available at:https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark

Here is a brief description of the project.

'The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:

unpruned Katz (1.1B n-grams),
pruned Katz (~15M n-grams),
unpruned Interpolated Kneser-Ney (1.1B n-grams),
pruned Interpolated Kneser-Ney (~15M n-grams)

ArXiv paper: http://arxiv.org/abs/1312.3005

Happy benchmarking!'

Back

Top

5-2-11

International Standard Language Resource Number (ISLRN) (ELRA Press release)

Press Release - Immediate - Paris, France, December 13, 2013

Establishing the International Standard Language Resource Number (ISLRN)

12 major NLP organisations announce the establishment of the ISLRN, a Persistent Unique Identifier, to be assigned to each Language Resource.

On November 18, 2013, 12 NLP organisations have agreed to announce the establishment of the International Standard Language Resource Number (ISLRN), a Persistent Unique Identifier, to be assigned to each Language Resource. Experiment replicability, an essential feature of scientific work, would be enhanced by such unique identifier. Set up by ELRA, LDC and AFNLP/Oriental-COCOSDA, the ISLRN Portal will provide unique identifiers using a standardised nomenclature, as a service free of charge for all Language Resource providers. It will be supervised by a steering committee composed of representatives of participating organisations and enlarged whenever necessary.

More information on ELRA and the ISLRN, please contact: Khalid Choukri choukri@elda.org

More information on ELDA, please contact: Hélène Mazo mazo@elda.org

ELRA

55-57, rue Brillat Savarin

75013 Paris (France)

Tel.: +33 1 43 13 33 33

Fax: +33 1 43 13 33 30

Back

Top

5-2-12

ISLRN new portal

Opening of the ISLRN Portal
ELRA, LDC, and AFNLP/Oriental-COCOSDA announce the opening of the ISLRN Portal @ www.islrn.org.

Further to the establishment of the International Standard Language Resource Number (ISLRN) as a unique and universal identification schema for Language Resources on November 18, 2013, ELRA, LDC and AFNLP/Oriental-COCOSDA now announce the opening of the ISLRN Portal (www.islrn.org). As a service free of charge for all Language Resource providers and under the supervision of a steering committee composed of representatives of participating organisations, the ISLRN Portal provides unique identifiers using a standardised nomenclature.

Overview
The 13-digit ISLRN format is: XXX-XXX-XXX-XXX-X. It can be allocated to any Language Resource; its composition is neutral and does not include any semantics in reference to the type or nature of the Language Resource. The ISLRN is a randomly created number with a check digit that validates a Verhoeff algorithm.

Two types of external players may interact with the ISLRN Portal: Visitors and Providers. Visitors may browse the web site and search for the ISLRN of a given Language Resource by its name or by its number if it exists. Providers are registered and own credentials. They can request a new ISLRN for a given Language Resource. A provider has the possibility to become certified, after moderation, in order to be able to import metadata in XML format.

The functionalities that can be accessed by Visitors are:

-          Identify a language resource according to its ISLRN
-          Identify an ISLRN by the name of a language resource
-          Get information about ISLRN, FAQ, Basic Metadata, Legal Information
-          View last 5 accepted resources (“What’s new” block on home page)
-          Sign up to become a provider

The functionalities that can be accessed by Providers, once they have signed up, are:

-          Log in
-          Request an ISLRN according to the metadata of a given resource
-          Request to become a certified provider so as to import XML files containing metadata
-          Import one or more metadata descriptions in XML to request ISLRN(s) (only for certified providers)
-          Edit pending requests
-          Access previous requests
-          Contact a Moderator or an Administrator
-          Edit Providers’ own profile

ISLRN request is handled by moderators within 5 working days.
Contact: islrn@elda.org

Background
The International Standard Language Resource Number (ISLRN) is a unique and universal identification schema for Language Resources which provides Language Resources with unique identifier using a standardised nomenclature. It also ensures that Language Resources are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers. Moreover, it is a major step in the interconnected world that Human Language Technologies (HLT) has become: unique resources must be identified as they are and meta-catalogues need a common identification format to manage data correctly.

The ISLRN does not intend to replace local and specific identifiers, it is not meant to be a legal deposit, not an obligation, but rather an essential and best practice. For instance a resource that is distributed by several data centres will still have the “local” data-centre identifier but will have a unique ISLRN.

********************************************************************
About ELRA
The European Language Resources Association (ELRA) is a non-profit making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies (HLT). To find out more about ELRA, please visit www.elra.info.

About LDC
Founded in 1992, the Linguistic Data Consortium (LDC) is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC's host institution. To find out more about LDC, please visit www.ldc.upenn.edu.

About AFNLP
The mission of the Asian Federation of Natural Language Processing (AFNLP) is to promote and enhance R&D relating to the computational analysis and the automatic processing of all languages of importance to the Asian region by assisting and supporting like-minded organizations and institutions through information sharing, conference organization, research and publication co-ordination, and other forms of support. To find out more about AFNLP, please visit www.afnlp.org.

About Oriental-COCOSDA
The International Committee for the Co-ordination and Standardisation of Speech Databases and Assesment Techniques, Oriental-COCOSDA, has been established to encourage and promote international interaction and cooperation in the foundation areas of Spoken Language Processing, especially for Speech Input/Output. To find out more about Oriental-COCOSDA, please visit our web site: www.cocosda.org

Back

Top

5-2-13

Speechocean – update (August 2015)

Speechocean – update (August 2015):

Speechocean: A global language resources and data services supplier

About Speechocean

Speechocean is one of the world well-known language related resources & services provider in the fields of Human Computer Interaction and Human Language Technology. At present, we can provide data services with 110+ languages and dialects across the world.

KingLine Data Center ---Data Sharing Platform

Kingline Data Center is operated and supervised by Speechocean, which is mainly focused on language resources creating and providing for research and development of human language technology.

These diversified corpora are widely used for the research and development in the fields of Speech Recognition, Speech Synthesis, Natural Language Processing, Machine Translation, Web Search, etc. All corpora are openly accessible for users all over the world, including users from scientific research institutions, enterprises or individuals.

For more detailed information, please visit our website: http://kingline.speechocean.com

New released corpora:

Mexican Spanish Speech Recognition Database (Mobile)-300 Speakers

ID: King-ASR-143

This is a 3-channel Mexican Spanish mobile speech database, which is collected over three mobile phone simultaneously (android mobiles, iPhone and windows phones) in Mexico. This database was performed in a quiet environment.
The prompts were the phonetically rich sentences. Due to the potential cognitive load of saying these sentences by the subjects, we took care to choose natural sentences of length between 5 and 25 words. The raw sentences are selected from different domain: Conversations, News, etc. We did remove a number of sentences that includes offensive or negative words or phrase. In order to achieve a good phone balance, we choose sentences from the sentences list to fill out our number. Finally, we had around 29995 unique sentences in our list of sentences, with each of them repeated less than 4 times among different speakers.
The corpus contains the recordings of 270,711 utterances which were from 303 speakers. The recording time is about 477 hours (3-channel), including the leading silence (about 500 ms) and the trailing silence (about 500 ms). The total size of this database is about 51 GB.

Argentinean Spanish Speech Recognition Database(Desktop)-200 Speakers

ID: King-ASR-281

This is a 4-channel Spanish desktop speech database, which is collected over 4 different microphones simultaneously. The project was performed in Argentina; cover all the cities, for example: BuenosAires, Cordoba, Lanus, Cordoba...

Each Speaker was recorded around 300 sentences which were selected from a pool of phonetically rich sentences in approximate 80 minutes as natural as possible. The recording was performed in a quiet office environment.

This database is performed in quiet office environment. The corpus contains the recordings of 236,232 utterances of Spanish speech data which were from 200 speakers. The pure recording time is about 358 hours (4-channel), including the leading silence (about 500 ms) and the trailing silence (about 500 ms). The total size of this database is 141 GB.

A pronunciation lexicon with a phonemic transcription in SAMPA was carefully made by covering all the words in the transcription files.

Chilean Spanish Speech Recognition Database -(Mobile)-300 Speakers

ID: King-ASR-290

This is a 3-channel Chilean Spanish speech database, which is collected over 3 different mobile operating systems: iOS, Android and Windows Phone platform. The project was performed in Chile, cover all the main cities. For example: Santiago, Rancagua, Antofagasta and Viña.
300 speakers were recorded in total, and each speaker recorded in a quiet environment.
The prompts were the phonetically rich sentences. The raw sentences are all selected from the News domain Twitter/Forum and SMS. We did remove a number of sentences that includes offensive or negative words or phrase. Finally, we had 108055 unique sentences in our list of sentences, that we generated the prompt sheets from with no more than 3 times for each.
With discarding some unqualified utterances, the whole corpus contains the recordings of 268,704 utterances; the pure recording time is about 519 hours (including leading silence and tail silence). The total size of this database is about 55.8 G.

Contact Information

Xianfeng Cheng

Tel: +86-10-62660928; +86-10-62660053 ext.8080

Mobile: +86 13681432590

Skype: xianfeng.cheng1

Email: chengxianfeng@speechocean.com; cxfxy0cxfxy0@gmail.com

Website: www.speechocean.com

Back

Top

5-2-14

kidLUCID: London UCL Children’s Clear Speech in Interaction Database

kidLUCID: London UCL Children’s Clear Speech in Interaction Database

We are delighted to announce the availability of a new corpus of spontaneous speech for children aged 9 to 14 years inclusive, produced as part of the ESRC-funded project on ‘Speaker-controlled Variability in Children's Speech in Interaction’ (PI: Valerie Hazan).

Speech recordings (a total of 288 conversations) are available for 96 child participants (46M, 50F, range 9;0 to 15;0 years), all native southern British English speakers. Participants were recorded in pairs while completing the diapix spot-the-difference picture task in which the pair verbally compared two scenes, only one of which was visible to each talker. High-quality digital recordings were made in sound-treated rooms. For each conversation, a stereo audio recording is provided with each speaker on a separate channel together with a Praat Textgrid containing separate word- and phoneme-level segmentations for each speaker.

There are six recordings per speaker pair made in the following conditions:

NOB (No barrier): both speakers heard each other normally
VOC (Vocoder): one conversational partner heard the other's speech after it had been processed in real time through a noise-excited three channel vocoder
BAB (Babble): one conversational partner heard the other's speech in a background of adult multi-talker babble at an approximate SNR of 0 dB.

The kidLUCID corpus is available online within the OSCAAR (Online Speech/Corpora Archive and Analysis Resource) archive (https://oscaar.ci.northwestern.edu/). Free access can be requested for research purposes. Further information about the project can be found at: http://www.ucl.ac.uk/pals/research/shaps/research/shaps/research/clear-speech-strategies

This work was supported by Economic and Social Research Council Grant No. RES-062- 23-3106.

Back

Top

5-2-15

Robust speech datasets and ASR software tools

We are happy to announce the release of a table of 44 publicly available robust speech processing datasets and a table of 4 ASR software tools on the wiki of ISCA's Robust Speech Processing SIG:
https://wiki.inria.fr/rosp/Datasets#Speech_datasets
https://wiki.inria.fr/rosp/Software#Automatic_speech_recognition

We hope that these tables will promote wider dissemination of the datasets and software tools available in our community and help newcomers select the most suitable dataset or software for a given experiment. We plan to provide additional tables on, e.g., room impulse response datasets or speaker recognition software in the future.

We highly welcome your input, especially additional tables/entries and reproducible baselines for each dataset. It just takes a few minutes thanks to the simple wiki interface.

For more information about joining the SIG and contributing, see
https://wiki.inria.fr/rosp/

Jonathan Le Roux, Emmanuel Vincent, and Ramon Astudillo

Back

Top

5-2-16

International Standard Language Resource Number (ISLRN) implemented by ELRA and LDC

ELRA and LDC partner to implement ISLRN process and assign identifiers to all the Language Resources in their catalogues.

Following the meeting of the largest NLP organizations, the NLP12, and their endorsement of the International Standard Language Resource Number (ISLRN), ELRA and LDC partnered to implement the ISLRN process and to assign identifiers to all the Language Resources (LRs) in their catalogues. The ISLRN web portal was designed to enable the assignment of unique identifiers as a service free of charge for all Language Resource providers. To enhance the use of ISLRN, ELRA and LDC have collaborated to provide the ISLRN 13-digit ID to all the Language Resources distributed in their respective catalogues. Anyone who is searching the ELRA and LDC catalogues can see that each Language Resource is now identified by both the data centre ID and the ISLRN number. All providers and users of such LRs should refer to the latter in their own publications and whenever referring to the LR.

ELRA and LDC will continue their joint involvement in ISLRN through active participation in this web service.

Visit the ELRA and LDC catalogues, respectively at http://catalogue.elra.info and https://catalog.ldc.upenn.edu

Background

The International Standard Language Resource Number (ISLRN) aims to provide unique identifiers using a standardised nomenclature, thus ensuring that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications within R&D projects, product evaluation and benchmarking, as well as in documents and scientific papers. Moreover, this is a major step in the networked and shared world that Human Language Technologies (HLT) has become: unique resources must be identified as such and meta-catalogues need a common identification format to manage data correctly.

The ISLRN portal can be accessed from http://www.islrn.org,

***About NLP12***

Representatives of the major Natural Language Processing and Computational Linguistics organizations met in Paris on 18 November 2013 to harmonize and coordinate their activities within the field.
The results of this coordination are expressed in the Paris Declaration: http://www.elra.info/NLP12-Paris-Declaration.html.

*** About ELRA ***
The European Language Resources Association (ELRA) is a non-profit making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies (HLT).
To find out more about ELRA, please visit our web site: http://www.elra.info

*** About LDC ***

The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and research laboratories that creates and distributes linguistic resources for language-related education, research and technology development.

To find out more about LDC, please visit our web site: https://www.ldc.upenn.edu

For more information, please contact: admin@islrn.org

Back

Top

5-2-17

ELRA News

We are happy to announce that 1 new Written Corpus and 1 new Terminological Resource are now available in our catalogue.

ELRA-W0081 Khresmoi manually annotated reference corpus
ISLRN: 764-036-829-417-7
This corpus is a collection of Khresmoi English web documents annotated with key entities (such as disease, drug). The corpus is divided into two parts:
1. The initial corpus: 625 documents from the Genetics Home Reference data set, automatically annotated with anatomical locations and diseases, and manually corrected by 3-4 annotators. Size of documents: between 26 and 8,306 tokens each.
2. The main corpus: 6,950 English documents from the Khresmoi crawl and 5,518 English Wikipedia pages, automatically annotated through the GATE Platform for Anatomy, Disease, Drug and Investigation. Size of documents: between 200 and 2,000 tokens each.
The corpus is using the GATE XML format.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1237

ELRA-T0375 ACL RD-TEC: A Reference Dataset for Terminology Extraction and Classification Research in Computational Linguistics
ISLRN: 699-305-362-089-6
This is a reference dataset for terminology extraction and classification research in computational linguistics. It is a set of manually annotated terms in English language that are extracted from the ACL Anthology Reference Corpus (ACL ARC). This dataset, called ACL RD-TEC, is comprised of more than 69,000 candidate terms that are manually annotated as valid and invalid terms. Furthermore, valid terms are classified as technology and non-technology terms.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1236

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

Back

Top

5-2-18

ELRA - Language Resources Catalogue - Update (May 2015)

*****************************************************************
ELRA - Language Resources Catalogue - Update
*****************************************************************

We are happy to announce that 1 new Speech resource is now available in our catalogue.

ELRA-S0373 GVLEX tales corpus
ISLRN: 433-270-888-230-5

GVLEX tales corpus consists of 89 written tales, manually annotated in structures, speech turns, speakers, phrases, 7 of which were annotated by 2 human annotators (96 annotated texts in total); 12 tales read by a professional, transcribed and manually annotated, including audio files; and annotation and viewing software developed within the GV-LEX project .
For more information, see: http://catalog.elra.info/product_info.php?products_id=1240

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

Delete \| Reply \| Reply to All \| Forward \| Redirect \| View Thread \| Blacklist \| Whitelist \| Message Source \| Save as \| Print
	Move \| Copy

Back

Top

5-2-19

Base de données LIBRE et GRATUITE pour la reconnaissance du locuteur

Je me permet de vous solliciter pour contribuer à la création

d?une base de données LIBRE et GRATUITE

pour la reconnaissance du locuteur.

Plus de détails et la marche à suivre ci-dessous.

Merci beaucoup,

Anthony Larcher

Récemment, un certain nombre de laboratoires spécialisés dans la reconnaissance du locuteur dépendante du texte ont initié le projet RedDots.

Il s?agit d?une initiative volontaire sur financement propre des laboratoires.

Ce projet encourage des discussions sur les thèmes de la reconnaissance du locuteur,

la collection de corpus et les cas d?usage propres à cette technologie à travers un Google Group.

Dans le cadre du projet RedDots, l?Institute for Infocomm Research (Singapour) a développé une application Android

qui permet d?enregistrer des données sur un téléphone portable.

Cette base de données a pour but de pallier certaines lacunes des corpus existants:

- le coût (certaines bases standard sont vendues à plusieurs milliers d?euro)

- la taille limitée (le nombre limité de locuteurs ne permet plus d?évaluer les systèmes de reconnaissance de manière significative)

- la variabilité limitée (les données sont actuellement enregistrées dans plus de 5 pays dans le monde entier)

Afin de distributer une base de données, qui puisse bénéficier librement

à l?ensemble de la communauté de recherche nous vous sollicitons.

Comment faire et en combien de temps?

- inscrivez vous en 2 minutes à l?adresse suivante

https://docs.google.com/forms/d/1124qRtQALqXb11y1rsbrXDu8ELBRHWQdXwq9L00uX7Y/viewform

- installez l?application Android sur votre téléphone en 2 minutes, saisissez l'ID et mot de passe qui vous seront envoyé par email

- enregistrez une session 3 minutes sur votre téléphone

Tout se fait en moins de 10 minutes?

Une des limitations principale des corpus existant est le nombre limité de sessions

enregistrée par locuteur et le court intervalle de temps au cours duquel ces sessions sont enregistrées.

Afin de combler ce manque nous espérons que chaque participant acceptera d?enregistrer

plusieurs sessions dans les mois à venir.

Idealement, chaque participant enregistrera 3 ou 4 minutes par semaine pendant un an.

Ou vont mes données et pour quoi sont elles utilisées?

Les données sont actuellement envoyées sur un serveur de l?Institute for Infocomm Research

à Singapour. Un institut de recherche public.

En vous enregistrant, vous acceptez que ces données soient utilisées à des fins de recherche

uniquement. ces données seront mise à disposition en ligne gratuitement tout au long du projet.

Merci pour votre contribution, n?hésitez pas à faire circuler cet email.

Plus de détails seront données prochainement dans un article soumis à INTERSPEECH 2015.

Anthony Larcher

Back

Top

5-2-20

ISLRN adopted by Joint Research Center (JRC) of the European Commission

JRC, the EC's Joint Research Centre, an important LR player: First to adopt the ISLRN initiative

The Joint Research Centre (JRC), the European Commission's in house science service, is the first organisation to use the International Standard Language Resource Number (ISLRN) initiative and has requested ISLRN 13-digit unique identifiers to its Language Resources (LR).
Thus, anyone who is using JRC LRs may now refer to this number in their own publications.

The current JRC LRs (downloadable from https://ec.europa.eu/jrc/en/language-technologies) with an ISLRN ID are:

DGT-Acquis, ISLRN: 393-866-130-658-2
DGT-TM, ISLRN: 710-653-952-884-4
EAC-TM, ISLRN: 589-927-543-547-4
ECDC-TM, ISLRN: 476-596-396-497-8
English sentiment quotes, ISLRN: 574-735-957-886-6
JRC-Acquis, ISLRN: 821-325-977-001-1
JRC-Names, ISLRN: 328-863-023-410-2
Multilingual summary evaluation data, ISLRN: 762-292-165-648-8
Turkish Tweet Named Entity annotation, ISLRN: 764-177-227-350-7

Background

The International Standard Language Resource Number (ISLRN) aims to provide unique identifiers using a standardised nomenclature, thus ensuring that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications within R&D projects, product evaluation and benchmarking, as well as in documents and scientific papers. Moreover, this is a major step in the networked and shared world that Human Language Technologies (HLT) has become: unique resources must be identified as such and meta-catalogues need a common identification format to manage data correctly.
The ISLRN portal can be accessed from http://www.islrn.org,

*** About the JRC ***

As the Commission's in-house science service, the Joint Research Centre's mission is to provide EU policies with independent, evidence-based scientific and technical support throughout the whole policy cycle.
Within its research in the field of global security and crisis management, the JRC develops open source intelligence and analysis systems that can automatically harvest and analyse a huge amount of multi-lingual information from the internet-based sources. In this context, the JRC has developed Language Technology resources and tools that can be used for highly multilingual text analysis and cross-lingual applications.
To find out more about JRC's research in open source information monitoring, please visit https://ec.europa.eu/jrc/en/research-topic/internet-surveillance-systems. To access media monitoring applications directly, go to http://emm.newsbrief.eu/overview.html.

Back

Top

5-2-21

Forensic database of voice recordings of 500+ Australian English speakers

Forensic database of voice recordings of 500+ Australian English speakers

We are pleased to announce that the forensic database of voice recordings of 500+ Australian English speakers is now published.

The database was collected by the Forensic Voice Comparison Laboratory, School of Electrical Engineering & Telecommunications, University of New South Wales as part of the Australian Research Council funded Linkage Project on making demonstrably valid and reliable forensic voice comparison a practical everyday reality in Australia. The project was conducted in partnership with: Australian Federal Police, New South Wales Police, Queensland Police, National Institute of Forensic Sciences, Australasian Speech Sciences and Technology Association, Guardia Civil, Universidad Autónoma de Madrid.

The database includes multiple non-contemporaneous recordings of most speakers. Each speaker is recorded in three different speaking styles representative of some common styles found in forensic casework. Recordings are recorded under high-quality conditions and extraneous noises and crosstalk have been manually removed. The high-quality audio can be processed to reflect recording conditions found in forensic casework.

The database can be accessed at: http://databases.forensic-voice-comparison.net/

Back

Top

5-2-22

Audio and Electroglottographic speech recordings

Audio and Electroglottographic speech recordings from several languages

We are happy to announce the public availability of speech recordings made as part of the UCLA project 'Production and Perception of Linguistic Voice Quality'.

http://www.phonetics.ucla.edu/voiceproject/voice.html

Audio and EGG recordings are available for Bo, Gujarati, Hmong, Mandarin, Black Miao, Southern Yi, Santiago Matatlan/ San Juan Guelavia Zapotec; audio recordings (no EGG) are available for English and Mandarin. Recordings of Jalapa Mazatec extracted from the UCLA Phonetic Archive are also posted. All recordings are accompanied by explanatory notes and wordlists, and most are accompanied by Praat textgrids that locate target segments of interest to our project.

Analysis software developed as part of the project – VoiceSauce for audio analysis and EggWorks for EGG analysis – and all project publications are also available from this site. All preliminary analyses of the recordings using these tools (i.e. acoustic and EGG parameter values extracted from the recordings) are posted on the site in large data spreadsheets.

All of these materials are made freely available under a Creative Commons Attribution-NonCommercial-ShareAlike-3.0 Unported License.

This project was funded by NSF grant BCS-0720304 to Pat Keating, Abeer Alwan and Jody Kreiman of UCLA, and Christina Esposito of Macalester College.

Pat Keating (UCLA)

Back

Top

5-2-23

Press release: Opening of the ELRA License Wizard

Press Release - Immediate - Paris, France, April 2, 2015

Opening of the ELRA License Wizard

ELRA announces the opening of the License Wizard @ http://wizard.elda.org.

ELRA is deploying a License Wizard to:

support the right-holders in finding the appropriate licenses under which to share/distribute their Language Resources, and
clarify the legal obligations applicable in various licensing situations.

Currently, the License Wizard allows the user to choose among several licenses that exist for the use of Language Resources: ELRA, Creative Commons and META-SHARE.
More will be added.

The License Wizard works as a web configurator that helps Right Holders/Users:

- to select a number of legal features and obtain the user license adapted to their selection.
- to define which user licenses they would like to select in order to distribute their Language Resources.
- to integrate the user license terms into a Distribution Agreement that could be proposed to ELRA or META-SHARE for further distribution through the ELRA Catalogue of Language Resources (http://catalogue.elra.info, www.meta-share.eu).

Background
From the very beginning, ELRA has come across all types of legal issues that arise when exchanging and sharing Language Resources. The association has devoted huge efforts to streamline the licensing processes while continuously monitoring the impacts of regulation changes on the HLT community activities. The first major step was to come up with a few licenses for both the research and the industrial sectors to use the resources available within the ELRA catalogue. Recently, its strong involvement in the META-SHARE infrastructure led to designing and drafting a small set of licenses, inspired by the ELRA licenses but also accounting for the new trends of permissive licenses and free resources, represented in particular by the Creative Commons.

Back

Top

5-2-24

EEG-face tracking- audio 24 GB data set Kara One, Toronto, Canada

We are making 24 GB of a new dataset, called Kara One, freely available. This database combines 3 modalities (EEG, face tracking, and audio) during imagined and articulated speech using phonologically-relevant phonemic and single-word prompts. It is the result of a collaboration between the Toronto Rehabilitation Institute (in the University Health Network) and the Department of Computer Science at the University of Toronto.

In the associated paper (abstract below), we show how to accurately classify imagined phonological categories solely from EEG data. Specifically, we obtain up to 90% accuracy in classifying imagined consonants from imagined vowels and up to 95% accuracy in classifying stimulus from active imagination states using advanced deep-belief networks.

Data from 14 participants are available here: http://www.cs.toronto.edu/~complingweb/data/karaOne/karaOne.html.

If you have any questions, please contact Frank Rudzicz at frank@cs.toronto.edu.

Best regards,

Frank

PAPER Shunan Zhao and Frank Rudzicz (2015) Classifying phonological categories in imagined and articulated speech. In Proceedings of ICASSP 2015, Brisbane Australia

ABSTRACT This paper presents a new dataset combining 3 modalities (EEG, facial, and audio) during imagined and vocalized phonemic and single-word prompts. We pre-process the EEG data, compute features for all 3 modalities, and perform binary classi?cation of phonological categories using a combination of these modalities. For example, a deep-belief network obtains accuracies over 90% on identifying consonants, which is signi?cantly more accurate than two baseline supportvectormachines. Wealsoclassifybetweenthedifferent states (resting, stimuli, active thinking) of the recording, achievingaccuraciesof95%. Thesedatamaybeusedtolearn multimodal relationships, and to develop silent-speech and brain-computer interfaces.

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy