ISCA - International Speech
Communication Association


ISCApad Archive  »  2015  »  ISCApad #200  »  Resources  »  Database

ISCApad #200

Friday, February 13, 2015 by Chris Wellekens

5-2 Database
5-2-1ELRA - Language Resources Catalogue - Update (2014-09)
ELRA - Language Resources Catalogue - Update
*****************************************************************
 
We are happy to announce that 1 new Speech Resource and 3 new Written Corpora are now available in our catalogue.  
 
ELRA-S0371 PortMedia French and Italian corpus
This corpus contains 700 transcribed dialogues from about 140 French speakers and 604 transcribed dialogues from about 150 Italian speakers (several dialogues per speaker). The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of touristic information and reservation. A manual transcription and semantic annotation of the corpus are provided with corresponding wave files.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1224&language=en
 
ELRA-W0078 NE3L named entities Arabic corpus
The Arabic corpus contains 103,363 words coming from articles extracted from “Le Monde Diplomatique” newspaper, and published in 2004. 2 named entity categories were taken into account: Time and Amount.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1226&language=en
 
ELRA-W0079 NE3L named entities Chinese corpus
The Chinese corpus contains 79,302 words coming from articles extracted from “Le Monde Diplomatique” newspaper, and published in 2001. 3 named entity categories were taken into account: Person, Place and Organisation.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1227&language=en
 
ELRA-W0080 NE3L named entities Russian corpus
The Russian corpus contains 75,784 words coming from articles extracted from “Izvestia” newspaper, and published in 1995. 2 named entity categories were taken into account: Time and Amount.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1228&language=en

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org
 

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/LRs-Announcements.html
 
 

Back  Top

5-2-2ELRA releases free Language Resources

ELRA releases free Language Resources
***************************************************

Anticipating users’ expectations, ELRA has decided to offer a large number of resources for free for Academic research use. Such an offer consists of several sets of speech, text and multimodal resources that are regularly released, for free, as soon as legal aspects are cleared. A first set was released in May 2012 at the occasion of LREC 2012. A second set is now being released.

Whenever this is permitted by our licences, please feel free to use these resources for deriving new resources and depositing them with the ELRA catalogue for community re-use.

Over the last decade, ELRA has compiled a large list of resources into its Catalogue of LRs. ELRA has negotiated distribution rights with the LR owners and made such resources available under fair conditions and within a clear legal framework. Following this initiative, ELRA has also worked on LR discovery and identification with a dedicated team which investigated and listed existing and valuable resources in its 'Universal Catalogue', a list of resources that could be negotiated on a case-by-case basis. At LREC 2010, ELRA introduced the LRE Map, an inventory of LRs, whether developed or used, that were described in LREC papers. This huge inventory listed by the authors themselves constitutes the first 'community-built' catalogue of existing or emerging resources, constantly enriched and updated at major conferences.

Considering the latest trends on easing the sharing of LRs, from both legal and commercial points of view, ELRA is taking a major role in META-SHARE, a large European open infrastructure for sharing LRs. This infrastructure will allow LR owners, providers and distributors to distribute their LRs through an additional and cost-effective channel.

To obtain the available sets of LRs, please visit the web page below and follow the instructions given online:
http://www.elra.info/Free-LRs,26.html

Back  Top

5-2-3LDC Newsletter (January 2015)

 

In this newsletter:

LDC Membership Discounts for MY 2015 Still Available

New publications:

GALE Phase 2 Arabic Broadcast News Speech Part 2

GALE Phase 2 Arabic Broadcast News Transcripts Part 2

SenSem Databank


LDC Membership Discounts for MY 2015 Still Available

If you are considering joining LDC for Membership Year 2015 (MY2015), there is still time to save on membership fees.   Any organization which joins or renews membership for 2015 through Monday, March 2, 2015, is entitled to a 5% discount on membership fees.  Organizations which held membership for MY2014 can receive a 10% discount on fees provided they renew prior to March 2, 2015.  For further information on planned publications for MY2015, please visit or contact LDC.


New publications

(1) GALE Phase 2 Arabic Broadcast News Speech Part 2 was developed by LDC and is comprised of approximately 170 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding transcripts are released as GALE Phase 2 Arabic Broadcast News Transcripts Part 1 (LDC2015T01).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology, Hong King (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

The broadcast recordings in this release feature news programs focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Aljazeera , a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Dubai TV, based in Dubai, United Arab Emirates; Al Iraqiyah, a television network based in Iraq; Kuwait TV, a national television station based in Kuwait; Lebanese Broadcasting Corporation, a Lebanese television station; Nile TV, a broadcast programmer based in Egypt; Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria.

This release contains 204 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program’s genre, data type and topic.

GALE Phase 2 Arabic Broadcast News Speech Part 2 is distributed on 3 DVD-ROM.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$2000.

*

(2) GALE Phase 2 Arabic Broadcast News Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 170 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) program. Corresponding audio data is released as GALE Phase 2 Arabic Broadcast News Speech Part 2 (LDC2015S01).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 920,730 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

GALE Phase 2 Arabic Broadcast News Transcripts Part 2 is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$1500.

*

(3) SenSem (Sentence Semantics) Databank was developed by GRIAL, the Linguistic Applications Inter-University Research Group that includes the following Spanish institutions: the Universitat Autonoma de Barcelona, the Universitat de Barcelona, the Universitat de Lleida and the Universitat Oberta de Catalunya. It contains syntactic and semantic annotation for over 35,000 sentences, approximately one million words of Spanish and approximately 700,000 words of Catalan translated from the Spanish. GRIAL's work focuses on resources for applied linguistics, including lexicography, translation and natural language processing.

Each sentence in SenSem Databank was labeled according to the verb sense it exemplifies, the type of complement it takes (arguments or adjuncts) and the syntactic category and function. Each argument was also labeled with a semantic role. Further information about the SenSem project can be obtained from the GRIAL website.

The Spanish source data includes texts from news journals (30,000 sentences) and novels (5,299 sentences). Those sentences represent around 1,000 different verb meanings that correspond to the 250 most frequent Spanish verbs. Verb frequencies were retrieved from a quantitative analysis of around 13 million words.

The Catalan corpus was developed by translating the news journal portion of the Spanish data set, resulting in a resource of over 700,000 sentences from which 391,267 sentences were annotated. Sentences were automatically translated and manually post-edited; some were re-annotated for sentence complements. Semantic information was the same for both languages. The Catalan sentences represent close to 1,300 different verbs.

SenSem Databank is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$200.  This data is made available to LDC not-for-profit members and all non-members under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license and to LDC for-profit members under the terms of the For-Profit Membership Agreement.


 

Back  Top

5-2-4Appen ButlerHill

 

Appen ButlerHill 

A global leader in linguistic technology solutions

RECENT CATALOG ADDITIONS—MARCH 2012

1. Speech Databases

1.1 Telephony

1.1 Telephony

Language

Database Type

Catalogue Code

Speakers

Status

Bahasa Indonesia

Conversational

BAH_ASR001

1,002

Available

Bengali

Conversational

BEN_ASR001

1,000

Available

Bulgarian

Conversational

BUL_ASR001

217

Available shortly

Croatian

Conversational

CRO_ASR001

200

Available shortly

Dari

Conversational

DAR_ASR001

500

Available

Dutch

Conversational

NLD_ASR001

200

Available

Eastern Algerian Arabic

Conversational

EAR_ASR001

496

Available

English (UK)

Conversational

UKE_ASR001

1,150

Available

Farsi/Persian

Scripted

FAR_ASR001

789

Available

Farsi/Persian

Conversational

FAR_ASR002

1,000

Available

French (EU)

Conversational

FRF_ASR001

563

Available

French (EU)

Voicemail

FRF_ASR002

550

Available

German

Voicemail

DEU_ASR002

890

Available

Hebrew

Conversational

HEB_ASR001

200

Available shortly

Italian

Conversational

ITA_ASR003

200

Available shortly

Italian

Voicemail

ITA_ASR004

550

Available

Kannada

Conversational

KAN_ASR001

1,000

In development

Pashto

Conversational

PAS_ASR001

967

Available

Portuguese (EU)

Conversational

PTP_ASR001

200

Available shortly

Romanian

Conversational

ROM_ASR001

200

Available shortly

Russian

Conversational

RUS_ASR001

200

Available

Somali

Conversational

SOM_ASR001

1,000

Available

Spanish (EU)

Voicemail

ESO_ASR002

500

Available

Turkish

Conversational

TUR_ASR001

200

Available

Urdu

Conversational

URD_ASR001

1,000

Available

1.2 Wideband

Language

Database Type

Catalogue Code

Speakers

Status

English (US)

Studio

USE_ASR001

200

Available

French (Canadian)

Home/ Office

FRC_ASR002

120

Available

German

Studio

DEU_ASR001

127

Available

Thai

Home/Office

THA_ASR001

100

Available

Korean

Home/Office

KOR_ASR001

100

Available

2. Pronunciation Lexica

Appen Butler Hill has considerable experience in providing a variety of lexicon types. These include:

Pronunciation Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)

Part-of-speech tagged Lexica providing grammatical and semantic labels

Other reference text based materials including spelling/mis-spelling lists, spell-check dictionar-ies, mappings of colloquial language to standard forms, orthographic normalization lists.

Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages. For holdings information in a given language or to discuss any customized development efforts, please contact: sales@appenbutlerhill.com

3. Named Entity Corpora

Language

Catalogue Code

Words

Description

Arabic

ARB_NER001

500,000

These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities

English

ENI_NER001

500,000

Farsi/Persian

FAR_NER001

500,000

Korean

KOR_NER001

500,000

Japanese

JPY_NER001

500,000

Russian

RUS_NER001

500,000

Mandarin

MAN_NER001

500,000

Urdu

URD_NER001

500,000

3. Named Entity Corpora

Language

Catalogue Code

Words

Description

Arabic

ARB_NER001

500,000

These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities

English

ENI_NER001

500,000

Farsi/Persian

FAR_NER001

500,000

Korean

KOR_NER001

500,000

Japanese

JPY_NER001

500,000

Russian

RUS_NER001

500,000

Mandarin

MAN_NER001

500,000

Urdu

URD_NER001

500,000

4. Other Language Resources

Morphological Analyzers – Farsi/Persian & Urdu

Arabic Thesaurus

Language Analysis Documentation – multiple languages

 

For additional information on these resources, please contact: sales@appenbutlerhill.com

5. Customized Requests and Package Configurations

Appen Butler Hill is committed to providing a low risk, high quality, reliable solution and has worked in 130+ languages to-date supporting both large global corporations and Government organizations.

We would be glad to discuss to any customized requests or package configurations and prepare a cus-tomized proposal to meet your needs.

6. Contact Information

Prithivi Pradeep

Business Development Manager

ppradeep@appenbutlerhill.com

+61 2 9468 6370

Tom Dibert

Vice President, Business Development, North America

tdibert@appenbutlerhill.com

+1-315-339-6165

                                                         www.appenbutlerhill.com

Back  Top

5-2-5OFROM 1er corpus de français de Suisse romande
Nous souhaiterions vous signaler la mise en ligne d'OFROM, premier corpus de français parlé en Suisse romande. L'archive est, dans version actuelle, d'une durée d'environ 15 heures. Elle est transcrite en orthographe standard dans le logiciel Praat. Un concordancier permet d'y effectuer des recherches, et de télécharger les extraits sonores associés aux transcriptions. 
 
Pour accéder aux données et consulter une description plus complète du corpus, nous vous invitons à vous rendre à l'adresse suivante : http://www.unine.ch/ofrom
Back  Top

5-2-6Real-world 16-channel noise recordings

We are happy to announce the release of DEMAND, a set of real-world
16-channel noise recordings designed for the evaluation of microphone
array processing techniques.

http://www.irisa.fr/metiss/DEMAND/

1.5 h of noise data were recorded in 18 different indoor and outdoor
environments and are available under the terms of the Creative Commons Attribution-ShareAlike License.

Joachim Thiemann (CNRS - IRISA)
Nobutaka Ito (University of Tokyo)
Emmanuel Vincent (Inria Nancy - Grand Est)

Back  Top

5-2-7Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne

Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne

 

 

Le consortium IRCOM de la TGIR Corpus et l’EquipEx ORTOLANG s’associent pour proposer une aide technique et financière à la finalisation de corpus de données orales ou multimodales à des fins de diffusion et pérennisation par l’intermédiaire de l’EquipEx ORTOLANG. Cet appel ne concerne pas la création de nouveaux corpus mais la finalisation de corpus existants et non-disponibles de manière électronique. Par finalisation, nous entendons le dépôt auprès d’un entrepôt numérique public, et l’entrée dans un circuit d’archivage pérenne. De cette façon, les données de parole qui ont été enrichies par vos recherches vont pouvoir être réutilisées, citées et enrichies à leur tour de manière cumulative pour permettre le développement de nouvelles connaissances, selon les conditions d’utilisation que vous choisirez (sélection de licences d’utilisation correspondant à chacun des corpus déposés).

 

Cet appel d’offre est soumis à plusieurs conditions (voir ci-dessous) et l’aide financière par projet est limitée à 3000 euros. Les demandes seront traitées dans l’ordre où elles seront reçues par l’ IRCOM. Les demandes émanant d’EA ou de petites équipes ne disposant pas de support technique « corpus » seront traitées prioritairement. Les demandes sont à déposer du 1er septembre 2013 au 31 octobre 2013. La décision de financement relèvera du comité de pilotage d’IRCOM. Les demandes non traitées en 2013 sont susceptibles de l’être en 2014. Si vous avez des doutes quant à l’éligibilité de votre projet, n’hésitez pas à nous contacter pour que nous puissions étudier votre demande et adapter nos offres futures.

 

Pour palier la grande disparité dans les niveaux de compétences informatiques des personnes et groupes de travail produisant des corpus, L’ IRCOM propose une aide personnalisée à la finalisation de corpus. Celle-ci sera réalisée par un ingénieur IRCOM en fonction des demandes formulées et adaptées aux types de besoin, qu’ils soient techniques ou financiers.

 

Les conditions nécessaires pour proposer un corpus à finaliser et obtenir une aide d’IRCOM sont :

  • Pouvoir prendre toutes décisions concernant l’utilisation et la diffusion du corpus (propriété intellectuelle en particulier).

  • Disposer de toutes les informations concernant les sources des corpus et le consentement des personnes enregistrées ou filmées.

  • Accorder un droit d’utilisation libre des données ou au minimum un accès libre pour la recherche scientifique.

 

Les demandes peuvent concerner tout type de traitement : traitements de corpus quasi-finalisés (conversion, anonymisation), alignement de corpus déjà transcrits, conversion depuis des formats « traitement de textes », digitalisation de support ancien. Pour toute demande exigeant une intervention manuelle importante, les demandeurs devront s’investir en moyens humains ou financiers à la hauteur des moyens fournis par IRCOM et ORTOLANG.

 

IRCOM est conscient du caractère exceptionnel et exploratoire de cette démarche. Il convient également de rappeler que ce financement est réservé aux corpus déjà largement constitués et ne peuvent intervenir sur des créations ex-nihilo. Pour ces raisons de limitation de moyens, les propositions de corpus les plus avancés dans leur réalisation pourront être traitées en priorité, en accord avec le CP d’IRCOM. Il n’y a toutefois pas de limite « théorique » aux demandes pouvant être faites, IRCOM ayant la possibilité de rediriger les demandes qui ne relèvent pas de ses compétences vers d’autres interlocuteurs.

 

Les propositions de réponse à cet appel d’offre sont à envoyer à ircom.appel.corpus@gmail.com. Les propositions doivent utiliser le formulaire de deux pages figurant ci-dessous. Dans tous les cas, une réponse personnalisée sera renvoyée par IRCOM.

 

Ces propositions doivent présenter les corpus proposés, les données sur les droits d’utilisation et de propriétés et sur la nature des formats ou support utilisés.

 

Cet appel est organisé sous la responsabilité d’IRCOM avec la participation financière conjointe de IRCOM et l’EquipEx ORTOLANG.

 

Pour toute information complémentaire, nous rappelons que le site web de l'Ircom (http://ircom.corpus-ir.fr) est ouvert et propose des ressources à la communauté : glossaire, inventaire des unités et des corpus, ressources logicielles (tutoriaux, comparatifs, outils de conversion), activités des groupes de travail, actualités des formations, ...

L'IRCOM invite les unités à inventorier leur corpus oraux et multimodaux - 70 projets déjà recensés - pour avoir une meilleure visibilité des ressources déjà disponibles même si elles ne sont pas toutes finalisées.

 

Le comité de pilotage IRCOM

 

 

Utiliser ce formulaire pour répondre à l’appel : Merci.

 

Réponse à l’appel à la finalisation de corpus oral ou multimodal

 

Nom du corpus :

 

Nom de la personne à contacter :

Adresse email :

Numéro de téléphone :

 

Nature des données de corpus :

 

Existe-t’il des enregistrements :

Quel média ? Audio, vidéo, autre…

Quelle est la longueur totale des enregistrements ? Nombre de cassettes, nombre d’heures, etc.

Quel type de support ?

Quel format (si connu) ?

 

Existe-t’il des transcriptions :

Quel format ? (papier, traitement de texte, logiciel de transcription)

Quelle quantité (en heures, nombre de mots, ou nombre de transcriptions) ?

 

Disposez vous de métadonnées (présentation des droits d’auteurs et d’usage) ?

 

Disposez-vous d’une description précise des personnes enregistrées ?

 

Disposez-vous d’une attestation de consentement éclairé pour les personnes ayant été enregistrées ? En quelle année (environ) les enregistrements ont eu lieu ?

 

Quelle est la langue des enregistrements ?

 

Le corpus comprend-il des enregistrements d’enfants ou de personnes ayant un trouble du langage ou une pathologie ?

Si oui, de quelle population s’agit-il ?

 

 

Dans un souci d’efficacité et pour vous conseiller dans les meilleurs délais, il nous faut disposer d’exemples des transcriptions ou des enregistrements en votre possession. Nous vous contacterons à ce sujet, mais vous pouvez d’ores et déjà nous adresser par courrier électronique un exemple des données dont vous disposez (transcriptions, métadonnées, adresse de page web contenant les enregistrements).

 

Nous vous remercions par avance de l’intérêt que vous porterez à notre proposition. Pour toutes informations complémentaires veuillez contacter Martine Toda martine.toda@ling.cnrs.fr ou à ircom.appel.corpus@gmail.com.

Back  Top

5-2-8Rhapsodie: un Treebank prosodique et syntaxique de français parlé

Rhapsodie: un Treebank prosodique et syntaxique de français parlé

 

Nous avons le plaisir d'annoncer que la ressource Rhapsodie, Corpus de français parlé annoté pour la prosodie et la syntaxe, est désormais disponible sur http://www.projet-rhapsodie.fr/

 

Le treebank Rhapsodie est composé de 57 échantillons sonores (5 minutes en moyenne, au total 3h de parole, 33000 mots) dotés d’une transcription orthographique et phonétique alignées au son.

 

Il s'agit d’une ressource de français parlé multi genres (parole privée et publique ; monologues et dialogues ; entretiens en face à face vs radiodiffusion, parole plus ou moins interactive et plus ou moins planifiée, séquences descriptives, argumentatives, oratoires et procédurales) articulée autour de sources externes (enregistrements extraits de projets antérieurs, en accord avec les concepteurs initiaux) et internes. Nous tenons en particulier à remercier les responsables des projets CFPP2000, PFC, ESLO, C-Prom ainsi que Mathieu Avanzi, Anne Lacheret, Piet Mertens et Nicolas Obin.

 

Les échantillons sonores (wave & MP3, pitch nettoyé et lissé), les transcriptions orthographiques (txt), les annotations macrosyntaxiques (txt), les annotations prosodiques (xml, textgrid) ainsi que les metadonnées (xml & html) sont téléchargeables librement selon les termes de la licence Creative Commons Attribution - Pas d’utilisation commerciale - Partage dans les mêmes conditions 3.0 France.

Les annotations microsyntaxiques seront disponibles prochainement

 Les métadonnées sont également explorables en ligne grâce à un browser.

 Les tutoriels pour la transcription, les annotations et les requêtes sont disponibles sur le site Rhapsodie.

 Enfin, L’annotation prosodique est interrogeable en ligne grâce au langage de requêtes Rhapsodie QL.

 L'équipe Ressource Rhapsodie (Modyco, Université Paris Ouest Nanterre)

Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.

 Partenaires : IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence), CLLE-ERSS (Toulouse).

 

********************************************************

Rhapsodie: a Prosodic and Syntactic Treebank for Spoken French

We are pleased to announce that Rhapsodie, a syntactic and prosodic treebank of spoken French created with the aim of modeling the interface between prosody, syntax and discourse in spoken French is now available at   http://www.projet-rhapsodie.fr/

The Rhapsodie treebank is made up of 57 short samples of spoken French (5 minutes long on average, amounting to 3 hours of speech and a 33 000 word corpus) endowed with an orthographical phoneme-aligned transcription . 

The corpus is representative of different genres (private and public speech; monologues and dialogues; face-to-face interviews and broadcasts; more or less interactive discourse; descriptive, argumentative and procedural samples, variations in planning type).

The corpus samples have been mainly drawn from existing corpora of spoken French and partially created within the frame of theRhapsodie project. We would especially like to thank the coordinators of the  CFPP2000, PFC, ESLO, C-Prom projects as well as Piet Mertens, Mathieu Avanzi, Anne Lacheret and Nicolas Obin.

The sound samples (waves, MP3, cleaned and stylized pitch), the orthographic transcriptions (txt), the macrosyntactic annotations (txt), the prosodic annotations  (xml, textgrid) as well as the metadata (xml and html) can be freely downloaded under the terms of the Creative Commons licence Attribution - Noncommercial - Share Alike 3.0 France.

Microsyntactic annotation will be available soon.

The metadata are  searchable on line through a browser.

The prosodic annotation can be explored on line through the Rhapsodie Query Language.

The tutorials of transcription, annotations and Rhapsodie Query Language  are available on the site.

 

The Rhapsodie team (Modyco, Université Paris Ouest Nanterre :

Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.

Partners: IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence),CLLE-ERSS (Toulouse).

Back  Top

5-2-9Annotation of “Hannah and her sisters” by Woody Allen.

We have created and made publicly available a dense audio-visual person-oriented ground-truth annotation of a feature movie (100 minutes long): “Hannah and her sisters” by Woody Allen.

The annotation includes

•          Face tracks in video (densely annotated, i.e., in each frame, and person-labeled)

•             Speech segments in audio (person-labeled)

•             Shot boundaries in video



The annotation can be useful for evaluating



•   Person-oriented video-based tasks (e.g., face tracking, automatic character naming, etc.)

•             Person-oriented audio-based tasks (e.g., speaker diarization or recognition)

•             Person-oriented multimodal-based tasks (e.g., audio-visual character naming)



Detail on Hannah dataset and access to it can be obtained there:

https://research.technicolor.com/rennes/hannah-home/

https://research.technicolor.com/rennes/hannah-download/



Acknowledgments:

This work is supported by AXES EU project: http://www.axes-project.eu/










Alexey Ozerov Alexey.Ozerov@technicolor.com

Jean-Ronan Vigouroux,

Louis Chevallier

Patrick Pérez

Technicolor Research & Innovation



 

Back  Top

5-2-10French TTS

Text to         Speech Synthesis:
      over an hour of speech       synthesis samples from         1968 to 2001 by       25 French, Canadian, US , Belgian,       Swedish, Swiss systems
     
     
33 ans de synthèse de la parole à         partir du texte: une promenade sonore (1968-2001)
        (33 years of
Text to Speech Synthesis       in French : an audio tour (1968-2001)       )
      Christophe d'Alessandro
      Article published in         Volume 42 - No. 1/2001 issue of 
Traitement       Automatique des Langues  (TAL,       Editions Hermes),         pp. 297-321.
     
      posted to:
      http://groupeaa.limsi.fr/corpus:synthese:start

Back  Top

5-2-11Google 's Language Model benchmark
 Here is a brief description of the project.

'The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:

  • unpruned Katz (1.1B n-grams),
  • pruned Katz (~15M n-grams),
  • unpruned Interpolated Kneser-Ney (1.1B n-grams),
  • pruned Interpolated Kneser-Ney (~15M n-grams)

 

Happy benchmarking!'

Back  Top

5-2-12International Standard Language Resource Number (ISLRN) (ELRA Press release)

Press Release - Immediate - Paris, France, December 13, 2013

Establishing the International Standard Language Resource Number (ISLRN)

12 major NLP organisations announce the establishment of the ISLRN, a Persistent Unique Identifier, to be assigned to each Language Resource.

On November 18, 2013, 12 NLP organisations have agreed to announce the establishment of the International Standard Language Resource Number (ISLRN), a Persistent Unique Identifier, to be assigned to each Language Resource. Experiment replicability, an essential feature of scientific work, would be enhanced by such unique identifier. Set up by ELRA, LDC and AFNLP/Oriental-COCOSDA, the ISLRN Portal will provide unique identifiers using a standardised nomenclature, as a service free of charge for all Language Resource providers. It will be supervised by a steering committee composed of representatives of participating organisations and enlarged whenever necessary.

More information on ELRA and the ISLRN, please contact: Khalid Choukri choukri@elda.org

More information on ELDA, please contact: Hélène Mazo mazo@elda.org

ELRA

55-57, rue Brillat Savarin

75013 Paris (France)

Tel.: +33 1 43 13 33 33

Fax: +33 1 43 13 33 30

Back  Top

5-2-13ISLRN new portal
Opening of the ISLRN Portal
ELRA, LDC,  and AFNLP/Oriental-COCOSDA announce the opening of the ISLRN Portal @ www.islrn.org.


Further to the establishment of the International Standard Language Resource Number (ISLRN) as a unique and universal identification schema for Language Resources on November 18, 2013, ELRA, LDC and AFNLP/Oriental-COCOSDA now announce the opening of the ISLRN Portal (www.islrn.org). As a service free of charge for all Language Resource providers and under the supervision of a steering committee composed of representatives of participating organisations, the ISLRN Portal provides unique identifiers using a standardised nomenclature.

Overview
The 13-digit ISLRN format is: XXX-XXX-XXX-XXX-X. It can be allocated to any Language Resource; its composition is neutral and does not include any semantics in reference to the type or nature of the Language Resource. The ISLRN is a randomly created number with a check digit that validates a Verhoeff algorithm.

Two types of external players may interact with the ISLRN Portal: Visitors and Providers. Visitors may browse the web site and search for the ISLRN of a given Language Resource by its name or by its number if it exists. Providers are registered and own credentials. They can request a new ISLRN for a given Language Resource. A provider has the possibility to become certified, after moderation, in order to be able to import metadata in XML format.

The functionalities that can be accessed by Visitors are:

-          Identify a language resource according to its ISLRN
-          Identify an ISLRN by the name of a language resource
-          Get information about ISLRN, FAQ, Basic Metadata, Legal Information
-          View last 5 accepted resources (“What’s new” block on home page)
-          Sign up to become a provider

The functionalities that can be accessed by Providers, once they have signed up, are:

-          Log in
-          Request an ISLRN according to the metadata of a given resource
-          Request to become a certified provider so as to import XML files containing metadata
-          Import one or more metadata descriptions in XML to request ISLRN(s) (only for certified providers)
-          Edit pending requests
-          Access previous requests
-          Contact a Moderator or an Administrator
-          Edit Providers’ own profile

ISLRN request is handled by moderators within 5 working days.
Contact: islrn@elda.org

Background
The International Standard Language Resource Number (ISLRN) is a unique and universal identification schema for Language Resources which provides Language Resources with unique identifier using a standardised nomenclature. It also ensures that Language Resources are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers. Moreover, it is a major step in the interconnected world that Human Language Technologies (HLT) has become: unique resources must be identified as they are and meta-catalogues need a common identification format to manage data correctly.

The ISLRN does not intend to replace local and specific identifiers, it is not meant to be a legal deposit, not an obligation, but rather an essential and best practice. For instance a resource that is distributed by several data centres will still have the “local” data-centre identifier but will have a unique ISLRN.

********************************************************************
About ELRA
The European Language Resources Association (ELRA) is a non-profit making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies (HLT). To find out more about ELRA, please visit www.elra.info.

About LDC
Founded in 1992, the Linguistic Data Consortium (LDC) is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC's host institution. To find out more about LDC, please visit www.ldc.upenn.edu.

About AFNLP
The mission of the Asian Federation of Natural Language Processing (AFNLP) is to promote and enhance R&D relating to the computational analysis and the automatic processing of all languages of importance to the Asian region by assisting and supporting like-minded organizations and institutions through information sharing, conference organization, research and publication co-ordination, and other forms of support. To find out more about AFNLP, please visit www.afnlp.org.

About Oriental-COCOSDA
The International Committee for the Co-ordination and Standardisation of Speech Databases and Assesment Techniques, Oriental-COCOSDA, has been established to encourage and promote international interaction and cooperation in the foundation areas of Spoken Language Processing, especially for Speech Input/Output. To find out more about Oriental-COCOSDA, please visit our web site: www.cocosda.org

Back  Top

5-2-14Speechocean – update (February 2015)

Speechocean – update (Feb 2015):

 

Speechocean: A global language resources and data services supplier

 

Speechocean has over 500 large-scale databases available in 110+ languages and accents with the platform of desktop, in-car, telephony and tablet PC. Our data repository is enormous and diversified, which includes ASR Databases, TTS Databases, Lexica, Text Corpora, etc.

 

Speechocean is glad to announce more resources that have been released:

ASR Databases

Speechocean provides 110+ regional languages corpora, available in a variety of formats, situational styles, scene environments and platform systems, covering In-car speech recognition corpora, mobile phone speech recognition corpora, fixed-line speech recognition corpora, desktop speech recognition corpora, etc. This month we are glad to introduce our most popular databases which were made for the tuning and testing purpose of speech recognition systems for speech ASR applications.

    1. In-Car

Serial Number

Kingline Data Names

Sound Parameter

King-ASR-125

Japanese Speech Recognition Database -(In car) 300 Speakers

48k,16bit

Four Channels

King-ASR-120

Chinese Mandarin Speech Recognition Database-(in car )1200 Speakers

16 K16 bit

Four Channels



    1. Mobile

Serial Number

Kingline Data Names

Sound Parameter

King-ASR-216

Chinese Mandarin Speech Recognition Database-Sentences (Mobile)--(5048 speakers)

16K,16bit

One Channel

King-ASR-044

Taiwanese Speech Recognition Database—(Mobile)--654 speakers

16K,16bit

One Channel

King-ASR-113

Chinese Mandarin Speech Recognition Database ----(Mobile)-4000 Speakers

16K,16bit

One Channel

 

    1. Telephony

Serial Number

Kingline Data Names

Sound Parameter

King-ASR-222

Japanese Speech Recognition Database ----
spontaneous dialog (Telephony)-200 Speakers

8k,16bit

King-ASR-027

Chinese Mandarin Speech Recognition Database ---- Spontaneous Speech (Telephone)-649 Speakers

8k,16bit

 

    1. Desktop

      Serial Number

      Kingline Data Names

      Sound Parameter

      King-ASR-087

      Taiwanese Speech Recognition Database ----
      Sentences (Desktop)-200 Speakers

      44.1k,16bit

      King-ASR-111

      Mandarin Speech Recognition Database ----
      spontaneous dialog (Desktop)-1013 Speakers

      44.1k,16bit

      King-ASR-175

      Japanese Speech Recognition Database ----
      Sentences (Desktop)-505 Speakers

      44.1k,16bit

  1. TTS Databases

Speechocean licenses a variety of databases in more than 40 languages for speech synthesis broadcasting speech, emotional speech, etc. which can be used in different algorithms.

Serial No.

Kingline Data Names

Sound Parameter

Recording Hours

King-TTS-020

Russian Speech Synthesis Database (Male)

44.1K,16bit
Two Channels

12.3

King-TTS-027

Taiwanese Speech Synthesis Database (Female)

44.1K,16bit
Two Channels

9.8

King-TTS-030

British English Speech Synthesis Database (Female)

44.1K,16bit
Two Channels

12.6

 

 

  1. Text Corpora

Speechocean licenses many kinds of text corpora in many languages which is superb for language model training.

ID

Kingline Data Names

Size

King-NLP-022

Japanese Name Variants Corpus

4000000 Words

King-NLP-023

Japanese Lexical Database

290000 Words

King-NLP-034

Japanese Organizations Names Corpus

580000 Words

 

  1. Lexica

Speechocean builds pronunciation lexica in many languages which can be licensed to customers.

No.

Name

Phoneset

King-Lexicon-032

Urdu Pronunciation Lexicon

SAMPA

King-Lexicon-033

Vietnamese Pronunciation Lexicon

SO

King-Lexicon-037

Brazilian Portugese Pronunciation Lexicon

SAMPA

 

Contact Information

Xianfeng Cheng

Business Manager of Commercial Department

Tel: +86-10-62660928; +86-10-62660053 ext.8080

Mobile: +86 13681432590

Skype: xianfeng.cheng1

Email: chengxianfeng@speechocean.com; cxfxy0cxfxy0@gmail.com

Website: www.speechocean.com

 

 

 

 

 

 

 

 




 

 

 

 

 

 

Back  Top

5-2-15kidLUCID: London UCL Children’s Clear Speech in Interaction Database

kidLUCID: London UCL Children’s Clear Speech in Interaction Database

We are delighted to announce the availability of a new corpus of spontaneous speech for children aged 9 to 14 years inclusive, produced as part of the ESRC-funded project on ‘Speaker-controlled Variability in Children's Speech in Interaction’ (PI: Valerie Hazan).

Speech recordings (a total of 288 conversations) are available for 96 child participants (46M, 50F, range 9;0 to 15;0 years), all native southern British English speakers. Participants were recorded in pairs while completing the diapix spot-the-difference picture task in which the pair verbally compared two scenes, only one of which was visible to each talker. High-quality digital recordings were made in sound-treated rooms. For each conversation, a stereo audio recording is provided with each speaker on a separate channel together with a Praat Textgrid containing separate word- and phoneme-level segmentations for each speaker.

There are six recordings per speaker pair made in the following conditions:

  • NOB (No barrier): both speakers heard each other normally

  • VOC (Vocoder): one conversational partner heard the other's speech after it had been processed in real time through a noise-excited three channel vocoder

  • BAB (Babble): one conversational partner heard the other's speech in a background of adult multi-talker babble at an approximate SNR of 0 dB.

The kidLUCID corpus is available online within the OSCAAR (Online Speech/Corpora Archive and Analysis Resource) archive (https://oscaar.ci.northwestern.edu/). Free access can be requested for research purposes. Further information about the project can be found at: http://www.ucl.ac.uk/pals/research/shaps/research/shaps/research/clear-speech-strategies

This work was supported by Economic and Social Research Council Grant No. RES-062- 23-3106.

Back  Top

5-2-16Robust speech datasets and ASR software tools


We are happy to announce the release of a table of 44 publicly available robust speech processing datasets and a table of 4 ASR software tools on the wiki of ISCA's Robust Speech Processing SIG:
https://wiki.inria.fr/rosp/Datasets#Speech_datasets
https://wiki.inria.fr/rosp/Software#Automatic_speech_recognition

We hope that these tables will promote wider dissemination of the datasets and software tools available in our community and help newcomers select the most suitable dataset or software for a given experiment. We plan to provide additional tables on, e.g., room impulse response datasets or speaker recognition software in the future.

We highly welcome your input, especially additional tables/entries and reproducible baselines for each dataset. It just takes a few minutes thanks to the simple wiki interface.

For more information about joining the SIG and contributing, see
https://wiki.inria.fr/rosp/

Jonathan Le Roux, Emmanuel Vincent, and Ramon Astudillo

Back  Top

5-2-17International Standard Language Resource Number (ISLRN) implemented by ELRA and LDC

ELRA and LDC partner to implement ISLRN process and assign identifiers to all the Language Resources in their catalogues.

 

Following the meeting of the largest NLP organizations, the NLP12, and their endorsement of the International Standard Language Resource Number (ISLRN), ELRA and LDC partnered to implement the ISLRN process and to assign identifiers to all the Language Resources (LRs) in their catalogues. The ISLRN web portal was designed to enable the assignment of unique identifiers as a service free of charge for all Language Resource providers. To enhance the use of ISLRN, ELRA and LDC have collaborated to provide the ISLRN 13-digit ID to all the Language Resources distributed in their respective catalogues. Anyone who is searching the ELRA and LDC catalogues can see that each Language Resource is now identified by both the data centre ID and the ISLRN number. All providers and users of such LRs should refer to the latter in their own publications and whenever referring to the LR.

 

ELRA and LDC will continue their joint involvement in ISLRN through active participation in this web service.

 

Visit the ELRA and LDC catalogues, respectively at http://catalogue.elra.info and https://catalog.ldc.upenn.edu

 

Background

The International Standard Language Resource Number (ISLRN) aims to provide unique identifiers using a standardised nomenclature, thus ensuring that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications within R&D projects, product evaluation and benchmarking, as well as in documents and scientific papers. Moreover, this is a major step in the networked and shared world that Human Language Technologies (HLT) has become: unique resources must be identified as such and meta-catalogues need a common identification format to manage data correctly.

The ISLRN portal can be accessed from http://www.islrn.org,

 

***About NLP12***

Representatives of the major Natural Language Processing and Computational Linguistics organizations met in Paris on 18 November 2013 to harmonize and coordinate their activities within the field.
The results of this coordination are expressed in the Paris Declaration: http://www.elra.info/NLP12-Paris-Declaration.html.

 

*** About ELRA ***
The European Language Resources Association (ELRA) is a non-profit making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies (HLT).
To find out more about ELRA, please visit our web site: http://www.elra.info

*** About LDC ***

The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and research laboratories that creates and distributes linguistic resources for language-related education, research and technology development.

To find out more about LDC, please visit our web site: https://www.ldc.upenn.edu


For more information, please contact: admin@islrn.org

Back  Top

5-2-18ISLRN adopted by Joint Research Center (JRC) of the European Commission

JRC, the EC's Joint Research Centre, an important LR player: First to adopt the ISLRN initiative

 

The Joint Research Centre (JRC), the European Commission's in house science service, is the first organisation to use the International Standard Language Resource Number (ISLRN) initiative and has requested ISLRN 13-digit unique identifiers to its Language Resources (LR).
Thus, anyone who is using JRC LRs may now refer to this number in their own publications.

 

The current JRC LRs (downloadable from https://ec.europa.eu/jrc/en/language-technologies) with an ISLRN ID are:

 

 

Background

The International Standard Language Resource Number (ISLRN) aims to provide unique identifiers using a standardised nomenclature, thus ensuring that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications within R&D projects, product evaluation and benchmarking, as well as in documents and scientific papers. Moreover, this is a major step in the networked and shared world that Human Language Technologies (HLT) has become: unique resources must be identified as such and meta-catalogues need a common identification format to manage data correctly.
The ISLRN portal can be accessed from http://www.islrn.org,

 

*** About the JRC ***

As the Commission's in-house science service, the Joint Research Centre's mission is to provide EU policies with independent, evidence-based scientific and technical support throughout the whole policy cycle.
Within its research in the field of global security and crisis management, the JRC develops open source intelligence and analysis systems that can automatically harvest and analyse a huge amount of multi-lingual information from the internet-based sources. In this context, the JRC has developed Language Technology resources and tools that can be used for highly multilingual text analysis and cross-lingual applications.
To find out more about JRC's research in open source information monitoring, please visit https://ec.europa.eu/jrc/en/research-topic/internet-surveillance-systems. To access media monitoring applications directly, go to http://emm.newsbrief.eu/overview.html.

 

*** About ELRA ***
The European Language Resources Association (ELRA) is a non-profit making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies (HLT).
To find out more about ELRA, please visit our web site: http://www.elra.info

For more information, contact admin@ilsrn.org
 

Back  Top

5-2-19ELRA News

We are happy to announce that 1 new Written Corpus and 1 new Terminological Resource are now available in our catalogue.

ELRA-W0081 Khresmoi manually annotated reference corpus
ISLRN: 764-036-829-417-7
This corpus is a collection of Khresmoi English web documents annotated with key entities (such as disease, drug). The corpus is divided into two parts:
1. The initial corpus: 625 documents from the Genetics Home Reference data set, automatically annotated with anatomical locations and diseases, and manually corrected by 3-4 annotators. Size of documents: between 26 and 8,306 tokens each.
2. The main corpus: 6,950 English documents from the Khresmoi crawl and 5,518 English Wikipedia pages, automatically annotated through the GATE Platform for Anatomy, Disease, Drug and Investigation. Size of documents: between 200 and 2,000 tokens each.
The corpus is using the GATE XML format.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1237

 

ELRA-T0375 ACL RD-TEC: A Reference Dataset for Terminology Extraction and Classification Research in Computational Linguistics
ISLRN: 699-305-362-089-6
This is a reference dataset for terminology extraction and classification research in computational linguistics. It is a set of manually annotated terms in English language that are extracted from the ACL Anthology Reference Corpus (ACL ARC). This dataset, called ACL RD-TEC, is comprised of more than 69,000 candidate terms that are manually annotated as valid and invalid terms. Furthermore, valid terms are classified as technology and non-technology terms.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1236

 

 

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

 

 

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements/


Follow us on Twitter:
@ELRANews

Back  Top



 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA