ISCA - International Speech
Communication Association


ISCApad Archive  »  2013  »  ISCApad #185  »  Resources  »  Database

ISCApad #185

Tuesday, November 12, 2013 by Chris Wellekens

5-2 Database
5-2-1ELRA - Language Resources Catalogue - Update (2013-09))

 *****************************************************************
    ELRA - Language Resources Catalogue - Update  September 2013
    *****************************************************************

We are happy to announce that 5 new Pronunciation Dictionaries       from the GlobalPhone database (Croatian, Russian, Spanish (Latin       American), Turkish and Vietnamese) are now available in our       catalogue. 
     
      The GlobalPhone Pronunciation Dictionaries:       GlobalPhone is a multilingual speech and text database collected       at Karlsruhe University, Germany. The GlobalPhone pronunciation       dictionaries contain the pronunciations of all word forms found in       the transcription data of the GlobalPhone speech & text       database. The pronunciation dictionaries are currently available       in 15 languages: Arabic (29230 entries/27059 words), Bulgarian       (20193 entries), Croatian (23497 entries/20628 words), Czech       (33049 entries/32942 words), French (36837 entries/20710 words),       German (48979 entries/46035 words), Hausa (42662 entries/42079       words), Japanese (18094 entries), Polish (36484 entries),       Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818       entries/27667 words), Spanish (Latin American) (43264       entries/33960 words), Swedish (about 25000 entries), Turkish       (31330 entries/31087 words), and Vietnamese (38504 entries/29974       words). 3 other languages will also be released: Chinese-Mandarin,       Korean and Thai.
     
      *** NEW ***
       
 ELRA-S0358 GlobalPhone Croatian Pronunciation Dictionary

      For more information, see: http://catalog.elra.info/product_info.php?products_id=1207
      ELRA-S0359 GlobalPhone Russian Pronunciation Dictionary
      For more information, see: http://catalog.elra.info/product_info.php?products_id=1208
      ELRA-S0360 GlobalPhone Spanish (Latin American) Pronunciation         Dictionary
      For more information, see: http://catalog.elra.info/product_info.php?products_id=1209
      ELRA-S0361 GlobalPhone Turkish Pronunciation Dictionary
      For more information, see: http://catalog.elra.info/product_info.php?products_id=1210
      ELRA-S0362 GlobalPhone Vietnamese Pronunciation Dictionary
      For more information, see: http://catalog.elra.info/product_info.php?products_id=1211
     
      Special prices are offered for a combined purchase of several       GlobalPhone languages.
     
      Available GlobalPhone Pronuncation Dictionaries are listed below       (click on the links for further details):
      ELRA-S0340 GlobalPhone French Pronunciation Dictionary
      For more information, see: http://catalog.elra.info/product_info.php?products_id=1197
      ELRA-S0341 GlobalPhone German Pronunciation Dictionary
      For more information, see: http://catalog.elra.info/product_info.php?products_id=1198
      ELRA-S0348 GlobalPhone Japanese Pronunciation Dictionary
      For more information, see: http://catalog.elra.info/product_info.php?products_id=1199
      ELRA-S0350 GlobalPhone Arabic Pronunciation Dictionary
      For more information, see: http://catalog.elra.info/product_info.php?products_id=1200
      ELRA-S0351 GlobalPhone Bulgarian Pronunciation Dictionary
     
For more information, see: http://catalog.elra.info/product_info.php?products_id=1201

      ELRA-S0352 GlobalPhone Czech Pronunciation Dictionary
      For more information, see: http://catalog.elra.info/product_info.php?products_id=1202
      ELRA-S0353 GlobalPhone Hausa Pronunciation Dictionary
      For more information, see: http://catalog.elra.info/product_info.php?products_id=1203
      ELRA-S0354 GlobalPhone Polish Pronunciation Dictionary
      For more information, see: http://catalog.elra.info/product_info.php?products_id=1204
      ELRA-S0355 GlobalPhone Portuguese (Brazilian) Pronunciation         Dictionary
      For more information, see: http://catalog.elra.info/product_info.php?products_id=1205
      ELRA-S0356 GlobalPhone Swedish Pronunciation Dictionary
      For more information, see: http://catalog.elra.info/product_info.php?products_id=1206
     
      For more information on the catalogue, please contact Valérie       Mapelli mailto:mapelli@elda.org
     
      Visit our On-line Catalogue: http://catalog.elra.info
      Visit the Universal Catalogue: http://universal.elra.info      
      Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/LRs-Announcements.html
   

        
   

Top

5-2-2ELRA releases free Language Resources

ELRA releases free Language Resources
***************************************************

Anticipating users’ expectations, ELRA has decided to offer a large number of resources for free for Academic research use. Such an offer consists of several sets of speech, text and multimodal resources that are regularly released, for free, as soon as legal aspects are cleared. A first set was released in May 2012 at the occasion of LREC 2012. A second set is now being released.

Whenever this is permitted by our licences, please feel free to use these resources for deriving new resources and depositing them with the ELRA catalogue for community re-use.

Over the last decade, ELRA has compiled a large list of resources into its Catalogue of LRs. ELRA has negotiated distribution rights with the LR owners and made such resources available under fair conditions and within a clear legal framework. Following this initiative, ELRA has also worked on LR discovery and identification with a dedicated team which investigated and listed existing and valuable resources in its 'Universal Catalogue', a list of resources that could be negotiated on a case-by-case basis. At LREC 2010, ELRA introduced the LRE Map, an inventory of LRs, whether developed or used, that were described in LREC papers. This huge inventory listed by the authors themselves constitutes the first 'community-built' catalogue of existing or emerging resources, constantly enriched and updated at major conferences.

Considering the latest trends on easing the sharing of LRs, from both legal and commercial points of view, ELRA is taking a major role in META-SHARE, a large European open infrastructure for sharing LRs. This infrastructure will allow LR owners, providers and distributors to distribute their LRs through an additional and cost-effective channel.

To obtain the available sets of LRs, please visit the web page below and follow the instructions given online:
http://www.elra.info/Free-LRs,26.html

Top

5-2-3LDC Newsletter (October 2013)

 

In this           newsletter:

   

Fall 2013 LDC Data             Scholarship Recipients

   

New           publications:

   

GALE Phase 2 Chinese Broadcast News Speech

   

GALE             Phase 2 Chinese Broadcast News Transcripts

   

OntoNotes
            Release 5.0

       

       

   

Fall 2013 LDC Data           Scholarship Recipients
       

   

LDC is pleased to announce the         student recipients of the Fall 2013 LDC
          Data Scholarship program
!          This program provides university and college students with         access to LDC data at no-cost. Students were asked to complete         an application which consisted of a proposal describing their         intended use of the data, as well as a letter of support from         their thesis adviser. We received many solid applications and         have chosen six  proposals
        to support.   The following students will receive no-cost copies         of LDC data:

   

     

Shamama Afnan - Clemson           University (USA), MS candidate, Electrical Engineering.            Shamana has been awarded a copy of 2008 NIST Speaker           Recognition Training and Test data for her work in speaker           recognition.

     

Seyedeh Firoozabadi -           University of Connecticut (USA), PhD candidate, Biomedical           Engineering.  Seyedeh has been awarded a copy of TIDIGITS and           TI-46 Word for her work in speech recognition.

     

Lei Liu - Beijing Foreign           Studies University (China), PhD candidate, Foreign Language           Education.  Lei has been awarded a copy of Treebank-3 and           Prague Czech-English Dependency Treebank 2.0 for his work in           parsing.

     

Monisankha Pal - Indian           Institute of Technology, Kharagpur (India), PhD candidate,           Electronics and Electrical Communication Engineering.            Monisankha has been awarded a copy of CSR-I (WSJ0) and CSR-II           (WSJ1) for his work in speaker recognition.

     

Sachin Pawar - Indian           Institute of Technology, Bombay (India), PhD candidate,           Computer Science and Engineering.  Sachin has been awarded a           copy of ACE 2004 Multilingual Training Corpus for his work in           named-entity recognition.

     

Sergio Silva - Federal           University of Rio Grande do Sul (Brazil), MS candidate,           Computer Science.  Sergio has been awarded a copy of 2004 and           2005 Spring NIST Rich Transcription data for his work in           diarization.

   

   

 

   

New           publications

   

(1) GALE Phase 2 Chinese Broadcast           News Speech was developed by LDC and is         comprised of approximately 126 hours of Mandarin Chinese         broadcast news speech collected in 2006 and 2007 by the         Linguistic Data Consortium (LDC) and Hong University of Science         and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA         GALE (Global Autonomous Language Exploitation) Program.

   

Corresponding
        transcripts are released as GALE Phase 2 Chinese Broadcast News         Transcripts (LDC2013T20).

   

Broadcast         audio for the GALE program was collected at LDC's Philadelphia,         PA USA facilities and at three remote collection sites: HKUST         (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat,         Morocco) (Arabic). The combined local and outsourced broadcast         collection supported GALE at a rate of approximately 300 hours         per week of programming from more than 50 broadcast sources for         a total of over 30,000 hours of collected broadcast audio over         the life of the program.

   

The         broadcast conversation recordings in this release feature news         broadcasts focusing principally on current events from the         following sources: Anhui TV, a regional television station in         Mainland China, Anhui Province; China Central TV (CCTV), a         national and international broadcaster in Mainland China; and         Phoenix TV, a Hong Kong-based satellite television station.

   

This         release contains 248 audio files presented in FLAC-compressed Waveform Audio File format (.flac),         16000 Hz single-channel 16-bit PCM. Each file was audited by a         native Chinese speaker following Audit Procedure Specification         Version 2.0 which is included in this release. The broadcast         auditing process served three principal goals: as a check on the         operation of the broadcast collection system equipment by         identifying failed, incomplete or faulty recordings, as an         indicator of broadcast schedule changes by identifying instances         when the incorrect program was recorded, and as a guide for data         selection by retaining information about a program's genre, data         type and topic.

   

GALE Phase         2 Chinese Broadcast News Speech is         distributed on 2 DVD-ROM.

   

2013         Subscription Members will automatically receive two copies of         this data. 2013 Standard Members may request a copy as part of         their 16 free membership corpora.  Non-members may license this         data for US$2000.
     

   

                                                                                                                                    *
     

   

(2) GALE Phase 2 Chinese Broadcast           News Transcripts was developed by LDC and contains         transcriptions of approximately 110 hours of Chinese broadcast         news speech collected in 2006 and 2007 by LDC and Hong         University of Science and Technology (HKUST), Hong Kong, during         Phase 2 of the DARPA GALE (Global Autonomous Language         Exploitation) Program.

   

Corresponding
        audio data is released as GALE Phase 2 Chinese Broadcast News         Speech (LDC2013S08).

   

The         transcript files are in plain-text, tab-delimited format (TDF)         with UTF-8 encoding, and the transcribed data totals 1,593,049         tokens. The transcripts were created with the LDC-developed         transcription tool, XTrans, a multi-platform, multilingual, multi-channel         transcription tool that supports manual transcription and         annotation of audio recordings. 

   

The files         in this corpus were transcribed by LDC staff and/or by         transcription vendors under contract to LDC. Transcribers         followed LDC’s quick transcription guidelines (QTR) and quick         rich transcription specification (QRTR) both of which are         included in the documentation with this release. QTR         transcription consists of quick (near-)verbatim, time-aligned         transcripts plus speaker identification with minimal additional         mark-up. It does not include sentence unit annotation. QRTR         annotation adds structural information such as topic boundaries         and manual sentence unit annotation to the core components of a         quick transcript.

   

GALE Phase         2 Chinese Broadcast News Transcripts is distributed via web         download.

   

2013         Subscription Members will automatically receive two copies of         this data. 2013 Standard Members may request a copy as part of         their 16 free membership corpora.  Non-members may license this         data for US$2000.
     

   

                                                                                                                        *
     

   


      (3) OntoNotes Release 5.0 is the final release of the         OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of           Pennsylvania and the University of Southern           Californias Information Sciences Institute. The goal         of the project was to annotate a large corpus comprising various         genres of text (news, conversational telephone speech, weblogs,         usenet newsgroups, broadcast, talk shows) in three languages         (English, Chinese, and Arabic) with structural information         (syntax and predicate argument structure) and shallow semantics         (word sense linked to an ontology and coreference).

   

OntoNotes         Release 5.0 contains the content of earlier releases --         OntoNotes Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04, OntoNotes Release 3.0 LDC2009T24 and OntoNotes Release 4.0 LDC2011T03 -- and adds source data from and/or additional         annotations for, newswire (News), broadcast news (BN), broadcast         conversation (BC), telephone conversation (Tele) and web data         (Web) in English and Chinese and newswire data in Arabic. Also         contained is English pivot text (Old Testament and New Testament         text). This cumulative publication consists of 2.9 million words        

   

The         OntoNotes project built on two time-tested resources, following         the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic         representation includes word sense disambiguation for nouns and         verbs, with some word senses connected to an ontology, and         coreference.

   

Documents         describing the annotation guidelines and the routines for         deriving various views of the data from the database are         included in the documentation directory of this release. The         annotation is provided both in separate text files for each         annotation layer (Treebank, PropBank, word sense, etc.) and in         the form of an integrated relational database         (ontonotes-v5.0.sql.gz) with a Python API to provide convenient         cross-layer access.

   

OntoNotes         Release 5.0 is distributed on 1 DVD-ROM.

   

2013         Subscription Members will automatically receive two copies of         this data. 2013 Standard Members may request a copy as part of         their 16 free membership corpora.  Non-members may request this         data by completing a copy of the LDC User Agreement for           Non-Members.  The agreement can be faxed +1         215 573 2175 or scanned and emailed to this address.  This data         is available at no charge, but is subject to non-member shipping         and handling fees.

   

 


   

      

Top

5-2-4Appen ButlerHill

 

Appen ButlerHill 

A global leader in linguistic technology solutions

RECENT CATALOG ADDITIONS—MARCH 2012

1. Speech Databases

1.1 Telephony

1.1 Telephony

Language

Database Type

Catalogue Code

Speakers

Status

Bahasa Indonesia

Conversational

BAH_ASR001

1,002

Available

Bengali

Conversational

BEN_ASR001

1,000

Available

Bulgarian

Conversational

BUL_ASR001

217

Available shortly

Croatian

Conversational

CRO_ASR001

200

Available shortly

Dari

Conversational

DAR_ASR001

500

Available

Dutch

Conversational

NLD_ASR001

200

Available

Eastern Algerian Arabic

Conversational

EAR_ASR001

496

Available

English (UK)

Conversational

UKE_ASR001

1,150

Available

Farsi/Persian

Scripted

FAR_ASR001

789

Available

Farsi/Persian

Conversational

FAR_ASR002

1,000

Available

French (EU)

Conversational

FRF_ASR001

563

Available

French (EU)

Voicemail

FRF_ASR002

550

Available

German

Voicemail

DEU_ASR002

890

Available

Hebrew

Conversational

HEB_ASR001

200

Available shortly

Italian

Conversational

ITA_ASR003

200

Available shortly

Italian

Voicemail

ITA_ASR004

550

Available

Kannada

Conversational

KAN_ASR001

1,000

In development

Pashto

Conversational

PAS_ASR001

967

Available

Portuguese (EU)

Conversational

PTP_ASR001

200

Available shortly

Romanian

Conversational

ROM_ASR001

200

Available shortly

Russian

Conversational

RUS_ASR001

200

Available

Somali

Conversational

SOM_ASR001

1,000

Available

Spanish (EU)

Voicemail

ESO_ASR002

500

Available

Turkish

Conversational

TUR_ASR001

200

Available

Urdu

Conversational

URD_ASR001

1,000

Available

1.2 Wideband

Language

Database Type

Catalogue Code

Speakers

Status

English (US)

Studio

USE_ASR001

200

Available

French (Canadian)

Home/ Office

FRC_ASR002

120

Available

German

Studio

DEU_ASR001

127

Available

Thai

Home/Office

THA_ASR001

100

Available

Korean

Home/Office

KOR_ASR001

100

Available

2. Pronunciation Lexica

Appen Butler Hill has considerable experience in providing a variety of lexicon types. These include:

Pronunciation Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)

Part-of-speech tagged Lexica providing grammatical and semantic labels

Other reference text based materials including spelling/mis-spelling lists, spell-check dictionar-ies, mappings of colloquial language to standard forms, orthographic normalization lists.

Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages. For holdings information in a given language or to discuss any customized development efforts, please contact: sales@appenbutlerhill.com

3. Named Entity Corpora

Language

Catalogue Code

Words

Description

Arabic

ARB_NER001

500,000

These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities

English

ENI_NER001

500,000

Farsi/Persian

FAR_NER001

500,000

Korean

KOR_NER001

500,000

Japanese

JPY_NER001

500,000

Russian

RUS_NER001

500,000

Mandarin

MAN_NER001

500,000

Urdu

URD_NER001

500,000

3. Named Entity Corpora

Language

Catalogue Code

Words

Description

Arabic

ARB_NER001

500,000

These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities

English

ENI_NER001

500,000

Farsi/Persian

FAR_NER001

500,000

Korean

KOR_NER001

500,000

Japanese

JPY_NER001

500,000

Russian

RUS_NER001

500,000

Mandarin

MAN_NER001

500,000

Urdu

URD_NER001

500,000

4. Other Language Resources

Morphological Analyzers – Farsi/Persian & Urdu

Arabic Thesaurus

Language Analysis Documentation – multiple languages

 

For additional information on these resources, please contact: sales@appenbutlerhill.com

5. Customized Requests and Package Configurations

Appen Butler Hill is committed to providing a low risk, high quality, reliable solution and has worked in 130+ languages to-date supporting both large global corporations and Government organizations.

We would be glad to discuss to any customized requests or package configurations and prepare a cus-tomized proposal to meet your needs.

6. Contact Information

Prithivi Pradeep

Business Development Manager

ppradeep@appenbutlerhill.com

+61 2 9468 6370

Tom Dibert

Vice President, Business Development, North America

tdibert@appenbutlerhill.com

+1-315-339-6165

                                                         www.appenbutlerhill.com

Top

5-2-5OFROM 1er corpus de français de Suisse romande
Nous souhaiterions vous signaler la mise en ligne d'OFROM, premier corpus de français parlé en Suisse romande. L'archive est, dans version actuelle, d'une durée d'environ 15 heures. Elle est transcrite en orthographe standard dans le logiciel Praat. Un concordancier permet d'y effectuer des recherches, et de télécharger les extraits sonores associés aux transcriptions. 
 
Pour accéder aux données et consulter une description plus complète du corpus, nous vous invitons à vous rendre à l'adresse suivante : http://www.unine.ch/ofrom
Top

5-2-6Real-world 16-channel noise recordings

We are happy to announce the release of DEMAND, a set of real-world
16-channel noise recordings designed for the evaluation of microphone
array processing techniques.

http://www.irisa.fr/metiss/DEMAND/

1.5 h of noise data were recorded in 18 different indoor and outdoor
environments and are available under the terms of the Creative Commons Attribution-ShareAlike License.

Joachim Thiemann (CNRS - IRISA)
Nobutaka Ito (University of Tokyo)
Emmanuel Vincent (Inria Nancy - Grand Est)

Top

5-2-7Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne

Aide à la finalisation de corpus oraux ou multimodaux pour diffusion, valorisation et dépôt pérenne

 

 

Le consortium IRCOM de la TGIR Corpus et l’EquipEx ORTOLANG s’associent pour proposer une aide technique et financière à la finalisation de corpus de données orales ou multimodales à des fins de diffusion et pérennisation par l’intermédiaire de l’EquipEx ORTOLANG. Cet appel ne concerne pas la création de nouveaux corpus mais la finalisation de corpus existants et non-disponibles de manière électronique. Par finalisation, nous entendons le dépôt auprès d’un entrepôt numérique public, et l’entrée dans un circuit d’archivage pérenne. De cette façon, les données de parole qui ont été enrichies par vos recherches vont pouvoir être réutilisées, citées et enrichies à leur tour de manière cumulative pour permettre le développement de nouvelles connaissances, selon les conditions d’utilisation que vous choisirez (sélection de licences d’utilisation correspondant à chacun des corpus déposés).

 

Cet appel d’offre est soumis à plusieurs conditions (voir ci-dessous) et l’aide financière par projet est limitée à 3000 euros. Les demandes seront traitées dans l’ordre où elles seront reçues par l’ IRCOM. Les demandes émanant d’EA ou de petites équipes ne disposant pas de support technique « corpus » seront traitées prioritairement. Les demandes sont à déposer du 1er septembre 2013 au 31 octobre 2013. La décision de financement relèvera du comité de pilotage d’IRCOM. Les demandes non traitées en 2013 sont susceptibles de l’être en 2014. Si vous avez des doutes quant à l’éligibilité de votre projet, n’hésitez pas à nous contacter pour que nous puissions étudier votre demande et adapter nos offres futures.

 

Pour palier la grande disparité dans les niveaux de compétences informatiques des personnes et groupes de travail produisant des corpus, L’ IRCOM propose une aide personnalisée à la finalisation de corpus. Celle-ci sera réalisée par un ingénieur IRCOM en fonction des demandes formulées et adaptées aux types de besoin, qu’ils soient techniques ou financiers.

 

Les conditions nécessaires pour proposer un corpus à finaliser et obtenir une aide d’IRCOM sont :

  • Pouvoir prendre toutes décisions concernant l’utilisation et la diffusion du corpus (propriété intellectuelle en particulier).

  • Disposer de toutes les informations concernant les sources des corpus et le consentement des personnes enregistrées ou filmées.

  • Accorder un droit d’utilisation libre des données ou au minimum un accès libre pour la recherche scientifique.

 

Les demandes peuvent concerner tout type de traitement : traitements de corpus quasi-finalisés (conversion, anonymisation), alignement de corpus déjà transcrits, conversion depuis des formats « traitement de textes », digitalisation de support ancien. Pour toute demande exigeant une intervention manuelle importante, les demandeurs devront s’investir en moyens humains ou financiers à la hauteur des moyens fournis par IRCOM et ORTOLANG.

 

IRCOM est conscient du caractère exceptionnel et exploratoire de cette démarche. Il convient également de rappeler que ce financement est réservé aux corpus déjà largement constitués et ne peuvent intervenir sur des créations ex-nihilo. Pour ces raisons de limitation de moyens, les propositions de corpus les plus avancés dans leur réalisation pourront être traitées en priorité, en accord avec le CP d’IRCOM. Il n’y a toutefois pas de limite « théorique » aux demandes pouvant être faites, IRCOM ayant la possibilité de rediriger les demandes qui ne relèvent pas de ses compétences vers d’autres interlocuteurs.

 

Les propositions de réponse à cet appel d’offre sont à envoyer à ircom.appel.corpus@gmail.com. Les propositions doivent utiliser le formulaire de deux pages figurant ci-dessous. Dans tous les cas, une réponse personnalisée sera renvoyée par IRCOM.

 

Ces propositions doivent présenter les corpus proposés, les données sur les droits d’utilisation et de propriétés et sur la nature des formats ou support utilisés.

 

Cet appel est organisé sous la responsabilité d’IRCOM avec la participation financière conjointe de IRCOM et l’EquipEx ORTOLANG.

 

Pour toute information complémentaire, nous rappelons que le site web de l'Ircom (http://ircom.corpus-ir.fr) est ouvert et propose des ressources à la communauté : glossaire, inventaire des unités et des corpus, ressources logicielles (tutoriaux, comparatifs, outils de conversion), activités des groupes de travail, actualités des formations, ...

L'IRCOM invite les unités à inventorier leur corpus oraux et multimodaux - 70 projets déjà recensés - pour avoir une meilleure visibilité des ressources déjà disponibles même si elles ne sont pas toutes finalisées.

 

Le comité de pilotage IRCOM

 

 

Utiliser ce formulaire pour répondre à l’appel : Merci.

 

Réponse à l’appel à la finalisation de corpus oral ou multimodal

 

Nom du corpus :

 

Nom de la personne à contacter :

Adresse email :

Numéro de téléphone :

 

Nature des données de corpus :

 

Existe-t’il des enregistrements :

Quel média ? Audio, vidéo, autre…

Quelle est la longueur totale des enregistrements ? Nombre de cassettes, nombre d’heures, etc.

Quel type de support ?

Quel format (si connu) ?

 

Existe-t’il des transcriptions :

Quel format ? (papier, traitement de texte, logiciel de transcription)

Quelle quantité (en heures, nombre de mots, ou nombre de transcriptions) ?

 

Disposez vous de métadonnées (présentation des droits d’auteurs et d’usage) ?

 

Disposez-vous d’une description précise des personnes enregistrées ?

 

Disposez-vous d’une attestation de consentement éclairé pour les personnes ayant été enregistrées ? En quelle année (environ) les enregistrements ont eu lieu ?

 

Quelle est la langue des enregistrements ?

 

Le corpus comprend-il des enregistrements d’enfants ou de personnes ayant un trouble du langage ou une pathologie ?

Si oui, de quelle population s’agit-il ?

 

 

Dans un souci d’efficacité et pour vous conseiller dans les meilleurs délais, il nous faut disposer d’exemples des transcriptions ou des enregistrements en votre possession. Nous vous contacterons à ce sujet, mais vous pouvez d’ores et déjà nous adresser par courrier électronique un exemple des données dont vous disposez (transcriptions, métadonnées, adresse de page web contenant les enregistrements).

 

Nous vous remercions par avance de l’intérêt que vous porterez à notre proposition. Pour toutes informations complémentaires veuillez contacter Martine Toda martine.toda@ling.cnrs.fr ou à ircom.appel.corpus@gmail.com.

Top

5-2-8Rhapsodie: un Treebank prosodique et syntaxique de français parlé

Rhapsodie: un Treebank prosodique et syntaxique de français parlé

 

Nous avons le plaisir d'annoncer que la ressource Rhapsodie, Corpus de français parlé annoté pour la prosodie et la syntaxe, est désormais disponible sur http://www.projet-rhapsodie.fr/

 

Le treebank Rhapsodie est composé de 57 échantillons sonores (5 minutes en moyenne, au total 3h de parole, 33000 mots) dotés d’une transcription orthographique et phonétique alignées au son.

 

Il s'agit d’une ressource de français parlé multi genres (parole privée et publique ; monologues et dialogues ; entretiens en face à face vs radiodiffusion, parole plus ou moins interactive et plus ou moins planifiée, séquences descriptives, argumentatives, oratoires et procédurales) articulée autour de sources externes (enregistrements extraits de projets antérieurs, en accord avec les concepteurs initiaux) et internes. Nous tenons en particulier à remercier les responsables des projets CFPP2000, PFC, ESLO, C-Prom ainsi que Mathieu Avanzi, Anne Lacheret, Piet Mertens et Nicolas Obin.

 

Les échantillons sonores (wave & MP3, pitch nettoyé et lissé), les transcriptions orthographiques (txt), les annotations macrosyntaxiques (txt), les annotations prosodiques (xml, textgrid) ainsi que les metadonnées (xml & html) sont téléchargeables librement selon les termes de la licence Creative Commons Attribution - Pas d’utilisation commerciale - Partage dans les mêmes conditions 3.0 France.

Les annotations microsyntaxiques seront disponibles prochainement

 Les métadonnées sont également explorables en ligne grâce à un browser.

 Les tutoriels pour la transcription, les annotations et les requêtes sont disponibles sur le site Rhapsodie.

 Enfin, L’annotation prosodique est interrogeable en ligne grâce au langage de requêtes Rhapsodie QL.

 L'équipe Ressource Rhapsodie (Modyco, Université Paris Ouest Nanterre)

Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.

 Partenaires : IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence), CLLE-ERSS (Toulouse).

 

********************************************************

Rhapsodie: a Prosodic and Syntactic Treebank for Spoken French

We are pleased to announce that Rhapsodie, a syntactic and prosodic treebank of spoken French created with the aim of modeling the interface between prosody, syntax and discourse in spoken French is now available at   http://www.projet-rhapsodie.fr/

The Rhapsodie treebank is made up of 57 short samples of spoken French (5 minutes long on average, amounting to 3 hours of speech and a 33 000 word corpus) endowed with an orthographical phoneme-aligned transcription . 

The corpus is representative of different genres (private and public speech; monologues and dialogues; face-to-face interviews and broadcasts; more or less interactive discourse; descriptive, argumentative and procedural samples, variations in planning type).

The corpus samples have been mainly drawn from existing corpora of spoken French and partially created within the frame of theRhapsodie project. We would especially like to thank the coordinators of the  CFPP2000, PFC, ESLO, C-Prom projects as well as Piet Mertens, Mathieu Avanzi, Anne Lacheret and Nicolas Obin.

The sound samples (waves, MP3, cleaned and stylized pitch), the orthographic transcriptions (txt), the macrosyntactic annotations (txt), the prosodic annotations  (xml, textgrid) as well as the metadata (xml and html) can be freely downloaded under the terms of the Creative Commons licence Attribution - Noncommercial - Share Alike 3.0 France.

Microsyntactic annotation will be available soon.

The metadata are  searchable on line through a browser.

The prosodic annotation can be explored on line through the Rhapsodie Query Language.

The tutorials of transcription, annotations and Rhapsodie Query Language  are available on the site.

 

The Rhapsodie team (Modyco, Université Paris Ouest Nanterre :

Sylvain Kahane, Anne Lacheret, Paola Pietrandrea, Atanas Tchobanov, Arthur Truong.

Partners: IRCAM (Paris), LATTICE (Paris), LPL (Aix-en-Provence),CLLE-ERSS (Toulouse).

Top

5-2-9COVAREP: A Cooperative Voice Analysis Repository for Speech Technologies
======================
CALL for contributions
======================
 
We are pleased to announce the creation of an open-source repository of advanced speech processing algorithms called COVAREP (A Cooperative Voice Analysis Repository for Speech Technologies). COVAREP has been created as a GitHub project (https://github.com/covarep/covarep) where researchers in speech processing can store original implementations of published algorithms.
 
Over the past few decades a vast array of advanced speech processing algorithms have been developed, often offering significant improvements over the existing state-of-the-art. Such algorithms can have a reasonably high degree of complexity and, hence, can be difficult to accurately re-implement based on article descriptions. Another issue is the so-called 'bug magnet effect' with re-implementations frequently having significant differences from the original. The consequence of all this has been that many promising developments have been under-exploited or discarded, with researchers tending to stick to conventional analysis methods.
 
By developing the COVAREP repository we are hoping to address this by encouraging authors to include original implementations of their algorithms, thus resulting in a single de facto version for the speech community to refer to.
 
We envisage a range of benefits to the repository:
1) Reproducible research: COVAREP will allow fairer comparison of algorithms in published articles.
2) Encouraged usage: the free availability of these algorithms will encourage researchers from a wide range of speech-related disciplines (both in academia and industry) to exploit them for their own applications.
3) Feedback: as a GitHub project users will be able to offer comments on algorithms, report bugs, suggest improvements etc.
 
SCOPE
We welcome contributions from a wide range of speech processing areas, including (but not limited to): Speech analysis, synthesis, conversion, transformation, enhancement, speech quality, glottal source/voice quality analysis, etc.
 
REQUIREMENTS
In order to achieve a reasonable standard of consistency and homogeneity across algorithms we have compiled a list of requirements for prospective contributors to the repository. However, we intend the list of the requirements not to be so strict as to discourage contributions.
  • Only published work can be added to the   repository
  • The code must be available as open source
  • Algorithms should be coded in Matlab, however we   strongly encourage authors to make the code compatible with Octave in order to   maximize usability
  • Contributions have to comply with a Coding   Convention (see GitHub site for coding convention and template). However, only   for normalizing the inputs/outputs and the documentation. There is no   restriction for the content of the functions (though, comments are obviously   encouraged).
 
LICENCE
Getting contributing institutions to agree to a homogenous IP policy would be close to impossible. As a result COVAREP is a repository and not a toolbox, and each algorithm will have its own licence associated with it. Though flexible to different licence types, contributions will need to have a licence which is compatible with the repository, i.e. {GPL, LGPL, X11, Apache, MIT} or similar. We would encourage contributors to try to obtain LGPL licences from their institutions in order to be more industry friendly.
 
CONTRIBUTE!
We believe that the COVAREP repository has a great potential benefit to the speech research community and we hope that you will consider contributing your published algorithms to it. If you have any questions, comments issues etc regarding COVAREP please contact us on one of the email addresses below. Please forward this email to others who may be interested.
 
Existing contributions include: algorithms for spectral envelope modelling, adaptive sinusoidal modelling, fundamental frequncy/voicing decision/glottal closure instant detection algorithms, methods for detecting non-modal phonation types etc.
 
Gilles Degottex <degottex@csd.uoc.gr>, John Kane <kanejo@tcd.ie>, Thomas Drugman <thomas.drugman@umons.ac.be>, Tuomo Raitio <tuomo.raitio@aalto.fi>, Stefan Scherer <scherer@ict.usc.edu>
 
 
Top

5-2-10Annotation of “Hannah and her sisters” by Woody Allen.

We have created and made publicly available a dense audio-visual person-oriented ground-truth annotation of a feature movie (100 minutes long): “Hannah and her sisters” by Woody Allen.

The annotation includes

•          Face tracks in video (densely annotated, i.e., in each frame, and person-labeled)

•             Speech segments in audio (person-labeled)

•             Shot boundaries in video



The annotation can be useful for evaluating



•   Person-oriented video-based tasks (e.g., face tracking, automatic character naming, etc.)

•             Person-oriented audio-based tasks (e.g., speaker diarization or recognition)

•             Person-oriented multimodal-based tasks (e.g., audio-visual character naming)



Detail on Hannah dataset and access to it can be obtained there:

https://research.technicolor.com/rennes/hannah-home/

https://research.technicolor.com/rennes/hannah-download/



Acknowledgments:

This work is supported by AXES EU project: http://www.axes-project.eu/










Alexey Ozerov Alexey.Ozerov@technicolor.com

Jean-Ronan Vigouroux,

Louis Chevallier

Patrick Pérez

Technicolor Research & Innovation



 

Top

5-2-11French TTS

Text to         Speech Synthesis:
      over an hour of speech       synthesis samples from         1968 to 2001 by       25 French, Canadian, US , Belgian,       Swedish, Swiss systems
     
     
33 ans de synthèse de la parole à         partir du texte: une promenade sonore (1968-2001)
        (33 years of
Text to Speech Synthesis       in French : an audio tour (1968-2001)       )
      Christophe d'Alessandro
      Article published in         Volume 42 - No. 1/2001 issue of 
Traitement       Automatique des Langues  (TAL,       Editions Hermes),         pp. 297-321.
     
      posted to:
      http://groupeaa.limsi.fr/corpus:synthese:start

Top



 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA