ISCA - International Speech
Communication Association


ISCApad Archive  »  2022  »  ISCApad #292  »  Resources  »  Database

ISCApad #292

Wednesday, October 05, 2022 by Chris Wellekens

5-2 Database
5-2-1Linguistic Data Consortium (LDC) update (September 2022)

 

In this newsletter:
Upcoming Policy Change to LDC’s Open Memberships
LDC at Interspeech 2022
LanguageARC: Citizen Science for Language

30th Anniversary Highlight: Switchboard

New publications:
Xi’an Guanzhong Object Naming
MASRI Synthetic


Upcoming Policy Change to LDC’s Open Memberships
LDC is changing Its open membership year policy beginning January 1, 2023.  Only one membership year will be open for joining – the current membership year. The 2022 membership year will close for joining on December 31, 2022. We expect this change to have a minimal impact on members, while allowing us to streamline our processes to serve members better. LDC’s many membership benefits will remain the same and organizations choosing to join membership years in advance will still be able to do so. If you have any questions about this change, please don’t hesitate to contact our membership office. 

LDC at Interspeech 2022
LDC is proud to sponsor the Workshop for Young Female Researchers in Speech (YFRSW) to be held in-person as an Interspeech 2022 pre-conference satellite event on September 17. Also, be sure to check out the collaborative work of LDC’s Mark Liberman, “The mapping between syntactic and prosodic phrasing in English and Mandarin”, presented during the On-Site Oral Session: Phonetics and Phonology on Wednesday, September 21, 13:30-15:30 KST. 

LanguageARC: Citizen Science for Language 
LanguageARC is a citizen science web portal for language research developed by LDC with the support of the National Science Foundation (grant #1730377). 

LanguageARC brings together researchers and participants from the general public interested in language to form a community dedicated to support and advance language-related research and development. Contributors to this online community can participate in a variety of language-related tasks and activities such as reading text, answering questions, describing images or video, creating or evaluating transcriptions for audio clips, or developing translations into their native languages. LanguageARC includes projects in languages other than English, such as French, Sesotho, and Swedish. Xi’an Guanzhong Object Naming LDC2022S09, released this month in LDC’s Catalog and described below, is an example of a data set developed using LanguageARC. New projects will be added on an ongoing basis.

Sign up for a LanguageARC account today to start making real contributions to language knowledge and research. Please share this information with colleagues, students, and anyone who might be interested in participating in the language activities on this website. If you are a researcher interested in creating a project on Language ARC, please reach out on the site’s Contact page.

Find LanguageARC on Facebook at: https://www.facebook.com/languagearc

30th Anniversary Highlight: Switchboard
Switchboard-1 Release 2 (LDC97S62) is considered the first large collection of spontaneous conversational telephone speech (Graff & Bird, 2000). It consists of approximately 260 hours of recordings collected by Texas Instruments in 1990-1991  (Godfrey et al., 1992). The first release of the corpus (later superseded) was published by NIST and distributed by LDC in 1993.

Participants were 543 speakers (302 male, 241 female) from across the United States who accounted for around 2,400 two-sided telephone conversations. A robot operator handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion, and recording the speech from the two subjects into separate channels until the conversation was finished. Roughly 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic.

This gold standard data set has been used for many HLT applications, including speaker identification, speaker authentication, and speech recognition. It is considered one of the most important benchmarks for recognition tasks involving large vocabulary conversational speech (Deshmukh et al., 1998) as well as a key resource for studying the phonetic properties of spontaneous speech (Greenberg et al., 1996). Annotation tasks based on Switchboard include discourse tags/speech acts, part-of-speech tagging and parsing, and sentiment analysis

The Switchboard series includes Switchboard Credit CardPhase IIPhase III, the Switchboard Cellular collection, and new recordings from 18 Switchboard participants in the 2013 Greybeard corpus.

All Switchboard corpora are available in the Catalog for licensing by Consortium members and non-members. Visit Obtaining Data for more information.


New publications:
Xi’an Guanzhong Object Naming is comprised of 15 hours of audio recordings from speakers of the Guanzhong dialect of Mandarin Chinese living in or near Xi’an in Shaangxi Province (China) naming objects that appeared in colored line drawings. The corpus was developed to support traditional and computer aided language documentation.

The collection was conducted from February-May 2021 using LanguageARC, a citizen science portal developed by LDC, from a closed volunteer community. Speakers were presented with images selected from the MultiPic dataset and were asked to record themselves naming the objects in the images.

Xi’an Guanzhong Object Naming is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $250.

*

MASRI Synthetic MASRI (Maltese Automatic Speech Recognition I) Synthetic was developed by the MASRI team at the University of Malta and contains 99 hours of synthesized Maltese speech.

Source sentences were extracted from the Maltese Language Resource Server (MLRS) corpus, comprised of written or transcribed Maltese covering various genres, including parliamentary debates, news, law, opinion, sports, culture, academic, literature, and religious texts. Text was processed through the CrimsonWing text-to-speech system to generate speech files. Synthesized speech was created with 210 voices (105 female, 105 male).

MASRI Synthetic is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $250.


Membership Coordinator

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

      Philadelphia, PA 19104

 


Top

5-2-2ELRA - Language Resources Catalogue - Update (June 2022)
ELRA releases the Corpus of Interactions between Seniors and an Empathic Virtual Coach in Spanish, French and Norwegian

ELRA is pleased to announce the release of the Corpus of Interactions between Seniors and an Empathic Virtual Coach in Spanish, French and Norwegian. This corpus was built within the EMPATHIC project (Empathic, Expressive, Advanced Virtual Coach to Improve Independent Healthy-Life-Years of the Elderly), funded within the European Union's Horizon 2020 Research and Innovation program. It consists of video recordings, obtained through a webcam. These recordings were made during user conversations with a Wizard of Oz (WOZ) and with an automatic dialogue system. Several topics were involved in the conversations that were carried out in three different languages: Spanish, French and Norwegian.

The corpus can be found in the ELRA Catalogue under the following links and references: Corpus of Interactions between Seniors and an Empathic Virtual Coach in Spanish, French and Norwegian ISLRN: 631-345-309-445-9

For more information and/or questions, please write to contact@elda.org.

About EMPATHIC project
The purposes of EMPATHIC project are to research, innovate and validate new paradigms, laying the foundation for future generations of Personalised Virtual Coaches to help elderly people to live independently. The following partners collaborated on this project:  the University of Basque Country UPV/EHU (Spain), OSATEK (Spain), the University of Barcelona (Spain), Oslo University Hospital (Norway), e-Seniors Association (France), Tunstall Healthcare (Spain and Sweden), Intelligent Voice (United Kingdom), Acapela Group (Belgium), Institut Mines-Télécom (France), and Università degli Studi della Campania “L. Vanvitelli” (Italy).

To find out more about EMPATHIC project, please visit: http://www.empathic-project.eu/


About ELRA

The European Language Resources Association (ELRA) is a non-profit making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies (HLT).

To find out more on ELRA, please visit: http://www.elra.info/en/


 
 
Top

5-2-3Speechocean – update (August 2019)

 

English Speech Recognition Corpus - Speechocean

 

At present, Speechocean has produced more than 24,000 hours of English Speech Recognition Corpora, including some rare corpora recorded by kids. Those corpora were recorded by 23,000 speakers in total. Please check the form below:

 

Name

Speakers

Hours

American English

8,441

8,029

Indian English

2,394

3,540

British English

2,381

3,029

Australian English

1,286

1,954

Chinese (Mainland) English

3,478

1,513

Canadian English

1,607

1,309

Japanese English

1,005

902

Singapore English

404

710

Russian English

230

492

Romanian English

201

389

French English

225

378

Chinese (Hong Kong) English

200

378

Italian English

213

366

Portugal English

201

341

Spainish English

200

326

German English

196

306

Korean English

116

207

Indonesian English

402

126

 

 

If you have any further inquiries, please do not hesitate to contact us.

Web: en.speechocean.com

Email: marketing@speechocean.com

 

 

 

 

 

 


 


 

 

Top

5-2-4Google 's Language Model benchmark
 Here is a brief description of the project.

'The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:

  • unpruned Katz (1.1B n-grams),
  • pruned Katz (~15M n-grams),
  • unpruned Interpolated Kneser-Ney (1.1B n-grams),
  • pruned Interpolated Kneser-Ney (~15M n-grams)

 

Happy benchmarking!'

Top

5-2-5Forensic database of voice recordings of 500+ Australian English speakers

Forensic database of voice recordings of 500+ Australian English speakers

We are pleased to announce that the forensic database of voice recordings of 500+ Australian English speakers is now published.

The database was collected by the Forensic Voice Comparison Laboratory, School of Electrical Engineering & Telecommunications, University of New South Wales as part of the Australian Research Council funded Linkage Project on making demonstrably valid and reliable forensic voice comparison a practical everyday reality in Australia. The project was conducted in partnership with: Australian Federal Police,  New South Wales Police,  Queensland Police, National Institute of Forensic Sciences, Australasian Speech Sciences and Technology Association, Guardia Civil, Universidad Autónoma de Madrid.

The database includes multiple non-contemporaneous recordings of most speakers. Each speaker is recorded in three different speaking styles representative of some common styles found in forensic casework. Recordings are recorded under high-quality conditions and extraneous noises and crosstalk have been manually removed. The high-quality audio can be processed to reflect recording conditions found in forensic casework.

The database can be accessed at: http://databases.forensic-voice-comparison.net/

Top

5-2-6Audio and Electroglottographic speech recordings

 

Audio and Electroglottographic speech recordings from several languages

We are happy to announce the public availability of speech recordings made as part of the UCLA project 'Production and Perception of Linguistic Voice Quality'.

http://www.phonetics.ucla.edu/voiceproject/voice.html

Audio and EGG recordings are available for Bo, Gujarati, Hmong, Mandarin, Black Miao, Southern Yi, Santiago Matatlan/ San Juan Guelavia Zapotec; audio recordings (no EGG) are available for English and Mandarin. Recordings of Jalapa Mazatec extracted from the UCLA Phonetic Archive are also posted. All recordings are accompanied by explanatory notes and wordlists, and most are accompanied by Praat textgrids that locate target segments of interest to our project.

Analysis software developed as part of the project – VoiceSauce for audio analysis and EggWorks for EGG analysis – and all project publications are also available from this site. All preliminary analyses of the recordings using these tools (i.e. acoustic and EGG parameter values extracted from the recordings) are posted on the site in large data spreadsheets.

All of these materials are made freely available under a Creative Commons Attribution-NonCommercial-ShareAlike-3.0 Unported License.

This project was funded by NSF grant BCS-0720304 to Pat Keating, Abeer Alwan and Jody Kreiman of UCLA, and Christina Esposito of Macalester College.

Pat Keating (UCLA)

Top

5-2-7EEG-face tracking- audio 24 GB data set Kara One, Toronto, Canada

We are making 24 GB of a new dataset, called Kara One, freely available. This database combines 3 modalities (EEG, face tracking, and audio) during imagined and articulated speech using phonologically-relevant phonemic and single-word prompts. It is the result of a collaboration between the Toronto Rehabilitation Institute (in the University Health Network) and the Department of Computer Science at the University of Toronto.

 

In the associated paper (abstract below), we show how to accurately classify imagined phonological categories solely from EEG data. Specifically, we obtain up to 90% accuracy in classifying imagined consonants from imagined vowels and up to 95% accuracy in classifying stimulus from active imagination states using advanced deep-belief networks.

 

Data from 14 participants are available here: http://www.cs.toronto.edu/~complingweb/data/karaOne/karaOne.html.

 

If you have any questions, please contact Frank Rudzicz at frank@cs.toronto.edu.

 

Best regards,

Frank

 

 

PAPER Shunan Zhao and Frank Rudzicz (2015) Classifying phonological categories in imagined and articulated speech. In Proceedings of ICASSP 2015, Brisbane Australia

ABSTRACT This paper presents a new dataset combining 3 modalities (EEG, facial, and audio) during imagined and vocalized phonemic and single-word prompts. We pre-process the EEG data, compute features for all 3 modalities, and perform binary classi?cation of phonological categories using a combination of these modalities. For example, a deep-belief network obtains accuracies over 90% on identifying consonants, which is signi?cantly more accurate than two baseline supportvectormachines. Wealsoclassifybetweenthedifferent states (resting, stimuli, active thinking) of the recording, achievingaccuraciesof95%. Thesedatamaybeusedtolearn multimodal relationships, and to develop silent-speech and brain-computer interfaces.

 

Top

5-2-8TORGO data base free for academic use.

In the spirit of the season, I would like to announce the immediate availability of the TORGO database free, in perpetuity for academic use. This database combines acoustics and electromagnetic articulography from 8 individuals with speech disorders and 7 without, and totals over 18 GB. These data can be used for multimodal models (e.g., for acoustic-articulatory inversion), models of pathology, and augmented speech recognition, for example. More information (and the database itself) can be found here: http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html.

Top

5-2-9Datatang

Datatang is a global leading data provider that specialized in data customized solution, focusing in variety speech, image, and text data collection, annotation, crowdsourcing services.

 

Summary of the new datasets (2018) and a brief plan for 2019.

 

 

 

? Speech data (with annotation) that we finished in 2018 

 

Language
Datasets Length
  ( Hours )
French
794
British English
800
Spanish
435
Italian
1,440
German
1,800
Spanish (Mexico/Colombia)
700
Brazilian Portuguese
1,000
European Portuguese
1,000
Russian
1,000

 

?2019 ongoing  speech project 

 

Type

Project Name

Europeans speak English

1000 Hours-Spanish Speak English

1000 Hours-French Speak English

1000 Hours-German Speak English

Call Center Speech

1000 Hours-Call Center Speech

off-the-shelf data expansion

1000 Hours-Chinese Speak English

1500 Hours-Mixed Chinese and English Speech Data

 

 

 

On top of the above,  there are more planed speech data collections, such as Japanese speech data, children`s speech data, dialect speech data and so on.  

 

What is more, we will continually provide those data at a competitive price with a maintained high accuracy rate.

 

 

 

If you have any questions or need more details, do not hesitate to contact us jessy@datatang.com 

 

It would be possible to send you with a sample or specification of the data.

 

 

 


Top

5-2-10SIWIS French Speech Synthesis Database
The SIWIS French Speech Synthesis Database includes high quality French speech recordings and associated text files, aimed at building TTS systems, investigate multiple styles, and emphasis. A total of 9750 utterances from various sources such as parliament debates and novels were uttered by a professional French voice talent. A subset of the database contains emphasised words in many different contexts. The database includes more than ten hours of speech data and is freely available.
 
Top

5-2-11JLCorpus - Emotional Speech corpus with primary and secondary emotions
JLCorpus - Emotional Speech corpus with primary and secondary emotions:
 

For further understanding the wide array of emotions embedded in human speech, we are introducing an emotional speech corpus. In contrast to the existing speech corpora, this corpus was constructed by maintaining an equal distribution of 4 long vowels in New Zealand English. This balance is to facilitate emotion related formant and glottal source feature comparison studies. Also, the corpus has 5 secondary emotions along with 5 primary emotions. Secondary emotions are important in Human-Robot Interaction (HRI), where the aim is to model natural conversations among humans and robots. But there are very few existing speech resources to study these emotions,and this work adds a speech corpus containing some secondary emotions.

Please use the corpus for emotional speech related studies. When you use it please include the citation as:

Jesin James, Li Tian, Catherine Watson, 'An Open Source Emotional Speech Corpus for Human Robot Interaction Applications', in Proc. Interspeech, 2018.

To access the whole corpus including the recording supporting files, click the following link: https://www.kaggle.com/tli725/jl-corpus, (if you have already installed the Kaggle API, you can type the following command to download: kaggle datasets download -d tli725/jl-corpus)

Or if you simply want the raw audio+txt files, click the following link: https://www.kaggle.com/tli725/jl-corpus/downloads/Raw%20JL%20corpus%20(unchecked%20and%20unannotated).rar/4

The corpus was evaluated by a large scale human perception test with 120 participants. The link to the survey are here- For Primary emorion corpus: https://auckland.au1.qualtrics.com/jfe/form/SV_8ewmOCgOFCHpAj3

For Secondary emotion corpus: https://auckland.au1.qualtrics.com/jfe/form/SV_eVDINp8WkKpsPsh

These surveys will give an overall idea about the type of recordings in the corpus.

The perceptually verified and annotated JL corpus will be given public access soon.

Top

5-2-12OPENGLOT –An open environment for the evaluation of glottal inverse filtering

OPENGLOT –An open environment for the evaluation of glottal inverse filtering

 

OPENGLOT is a publically available database that was designed primarily for the evaluation of glottal inverse filtering algorithms. In addition, the database can be used in evaluating formant estimation methods. OPENGLOT consists of four repositories. Repository I contains synthetic glottal flow waveforms, and speech signals generated by using the Liljencrants–Fant (LF) waveform as an excitation, and an all-pole vocal tract model. Repository II contains glottal flow and speech pressure signals generated using physical modelling of human speech production. Repository III contains pairs of glottal excitation and speech pressure signal generated by exciting 3D printed plastic vocal tract replica with LF excitations via a loudspeaker. Finally, Repository IV contains multichannel recordings (speech pressure signal, EGG, high-speed video of the vocal folds) from natural production of speech.

 

OPENGLOT is available at:

http://research.spa.aalto.fi/projects/openglot/

Top

5-2-13Corpus Rhapsodie

Nous sommes heureux de vous annoncer la publication d¹un ouvrage consacré
au treebank Rhapsodie, un corpus de français parlé de 33 000 mots
finement annoté en prosodie et en syntaxe.

Accès à la publication : https://benjamins.com/catalog/scl.89 (voir flyer
ci-joint)

Accès au treebank : https://www.projet-rhapsodie.fr/
Les données librement accessibles sont diffusées sous licence Creative
Commons.
Le site donne également accès aux guides d¹annotations.

Top

5-2-14The My Science Tutor Children?s Conversational Speech Corpus (MyST Corpus) , Boulder Learning Inc.

The My Science Tutor Children?s Conversational Speech Corpus (MyST Corpus) is the world?s largest English children?s speech corpus.  It is freely available to the research community for research use.  Companies can acquire the corpus for $10,000.  The MyST Corpus was collected over a 10-year period, with support from over $9 million in grants from the US National Science Foundation and Department of Education, awarded to Boulder Learning Inc. (Wayne Ward, Principal Investigator).

The MyST corpus contains speech collected from 1,374 third, fourth and fifth grade students.  The students engaged in spoken dialogs with a virtual science tutor in 8 areas of science.  A total of 11,398 student sessions of 15 to 20 minutes produced a total of 244,069 utterances.  42% of the utterances have been transcribed at the word level.  The corpus is partitioned into training and test sets to support comparison of research results across labs. All parents and students signed consent forms, approved by the University of Colorado?s Institutional Review Board,  that authorize distribution of the corpus for research and commercial use. 

The MyST children?s speech corpus contains approximately ten times as many spoken utterances as all other English children?s speech corpora combined (see https://en.wikipedia.org/wiki/List_of_children%27s_speech_corpora). 

Additional information about the corpus, and instructions for how to acquire the corpus (and samples of the speech data) can be found on the Boulder Learning Web site at http://boulderlearning.com/request-the-myst-corpus/.   

Top

5-2-15HARVARD speech corpus - native British English speaker
  • HARVARD speech corpus - native British English speaker, digital re-recording
 
Top

5-2-16Magic Data Technology Kid Voice TTS Corpus in Mandarin Chinese (November 2019)

Magic Data Technology Kid Voice TTS Corpus in Mandarin Chinese

 

Magic Data Technology is one of the leading artificial intelligence data service providers in the world. The company is committed to providing a wild range of customized data services in the fields of speech recognition, intelligent imaging and Natural Language Understanding.

 

This corpus was recorded by a four-year-old Chinese girl originally born in Beijing China. This time we published 15-minute speech data from the corpus for non-commercial use.

 

The contents and the corresponding descriptions of the corpus:

  • The corpus contains 15 minutes of speech data, which is recorded in NC-20 acoustic studio.

  • The speaker is 4 years old originally born in Beijing

  • Detail information such as speech data coding and speaker information is preserved in the metadata file.

  • This corpus is natural kid style.

  • Annotation includes four parts: pronunciation proofreading, prosody labeling, phone boundary labeling and POS Tagging.

  • The annotation accuracy is higher than 99%.

  • For phone labeling, the database contains the annotation not only on the boundary of phonemes, but also on the boundary of the silence parts.

 

The corpus aims to help researchers in the TTS fields. And it is part of a much bigger dataset (2.3 hours MAGICDATA Kid Voice TTS Corpus in Mandarin Chinese) which was recorded in the same environment. This is the first time to publish this voice!

 

Please note that this corpus has got the speaker and her parents’ authorization.

 

Samples are available.

Do not hesitate to contact us for any questions.

Website: http://www.imagicdatatech.com/index.php/home/dataopensource/data_info/id/360

E-mail: business@magicdatatech.com

Top

5-2-17FlauBERT: a French LM
Here is FlauBERT: a French LM learnt (with #CNRS J-Zay supercomputer) on a large and heterogeneous corpus. Along with it comes FLUE (evaluation setup for French NLP). FlauBERT was successfully applied to complex tasks (NLI, WSD, Parsing).  More on https://github.com/getalp/Flaubert
More details on this online paper: https://arxiv.org/abs/1912.05372 
Top

5-2-18ELRA-S0408 SpeechTera Pronunciation Dictionary

ELRA-S0408 Speechtera Pronunciation Dictionary

ISLRN: 645-563-102-594-8
The SpeechTera Pronunciation Dictionary is a machine-readable pronunciation dictionary for Brazilian Portuguese and comprises 737,347 entries. Its phonetic transcription is based on 13 linguistics varieties spoken in Brazil and contains the pronunciation of the frequent word forms found in the transcription data of the SpeechTera's speech and text database (literary, newspaper, movies, miscellaneous). Each one of the thirteen dialects comprises 56,719 entries.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0408/

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us.

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements

Top

5-2-19Ressources of ELRC Network

Paris, France, April 23, 2020

ELRA is happy to announce that Language Resources collected within the ELRC Network, funded by the European Commission, are now available from the ELRA Catalogue of Language Resources.

In total, 180 Written Corpora, 5 Multilingual Lexicons and 2 Terminological Resources, are freely available under open licences and can be downloaded directly from the catalogue. Type 'ELRC' in the catalogue search engine (http://catalog.elra.info/en-us/repository/search/?q=ELRC) to access and download resources.

All these Language Resources can be used to support your Machine Translation solutions developments. They cover the official languages of the European Union and CEF associated countries.

More LRs coming from ELRC will be added as they become available.

*****
About ELRC
ELRC (European Language Resources Coordination) Network raises awareness and promote the acquisition and continued identification and collection of language resources in all official languages of the EU and CEF associated countries. These activities aim to help to improve the quality, coverage and performance of automated translation solutions in the context of current and future CEF digital services.

To find out more about ELRC, please visit the website: http://lr-coordination.eu


About ELRA
The European Language Resources Association (ELRA) is a non-profit-making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for Language Resources and promoting Human Language Technologies. Language Resources covering various fields of HLT (including Multimodal, Speech, Written, Terminology) and a great number of languages are available from the ELRA catalogue. ELRA's strong involvement in the fields of Language Resources  and Language Technologies is also emphasized at the LREC conference, organized every other year since 1998.

To find out more about ELRA, please visit the website: http://www.elra.info

For more information on the catalogue, please contact Valérie Mapelli
If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us.

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements

Top

5-2-20ELRA announces that MEDIA data are now available for free for academic research

ELRA announces that MEDIA data are now available for free for academic research

Further to the request of the HLT French community to foster evaluation activities for man-machine dialogue systems for French language, ELRA has decided to provide a free access to the MEDIA speech corpora and evaluation package for academic research purposes.

The MEDIA data can be found in the ELRA Catalogue under the following references:

Data available from the ELRA Catalogue can be obtained easily by contacting ELRA.  

The MEDIA project was carried out within the framework of Technolangue, the French national research programme funded by the French Ministry of Research and New Technologies (MRNT) with the objective of running a campaign for the evaluation of man-machine dialogue systems for French. The campaign was distributed over two actions: an evaluation taking into account the dialogue context and an evaluation not taking into account the dialogue context.

PortMedia was a follow up project supported by the French Research Agency (ANR). The French and Italian corpus was produced by ELDA, with the same paradigm and specifications as the MEDIA speech database but on a different domain.

For more information and/or questions, please write to contact@elda.org.

 *** About ELRA ***
The European Language Resources Association (ELRA) is a non-profit making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies (HLT).

To find out more about ELRA and its respective catalogue, please visit: http://www.elra.info and http://catalogue.elra.info

Top

5-2-21ELRA/ELDA Communication : LT4all

Out of the 7,000+ language spoken around the world, only a few have associated Language Technologies. The majority of languages can be considered as 'under-resourced' or as 'not supported'. This situation, very detrimental to many languages speakers, and specifically indigenous languages speakers, creates a digital divide,  and places many languages in danger of digital extinction, if not complete extinction.

Organized as part of the 2019 International Year of Indigenous Languages, the 1st edition of LT4All (Language Technologies for All: Enabling Linguistic Diversity and Multilingualism Worldwide) took place in Paris at the UNESCO Headquarters on December 4-6, 2019 and gathered 400 participants from various backgrounds (including language science and technology researchers, linguists, industrials, indigenous peoples, language policy and decision makers) from all over the world.

The LT4All Programme and Editorial Committees are very happy to announce that the set of Research Papers and Posters collected at the occasion of LT4All is now available online at : https://lt4all.elra.info/proceedings/lt4all2019/

****
LT4All has  been made possible thanks to the close cooperation between UNESCO, the  Government of the Khanty-Mansiysk Autonomous Okrug-Ugra (Russian Federation), the European Language Resources Association (ELRA) and its Special Interest  Group on Under-resourced  languages  (SIGUL), and in partnership with UNESCO Intergovernmental Information for All Programme (IFAP) and the Interregional Library Cooperation Centre, as well as with support of other public organizations and sponsors.

More information including the list of all the sponsors and supporters @ https://en.unesco.org/LT4All

Top

5-2-22Search and Find ELRA LRs on Google Dataset Search and ELG LT Platform

Search and Find ELRA LRs on Google Dataset Search and ELG LT Platform

ELRA is happy to announce that all the Language Resources from its Catalogue can now be searched and found on Google Dataset Search and on the ELG Language Technology platform developed within the European Language Grid project.

In order to allow the indexing by Google Dataset Search, ELRA has updated the code generating the catalogue pages. The code developed follows the schema.org standard and is publicly available in JSON format so that it can be used for other harvesting purposes.

The ELRA Catalogue is already indexed and harvested by famous repositories and archives such as OLAC (Open Language Archives Community), CLARIN Virtual Language Observatory and META-SHARE.

For 25 years now, ELRA has been distributing Language Resources to support research and development in various fields of Human Language Technology. The indexing on both Google Dataset Search and the ELG LT Plaform is increasing ELRA Catalogue?s visibility, making the LRs known to new visitors from the Human Language Technologies, Artificial Intelligence and other related fields.


*** About ELRA ***

The European Language Resources Association (ELRA) is a non-profit making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies (HLT).

ELRA Catalogue of Language Resources: http://catalogue.elra.info

More about ELRA, please visit: http://www.elra.info.


*** About Google Dataset Search ***

Google Dataset Search is a search engine for datasets that enables users to search through a list of data repositories indexed through a standardised schema.

More about Google Dataset Search: https://datasetsearch.research.google.com/


*** About European Language Grid ***

The European Language Grid (ELG) is a project funded by the European Union through the Horizon 2020 research and innovation programme. It aims to be a primary platform for Language Technology in Europe.

More about the European Language Grid project: https://www.european-language-grid.eu/

Top

5-2-23Sharing language resources (ELRA)

ELRA recognises the importance of sharing Language Resources (LRs) and making them available to the community.

Since the 2014 edition of LREC, the Language Resources and Evaluation Conference, participants have been offered the possibility to share their LRs (data, tools, web-services, etc.) when submitting a paper, uploading them in a special LREC repository set up by ELRA.

This effort of sharing LRs, linked to the LRE Map initiative (https://lremap.elra.info) for their description, contributes to creating a common repository where everyone can deposit and share data.

Despite the cancellation of LREC 2020 in Marseille, a high number of contributions was submitted and the LREC initiative 'Share your LRs' could be conducted to the end successfully.

Repositories corresponding to each edition of the conference are available here:

For more info and questions, please write to contact@elda.org.
 
Top

5-2-24The UCLA Variability Speaker Database

With NSF support, our interdisciplinary team of voice research at UCLA recently put together a public database that we believe will be of interest to many members of the ISCA community. On behalf of my co-authors (Patricia Keating, Jody Kreiman, Abeer Alwan, Adam Chong), I'm writing to ask if we could advertise our database in the ISCA newsletter. We'd really appreciate your help with this.

 
The database, the UCLA Variability Speaker Database, is freely available through UCLA's Box cloud, which can be accessed from our lab website: http://www.seas.ucla.edu/spapl/shareware.html#Data I should mention that the database will also be available from the Linguistic Data Consortium (LDC) as of October, 2021.
 
Here's a brief description of the database.
The UCLA Variability Speaker Database comprises high-quality audio recordings from 202 speakers, 101 men 
and 101 women, performing 12 brief speech tasks in English over three recording sessions (total amount 
of speech: 300-450 sec per speaker). This public database was designed to sample variability in speaking 
within individual speakers and across a large number of speakers. The large set of speakers (similar in age) 
sampled from the current university community is gender-balanced and has a variety of language backgrounds. 
The database can serve as a testing ground for research questions involving between-speaker variability, 
within-speaker variability, and text-dependent variability. 
More details about the database are available in a readme file that can be sent on request.

--Cynthia Yoonjeong Lee
Postdoctoral Scholar, Department of Linguistics, UCLA
Top

5-2-25Free databases in Catalan, Spanish and Arabic (ELRA and UPC Spain)

We are pleased to announce that Language Resources entrusted to ELRA for distribution and shared by the Universitat Politecnica de Catalunya (UPC), in Spain, are now available for free for academic research purposes (for ELRA institutional members) and at substantially decreased costs for commercial purposes. All data have been developed to enhance Speech technologies in Catalan, Spanish and Arabic.

 

The Language Resources can be found in the ELRA Catalogue under the following references:

ELRA-S0101 Spanish SpeechDat(II) FDB-1000 (ISLRN: 415-072-153-167-5)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0101/
ELRA-S0102 Spanish SpeechDat(II) FDB-4000 (ISLRN: 295-399-069-106-4)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0102/
ELRA-S0140 Spanish SpeechDat-Car database (ISLRN: 937-459-364-430-3)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0140/
ELRA-S0141 SALA Spanish Venezuelan Database (ISLRN: 894-744-522-508-8)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0141/
ELRA-S0173 SALA Spanish Mexican Database (ISLRN: 077-043-759-782-3)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0173/
ELRA-S0183 OrienTel Morocco MCA (Modern Colloquial Arabic) database (ISLRN: 613-578-868-832-2)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0183/
ELRA-S0184 OrienTel Morocco MSA (Modern Standard Arabic) database (ISLRN: 978-839-138-181-8)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0184/
ELRA-S0185 OrienTel French as spoken in Morocco database (ISLRN: 299-422-451-969-8)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0185/
ELRA-S0186 OrienTel Tunisia MCA (Modern Colloquial Arabic) database (ISLRN: 297-705-745-294-4)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0186/
ELRA-S0187 OrienTel Tunisia MSA (Modern Standard Arabic) database (ISLRN: 926-401-827-806-5)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0187/
ELRA-S0188 OrienTel French as spoken in Tunisia database (ISLRN:
085-972-271-578-3)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0188/
ELRA-S0207 LC-STAR Catalan phonetic lexicon (ISLRN: 102-856-174-704-7)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0207/
ELRA-S0208 LC-STAR Spanish phonetic lexicon (ISLRN: 826-939-678-247-5)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0208/
ELRA-S0243 SpeechDat Catalan FDB database (ISLRN:
373-541-490-506-3)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0243/
ELRA-S0306 TC-STAR Transcriptions of Spanish Parliamentary Speech (ISLRN: 972-398-693-247-4 )
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0306/
ELRA-S0309 TC-STAR Spanish Baseline Female Speech Database (ISLRN: 682-113-241-701-0)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0309/
ELRA-S0310 TC-STAR Spanish Baseline Male Speech Database (ISLRN: 736-021-086-598-0)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0310/
ELRA-S0311 TC-STAR Bilingual Voice-Conversion Spanish Speech Database (ISLRN: 254-311-004-570-0)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0311/
ELRA-S0312 TC-STAR Bilingual Voice-Conversion English Speech Database (ISLRN: 522-613-023-181-1)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0312/
ELRA-S0313 TC-STAR Bilingual Expressive Speech Database (ISLRN:
088-656-828-489-3)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0313/
ELRA-S0336 Spanish Festival voice male (ISLRN: 868-352-143-949-9)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0336/
ELRA-S0337 Spanish Festival voice female (ISLRN: 396-262-481-019-0)
For more information, see: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0337/



For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us.

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements

Top

5-2-26USC 75-Speaker Speech MRI Database. Multispeaker speech production articulatory datasets of vocal tract MRI video

A USC Multispeaker speech production articulatory datasets of vocal tract MRI video
 the 'USC 75-Speaker Speech MRI Database'  is a new freely-available speech production data set with accompanying software  tools:
https://www.nature.com/articles/s41597-021-00976-x

These data contain 2D sagittal-view RT-MRI videos along with synchronized  audio for 75 subjects performing linguistically motivated speech tasks,  alongside the corresponding raw RT-MRI data. The dataset also includes
3D volumetric vocal tract MRI during sustained speech sounds and  high-resolution static anatomical T2-weighted upper airway MRI for each  subject. 

Other speech production datasets of articulatory data that are also freely available  include  a TIMIT articulatory data corpus and emotional speech production data, all available from:  https://sail.usc.edu/span/resources.html

Top



 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA