ISCApad #258 |
Tuesday, December 10, 2019 by Chris Wellekens |
5-2-1 | Linguistic Data Consortium (LDC) update (November 2019 and December 2019)
In this newsletter: (November 2019) Join LDC for Membership Year 2020
Join LDC for Membership Year 2020
It’s also not too late to join for MY2018 (through December 31, 2019) and MY2019 (through December 31, 2020). Data sets from those years include Concretely Annotated New York Times and English Gigaword, DIRHA English WSJ Audio, BOLT English Treebank – Discussion Forum, First DIHARD Challenge Development and Evaluation releases, Penn Discourse Treebank Version 3.0, and 2016 NIST Speaker Recognition Evaluation Test Set. New publications:
(1) DEFT English Committed Belief Annotation was developed by LDC and consists of approximately 950,000 words of English discussion forum text annotated for 'committed belief,' which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text. * (2) CALLFRIEND American English-Non-Southern Dialect Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of non-Southern dialects of American English. This second edition updates the audio files to wav format, simplifies the directory structure, and adds documentation and metadata. The first edition is available as CALLFRIEND American English-Non-Southern Dialect (LDC96S46). * (3) TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017 was developed by LDC and contains Chinese, English, and Spanish data produced in support of the TAC KBP Cold Start evaluation track conducted from 2012 to 2017. This corpus includes source documents, queries, assessments, manual runs, and final assessments. * (4) IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Amharic conversational and scripted telephone speech collected in 2014 along with corresponding transcripts. Membership Office University of Pennsylvania T: +1-215-573-1275 E: ldc@ldc.upenn.edu M: 3600 Market St. Suite 810 Philadelphia, PA 19104
In this newsletter: (December 2019)
* (2) BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training was developed by LDC and consists of 349,414 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations. * (3) TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking (EDL) tasks in 2016 and 2017. This corpus includes queries, knowledge base (KB) links, equivalence class clusters for NIL entities, and entity type information for each of the queries. The EDL reference KB, to which EDL data are linked, is available separately in TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 (LDC2019T02). *
Membership Office University of Pennsylvania T: +1-215-573-1275 E: ldc@ldc.upenn.edu M: 3600 Market St. Suite 810 Philadelphia, PA 19104
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-2 | ELRA - Language Resources Catalogue - Update (October 2019) In the framework of a distribution agreement between ELRA and the CJK Dictionary Institute, Inc., ELRA is happy to announce the distribution of 29 Monolingual Lexicons and 20 Multilingual Lexicons, suitable for a large variety of natural language processing applications. Monolingual Lexicons are available for Arabic, Cantonese, Simplified and Traditional Chinese, Japanese, Korean, Persian and Spanish and Multilingual lexicons include those languages as well as some other European languages (English, German, French, Italian, Portuguese and Russian) and Asian languages (Vietnamese, Indonesian, Thai). Possible applications include information retrieval, morphological analysis, word segmentation, named entity recognition, machine translation, etc. All lexicons are made available in tab-delimited, UTF-8 encoded text files.
The following list of lexicons is available:
1) Monolingual Lexicons:
Cantonese Readings Database, ELRA ID: ELRA-L0101, ISLRN: 634-690-317-631-5
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0101 Chinese Phonological Database, ELRA ID: ELRA-L0102, ISLRN: 968-547-869-011-3 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0102 Simplified to Traditional Chinese Conversion, ELRA ID: ELRA-L0103, ISLRN: 151-342-562-705-1 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0103 Hanzi Pinyin Database for Simplified Chinese, ELRA ID: ELRA-L0104, ISLRN: 292-895-602-975-4 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0104 Database of Chinese Name Variants, ELRA ID: ELRA-L0105, ISLRN: 379-237-021-386-4 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0105 Database of Chinese Full Names, ELRA ID: ELRA-L0106, ISLRN: 356-835-468-182-0 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0106 Chinese Lexical Database, ELRA ID: ELRA-L0107, ISLRN: 500-068-723-953-8 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0107 Chinese Morphological Database, ELRA ID: ELRA-L0108, ISLRN: 279-636-746-963-2 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0108 Comprehensive Wordlist of Simplified Chinese, ELRA ID: ELRA-L0109, ISLRN: 159-767-888-341-3 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0109 Comprehensive Word List of Traditional Chinese, ELRA ID: ELRA-L0110, ISLRN: 378-715-589-213-1 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0110 Japanese Phonological Database, ELRA ID: ELRA-L0111, ISLRN: 169-903-096-259-9 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0111 Japanese Lexical Database, ELRA ID: ELRA-L0112, ISLRN: 162-212-767-492-8 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0112 Japanese Morphological Database, ELRA ID: ELRA-L0113, ISLRN: 212-935-180-069-7 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0113 Japanese Orthographical Database, ELRA ID: ELRA-L0114, ISLRN: 261-356-756-593-8 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0114 Japanese Companies and Organizations, ELRA ID: ELRA-L0115, ISLRN: 570-674-242-221-3 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0115 Database of Japanese Name Variants, ELRA ID: ELRA-L0116, ISLRN: 850-674-726-461-2 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0116 Comprehensive Word List of Japanese, ELRA ID: ELRA-L0117, ISLRN: 145-375-006-102-6 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0117 Korean Lexical Database, ELRA ID: ELRA-L0118, ISLRN: 702-121-344-159-1 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0118 Comprehensive Word List of Korean, ELRA ID: ELRA-L0119, ISLRN: 652-932-407-045-1 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0119 Arabic Full Form Lexicon, ELRA ID: ELRA-L0120, ISLRN: 968-827-909-119-8 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0120 Database of Arabic Plurals, ELRA ID: ELRA-L0121, ISLRN: 414-072-749-098-5 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0121 Database of Arab Names, ELRA ID: ELRA-L0122, ISLRN: 998-153-793-831-3 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0122 Database of Arab Names in Arabic, ELRA ID: ELRA-L0123, ISLRN: 126-981-976-765-2 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0123 Database of Foreign Names in Arabic, ELRA ID: ELRA-L0124, ISLRN: 130-493-475-689-4 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0124 Database of Arabic Place Names, ELRA ID: ELRA-L0125, ISLRN: 916-541-123-321-8 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0125 Comprehensive Database of Chinese Personal Names, ELRA ID: ELRA-L0126, ISLRN: 797-857-604-135-5 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0126 Database of Persian Names, ELRA ID: ELRA-L0127, ISLRN: 739-878-734-567-6 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0127 Spanish Full-form Lexicon (Monolingual), ELRA ID: ELRA-L0128, ISLRN: 866-578-477-474-1 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0128 Database of Chinese Names, ELRA ID: ELRA-L0129, ISLRN: 792-499-131-789-4 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-L0129 2) Bilingual/Multilingal Lexicons:
Simplified Chinese?English Technical Terms, ELRA ID: ELRA-M0053, ISLRN: 418-191-947-016-4 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0053 Simplified Chinese-to-English Dictionary, ELRA ID: ELRA-M0054, ISLRN: 694-156-385-534-4 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0054 English-to-Simplified Chinese Dictionary, ELRA ID: ELRA-M0055, ISLRN: 407-348-028-638-3 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0055 Chinese-English Database of Proverbs and Idioms (Chengyu), ELRA ID: ELRA-M0056, ISLRN: 506-728-933-717-0 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0056 Chinese-Japanese Technical Terms Dictionary, ELRA ID: ELRA-M0057, ISLRN: 079-503-057-574-0 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0057 Chinese-English Database of Proper Nouns, ELRA ID: ELRA-M0058, ISLRN: 638-295-493-483-2 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0058 Chinese-Japanese Database of Proper Nouns, ELRA ID: ELRA-M0059, ISLRN: 951-838-928-664-9 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0059 Spanish Full-form Lexicon (Bilingual), ELRA ID: ELRA-M0060, ISLRN: 942-238-032-826-7 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0060 Japanese ? English Dictionary, ELRA ID: ELRA-M0061, ISLRN: 854-879-959-652-9 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0061 English ? Japanese Dictionary, ELRA ID: ELRA-M0062, ISLRN: 233-968-157-290-2 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0062 Multilingual Database of Japanese Points-of-Interest 1, ELRA ID: ELRA-M0063, ISLRN: 902-666-654-661-3 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0063 Multilingual Database of Japanese Points-of-Interest 2, ELRA ID: ELRA-M0064, ISLRN: 268-160-514-957-3 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0064 Japanese ? English Database of Proper Nouns, ELRA ID: ELRA-M0065, ISLRN: 104-268-721-502-8 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0065 Japanese - English Dictionary of Technical Terms, ELRA ID: ELRA-M0066, ISLRN: 499-497-806-398-9 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0066 Korean-Japanese Dictionary of Technical Terms, ELRA ID: ELRA-M0067, ISLRN: 584-164-296-035-1 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0067 Korean-English Database of Proper Nouns, ELRA ID: ELRA-M0068, ISLRN: 408-409-094-493-9 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0068 Korean-Japanese Database of Proper Nouns, ELRA ID: ELRA-M0069, ISLRN: 265-620-933-123-5 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0069 Korean-Chinese Database of Proper Nouns, ELRA ID: ELRA-M0070, ISLRN: 207-127-841-003-9 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0070 Comprehensive Word Lists for Chinese, Japanese, Korean and Arabic, ELRA ID: ELRA-M0071, ISLRN: 476-146-877-598-3 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0071 Multilingual Proper Noun Database, ELRA ID: ELRA-M0072, ISLRN: 340-315-642-771-9 For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-M0072 About CJK Dictionary Institute, Inc.
The CJK Dictionary Institute, Inc. (CJKI) specializes in CJK lexicography. The principal activity of CJKI is the development and continuous expansion of lexical databases of general vocabulary, proper nouns and technical terms for CJK languages (Chinese, Japanese, Korean), including Chinese dialects such as Cantonese and Hakka, containing millions of entries. CJKI also developed databases and romanization systems of Arabic proper nouns, a comprehensive Spanish-English dictionary, a Chinese-Vietnamese names dictionary, and various others. In addition, CJKI offers a full range of professional consulting services on CJK linguistics and lexicography.
To find out more about ELRA, please visit the website: http://www.cjk.org/cjk/index.htm About ELRA
The European Language Resources Association (ELRA) is a non-profit-making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for Language Resources and promoting Human Language Technologies. Language Resources covering various fields of HLT (including Multimodal, Speech, Written, Terminology) and a great number of languages are available from the ELRA catalogue. ELRA's strong involvement in the fields of Language Resources and Language Technologies is also emphasized at the LREC conference, organized every other year since 1998.To find out more about ELRA, please visit the website: http://www.elra.info For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org
If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us. Visit the Universal Catalogue: http://universal.elra.info Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/en/catalogues/language-resources-announcements
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-3 | Speechocean – update (August 2019)
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-4 | Google 's Language Model benchmark A LM benchmark is available at:https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark
Here is a brief description of the project.
'The purpose of the project is to make available a standard training and test setup for language modeling experiments. The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here. This also means that your results on this data set are reproducible by the research community at large. Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:
ArXiv paper: http://arxiv.org/abs/1312.3005
Happy benchmarking!'
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-5 | Forensic database of voice recordings of 500+ Australian English speakers Forensic database of voice recordings of 500+ Australian English speakers
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-6 | Audio and Electroglottographic speech recordings
Audio and Electroglottographic speech recordings from several languages We are happy to announce the public availability of speech recordings made as part of the UCLA project 'Production and Perception of Linguistic Voice Quality'. http://www.phonetics.ucla.edu/voiceproject/voice.html Audio and EGG recordings are available for Bo, Gujarati, Hmong, Mandarin, Black Miao, Southern Yi, Santiago Matatlan/ San Juan Guelavia Zapotec; audio recordings (no EGG) are available for English and Mandarin. Recordings of Jalapa Mazatec extracted from the UCLA Phonetic Archive are also posted. All recordings are accompanied by explanatory notes and wordlists, and most are accompanied by Praat textgrids that locate target segments of interest to our project. Analysis software developed as part of the project – VoiceSauce for audio analysis and EggWorks for EGG analysis – and all project publications are also available from this site. All preliminary analyses of the recordings using these tools (i.e. acoustic and EGG parameter values extracted from the recordings) are posted on the site in large data spreadsheets. All of these materials are made freely available under a Creative Commons Attribution-NonCommercial-ShareAlike-3.0 Unported License. This project was funded by NSF grant BCS-0720304 to Pat Keating, Abeer Alwan and Jody Kreiman of UCLA, and Christina Esposito of Macalester College. Pat Keating (UCLA)
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-7 | EEG-face tracking- audio 24 GB data set Kara One, Toronto, Canada We are making 24 GB of a new dataset, called Kara One, freely available. This database combines 3 modalities (EEG, face tracking, and audio) during imagined and articulated speech using phonologically-relevant phonemic and single-word prompts. It is the result of a collaboration between the Toronto Rehabilitation Institute (in the University Health Network) and the Department of Computer Science at the University of Toronto.
In the associated paper (abstract below), we show how to accurately classify imagined phonological categories solely from EEG data. Specifically, we obtain up to 90% accuracy in classifying imagined consonants from imagined vowels and up to 95% accuracy in classifying stimulus from active imagination states using advanced deep-belief networks.
Data from 14 participants are available here: http://www.cs.toronto.edu/~complingweb/data/karaOne/karaOne.html.
If you have any questions, please contact Frank Rudzicz at frank@cs.toronto.edu.
Best regards, Frank
PAPER Shunan Zhao and Frank Rudzicz (2015) Classifying phonological categories in imagined and articulated speech. In Proceedings of ICASSP 2015, Brisbane Australia ABSTRACT This paper presents a new dataset combining 3 modalities (EEG, facial, and audio) during imagined and vocalized phonemic and single-word prompts. We pre-process the EEG data, compute features for all 3 modalities, and perform binary classi?cation of phonological categories using a combination of these modalities. For example, a deep-belief network obtains accuracies over 90% on identifying consonants, which is signi?cantly more accurate than two baseline supportvectormachines. Wealsoclassifybetweenthedifferent states (resting, stimuli, active thinking) of the recording, achievingaccuraciesof95%. Thesedatamaybeusedtolearn multimodal relationships, and to develop silent-speech and brain-computer interfaces.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-8 | TORGO data base free for academic use. In the spirit of the season, I would like to announce the immediate availability of the TORGO database free, in perpetuity for academic use. This database combines acoustics and electromagnetic articulography from 8 individuals with speech disorders and 7 without, and totals over 18 GB. These data can be used for multimodal models (e.g., for acoustic-articulatory inversion), models of pathology, and augmented speech recognition, for example. More information (and the database itself) can be found here: http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-9 | Datatang Datatang is a global leading data provider that specialized in data customized solution, focusing in variety speech, image, and text data collection, annotation, crowdsourcing services.
Summary of the new datasets (2018) and a brief plan for 2019.
? Speech data (with annotation) that we finished in 2018
?2019 ongoing speech project
On top of the above, there are more planed speech data collections, such as Japanese speech data, children`s speech data, dialect speech data and so on.
What is more, we will continually provide those data at a competitive price with a maintained high accuracy rate.
If you have any questions or need more details, do not hesitate to contact us jessy@datatang.com
It would be possible to send you with a sample or specification of the data.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-10 | Fearless Steps Corpus (University of Texas, Dallas) Fearless Steps Corpus John H.L. Hansen, Abhijeet Sangwan, Lakshmish Kaushik, Chengzhu Yu Center for Robust Speech Systems (CRSS), Eric Jonsson School of Engineering, The University of Texas at Dallas (UTD), Richardson, Texas, U.S.A.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-11 | SIWIS French Speech Synthesis Database The SIWIS French Speech Synthesis Database includes high quality French speech recordings and associated text files, aimed at building TTS systems, investigate multiple styles, and emphasis. A total of 9750 utterances from various sources such as parliament debates and novels were uttered by a professional French voice talent. A subset of the database contains emphasised words in many different contexts. The database includes more than ten hours of speech data and is freely available.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-12 | JLCorpus - Emotional Speech corpus with primary and secondary emotions JLCorpus - Emotional Speech corpus with primary and secondary emotions:
For further understanding the wide array of emotions embedded in human speech, we are introducing an emotional speech corpus. In contrast to the existing speech corpora, this corpus was constructed by maintaining an equal distribution of 4 long vowels in New Zealand English. This balance is to facilitate emotion related formant and glottal source feature comparison studies. Also, the corpus has 5 secondary emotions along with 5 primary emotions. Secondary emotions are important in Human-Robot Interaction (HRI), where the aim is to model natural conversations among humans and robots. But there are very few existing speech resources to study these emotions,and this work adds a speech corpus containing some secondary emotions. Please use the corpus for emotional speech related studies. When you use it please include the citation as: Jesin James, Li Tian, Catherine Watson, 'An Open Source Emotional Speech Corpus for Human Robot Interaction Applications', in Proc. Interspeech, 2018. To access the whole corpus including the recording supporting files, click the following link: https://www.kaggle.com/tli725/jl-corpus, (if you have already installed the Kaggle API, you can type the following command to download: kaggle datasets download -d tli725/jl-corpus) Or if you simply want the raw audio+txt files, click the following link: https://www.kaggle.com/tli725/jl-corpus/downloads/Raw%20JL%20corpus%20(unchecked%20and%20unannotated).rar/4 The corpus was evaluated by a large scale human perception test with 120 participants. The link to the survey are here- For Primary emorion corpus: https://auckland.au1.qualtrics.com/jfe/form/SV_8ewmOCgOFCHpAj3 For Secondary emotion corpus: https://auckland.au1.qualtrics.com/jfe/form/SV_eVDINp8WkKpsPsh These surveys will give an overall idea about the type of recordings in the corpus. The perceptually verified and annotated JL corpus will be given public access soon.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-13 | OPENGLOT –An open environment for the evaluation of glottal inverse filtering OPENGLOT –An open environment for the evaluation of glottal inverse filtering
OPENGLOT is a publically available database that was designed primarily for the evaluation of glottal inverse filtering algorithms. In addition, the database can be used in evaluating formant estimation methods. OPENGLOT consists of four repositories. Repository I contains synthetic glottal flow waveforms, and speech signals generated by using the Liljencrants–Fant (LF) waveform as an excitation, and an all-pole vocal tract model. Repository II contains glottal flow and speech pressure signals generated using physical modelling of human speech production. Repository III contains pairs of glottal excitation and speech pressure signal generated by exciting 3D printed plastic vocal tract replica with LF excitations via a loudspeaker. Finally, Repository IV contains multichannel recordings (speech pressure signal, EGG, high-speed video of the vocal folds) from natural production of speech.
OPENGLOT is available at: http://research.spa.aalto.fi/projects/openglot/
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-14 | Corpus Rhapsodie Nous sommes heureux de vous annoncer la publication d¹un ouvrage consacré
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-15 | The My Science Tutor Children?s Conversational Speech Corpus (MyST Corpus) , Boulder Learning Inc. The My Science Tutor Children?s Conversational Speech Corpus (MyST Corpus) is the world?s largest English children?s speech corpus. It is freely available to the research community for research use. Companies can acquire the corpus for $10,000. The MyST Corpus was collected over a 10-year period, with support from over $9 million in grants from the US National Science Foundation and Department of Education, awarded to Boulder Learning Inc. (Wayne Ward, Principal Investigator). The MyST corpus contains speech collected from 1,374 third, fourth and fifth grade students. The students engaged in spoken dialogs with a virtual science tutor in 8 areas of science. A total of 11,398 student sessions of 15 to 20 minutes produced a total of 244,069 utterances. 42% of the utterances have been transcribed at the word level. The corpus is partitioned into training and test sets to support comparison of research results across labs. All parents and students signed consent forms, approved by the University of Colorado?s Institutional Review Board, that authorize distribution of the corpus for research and commercial use. The MyST children?s speech corpus contains approximately ten times as many spoken utterances as all other English children?s speech corpora combined (see https://en.wikipedia.org/wiki/List_of_children%27s_speech_corpora). Additional information about the corpus, and instructions for how to acquire the corpus (and samples of the speech data) can be found on the Boulder Learning Web site at http://boulderlearning.com/request-the-myst-corpus/.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-16 | HARVARD speech corpus - native British English speaker
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-17 | Magic Data Technology Kid Voice TTS Corpus in Mandarin Chinese (November 2019) Magic Data Technology Kid Voice TTS Corpus in Mandarin Chinese
Magic Data Technology is one of the leading artificial intelligence data service providers in the world. The company is committed to providing a wild range of customized data services in the fields of speech recognition, intelligent imaging and Natural Language Understanding.
This corpus was recorded by a four-year-old Chinese girl originally born in Beijing China. This time we published 15-minute speech data from the corpus for non-commercial use.
The contents and the corresponding descriptions of the corpus:
The corpus aims to help researchers in the TTS fields. And it is part of a much bigger dataset (2.3 hours MAGICDATA Kid Voice TTS Corpus in Mandarin Chinese) which was recorded in the same environment. This is the first time to publish this voice!
Please note that this corpus has got the speaker and her parents’ authorization.
Samples are available. Do not hesitate to contact us for any questions. Website: http://www.imagicdatatech.com/index.php/home/dataopensource/data_info/id/360 E-mail: business@magicdatatech.com
|