ISCApad #160 |
Saturday, October 08, 2011 by Chris Wellekens |
5-2-1 | ELRA - Language Resources Catalogue - Update (2011-09) *****************************************************************
| ||||||||||||||||||||||||||
5-2-2 | ELRA - Language Resources Catalogue - Special Offer *****************************************************************
| ||||||||||||||||||||||||||
5-2-3 | LDC Newsletter (September 2011) In this newsletter: - Cataloging the communication of Asian Elephants - New publications: - 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1 - Cataloging the communication of Asian Elephants LDC distributes a broad selection of databases, the majority of which are used for human language research and technology development. Our corpus catalog also includes the vocalizations of other animal species. We'd like to highlight the intriguing work behind one such animal communication corpus, Asian Elephant Vocalizations LDC2010S05. New Publications (1) 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1 was developed by researchers at the Department of Computer Science and Engineering, University of South Florida (USF), Tampa, Florida and the Multimodal Information Group at the National Institute of Standards and Technology (NIST). It contains approximately fifteen hours of meeting room video data collected in 2005 and 2006 and annotated for the VACE (Video Analysis and Content Extraction) 2006 face and person tracking tasks. The VACE program was established to develop novel algorithms for automatic video content extraction, multi-modal fusion, and event understanding. During VACE Phases I and II, the program made significant progress in the automated detection and tracking of moving objects including faces, hands, people, vehicles and text in four primary video domains: broadcast news, meetings, street surveillance, and unmanned aerial vehicle motion imagery. Initial results were also obtained on automatic analysis of human activities and understanding of video sequences. Three performance evaluations were conducted under the auspices of the VACE program between 2004 and 2007. In 2006, the VACE program and the European Union's Computers in the Human Interaction Loop (CHIL) collaborated to hold the CLassification of Events, Activities and Relationships (CLEAR) Evaluation. This was an international effort to evaluate systems designed to analyze people, their identities, activities, interactions and relationships in human-human interaction scenarios, as well as related scenarios. The VACE program contributed the evaluation infrastructure (e.g., data, scoring, tools) for a specific set of tasks, and the CHIL consortium, coordinated by the Karlsruhe Institute of Technology, contributed a separate set of evaluation infrastructure. The meeting room data used for the 2006 test set was collected by the following sites in 2005 and 2006: Carnegie Mellon University (USA), University of Edinburgh (Scotland), IDIAP Research Institute (Switzerland), NIST (USA), Netherlands Organization for Applied Scientific Research (Netherlands) and Virginia Polytechnic Institute and State University (USA). Each site had its own independent camera setup, illuminations, viewpoints, people and topics. Most of the datasets included High-Definition (HD) recordings, but those were subsequently formatted to MPEG-2 for the evaluation. *
The 2008 evaluation was distinguished from prior evaluations, in particular those in 2005 and 2006, by including not only conversational telephone speech data but also conversational speech data of comparable duration recorded over a microphone channel involving an interview scenario. The speech data in this release was collected in 2007 by LDC at its Human Subjects Data Collection Laboratories in Philadelphia and by the International Computer Science Institute (ICSI) at the University of California, Berkeley. This collection was part of the Mixer 5 project, which was designed to support the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. Mixer participants were native English speakers and bilingual English speakers. The telephone speech in this corpus is predominately English; all interview segments are in English. Telephone speech represents approximately 523 hours of the data, and microphone speech represents the other 427 hours. The telephone speech segments include summed-channel excerpts in the range of 5 minutes from longer original conversations. The interview material includes single channel conversation interview segments of at least 8 minutes from a longer interview session. English language transcripts were produced using an automatic speech recognition (ASR) system. 2008 NIST Speaker Recognition Evaluation Training Set Part 2 is distributed on 7 DVD-ROM. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $2000. * (3) French Gigaword Third Edition is a comprehensive archive of newswire text data that has been acquired over several years by LDC. This third edition updates French Gigaword Second Edition (LDC2009T28) and adds material collected from January 1, 2009 through December 31, 2010. The two distinct international sources of French newswire in this edition, and the time spans of collection covered for each, are as follows: Agence France-Presse (afp_fre) May 1994 - Dec. 2010 Associated Press French Service (apw_fre) Nov. 1994 - Dec. 2010 All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII, white space, and printable code points in the 'Latin1 Supplement' character table, as defined by the Unicode Standard (ISO 10646) for the 'accented' characters used in French. The Supplement/accented characters are presented in UTF-8 encoding. The overall totals for each source are summarized below. Note that the 'Totl-MB' numbers show the amount of data when the files are uncompressed (i.e. approximately 15 gigabytes, total); the 'Gzip-MB' column shows totals for compressed file sizes as stored on the DVD-ROM; the 'K-wrds' numbers are simply the number of white space-separated tokens (of all types) after all SGML tags are eliminated.
French Gigaword Third Edition is distributed on 1 DVD-ROM. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$4500.
| ||||||||||||||||||||||||||
5-2-4 | Speechocean October 2011 update SpeechOcean China also has about 200+ large language resources and some of databases can be freely used to our members for academic research purpose. As a ISCA member, we will be also glad to share these databases to other ISCA members, Speechocean - Language Resource Catalogue - New Released (2011-10) Speechocean, as a global provider of language resources and data services, has more than 200 large-scale databases available 80+ languages and accents covering the fields of Text to Speech, Automatic Speech Recognition, Text, Machine Translation, Web Search, Videos, Images etc. Speechocean is glad to announce that more Speech Resources has been released: Turkish speech recognition Database (Desktop) --- 201 speakers This Turkish desktop speech recognition database was collected by Speechocean’s project team in Turkey. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections in 30 languages presently. All audio files are manually transcribed and labelled. A pronunciation lexicon with a phonetic transcription in SAMPA is also included. For more information, please see the technical document at the following link: http://www.speechocean.com/en-ASR-Corpora/789.html
Turkish speech recognition Database (In-car) --- 316 speakers This Turkish in-car speech recognition database was collected by Speechocean’s project team in Turkey. This database is one of our databases of Speech Data---Car (SDC) Project, which contains the database collections in more than 30 languages presently. The script was specially designed to provide material for both training and testing of many classes of speech recognizers, and contain 320 utterances covering 15 categories and 35 sub-categories for each speaker. Each speaker was recorded under two environments in three variations (Parked, City Driving and Highway driving) with kinds of recording conditions such as motor running, fan on/off, window up/down and etc. A total of 320 utterances were recorded for each speaker under two environments (160 utterances and spontaneous sentences per environment).
All audio files are manually transcribed and labelled. A pronunciation lexicon with a phonetic transcription in SAMPA is also included. For more information, please see the technical document at the following link: http://www.speechocean.com/en-ASR-Corpora/793.html
France French speech recognition Database (Desktop) --- 200 speakers This France French desktop speech recognition database was collected by Speechocean’s project team in France. This database is one of our databases of Speech Data ----Desktop Project (SDD)which contains the database collections in 30 languages presently. All audio files are manually transcribed and labelled. A pronunciation lexicon with a phonetic transcription in SAMPA is also included. For more information, please see the technical document at the following link: http://www.speechocean.com/en-ASR-Corpora/796.html
Spain Spanish speech recognition Database (Desktop) --- 210 speakers This Spain Spanish desktop speech recognition database was collected by Speechocean’s project team in Spain. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections in 30 languages presently.
All audio files are manually transcribed and labelled. A pronunciation lexicon with a phonetic transcription in SAMPA is also included. For more information, please see the technical document at the following link: http://www.speechocean.com/en-ASR-Corpora/795.html
UK English speech recognition Database (Desktop) --- 200 speakers This UK English desktop speech recognition database was collected by Speechocean’s project team in UK. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections in 30 languages presently.
All audio files are manually transcribed and labelled. A pronunciation lexicon with a phonetic transcription in SAMPA is also included. For more information, please see the technical document at the following link: http://www.speechocean.com/en-ASR-Corpora/792.html
Portugal Portuguese speech recognition Database (Desktop) --- 200 speakers This Portugal Portuguese desktop speech recognition database was collected by Speechocean’s project team in Portugal. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections in 30 languages presently.
All audio files are manually transcribed and labelled. A pronunciation lexicon with a phonetic transcription in SAMPA is also included. For more information, please see the technical document at the following link: http://www.speechocean.com/en-ASR-Corpora/791.html
Swedish speech recognition Database (Desktop) --- 200 speakers This Swedish desktop speech recognition database was collected by Speechocean’s project team in Sweden. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections in 30 languages presently.
All audio files are manually transcribed and labelled. A pronunciation lexicon with a phonetic transcription in SAMPA is also included. For more information, please see the technical document at the following link: http://www.speechocean.com/en-ASR-Corpora/790.html
Canadian French Desktop speech recognition Corpus (200 speakers) was launched in Canada Based on our client's urgent demands, the Canadian French desktop speech recognition database (200 speakers) was collected by Speechocean’s project team in Canada. This database belongs to Speechocean's Desktop Speech Data Project.
All audio files are manually transcribed and labelled. A pronunciation lexicon with a phonetic transcription in SAMPA is also included. For more information, please see the technical document at the following link: http://www.speechocean.com/en-ASR-Corpora/733.html
Chinese Mandarin In-car Speech Recognition Database was Successful Released! Chinese Mandarin In-car Speech Recognition Database was successfully released with the catalogue serial number of King-ASR-122 in our Catalogue. This database was made for the tuning and testing purpose of speech recognition system for car-using. It belongs to SPC’s Multi-language In-car Speech Data Project. The script was specially designed to provide material for both training and testing of many classes of speech recognizers which contain 320 utterances covering 15 categories and 35 sub-categories for each speaker.
All audio files are manually transcribed and labelled. A pronunciation lexicon with a phonetic transcription in SAMPA is also included. For more information, please see the technical document at the following link: http://www.speechocean.com/en-ASR-Corpora/781.html
The American Spanish Mobile speech Recognition database was Successful Released!
The American Spanish Mobile speech Recognition database was successfully released with the catalogue serial number of King-ASR-119. This database was made for the tuning and testing purpose of speech recognition system for IVR / mobile. It belongs to SPC’s Multi-language Mobile Speech Data Project.
All audio files are manually transcribed and labelled. A pronunciation lexicon with a phonetic transcription in SAMPA is also included. For more information, please see the technical document at the following link: http://www.speechocean.com/en-ASR-Corpora/779.html
Visit our on-line Catalogue: http://www.speechocean.com/en-Product-Catalogue/Index.html For more information about our Database and Services please visit our website www.Speechocen.com If you have any inquiry regarding our databases and service please feel free to contact us: XiangFeng Cheng mailto:Chengxianfeng@speechocean.com Marta Gherardi mailto:Marta@speechocean.com
|