ISCApad #241 |
Tuesday, July 10, 2018 by Chris Wellekens |
Speechocean – update (July 2018):
Speechocean: A global language resources and data services supplier
About Speechocean
Speechocean is one of the world well-known language related resources & services provider in the fields of Human Computer Interaction and Human Language Technology. At present, we can provide data services with 110+ languages and dialects across the world.
KingLine Data Center---Data Sharing Platform
Kingline Data Center is operated and supervised by Speechocean, which is mainly focused on language resources creating and providing for research and development of human language technology.
These diversified corpora are widely used for the research and development in the fields of Speech Recognition, Speech Synthesis, Natural Language Processing, Machine Translation, Web Search, etc. All corpora are openly accessible for users all over the world, including users from scientific research institutions, enterprises or individuals.
For more detailed information, please visit our website: http://kingline.speechocean.com
New released data:
1. Chinese Mandarin Speech Recognition Corpus (Mobile)-Conversation-1250 Speakers
S.N:King-ASR-408
The Chinese Mandarin Speech Recognition Corpus was collected in China.
The script contains 625 pairs of daily spontaneous conversational speech data utterances in total, specially designed to provide materials for both training and testing of speech recognizers.
This corpus contains the voices of 1,250 different speakers (571 males, 679 females) who were balanced distributed in age (16 – 30, 31 – 45, 46 – 60), gender and regional accents. Each speaker was recorded in quiet office or home environment.
Mobile platform, i.e. Android was used for speech collection. A pronunciation lexicon is available with a phonemic transcription in zh-cn_pinyin. All manually checked. All audio files were manually transcribed and annotated by native transcribers.
2. Guangdong Cantonese Speech Recognition Corpus (Mobile)-Sentences-1014 Speakers
Details:
The Guangdong Cantonese Speech Recognition Corpus was collected in Guangdong.
3. Russian Speech Synthesis Corpus - Male
S.N:King-TTS-020
Details:
Size: 8.12 GB
Recording Hours: 13.69 Hours
Parameters: 44.1k, 16bit; 2 Channels
The Russian Speech Synthesis Corpus contains the recordings of 1 male voice talent. He is a broadcaster, 34 years old when recording this database, and he was born and grew up in Moscow.
The corpus contains 9,212 utterances. It was recorded in a professional studio over two channels--waveform and electroglottography (EGG) signal. Speech rate, energy and timbre were strictly controlled during recording process.
Each utterance was carefully proofreaded by linguists and was stored in Windows uncompressed PCM format. Prosody labeling and phone boundary labeling are included. A pronouncing dictionary is available. All data were manually checked.
4. Prounciation Lexicon of Loan words of US English
S.N:King-Lexicon-079
Details:
Entries: 350,000
Phoneme Inventory: Computer Readable IPA(It can be converted to the phoneset Sampa, XSampa, and etc., based on demand.)
Stress: Included
Syllable Boundary: Included
Contact Information
Xianfeng Cheng
VP
Tel: +86-10-62660928; +86-10-62660053 ext.8080
Mobile: +86 13681432590
Skype: xianfeng.cheng1
Email: chengxianfeng@speechocean.com; cxfxy0cxfxy0@gmail.com
Website: www.speechocean.com
|
Back | Top |