ISCA - International Speech
Communication Association


ISCApad Archive  »  2016  »  ISCApad #211  »  Resources  »  Database  »  Speechocean – update (December 2015)

ISCApad #211

Wednesday, January 13, 2016 by Chris Wellekens

5-2-13 Speechocean – update (December 2015)
  

 

Speechocean – update (December 2015):

 

Speechocean: A global language resources and data services supplier

 

About Speechocean

Speechocean is one of the world well-known language related resources & services provider in the fields of Human Computer Interaction and Human Language Technology. At present, we can provide data services with 110+ languages and dialects across the world.

 

KingLine Data Center ---Data Sharing Platform

Kingline Data Center is operated and supervised by Speechocean, which is mainly focused on language resources creating and providing for research and development of human language technology.

These diversified corpora are widely used for the research and development in the fields of Speech Recognition, Speech Synthesis, Natural Language Processing, Machine Translation, Web Search, etc. All corpora are openly accessible for users all over the world, including users from scientific research institutions, enterprises or individuals.

For more detailed information, please visit our website: http://kingline.speechocean.com

 

New released corpora:

  1. Canadian English Speech Recognition Database -(Desktop)-202 Speakers

ID: King-ASR-140

This is a 4-channel Canada English desktop speech database, which is collected over 4 different microphones simultaneously. This database is owned by Beijing Haitian Ruisheng Science Technology Ltd (SpeechOcean, www.speechocean.com).

The prompts were the phonetically rich sentences. The raw sentences are all selected from the News and Twitter domain. We did remove a number of sentences that includes offensive or negative words or phrase. Finally, we had 30000 unique sentences in our list of sentences, that we generated the prompt sheets from with no more than 3 times for each. All audio files were manually transcribed and annotated by our native transcribing team based on the Transcribing conventions; a strict evaluation work was made on all the transcribing files by our QA Team. A professional transcription tool was developed by SpeechOcean to support this transcription work and some new short-cut functions were embedded into the tool such as the button for the non-speech acoustic events.

This database is performed in quiet office environment. The corpus contains the recordings of 322,936 utterances of English speech data which were from 202 speakers. The pure recording time is about 380hours (4-channel), including the leading silence (about 500 ms) and the trailing silence (about 500 ms). The total size of this database is 121 GB.

A pronunciation lexicon with a phonemic transcription in SAMPA was carefully made by covering all the words in the transcription files.

  1. Argentinean Spanish Speech Recognition Database(Desktop)-200 Speakers

ID: King-ASR-281

This is a 4-channel Spanish desktop speech database, which is collected over 4 different microphones simultaneously. The project was performed in Argentina; cover all the cities, for example: BuenosAires, Cordoba, Lanus, Cordoba...

Each Speaker was recorded around 300 sentences which were selected from a pool of phonetically rich sentences in approximate 80 minutes as natural as possible. The recording was performed in a quiet office environment.

This database is performed in quiet office environment. The corpus contains the recordings of 236,232 utterances of Spanish speech data which were from 200 speakers. The pure recording time is about 358 hours (4-channel), including the leading silence (about 500 ms) and the trailing silence (about 500 ms). The total size of this database is 141 GB.

A pronunciation lexicon with a phonemic transcription in SAMPA was carefully made by covering all the words in the transcription files.

 

  1. Chilean Spanish Speech Recognition Database -(Mobile)-300 Speakers

ID: King-ASR-290

This is a 3-channel Chilean Spanish speech database, which is collected over 3 different mobile operating systems: iOS, Android and Windows Phone platform. The project was performed in Chile, cover all the main cities. For example: Santiago, Rancagua, Antofagasta and Viña.
300 speakers were recorded in total, and each speaker recorded in a quiet environment.
The prompts were the phonetically rich sentences. The raw sentences are all selected from the News domain Twitter/Forum and SMS. We did remove a number of sentences that includes offensive or negative words or phrase. Finally, we had 108055 unique sentences in our list of sentences, that we generated the prompt sheets from with no more than 3 times for each.
With discarding some unqualified utterances, the whole corpus contains the recordings of 268,704 utterances; the pure recording time is about 519 hours (including leading silence and tail silence). The total size of this database is about 55.8 G.

 

Contact Information

Xianfeng Cheng

VP

Tel: +86-10-62660928; +86-10-62660053 ext.8080

Mobile: +86 13681432590

Skype: xianfeng.cheng1

Email: chengxianfeng@speechocean.com; cxfxy0cxfxy0@gmail.com

Website: www.speechocean.com

 

 

 

 

 

 

 

 

 

 




 

 


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA