ISCA - International Speech
Communication Association


ISCApad Archive  »  2014  »  ISCApad #189  »  Resources  »  Database  »  Speechocean March 2014 update

ISCApad #189

Saturday, March 15, 2014 by Chris Wellekens

5-2-14 Speechocean March 2014 update
  

Speechocean March 2014 update:

 

Speechocean: A global language resources and data services supplier

 

has over 500 large-scale databases available in 110+ languages and accents with the platform of desktop, in-car, telephony and tablet PC. Our data repository is enormous and diversified, which includes ASR Databases, TTS Databases, Lexica, Text Corpora, etc.

 

Speechocean is glad to announce that more resources have been released:

ASR Databases

Speechocean provides 110+ regional languages corpora, available in a variety of formats, situational styles, scene environments and platform systems, covering In-car speech recognition corpora, mobile phone speech recognition corpora, fixed-line speech recognition corpora, desktop speech recognition corpora, etc. This month we released more Asian languages databases which were made for the tuning and testing purpose of speech recognition systems for speech ASR applications.

    1. In-Car

Chinese Mandarin Speech Recognition Database ---- (In-Car)-100 Speakers

ID: King-ASR-122

This database was collected in China Mainland. It contains the voices of 100 different native speakers (50 males, 50 females) who were balanced according to age(mainly 18 – 3062),31 – 4528,46 – 6010), gender (Male 50%, Female 50%) and regional accents (Northern 60%, Wu 10%, Xiang 5%, Gan 5%, Kejia 5%, Min 5%, Cantonese 10%).


The script was specially designed to provide material for both training and testing of many classes of speech recognizers which contain 320 utterances covering 15 categories and 35 sub-categories for each speaker (for the detail script structure design, please see the technical document).
Each speaker was recorded under two environments in three variations (Parked, City Driving and Highway driving) with various kinds of recording conditions such as motor running, fan on/off, window up/down, etc. 320 utterances were recorded for each speaker under two environments and there are 200796 utterances recorded in total.

 

Each utterance is stored in a separate file and each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information. A pronunciation lexicon with a phonemic transcription in SAMPA is also included. All the data was transcribed and labeled.


Japanese Speech Recognition Database ---- (In-Car)-800 Speakers

ID: King-ASR-125

This Japan In-car Speech database was collected in Japan and contains the voices of 800 different native speakers who were demographically balanced according to Age (16-30, 31-45, and 46-60), Gender (400±5% males, 400±5% females) and Dialectical Region. The script was specially designed to provide material for both training and testing of many classes of speech recognizers which contains 16 general categories and more than 50 specific sub-categories. Each speaker was recorded under three driving environments (parked, city driving and highway driving) with recording conditions such as fan on/off and window up/down. A total of 300 utterances were recorded for each speaker in two of three driving environments (150 utterances and 10 spontaneous utterances per environment).


Four high quality audio channels (C1: SHURE SM10A, C2: SENNHEISER ME104, F1: AKG Q400, F2: AKG Q400) and three popular cars in the country were used in this recording.


The speech data is stored as sequences of 16 kHz, 16 bit which is uncompressed and each prompted utterance is stored in a separate file and each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information. A pronunciation lexicon with a phonemic transcription in SAMPA is also included. All the data was transcribed and labeled.

    1. Telephony

Japanese Speech Recognition Database ---- Conversation (Telephony)-201 Speakers

ID: King-ASR-055

This Japanese Speech Recognition database was collected in Japan and contains the voices of 201 different native speakers who were demographic balanced according to age distribution (16-28,29-60), Gender, Dialectical Regions. The corpus contains 100 pairs of spontaneous dialog speech data which were from 201 speakers. Each pair of speech consists of 3 audio files: two of them from single speaker separately and the other is from the mixed channel. The three files were recorded simultaneously. The pure recording time of mixed channel is about 104.8 hours. 33 topics were contained in this database.

 

There are 7,009 audio files which were saved as uncompressed PCM files. All the speech data was transcribed and labeled.

1.3 Mobile

Korean Speech Recognition Database—(Mobile)--1023 Speakers

ID: King-ASR-137

The Korean mobile speech Recognition database which was collected in Korea, contains the voices of 1023 different native speakers (510±5%males, 513±5% females) who were balanced according to age (mainly 16 – 30,31 – 45,46 – 60), Gender and regional accents (for the details, please see the technical document).


The script was specially designed to provide material for both training and testing of many classes of speech recognizers which contain 15 categories and 35 sub-categories for each speaker (for the detail script structure design, please see the technical document).

Each speaker has recorded 300 utterances under two environments, one in a quiet session (Office/Home) and one in a noisy session (Garden/roadside/restaurant/bus). Each speaker has recorded 150 utterances and spontaneous sentences per session and totally 300 utterances were recorded by each speaker.

Popular mobiles in this country were used for collecting this data such as Samsung, Nokia, HTC, etc. The speech data is stored as sequences of 16 kHz, 16 bit and uncompressed.
Each utterance is stored in a separate file and each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.


A pronunciation lexicon with a phonemic transcription is also included. All the data was transcribed and labeled

 

Chinese Mandarin Speech Recognition Database---Sentences (Mobile) - (5048 Speakers)

ID: King-ASR-216

This database is a desktop speech database collected by Speechocean which is performed in a quiet environment in China. This database is one of our databases of Speech Data ----Mobile Project (SDM) which contains the database collections in 30 languages presently.


This database contains 1,514,028 sentences of Chinese Mandarin speech data which were from 5048 speakers which were recorded in a quiet environment. The pure recording length is about 2,268 hours. All speakers are native speakers from 14 typical dialectical cities covering seven main dialectical regions of China who were demographic balanced according to age distribution (16~30, 31~45, 46~60), Gender (2,584 Males and 2464 Females) and regional accents.

The script was specially designed to provide material for both training and testing of many classes of speech recognizers. The script of each speaker contains 300 sentences which were randomly selected from a pool of sentences specially designed. Each speaker will be recorded as naturally as possible in quiet environment through Popular Mobile Phones such as of iPhones, HTC Samsung, MOTO and etc. which cover the platforms of ios, android and window mobile.

The speech data are stored as sequences of 16 kHz, 16 bit and uncompressed PCM format. All the speech was manually transcribed and labeled. A pronunciation lexicon with a phonemic transcription in Pinyin is also included.

    1. Desktop

Indonesian Speech Recognition Database ---- Sentences (Desktop)-200 Speakers

ID: King-ASR-061

This Indonesian Speech Recognition database was collected in Indonesia and contains the voices of 200 different native speakers who were demographic balanced according to age distribution (16–30, 31–45, 46–60) and Gender. It contains 239267 audio files with about 460.94 hours of recording.

Each speaker uttered 300 sentences in a quiet office room. The whole data has been proofread manually with precise data labeling.

Urdu Speech Recognition Database ----Sentences (Desktop)-200 Speakers

ID: King-ASR-063

This Urdu Speech Recognition database, which was collected in Pakistan, contains the voices of 200 native speakers who were demographic balanced according to age distribution (16–60), gender, dialectical Regions, there were 241,354 audio files which were saved as uncompressed PCM files. All the speech data was transcribed and labeled.

Vietnamese Speech Recognition Database ----Sentences (Desktop)-200 Speakers

ID: King-ASR-074

This Vietnamese Speech Recognition database, which was collected in Vietnam, contains the voices of 200 native speakers who were demographic balanced according to age distribution (16–60), Gender, Dialectical Regions, there were 263,204 audio files which were saved as uncompressed PCM files. All the speech data was transcribed and labeled.

  1. TTS Databases

Speechocean licenses a variety of databases in more than 40 languages for speech synthesis broadcasting speech, emotional speech, etc. which can be used in different algorithms.

 

European Portuguese Speech Corpus for TTS (Female)

ID: King-TTS-017

The European Portuguese (pt-PT) Speech Corpus consists native Portuguese female professional broadcaster (Female, 32 years old) recorded in a studio with high SNR (>35dB) over two channels (Shure SM15 microphone and Electroglottography (EGG) sensor).

 

The Corpus includes the following sub-corpora:

  1. Sentence sub-corpus: including 3000 short sentences (7~12 words) and 2000 sentences with normal length (13~20 words). Considering all kinds of linguistic phenomena, all sentences are extracted from the daily articles in Portugal, such as national and international news, papers in life, travel, and so on. The sentences with political/religious/obscene/pornographic words which might lead to negative emotions are carefully excluded.

  2. Emotional sub-corpus: including 100 exclamatory sentences and 100 interrogative sentences which can be used for emotional TTS study;

  3. Digit sub-corpus: including many kinds of digits data, such as isolated digit, connected digits with blocks, natural and ordinal number readings;

  4. Expression sub-corpus: consists of general expressions, such as date, time, money and measure expression;

  5. Spell sub-corpus: including characters in alphabet, Greek characters and general abbreviations;

 

All reading prompts are manually revised and prosody annotations were made according to real speech. All speech data are segmented and labeled on phone level. Pronunciation lexicon and pitch extract from EEG can also be provided based on demands

  1. Text Corpora

Speechocean licenses many kinds of text corpora in many languages which is superb for language model training.

ID

Kingline Data Names

 Languages

Size

King-MT-001

Chinese-English-Korean-
Japanese Parallel Corpus

Chinese-English-
Korean-Japanese

200,000 Pairs
of Sentences

King-MT-005

English-to-Simplified Chinese Dictionary

English-Chinese

80,000 Words

King-MT-010

Japanese - English Place Names

Japanese - English

80,000 Words

King-NLP-019

SC and TC Chinese Pinyin Database

Chinese

2,600,000 Words

King-NLP-020

Japanese Phonological Database

Japanese

35,000 Words

King-NLP-022

Database of Japanese Name Variants

Japanese

4,000,000 Words

King-NLP-023

Japanese Lexical Database

Japanese

290,000 Words

King-NLP-024

Japanese - English Personal Names

Japanese

580,000 Words

  1. Lexica

Speechocean builds pronunciation lexica in many languages which can be licensed to customers.

No

Name

License

Phoneme Set

King-Lexicon-001

Chinese Mandarin Pronunciation Lexicon

211,444 Entries

Pinyin

King-Lexicon-002

Canadian French Pronunciation Lexicon

23,000 Entries

SAMPA

King-Lexicon-003

Russian Pronunciation Lexicon

139,032 Entities

SAMPA

King-Lexicon-004

US English Pronunciation Lexicon

36,000 Entries

CMU

King-Lexicon-005

UK English Pronunciation Lexicon

23,000 Entries

SAMPA

King-Lexicon-006

Argentina Spanish Pronunciation Lexicon

14,636 Entries

SAMPA

King-Lexicon-007

European Spanish Pronunciation Lexicon

31,388 Entries

SAMPA

King-Lexicon-008

German Pronunciation Lexicon

80,745 Entries

SAMPA

King-Lexicon-009

Cantonese pronunciation Lexicon

86,364 Entries

Jyutpin

King-Lexicon-010

Turkish Pronunciation Lexicon

101,950 Entries

SAMPA

King-Lexicon-011

European Portuguese Pronunciation Lexicon

23,033 Entries

SAMPA

King-Lexicon-012

European French Pronunciation Lexicon

53,000 Entries

SAMPA

King-Lexicon-013

Chile Spanish Pronunciation Lexicon

21,884 Entities

SAMPA

King-Lexicon-014

Ukrainian Pronunciation Lexicon

37,000 Entries

SAMPA

King-Lexicon-015

Danish Pronunciation Lexicon

6,983 Entries

SAMPA

King-Lexicon-016

Japanese Pronunciation Lexicon

72,968 Entries

Hepburn

 

 

Contact Information

Xianfeng Cheng

Business Manager of Commercial Department

Tel: +86-10-62660928; +86-10-62660053 ext.8080

Cell phone: +86 13681432590

Skype: xianfeng.cheng1

Email: chengxianfeng@speechocean.com; cxfxy0cxfxy0@gmail.com

Website: www.speechocean.com

 


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA