ISCA - International Speech
Communication Association


ISCApad Archive  »  2012  »  ISCApad #169  »  Resources  »  Database

ISCApad #169

Tuesday, July 10, 2012 by Chris Wellekens

5-2 Database
5-2-1ELRA - Language Resources Catalogue - Update (2012-07)

*****************************************************************
ELRA - Language Resources Catalogue - Update
*****************************************************************
ELRA is happy to announce that 2 new Speech     Telephone Resources are now available in its catalogue.
    Moreover, an updated version of the Bilingual Collocational     Dictionary (Horst Bogatz) has also been released.     
   
    1) New Language Resources:
     
      ELRA-S0343 VERIF1DE
   
The speech corpus VERIF1DE contains 20 recordings (sessions) of     150 German speakers each over the telephone network (10 sessions     over fixed network and 10 sessions over GSM). Each session contains  40 single recordings, mainly speech read from a prompt sheet.
  
For more information, see: http://catalog.elra.info/product_info.php?products_id=1169
   
    ELRA-S0344 LILA Hindi Belt database
   
The LILA Hindi Belt database comprises 2,023 Hindi speakers     (1,011 males and 1,012 females, all speakers with Hindi as first     language) recorded over the Indian mobile telephone network. Each  speaker uttered 83 read and spontaneous items.
   
For more information, see: http://catalog.elra.info/product_info.php?products_id=1170
   
    2) Updated Language Resource:
     
    ELRA-M0013 Bilingual Collocational Dictionary (Horst Bogatz)
   
This new release contains  69,000  English headwords (instead       of 40,000 for the previous release).
    The bilingual English-German collocational dictionary consists of     around 69,000 English headwords, including concepts expressed with     more than one word (e.g. 'the awareness of the environment' or 'lame     duck') and hyphenated compounds. It contains verbs, adjectives,     synonyms and phrases that collocate with the headword. It provides     the German equivalents for the headwords as well as their English     synonyms.
    For more information, see: http://catalog.elra.info/product_info.php?products_id=451
    
    For more information on the catalogue, please contact Valérie  Mapelli mailto:mapelli@elda.org
   
    Visit our On-line Catalogue: http://catalog.elra.info
    Visit the Universal Catalogue: http://universal.elra.info
    Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/LRs-Announcements.html

Top

5-2-2LDC Newsletter (June 2012)

In this newsletter:

LDC at          LREC 2012 -

New publications:

LDC2012T09 - Arabic-Dialect/English Parallel Text -
      LDC2012T08 - Prague Czech-English Dependency Treebank 2.0         -

 


LDC        at LREC 2012

      LDC attended the 8th Language Resource        Evaluation Conference (LREC2012), hosted by ELRA, the European Language      Resource Association. The conference was held in Istanbul, Turkey      and featured a broad range of sessions on language resource and      human language technologies research. Fourteen LDC staff members      presented current work on a wide range of topics, including      handwriting recognition, word alignment, treebanks, machine      translation and information retrieval as well as initiatives for      synchronizing metadata practices in sociolinguistic data      collection.
      The LDC Papers page      now includes research papers presented at LREC 2012.  Most papers      are available for download in pdf format; presentations slides and      posters are available for several papers as well. On the Papers      page, you can read about LDC's role in resource creation to      support handwriting recognition and translation technology (Song      et al 2012).   LDC is developing resources to support two research      programs:  Multilingual Automatic Document Classification,      Analysis and Translations (MADCAT) and Open Handwriting      Recognition and Translation (OpenHaRT).  To support these      programs, LDC is collecting handwritten samples of pre-processed      Arabic and Chinese data that had previously been translated      into English.  To date, LDC has collected and annotated over      225,000 handwriting images.
      Additionally, you can learn about LDC's efforts to collect and      annotate very large corpora of user-contributed content in      multiple languages (Garland et al, 2012).  For the Broad Operational      Language Translation (BOLT) program, LDC is developing resources      to support genre-independent machine translation and information      retrieval systems.  In the current phase of BOLT, LDC is      collecting and annotating threaded posts from online discussion      forums, targeting at least 500 millions words each in three      languages:  English, Chinese, and Egyptian Arabic.  A portion of      the data undergoes manual, multi-layered linguistic annotation.
      As we mark LDC's 20th anniversary, we will feature the work behind      these LREC papers as well as other ongoing research in upcoming      newsletters.

New publications

(1) Arabic-Dialect/English        Parallel Text was developed by Raytheon BBN Technologies (BBN),      LDC and Sakhr Software and      contains approximately 3.5 million tokens of Arabic dialect      sentences and their English translations.

The data in this corpus consists of Arabic web      text as follows:

1. Filtered automatically from large Arabic      text corpora harvested from the web by LDC. The LDC corpora      consisted largely of weblog and online user groups and amounted to      around 350 million Arabic words. Documents that contained a large      percentage of non-Arabic or Modern Standard Arabic (MSA) words      were eliminated. A list of dialect words was manually selected by      culling through the Levantine Fisher (LDC2005S07,      LDC2005T03,      LDC2007S02      and LDC2007T04)      and Egyptian CALLHOME speech corpora (LDC97S45,      LDC2002S37,      LDC97T19      and LDC2002T38)      distributed by LDC. That list was then used to retain documents      that contained a certain number of matches. The resulting subset      of the web corpora contained around four million words. Documents      were automatically segmented into passages using formatting      information from the raw data.

2. Manually harvested by Sakhr Software from      Arabic dialect web sites.

Dialect classification and sentence      segmentation, as needed, and translation into English were      performed by BBN through Amazon's Mechanical        Turk. Arabic annotators from Mechanical Turk classified      filtered passages as being either MSA or one of four regional      dialects: Egyptian, Levantine, Gulf/Iraqi or Maghrebi. An      additional 'General' dialect option was allowed for ambiguous      passages. The classification was applied to whole passages rather      than individual sentences. Only the passages labeled Levantine and      Egyptian were further processed. The segmented Levantine and      Egyptian sentences were then translated. Annotators were      instructed to translate completely and accurately and to      transliterate Arabic names. They were also provided with examples.      All segments of a passage were presented in the same translation      task to provide context.

Arabic-Dialect/English Parallel Text is      distributed via web download.

2012 Subscription Members will automatically      receive two copies of this data on disc.  2012 Standard Members      may request a copy as part of their 16 free membership corpora.       Non-members may license this data for US$2250.

*

(2) Prague        Czech-English Dependency Treebank (PCEDT) 2.0 was developed      by the Institute of Formal and        Applied Linguistics at Charles        University in Prague, Czech Republic. It is a corpus of      Czech-English parallel resources translated, aligned and manually      annotated for dependency structure, semantic labeling, argument      structure, ellipsis and anaphora resolution. This release updates      Prague Czech-English      Dependency Treebank 1.0 (LDC2004T25)      by adding English newswire texts so that it now contains over two      million words in close to 100,000 sentences.

The principal new material in PCEDT 2.0 is the      inclusion of the entire Wall Street Journal data from Treebank-3 (LDC99T42). Not      included from PCEDT 1.0 are the Reader's Digest material, the Czech monolingual corpus      and the English-Czech      dictionary.

Each section is enhanced with a comprehensive      manual linguistic annotation in the Prague Dependency Treebank      style (LDC2006T01), Prague      Dependency Treebank 2.0). The main features of this annotation      style are:

-dependency structure of the content words        and coordinating and similar structures (function words are        attached as their attribute values)

-semantic labeling of content words and types        of coordinating structures

-argument structure, including an argument        structure ('valency') lexicon for both languages

-ellipsis and anaphora resolution

This annotation style is called      tectogrammatical annotation, and it constitutes the      tectogrammatical layer in the corpus. Please consult the PCEDT website for more      information and documentation.

Prague Czech-English Dependency Treebank      (PCEDT) 2.0 is distributed on one DVD-ROM.

    2012 Subscription Members will automatically receive two copies of    this data.  2012 Standard Members may request a copy as part of    their 16 free membership corpora.  Non-members may license this data    for US$100.


Top

5-2-3Speechocean January 2012 update

Speechocean - Language Resource Catalogue - New Released (01- 2012)

Speechocean, as a global provider of language resources and data services, has more than 200 large-scale databases available in 80+ languages and accents covering the fields of Text to Speech, Automatic Speech Recognition, Text, Machine Translation, Web Search, Videos, Images etc.

 

Speechocean is glad to announce that more Speech Resources has been released:

 

Chinese and English Mixing Speech Synthesis Database (Female)

The Chinese Mandarin TTS Speech Corpus contains the read speech of a native Chinese Female professional broadcaster recorded in a studio with high SNR (>35dB) over two channels (AKG C4000B microphone and Electroglottography (EGG) sensor). 
The Corpus includes the following categories:
1.    Basic Mandarin sub-corpus: including 5,000 utterances which were carefully designed considering all kinds of linguistic phenomena. All sentences were declarative and extracted from News channels of People's Daily, China Daily, etc. The prompts with negative words were carefully excluded. ONLY suitable length sentences were accepted (7~20 words, in average 14 words). This sub-corpus can be used for R&D of HMM-based TTS, Limit domain TTS and Small-scale concatenative TTS;
2.    Complementary Mandarin sub-corpus: including 10,000 utterances which were carefully designed considering all kinds of linguistic phenomena. All sentences were declarative and extracted from News channels of People's Daily, China Daily, etc. The prompts with negative words are carefully excluded. ONLY suitable length sentences were accepted (7~20 words, average 14 words). This sub-corpus is a complementary corpus for Basic Mandarin sub-corpus and can be used for R&D of Large-scale concatenative TTS;
3.    Mandarin Neutral sub-corpus: including 380 Chinese bi-syllable words which embedded in carrier sentences;
4.    Mandarin ERHUA sub-corpus: including 290 Chinese Erhua syllables which embedded in carrier sentences;
5.    Mandarin Digit-String sub-corpus: including 1250 utterances with 3-digit length which considered the different pronunciation of 1, i.e. “yi1” and “yao1”.
6.    Mandarin Question sub-corpus: including 300 question sentences with common used question mark, for example “吗”, “么”, “呢”, and etc.;
7.    Mandarin exclamatory sub-corpus: including 200 exclamatory sentences with common used exclamatory mark, for example “呀”, “啊”, “吧”, “啦”, and etc.;
8.    Chinese English sentence sub-corpus: including 1,000 sentences which were carefully designed considering bi-phone coverage. All sentences were extracted from News channels of Voice of America (VOA), and etc. The prompts with negative words are carefully excluded. ONLY suitable length sentences were accepted (7~20 words, in average 12 words) and phonetically annotated with SAMPA. This sub-corpus can be used for R&D of HMM-based TTS, Limit domain TTS and Small-scale concatenative TTS;
9.    Chinese English words sub-corpus: including about 6,000 commonly used English words which embedded in carrier sentence;
10.    Chinese English Abbreviation sub-corpus: including about 200 utterances which considered not only the alphabet coverage, but also the combination of character and digit, such as “MP4”;
11.    Chinese English Letter sub-corpus: including 26 carrier utterances with each letter embedded in the Beginning, Middle and End;
12.    Chinese Greek Letter sub-corpus: including 24 carrier utterances with each letter embedded in the Beginning, Middle and End.

All speech data are segmented and labeled on phone level. Pronunciation lexicon and pitch extract from EEG can also be provided based on demands.

 

France French Speech Recognition Corpus (desktop) – 50 speakers

This France French desktop speech recognition database was collected by SpeechOcean in France. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently. 

It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (28 males, 22 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

 

UK English Speech Recognition Corpus (desktop) – 50 speakers

This UK English desktop speech recognition database was collected by SpeechOcean in England. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently. 

It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (28 males, 22 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

 

US English Speech Recognition Corpus (desktop) – 50 speakers

This US English desktop speech recognition database was collected by SpeechOcean in America. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently. 

It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (25 males, 25 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

 

Italian Speech Recognition Corpus (desktop) – 50 speakers

This Italian desktop speech recognition database was collected by SpeechOcean in Italy. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently. 

It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (23 males, 27 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

 

For more information about our Database and Services please visit our website www.Speechocen.com or visit our on-line Catalogue at http://www.speechocean.com/en-Product-Catalogue/Index.html

If you have any inquiry regarding our databases and service please feel free to contact us:

Xianfeng Cheng mailto: Chengxianfeng@speechocean.com

Marta Gherardi mailto: Marta@speechocean.com

 

 

Top

5-2-4Appen ButlerHill

 

Appen ButlerHill 

A global leader in linguistic technology solutions

RECENT CATALOG ADDITIONS—MARCH 2012

1. Speech Databases

1.1 Telephony

1.1 Telephony

Language

Database Type

Catalogue Code

Speakers

Status

Bahasa Indonesia

Conversational

BAH_ASR001

1,002

Available

Bengali

Conversational

BEN_ASR001

1,000

Available

Bulgarian

Conversational

BUL_ASR001

217

Available shortly

Croatian

Conversational

CRO_ASR001

200

Available shortly

Dari

Conversational

DAR_ASR001

500

Available

Dutch

Conversational

NLD_ASR001

200

Available

Eastern Algerian Arabic

Conversational

EAR_ASR001

496

Available

English (UK)

Conversational

UKE_ASR001

1,150

Available

Farsi/Persian

Scripted

FAR_ASR001

789

Available

Farsi/Persian

Conversational

FAR_ASR002

1,000

Available

French (EU)

Conversational

FRF_ASR001

563

Available

French (EU)

Voicemail

FRF_ASR002

550

Available

German

Voicemail

DEU_ASR002

890

Available

Hebrew

Conversational

HEB_ASR001

200

Available shortly

Italian

Conversational

ITA_ASR003

200

Available shortly

Italian

Voicemail

ITA_ASR004

550

Available

Kannada

Conversational

KAN_ASR001

1,000

In development

Pashto

Conversational

PAS_ASR001

967

Available

Portuguese (EU)

Conversational

PTP_ASR001

200

Available shortly

Romanian

Conversational

ROM_ASR001

200

Available shortly

Russian

Conversational

RUS_ASR001

200

Available

Somali

Conversational

SOM_ASR001

1,000

Available

Spanish (EU)

Voicemail

ESO_ASR002

500

Available

Turkish

Conversational

TUR_ASR001

200

Available

Urdu

Conversational

URD_ASR001

1,000

Available

1.2 Wideband

Language

Database Type

Catalogue Code

Speakers

Status

English (US)

Studio

USE_ASR001

200

Available

French (Canadian)

Home/ Office

FRC_ASR002

120

Available

German

Studio

DEU_ASR001

127

Available

Thai

Home/Office

THA_ASR001

100

Available

Korean

Home/Office

KOR_ASR001

100

Available

2. Pronunciation Lexica

Appen Butler Hill has considerable experience in providing a variety of lexicon types. These include:

Pronunciation Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)

Part-of-speech tagged Lexica providing grammatical and semantic labels

Other reference text based materials including spelling/mis-spelling lists, spell-check dictionar-ies, mappings of colloquial language to standard forms, orthographic normalization lists.

Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages. For holdings information in a given language or to discuss any customized development efforts, please contact: sales@appenbutlerhill.com

3. Named Entity Corpora

Language

Catalogue Code

Words

Description

Arabic

ARB_NER001

500,000

These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities

English

ENI_NER001

500,000

Farsi/Persian

FAR_NER001

500,000

Korean

KOR_NER001

500,000

Japanese

JPY_NER001

500,000

Russian

RUS_NER001

500,000

Mandarin

MAN_NER001

500,000

Urdu

URD_NER001

500,000

3. Named Entity Corpora

Language

Catalogue Code

Words

Description

Arabic

ARB_NER001

500,000

These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities

English

ENI_NER001

500,000

Farsi/Persian

FAR_NER001

500,000

Korean

KOR_NER001

500,000

Japanese

JPY_NER001

500,000

Russian

RUS_NER001

500,000

Mandarin

MAN_NER001

500,000

Urdu

URD_NER001

500,000

4. Other Language Resources

Morphological Analyzers – Farsi/Persian & Urdu

Arabic Thesaurus

Language Analysis Documentation – multiple languages

 

For additional information on these resources, please contact: sales@appenbutlerhill.com

5. Customized Requests and Package Configurations

Appen Butler Hill is committed to providing a low risk, high quality, reliable solution and has worked in 130+ languages to-date supporting both large global corporations and Government organizations.

We would be glad to discuss to any customized requests or package configurations and prepare a cus-tomized proposal to meet your needs.

6. Contact Information

Prithivi Pradeep

Business Development Manager

ppradeep@appenbutlerhill.com

+61 2 9468 6370

Tom Dibert

Vice President, Business Development, North America

tdibert@appenbutlerhill.com

+1-315-339-6165

                                                         www.appenbutlerhill.com

Top



 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA