ISCA - International Speech
Communication Association


ISCApad Archive  »  2012  »  ISCApad #172  »  Resources  »  Database

ISCApad #172

Sunday, October 07, 2012 by Chris Wellekens

5-2 Database
5-2-1ELRA - Language Resources Catalogue - Update (2012-07)

*****************************************************************
ELRA - Language Resources Catalogue - Update
*****************************************************************
ELRA is happy to announce that 2 new Speech     Telephone Resources are now available in its catalogue.
    Moreover, an updated version of the Bilingual Collocational     Dictionary (Horst Bogatz) has also been released.     
   
    1) New Language Resources:
     
      ELRA-S0343 VERIF1DE
   
The speech corpus VERIF1DE contains 20 recordings (sessions) of     150 German speakers each over the telephone network (10 sessions     over fixed network and 10 sessions over GSM). Each session contains  40 single recordings, mainly speech read from a prompt sheet.
  
For more information, see: http://catalog.elra.info/product_info.php?products_id=1169
   
    ELRA-S0344 LILA Hindi Belt database
   
The LILA Hindi Belt database comprises 2,023 Hindi speakers     (1,011 males and 1,012 females, all speakers with Hindi as first     language) recorded over the Indian mobile telephone network. Each  speaker uttered 83 read and spontaneous items.
   
For more information, see: http://catalog.elra.info/product_info.php?products_id=1170
   
    2) Updated Language Resource:
     
    ELRA-M0013 Bilingual Collocational Dictionary (Horst Bogatz)
   
This new release contains  69,000  English headwords (instead       of 40,000 for the previous release).
    The bilingual English-German collocational dictionary consists of     around 69,000 English headwords, including concepts expressed with     more than one word (e.g. 'the awareness of the environment' or 'lame     duck') and hyphenated compounds. It contains verbs, adjectives,     synonyms and phrases that collocate with the headword. It provides     the German equivalents for the headwords as well as their English     synonyms.
    For more information, see: http://catalog.elra.info/product_info.php?products_id=451
    
    For more information on the catalogue, please contact Valérie  Mapelli mailto:mapelli@elda.org
   
    Visit our On-line Catalogue: http://catalog.elra.info
    Visit the Universal Catalogue: http://universal.elra.info
    Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/LRs-Announcements.html

Back  Top

5-2-2LDC Newsletter (September 2012)

In this newsletter:

   

-  The           Future of Language Resources: LDC 20th Anniversary Workshop           Summary  -

   

English
          Treebanking at LDC
-
   

   

New publications

   

GALE           Chinese-English Word Alignment and Tagging Training Part 1 --           Newswire and Web  -

   

MADCAT         Phase 1 Training Set  -

   


   

The Future of Language         Resources: LDC 20th Anniversary Workshop Summary 

   

Thanks to the members,       friends and staff  who made       our 20th Anniversary Workshop (September 6-7) a fruitful and fun       experience. The speakers -- from academia, industry and government       – engaged participants and provoked discussion with their talks       about the ways in which language resources contribute to research       in language-related fields and other disciplines and with their       insights into the future. The result was much food for thought as       we enter our third decade.
   

   

Visit the workshop
        page
for the proceedings and to learn more about the event.

   

English Treebanking at LDC
     

   

 

   

                     

   

As part of our 20th anniversary celebration, the coming newsletters         will include features that provide an overview of the broad         range of LDC’s activities. This month, we'll examine English       treebanking efforts at LDC.  The English treebanking team is lead       by Ann Bies, Senior Research Coordinator.  The association of treebanks       with LDC began with the publication of the original Penn English       Treebank (Treebank-2)       in 1995.  Since that time       the need for new varieties of English treebank data has continued       to grow, and LDC has expanded its expertise to address new       research challenges.  This       includes the development of treebanked data for additional domains       including conversational speech and web text as well as the       creation of parallel treebank data.

   

Speech data presents unique       challenges not inherent in edited text such as speech disfluency       and hesitations.  Penn       Treebank contains conversational speech data from the Switchboardtelephone
      collection which has been tagged, dysfluency-annotated, and       parsed.  LDC’s more recent publication, English CTS Treebank with Structural           Metadata,       builds on that annotation and includes new data. The development       of that corpus was motivated by the need to have both structural       metadata and syntactic structure annotated in order to support       work on speech parsing and structural event detection. The       annotation involved a two-pass approach to annotating metadata,       speech effects and syntactic structure in transcribed       conversational speech: separately annotating for structural       metadata, or structural events, and for syntactic structure. The       two annotations were then combined into a single aligned       representation.

   

Also recently, LDC has undertaken       complex syntactic annotation of data collected over the web.  Since most parsers are       trained using newswire, they achieve better accuracy on similar       heavily edited texts.  LDC,       through a gift from Google Inc., developed English Web
        Treebank
to improve parsing, translation and information       extraction on unedited domains, such as blogs, newsgroups, and       consumer reviews.  LDC’s       annotation guidelines were adapted to handle unique features of       web text such as inconsistent punctuation and capitalization as       well as the increased use of slang, technical jargon and       ungrammatical sentences.

   

LDC and its research partners are       also involved in the creation of parallel treebanks used for word       alignment tasks.  Parallel       treebanks are annotated morphological and syntactic structures       that are aligned at sentence as well as sub-sentence levels. These       resources are used for improving machine translation quality. To       create such treebanks, English files (translated from         the source Arabic or Chinese) are first automatically  part-of-speech       tagged and parsed and then hand-corrected at each stage.  The quality control process       consists of a series of specific searches for over 100 types of       potential inconsistency and parser or annotation error.  Parallel treebank data in the       LDC catalog includes the English Translation
        Treebank: An Nahar Newswire
whose files are parallel with       those in Arabic Treebank:
        Part 3 v 3.2

   

English treebanking at       LDC is ongoing; new titles are in progress and will be added to       our catalog.

   

     
     
   

   
       

New Publications
     

   

(1) GALE Chinese-English
        Word Alignment and Tagging Training Part 1 -- Newswire and Web
      was developed by LDC and contains 150,068 tokens of word aligned       Chinese and English parallel text enriched with linguistic tags.       This material was used as training data in the DARPA GALE       (Global Autonomous Language Exploitation) program.  This  release consists of Chinese       source newswire and web data (newsgroup, weblog) collected by LDC       in 2008.

   

Some approaches to statistical machine       translation include the incorporation of linguistic knowledge in       word aligned text as a means to improve automatic word alignment       and machine translation quality. This is accomplished with two       annotation schemes: alignment and tagging. Alignment identifies       minimum translation units and translation relations by using       minimum-match and attachment annotation approaches. A set of word       tags and alignment link tags are designed in the tagging scheme to       describe these translation units and relations. Tagging adds       contextual, syntactic and language-specific features to the       alignment annotation.

   

The Chinese word alignment tasks consisted of       the following components:

   

-Identifying, aligning, and tagging 8 different       types of links

   

-Identifying, attaching, and tagging       local-level unmatched words

   

-Identifying and tagging       sentence/discourse-level unmatched words

   

-Identifying and tagging all instances of       Chinese 的       (DE) except when they were a part of a semantic link.

   

GALE Chinese-English Word Alignment and Tagging       Training Part 1 -- Newswire and Web is distributed via web       download.

   

2012 Subscription Members will automatically       receive two copies of this data on disc.  2012 Standard Members       may request a copy as part of their 16 free membership corpora.        Non-members may license this data for US$1750.

   

*

   

(2) MADCAT
        Phase 1 Training Set
contains all training data created by       LDC to support Phase 1 of the DARPA MADCAT Program. The data in       this release consists of handwritten Arabic documents scanned at       high resolution and annotated for the physical coordinates of each       line and token. Digital transcripts and English translations of       each document are also provided, with the various content and       annotation layers integrated in a single MADCAT XML output.

   

The goal of the MADCAT program is to       automatically convert foreign text images into English       transcripts. MADCAT Phase 1 data was collected by LDC from Arabic       source documents in three genres: newswire, weblog and newsgroup       text. Arabic speaking 'scribes' copied documents by hand,       following specific instructions on writing style (fast, normal,       careful), writing implement (pen, pencil) and paper (lined,       unlined). Prior to assignment, source documents were processed to       optimize their appearance for the handwriting task, which resulted       in some original source documents being broken into multiple       'pages' for handwriting. Each resulting handwritten page was       assigned to up to five independent scribes, using different       writing conditions.

   

The handwritten, transcribed documents were  checked for quality and       completeness, then each page was scanned at a high resolution (600       dpi, greyscale) to create a digital version of the handwritten       document. The scanned images were then annotated to indicate the       physical coordinates of each line and token. Explicit reading       order was also labeled, along with any errors produced by the       scribes when copying the text.

   

The final step was to produce a unified data       format that takes multiple data streams and generates a single xml       output file which contains all required information. The resulting       xml file  has these       distinct components: a text layer that consists of the source       text, tokenization and sentence segmentation; an image layer that       consist of bounding boxes; a scribe demographic layer that       consists of scribe ID and partition (train/test); and a document       metadata layer. This release includes 9693 annotation files in       MADCAT XML format (.madcat.xml) along with their corresponding       scanned image files in TIFF format.

   

MADCAT Phase 1 Training Set is distributed on       two DVD-ROM.

   

2012 Subscription Members will automatically       receive two copies of this data.  2012 Standard Members may       request a copy as part of their 16 free membership corpora.        Non-members may license this data for US$2000.

   

 

 
            

Back  Top

5-2-3Speechocean January 2012 update

Speechocean - Language Resource Catalogue - New Released (01- 2012)

Speechocean, as a global provider of language resources and data services, has more than 200 large-scale databases available in 80+ languages and accents covering the fields of Text to Speech, Automatic Speech Recognition, Text, Machine Translation, Web Search, Videos, Images etc.

 

Speechocean is glad to announce that more Speech Resources has been released:

 

Chinese and English Mixing Speech Synthesis Database (Female)

The Chinese Mandarin TTS Speech Corpus contains the read speech of a native Chinese Female professional broadcaster recorded in a studio with high SNR (>35dB) over two channels (AKG C4000B microphone and Electroglottography (EGG) sensor). 
The Corpus includes the following categories:
1.    Basic Mandarin sub-corpus: including 5,000 utterances which were carefully designed considering all kinds of linguistic phenomena. All sentences were declarative and extracted from News channels of People's Daily, China Daily, etc. The prompts with negative words were carefully excluded. ONLY suitable length sentences were accepted (7~20 words, in average 14 words). This sub-corpus can be used for R&D of HMM-based TTS, Limit domain TTS and Small-scale concatenative TTS;
2.    Complementary Mandarin sub-corpus: including 10,000 utterances which were carefully designed considering all kinds of linguistic phenomena. All sentences were declarative and extracted from News channels of People's Daily, China Daily, etc. The prompts with negative words are carefully excluded. ONLY suitable length sentences were accepted (7~20 words, average 14 words). This sub-corpus is a complementary corpus for Basic Mandarin sub-corpus and can be used for R&D of Large-scale concatenative TTS;
3.    Mandarin Neutral sub-corpus: including 380 Chinese bi-syllable words which embedded in carrier sentences;
4.    Mandarin ERHUA sub-corpus: including 290 Chinese Erhua syllables which embedded in carrier sentences;
5.    Mandarin Digit-String sub-corpus: including 1250 utterances with 3-digit length which considered the different pronunciation of 1, i.e. “yi1” and “yao1”.
6.    Mandarin Question sub-corpus: including 300 question sentences with common used question mark, for example “吗”, “么”, “呢”, and etc.;
7.    Mandarin exclamatory sub-corpus: including 200 exclamatory sentences with common used exclamatory mark, for example “呀”, “啊”, “吧”, “啦”, and etc.;
8.    Chinese English sentence sub-corpus: including 1,000 sentences which were carefully designed considering bi-phone coverage. All sentences were extracted from News channels of Voice of America (VOA), and etc. The prompts with negative words are carefully excluded. ONLY suitable length sentences were accepted (7~20 words, in average 12 words) and phonetically annotated with SAMPA. This sub-corpus can be used for R&D of HMM-based TTS, Limit domain TTS and Small-scale concatenative TTS;
9.    Chinese English words sub-corpus: including about 6,000 commonly used English words which embedded in carrier sentence;
10.    Chinese English Abbreviation sub-corpus: including about 200 utterances which considered not only the alphabet coverage, but also the combination of character and digit, such as “MP4”;
11.    Chinese English Letter sub-corpus: including 26 carrier utterances with each letter embedded in the Beginning, Middle and End;
12.    Chinese Greek Letter sub-corpus: including 24 carrier utterances with each letter embedded in the Beginning, Middle and End.

All speech data are segmented and labeled on phone level. Pronunciation lexicon and pitch extract from EEG can also be provided based on demands.

 

France French Speech Recognition Corpus (desktop) – 50 speakers

This France French desktop speech recognition database was collected by SpeechOcean in France. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently. 

It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (28 males, 22 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

 

UK English Speech Recognition Corpus (desktop) – 50 speakers

This UK English desktop speech recognition database was collected by SpeechOcean in England. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently. 

It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (28 males, 22 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

 

US English Speech Recognition Corpus (desktop) – 50 speakers

This US English desktop speech recognition database was collected by SpeechOcean in America. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently. 

It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (25 males, 25 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

 

Italian Speech Recognition Corpus (desktop) – 50 speakers

This Italian desktop speech recognition database was collected by SpeechOcean in Italy. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently. 

It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (23 males, 27 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

 

For more information about our Database and Services please visit our website www.Speechocen.com or visit our on-line Catalogue at http://www.speechocean.com/en-Product-Catalogue/Index.html

If you have any inquiry regarding our databases and service please feel free to contact us:

Xianfeng Cheng mailto: Chengxianfeng@speechocean.com

Marta Gherardi mailto: Marta@speechocean.com

 

 

Back  Top

5-2-4Appen ButlerHill

 

Appen ButlerHill 

A global leader in linguistic technology solutions

RECENT CATALOG ADDITIONS—MARCH 2012

1. Speech Databases

1.1 Telephony

1.1 Telephony

Language

Database Type

Catalogue Code

Speakers

Status

Bahasa Indonesia

Conversational

BAH_ASR001

1,002

Available

Bengali

Conversational

BEN_ASR001

1,000

Available

Bulgarian

Conversational

BUL_ASR001

217

Available shortly

Croatian

Conversational

CRO_ASR001

200

Available shortly

Dari

Conversational

DAR_ASR001

500

Available

Dutch

Conversational

NLD_ASR001

200

Available

Eastern Algerian Arabic

Conversational

EAR_ASR001

496

Available

English (UK)

Conversational

UKE_ASR001

1,150

Available

Farsi/Persian

Scripted

FAR_ASR001

789

Available

Farsi/Persian

Conversational

FAR_ASR002

1,000

Available

French (EU)

Conversational

FRF_ASR001

563

Available

French (EU)

Voicemail

FRF_ASR002

550

Available

German

Voicemail

DEU_ASR002

890

Available

Hebrew

Conversational

HEB_ASR001

200

Available shortly

Italian

Conversational

ITA_ASR003

200

Available shortly

Italian

Voicemail

ITA_ASR004

550

Available

Kannada

Conversational

KAN_ASR001

1,000

In development

Pashto

Conversational

PAS_ASR001

967

Available

Portuguese (EU)

Conversational

PTP_ASR001

200

Available shortly

Romanian

Conversational

ROM_ASR001

200

Available shortly

Russian

Conversational

RUS_ASR001

200

Available

Somali

Conversational

SOM_ASR001

1,000

Available

Spanish (EU)

Voicemail

ESO_ASR002

500

Available

Turkish

Conversational

TUR_ASR001

200

Available

Urdu

Conversational

URD_ASR001

1,000

Available

1.2 Wideband

Language

Database Type

Catalogue Code

Speakers

Status

English (US)

Studio

USE_ASR001

200

Available

French (Canadian)

Home/ Office

FRC_ASR002

120

Available

German

Studio

DEU_ASR001

127

Available

Thai

Home/Office

THA_ASR001

100

Available

Korean

Home/Office

KOR_ASR001

100

Available

2. Pronunciation Lexica

Appen Butler Hill has considerable experience in providing a variety of lexicon types. These include:

Pronunciation Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)

Part-of-speech tagged Lexica providing grammatical and semantic labels

Other reference text based materials including spelling/mis-spelling lists, spell-check dictionar-ies, mappings of colloquial language to standard forms, orthographic normalization lists.

Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages. For holdings information in a given language or to discuss any customized development efforts, please contact: sales@appenbutlerhill.com

3. Named Entity Corpora

Language

Catalogue Code

Words

Description

Arabic

ARB_NER001

500,000

These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities

English

ENI_NER001

500,000

Farsi/Persian

FAR_NER001

500,000

Korean

KOR_NER001

500,000

Japanese

JPY_NER001

500,000

Russian

RUS_NER001

500,000

Mandarin

MAN_NER001

500,000

Urdu

URD_NER001

500,000

3. Named Entity Corpora

Language

Catalogue Code

Words

Description

Arabic

ARB_NER001

500,000

These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities

English

ENI_NER001

500,000

Farsi/Persian

FAR_NER001

500,000

Korean

KOR_NER001

500,000

Japanese

JPY_NER001

500,000

Russian

RUS_NER001

500,000

Mandarin

MAN_NER001

500,000

Urdu

URD_NER001

500,000

4. Other Language Resources

Morphological Analyzers – Farsi/Persian & Urdu

Arabic Thesaurus

Language Analysis Documentation – multiple languages

 

For additional information on these resources, please contact: sales@appenbutlerhill.com

5. Customized Requests and Package Configurations

Appen Butler Hill is committed to providing a low risk, high quality, reliable solution and has worked in 130+ languages to-date supporting both large global corporations and Government organizations.

We would be glad to discuss to any customized requests or package configurations and prepare a cus-tomized proposal to meet your needs.

6. Contact Information

Prithivi Pradeep

Business Development Manager

ppradeep@appenbutlerhill.com

+61 2 9468 6370

Tom Dibert

Vice President, Business Development, North America

tdibert@appenbutlerhill.com

+1-315-339-6165

                                                         www.appenbutlerhill.com

Back  Top



 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA