ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2012 » ISCApad #174 » Resources » Database

ISCApad #174

Sunday, December 09, 2012 by Chris Wellekens

5-2 Database

5-2-1

ELRA - Language Resources Catalogue - Update (2012-07)

*****************************************************************
ELRA - Language Resources Catalogue - Update
*****************************************************************
ELRA is happy to announce that 2 new Speech     Telephone Resources are now available in its catalogue.
    Moreover, an updated version of the Bilingual Collocational     Dictionary (Horst Bogatz) has also been released.

    1) New Language Resources:

      ELRA-S0343 VERIF1DE
    The speech corpus VERIF1DE contains 20 recordings (sessions) of     150 German speakers each over the telephone network (10 sessions     over fixed network and 10 sessions over GSM). Each session contains  40 single recordings, mainly speech read from a prompt sheet.
   For more information, see: http://catalog.elra.info/product_info.php?products_id=1169

    ELRA-S0344 LILA Hindi Belt database
    The LILA Hindi Belt database comprises 2,023 Hindi speakers     (1,011 males and 1,012 females, all speakers with Hindi as first     language) recorded over the Indian mobile telephone network. Each  speaker uttered 83 read and spontaneous items.
    For more information, see: http://catalog.elra.info/product_info.php?products_id=1170

    2) Updated Language Resource:

    ELRA-M0013 Bilingual Collocational Dictionary (Horst Bogatz)
    This new release contains 69,000 English headwords (instead       of 40,000 for the previous release).
    The bilingual English-German collocational dictionary consists of     around 69,000 English headwords, including concepts expressed with     more than one word (e.g. 'the awareness of the environment' or 'lame     duck') and hyphenated compounds. It contains verbs, adjectives,     synonyms and phrases that collocate with the headword. It provides     the German equivalents for the headwords as well as their English     synonyms.
    For more information, see: http://catalog.elra.info/product_info.php?products_id=451

    For more information on the catalogue, please contact Valérie  Mapelli mailto:mapelli@elda.org

    Visit our On-line Catalogue: http://catalog.elra.info
    Visit the Universal Catalogue: http://universal.elra.info
    Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/LRs-Announcements.html

Top

5-2-2

LDC Newsletter (November 2012)

In this newsletter:

Spring 2013 LDC Data Scholarship Program

Invitation to Join for Membership Year 2013

Why become an LDC member?

2012 User Survey Results

LDC to Close for Thanksgiving Break

New

publications:

Annotated English Gigaword

Chinese-English             Semiconductor Parallel Text

          GALE               Phase 2 Arabic Newswire Parallel Text

Spring 2013 LDC Data Scholarship Program

Applications are now being accepted through January 15, 2013, 11:59PM EST for the Spring 2013 LDC Data Scholarship program! The LDC Data Scholarship program provides university students with access to LDC data at no-cost. During previous program cycles, LDC has awarded no-cost copies of LDC data to over 25 individual students and student research groups.

This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

Applicants should consult the LDC Corpus Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two datasets; students may apply for additional datasets during the following cycle once they have completed processing of the initial datasets and publish or present work in some juried venue.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must verify the student's need for data and confirm that the department or university lacks the funding to pay the full Non-member Fee for the data or to join the consortium.

For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

The deadline for the Spring 2013 program cycle is January 15, 2013, 11:59PM EST.

Invitation to Join for Membership Year 2013

Membership
        Year (MY) 2013 is open for joining! We would like to invite all         current and previous members of LDC to renew their membership as         well as welcome new organizations to join the consortium.    For         MY2013, LDC is pleased to maintain membership fees at last         year’s rates – membership fees will not increase. Additionally,         LDC will extend discounts on membership fees to members who keep         their membership current and who join early in the year.

        The details of our early renewal discounts for MY2013 are as         follows:

· Organizations
who joined for MY2012 will receive a 5% discount when renewing. This discount will apply throughout 2013, regardless of time of renewal. MY2012 members renewing before March 1, 2013 will receive an additional 5% discount, for a total 10% discount off the membership fee.

· New
members as well as organizations who did not join for MY2012, but who held membership in any of the previous MYs (1993-2011), will also be eligible for a 5% discount provided that they join/renew before March 1, 2013.

The
following table provides exact pricing information.

		MY2013 Fee	MY2013 Fee with 5% Discount*	MY2013 Fee with 10% Discount**
Not-for-Profit /US Government
	Standard	US$2400	US$2280	US$2160
	Subscription	US$3850	US$3658	US$3465
For-Profit
	Standard	US$24000	US$22800	US$21600
	Subscription	US$27500	US$26125	US$24750

        * For new members, MY2012 Members renewing for MY2013, and any         previous year Member who renews before March 1, 2013

        ** For MY2012 Members renewing before March 1, 2013


        Publications for MY2013 are still being planned; here are the         working titles of data sets we intend to provide:

· Arabic Treebank - Weblog	· Hispanic-English Speech
· Chinese-English Biomedical Parallel Text	· Maninkakan Lexicon
· GALE data – all phases and tasks	· OpenMT 2008-2012 Progress Set

        In addition to receiving new publications, current year members         of the LDC also enjoy the benefit of licensing older data at         reduced costs; current year for-profit members may use most data         for commercial applications.

        This past year, LDC members who joined early or kept their         membership current saved almost US$70,000 collectively on         membership fees. Be sure to keep an eye on your mail - all         previous and current LDC members will be sent an invitation to         join letter and renewal invoice for MY2013. Renew early for         MY2013 to save today!

Why become an LDC member?

LDC
is offering early renewal discounts on membership fees for Membership Year 2013 making now a good time to consider joining or renewing membership. LDC membership has the following advantages:

LDC
membership provides cost-effective access to an extensive and growing catalog that spans 20 years and includes over 500 multilingual speech, text, and video resources. Even if your organization only needs a few datasets from a given membership year, membership is often the most economical way to obtain current corpora. Additionally, the generous discounts that member organizations receive on older corpora reduce the cost of acquiring such datasets.

All
members enjoy unlimited use of LDC data within their organizations. For universities, there is no difference in cost between a departmental membership and one that is university-wide. Departments can therefore combine resources and establish one LDC membership for use by the entire university community. Likewise, for-profit members with multiple branches can maintain one membership for use by their entire organizations.

For-profit
organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations, including commercial restrictions, on the use of certain corpora. In the case of a small group of corpora, commercial licenses must be obtained separately from the owners of the data.

2012 User Survey Results

Earlier this year, LDC sent a survey to its user communities. Like previous iterations in 2006 and 2007, the survey solicited community input and suggestions on key LDC-related topics, including:

Satisfaction levels with LDC’s data, homepage and Catalog
Reflections on LDC’s 20^th Anniversary year
Suggestions for future publications
Speculations on the future of HLT-related fields, specifically on mobile technologies, cloud computing, social networking and open data

Survey respondents were generally satisfied with LDC’s data, membership options, homepage and Catalog, though there were requests for additional data options and data acquisition methods. Some of the data respondents requested are already in our pipeline for the end of 2012 or for Membership Year (MY) 2013, so please be on the lookout for Publications updates. Respondents were also very supportive of LDC’s 20^th Anniversary, posting testimonials and well-wishes in the 20^th Anniversary section.

LDC would like to thank all survey participants. Survey participants will receive access to full survey results shortly.

LDC to Close for Thanksgiving Break

LDC will be closed on Thursday, November 22, 2012 and Friday, November 23, 2012 in observance of the US Thanksgiving Holiday. Our offices will reopen on Monday, November 26, 2012.

        New publications

(1)
Annotated English Gigaword was developed by Johns Hopkins
University's Human Language Technology Center of Excellence. It adds automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition (LDC2011T07) and also contains an API and tools for reading the dataset's XML files. The goal of the annotation is to provide a standardized corpus for knowledge extraction and distributional semantics which enables broader involvement in large-scale knowledge-acquisition efforts by researchers.

Annotated
English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition from seven news sources:

Agence
France-Presse, English Service (afp_eng)
Associated
Press Worldstream, English Service (apw_eng)
Central
News Agency of Taiwan, English Service (cna_eng)
Los
Angeles Times/Washington Post Newswire Service (ltw_eng)
Washington
Post/Bloomberg Newswire Service (wpb_eng)
New
York Times Newswire Service (nyt_eng)
Xinhua
News Agency, English Service (xin_eng)

The
following layers of annotation were added:

Tokenized
and segmented sentences
Treebank-style
constituent parse trees
Syntactic
dependency trees
Named
entities
In-document
coreference chains

The
annotation was performed in a three-step process: (1) the data was preprocessed and sentences selected for annotation (sentences with more than 100 tokens were excluded); (2) syntactic parses were derived; and (3) the parsed output was post-processed to derive syntactic dependencies, named entities and coreference chains. Over 183 million sentences were parsed.

Annotated
English Gigaword is distributed on one hard drive.

2012
Subscription Members will automatically receive one copy of this data on hard drive. 2012 Standard Members may request a copy as part of their 16 free membership corpora. 2011 Members who licensed English Gigaword Fifth Edition (LDC2011T07) may request a no-cost copy of Annotated English Gigaword. Non-member organizations who licensed English Gigaword Fifth Edition may request a copy of Annotated English Gigaword for the US$200 media fee. Non-member organizations without a license to English Gigaword Fifth Edition may obtain this data for US$6000.

(2) Chinese-English Semiconductor Parallel Text was developed by The MITRE Corporation. It consists of parallel sentences from a collection of abstracts from scientific articles on semiconductors published in Mandarin and translated into English by translators with particular expertise in the technical area. Translators were instructed to err on the side of literal translation if required, but to maintain the technical writing style of the source and to make the resulting English as natural as possible. The translators followed specific guidelines for translation, and those are included in this distribution.

There
are 2,169 lines of parallel Mandarin and English, with a total of 125,302 characters of Mandarin and 64,851 words of English, presented in a separate UTF-8 plain text file for each language. The sentences were translated in sequential order and presented in a scrambled order, such that parallel sentences at identical line numbers are translations. For example, the 31st line of the English file is a translation of the 31st line of the Mandarin file. The original line sequence is not provided.

Chinese-English Semiconductor
Parallel Text is distributed via web download.

2012
Subscription Members will automatically receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1500.

(3)
GALE Phase 2 Arabic Newswire Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from newswire data collected in 2007 by LDC and transcribed by LDC or under its direction.

GALE
Phase 2 Arabic Newswire Parallel Text includes 400 source-translation pairs, comprising 181,704 tokens of Arabic source text and its English translation. Data is drawn from six distinct Arabic newswire sources.: Al Ahram, Al Hayat, Al-Quds Al-Arabi, An Nahar, Asharq Al-Awsat and Assabah.

The
files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE
Phase 2 Arabic Newswire Parallel Text is distributed via web download.

To unsubscribe visit:

https://secure.ldc.upenn.edu/intranet/

-- 
--

Top

5-2-3

Speechocean January 2012 update

Speechocean - Language Resource Catalogue - New Released (01- 2012)

Speechocean, as a global provider of language resources and data services, has more than 200 large-scale databases available in 80+ languages and accents covering the fields of Text to Speech, Automatic Speech Recognition, Text, Machine Translation, Web Search, Videos, Images etc.

Speechocean is glad to announce that more Speech Resources has been released:

Chinese and English Mixing Speech Synthesis Database (Female)

The Chinese Mandarin TTS Speech Corpus contains the read speech of a native Chinese Female professional broadcaster recorded in a studio with high SNR (>35dB) over two channels (AKG C4000B microphone and Electroglottography (EGG) sensor).
The Corpus includes the following categories:
1.    Basic Mandarin sub-corpus: including 5,000 utterances which were carefully designed considering all kinds of linguistic phenomena. All sentences were declarative and extracted from News channels of People's Daily, China Daily, etc. The prompts with negative words were carefully excluded. ONLY suitable length sentences were accepted (7~20 words, in average 14 words). This sub-corpus can be used for R&D of HMM-based TTS, Limit domain TTS and Small-scale concatenative TTS;
2.    Complementary Mandarin sub-corpus: including 10,000 utterances which were carefully designed considering all kinds of linguistic phenomena. All sentences were declarative and extracted from News channels of People's Daily, China Daily, etc. The prompts with negative words are carefully excluded. ONLY suitable length sentences were accepted (7~20 words, average 14 words). This sub-corpus is a complementary corpus for Basic Mandarin sub-corpus and can be used for R&D of Large-scale concatenative TTS;
3.    Mandarin Neutral sub-corpus: including 380 Chinese bi-syllable words which embedded in carrier sentences;
4.    Mandarin ERHUA sub-corpus: including 290 Chinese Erhua syllables which embedded in carrier sentences;
5.    Mandarin Digit-String sub-corpus: including 1250 utterances with 3-digit length which considered the different pronunciation of 1, i.e. “yi1” and “yao1”.
6.    Mandarin Question sub-corpus: including 300 question sentences with common used question mark, for example “吗”, “么”, “呢”, and etc.;
7.    Mandarin exclamatory sub-corpus: including 200 exclamatory sentences with common used exclamatory mark, for example “呀”, “啊”, “吧”, “啦”, and etc.;
8.    Chinese English sentence sub-corpus: including 1,000 sentences which were carefully designed considering bi-phone coverage. All sentences were extracted from News channels of Voice of America (VOA), and etc. The prompts with negative words are carefully excluded. ONLY suitable length sentences were accepted (7~20 words, in average 12 words) and phonetically annotated with SAMPA. This sub-corpus can be used for R&D of HMM-based TTS, Limit domain TTS and Small-scale concatenative TTS;
9.    Chinese English words sub-corpus: including about 6,000 commonly used English words which embedded in carrier sentence;
10.    Chinese English Abbreviation sub-corpus: including about 200 utterances which considered not only the alphabet coverage, but also the combination of character and digit, such as “MP4”;
11.    Chinese English Letter sub-corpus: including 26 carrier utterances with each letter embedded in the Beginning, Middle and End;
12.    Chinese Greek Letter sub-corpus: including 24 carrier utterances with each letter embedded in the Beginning, Middle and End.

All speech data are segmented and labeled on phone level. Pronunciation lexicon and pitch extract from EEG can also be provided based on demands.

France French Speech Recognition Corpus (desktop) – 50 speakers

This France French desktop speech recognition database was collected by SpeechOcean in France. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently.

It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (28 males, 22 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

UK English Speech Recognition Corpus (desktop) – 50 speakers

This UK English desktop speech recognition database was collected by SpeechOcean in England. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

US English Speech Recognition Corpus (desktop) – 50 speakers

This US English desktop speech recognition database was collected by SpeechOcean in America. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently.

It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (25 males, 25 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

Italian Speech Recognition Corpus (desktop) – 50 speakers

This Italian desktop speech recognition database was collected by SpeechOcean in Italy. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently.

It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (23 males, 27 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information.

A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

For more information about our Database and Services please visit our website www.Speechocen.com or visit our on-line Catalogue at http://www.speechocean.com/en-Product-Catalogue/Index.html

If you have any inquiry regarding our databases and service please feel free to contact us:

Xianfeng Cheng mailto: Chengxianfeng@speechocean.com

Marta Gherardi mailto: Marta@speechocean.com

Top

5-2-4

Appen ButlerHill

A global leader in linguistic technology solutions

RECENT CATALOG ADDITIONS—MARCH 2012

1. Speech Databases

1.1 Telephony

1.1 Telephony Language	Database Type	Catalogue Code	Speakers	Status
Bahasa Indonesia	Conversational	BAH_ASR001	1,002	Available
Bengali	Conversational	BEN_ASR001	1,000	Available
Bulgarian	Conversational	BUL_ASR001	217	Available shortly
Croatian	Conversational	CRO_ASR001	200	Available shortly
Dari	Conversational	DAR_ASR001	500	Available
Dutch	Conversational	NLD_ASR001	200	Available
Eastern Algerian Arabic	Conversational	EAR_ASR001	496	Available
English (UK)	Conversational	UKE_ASR001	1,150	Available
Farsi/Persian	Scripted	FAR_ASR001	789	Available
Farsi/Persian	Conversational	FAR_ASR002	1,000	Available
French (EU)	Conversational	FRF_ASR001	563	Available
French (EU)	Voicemail	FRF_ASR002	550	Available
German	Voicemail	DEU_ASR002	890	Available
Hebrew	Conversational	HEB_ASR001	200	Available shortly
Italian	Conversational	ITA_ASR003	200	Available shortly
Italian	Voicemail	ITA_ASR004	550	Available
Kannada	Conversational	KAN_ASR001	1,000	In development
Pashto	Conversational	PAS_ASR001	967	Available
Portuguese (EU)	Conversational	PTP_ASR001	200	Available shortly
Romanian	Conversational	ROM_ASR001	200	Available shortly
Russian	Conversational	RUS_ASR001	200	Available
Somali	Conversational	SOM_ASR001	1,000	Available
Spanish (EU)	Voicemail	ESO_ASR002	500	Available
Turkish	Conversational	TUR_ASR001	200	Available
Urdu	Conversational	URD_ASR001	1,000	Available

1.2 Wideband Language	Database Type	Catalogue Code	Speakers	Status
English (US)	Studio	USE_ASR001	200	Available
French (Canadian)	Home/ Office	FRC_ASR002	120	Available
German	Studio	DEU_ASR001	127	Available
Thai	Home/Office	THA_ASR001	100	Available
Korean	Home/Office	KOR_ASR001	100	Available

2. Pronunciation Lexica

Appen Butler Hill has considerable experience in providing a variety of lexicon types. These include:

 Pronunciation Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate)

 Part-of-speech tagged Lexica providing grammatical and semantic labels

 Other reference text based materials including spelling/mis-spelling lists, spell-check dictionar-ies, mappings of colloquial language to standard forms, orthographic normalization lists.

Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages. For holdings information in a given language or to discuss any customized development efforts, please contact: sales@appenbutlerhill.com

3. Named Entity Corpora Language	Catalogue Code		Words		Description
Arabic	ARB_NER001		500,000		These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities
English		ENI_NER001		500,000
Farsi/Persian		FAR_NER001		500,000
Korean		KOR_NER001		500,000
Japanese		JPY_NER001		500,000
Russian		RUS_NER001		500,000
Mandarin		MAN_NER001		500,000
Urdu		URD_NER001		500,000

3. Named Entity Corpora Language	Catalogue Code		Words		Description
Arabic	ARB_NER001		500,000		These NER Corpora contain text material from a vari-ety of sources and are tagged for the following Named Entities: Person, Organization, Location, Na-tionality, Religion, Facility, Geo-Political Entity, Titles, Quantities
English		ENI_NER001		500,000
Farsi/Persian		FAR_NER001		500,000
Korean		KOR_NER001		500,000
Japanese		JPY_NER001		500,000
Russian		RUS_NER001		500,000
Mandarin		MAN_NER001		500,000
Urdu		URD_NER001		500,000

4. Other Language Resources

 Morphological Analyzers – Farsi/Persian & Urdu

 Arabic Thesaurus

 Language Analysis Documentation – multiple languages

For additional information on these resources, please contact: sales@appenbutlerhill.com

5. Customized Requests and Package Configurations

Appen Butler Hill is committed to providing a low risk, high quality, reliable solution and has worked in 130+ languages to-date supporting both large global corporations and Government organizations.

We would be glad to discuss to any customized requests or package configurations and prepare a cus-tomized proposal to meet your needs.

6. Contact Information

Prithivi Pradeep

Business Development Manager

ppradeep@appenbutlerhill.com

+61 2 9468 6370

Tom Dibert

Vice President, Business Development, North America

tdibert@appenbutlerhill.com

+1-315-339-6165

www.appenbutlerhill.com

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy