ISCApad #170 |
Monday, August 06, 2012 by Chris Wellekens |
5-2-1 | ELRA - Language Resources Catalogue - Update (2012-07) *****************************************************************
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-2 | LDC Newsletter (July 2012)
In this newsletter: - LDC 20th Anniversary Workshop - New publications: LDC2012T11 LDC2012T07 LDC2012T10 LDC 20th Anniversary Workshop LDC announces its 20th Anniversary Workshop on Language Resources, to be held in Philadelphia on September 6-7, 2012. The event will commemorate our anniversary, reflect on the beginning of language data centers and address the future of language resources. Workshop themes will include: the developments in human language technologies and associated resources that have brought us to our current state; the language resources required by the technical approaches taken and the impact of these resources on HLT progress; the applications of HLT and resources to other disciplines including law, medicine, economics, the political sciences and psychology; the impact of HLTs and related technologies on linguistic analysis and novel approaches in fields as widespread as phonetics, semantics, language documentation, sociolinguistics and dialect geography; and finally, the impact of any of these developments on the ways in which language resources are created, shared and exploited and on the specific resources required. Stay tuned for further details. New publications (1) American English Nickname Collection was developed by Intelius, Inc. and is a compilation of American English nicknames to given name mappings based on information in US government records, public web profiles and financial and property reports. This corpus is intended as a tool for the quantitative study of nickname usage in the United States such as in demographic and sociological studies. The American English Nickname Collection contains 331,237 distinct mappings encompassing millions of names. The data was collected and processed through a record linkage pipeline. The steps in the pipeline were (1) data cleaning, (2) blocking, (3) pair-wise linkage and (4) clustering. In the cleaning step, material was categorized, processed to remove junk and spam records and normalized to an approximately common representation. The blocking process utilized an algorithm to group records by shared properties for determining which record pairs should be examined by the pairwise linker as potential duplicates. The linkage step assigned a score to record pairs using a supervised pairwise-based machine learning model. The clustering step combined record pairs into connected components and further partitioned each connected component to remove inconsistent pairwise links. The result is that input records were partitioned into disjoint sets called profiles, where each profile corresponded to a single person. The material is presented in the form of a comma delimited text file. Each line contains a first name, a nickname or alias, its conditional probability and its frequency. The conditional probability for each nickname is derived from the base data using an algorithm which calculates both the probability for which any alias refers to a given name and a threshold below which the mapping is most likely an error. This threshold eliminates typographic errors and other noise from the data. American English Nickname Collection is distributed via web download. 2012 Subscription Members will receive two copies of this data on disc provided that they have submitted a completed copy of the User License Agreement for American English Nickname Collection (LDC2012T11). 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data by completing the User License Agreement for American English Nickname Collection (LDC2012T11). The agreement can be faxed to +1 215 573 2175 or scanned and emailed to this address. The collection is being made available at no charge. * (2) Arabic Treebank - Broadcast News v1.0 was developed at LDC. It consists of 120 transcribed Arabic broadcast news stories with part-of-speech, morphology, gloss and syntactic tree annotation in accordance with the Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines. The ongoing PATB project supports research in Arabic-language natural language processing and human language technology development. This release contains 432,976 source tokens before clitics were split, and 517,080 tree tokens after clitics were separated for treebank annotation. The source materials are Arabic broadcast news stories collected by LDC during the period 2005-2008 from the following sources: Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Baghdadya TV, Al Fayha, Alhurra, Al Iraqiyah, Aljazeera, Al Ordiniyah, Al Sharqiyah, Dubai TV, Kuwait TV, Lebanese Broadcasting Corp., Oman TV, Radio Sawa, Saudi TV and Syria TV. The transcripts were produced by LDC. Arabic Treebank - Broadcast News v1.0 is distributed via web download. 2012 Subscription Members will receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$4500. * (3) Catalan TimeBank 1.0 was developed by researchers at Barcelona Media and consists of Catalan texts in the AnCora corpus annotated with temporal and event information according to the TimeML specification language. TimeML is a schema for annotating eventualities and time expressions in natural language as well as the temporal relations among them, thus facilitating the task of extraction, representation and exchange of temporal information. Catalan Timebank 1.0 is annotated in three levels, marking events, time expressions and event metadata. The TimeML annotation scheme was tailored for the specifics of the Catalan language. Temporal relations in Catalan present distinctions of verbal mood (e.g., indicative, subjunctive, conditional, etc.) and grammatical aspect (e.g., imperfective) which are absent in English. Catalan TimeBank 1.0 contains stand-off annotations for 210 documents with over 75,800 tokens (including punctuation marks) and 68,000 tokens (excluding punctuation). The source documents are from the EFE news agency, the ACN Catalan news agency2 and the Catalan version of the El Períodico newspaper, and span the period from January to December 2000. The AnCora corpus is the largest multilayer annotated corpus of Spanish and Catalan. AnCora contains 400,000 words in Spanish and 275,000 words in Catalan. The AnCora documents are annotated on many linguistic levels including structure, syntax, dependencies, semantics and pragmatics. That information is not included in this release, but it can be mapped to the present annotations. The corpus is freely available from the Centre de Llenguatge i Computació (CLiC)'. Catalan TimeBank 1.0 is distributed by web download. 2012 Subscription Members will receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data by completing the LDC User Agreement for Non-members. The agreement can be faxed to +1 215 573 2175 or scanned and emailed to this address. The collection is being made available at no charge.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-3 | Speechocean January 2012 update Speechocean - Language Resource Catalogue - New Released (01- 2012) Speechocean, as a global provider of language resources and data services, has more than 200 large-scale databases available in 80+ languages and accents covering the fields of Text to Speech, Automatic Speech Recognition, Text, Machine Translation, Web Search, Videos, Images etc.
Speechocean is glad to announce that more Speech Resources has been released:
Chinese and English Mixing Speech Synthesis Database (Female) The Chinese Mandarin TTS Speech Corpus contains the read speech of a native Chinese Female professional broadcaster recorded in a studio with high SNR (>35dB) over two channels (AKG C4000B microphone and Electroglottography (EGG) sensor). All speech data are segmented and labeled on phone level. Pronunciation lexicon and pitch extract from EEG can also be provided based on demands.
France French Speech Recognition Corpus (desktop) – 50 speakers This France French desktop speech recognition database was collected by SpeechOcean in France. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently. It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (28 males, 22 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information. A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
UK English Speech Recognition Corpus (desktop) – 50 speakers This UK English desktop speech recognition database was collected by SpeechOcean in England. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently. It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (28 males, 22 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information. A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
US English Speech Recognition Corpus (desktop) – 50 speakers This US English desktop speech recognition database was collected by SpeechOcean in America. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently. It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (25 males, 25 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information. A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
Italian Speech Recognition Corpus (desktop) – 50 speakers This Italian desktop speech recognition database was collected by SpeechOcean in Italy. This database is one of our databases of Speech Data ----Desktop Project (SDD) which contains the database collections for 30 languages presently. It contains the voices of 50 different native speakers who were balanced distributed by age (mainly 16 – 30, 31 – 45, 46 – 60), gender (23 males, 27 females) and regional accents. The script was specially designed to provide material for both training and testing of many classes of speech recognition applications. Each speaker recorded 500 utterances in a quiet office environment through two professional microphones. Each utterance is stored as 44.1K 16Bit uncompressed PCM format and accompanied by an ASCII SAM label file which contains the relevant descriptive information. A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
For more information about our Database and Services please visit our website www.Speechocen.com or visit our on-line Catalogue at http://www.speechocean.com/en-Product-Catalogue/Index.html If you have any inquiry regarding our databases and service please feel free to contact us: Xianfeng Cheng mailto: Chengxianfeng@speechocean.com Marta Gherardi mailto: Marta@speechocean.com
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5-2-4 | Appen ButlerHill
Appen ButlerHill A global leader in linguistic technology solutions RECENT CATALOG ADDITIONS—MARCH 2012 1. Speech Databases 1.1 Telephony
2. Pronunciation Lexica Appen Butler Hill has considerable experience in providing a variety of lexicon types. These include: Pronunciation Lexica providing phonemic representation, syllabification, and stress (primary and secondary as appropriate) Part-of-speech tagged Lexica providing grammatical and semantic labels Other reference text based materials including spelling/mis-spelling lists, spell-check dictionar-ies, mappings of colloquial language to standard forms, orthographic normalization lists. Over a period of 15 years, Appen Butler Hill has generated a significant volume of licensable material for a wide range of languages. For holdings information in a given language or to discuss any customized development efforts, please contact: sales@appenbutlerhill.com
4. Other Language Resources Morphological Analyzers – Farsi/Persian & Urdu Arabic Thesaurus Language Analysis Documentation – multiple languages
For additional information on these resources, please contact: sales@appenbutlerhill.com 5. Customized Requests and Package Configurations Appen Butler Hill is committed to providing a low risk, high quality, reliable solution and has worked in 130+ languages to-date supporting both large global corporations and Government organizations. We would be glad to discuss to any customized requests or package configurations and prepare a cus-tomized proposal to meet your needs.
|