ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2010 » ISCApad #149 » Resources » Database

ISCApad #149

Friday, November 05, 2010 by Chris Wellekens

5-2 Database

5-2-1

Bell System Technical Journal (1922-1983) available .

I received this very good news from Joseph P. Campbell (MIT Lincoln
Laboratory):

The entire Bell System Technical Journal from 1922--1983 is now
available on line! It is offered in PDF format sorted by year, volume, and
issue; and it is searchable:

http://bstj.bell-labs.com/

BSTJ holds a wealth of consolidated information of outstanding contributions
from Bell Labs over the years. For example, check out Fletcher's 1922
article 'The Nature of Speech and Its Interpretation', BSTJ, vol 1, no 1.
Also see Shannon's landmark paper and much, much more!

As Joe pointed out, the current generation of researchers didn't grow up
with BSTJs under their bed like we both did, but for the older generation
that remembers these great articles, this is indeed very good news.

--Isabel Trancoso

Top

5-2-2

ELRA Language Resources Catalogue Update (June 2010)

ELRA is happy to announce that 2 new Speech Desktop/Microphone resources, 1 new Terminological Resource and 1 Written Corpus are now available in its catalogue:

ELRA-S0305 EPAC Corpus: orthographic transcriptions

This corpus consists of approx. 100 hours of manual orthographic transcriptions, which were produced from 1,677 hours of non transcribed recordings from the ESTER Evaluation Campaign (Technolangue programme). This corpus also consists of automatic transcriptions of the full 1,677 hours.

For more information, see: http://catalog.elra.info/product_info.php?products_id=1119

ELRA-S0307 BABEL Polish database

The BABEL Polish Database is a speech database that was produced by a research consortium funded by the European Union under the COPERNICUS programme (COPERNICUS Project 1304). It consists of the basic 'common' set which contains the Many Talker Set (30 males, 30 females), the Few Talker Set (5 males, 5 females), the Very Few Talker Set (1 male, 1 female).

For more information, see: http://catalog.elra.info/product_info.php?products_id=1120

ELRA-T0374 Terminology database of natural sciences

This dictionary covers the three kingdoms: Animal, Vegetal, Mineral. It contains 50,000 species with numerous synonyms in French, English and Latin and many breeds and varieties. Minerals are given with their chemical formula. About 7,900 definitions in French are included. It also includes synonyms and linguistic variants.

For more information, see: http://catalog.elra.info/product_info.php?products_id=1121

ELRA-W0053 Catalan-Spanish Parallel Corpus

This corpus contains more than 100 million words and it contains 10 years of bilingual articles from “El Periódico de Catalunya”. The data are aligned at sentence level and stored in text files, in a one sentence per line basis. The data are provided in plain text, with no encoding whatsoever.

For more information, see: http://catalog.elra.info/product_info.php?products_id=1122

******

Moreover, please note that the content of the following 3 Terminological Resources has been updated and their prices have been revised:

ELRA-T0102 Terminology database of expressions

This resource comprises over about 26,000-30,000 expressions, such as sayings, proverbs, idioms, slogans, citations, exclamations, onomatopoeias and figurative expressions of French and English. Several grammatical topics that are included in some sentences are also handled. This resource contains synonyms. The DISCIPLINE field refers to the expression category: proverbs, idioms, postposition verbs.

For more information, see: http://catalog.elra.info/product_info.php?products_id=114

ELRA-T0103 Terminology database of finance

For more information, see: http://catalog.elra.info/product_info.php?products_id=115

ELRA-T0367 Terminology database of telecommunication

This resource comprises over 89,200 entries in the field of telecommunication. It also contains many synonyms and abbreviations in both languages, as well as meaning, case or applications for polysemic terms.

For more information, see: http://catalog.elra.info/product_info.php?products_id=659

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

Visit our On-line Catalogue: http://catalog.elra.info

Visit the Universal Catalogue: http://universal.elra.info

Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/LRs-Announcements.html

*****************************************************************

ELRA - Language Resources Catalogue - Update

*****************************************************************

In the framework of our ongoing campaign for updating and reducing the prices of the language resources distributed in the ELRA catalogue, ELRA is happy to announce that the prices for the following resources have been substantially reduced:

ELRA-S0074 British English SpeechDat(II) MDB-1000

This speech database contains the recordings of 1,000 British speakers recorded over the British mobile telephone network. Each speaker uttered around 40 read and spontaneous items.

For more information, see: http://catalog.elra.info/product_info.php?products_id=723

ELRA-S0075 Welsh SpeechDat(II) FDB-2000

This speech database contains the recordings of 2,000 Welsh speakers recorded over the British fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

For more information, see: http://catalog.elra.info/product_info.php?products_id=557

ELRA-S0101 Spanish SpeechDat(II) FDB-1000

This speech database contains the recordings of 1,000 Castillan Spanish speakers recorded over the Spanish fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

This database is a subset of the Spanish SpeechDat(II) FDB-4000 (ref. ELRA-S0102).

For more information, see: http://catalog.elra.info/product_info.php?products_id=726

ELRA-S0102 Spanish SpeechDat(II) FDB-4000

This speech database contains the recordings of 4,000 Castillan Spanish speakers recorded over the Spanish fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

This database includes the Spanish SpeechDat(II) FDB-1000 (ref. ELRA-S0101).

For more information, see: http://catalog.elra.info/product_info.php?products_id=727

ELRA-S0140 Spanish SpeechDat-Car database

The Spanish SpeechDat-Car database contains the recordings in a car of 306 speakers, who uttered around 120 read and spontaneous items. Recordings have been made through 5 different channels, of which 4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones) and 1 channel over the GSM network.

For more information, see: http://catalog.elra.info/product_info.php?products_id=690

ELRA-S0141 SALA Spanish Venezuelan Database

This speech database contains the recordings of 1,000 Venezuelan speakers recorded over the Venezuelan fixed telephone network. Each speaker uttered around 50 read and spontaneous items.

For more information, see: http://catalog.elra.info/product_info.php?products_id=736

ELRA-S0297 Hungarian Speecon database

The Hungarian Speecon database comprises the recordings of 555 adult Hungarian speakers and 50 child Hungarian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

For more information, see: http://catalog.elra.info/product_info.php?products_id=1094

ELRA-S0298 Czech Speecon database

The Czech Speecon database comprises the recordings of 550 adult Czech speakers and 50 child Czech speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

For more information, see: http://catalog.elra.info/product_info.php?products_id=1095

For more information on the catalogue, please contact Valérie Mapelli mailto:mapelli@elda.org

Visit our On-line Catalogue: http://catalog.elra.info

Visit the Universal Catalogue: http://universal.elra.info

Archives of ELRA Language Resources Catalogue Updates: http://www.elra.info/LRs-Announcements.html

Top

5-2-3

LDC Newsletter (October 2010)

In this newsletter:

- Fall 2010 LDC Data Scholarship Winners! -

- Position Openings at LDC -

New Publications:

LDC2010T18

- ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 -

LDC2010T19
- Korean Newswire Second Edition -

LDC2010T17
- NIST 2006 Open Machine Translation (OpenMT) Evaluation -

Fall 2010 LDC Data Scholarship Winners!

LDC is pleased to announce the winners in our first-ever LDC Data Scholarship program! The LDC Data Scholarship program provides university students with access to LDC data at no-cost. Data scholarships are offered twice a year to correspond to the Fall and Spring semesters. Students are asked to complete an application which consists of a data use proposal and letter of support from their academic adviser.

LDC received many strong applications from both undergraduate and graduate students attending universities across the globe. The decision process was difficult, and after much deliberation, we have selected eight winners! These students will receive no-cost copies of LDC data valued at over US$10,000:

Aby Abraham - Ohio University (USA), graduate student, Electrical Engineering. Aby has been awarded a copy of 2003 NIST Speaker Recognition Evaluation (LDC2010S03) for his work in using long term memory cells for continuous speech recognition.

Ripandy Adha - Bandung Institute of Technology (Indonesia), undergraduate student, Computer Science - Ripandy has been awarded a copy of American English Spoken Lexicon (LDC99L23) to assist in the development of a voice command internet browser.

Basawaraj - Ohio University (USA), PhD candidate, Electrical Engineering and Computer Science. Basawaraj has been awarded a copy of NIST 2002 Open Machine Translation (OpenMT) Evaluation (LDC2010T10) to assist in fine tuning his machine translation system and to provide a benchmark dataset.

Zachary Brooks - University of Arizona (USA), PhD Candidate, Second Language Acquisition and Teaching. Zachary and his research group have been awarded a copy of ECI Multilingual Text (LDC94T5) for research in eye movement tracking by native and non-natives readers.

Marco Carmosino - Hampshire College (USA), undergraduate student, Computer Science. Marco has been awarded a copy of English Gigaword Fourth Edition (LDC2009T13) for his work in narrative chain extraction.

Xiaohui Huang - Harbin Institute of Technology (China), Shenzhen Graduate School. Xiaohui has been awarded a copy of TDT5 Topics and Annotations (LDC2006T19) for his work in topic detection and tracking for large-scale web data.

Yuhuan Zhou - PLA University of Science and Technology (China), postgraduate student, Institute of Communications Engineering. Yuhuan has been awarded a copy of 2002 NIST Speaker Recognition Evaluation (LDC2004S04) to assist in the development of a speaker recognition system which fuses support vector data description (SVDD) and Gaussian mixture model (GMM).

Speaker Recognition Group (GEDA) with members Matias Fineschi, Gonzalo Lavigna, Jorge Prendes, and Pablo Vacatello - Buenos Aires Institute of Technology (Argentina), Department of Electrical Engineering. GEDA has been awarded a copy of 2004 NIST Speaker Recognition Evaluation (LDC2006S44) to assist in the development of a flexible platform on speaker verification capable of implementing different feature extraction, normalizations, stochastical models and outputs.

Please join us in congratulating our student winners! The next LDC Data Scholarship program is scheduled for the Spring 2011 semester. Stay tuned for further announcements.

[ top ]

Position Openings at LDC

Linguistic Data Consortium at the University of Pennsylvania has a number of immediate openings for full-time positions to support our corpus development projects:

PROGRAMMER ANALYST - (#100528459 and #100929195)

Support linguistic data collection and annotation projects by providing software development, system integration, technical and research support, annotation tool development and/or data collection system management.

SENIOR PROJECT MANAGER (#100728923 and #100728924)

Provide complete oversight for multiple, concurrent corpus creation projects, including collection, annotation and distribution of speech, text and/or video data in a variety of languages. Create project roadmaps and direct teams of programmers, linguists and managers to execute deliverables; represent corpus creation efforts to external researchers and sponsors.

LEAD ANNOTATOR (#100728920)

Perform linguistic annotation on English text, speech and video data; recruit, train and supervise teams of annotators for multiple tasks and languages; define, test and document procedural approaches to linguistic annotation;perform quality control on annotated data.

For further information on the duties and qualifications for these positions, or to apply online please visit https://jobs.hr.upenn.edu/; search postings for the reference numbers indicated above.

Penn offers an excellent benefits package including medical/dental, retirement plans, tuition assistance and a minimum of three weeks paid vacation per year. The University of Pennsylvania is an affirmative action/equal opportunity employer. All positions contingent upon grant funding.
.
For more information about LDC and the programs we support, visit http://www.ldc.upenn.edu/.

[ top ]

New Publications

(1) ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 was developed by researchers at The MITRE Corporation. It contains the English evaluation data prepared for the 2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by the Automatic Content Extraction (ACE) program, specifically, English broadcast news and newswire data collected by LDC. The training data for this evaluation can be found in ACE Time Normalization (TERN) 2004 English Training Data v 1.0 LDC2005T07.

The purpose of the TERN evaluation is to advance the state of the art in the automatic recognition and normalization of natural language temporal expressions. In most language contexts such expressions are indexical. For example, with 'Monday,' 'last week,' or 'three months starting October 1,' one must know the narrative reference time in order to pinpoint the time interval being conveyed by the expression. In addition, for data exchange purposes, it is essential that the identified interval be rendered according to an established standard, i.e., normalized. Accurate identification and normalization of temporal expressions are in turn essential for the temporal reasoning being demanded by advanced NLP applications such as question answering, information extraction and summarization.

The data in this release is English broadcast transcripts and newswire material from TDT4 Multilingual Text and Annotations LDC2005T16. The annotation specifications for this corpus were developed under DARPA's Translingual Information Detection Extraction and Summarization (TIDES) program, with support from ACE. All files have been doubly-annotated by two separate annotators and then reconciled, using the TIDES 2003 Standard for the Annotation of Temporal Expressions. The data directory contains the corpus which consists of 192 files (54K words).

[ top ]

(2) Korean Newswire Second Edition is an archive of Korean newswire text that has been acquired over several years (1994-2009) at LDC from the Korean Press Agency. This release includes all of the content of Korean Newswire (LDC2000T45) (June 1994-March 2000) as well as newly-collected data. The second edition contains all data collected by LDC from April 2000 through December 2009.

All material, including that from the first release, has been converted to UTF-8 (except for more recent data already in UTF-8 format) and processed in LDC's gigaword format. The gigaword format classifies newswire content into three types: story, multi and other where 'story' refers to an article containing information pertaining to a particular event on a day; 'multi' refers to an article that contains more than one story relating to different topics; and 'other' refers to articles containing lists, tables or numerical data, such as sports scores.

A word break error in the original release and in data collected from January 2002 through February 2005 has been corrected in the second edition with the result that all Korean text should display correctly. The error involved a line break in the middle of a word with the result that an affected word appeared in segments in two lines. This problem was resolved using word histograms and a few

[ top ]

(3) NIST 2006 Open Machine Translation (OpenMT) Evaluation is a package containing source data, reference translations and scoring software used in the NIST 2006 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems. The package was compiled and scoring software was developed by researchers at NIST, making use of broadcast, newswire and web newsgroup source data and reference translations collected and developed by LDC.

The objective of the NIST Open Machine Translation (OpenMT) evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. The OpenMT evaluations are intended to be of interest to all researchers working on the general problem of automatic translation between human languages. To this end, they are designed to be simple, to focus on core technology issues and to be fully supported. The 2006 task was to evaluate translation from Arabic to English and from Chinese to English. Additional information about these evaluations may be found at the NIST Open Machine Translation (OpenMT) Evaluation web site.

This evaluation kit includes a single Perl script (mteval-v11b.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation.

The included scoring script was released with the original evaluation, intended for use with SGML-formatted data files, and is provided to ensure compatibility of user scoring results with results from the original evaluation. An updated scoring software package (mteval-v13a-20091001.tar.gz), with XML support, additional options and bug fixes, documentation, and example translations, may be downloaded from the NIST Multimodal Information Group Tools website.

This release contains of 357 documents with corresponding sets of four separate human expert reference translations. The source data is comprised of Arabic and Chinese newswire documents, human transcriptions of broadcast news and broadcast conversation programs and web newsgroup documents collected by LDC in 2006. The newswire and broadcast material are from Agence France-Presse (Arabic, Chinese), Xinhua News Agency (Arabic, Chinese), Lebanese Broadcasting Corp. (Arabic), Dubai TV (Arabic), China Central TV (Chinese) and New Tang Dynasty Television (Chinese). The web text was collected from Google and Yahoo newsgroups.

For each language, the test set consists of two files: a source and a reference file. Each reference file contains four independent translations of the data set. The evaluation year, source language, test set, version of the data, and source vs. reference file are reflected in the file name.

[ top ]

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        ldc@ldc.upenn.edu

Philadelphia, PA 19104 USA                 http://www.ldc.upenn.edu

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy