ISCA - International Speech
Communication Association


ISCApad Archive  »  2014  »  ISCApad #196  »  Resources  »  Database  »  LDC Newsletter (September 2014)

ISCApad #196

Sunday, October 12, 2014 by Chris Wellekens

5-2-3 LDC Newsletter (September 2014)
  

 

In this newsletter:

LDC at Interspeech 2014, Singapore  -

New publications:

ACE 2007 Multilingual Training Corpus  -

GALE Arabic-English Word Alignment -- Broadcast Training Part 1  -

GALE Phase 2 Chinese Newswire Parallel Text Part 2  -


LDC at Interspeech 2014, Singapore

LDC is off to Singapore to participate in Interspeech 2014. This year’s conference will be held from September 14-18 at Singapore’s Max Atria at the Expo Center. Please stop by LDC’s exhibition booth to learn more about recent developments at the Consortium and new publications. LDC will continue to post conference updates via our Facebook page. We hope to see you there!   

 

New publications

(1) ACE 2007 Multilingual Training Corpus was developed by LDC and contains the complete set of Arabic and Spanish training data for the 2007 Automatic Content Extraction (ACE) technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC ACE project pages.

The Arabic data is composed of newswire (60%) published in October 2000-December 2000 and weblogs (40%) published during the period November 2004-February 2005. The Spanish data set consists entirely of newswire material from multiple sources published in January 2005-April 2005. A document pool was established for each language based on genre and epoch requirements. Humans reviewed the pool to select individual documents suitable for ACE annotation, such as documents that were representative of their genre and contained targeted ACE entity types. One annotator completed the entity and temporal expression (TIMEX2) markup in the first pass annotation. This work was reviewed in the second pass by a senior annotator. TIMEX2 values were normalized by an annotator specifically trained for that task.

The table below describes the amount of data included in the current release and its annotation status. Corpus content for each language and data type is represented in the three stages of annotation: first pass annotation (1P), second pass annotation (2P) and TIMEX2 normalization and additional quality control (NORM).

Arabic

Words

 

 

Files

 

 

 

 

1P

2P

NORM

1P

2P

NORM

NW

58,015

58,015

58,015

257

257

257

WL

40,338

40,338

40,338

121

121

121

Total

98,353

98,353

98,353

378

378

378

Spanish

           

Words

 

 

Files

 

 

 

 

1P

2P

NORM

1P

2P

NORM

NW

100,401

100,401

100,401

352

352

352

Total

100,401

100,401

100,401

352

352

352

For a given document, there is a source .sgm file together with the .ag.xml and .apf.xml annotation files in each of the three directories '1p', '2p' and 'timex2norm'. In other words, for each newswire story or weblog entry, the three annotation directories each contain an identical copy of the source text (SGML .sgm file) along with distinct versions of the associated annotations (XML .ag.xml, apf.xml files and plain text .tab files). All files are presented in UTF-8.

ACE 2007 Multilingual Training Corpus is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$1000.

*

(2) GALE Arabic-English Word Alignment -- Broadcast Training Part 1 was developed by LDC and contains 267,257 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source broadcast news and broadcast conversation data collected by LDC from 2007-2009. The distribution by genre, words, tokens and segments appears below:

Language

Genre

Files

Words

Tokens

Segments

Arabic

BC

231

79,485

103,816

4,114

Arabic

BN

92

131,789

163,441

7,227

Totals

 

323

211,274

267,257

11,341

Note that word count is based on the untokenized Arabic source, and token count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:

Normalizing tokenized tokens as needed

Identifying different types of links

Identifying sentence segments not suitable for annotation

Tagging unmatched words attached to other words or phrases

GALE Arabic-English Word Alignment -- Broadcast Training Part 1 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$1750.

 

*

(3) GALE Phase 2 Chinese Newswire Parallel Text Part 2 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains 117,895 tokens of Chinese source text and corresponding English translations selected from newswire data collected by LDC in 2007 and translated by LDC or under its direction.

This release includes 177 source-translation document pairs, comprising 117,895 tokens of translated data. Data is drawn from four distinct Chinese newswire sources: China News Service, Guangming Daily, People's Daily and People's Liberation Army Daily.

Data was manually selected for translation according to several criteria, including linguistic features and topic features. The files were formatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.text. All data are encoded in UTF-8.

GALE Phase 2 Chinese Newswire Parallel Text Part 2 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$1750.



 

 


 

 


 

 


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA