ISCA - International Speech
Communication Association


ISCApad Archive  »  2015  »  ISCApad #208  »  Resources  »  Database  »  LDC Newsletter (September 2015)

ISCApad #208

Saturday, October 10, 2015 by Chris Wellekens

5-2-2 LDC Newsletter (September 2015)
  

In this newsletter:

 

New publications:

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4

GALE Phase 3 and 4 Arabic Newswire Parallel Text

NewSoMe Corpus of Opinion in News Reports


(1) GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 was developed by LDC and contains 243,038 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2008 and 2009. The distribution by genre, words, character tokens and segments appears below:

Language

Genre

Files

Words

CharTokens

Segments

Chinese

BC

69

67,782

101,674

2,276

Chinese

BN

29

94,242

141,364

3,152

Total

 

98

162,024

243,038

5,428

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components:

Identifying, aligning, and tagging eight different types of links

Identifying, attaching, and tagging local-level unmatched words

Identifying and tagging sentence/discourse-level unmatched words

Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$1750.

 


                                                                                                                                                                                                                                         *

(2)  GALE Phase 3 and 4 Arabic Newswire Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from newswire data collected by LDC in 2007 and 2008 and transcribed and translated by LDC or under its direction.

This data includes 551 source-translation document pairs, comprising 156,775 tokens of Arabic source text and its English translation. Data is drawn from seven distinct Arabic newswire sources: Agence France Presse, Al Ahram, Al Hayat, Al-Quds Al-Arabi, An Nahar, Asharq Al-Awsat and Assabah.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. The transcribed and segmented files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.  Source data and translations are distributed in TDF format.

GALE Phase 3 and 4 Arabic Newswire Parallel Text is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$1750.

 

*

(3) NewSoMe Corpus of Opinion in News Reports was compiled at Barcelona Media and consists of Spanish, Catalan and Portuguese news reports annotated for opinions. It is part of the NewSoMe (News and Social Media) set of corpora presenting opinion annotations across several genres and covering multiple languages. NewSoMe is the result of an effort to build a unifying annotation framework for analyzing opinion in different genres, ranging from controlled text, such as news reports, to diverse types of user-generated content that includes blogs, product reviews and microblogs.

The source data in this release was obtained from various newspaper websites and consists of approximately 200 documents in each of Spanish, Catalan and Portuguese. The annotation was carried out manually through the crowdsourcing platform CrowdFlower with seven annotations per layer that were aggregated for this data set. The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.

NewSoMe Corpus of Opinion in News Reports is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$$750.


 


 

 

 

 


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA