ISCA - International Speech
Communication Association

ISCApad Archive  »  2018  »  ISCApad #246  »  Resources  »  Database  »  ELRA - Language Resources Catalogue - Update (October 2018)

ISCApad #246

Thursday, December 13, 2018 by Chris Wellekens

5-2-2 ELRA - Language Resources Catalogue - Update (October 2018)
ELRA - Language Resources Catalogue - Update
We are happy to announce that 2 new Written Corpora and 4 new Speech resources are now available in our catalogue.

ELRA-W0126 Training and test data for Arabizi detection and transliteration
ISLRN: 986-364-744-303-9
The dataset is composed of : a collection of mixed English and Arabizi text intended to train and test a system for the automatic detection of code-switching in mixed English and Arabizi texts ; and a set of 3,452 Arabizi tokens manually transliterated into Arabic, intended to train and test a system that performs Arabizi to Arabic transliteration.
For more information, see:

ELRA-W0127 Normalized Arabic Fragments for Inestimable Stemming (NAFIS)
ISLRN: 305-450-745-774-1
This is an Arabic stemming gold standard corpus composed by a collection of 37 sentences, selected to be representative of Arabic stemming tasks and manually annotated. Compiled sentences belong to various sources (poems, holy Quran, books, and periodics) of diversified kinds (proverb and dictum, article commentary, religious text, literature, historical fiction). NAFIS is represented according to the TEI standard.   
For more information, see:

ELRA-S0396 Mbochi speech corpus

ISLRN: 747-055-093-447-8
This corpus consists of 5131 sentences recorded in Mbochi, together with their transcription and French translation, as well as the results from the work made during  JSALT workshop: alignments at the phonetic level and various results of unsupervised word segmentation from audio. The audio corpus is made up of 4,5 hours, downsampled at 16kHz, 16bits, with Linear PCM encoding. Data is distributed into 2 parts, one for training consisting of 4617 sentences, and one for development consisting of 514 sentences.
For more information, see:

ELRA-S0397 Chinese Mandarin (South) database

ISLRN: 503-886-852-083-2
This database contains the recordings of 1000 Chinese Mandarin speakers from Southern China (500 males and 500 females), from 18 to 60 years? old, recorded in quiet studios. Recordings were made through microphone headsets and consist of 341 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM.
For more information, see:

ELRA-S0398 Chinese Mandarin (North) database
ISLRN: 353-548-770-894-7
This database contains the recordings of 500 Chinese Mandarin speakers from Northern China (250 males and 250 females), from 18 to 60 years? old, recorded in quiet studios. Recordings were made through microphone headsets and consist of 172 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM.
For more information, see:

ELRA-S0401 Persian Audio Dictionary
ISLRN: 133-181-128-420-9
This dictionary consists of more than 50,000 entries (along with almost all wordforms and proper names) with corresponding audio files in MP3 and English transliterations. The words have been recorded with standard Persian (Farsi) pronunciation (all by a single speaker). This dictionary is provided with its software.
For more information, see:

For more information on the catalogue, please contact Valérie Mapelli

If you would like to enquire about having your resources distributed by ELRA, please do not hesitate to contact us.

Visit our On-line Catalogue:
Visit the Universal Catalogue:
Archives of ELRA Language Resources Catalogue Updates:

Back  Top

 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA