ISCA - International Speech
Communication Association


ISCApad Archive  »  2023  »  ISCApad #303  »  Resources  »  Database  »  ELRA - Language Resources Catalogue - Update (July 2023)

ISCApad #303

Thursday, September 07, 2023 by Chris Wellekens

5-2-2 ELRA - Language Resources Catalogue - Update (July 2023)
  

We are happy to announce that 66 new monolingual lexicons and 1 speech resource are now available in our catalogue. Moreover, 4 speech resources are now available at reduced fees.

 

1) New Language Resources:

 

Bitext Lexical Datasets

The series of Bitext Lexical Datasets for the generic vocabulary includes Lemmas, POS tagging, Frequency, Named Entities and Offensive features. Depending on the dataset and language, other syntactic and morphological features are also provided. The following 15 languages are available:

As a complement to the datasets mentioned above, 11 datasets of Language Variants can also be obtained:

 

  1. Arabic (MSA) dataset and Arabic Language Variants dataset consisting of Arabic Gulf, Arabic Najdi, Arabic Egypt and Arabic MSA variants,
  2. Chinese (Simplified) dataset, Chinese (Traditional) dataset, and Chinese Language Variants dataset (Simplified + Traditional),
  3. Dutch dataset and Dutch Language Variants dataset consisting of Netherlands and Belgium variants,
  4. English dataset and English Language Variants dataset consisting of United States, United Kingdom and India variants,
  5. Finnish dataset and Finnish Language Variants dataset consisting of Standard and Colloquial Finnish variants,
  6. French dataset and French Language Variants dataset consisting of France, Canada and Switzerland variants,
  7. German dataset and German Language Variants dataset consisting of Germany and Switzerland variants,
  8. Indonesian dataset,
  9. Italian dataset and Italian Language Variants dataset consisting of Italy and Switzerland variants,
  10. Malay dataset,
  11. Norwegian (Bokmal) dataset and Norwegian Language Variants dataset consisting of Bokmal and Nynorsk variants,
  12. Portuguese dataset and Portuguese Language Variants dataset consisting of Portugal and Brazil variants,
  13. Spanish dataset and Spanish Language Variants dataset consisting of Spain, North America, Central America, Andes and Southern Cone variants,

 

Bitext Synthetic Data

The Bitext Synthetic Data consist of pre-built training data for intent detection and are provided for 20 verticals for English and Spanish languages. They cover the most common intents for each vertical and include a large number of example utterances for each intent, with optional entity/slot annotations for each utterance. Data is distributed as models or open text files.

For each language, the following verticals are available:

  1. Automotive: 52 intents (EnglishSpanish)
  2. Retail banking: 26 intents (EnglishSpanish)
  3. Education: 37 intents (EnglishSpanish)
  4. Event and ticketing: 25 intents (EnglishSpanish)
  5. Field Service: 27 intents (EnglishSpanish)
  6. Healthcare: 40 intents (EnglishSpanish)
  7. Hospitality: 24 intents (EnglishSpanish)
  8. Insurance: 38 intents (EnglishSpanish)
  9. Legal : 29 intents (EnglishSpanish)
  10. Manufacturing: 34 intents (EnglishSpanish)
  11. Media Streaming: 24 intents (EnglishSpanish)
  12. Mortgage and loans: 39 intents (EnglishSpanish)
  13. Moving and storage: 29 intents (EnglishSpanish)
  14. Real estate and construction: 28 intents (EnglishSpanish)
  15. Restaurant/ bar chains: 30 intents (EnglishSpanish)
  16. Retail Ecomm: 34 intents (EnglishSpanish)
  17. Telecommunication: 26 intents (EnglishSpanish)
  18. Travel: 33 intents (EnglishSpanish)
  19. Utilities: 21 intents (EnglishSpanish)
  20. Wealth management: 24 intents (EnglishSpanish)

 

Persian Kids’ Speech Corpus

The Persian Kids’ Speech Corpus consists of speech signals recorded by 286 children (141 girls, 145 boys), from 6 to 9 years old, through an Andreas Mic Anti-Noise microphone and a Premium Speechmike headphone. This recorded data was manually checked and labeled. Finally, a corpus containing 162,395 samples with a duration of 33 hours and 44 minutes was created. The samples are distributed as follows:

  1. 29,057 Words (478 minutes),
  2. 17,429 SubWords (260 minutes),
  3. 43,838 Syllables (485 minutes),
  4. 70,078 Phonemes (765 minutes),
  5.  1,993 Extra Vocabulary (36 minutes).

The prepared speech corpus comprehensively contains all the 29 Persian phonemes, 118 syllables, 56 sub-words, and 711 words and is particularly applicable to speech recognition and linguistics studies.

 

2) Reduced fees for the following speech resources:


For more information on the catalogue or if you would like to enquire about having your resources distributed by ELRA, please contact us.
_________________________________________

Visit the ELRA Catalogue of Language Resources
Visit the Universal Catalogue 
Archives of ELRA Language Resources Catalogue Updates

 

 

 
 

Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA