ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2022 » ISCApad #295 » Resources » Database » Google 's Language Model benchmark

ISCApad #295

Monday, January 09, 2023 by Chris Wellekens

5-2-4 Google 's Language Model benchmark

A LM benchmark is available at:https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark

Here is a brief description of the project.

'The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:

unpruned Katz (1.1B n-grams),
pruned Katz (~15M n-grams),
unpruned Interpolated Kneser-Ney (1.1B n-grams),
pruned Interpolated Kneser-Ney (~15M n-grams)

ArXiv paper: http://arxiv.org/abs/1312.3005

Happy benchmarking!'

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy