N-gram Counts and Language Models from the Common Crawl

Christian Buck; Kenneth Heafield; Bas van Ooyen

Abstract

We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages. This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate. By preserving singletons, we were able to use Kneser-Ney smoothing to build large language models. This paper describes how the corpus was processed with emphasis on the problems that arise in working with data at this scale. Our unpruned Kneser-Ney English 5-gram language model, built on 975 billion deduplicated tokens, contains over 500 billion unique n-grams. We show gains of 0.5–1.4 BLEU by using large language models to translate into various languages.

Keywords

n-gramLanguage modelComputer scienceBoilerplate textNatural language processingGramKey (lock)Artificial intelligenceWorld Wide WebProgramming languageOperating system

Affiliated Institutions

University of Edinburgh GB

Related Publications

Enriching Word Vectors with Subword Information

Piotr Bojanowski , Édouard Grave , Armand Joulin +1 more

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore ...

2017 Transactions of the Association for C... 9444 citations

Letter N-Gram-based Input Encoding for Continuous Space Language Models

Henning Sperr , Jan Niehues , Alex Waibel

We present a letter-based encoding for words in continuous space language models. We represent the words completely by letter n-grams instead of using the word index. This way, ...

2013 22 citations

Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

Chin-Yew Lin , Franz Josef Och

In this paper we describe two new objective automatic evaluation methods for machine translation. The first method is based on longest common subsequence between a candidate tra...

2004 708 citations

Exploring the Limits of Transfer Learning with a Unified Text-to-Text\n Transformer

Colin Raffel , Noam Shazeer , Adam Roberts +6 more

Transfer learning, where a model is first pre-trained on a data-rich task\nbefore being fine-tuned on a downstream task, has emerged as a powerful\ntechnique in natural language...

2019 arXiv (Cornell University) 8299 citations

Publication Info

Year: 2014
Type: article
Pages: 3579-3584
Citations: 145
Access: Closed

External Links

Citation Metrics

145

OpenAlex

Cite This

APA Style

                            
                                    Christian Buck, 
                                
                                    Kenneth Heafield, 
                                
                                    Bas van Ooyen
                                
                            (2014). 
                            N-gram Counts and Language Models from the Common Crawl. 
                            
                            , 3579-3584.