Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Abstract

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

Keywords

PerplexityComputer scienceLanguage modelTransformerTreebankArtificial intelligenceHyperparameterNatural language processingDependency (UML)EngineeringElectrical engineering

Affiliated Institutions

Related Publications

Feature-rich part-of-speech tagging with a cyclic dependency network

Kristina Toutanova , Dan Klein , Christopher D. Manning +1 more

We present a new part-of-speech tagger that demonstrates the following ideas: (i) explicit use of both preceding and following tag contexts via a dependency network representati...

2003 2851 citations

Exploring the Limits of Transfer Learning with a Unified Text-to-Text\n Transformer

Colin Raffel , Noam Shazeer , Adam Roberts +6 more

Transfer learning, where a model is first pre-trained on a data-rich task\nbefore being fine-tuned on a downstream task, has emerged as a powerful\ntechnique in natural language...

2019 arXiv (Cornell University) 8299 citations

Towards Learning Terminological Concept Systems from Multilingual Natural Language Text

Yinhan Liu , Myle Ott , Naman Goyal +7 more

Terminological Concept Systems (TCS) provide a means of organizing, structuring and representing domain-specific multilingual information and are important to ensure terminologi...

2021 Leibniz-Zentrum für Informatik (Schlo... 16995 citations

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Jinhyuk Lee , Wonjin Yoon , Sungdong Kim +4 more

Abstract Motivation Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processi...

2019 Bioinformatics 6148 citations

Learning the hidden structure of speech

Jeffrey L. Elman , David Zipser

In the work described here, the backpropagation neural network learning procedure is applied to the analysis and recognition of speech. This procedure takes a set of input/outpu...

1988 The Journal of the Acoustical Society... 269 citations

Publication Info

Year: 2019
Type: preprint
Citations: 3018
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

3018

OpenAlex

Cite This

APA Style

                            
                                    Zihang Dai, 
                                
                                    Zhilin Yang, 
                                
                                    Yiming Yang
                                
                                et al.
                            
                            (2019). 
                            Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. 
                            
                            .
                            https://doi.org/10.18653/v1/p19-1285

Identifiers

DOI: 10.18653/v1/p19-1285