Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Yunpeng Chen; Tao Wang; Weihao Yu; Yujun Shi; Zihang Jiang; Francis E. H. Tay; Jiashi Feng; Shuicheng Yan

doi:10.1109/iccv48922.2021.00060

Abstract

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384x384 on ImageNet. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup>

Keywords

Security tokenComputer scienceTransformerArtificial intelligencePixelScratchPattern recognition (psychology)Lexical analysisSpeech recognitionProgramming languageComputer networkEngineering

Affiliated Institutions

National University of Singapore SG

Related Publications

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Xiaoyi Dong , Jianmin Bao , Dongdong Chen +5 more

We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global se...

2022 2022 IEEE/CVF Conference on Computer ... 1121 citations

A ConvNet for the 2020s

Zhuang Liu , Hanzi Mao , Chao-Yuan Wu +3 more

The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification...

2022 2022 IEEE/CVF Conference on Computer ... 5683 citations

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Zihang Dai , Zhilin Yang , Yiming Yang +3 more

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural archit...

2019 3018 citations

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Chun-Fu Richard Chen , Quanfu Fan , Rameswar Panda

The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper...

2021 2021 IEEE/CVF International Conferenc... 1692 citations

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Wenhai Wang , Enze Xie , Xiang Li +6 more

Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a simpler, convolution-free backbone network use-fid for man...

2021 2021 IEEE/CVF International Conferenc... 4221 citations

Publication Info

Year: 2021
Type: article
Pages: 538-547
Citations: 2067
Access: Closed

External Links

Download PDF (Free) View on DOI.org arXiv Semantic Scholar

Social Impact

Altmetric

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

2067

OpenAlex

226

Influential

1704

CrossRef

Cite This

APA Style

                            
                                    Yunpeng Chen, 
                                
                                    Tao Wang, 
                                
                                    Weihao Yu
                                
                                et al.
                            
                            (2021). 
                            Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. 
                            2021 IEEE/CVF International Conference on Computer Vision (ICCV)
                            
                            , 538-547.
                            https://doi.org/10.1109/iccv48922.2021.00060

Identifiers

DOI: 10.1109/iccv48922.2021.00060
arXiv: 2101.11986

Data Quality

Data completeness: 84%