Need help with ViT?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

218 Stars 21 Forks MIT License 38 Commits 6 Opened issues


An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Services available


Need anything else?

Contributors list

# 167,124
38 commits

Vision Transformers

Implementation of Vision Transformer in PyTorch, a new model to achieve SOTA in vision classification with using transformer style encoders. Associated blog article.

Credits to Phil Wang for the gif ViT


  • [x] ViT
  • [x] ViT with convolutional patches
  • [x] ViT with convolutional stems
    • [x] Early Convolutional Stem
    • [x] Scaled ReLU Stem
  • [X] GAP Pooling


  title={An image is worth 16x16 words: Transformers for image recognition at scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and others},
  journal={arXiv preprint arXiv:2010.11929},
  title={Early convolutions help transformers see better},
  author={Xiao, Tete and Singh, Mannat and Mintun, Eric and Darrell, Trevor and Doll{\'a}r, Piotr and Girshick, Ross},
  journal={arXiv preprint arXiv:2106.14881},
  title={Scaled ReLU Matters for Training Vision Transformers},
  author={Wang, Pichao and Wang, Xue and Luo, Hao and Zhou, Jingkai and Zhou, Zhipeng and Wang, Fan and Li, Hao and Jin, Rong},
  journal={arXiv preprint arXiv:2109.03810},
  title={Scaling vision transformers},
  author={Zhai, Xiaohua and Kolesnikov, Alexander and Houlsby, Neil and Beyer, Lucas},
  journal={arXiv preprint arXiv:2106.04560},

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.