// the find

lucidrains/vit-pytorch

★ 25,355 · Python · MIT · updated Jun 2026

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

A PyTorch collection of Vision Transformer implementations, covering the original ViT paper and roughly 40 variants: MAE, NaViT, MaxViT, MobileViT, video transformers, self-supervised pretraining methods, and more. It's a research reference repo, not a production library — the value is having clean, readable implementations of papers in one place.

The breadth is genuinely useful: when a paper references CrossFormer or CaiT, you can just import it and run it rather than hunting for the author's messy training repo. The base ViT implementation is clean and short enough to actually read. NaViT's nested tensor support using PyTorch 2.5+ is a nice forward-looking addition. Self-supervised pretraining wrappers (MAE, SimMIM, DINO) plug directly onto the base ViT with minimal boilerplate, which makes prototyping fast.

No pretrained weights are distributed — for production use you still need timm or the official repos. The parameter explosion across variants (CvT alone has ~15 stage-prefixed args per stage) means configs are tedious and error-prone to write. Test coverage is thin: there's one test file for a library with 80+ modules, so breakage in less-used variants is plausible. The repo accumulates new variants without pruning old ones, so it's unclear which implementations are still maintained or worth trusting for current research.

View on GitHub →