// the find

lucidrains/DALLE-pytorch

★ 5,628 · Python · MIT · updated Feb 2024

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

A community replication of the original DALL-E paper — the discrete VAE + transformer approach OpenAI used before they moved to diffusion models. This predates Stable Diffusion and DALL-E 2/3 by years, so it's an autoregressive text-to-image model, not a diffusion one. Useful if you want to understand how the 2021-era architecture worked, or if you're doing research on autoregressive image generation specifically.

The implementation is unusually complete for a replication: discrete VAE training, the full DALL-E transformer, CLIP ranking, sparse attention variants (axial row/col, conv-like), reversible networks for depth scaling, and DeepSpeed/Horovod distributed training all included. The VAE swap story is well thought out — you can use OpenAI's pretrained dVAE, the Taming Transformers VQGAN, or train your own, which meaningfully changes training cost (256 vs 1024 sequence length). Phil Wang (lucidrains) has a track record of clean, readable transformer code and this follows that pattern.

It's a historical artifact at this point — last pushed February 2024, and the author himself linked to DALL-E 2 in the README as the move-onwards point back in April 2022. The image quality ceiling is far below modern diffusion models; you'd need enormous compute to get anything close to impressive results. No pretrained weights are distributed in the repo itself (just community checkpoints of dubious provenance), so you're looking at training from scratch. The DeepSpeed sparse attention path requires a pinned triton < 1.0, which is ancient and likely broken on current CUDA toolchains.

View on GitHub →