// the find

lucidrains/DALLE2-pytorch

★ 11,310 · Python · MIT · updated May 2024

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch

A PyTorch reimplementation of DALL-E 2's architecture: CLIP-conditioned diffusion prior plus cascading decoder. It's a research replication from 2022, not a production image generation tool — the target audience is people who want to train or study the pipeline from scratch rather than use a pretrained model.

The three-stage pipeline (CLIP → diffusion prior → cascading decoder) is faithfully implemented and the trainer wrappers handle the fiddly parts like per-unet optimizers and EMA correctly. OpenAI CLIP and Open CLIP adapters are both supported, so you're not forced to train CLIP from scratch. Latent diffusion is available as an optional layer on the decoder cascade, which is a meaningful extension beyond the original paper. Inpainting via the Repaint resampling formulation is built into the Decoder.sample path without requiring a separate model.

This is effectively abandoned — last push May 2024, and lucidrains himself noted in the README that Imagen surpassed it in May 2022. The README still says 'SOTA for text-to-image' in one line and 'no longer SOTA' three lines later, which tells you the maintenance energy. Training the full pipeline requires you to bring your own massive datasets, your own CLIP training or pretrained weights, and GPU clusters; the 'CLI tool for small scale training' promise mostly delivered but the end-to-end experience is underdocumented. There are no pretrained decoder checkpoints from LAION that actually completed — the wandb links in the README go to in-progress or abandoned runs.

View on GitHub →