// the find
lucidrains/imagen-pytorch
Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch
A PyTorch reimplementation of Google's Imagen text-to-image diffusion model — the cascading DDPM architecture conditioned on T5 text embeddings. It's for researchers and practitioners who want to train their own text-to-image or text-to-video models from scratch, without waiting for Google to open-source theirs. The ElucidatedImagen variant incorporates Karras et al.'s improved sampler, which is the version worth using.
The ImagenTrainer wrapper handles EMA, gradient accumulation, multi-GPU via Accelerate, and checkpoint management in one place — the kind of training infrastructure that's annoying to write yourself. ElucidatedImagen is a genuine improvement over the base DDPM formulation and was added quickly after the Karras paper dropped. The 3D UNet for video is verified working by an external researcher on medical data, not just a theoretical addition. Config-driven training and a CLI make it possible to hand a checkpoint to someone else and have them fine-tune without touching Python.
No pretrained weights exist, so you're on your own for compute — training a real Imagen-scale model is a multi-GPU multi-week job that this repo doesn't help you budget or plan for. Last commit was late 2024 and the todo list has had the same open items (flash attention, DreamBooth, consistency distillation) sitting unchecked for over a year; this is research-pace maintenance, not production-pace. The test suite is minimal — one trainer test file — so breaking changes in dependencies can go undetected. Flash attention is still missing, which matters at the resolutions you'd actually want to train at.