// the find

rasbt/LLMs-from-scratch

★ 97,742 · Jupyter Notebook · NOASSERTION · updated Jun 2026

Implement a ChatGPT-like LLM in PyTorch from scratch, step by step

A companion code repository for Sebastian Raschka's Manning book on building GPT-style language models in PyTorch from scratch. It walks through tokenization, attention, pretraining, and instruction finetuning chapter by chapter, then goes further with bonus notebooks covering modern architectures like Llama 3, Qwen3, Gemma 3/4, MoE, and several attention variants. Aimed at ML practitioners who want to understand what's actually happening inside these models rather than just calling a library.

- The bonus material has kept up with the field remarkably well — standalone notebooks for Qwen3, Gemma 3/4, OLMo 3, and DPO mean this is genuinely useful reference material in 2026, not just a GPT-2 explainer that aged out

- CI runs against Linux, macOS, and Windows on both latest and old PyTorch versions, which is unusual for an educational repo and means the notebooks actually execute rather than slowly rotting

- Each chapter has isolated exercise notebooks with solutions, and there's a separate installable package (`llms_from_scratch`) with the chapter code as importable modules — useful once you want to build on top instead of copy-paste

- The attention alternatives section (GQA, MLA, sliding window, DeltaNet, DSA, KV sharing) covers the actual design decisions that differentiate modern models, not just the textbook scaled dot-product version

- Because it mirrors a print book, contributions extending the main chapter code are explicitly rejected — so the core chapters are frozen at publication-time decisions and won't absorb community corrections or improvements

- The primary format is Jupyter notebooks, which means diffs are unreadable, the Git history is nearly useless for tracking logic changes, and running things non-interactively requires the separate `.py` script variants that aren't always kept in sync

- The model scale is toy-level by design (fits on a laptop CPU), which is honest but means any intuitions built here about training dynamics, loss curves, or hyperparameter sensitivity don't transfer cleanly to real pretraining runs where batch size, learning rate warmup, and gradient stability behave very differently

- The reasoning/RL content lives in a separate sibling repo (`reasoning-from-scratch`) that's referenced but not included, so the story goes from instruction finetuning to a hard cut — readers who came for the full pipeline have to context-switch to a different codebase

View on GitHub → Homepage ↗