// the find

skyzh/tiny-llm

★ 4,272 · Python · Apache-2.0 · updated Jun 2026

A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.

A structured 3-week course where you implement LLM inference from scratch using Apple's MLX framework — no high-level nn.Module wrappers, just raw matrix ops. You build attention, KV cache, quantized matmul, flash attention, continuous batching, and paged attention step by step, targeting Qwen3 on Apple Silicon. Aimed at systems engineers who want to understand what vLLM actually does under the hood.

The dual src/tiny_llm and src/tiny_llm_ref layout is smart — you write the exercise code, tests compare against the reference solution, so you get immediate feedback without spoiling the answer. The course goes deeper than most: quantized matmul in Metal shaders, flash attention with CPU and GPU paths, paged attention with actual Metal kernels — not just Python toy implementations. Tests exist for every completed chapter, with CI running on macOS, which is rare for a course repo and means the code actually works. Choosing MLX over PyTorch removes the CUDA dependency entirely, making this practical for anyone with a Mac.

Week 3 documentation is still largely unwritten — paged attention and MoE chapters have code and tests but no book pages, so you're reading source without explanation. The repo is Mac-only by design; there's no path to run this on Linux or Windows, which locks out a large chunk of systems engineers working in cloud VMs. Speculative decoding, RAG, and agent chapters are scaffolded but not implemented, so the course isn't complete yet and the roadmap doesn't say when they will be. No discussion of multi-GPU or tensor parallelism, which is where real serving systems spend most of their complexity.

View on GitHub → Homepage ↗