// the find
Lightning-AI/lit-llama
Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
A clean Apache 2.0 reimplementation of LLaMA built on nanoGPT, supporting LoRA/Adapter fine-tuning, int8/GPTQ-int4 quantization, and pretraining on RedPajama. It was created specifically to escape LLaMA's GPL license contamination, making it legally safe for commercial or permissively licensed projects. The repo is explicitly unmaintained — successor is LitGPT.
Single-file model implementation (lit_llama/model.py) keeps the architecture readable without framework abstractions getting in the way. GPTQ int4 quantization brings a 7B model down to ~5 GB VRAM, making it actually runnable on consumer hardware without heroics. Both LoRA and Adapter v1/v2 variants are implemented and tested separately, not tangled together. The Apache 2.0 license on the code side was the whole point — it solved a real problem when it was released.
Abandoned. The README says it directly, and LLaMA 1 weights were the last thing it tracked — no LLaMA 2, 3, or any model released after early 2023. Adopting this means maintaining a fork yourself from day one. Flash attention support is conditional and fragile enough that the docs tell you to disable it for some GPU models via a backend flag. No model parallelism or multi-GPU inference — the 26 GB baseline for the 7B model is a hard ceiling for a single A100, so anything larger is a dead end here.