// the find

vllm-project/vllm

★ 82,723 · Python · Apache-2.0 · updated Jun 2026

A high-throughput and memory-efficient inference and serving engine for LLMs

vLLM is the de facto standard for self-hosted LLM inference — PagedAttention was the original insight, and the project has snowballed from there into a full serving stack. If you're running Llama, Qwen, DeepSeek, or most other popular models on your own GPUs and want OpenAI-compatible endpoints, this is where you start.

PagedAttention and continuous batching genuinely work — throughput numbers hold up in production, not just benchmarks. The quantization support is unusually complete: FP8, INT4, GPTQ, AWQ, GGUF, and several others, all in one place. Multi-LoRA support lets you serve dozens of adapters from a single model load, which is a real cost saver. Hardware coverage is surprisingly broad — NVIDIA, AMD ROCm, Google TPU, Intel Gaudi, Apple Silicon, and others, with separate CI pipelines for each.

The codebase is sprawling — 2000+ contributors means inconsistent abstractions and some dark corners that bite you when you leave the happy path. Cold-start time is slow; loading a 70B model and initializing CUDA graphs takes minutes, which makes it a poor fit for serverless or burst workloads. The V1 engine migration is still in progress, so you'll occasionally hit flags like `--use-v2-block-manager` and wonder which generation of code you're actually running. Windows support is effectively nonexistent — CUDA on Linux only for serious use.

View on GitHub → Homepage ↗