// the find
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
vLLM is the de facto standard for self-hosted LLM inference — PagedAttention was the original insight, and the project has snowballed from there into a full serving stack. If you're running Llama, Qwen, DeepSeek, or most other popular models on your own GPUs and want OpenAI-compatible endpoints, this is where you start.
PagedAttention and continuous batching genuinely work — throughput numbers hold up in production, not just benchmarks. The quantization support is unusually complete: FP8, INT4, GPTQ, AWQ, GGUF, and several others, all in one place. Multi-LoRA support lets you serve dozens of adapters from a single model load, which is a real cost saver. Hardware coverage is surprisingly broad — NVIDIA, AMD ROCm, Google TPU, Intel Gaudi, Apple Silicon, and others, with separate CI pipelines for each.
The codebase is sprawling — 2000+ contributors means inconsistent abstractions and some dark corners that bite you when you leave the happy path. Cold-start time is slow; loading a 70B model and initializing CUDA graphs takes minutes, which makes it a poor fit for serverless or burst workloads. The V1 engine migration is still in progress, so you'll occasionally hit flags like `--use-v2-block-manager` and wonder which generation of code you're actually running. Windows support is effectively nonexistent — CUDA on Linux only for serious use.