// the find
Lightning-AI/LitServe
A minimal Python framework for building custom AI inference servers with full control over logic, batching, and scaling.
LitServe is a Python serving framework built on top of FastAPI that handles the boilerplate of AI inference APIs: worker pools, batching, streaming, and multi-model pipelines. It targets ML engineers who want more control than vLLM or TorchServe offer but don't want to wire up concurrency and request queuing themselves. The abstraction is thin enough that you can understand the whole thing, which is its main advantage over heavier alternatives.
The worker/loop separation is well-thought-out — request handling and model inference run in separate processes with a ZMQ transport layer, which is why the 2x FastAPI speedup is real and not just benchmark tuning. The `setup(device)` + `predict(request)` split cleanly handles the expensive model-load-once problem that trips up naive FastAPI deployments. OpenAI-compatible spec support means you can drop it behind any client that already speaks the OpenAI chat/embeddings API without adapter code. The test suite has genuine e2e coverage across batching, streaming, and async paths — not just unit tests against mocked internals.
The benchmarks compare against plain FastAPI, not against production-tuned alternatives like Ray Serve or BentoML — the 2x claim holds in their specific test setup but will vary significantly depending on your workload. Authentication is marked DIY for self-hosted, which means you're rolling your own token middleware for anything not on Lightning Cloud — a gap that will bite anyone deploying internally. The ZMQ transport adds an operational dependency that isn't obvious from the README; if you're containerizing this, ZMQ socket file paths and process coordination need care. Multi-node inference is cloud-only, so horizontal scaling for self-hosted deployments is left entirely to you.