// the find

EricLBuehler/candle-vllm

★ 676 · Rust · MIT · updated Jun 2026

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

candle-vllm is a Rust implementation of a vLLM-style inference server built on Hugging Face's candle library. It serves local LLMs with an OpenAI-compatible API, PagedAttention, continuous batching, and runs on both CUDA and Apple Silicon. It targets developers who want vLLM's throughput characteristics without a Python process in the serving path.

PagedAttention and continuous batching are the two mechanisms that actually matter for high-throughput serving, and this project has both — ported to Rust rather than bolted onto a Python runtime. Quantization coverage is genuinely broad: GGUF, GPTQ/Marlin, AWQ, block-wise FP8, MXFP4/NVFP4, and their own TurboQuant KV cache (Walsh-Hadamard transform for 3–4x KV compression) — more formats than most single-binary servers support. Multi-GPU via NCCL multi-process is real tensor parallelism, not just naive model sharding, and multi-node via MPI is an unusual capability at this project size. Apple Silicon support is a genuine differentiator — most production-grade inference servers drop Metal as a second-class concern.

The build surface is punishing: CUDA toolkit version pinning, NCCL, flashinfer, cutlass, and optional MPI all need to align with your host driver, and a version mismatch surfaces as a link error not a helpful message. The throughput numbers in the README are all single-request benchmarks on Hopper H100s with no methodology or comparison to upstream vLLM — you cannot tell from these numbers whether the PagedAttention implementation is actually competitive at realistic batch sizes. The README documents `record_conversation` as 'not yet implemented', which signals the project has incomplete edges that haven't been prioritized. At 676 stars and 81 forks, the model implementations are lightly tested in production — if your architecture isn't Qwen or Llama, expect to debug issues yourself.

View on GitHub →