// the find

PaddlePaddle/FastDeploy

★ 3,693 · Python · Apache-2.0 · updated Jun 2026

High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

FastDeploy is Baidu's production LLM/VLM inference serving toolkit built on PaddlePaddle, targeting teams running ERNIE or other large models at scale on Chinese AI hardware (Kunlun XPU, Hygon DCU, etc.) as well as NVIDIA GPUs. It offers prefill/decode disaggregation, speculative decoding, and a vLLM-compatible API. The primary audience is infrastructure engineers at Chinese enterprises deploying Baidu-ecosystem models.

PD disaggregation (prefill/decode separation) with dynamic instance role switching is genuinely production-grade — this is hard to get right and most OSS serving stacks don't have it. The quantization breadth is real: W8A8, W4A8, W2A16, FP8 blockwise and tensorwise, not just the usual W4A16. Multi-hardware support goes beyond NVIDIA to XPU, DCU, Gaudi, and GCU with actual CI workflows for each, not just README claims. The benchmark YAML library is extensive and suggests these configs are actually run, not aspirational.

The entire README, docs, and most commit history are in Chinese — if you don't read Chinese, you're second-class. The English README exists but is clearly translated and lags behind. Hard Linux-only constraint (Python 3.10–3.12, no macOS, no Windows) with no containerized quick-start makes local evaluation painful. The ecosystem lock-in is real: optimal performance requires PaddlePaddle, and the first-party model priority is ERNIE 4.5; Qwen and other models are supported but feel like afterthoughts in the documentation. Despite vLLM-compatibility claims, there's no published comparison showing where FastDeploy actually beats vLLM for non-Baidu models.

View on GitHub → Homepage ↗