// the find
vllm-project/vllm-omni
A framework for efficient model inference with omni-modality models
vLLM-Omni extends the vLLM inference engine to handle omni-modal models — text, image, video, audio — including both autoregressive models (Qwen3-Omni, BAGEL) and diffusion-based generators (FLUX, Wan2.2, HunyuanImage). It targets teams running production inference for any-to-any multimodal models who want a single serving stack instead of stitching together separate pipelines per modality. Backed by the vLLM project community, so it's not a random fork.
The fully disaggregated pipeline via OmniConnector is the real architectural bet here — separating AR stages from DiT stages lets you allocate GPU resources independently, which matters when your text encoder and diffusion backbone have wildly different compute profiles. Hardware coverage is genuinely broad: CUDA, ROCm, NPU (Ascend), XPU, and MUSA, with CI pipelines for all of them visible in the repo. The included `.claude/skills/` tree is an unusual and practical touch — structured guides for adding new model types that contributors can feed directly to AI coding assistants. The OpenAI-compatible API server means you can swap it behind existing tooling without rewriting clients.
It's young and moving fast — v0.14 was the first 'stable' release in February 2026 and they're already at v0.22 four months later, which means API surface is still shifting under you. The disaggregated serving architecture adds meaningful operational complexity: you're now managing stage assignment and resource allocation across a heterogeneous pipeline, not just spinning up one vLLM instance. Diffusion model support (DiT/non-autoregressive) is a fundamentally different execution model bolted onto a scheduler originally designed for AR — the docs mention 'diffusion continuous batching' as a feature, which suggests this is still being worked out. No mention of Windows or macOS support anywhere; this is a Linux-on-GPU-cluster tool.