finds.dev← search

// the find

vllm-project/vllm-omni

★ 5,131 · Python · Apache-2.0 · updated Jun 2026

A framework for efficient model inference with omni-modality models

vLLM-Omni extends the vLLM inference engine to handle omni-modal models — text, image, video, audio — including both autoregressive models (Qwen3-Omni, BAGEL) and diffusion-based generators (FLUX, Wan2.2, HunyuanImage). It targets teams running production inference for any-to-any multimodal models who want a single serving stack instead of stitching together separate pipelines per modality. Backed by the vLLM project community, so it's not a random fork.

The fully disaggregated pipeline via OmniConnector is the real architectural bet here — separating AR stages from DiT stages lets you allocate GPU resources independently, which matters when your text encoder and diffusion backbone have wildly different compute profiles. Hardware coverage is genuinely broad: CUDA, ROCm, NPU (Ascend), XPU, and MUSA, with CI pipelines for all of them visible in the repo. The included `.claude/skills/` tree is an unusual and practical touch — structured guides for adding new model types that contributors can feed directly to AI coding assistants. The OpenAI-compatible API server means you can swap it behind existing tooling without rewriting clients.

It's young and moving fast — v0.14 was the first 'stable' release in February 2026 and they're already at v0.22 four months later, which means API surface is still shifting under you. The disaggregated serving architecture adds meaningful operational complexity: you're now managing stage assignment and resource allocation across a heterogeneous pipeline, not just spinning up one vLLM instance. Diffusion model support (DiT/non-autoregressive) is a fundamentally different execution model bolted onto a scheduler originally designed for AR — the docs mention 'diffusion continuous batching' as a feature, which suggests this is still being worked out. No mention of Windows or macOS support anywhere; this is a Linux-on-GPU-cluster tool.

View on GitHub → Homepage ↗

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →