// the find

predibase/lorax

★ 3,793 · Python · Apache-2.0 · updated May 2026

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

LoRAX is an inference server that lets you serve a single base LLM (Llama, Mistral, Qwen, etc.) and dynamically swap LoRA adapters per-request, so you're not spinning up separate GPU instances for each fine-tuned variant. The target audience is teams running many fine-tuned versions of the same base model in production and paying too much for it. It's a fork of HuggingFace's TGI with multi-adapter scheduling bolted on.

- The SGMV/BGMV kernel integration (from Punica) is the real technical differentiator — batching requests across different LoRA adapters without falling back to sequential processing is genuinely hard to get right, and they've shipped it.

- Production deployment story is solid: prebuilt Docker images, Helm charts, Prometheus metrics, OpenTelemetry tracing, and an OpenAI-compatible API endpoint. You can drop this behind existing tooling without much plumbing.

- Heterogeneous continuous batching with async adapter prefetch/offload between GPU and CPU means hot adapters stay on-GPU while cold ones sit in CPU memory — a practical solution to the 'hundreds of adapters but 24GB VRAM' problem.

- The router is written in Rust (separate from the Python server process), which keeps scheduling overhead low and avoids GIL contention on the hot path.

- Hard dependency on Nvidia Ampere or newer and CUDA 11.8+ means no AMD GPUs, no older datacenter cards like V100. The README doesn't mention any roadmap for ROCm or CPU fallback.

- Building from source is genuinely painful — there are separate Makefiles for flash-attn, flash-attn-v2, vllm, awq, eetq, megablocks, and custom kernels, each with its own dependency chain. The Docker image hides this but debugging kernel compilation failures locally is a time sink.

- The roadmap is tracked in a single GitHub issue (#57) which is not a great signal for long-term planning visibility, and the project is tied to a commercial company (Predibase) whose incentives may not always align with the open-source use case.

- No support for non-LoRA PEFT methods (prefix tuning, IA3, etc.) and adapter merging is limited to a subset of strategies — if your fine-tuning pipeline uses anything besides standard LoRA, you'll hit a wall quickly.

View on GitHub → Homepage ↗