// the find

lm-sys/RouteLLM

★ 5,016 · Python · Apache-2.0 · updated Aug 2024

A framework for serving and evaluating LLM routers - save LLM costs without compromising quality

RouteLLM is a routing layer that sits in front of your LLM calls and decides whether each query needs the expensive model or a cheaper one. It ships four trained routers — the recommended matrix factorization one is genuinely lightweight. Aimed at teams already paying GPT-4-level bills who want to cut costs without rewriting their application.

The drop-in OpenAI client replacement is the real selling point — you change two lines of code and get routing for free. The matrix factorization router is small and fast, not another LLM call to decide if you need an LLM call. Threshold calibration against a dataset you provide is a practical touch; it gives you a principled way to set the cost/quality tradeoff rather than guessing. The evaluation framework with cached benchmark results means you can actually measure router performance against MMLU and MT-Bench without re-running everything.

The trained routers were built on GPT-4 vs Mixtral-8x7B preference data from 2024 and haven't been updated since August 2024 — the model landscape has shifted significantly and the routing decisions may not transfer well to newer model pairs. Requiring an OpenAI API key just for embeddings even when you've replaced both strong and weak models with non-OpenAI providers is a real annoyance. The threshold calibration is based on Chatbot Arena data by default, which doesn't resemble most production query distributions — domain-specific workloads will need their own calibration data and the tooling for that is thin. No streaming support documentation and the server implementation looks like it would require investigation to confirm streaming works correctly.

View on GitHub →