// the find

cvs-health/uqlm

★ 1,166 · Python · Apache-2.0 · updated Jun 2026

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

UQLM is a Python library from CVS Health for detecting LLM hallucinations using uncertainty quantification — sampling multiple responses and measuring consistency, using token probabilities, or routing through judge LLMs. It covers the full spectrum from cheap white-box token-probability methods to expensive black-box multi-sampling approaches, plus claim-level scoring for long text. Aimed at ML engineers who need to ship LLM applications with some confidence signal about when the model is making things up.

Published in JMLR and TMLR with papers backing the specific scorer implementations — this isn't vibes-based, it's grounded in the literature. The scorer taxonomy (black-box vs white-box vs judge vs ensemble) is genuinely useful and maps directly to the latency/cost tradeoffs you'll face in production. LangChain integration means you can swap in any supported model without changing the scoring code. The ensemble tuner that learns weights from ground truth is a practical touch — most UQ libraries stop at off-the-shelf scorers and leave calibration as your problem.

Everything is async-first, which is fine until you try to integrate it into a sync codebase or a notebook where async is a pain. The LangChain dependency is both the strength and the trap — you're now tied to LangChain's API churn, and if LangChain deprecates a model interface you'll be waiting on UQLM to catch up. Black-box methods that require 5+ LLM calls per query will multiply your inference costs in a way that's hard to explain to a budget owner. No streaming support mentioned, so you're waiting for the full response before scoring starts, which hurts perceived latency.

View on GitHub → Homepage ↗