// the find

evidentlyai/evidently

★ 7,598 · Jupyter Notebook · Apache-2.0 · updated May 2026

Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.

Evidently is a Python library for evaluating and monitoring ML models and LLM outputs — data drift, classification metrics, RAG quality, text descriptors, the works. It targets ML engineers who need to move beyond "did the model deploy" to "is the model still working" in production. The scope is genuinely wide: offline batch evals, CI/CD test suites, and a live monitoring dashboard all in one package.

20+ statistical drift tests (PSI, KS, Jensen-Shannon, Wasserstein, etc.) with sensible defaults means you can plug in your reference dataset and get real signal without configuring statistics from scratch. The descriptor system for LLM evals is well-designed — row-level scoring that composes cleanly with the Report/TestSuite abstraction, so you can go from exploratory notebook to CI gate with minimal code changes. Export targets (HTML, JSON, dict, Grafana, Prometheus) are first-class, not afterthoughts — the Grafana integration examples are actually complete and runnable. Self-hostable monitoring UI is a real differentiator; you're not forced into their cloud for the core functionality.

The codebase is split between a `legacy/` tree and a new API (core, descriptors, future/) with the migration clearly mid-flight — you will hit naming collisions and doc examples that reference the old classes if you wander off the happy path. LLM-as-a-judge evals ship with OpenAI wiring but the multi-provider story is thin; if you're not on OpenAI you'll be writing adapters. The self-hosted monitoring service requires a separate backend process and its own storage — it's not embeddable, which makes it heavy for teams that just want dashboards without running another service. "100+ metrics" is true but the long tail is shallow: recsys and ranking metrics exist but the docs and examples thin out fast outside of drift and classification.

View on GitHub → Homepage ↗