// the find

confident-ai/deepeval

★ 16,139 · Python · Apache-2.0 · updated Jun 2026

The LLM Evaluation Framework

DeepEval is a pytest-style evaluation framework for LLM applications — RAG pipelines, agents, chatbots. You write test cases with inputs and expected outputs, attach metrics (hallucination, faithfulness, G-Eval, etc.), and run them in CI or notebooks. It's aimed at teams who want structured, repeatable evals rather than vibes-based prompting.

- The metric library is genuinely broad and well-organized: RAG-specific metrics (contextual precision/recall/relevancy), agentic metrics (tool correctness, step efficiency, plan adherence), multi-turn conversation metrics, and multimodal — most projects will find what they need without writing custom scorers.

- G-Eval is the right default: it uses chain-of-thought LLM-as-a-judge with evaluation criteria you define, which beats fixed rubric approaches for anything domain-specific. Each metric also returns a `reason`, so failures are debuggable rather than just a number.

- Framework integrations cover the realistic landscape (LangChain, LangGraph, OpenAI Agents, CrewAI, PydanticAI, LlamaIndex) via thin callback/instrumentation wrappers rather than requiring you to rewrite your app.

- The pytest integration is first-class: `deepeval test run` slots into existing CI pipelines without ceremony, and the `evals_iterator()` pattern for tracing eval runs through your actual app code is a clean design.

- Almost every metric ultimately calls an LLM judge, which means eval costs can exceed inference costs on large datasets — there's no built-in cost estimation or budget guard, so you'll discover this the hard way.

- The cloud platform (Confident AI) is heavily pushed throughout the README and CLI, and some features like dataset management and result sharing are intentionally kept there. The open-source version is functional but clearly designed to funnel you toward a paid SaaS.

- LLM-as-a-judge metrics inherit the judge's biases and inconsistencies — scores on the same test case can vary between runs, which makes threshold-based pass/fail fragile. The docs don't address this variance problem or suggest mitigation strategies like repeated sampling.

- The benchmark suite (MMLU, GSM8K, etc.) is a nice addition but each benchmark downloads datasets at runtime with minimal caching story, making CI runs slow and brittle if the upstream Hugging Face datasets move.

View on GitHub → Homepage ↗