// the find

vibrantlabsai/ragas

★ 14,355 · Python · Apache-2.0 · updated Feb 2026

Supercharge Your LLM Application Evaluations 🚀

Ragas is a Python library for evaluating RAG pipelines and LLM applications, covering both reference-based metrics (faithfulness, context precision, answer correctness) and LLM-as-judge approaches. It also generates synthetic test datasets from your documents. Aimed at teams building production RAG systems who need something more principled than vibes-based testing.

The metric library is genuinely broad — faithfulness, context recall, noise sensitivity, SQL correctness, multimodal — covering cases that most teams have to hand-roll themselves. Test dataset generation from your own docs is the killer feature: you get query/context/answer triples without manually labeling hundreds of examples. The async-first API (`ascore`, `aevaluate`) means batch evaluation doesn't block. Integration coverage is wide (LangChain, LlamaIndex, LangSmith, Arize, Langfuse) so it drops into most existing observability stacks without glue code.

Every LLM-based metric is itself an LLM call, so your evaluation cost scales with your dataset size and the judge model you choose — this is unavoidable but the docs underplay it. The metrics assume a specific RAG shape (query + retrieved context + answer); agentic or multi-turn flows need significant custom work, and the 'agent evals' templates are still listed as 'coming soon'. The repo is a fork of the original `explodinggradients/ragas` under a new org (`vibrantlabsai`), which raises questions about long-term maintenance trajectory and whether upstream fixes are being merged. Migration guides exist for 0.1→0.2 and 0.3→0.4, hinting at a history of breaking API changes.

View on GitHub → Homepage ↗