// the find

beir-cellar/beir

★ 2,212 · Python · Apache-2.0 · updated Oct 2025

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

BEIR is a standardized benchmark for evaluating information retrieval models across 18 diverse datasets in a zero-shot setting — meaning you train on one domain and test on others. It's aimed at NLP researchers and ML engineers who need a fair, apples-to-apples comparison of retrieval approaches (BM25, dense, sparse, reranking) without cherry-picking a friendly dataset. The NeurIPS 2021 paper that introduced it has become a standard citation in the retrieval literature.

1. The dataset variety is genuinely useful: you get everything from medical (NFCorpus, BioASQ) to fact-checking (FEVER) to financial QA (FiQA), so a model that scores well across BEIR is actually generalizing. 2. The pluggable model interface is clean — wrapping a custom embedding model requires implementing two methods and the evaluation harness handles the rest. 3. Recent additions (vLLM with LoRA, Cohere API, HuggingFace models with flash_attention_2) show the project is keeping pace with how people actually deploy models in 2025. 4. Datasets are hosted on Hugging Face, so `load_dataset` works without hunting down custom download scripts.

1. Four of the 18 datasets (BioASQ, Signal-1M, TREC-NEWS, Robust04) are not redistributable — you have to jump through hoops to reproduce those numbers, which undercuts the 'easy benchmark' pitch for anyone doing serious comparison work. 2. The library ships Elasticsearch and BM25 search backends but there's no Docker Compose or setup script; getting BM25 running requires you to stand up ES yourself, which is a real friction point. 3. The evaluation assumes a single-vector-per-document model; multi-vector approaches like ColBERT are handled via a git submodule pointing to a separate repo, not a first-class integration. 4. No built-in statistical significance testing — the 2024 SIGIR paper added meta-analysis tooling but it's not in the pip package, so most people just compare raw NDCG@10 numbers without knowing if differences are meaningful.

View on GitHub → Homepage ↗