// the find

embeddings-benchmark/mteb

★ 3,302 · Python · Apache-2.0 · updated Jun 2026

MTEB: Massive Text Embedding Benchmark

MTEB is the standard benchmark for evaluating text embedding models across tasks like retrieval, classification, clustering, and semantic similarity — now expanded to cover multilingual and multimodal (audio, image) embeddings. If you're picking or fine-tuning an embedding model, this is where you check whether your numbers are real. It's the thing the leaderboard on HuggingFace is built on.

1. Task coverage is genuinely broad — 500+ datasets across retrieval, clustering, bitext mining, reranking, STS, classification, and now audio/image tasks. Not a toy benchmark. 2. The model abstraction is clean: wrap any encoder in a simple interface and it runs against all tasks without boilerplate. Falls back to SentenceTransformer automatically for unregistered models. 3. Embedding caching is built in, so re-running with different tasks doesn't re-encode your corpus from scratch — critical when working with large models. 4. Results are stored as structured JSON per model/task/split, making it easy to load and compare across runs programmatically rather than parsing log files.

1. Running a full MTEB evaluation is a multi-day job on consumer hardware — the benchmark's breadth is also its practical barrier. There's no lightweight 'smoke test' subset that's officially blessed for fast iteration. 2. The multilingual expansion (MMTEB) added hundreds of datasets but coverage is uneven — some languages have one low-quality dataset, giving false confidence on those language scores. 3. No built-in support for evaluating retrieval with hybrid (sparse + dense) pipelines out of the box; BM25 is bolted on as an advanced use case and the integration is fragile. 4. The leaderboard scores are self-reported by model authors submitting results, not independently re-run, so there's no guarantee of reproducibility or that the model checkpoint matches what was evaluated.

View on GitHub → Homepage ↗