// the find
modelscope/evalscope
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
EvalScope is a unified LLM/VLM evaluation harness from the ModelScope team (Alibaba). It wraps OpenCompass, VLMEvalKit, and RAGAS behind a single CLI, adds its own stress-testing and arena modes, and ships a React dashboard for results. If you're running model evaluations in the ModelScope/Qwen ecosystem this is the natural starting point.
The perf testing module is genuinely useful — TTFT, TPOT, throughput, multi-turn load all in one tool with wandb/swanlab export, which most eval frameworks skip entirely. The External Agent Bridge for evaluating Claude Code and OpenAI Codex by transparently intercepting their LLM traffic is a clever design that sidesteps the 'how do you wrap an opinionated CLI' problem. YAML/dict/Python config parity means it fits in notebooks, CI pipelines, and ad-hoc scripts without rewriting setup code. Active release cadence — meaningful features shipping weekly, not just benchmark additions.
Being built primarily for the Qwen/ModelScope ecosystem is a real constraint: local model evaluation routes through ModelScope's hub by default, and documentation examples lean heavily on Qwen models, so HuggingFace-first shops will hit friction. The multi-backend approach (wrapping OpenCompass, VLMEvalKit, RAGEval) means you're often debugging through two layers of abstraction when something breaks — EvalScope's adapter plus the upstream framework. The benchmark catalog has grown to 150+ entries but coverage is uneven; some are solid integrations, others are thin wrappers around a single dataset with no few-shot or subset controls documented.