// the find

THUDM/AgentBench

★ 3,492 · Python · Apache-2.0 · updated Feb 2026

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

AgentBench is a benchmark suite for evaluating LLMs on agentic tasks across 8 environments: OS shell commands, SQL databases, knowledge graphs, web shopping, web browsing, card games, ALFWorld household tasks, and lateral thinking puzzles. It published at ICLR'24 and has since evolved to include a function-calling variant (AgentBench FC) built on top of AgentRL. Primarily useful for researchers comparing LLM agent capabilities, less so for practitioners building agents.

The environment diversity is genuine — OS interaction, DB querying, KG traversal, and web shopping test meaningfully different agent skills rather than rewording the same task. Docker-based isolation for each task worker is the right call; it keeps environments reproducible and prevents state bleed between runs. The FC variant's Docker Compose one-command setup is a real improvement over v0.2's manual setup. The leaderboard tracks a wide range of models including open-weight ones, making it useful for apples-to-apples comparisons.

The webshop task requires ~16GB RAM just to start, and the README openly admits alfworld leaks memory until the worker is restarted — these are known bugs that haven't been fixed despite the repo being two years old. Pinning Python 3.9 and numpy~=1.23.x means you're fighting dependency hell the moment you try to integrate this with any modern stack. The KnowledgeGraph task depends on an online SPARQL endpoint described as 'not stable', and the self-hosting path requires a Freebase dump that's increasingly hard to obtain. The split between v0.1/v0.2 and the FC rewrite is messy — the FC version lives on main but large parts of the README still document the old architecture.

View on GitHub →