// the find
ShishirPatil/gorilla
Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
Gorilla is a Berkeley research project for training and evaluating LLMs on function/tool calling. The main product at this point is the Berkeley Function Calling Leaderboard (BFCL), which has become the de facto benchmark for comparing how well models handle real-world API calls across single-turn, multi-turn, parallel, and agentic scenarios. If you're building systems that rely on tool use and want data on which model to pick, this is where you look.
The BFCL benchmark is genuinely well-constructed — it covers cases that matter in production: parallel function calls, irrelevance detection (knowing when NOT to call a function), multi-turn state, and now agentic web search with error recovery. The leaderboard includes cost and latency metrics alongside accuracy, which is the right thing to measure. The dataset is live-updated with enterprise-contributed prompts (V2), not just synthetic academic ones. The GoEx execution engine's 'undo' and 'damage confinement' abstractions are a real contribution to safe agentic systems that most people aren't thinking about yet.
The repo has sprawled into a monorepo of loosely connected subprojects (RAFT, GoEx, Agent Arena, OpenFunctions, APIZoo, gorilla-cli) with varying maintenance levels — the original Gorilla fine-tuned model is essentially abandoned in favor of the leaderboard. OpenFunctions-v2 still points to a Berkeley server endpoint that you cannot assume will stay up, making it a poor dependency for anything production. The 'APIZoo' community contribution angle never really took off and the data feels stale. BFCL itself requires non-trivial setup to self-host for evaluating a new model — the documentation for this is scattered across multiple READMEs.