finds.dev← search

// the find

ShishirPatil/gorilla

★ 12,933 · Python · Apache-2.0 · updated Apr 2026

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)

Gorilla is a Berkeley research project for training and evaluating LLMs on function/tool calling. The main product at this point is the Berkeley Function Calling Leaderboard (BFCL), which has become the de facto benchmark for comparing how well models handle real-world API calls across single-turn, multi-turn, parallel, and agentic scenarios. If you're building systems that rely on tool use and want data on which model to pick, this is where you look.

The BFCL benchmark is genuinely well-constructed — it covers cases that matter in production: parallel function calls, irrelevance detection (knowing when NOT to call a function), multi-turn state, and now agentic web search with error recovery. The leaderboard includes cost and latency metrics alongside accuracy, which is the right thing to measure. The dataset is live-updated with enterprise-contributed prompts (V2), not just synthetic academic ones. The GoEx execution engine's 'undo' and 'damage confinement' abstractions are a real contribution to safe agentic systems that most people aren't thinking about yet.

The repo has sprawled into a monorepo of loosely connected subprojects (RAFT, GoEx, Agent Arena, OpenFunctions, APIZoo, gorilla-cli) with varying maintenance levels — the original Gorilla fine-tuned model is essentially abandoned in favor of the leaderboard. OpenFunctions-v2 still points to a Berkeley server endpoint that you cannot assume will stay up, making it a poor dependency for anything production. The 'APIZoo' community contribution angle never really took off and the data feels stale. BFCL itself requires non-trivial setup to self-host for evaluating a new model — the documentation for this is scattered across multiple READMEs.

View on GitHub → Homepage ↗

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →