// the find

facebookresearch/meta-agents-research-environments

★ 523 · Python · MIT · updated Jun 2026

Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike static benchmarks, this platform introduces evolving environments where agents must adapt their strategies as new information becomes available, mirroring real-world challenges.

Meta's research platform for evaluating AI agents against dynamic, evolving task environments rather than static benchmarks. The centerpiece is Gaia2: 800 scenarios across 10 simulated domains (email, calendar, file system, shopping, etc.) where the environment state changes during the task. Aimed at researchers building or measuring autonomous agents.

The dynamic environment model is the real contribution here — scenarios inject new emails, calendar events, and notifications mid-task, which static benchmarks like MMLU or even the original Gaia can't test. The simulated app layer (email, calendar, contacts, cab, shopping) is detailed enough to support genuinely multi-step tasks without hitting real APIs. MCP server integration means you can wire in external agents without forking the framework. The GUI with DAG visualization for scenario authoring is actually useful — building evaluation scenarios usually involves painful trial-and-error that this makes inspectable.

The leaderboard is self-reported and hosted by the same org that built the benchmark, which is a serious credibility problem for a framework positioned as an objective evaluation tool. The simulated apps are bespoke Python classes rather than real software, so an agent that aces Gaia2 email tasks may still fail on real Gmail — the sim-to-real gap is structural, not fixable. With 523 stars and no external adopters visible in the README, the scenario library is thin beyond the 800 Gaia2 tasks, and writing custom scenarios requires understanding a non-trivial event/app abstraction. The TypeScript GUI is bundled into the Python package in a way that makes the install surface larger than it needs to be for headless benchmark runs.

View on GitHub → Homepage ↗