// the find
simular-ai/Agent-S
Agent S: an open agentic framework that uses computers like a human
Agent S is a framework for building GUI agents that control a computer by taking screenshots and executing mouse/keyboard actions. It's backed by academic papers with peer-reviewed results on OSWorld, a benchmark for measuring how well agents complete real desktop tasks. Target audience is researchers and developers who want to run or extend a state-of-the-art computer-use agent.
The S1→S2→S3 versioning is actually useful here — older versions are preserved as separate modules rather than deleted, so you can reproduce benchmark results from the papers without git archaeology. The grounding/planning split is architecturally sound: a small specialized model (UI-TARS) handles pixel-level element locating, while the larger LLM handles reasoning, which is cheaper and more accurate than asking one model to do both. Behavior Best-of-N is a concrete inference-time scaling idea that gets 72.6% on OSWorld — they've actually published how it works, not just the number. The `gui_agents` PyPI package means you can import it in three lines without cloning the repo.
The `exec(action[0])` pattern in the quickstart is a hard pill — the agent returns a string of Python code and you just run it, which means a confused or adversarially prompted agent can delete files or exfiltrate data; there's no sandboxing by default. The mandatory external grounding model dependency (UI-TARS-1.5-7B hosted on HuggingFace Inference Endpoints) adds significant cost and latency before you even get a working agent; there's no lightweight fallback for experimentation. Three co-existing agent versions (s1, s2, s2_5, s3) sharing almost identical directory structures means a lot of code duplication with no shared base to fix bugs in one place. The benchmark focus is both its strength and its weakness — OSWorld tasks are curated lab conditions, and there's almost no guidance on what happens when you point this at an actual messy desktop workflow.