// the find

garrytan/gbrain-evals

★ 280 · HTML · MIT · updated Jun 2026

A benchmark suite for gbrain, an agent long-term memory system. It tests retrieval quality across scenarios that actually matter for agent memory — relational queries, temporal reasoning, provenance tracking, contradiction detection — rather than just reporting a single headline number. The target audience is anyone evaluating or building on gbrain, or anyone who wants to understand what an honest memory system benchmark looks like.

The anti-gaming design is genuinely good: sealed answer keys that never touch the system under test, pinned judge versions, randomized question order, and tolerance bands from repeated runs. The decision to publish the bad 0.076 precision number alongside the good results, explain exactly why it's bad, and then ship a real fix as a result is exactly the kind of transparency you want from a vendor's own test suite. The corpus design — 240 pages of synthetic life with planted contradictions, stale facts, and deliberate junk — tests the failure modes that actually bite you in production rather than clean academic benchmarks. Zero regression across 20 releases with committed pass/fail thresholds in CI is a credible stability claim.

The harness is tightly coupled to gbrain's own internal adapter interface, so running a competing system against it requires writing a non-trivial adapter with no guarantee the abstraction fits cleanly. The corpora are fully synthetic and generated by Claude (Opus), which means gbrain is being tested on data that shares distributional properties with its training signal — that's a soft conflict of interest that isn't acknowledged. The 97.6% recall@5 headline is on LongMemEval-S, the smaller split; results on the full LongMemEval dataset aren't prominently featured, which is a gap worth noticing before you treat this as a definitive benchmark. There's no latency test that runs against the actual corpora used for retrieval accuracy, so the p95 < 200ms number and the accuracy numbers are measured separately and may not reflect the same operating point.

View on GitHub →