// the find

HolmesGPT/holmesgpt

★ 2,623 · Python · Apache-2.0 · updated Jun 2026

SRE Agent - CNCF Sandbox Project

HolmesGPT is an AI agent for SRE incident investigation — you point it at a Prometheus alert, a Jira ticket, or a Kubernetes cluster and it queries your observability stack to find root causes. It's a CNCF sandbox project with Microsoft contributing, aimed at platform/SRE teams running Kubernetes who are drowning in alert noise.

The context management is genuinely thoughtful: server-side filtering, per-tool memory limits, and streaming large results to disk so it doesn't choke the LLM context window on a 50MB log dump. The integration breadth is real — Prometheus, Datadog, Loki, Tempo, Elasticsearch, and two dozen more — and each has its own toolset definition rather than a generic REST shim. Operator mode (24/7 background health checks that can open GitHub PRs for fixes) is a meaningful step beyond 'chatbot that answers questions when asked'. LLM provider is pluggable from the start, so you're not locked to OpenAI.

It's a CNCF sandbox project backed by Robusta.dev, and the Slack/Teams integrations require Robusta's commercial product — that boundary isn't clearly flagged and will bite anyone who assumes full open-source parity. The agentic loop has no published latency or cost benchmarks for a realistic incident with 10+ tool calls, which matters when you're on-call at 2am. Python codebase with no static types visible from the tree means the tool dispatch layer is probably held together by dicts and duck typing, which makes adding custom toolsets riskier than the docs suggest. The 'read-only by design' claim is undermined by the Kubernetes Remediation MCP toolset existing in the same repo — the safety story needs a clearer separation.

View on GitHub → Homepage ↗