// the find

bregman-arie/sre-checklist

★ 2,547 · Apache-2.0 · updated Mar 2024

A checklist of anyone practicing Site Reliability Engineering

A collection of opinionated checklists covering SRE team setup, production readiness, Kubernetes, Terraform, GitOps, and monitoring. Useful for SRE teams doing a gap analysis or onboarding new members. Not a tutorial — it tells you what to think about, not how to do it.

The Terraform section is unusually specific: it covers state management, module reuse, for_each vs count, dynamic image lookups, and secrets handling across local/CI/CD contexts. The Kubernetes section correctly calls out the limit-CPU antipattern (limits for memory, not CPU). The ArgoCD section is honest about tradeoffs — one ArgoCD vs per-cluster, hosted vs self-managed — instead of just listing options. The SRE maturity model (Ops → Automation → Product) is a practical framing for where a team stands.

The repo hasn't been touched since March 2024 and several sections are placeholders ('TODO: add some items'). Coverage is uneven — Terraform is deep, monitoring and chaos engineering are almost empty. There's nothing on SLO/SLI/error budget implementation beyond 'define your SLO', which is where most teams actually struggle. No tooling recommendations for incident management beyond vague mentions of alerts and dashboards.

View on GitHub →