// the find
kubernetes/kube-state-metrics
Add-on agent to generate and expose cluster-level metrics.
kube-state-metrics watches the Kubernetes API server and exports object-state metrics (deployments, pods, nodes, etc.) as Prometheus-scrapable endpoints. It deliberately exposes raw API data without heuristics, which is the right call — you get the ground truth, not kubectl's interpretation of it. This is the standard way to get Kubernetes resource health into Prometheus; it's in virtually every kube-prometheus-stack deployment already.
The Custom Resource State feature lets you define metrics from any CRD via a YAML config, which means you're not blocked waiting for upstream support when your team adds custom resources. Horizontal and daemonset sharding is genuinely thought through — MD5 of UID for hash distribution, StatefulSet auto-discovery, and the newer deployment-based sharding option that avoids StatefulSet rollout gaps. The self-metrics (list/watch error counts, config hash, shard config) make it operationally debuggable without log diving. Allow/deny lists on both metrics and labels give you real control over cardinality before it blows up your Prometheus storage.
The sharding model still downloads all objects to every shard and filters client-side — they acknowledge this is waiting on Kubernetes API server support, but it means memory doesn't scale down as cleanly as you'd hope on very large clusters. The versioning story is one supported release only; if you're running older Kubernetes you're on your own with community support, which is a real constraint in environments that lag K8s upgrades. The ECMAScript regex support for allow/deny lists is a workaround for Go's lack of lookaheads and introduces a separate JS regex engine as a dependency, with a 1-minute cap on eval time — that's a footgun waiting to happen if someone writes a pathological pattern. Config hot-reload exists but the CRS config hash metrics suggest it's not atomic; a bad config reload in production will log errors but keep running on stale config, which can be confusing.