// the find
aws/aws-node-termination-handler
Gracefully handle EC2 instance shutdown within Kubernetes
aws-node-termination-handler intercepts EC2 shutdown signals — spot interruptions, scheduled maintenance, ASG scale-in, AZ rebalance — and translates them into Kubernetes cordon/drain operations before the node actually goes away. It runs either as a DaemonSet polling IMDS directly on each node, or as a central Deployment consuming an SQS queue fed by EventBridge. If you're running self-managed EC2 nodes on Kubernetes at any scale with spot instances, this is the piece that prevents your workloads from dying uncleanly.
The two-mode design (IMDS vs SQS/EventBridge) is genuinely useful: IMDS requires no AWS infrastructure setup and works fine for small clusters, while the Queue Processor supports lifecycle hooks that give you real control over termination timing. The heartbeat feature for ASG lifecycle hooks is particularly good — it lets you extend the drain window up to 48 hours without sitting on a stalled termination, which matters for stateful workloads. The e2e test coverage is extensive: there are separate test scripts for every event type and mode combination, not just happy-path smoke tests. The Helm chart covers the common operational knobs without burying them.
The Queue Processor setup is a five-step AWS infrastructure exercise (SQS, EventBridge rules, IAM role, ASG lifecycle hooks, instance tagging) before you can even install the thing — a CloudFormation template exists but Terraform users are on their own, and the README just shows raw CLI commands rather than linking to maintained IaC. IMDS mode and Queue Processor mode are mutually exclusive at runtime, which means if you want to handle both ASG lifecycle hooks and spot interruptions on the same cluster you have to pick one path and accept its gaps. The compatibility matrix shows NTH v1.25.x only tested against K8s 1.29–1.32, so anyone still on an older cluster version needs to match releases manually with no guidance on what actually breaks. Multiple replicas in Queue Processor mode don't actually load-balance — they all greedily consume from the same SQS queue and may process the same event redundantly, which the docs note but don't fully explain the implications of.