// the find

skypilot-org/skypilot

★ 10,112 · Python · Apache-2.0 · updated Jun 2026

Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, Slurm, 20+ clouds, on-prem).

SkyPilot is a job scheduler and resource manager for ML workloads that abstracts over 20+ clouds, Kubernetes, and Slurm behind a unified YAML/Python interface. It handles provisioning, spot instance recovery, and cost-aware placement automatically. It's aimed at ML engineers who need to run distributed training or inference across heterogeneous infrastructure without writing cloud-specific glue code.

The spot instance auto-recovery is the real payoff — it checkpoints and relaunches interrupted jobs without manual intervention, which matters when you're running multi-day training runs on cheap preemptible hardware. The cost-aware scheduler actually queries live pricing and availability across clouds before provisioning, so you're not just picking a cloud by habit. Slurm support (added in v0.12) means teams with on-prem HPC clusters can fold them into the same control plane as cloud GPUs, which is genuinely hard to do any other way. The task YAML spec is minimal and portable — it doesn't leak cloud abstractions, so moving a job from AWS to GCP is actually just changing a flag.

The abstraction leaks badly at the edges: anything touching networking (custom VPC peering, on-prem data egress, InfiniBand topology for large multi-node runs) requires you to drop into cloud-specific config anyway. The Python SDK is large and the install surface is wide — pulling in all cloud SDKs simultaneously creates dependency hell and the optional extras don't fully isolate this. Debugging failed provisioning is opaque; when a job silently falls back through failover zones, tracing why it ended up on a different instance type requires digging through log files rather than a clean event trail. State management is local by default (cluster metadata lives in `~/.sky/`), which creates problems for teams — the API server mode helps but it's still relatively new and the operational story around HA and backup isn't mature.

View on GitHub → Homepage ↗