// the find
Netflix/metaflow
Build, Manage and Deploy AI/ML Systems
Metaflow is Netflix's ML workflow framework, now maintained by Outerbounds, that treats ML pipelines as decorated Python classes with steps. It handles the gap between notebook experimentation and production deployment by managing compute (cloud burst to AWS/GCP/Azure), data versioning, and orchestration in one package. It's for ML engineers who want their code to run locally and scale to Kubernetes or Batch without rewriting it.
The step-based DAG model is genuinely well-designed — `@step` methods with `self.` artifact passing gives you automatic versioning and resumability without a separate artifact store. The `foreach` parallelism primitive is practical: fan out hyperparameter sweeps or data partitions with one decorator, collect results in a join step. Battle-tested at Netflix scale (3000+ projects, petabytes of artifacts) which means the failure and retry handling isn't theoretical. Multi-cloud support is real, not aspirational — the same flow can target Batch, Kubernetes, or local with a flag change.
The hosted metadata service and full production setup require Outerbounds (the commercial company) or significant self-hosting effort — the open source path to a real production deployment is genuinely hard and poorly documented compared to just paying for their SaaS. The artifact storage model serializes Python objects with pickle, which means Python version pinning becomes load-bearing and cross-language artifact access is painful despite the R client existing. The card/visualization system feels bolted on rather than designed in — it works but you'll find yourself fighting it once you want anything beyond the built-in chart types. Debugging failed remote tasks still means digging through cloud logs manually; the local stacktrace experience doesn't carry over to Kubernetes runs.