// the find
argoproj/argo-workflows
Workflow Engine for Kubernetes
Argo Workflows is a CNCF-graduated Kubernetes-native workflow engine that defines each step as a container, supporting DAGs, loops, conditionals, and artifact passing between steps. It's the de facto standard for ML pipelines and batch processing on Kubernetes, with Kubeflow Pipelines, Netflix Metaflow, and 200+ orgs running it in production. If you're orchestrating anything more complex than a cron job on Kubernetes, this is where serious teams land.
1. The DAG and Steps execution models are genuinely well-designed — dependencies are declared explicitly in YAML, the controller handles scheduling, and you get per-step retry, timeout, and resource limits without writing orchestration code. 2. Artifact passing between steps is first-class: S3, GCS, Azure Blob, and more are all supported, so you're not hacking tmp volumes or writing glue scripts to move data between containers. 3. The Hera Python SDK (argoproj-labs/hera) lets you define workflows in typed Python instead of raw YAML, which is a significant quality-of-life improvement for ML teams who don't want to write YAML for a living. 4. The workflow archive + Prometheus metrics + structured logging combination means you can actually debug production failures — the audit trail is there, not bolted on.
1. The YAML surface area is enormous and the spec is not discoverable — writing a non-trivial workflow from scratch without the docs open is painful, and IDE support only helps so much when the schema has hundreds of optional fields. 2. DinD (docker-in-docker) support is still listed as a feature, which means teams are still doing this in production despite it being a security anti-pattern that most cluster admins will push back on. 3. The controller's architecture means every workflow runs through a single controller deployment per namespace — at high concurrency you'll hit queue depth issues and need to tune parallelism limits carefully, which isn't obvious until you're already in trouble. 4. Multi-cluster workflows require third-party tooling (e.g., Admiralty) and there's no native cross-cluster DAG primitive, so anything spanning clusters is a DIY integration problem.