// the find

kedro-org/kedro

★ 10,887 · Python · Apache-2.0 · updated Jun 2026

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

Kedro is a Python framework for structuring data science and ML projects as proper software — dependency-resolved pipelines, a declarative data catalog, and a project scaffold that enforces separation of concerns. It's aimed at teams where data scientists write notebooks and engineers spend weeks trying to productionize them. The framework tries to close that gap by making the 'correct' structure the default.

The Data Catalog is the strongest part: declare inputs and outputs in YAML, and Kedro handles loading/saving across local files, S3, Azure Blob, databases — your pipeline functions just receive Python objects. Automatic DAG resolution means you write pure functions and Kedro works out the execution order and parallelism — no manual dependency wiring. The deployment matrix is genuinely broad: Airflow, Prefect, Kubeflow, AWS Batch, Databricks adapters exist and are maintained. The project includes agentic CI tooling (.agents/) which is an interesting bet — codifying review and security-check skills as agent scripts rather than just shell scripts.

Adoption on an existing project is painful — the framework assumes you start from its template, and retrofitting it onto a non-Kedro codebase means significant restructuring before you see any benefit. The plugin ecosystem is fragmented: kedro-datasets, kedro-viz, and kedro-mlflow are separate packages with independent versioning, and compatibility matrix breakage when upgrading any one of them is a recurring complaint in the Slack archive. The abstraction stack is deep enough that when something fails — a dataset not loading, a pipeline run crashing — the stack traces route through multiple framework layers before reaching your code, making debugging harder than it should be. Runner parallelism uses multiprocessing, which means everything passed between nodes must be picklable — this quietly breaks on common ML objects and gives confusing errors.

View on GitHub → Homepage ↗