// the find

datafold/data-diff

★ 2,989 · Python · MIT · updated May 2024

Compare tables within or across databases

data-diff compares tables within or across SQL databases using a bisecting hash algorithm that avoids full table scans. It supports a wide range of databases and has dbt integration for diffing models after runs. Officially abandoned by Datafold as of May 2024.

The bisecting hash approach is genuinely clever — it splits the keyspace recursively until it isolates differing rows, so diffing a 100M-row table doesn't require pulling 100M rows. Cross-database diffing (Postgres vs Snowflake, etc.) works without an ETL step. The dbt integration is practical for catching data pipeline regressions at the model level. Database coverage is broad — 14+ adapters including BigQuery, Redshift, Databricks, Trino.

Dead project — Datafold shut it down in May 2024, no PRs will be merged, and the last commit was the shutdown notice. The bisecting hash algorithm breaks on tables with non-uniform key distributions or composite keys where one dimension dominates. joindiff mode (the alternative algorithm) requires both tables to be in the same database, which limits the cross-DB story. No streaming or CDC support — this is a point-in-time snapshot comparison only, so it tells you what's different, not when or why.

View on GitHub → Homepage ↗