// the find

capitalone/datacompy

★ 648 · Python · Apache-2.0 · updated Jun 2026

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

DataComPy is a DataFrame diff tool that works across Pandas, Polars, Spark, and Snowflake. It originated as a SAS PROC COMPARE replacement and has grown into a multi-backend comparison library with human-readable reports. Most useful for data engineers who need to validate migrations, pipeline outputs, or schema changes across different compute layers.

Consistent API across very different backends (Pandas local vs Spark cluster vs Snowflake) means you write comparison logic once and it runs where your data already lives. The Jinja2-templated report output is genuinely useful — you get row match rates, column-level mismatch counts, and sample rows, not just a boolean. The comparator subpackage separates numeric/string/array comparison logic cleanly, which makes it easy to adjust tolerance thresholds for floats without touching the rest. Active maintenance with v1 just released and a published roadmap.

648 stars for a Capital One-backed project that does a genuinely useful thing suggests limited adoption outside finance/data engineering shops — community plugins and third-party integrations are thin. The Snowflake backend requires Snowpark, so you're running Python UDFs in Snowflake's sandbox, which adds latency and cost compared to pulling data out and comparing locally. No streaming or incremental comparison support — if your DataFrames are too large to materialize, you're stuck with Spark. The multi-backend abstraction means backend-specific quirks (Polars nulls vs Pandas NaN, Spark lazy evaluation) occasionally leak through and produce surprising results.

View on GitHub → Homepage ↗