// the find
MrPowers/chispa
PySpark test helper methods with beautiful error messages
chispa is a pytest helper for PySpark that makes DataFrame and column equality assertions readable. When a test fails, you get colored terminal output showing exactly which rows or cells mismatched rather than a wall of generic assertion text. Aimed squarely at data engineers writing unit tests for PySpark transformations.
The error messages are the real product here — color-coded row diffs with mismatched cells highlighted make debugging a failing Spark test actually bearable. The `ignore_row_order`, `ignore_nullable`, `ignore_column_order`, and `ignore_metadata` flags cover the annoying edge cases that bite you in real pipelines (nullable mismatches between `createDataFrame` and production schemas are a constant source of false failures). `assert_approx_df_equality` handles floating-point columns without requiring you to round everything upstream. CI matrix covers PySpark 3.5, 4.0, and 4.1 with Python 3.10–3.12, so it's not quietly broken on newer Spark.
No support for Spark Connect or remote sessions — if your team runs Spark Connect (the default in PySpark 4.x client mode), you can't use `collect()`-based comparisons without a local SparkSession, and chispa doesn't document this. The `transforms` option on `assert_df_equality` is a workaround for the lack of built-in sort-before-compare, but using it requires callers to remember to pass it rather than just setting `ignore_row_order=True`. There's no streaming DataFrame support — if you're testing Spark Structured Streaming jobs, you're on your own. At 771 stars it's niche enough that you may end up being the person filing the bug when something breaks on an unusual schema.