// the find

palantir/pyspark-style-guide

★ 1,248 · Python · MIT · updated Sep 2025

This is a guide to PySpark code style presenting common situations and the associated best practices based on the most frequent recurring topics across the PySpark repos we've encountered.

A Palantir-authored opinionated style guide for PySpark, covering column references, chaining, joins, window functions, and select patterns. It's aimed at data engineers who write PySpark transforms and want consistent, maintainable code across a team. Comes with actual Pylint checkers that can enforce some of the rules automatically.

The window function section is the most valuable part — the implicit frame behavior change when you add `orderBy()` is a genuine footgun that bites experienced engineers, and the examples here are concrete and runnable. The join aliasing pattern (using `df.alias()` instead of bulk-renaming columns before joining) is the right call and saves real pain. The logical operation refactoring section shows not just 'split it up' but actually catches a real redundant condition in the example code, which is rare for style guides. The Pylint checkers in `src/checkers/` mean some of this isn't just aspirational — you can actually enforce it in CI.

The chaining limit of 5 (or 3, the guide contradicts itself) is arbitrary and creates the kind of rule that teams argue about rather than follow. The guide says almost nothing about partitioning strategy or shuffle behavior, which is where PySpark code actually falls over in production — knowing to use `F.col()` instead of `df.col` won't save you from a 500-partition shuffle on a 1GB dataset. The Pylint checkers section is marked WIP and has been since the guide launched; they cover maybe 20% of what's described. Last real commit activity appears to be years old, so advice about Spark 3.0 improvements is the freshest technical content here.

View on GitHub →