// the find

AlexIoannides/pyspark-example-project

★ 2,111 · Python · updated Jan 2023

Implementing best practices for PySpark ETL jobs and applications.

A template project showing one opinionated way to structure PySpark ETL jobs: separate transform functions from extract/load, ship config via JSON file, and package dependencies as a zip for spark-submit. Aimed at data engineers who are new to PySpark and want a working scaffold rather than a blank screen. The example job itself is trivial (employees report), but the structure it demonstrates is the point.

The extract/transform/load separation is solid — isolating transforms into pure DataFrame-in, DataFrame-out functions makes unit testing actually feasible without a real cluster. The start_spark() helper that detects interactive vs. spark-submit context is a genuine quality-of-life win that most teams reinvent badly on their own. Test data is checked in as parquet so tests run without external deps. The config-via-JSON-file pattern sidesteps the fragility of argparse in distributed jobs.

Last commit was January 2023 and the repo hasn't been touched since — Spark has moved on (Spark 3.4/3.5, Python 3.11+) and some patterns here are dated. Pipenv is the dependency tool of choice, which is a controversial call in 2024; most teams have moved to uv or Poetry. The single example job is so simple (group by, rename columns) that it doesn't demonstrate anything hard: no incremental loads, no schema evolution, no handling of bad records, no partitioning strategy. The build_dependencies.sh script is a shell one-liner with no error handling — breaks silently if pipenv isn't on PATH.

View on GitHub →