// the find

spark-examples/pyspark-examples

★ 1,359 · Python · updated Dec 2025

Pyspark RDD, DataFrame and Dataset Examples in Python language

A flat collection of standalone PySpark scripts covering the common DataFrame, RDD, and SQL operations you'd look up when learning Spark. It's a reference repo, not a library — think cheat sheet, not framework. Aimed at developers new to PySpark who want runnable copy-paste examples rather than reading API docs.

- Each example is a single self-contained file — no project scaffolding to untangle before you can run something

- Coverage of the awkward-to-discover operations (pivot, mapPartitions, window functions, broadcast joins) where the official docs are thin on working code

- Sample data is included in resources/ so examples actually run without sourcing your own files

- Updated as recently as Dec 2025, so the PySpark API versions are not embarrassingly stale

- Everything dumps to the root directory with no grouping by topic — 80+ files in a flat list is hard to navigate and will only get worse

- No tests: typos like 'pyspark-fulter-null.py' and 'pyspark-repace-null.py' are in the repo and nobody caught them, which suggests the scripts aren't validated to actually run

- The README is mostly a link to an external tutorial site (sparkbyexamples.com) — the repo is effectively a lead-gen artifact for that blog, not a standalone learning resource

- RDD examples are the entry point, but RDDs are the wrong abstraction for any modern Spark workload — someone learning Spark here will pick up outdated patterns first

View on GitHub → Homepage ↗