// the find

lyhue1991/eat_pyspark_in_10_days

★ 825 · Python · updated Sep 2022

pyspark🍒🥭 is delicious，just eat it!😋😋

A Chinese-language PySpark tutorial structured as a 10-day learning path, covering RDD programming, SparkSQL, performance tuning, MLlib, and Structured Streaming. It targets Python developers with pandas/SQL experience who need to get productive with Spark quickly without learning Scala.

The progression from RDD basics through SparkSQL to performance tuning is well-sequenced — you're not thrown into optimization before you understand the execution model. Practice exercises after each core section (7 RDD problems, 7 SparkSQL problems) force you to actually write code rather than just read it. The single-machine setup using pip install pyspark keeps the environment overhead minimal, which is the right call for a tutorial. Coverage of Structured Streaming in day 10 is a meaningful inclusion that most beginner materials skip entirely.

Last updated September 2022, so it targets Spark 3.0.1 — Spark is now at 3.5.x with meaningful changes to the DataFrame API and Connect architecture; some code will behave differently or need adjustment. The entire content is in Chinese with no English translation, which cuts out a large portion of the potential audience despite the English repo name. The repo accidentally committed hundreds of Spark output partition files (dbscan_output.csv with 100+ part files and CRC files) directly into git, which is sloppy and inflates clone size for no reason. No CI, no tested notebooks — it's entirely possible some examples are broken on current Python/PySpark versions.

View on GitHub →