finds.dev← search

// the find

douban/dpark

★ 2,665 · Python · BSD-3-Clause · updated Dec 2020

Python clone of Spark, a MapReduce alike framework in Python

DPark is a Python reimplementation of Spark's RDD API, built to run on Mesos clusters or locally. It was maintained by Douban (the Chinese social network) for their internal data pipelines. If you're stuck on a Mesos shop and want Spark-like semantics without the JVM, this was the answer — circa 2020.

The RDD API is a faithful port: flatMap, reduceByKey, groupByKey, joins all work as expected, so anyone who knows PySpark can read this immediately. Running the same script locally or on a Mesos cluster with just a flag change (-m mesos vs nothing) is genuinely useful for development. The DAG visualization UI is a real implementation, not an afterthought — it shows stage graphs and callsite graphs, which makes debugging shuffle-heavy jobs tractable. C extensions (Cython .pyx files, crc32c in plain C) in the hot path show they actually ran this under production load.

Dead project — last commit was December 2020 and Mesos itself is essentially abandoned at this point, so the primary cluster backend is infrastructure nobody is deploying. Documentation is mostly in Chinese, which is a hard blocker if you don't read it. The shuffle layer depends on Nginx as a static file server for inter-node data transfer, which is an odd operational dependency that will surprise you the first time shuffle fails silently. No path to running on YARN, Kubernetes, or any cluster scheduler that's actually relevant in 2026 — you'd be forking and rewriting the scheduler module.

View on GitHub →

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →