// the find

lensacom/sparkit-learn

★ 1,150 · Python · Apache-2.0 · updated Dec 2020

PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn wraps scikit-learn estimators to run on PySpark RDDs, exposing a near-identical API so you can swap `CountVectorizer` for `SparkCountVectorizer` and distribute training without rewriting pipelines. It targets teams with existing sklearn workflows who need to scale to datasets that don't fit on a single machine. The abstraction layer (ArrayRDD, SparseRDD, DictRDD) handles block-level operations so sklearn code runs partition-by-partition.

The API mirroring is the main value: pipelines, grid search, and individual estimators have direct Spark counterparts with matching method signatures, so the learning curve is minimal. The block-based RDD wrappers are well-thought-out — ArrayRDD and SparseRDD maintain numpy/scipy semantics including indexing and slicing, which means you can prototype locally and distribute without changing logic. Covering the full sklearn pipeline including SparkGridSearchCV is more complete than most similar projects got.

Dead project — last commit was December 2020 and it explicitly targets Python 2.7/3.4 and Spark 1.3+, both of which are years past end-of-life; running this on anything current requires patching. The ROADMAP items (DataFrame support, MLlib integration) were never completed, meaning you're stuck on the RDD API that Spark itself deprecated in favor of DataFrames and ML pipelines. The distributed training approach — shipping model weights back to the driver after each partition — doesn't scale well for large models and can become a bottleneck on wide data. No documentation beyond the README; the `doc/` directory is empty.

View on GitHub →