finds.dev← search

// the find

databricks/spark-sklearn

★ 1,071 · Python · Apache-2.0 · updated Dec 2019

(Deprecated) Scikit-learn integration package for Apache Spark

A deprecated Databricks library that distributed scikit-learn's GridSearchCV across a Spark cluster. Last touched in 2019, it's been superseded by joblibspark, which does the same thing with a two-line setup and is actively maintained. There is no reason to use this today.

- The deprecation notice is upfront and honest — it tells you exactly what to use instead and shows working code for it

- The original design was sound: it only distributed the embarrassingly parallel parts (hyperparameter search) and explicitly did not try to distribute the learning algorithm itself, which would have been much harder to get right

- The Spark DataFrame to numpy/scipy sparse matrix converters were a practical utility that filled a real gap when the library was active

- Deprecated and dead — last commit 2019, incompatible with scikit-learn >= 0.20, which is ancient at this point

- The replacement (joblibspark) is objectively simpler: no subclassing GridSearchCV, just a context manager around the standard sklearn API

- No large-dataset support by design — if your data doesn't fit in memory on the driver, this library explicitly punts you to MLlib

- Scala/SBT build artifacts in a Python repo suggest this started as a JVM project and the Python layer was grafted on, which rarely ends cleanly

View on GitHub →

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →