// the find
databricks/spark-sklearn
(Deprecated) Scikit-learn integration package for Apache Spark
A deprecated Databricks library that distributed scikit-learn's GridSearchCV across a Spark cluster. Last touched in 2019, it's been superseded by joblibspark, which does the same thing with a two-line setup and is actively maintained. There is no reason to use this today.
- The deprecation notice is upfront and honest — it tells you exactly what to use instead and shows working code for it
- The original design was sound: it only distributed the embarrassingly parallel parts (hyperparameter search) and explicitly did not try to distribute the learning algorithm itself, which would have been much harder to get right
- The Spark DataFrame to numpy/scipy sparse matrix converters were a practical utility that filled a real gap when the library was active
- Deprecated and dead — last commit 2019, incompatible with scikit-learn >= 0.20, which is ancient at this point
- The replacement (joblibspark) is objectively simpler: no subclassing GridSearchCV, just a context manager around the standard sklearn API
- No large-dataset support by design — if your data doesn't fit in memory on the driver, this library explicitly punts you to MLlib
- Scala/SBT build artifacts in a Python repo suggest this started as a JVM project and the Python layer was grafted on, which rarely ends cleanly