// the find

databricks/koalas

★ 3,374 · Python · Apache-2.0 · updated Mar 2024

Koalas: pandas API on Apache Spark

Koalas was a pandas-compatible DataFrame API on top of Apache Spark, letting you write pandas code that ran distributed. It's been dead since 2022 — the functionality was merged into PySpark 3.2 as `pyspark.pandas`, and this repo is explicitly deprecated and in maintenance mode.

The idea was sound: bridging the pandas mental model to Spark meant data scientists could scale existing code without rewriting. The `missing/` submodule was honest about what wasn't implemented yet — a dedicated directory of stubs with `MissingPandasLikeSeries` etc. rather than silent failures. The `spark.` accessor pattern let you drop into native Spark when the pandas abstraction didn't fit.

This repo is a dead end — do not adopt it. The README says so in the first line. If you're on Spark 3.2+, use `pyspark.pandas` instead. The behavior gap between Koalas and real pandas was non-trivial (different index semantics, eager vs. lazy evaluation surprises, operations across DataFrames from different Spark contexts failing at runtime), and those rough edges are now the upstream PySpark team's problem to fix.

View on GitHub →