// the find

deepseek-ai/smallpond

★ 4,959 · Python · MIT · updated Mar 2025

A lightweight data processing framework built on DuckDB and 3FS.

smallpond is a distributed data processing framework from DeepSeek that wraps DuckDB with a Spark-like DataFrame API and uses 3FS (their custom distributed filesystem) for storage. It targets teams that need to process terabyte-to-petabyte scale datasets without standing up a long-running cluster like Spark. The GraySort benchmark result (110TiB sorted in 30 minutes) is real and impressive.

No long-running services required — jobs are ephemeral, which kills a whole class of ops headaches you get with Spark or Flink. DuckDB as the execution engine means single-node workloads are genuinely fast and the SQL dialect is modern. The logical/physical plan separation with an optimizer layer shows this isn't just a thin wrapper — there's actual query planning happening. MPI-based execution (`platform/mpi.py`) means it scales horizontally without a custom scheduler daemon.

The 3FS dependency is the elephant in the room — 3FS is DeepSeek's internal distributed filesystem and running it outside their environment is non-trivial; without it you're limited to local or S3-backed storage which is not the benchmark story they're selling. Last commit was March 2025, so active development has effectively stalled three months in — this is a research artifact more than a maintained product. The `partial_sql` API (passing `{0}` as a table placeholder) is awkward and will confuse anyone used to SQLAlchemy or DuckDB's native Python API. Python 3.13 support is missing despite being stable for months.

View on GitHub →