finds.dev← search

// the find

apache/datafusion-ballista

★ 2,060 · Rust · Apache-2.0 · updated Jun 2026

Apache DataFusion Ballista Distributed Query Engine

Ballista adds distributed execution on top of Apache DataFusion, letting you scale out existing DataFusion queries to a scheduler/executor cluster with minimal code changes. It's aimed at teams who already use DataFusion and need to run larger-than-single-node workloads without adopting Spark. The TPC-H benchmarks show ~2.9x speedup over Spark on a single node, though that comparison is intentionally limited.

- The migration story is genuinely low-friction: swap SessionContext::new() for SessionContext::standalone() and you're done, which is a real differentiator over Spark-based migrations.

- Adaptive Query Execution (AQE) implementation in the scheduler is fairly complete — dynamic join selection, partition coalescing, and exchange optimization are all present, not just stubs.

- Sort shuffle with spill-to-disk support is implemented as a proper subsystem with its own buffer, index, and multi-stream reader, not a naive in-memory shuffle.

- Good operational tooling: Prometheus metrics, Kubernetes KEDA autoscaler integration, graphviz plan visualization, and a terminal UI for job inspection are all available as optional features.

- The README explicitly warns of a 'gap between DataFusion and Ballista which may bring incompatibilities' — this is honest but also means you can't assume any DataFusion query will just work distributed; you'll hit edge cases in production.

- Scheduler state is in-memory only (ballista/scheduler/src/cluster/memory.rs). There's no persistent cluster state backend, so a scheduler restart loses all job history and in-flight jobs are gone.

- The benchmark comparison is single-node Spark vs single-node Ballista, which doesn't validate the distributed scaling story at all — it's measuring single-process DataFusion vs JVM overhead, not distributed throughput.

- Python bindings are listed as a topic but there's no Python crate or bindings code visible in the tree, just some benchmark Python scripts. The python tag in topics is misleading for anyone hoping to use this from Python.

View on GitHub → Homepage ↗

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →