// the find
apache/datafusion
Apache DataFusion SQL Query Engine
Apache DataFusion is an embeddable SQL query engine written in Rust, built on Apache Arrow's columnar in-memory format. It's aimed at developers building custom database systems, query layers, or data pipelines who want a production-grade execution engine they can extend rather than write from scratch. Projects like InfluxDB, GlareDB, and Comet (Spark accelerator) are built on top of it.
- Genuinely fast: columnar vectorized execution with multi-threaded parallelism, and ClickBench numbers are competitive with dedicated OLAP systems. The Arrow-native format avoids serialization overhead throughout the pipeline.
- Extremely extensible at every layer — custom data sources, user-defined functions (scalar, aggregate, window), custom physical and logical operators, and even alternate query frontends. The trait system is well-designed for this.
- Active, high-velocity project with ~100+ commits per month and a real Apache governance process. Deprecation policy is documented and followed, which matters for a library you're embedding.
- Rich example directory and multiple real-world benchmark suites (TPC-H, ClickBench, IMDB JOB) committed to the repo, making it straightforward to validate performance for your workload.
- The optimizer is rule-based with limited cost-based join ordering. For complex multi-table queries with skewed statistics, you'll hit cases where the planner makes poor decisions and there's no easy knob to fix it without writing a custom optimizer rule.
- API churn is real despite the deprecation policy — major versions have broken extension APIs multiple times, and building on internal traits means you're reading changelogs carefully with every upgrade.
- No built-in distributed execution: DataFusion is single-node. Ballista (the distributed layer) was spun out and is much less mature. If you need to scale out, you're either implementing your own scheduler or adopting something like Ray or building on top yourself.
- Compilation times are painful — the crate is large and heavily generic. Incremental builds in a project that depends on datafusion add noticeable overhead, and the full dependency tree is substantial.