// the find
lakehq/sail
Drop-in Apache Spark replacement written in Rust, unifying batch processing, stream processing, and compute-intensive AI workloads.
Sail is a Rust-native reimplementation of the Apache Spark compute engine that speaks the Spark Connect protocol, so existing PySpark code connects to it without changes. It's built on DataFusion and Arrow, targeting teams running heavy Spark workloads who want to ditch the JVM and its associated operational pain. Early-stage but actively developed with real benchmark numbers to back the performance claims.
The Spark Connect protocol compatibility is genuinely clever — you swap the server, keep all your existing PySpark client code, and the only change is a connection string. Zero-copy Python UDFs via Arrow array pointers is a real performance win; Spark's Python UDF overhead is one of the most complained-about things in the ecosystem. The benchmark methodology is specific enough to be credible: named instance type, named scale factor, 22 queries listed — not just '4x faster' marketing vaporware. Catalog breadth (Glue, Unity, Iceberg REST, Hive Metastore, OneLake) means it fits into existing lakehouse setups rather than requiring a greenfield deployment.
The compatibility check script explicitly warns it only verifies whether functions are implemented, not behavioral parity — for anything beyond straightforward SQL and DataFrame ops, you're discovering gaps in production. Streaming support is listed in the headline but the directory tree shows no streaming-specific crates beyond a proto stub for a 'streaming marker'; the claim of unified batch/stream/AI is ahead of the code. Distributed cluster mode exists but the Kubernetes deployment story is manual YAML authoring with no operator or Helm chart, which is a meaningful ops burden. Python UDF support is there but PySpark's broader ecosystem — MLlib, GraphX, Spark ML pipelines — has no visible coverage path, so 'drop-in replacement' depends heavily on which parts of Spark you actually use.