// the find

delta-io/delta

★ 8,846 · Scala · Apache-2.0 · updated Jun 2026

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

Delta Lake is an open-source storage layer that adds ACID transactions, schema enforcement, and time travel to data lakes built on Parquet files. It's the open-source core of Databricks' Lakehouse platform, and the primary target is teams running Spark-based data pipelines who want database-like guarantees without moving to a traditional warehouse. If you're not in the Spark/big-data ecosystem, this is probably not for you.

ACID transactions via an optimistic concurrency transaction log (the `_delta_log` JSON/Parquet commit files) are genuinely well-designed — the protocol spec is public and readable. Time travel and schema evolution are first-class: you can query `VERSION AS OF` or `TIMESTAMP AS OF` without any additional infrastructure. The connector ecosystem is real and maintained — Trino, Flink, Hive, and PrestoDB all have working connectors, not just stubs. The delta-rs Rust implementation with Python bindings means you can read/write Delta tables without a JVM, which is a significant practical win for non-Spark users.

The Spark coupling is deep — most write-path features (OPTIMIZE, Z-ORDER, MERGE INTO with full semantics) require Spark; delta-rs covers reads and simple writes but doesn't implement the full feature set. The transaction log compaction (checkpointing) and log cleanup (VACUUM) are maintenance operations you have to remember to run, and forgetting VACUUM will silently eat your storage budget. The README still links a 2022 roadmap as the canonical roadmap document, which is a bad sign for project communication hygiene. The multi-engine write story has real gaps — concurrent writes from two non-Spark engines can conflict in ways that are hard to debug without understanding the protocol internals.

View on GitHub → Homepage ↗