// the find
apache/iceberg
Apache Iceberg
Apache Iceberg is an open table format for large analytic datasets, designed to let multiple engines (Spark, Flink, Trino, Hive) read and write the same tables safely and concurrently. It solves the classic data lake problems: no atomicity, schema drift, partition evolution pain, and the inability to do concurrent writes without stepping on each other. If you're running a lakehouse on S3/GCS/ADLS and you're still using Hive-style partitioning with manual partition management, this is what replaces that.
The snapshot isolation model is the real win — every write produces a new snapshot, readers always see a consistent view, and you get time-travel queries for free without any extra infrastructure. Partition evolution works without rewriting existing data; you can change how a table is partitioned and old files stay where they are, new files use the new scheme, and queries work across both. The hidden partitioning feature means query engines don't need to know your partition scheme — predicates are automatically translated to partition filters using the stored column statistics and transform metadata. The REST catalog spec has become a real interoperability layer: Polaris, Unity, and other catalog implementations all speak the same protocol now, so you're not locked into a single vendor's metadata store.
Small file accumulation is a real operational burden — Iceberg doesn't prevent you from writing millions of 10KB files, and compaction via RewriteDataFiles is something you have to schedule and tune yourself; get it wrong and your read performance tanks. The equality delete file mechanism for row-level updates is clever on paper but becomes a read-time merge problem: until you compact those delete files into data files, every scan has to merge them, and this gets expensive fast on busy OLTP-style workloads. The Java reference implementation carries significant dependency weight — pulling in iceberg-spark for Spark 3.x brings a large shaded jar and version negotiation pain if your engine version doesn't exactly match a supported combination. Getting a working local dev environment takes Docker and a real metastore (or REST catalog server); there's no trivial local-only mode that a newcomer can stand up in five minutes.