// the find

apache/paimon

★ 3,299 · Java · Apache-2.0 · updated Jun 2026

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

Apache Paimon is a lake format that sits on top of object storage (HDFS, S3, OSS) and brings LSM-tree semantics to it — meaning you get primary-key upserts, real-time streaming writes, and batch reads from the same table without a separate streaming store. It emerged from the Flink community as 'Flink Table Store' and has since grown Spark support and an Iceberg-compatible layer. The target audience is data engineers running Flink or Spark who want CDC ingestion directly into a data lake without routing everything through Kafka + Hive.

LSM-tree on object storage is the genuinely interesting architectural bet here — compaction runs asynchronously so streaming writers never block readers, and the deletion vector approach (borrowed from Delta/Iceberg) means MoW semantics without rewriting entire files on every upsert. CDC ingestion pipeline support (MySQL, Postgres, MongoDB, Kafka, Pulsar) is first-class and documented with actual topology diagrams, not hand-waving. The Iceberg compatibility layer means you can point existing Iceberg readers at Paimon tables, which lowers the migration cost if your query layer already speaks Iceberg. The REST catalog implementation and Python API (pypaimon with Ray/Daft/PyTorch integrations) suggest it's being taken seriously beyond the Java-only Flink world.

JDK 8/11 requirement for building is a sign of how much legacy the Flink/Hadoop dependency chain drags in — you will hit classpath hell if your environment doesn't match exactly. The split between 'dedicated compaction' and inline compaction is a real operational burden: you need to run separate compaction jobs or your read latency degrades, and tuning when to compact vs when to merge-on-read is genuinely complex. Bucket count is set at table creation time and rescaling requires a dedicated procedure that rewrites data — this is a sharp edge for teams who don't size correctly upfront. The vector/multimodal table features are marked experimental and the docs are thin; don't mistake the directory tree for shipped functionality.

View on GitHub → Homepage ↗