// the find
apache/amoro
Apache Amoro(incubating) is a Lakehouse management system built on open data lake formats.
Amoro is a management layer that sits on top of open table formats (Iceberg, Hudi, Paimon) and handles the operational work those formats leave to you: compaction, file expiry, orphan file cleanup, and catalog unification across engines. It's for data platform teams running lakehouse architectures on Flink/Spark/Trino who are tired of writing their own maintenance jobs. Still in Apache incubation.
Self-optimizing compaction runs as a pluggable background service with configurable optimizer groups — you can isolate compaction resources from query resources, which is the right call. Unified catalog service means Flink, Spark, and Trino all point at one place instead of three separate metastore configs. Mixed-Hive format upgrade path is genuinely useful: promotes existing Hive tables to lake format via metadata migration only, no data rewrite. CI matrix covers both Hadoop 2.x and 3.x with separate workflows, and there's a dedicated Trino CI job — unusual level of engine compatibility testing for a project this size.
Still incubating after what appears to be several years; the graduation timeline is unclear and that's a real risk if you're building production infrastructure on it. Mixed format's engine support matrix has obvious gaps — Spark can't do streaming writes, Flink can't do batch overwrites or ALTER TABLE — so you'll hit walls fast if your pipeline needs those. The build story requires JDK 8, 11, and 17 simultaneously (via Maven toolchains) just to build Trino support, which is a painful local setup tax. Documentation and the dashboard feel like they lagged behind the Java code; the web UI module exists but getting it running outside Docker isn't clearly documented.