finds.dev← search

// the find

datazip-inc/olake

★ 1,354 · Go · Apache-2.0 · updated Jun 2026

OLake - Fastest Databases, Kafka & S3 Replication to Apache Iceberg with Table optimization (Called OLake Fusion). ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supported sources : Postgres, MongoDB, MySQL, Oracle, MSSql, DB2, Kafka, S3.

OLake is a Go-based CDC and full-load replication tool that moves data from relational databases, Kafka, and S3 into Apache Iceberg or Parquet files. It sidesteps the usual Spark/Flink/Debezium stack by shipping its own drivers per source and using Arrow writes over gRPC to a bundled Java process for Iceberg catalog operations. Aimed at teams who want a data lakehouse without paying Fivetran prices or running a distributed compute cluster.

The architecture choice to avoid Spark/Flink/Debezium for full-load is the right call — those stacks add enormous operational overhead for what is fundamentally a read-and-write job. The per-driver Go module layout (each source is its own go.mod under drivers/) means you can build and ship only what you need without pulling in every JDBC driver. Arrow columnar writes with gRPC to the Java Iceberg writer is a sensible boundary: keep performance-critical path in Go, delegate Iceberg catalog complexity (which is genuinely gnarly) to the mature Java ecosystem. Benchmark numbers for full load (580K RPS Postgres → Iceberg) are plausible given direct COPY protocol usage — this isn't marketing fiction.

The Java subprocess dependency for Iceberg writes (olake-iceberg-java-writer is a forked Debezium component) is an operational liability — you now have a Go binary shelling out to a JVM over gRPC, which means two runtimes to debug when writes stall or OOM. CDC for Oracle is still WIP and Kafka connector only does bounded incremental (no true streaming), which limits the 'real-time' claim to Postgres and MySQL in practice. The benchmark footnote admits results are 'preliminary' and not yet reproducible, which undercuts the headline 12.5× Fivetran comparison. Schema evolution support exists but the behavior on destructive changes (column drops, type changes) isn't documented in the README — that's exactly where CDC pipelines break in production.

View on GitHub → Homepage ↗

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →