// the find
lakesoul-io/LakeSoul
LakeSoul is an end-to-end, realtime cloud-native Lakehouse framework for fast data ingestion, concurrent updates, incremental analytics, multimodal data processing and vector search — powering next-generation BI and AI workloads.
LakeSoul is a lakehouse table format with a Rust core that binds to Spark, Flink, Python, and Ray. It targets teams who need CDC ingestion, upserts with primary keys, and vector search on the same storage layer without stitching together separate tools. Originally from DMetaSoul, now an LF AI & Data sandbox project.
The Rust-native IO and metadata layer is the real architectural bet here — one implementation shared across all engine bindings means the ACID and upsert semantics actually behave consistently whether you're hitting it from PySpark or a PyTorch DataLoader. The LSM-Tree-style upsert for hash-partitioned tables with merge-on-read is well thought out and benchmarks suggest write throughput is competitive with Delta. The automated disaggregated compaction service running as a separate Flink job (not inline on the write path) is a smart production choice — it doesn't block ingestion. PostgreSQL as the metadata store is genuinely useful: you get row-level security and RBAC for free, and you can route read-only metadata queries to a standby, which is a real operational win.
PostgreSQL as the metadata store is also the single biggest operational risk — if Postgres goes down or gets overloaded, everything stops; there's no fallback or degraded-read mode, which makes this a bad fit for teams who want catalog-level HA without running their own Postgres cluster. The vector search story is half-baked: it's in the roadmap but ANN search on object store with upserts is still listed as a 2026 goal, so the 'vector search' badge in the description is aspirational. Spark 4.0 and Flink 2.0 support are also still open roadmap items, which means you're pinned to Spark 3.5 and Flink 1.20 for now. Documentation is functional but clearly written by engineers who know the system — the operational guidance for tuning compaction, sizing the Postgres metadata tier, or recovering from a bad write is thin.