// the find
openlake-project/openlake
OpenLake is a high performance storage engine for efficient LLM inference and GPU Training
OpenLake is a distributed object store built specifically for GPU workloads — think S3-compatible storage where the hot path goes NVMe → GPU VRAM via GPUDirect and RDMA, bypassing host memory entirely. It uses io_uring with a thread-per-core model and erasure coding instead of replication. Target audience is teams running large-scale LLM training or inference who are bottlenecked on checkpoint load/save throughput.
The thread-per-core architecture with no work stealing is the right call for this workload — a request that never crosses a core boundary means no false sharing, no lock contention, and predictable tail latencies. SIMD Reed-Solomon for erasure coding is genuinely cheaper than 3x replication at scale, and the throughput numbers (225 MiB/s GET at sub-10ms p50 vs 75 MiB/s for MinIO at c=512) are plausible given the architecture. S3-compatible API is the pragmatic choice — it means existing tooling (PyTorch checkpoint saving, HuggingFace Hub, aws CLI) works without modification. The vendored h2 and cyper crates with a PATCHES.md show they're actually maintaining their own fork rather than pretending upstream is good enough.
GPUDirect Storage and RDMA require Mellanox/NVIDIA InfiniBand hardware — the mlx5dv_sys.rs binding makes that explicit. This immediately excludes anyone running on commodity Ethernet or cloud instances without SR-IOV, which is most teams. The benchmark graph only shows GET at one concurrency level; there's no PUT throughput data, no failure-mode benchmarks, and no numbers for the erasure coding overhead on write path. Documentation is sparse — the architecture docs promise detail but the docs/ tree is mostly RST stubs. The novel 'PacedRDMA' congestion control algorithm is described but not published or peer-reviewed, so adopters are trusting an unvalidated claim about tail latency behavior under burst conditions.