// the find
lance-format/lance
Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
Lance is a columnar file format built for ML workloads — think Parquet but with native vector indexing, fast random access, and zero-copy versioning baked in. It sits between a raw file format and a full vector database, making it a reasonable choice if you're already managing datasets in Arrow/Parquet and want to add ANN search without standing up another service. The Rust core with Python bindings via PyO3 means it's fast where it matters.
The random access story is the real differentiator — Parquet requires scanning entire row groups to pull individual rows, Lance doesn't, and the 100x claim for sampling workloads is plausible given how the format is structured. Hybrid search (ANN + BM25 + SQL predicates in one query) without gluing together three separate systems is genuinely useful for RAG pipelines. Zero-copy versioning via manifest files means you get time travel and ACID transactions without a catalog service or extra infrastructure. Ecosystem coverage is wide — Arrow, DuckDB, Polars, Spark, Ray — so you're not betting on a dead-end integration.
The catalog spec is still maturing; if you need multi-writer concurrency at scale, the conflict resolution story is not as battle-tested as Delta Lake or Iceberg, which have years of production use behind them. The Python package is `pylance`, which collides with Microsoft's Pylance LSP extension — minor but genuinely annoying for discoverability and error messages. Java bindings exist but feel second-class: they're JNI over the Rust core, the JNI layer is large and carries its own Cargo.lock, and Java users don't get the same ergonomics as the Python API. Blob/large binary support is documented but the lazy loading behavior isn't well-specified — if you're storing video or large audio files and care about partial reads, you'll need to read the source to understand what you're actually getting.