// the find
Eventual-Inc/Daft
High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale
Daft is a distributed dataframe engine that treats images, audio, video, and embeddings as native column types rather than opaque Python objects. It's Rust-powered with a Python API, Arrow-backed internally, and scales from a laptop to a Ray cluster with the same code. The target is ML data pipelines where you're mixing structured data with large binary assets — not general analytics.
The multimodal column type system is the real differentiator: images and audio live in typed columns with lazy decoding, so filter pushdown and projection can avoid deserializing data you don't need. The AI operations layer (daft/ai/) is well-structured — OpenAI, Google, vLLM, Transformers, and LM Studio all get a consistent provider interface, so you can swap backends without rewriting pipeline code. The catalog support is unusually broad for a project this size: Iceberg, Delta Lake, Unity Catalog, Gravitino, Paimon, and S3 Tables all have dedicated implementations rather than one generic connector. The query optimizer is present and Arrow-native throughout, which puts it ahead of Modin and Dask for structured workloads.
The Rust core doesn't help you where it matters most in AI pipelines — model inference still runs in Python UDFs, so the performance story gets murkier the moment you load a GPU model. If you're doing pure structured analytics, Polars is faster, has 5x the community size, and doesn't need Ray. The distributed mode requires Ray, which is a substantial dependency with its own operational footprint; the Kubernetes path is documented but feels less battle-tested than the Ray path. API stability is the quiet risk: the catalog and AI modules are growing fast and the project hasn't declared a stable v1 surface, so adopters should expect occasional breaking changes in the areas that are most useful.