// the find
cocoindex-io/cocoindex
Incremental engine for long horizon agents 🌟 Star if you like it!
CocoIndex is an incremental data pipeline framework for AI/RAG workloads — Rust core, Python API. The pitch is React-style declarative pipelines where you describe the target state and the engine tracks provenance to reprocess only what changed. Aimed at teams building RAG systems or agent memory stores who are tired of re-embedding entire corpora on every update.
The delta-only recomputation model is the real differentiator: memoization keyed on hash(input) + hash(code) means a function change only invalidates rows whose output depends on that code, not everything. The Rust core handling parallelism, retries, dead-letter queues, and failure isolation is a serious engineering choice — Python wrappers with a Rust engine sidesteps the GIL and gives actual production-grade throughput. Connector breadth is genuinely good: Postgres, LanceDB, Qdrant, Neo4j, Kafka, S3, Google Drive, SurrealDB out of the box. The end-to-end lineage story — every target vector traces back to its source byte — is practically useful for debugging why a retrieval went wrong.
The Python/Rust FFI boundary is opaque: when something goes wrong deep in the Rust engine, Python stack traces will be useless and you'll need to understand both layers to debug. The declarative model works well for append-heavy workloads but the docs are thin on what happens with complex updates — deletions, schema migrations mid-pipeline, or reprocessing a subset of rows when you change an embedding model. No native streaming source support for databases (CDC via Debezium etc.) — the 'live mode' appears to be polling. At 10k stars and heavy README marketing, the gap between the pitch and what's actually production-tested at scale is unclear; the benchmarks in the repo are for file summarization, not petabyte indexing.