finds.dev← search

// the find

chonkie-inc/chonkie

★ 4,139 · Python · MIT · updated Jun 2026

🦛 CHONK docs with Chonkie ✨ — The lightweight ingestion library for fast, efficient and robust RAG pipelines

Chonkie is a Python library focused specifically on text chunking for RAG pipelines, offering everything from simple token-based splitting to semantic and LLM-based chunking. It's designed to be a drop-in replacement for the chunking pieces of LangChain/LlamaIndex without dragging in their entire dependency trees. Aimed at anyone building document ingestion pipelines who's tired of rolling their own chunker for the fifth time.

- Genuine focus on being lightweight: 505KB wheel vs competitors at 1-12MB, and lazy imports mean you only pay for what you use. The optional dependency extras ([tiktoken], [semantic], [st], etc.) are well-thought-out.

- Unusually wide chunker variety in one place: token, sentence, recursive, semantic, late chunking, code-aware (tree-sitter presumably), neural, and LLM-based agentic chunking all with a consistent interface.

- Test coverage is thorough and mirrors the source structure closely — every chunker, embedding provider, handshake, and refinery has its own test file, and CI includes type checking, secret scanning, and lazy-import validation.

- The Pipeline API with method chaining (.chunk_with().refine_with()) is a clean abstraction that avoids the boilerplate of manually wiring chunkers, embedders, and vector DB writers together.

- SlumberChunker (LLM-based chunking) has no documented latency or cost characteristics, and using an LLM per-chunk in production will be brutally slow and expensive — the docs don't warn you about this tradeoff clearly.

- The 'handshakes' for vector DBs look thin: they appear to be write-only wrappers with no support for retrieval, metadata filtering, or index management, so you'll still need the native client for anything real.

- Scope creep is starting to show — OCR via Mistral, CSV/Excel table processing, HuggingFace Hub wrappers, a REST API server, cloud-hosted version — this is no longer a focused chunking library and the 'lightweight' claim starts to feel strained.

- The REST API server stores pipeline configs in SQLite with no auth, no multi-user support, and no mention of production hardening. Shipping this in anything other than a local dev setup would be a mistake.

View on GitHub → Homepage ↗

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →