// the find
chonkie-inc/chonkie
🦛 CHONK docs with Chonkie ✨ — The lightweight ingestion library for fast, efficient and robust RAG pipelines
Chonkie is a Python library focused specifically on text chunking for RAG pipelines, offering everything from simple token-based splitting to semantic and LLM-based chunking. It's designed to be a drop-in replacement for the chunking pieces of LangChain/LlamaIndex without dragging in their entire dependency trees. Aimed at anyone building document ingestion pipelines who's tired of rolling their own chunker for the fifth time.
- Genuine focus on being lightweight: 505KB wheel vs competitors at 1-12MB, and lazy imports mean you only pay for what you use. The optional dependency extras ([tiktoken], [semantic], [st], etc.) are well-thought-out.
- Unusually wide chunker variety in one place: token, sentence, recursive, semantic, late chunking, code-aware (tree-sitter presumably), neural, and LLM-based agentic chunking all with a consistent interface.
- Test coverage is thorough and mirrors the source structure closely — every chunker, embedding provider, handshake, and refinery has its own test file, and CI includes type checking, secret scanning, and lazy-import validation.
- The Pipeline API with method chaining (.chunk_with().refine_with()) is a clean abstraction that avoids the boilerplate of manually wiring chunkers, embedders, and vector DB writers together.
- SlumberChunker (LLM-based chunking) has no documented latency or cost characteristics, and using an LLM per-chunk in production will be brutally slow and expensive — the docs don't warn you about this tradeoff clearly.
- The 'handshakes' for vector DBs look thin: they appear to be write-only wrappers with no support for retrieval, metadata filtering, or index management, so you'll still need the native client for anything real.
- Scope creep is starting to show — OCR via Mistral, CSV/Excel table processing, HuggingFace Hub wrappers, a REST API server, cloud-hosted version — this is no longer a focused chunking library and the 'lightweight' claim starts to feel strained.
- The REST API server stores pipeline configs in SQLite with no auth, no multi-user support, and no mention of production hardening. Shipping this in anything other than a local dev setup would be a mistake.