// the find

llmware-ai/llmware

★ 14,844 · Python · Apache-2.0 · updated May 2026

Unified framework for building enterprise RAG pipelines with small, specialized models

llmware is a Python framework for building local-first RAG pipelines, targeting enterprise use cases where data can't leave the machine. It bundles document parsing, a vector store abstraction, and a catalog of 300+ pre-quantized small models (BLING/DRAGON/SLIM families, 1-7B) so you can go from PDF to LLM answer without touching a cloud API. The sweet spot is air-gapped or privacy-sensitive environments on Windows/Mac/Linux, including NPU-capable hardware like Snapdragon.

1. The SLIM function-call models are genuinely useful — small models fine-tuned for specific extraction tasks (sentiment, NER, SQL) that run on CPU and produce structured output, which is more reliable than prompting a general model. 2. Hardware backend breadth is real: GGUF via llama.cpp, ONNX, OpenVINO, and QNN for Qualcomm NPU are all wired up with a single ModelCatalog API, so you're not rewriting inference code when you change deployment targets. 3. The Library abstraction handles multi-format ingestion (PDF, PPTX, DOCX, XLSX, WAV, images) with actual C-backed parsers shipped as platform binaries — it's not just wrapping PyMuPDF. 4. Dual-pass retrieval (semantic + BM25-style text) with document-level filtering is built in, not bolted on.

1. Shipping platform-native .so/.dll files inside the Python package is a maintenance and supply-chain liability — you're trusting pre-compiled binaries from the repo with no reproducible build. Anyone doing a security audit will flag this immediately. 2. The abstraction is wide but shallow: LLMWareConfig global state (set_active_db, set_vector_db) makes it hard to run multiple pipelines with different backends in the same process, and the singleton pattern will bite you in any async or multi-tenant context. 3. The BLING/DRAGON models cap out around 7B parameters — fine for extraction on well-scoped documents, but they struggle on anything requiring reasoning across long contexts or ambiguous queries. There's no clear upgrade path to larger models without leaving the local-first paradigm. 4. Documentation and examples are voluminous but inconsistent — the solutions/ directory has duplicate files across gguf/ and models/ subdirectories, and several README code snippets have syntax errors (missing closing parenthesis in the hello world example), suggesting the docs aren't tested against the actual library.

View on GitHub → Homepage ↗