// the find

Maximilian-Winter/llama-cpp-agent

★ 644 · Python · NOASSERTION · updated Mar 2026

The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM models, execute structured function calls and get structured output. Works also with models not fine-tuned to JSON output and function calls.

A Python wrapper around llama.cpp that adds structured output, function calling via GBNF grammars, and agent chains to local LLM inference. Targets developers running models locally who want function calling without fine-tuned models. The first line of the README says it is no longer maintained.

Grammar-based constrained sampling is the genuinely interesting piece — forcing any 7B model to emit valid JSON/function calls without fine-tuning works better than prompt engineering alone. Provider abstraction covers llama-cpp-python, llama.cpp server, TGI, and vLLM with a consistent interface. The example set is thorough: RAG with ColBERT reranking, mixture-of-agents, memory managers, and knowledge graph extraction are all there with runnable code. Custom chat format support (15+ predefined formatters plus a builder) means it handled the pre-template-standardization era well.

Abandoned — the README's first sentence tells you to use something else, and the last commit was March 2026 on a repo with 644 stars. GBNF grammars are a dead end now that most capable models are fine-tuned for tool use and llama.cpp has first-class tool support; the whole grammar layer becomes maintenance burden for no benefit on modern models. No async support anywhere in the provider implementations, so it blocks the event loop and won't compose cleanly with modern Python async agent frameworks. The RAG implementation depends on ragatouille/ColBERT, which itself has a complicated dependency chain; you will fight install issues before writing a line of business logic.

View on GitHub →