// the find

zilliztech/GPTCache

★ 8,067 · Python · MIT · updated Jul 2025

Semantic cache for LLMs. Fully integrated with LangChain and llama_index.

GPTCache is a semantic caching layer that sits in front of LLM APIs — instead of exact-match caching, it embeds queries into vectors and retrieves cached responses for semantically similar inputs. It's for teams burning serious money on repeated or near-duplicate LLM calls in production chatbots or QA systems. The drop-in OpenAI adapter means adoption friction is low.

The adapter pattern is genuinely well-designed — two lines of code replace your OpenAI import and you get caching with no other changes. The pluggable architecture lets you swap embedding models, vector stores, and scalar storage independently; pgvector, FAISS, Milvus, Chroma, Qdrant are all there. The temperature parameter integration is clever: higher temperature increases the probability of bypassing cache, which is semantically correct behavior. Docker server image means non-Python stacks can use it via HTTP.

The project is explicitly in maintenance mode — the README says they're no longer adding support for new APIs or models, and the last meaningful commits are trailing off. That's a serious red flag if you're building on it: OpenAI's API shape changes and you're on your own. The eviction policy is acknowledged as broken in the README itself ('may cause OOM errors') with no fix timeline. Similarity threshold tuning is a foot-gun in production — too loose and you get wrong answers returned confidently from cache, too tight and hit rate drops to zero; there's no tooling to help you find the right value for your workload.

View on GitHub → Homepage ↗