// the find

SkywalkerDarren/chatWeb

★ 914 · Python · MIT · updated May 2026

ChatWeb can crawl web pages, read PDF, DOCX, TXT, and extract the main content, then answer your questions based on the content, or summarize the key points.

ChatWeb is a RAG pipeline that pulls text from URLs, PDFs, DOCX, and TXT files, embeds it with OpenAI's embedding API, and lets you ask questions over the content. It targets developers and researchers who want a self-hosted alternative to ChatPDF without the subscription. Supports FAISS (in-memory) or pgvector as the vector store.

Dual vector store support — FAISS for zero-setup local use, pgvector when you want persistence without a separate service. Keyword extraction before embedding lookup is a legitimate improvement over naive question-to-vector search; it reduces embedding drift on conversational phrasing. Docker support with a working docker-compose means you can actually run it without wrestling with Python deps. The three-mode design (console/api/webui) is practical — you can wire it into other tools via the API mode without rewriting anything.

The whole project is a flat pile of single-file modules with no packaging, no tests, and config entirely via a JSON file you copy by hand — this will rot fast as dependencies drift. Last meaningful feature work looks stalled; the TODO list ends with 'other features that have not been thought of yet,' which is not a roadmap. Hardcoded to GPT-3.5 embeddings and chat; swapping to a different model or provider requires editing source files directly. The pgvector setup instructions point to pgvector v0.4.0 from 2023 — the extension is now at v0.8.x with substantial performance improvements that this project won't benefit from.

View on GitHub →