// the find

NVIDIA/ChatRTX

★ 3,122 · Python · NOASSERTION · updated Jan 2026

A developer reference project for creating Retrieval Augmented Generation (RAG) chatbots on Windows using TensorRT-LLM

ChatRTX is NVIDIA's reference implementation for local RAG on Windows RTX hardware, combining TensorRT-LLM, LlamaIndex, and FAISS into a desktop Electron app. It's aimed at developers who want to see how to wire up GPU-accelerated inference with document retrieval on consumer hardware. As of January 2026, NVIDIA deprecated it and stopped maintenance.

The TensorRT-LLM integration is the main draw — running quantized LLMs at RTX speeds locally is meaningfully faster than CPU or naive CUDA inference. Multi-modal coverage is real: text, PDF, images via CLIP, and voice via Whisper/RIVA in one package is more than most local RAG demos attempt. The API/app split (ChatRTX_APIs vs ChatRTX_App) means the inference layer is usable headlessly without the Electron wrapper. The examples directory covers the core patterns concisely — NIM inference, streaming, RAG, CLIP — useful as copy-paste starting points.

Deprecated as of January 2026 — this is the headline problem. Any dependency on TensorRT-LLM or NIM versions here will rot; TensorRT-LLM moves fast and breaks things. Hardware requirements are steep and narrow: you need a 30- or 40-series (or newer) RTX GPU on Windows 11 only, which rules out most developer machines and all CI environments. The FAISS vector store is flat-file with no persistence story — if you want production RAG you'll be ripping this out and replacing it. Sample data is all NVIDIA GeForce marketing articles, so initial results look better than they'll be on real heterogeneous document sets.

View on GitHub →