// the find

78/xiaozhi-esp32

★ 27,300 · C++ · MIT · updated Jun 2026

An MCP-based chatbot | 一个基于MCP的聊天机器人

XiaoZhi is a voice-interactive AI chatbot firmware for ESP32 devices that connects to cloud LLMs (Qwen, DeepSeek) via WebSocket or MQTT and exposes device control through the MCP protocol. It ships with offline wake-word detection, OPUS audio, speaker recognition, and display support, targeting makers who want a real conversational AI on cheap hardware. The Chinese maker community is the primary audience, though English documentation exists.

The MCP integration is genuinely clever — instead of hardcoding skills, device peripherals (GPIO, servo, LED) and cloud capabilities (smart home, email) are both exposed as MCP tools, so the LLM can invoke them uniformly without firmware changes. Offline wake-word via ESP-SR means the device isn't always streaming audio to the cloud. Supporting 70+ hardware boards with a custom-board guide lowers the friction for new hardware significantly. The streaming ASR+LLM+TTS pipeline is the right architecture for low-latency voice on constrained hardware — not trying to buffer the whole response before speaking.

The v1-to-v2 partition table break with no OTA path is a real pain point for anyone who shipped hardware; manually reflashing 50 devices is not fun. The default setup routes everything through xiaozhi.me, so your conversations go through someone else's server — self-hosting requires setting up one of the community server projects separately, which is underdocumented in the main repo. Speaker recognition using 3D-Speaker on an ESP32-S3 sounds impressive but the accuracy on noisy environments with the onboard mics is almost certainly worse than the demo videos suggest. The project is heavily China-facing (Bilibili links, Feishu docs, QQ groups) so English-speaking contributors will hit documentation gaps fast.

View on GitHub → Homepage ↗