// the find
GetStream/Vision-Agents
Open Vision Agents by Stream. Build voice and vision agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.
Vision Agents is a Python framework from Stream for building real-time voice and video AI agents. It sits on top of Stream's WebRTC infrastructure and wires together LLMs (OpenAI, Gemini, Claude), STT/TTS providers, and computer vision models (YOLO, Roboflow) into a single agent pipeline. The target is developers who want to ship a voice or video AI feature without building the WebRTC plumbing themselves.
The processor pipeline design is genuinely good — you can chain YOLO pose detection before the LLM call and custom ONNX models after it, which is the right abstraction for video AI. The plugin architecture is clean: each provider lives in its own subdirectory with its own pyproject.toml, so you only pull in what you actually use. The testing module ships a mock session and LLM judge for agent evaluation, which most similar frameworks skip entirely. The production story is more complete than expected for a framework this young — Prometheus metrics, Helm charts, horizontal scaling docs, and Redis-backed session registry are all present.
The core dependency on Stream's edge network is real even though the README says 'works with any video edge network' — the low-latency numbers only hold if you're using getstream.io, and the free tier caps at 333k participant minutes. Context degradation after ~30 seconds of continuous video is a fundamental problem the README at least admits, but there's no mitigation strategy offered beyond 'mix specialized models with larger LLMs'. The monorepo structure with 20+ plugins each having their own pyproject.toml means dependency management gets messy fast in practice. Memory across sessions is backed by Stream Chat, which is another Stream service dependency that costs money and adds latency.