// the find

OpenGVLab/Ask-Anything

★ 3,341 · Python · MIT · updated Jan 2025

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.

Ask-Anything is a research repo from OpenGVLab that wraps video understanding models (InternVideo, UMT) with chat interfaces, letting you ask natural language questions about video content. It spawned two generations: VideoChat1 (2023, BLIP2+LLM pipeline) and VideoChat2 (2024, CVPR Highlight, with its own MVBench evaluation suite). Aimed at researchers wanting a working baseline for video-language tasks, not at developers building products.

VideoChat2 ships a serious benchmark (MVBench) alongside the model, so you can actually measure what you're getting rather than relying on vibes. The multi-backbone support is genuine — Mistral, Phi-3, and Vicuna variants exist with separate training scripts, not just stubs. The 2M instruction dataset is released publicly, which is rare and useful if you want to fine-tune on your own domain. The January 2025 pointer to VideoChat-Flash shows the team is still publishing successor work rather than abandoning the codebase.

This is a research dump, not a library — there's no installable package, no versioned API, and setup involves manually placing checkpoint files with paths hardcoded in config JSONs. The repo mixes three generations of code (video_chat_with_ChatGPT, video_chat, video_chat2) in one tree with no clear deprecation story, so it's easy to clone the wrong thing and spend an hour debugging a dead branch. Dependency hell is real: the ChatGPT variant needs a 2023-era environment.yaml with pinned detectron2 and InternVideo weights that are hosted on non-obvious links. If you want inference rather than research replication, you'd be better served by the successor Flash repo they link to.

View on GitHub → Homepage ↗