// the find

OpenGVLab/VideoChat-Flash

★ 527 · Python · MIT · updated Nov 2025

[ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

VideoChat-Flash is a multimodal large language model for video understanding that achieves long-context video processing by compressing each frame down to 16 tokens using Token Merging (ToMe). It's built on InternVideo2 as the visual encoder and Qwen2/2.5 as the language backbone, targeting researchers working on video-language models who need to handle hour-long videos without quadratic KV-cache blowup.

The 16-token-per-frame compression is genuinely impressive — the 5-10x speedup over prior work is real and the 99.1% needle-in-a-haystack accuracy at 10,000 frames is a concrete, meaningful number rather than a vague benchmark win. The four-stage training curriculum (connector init → visual pretraining → video SFT → high-res post-SFT) is well-structured and each stage's data config is checked in, so you can see exactly what was trained on. The training data is actually released on HuggingFace, not just referenced — that matters for reproducibility. The eval harness (lmms-eval fork) is bundled alongside the model code, so running the exact benchmark configs that produced the paper numbers is straightforward.

The README punts inference instructions to HuggingFace rather than the repo itself — a repo with this much training code should have a working inference example. The future plans section is a public admission that vllm/lmdeploy support, LoRA finetuning, and mixed image/video training are all missing, and the README literally asks the community to submit PRs because the authors are too busy; that's honest but not reassuring for anyone who wants to build on this. The code is split across three near-identical directory trees (llava-train_videochat, xtuner-train_internvideo2_5, xtuner-eval_niah) with duplicated model files and no shared library — any bug fix needs to be applied in multiple places. Stars-to-forks ratio of 527:19 suggests people are watching rather than using.

View on GitHub →