// the find

ModelTC/LightCompress

★ 729 · Python · Apache-2.0 · updated May 2026

[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.

LightCompress (formerly LLMC) is a post-training compression toolkit for LLMs, VLMs, and video generation models. It implements a wide range of quantization, pruning, and token reduction algorithms with direct export to production inference backends like vLLM and SGLang. It's for ML engineers who need to ship smaller models and don't want to implement AWQ or GPTQ from scratch.

The backend integration story is genuinely good — you can run AWQ quantization and export directly to vLLM-compatible INT4 weights in one pipeline, which removes a painful manual step most similar tools leave to you. Supporting DeepSeek-R1 671B quantization on a single A100 via careful memory management is a real engineering achievement worth noting. The YAML-driven config system and the breadth of algorithm coverage (15+ quantization methods, token pruning, sparsity) means you can swap methods without rewriting code. The benchmarking papers behind it mean the accuracy claims have at least been measured honestly against baselines.

The rename from LLMC to LightCompress is recent enough that documentation, issue references, and Docker tags are inconsistently mixed — you'll hit the old name constantly and have to mentally translate. Python 3.11 is pinned and Python 3.12 is called 'generally less stable', which is now a year out of date and signals the dependency set hasn't been modernized. Token reduction algorithms are almost entirely VLM-focused and added recently; the integration with quantization pipelines looks new and likely has rough edges for anything beyond the happy-path configs provided. There are no unit tests visible in the repo structure — for a tool where the whole point is numerical accuracy, that's a significant gap.

View on GitHub →