// the find
intel/auto-round
A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.
AutoRound is a post-training quantization toolkit from Intel that uses sign gradient descent to optimize weight rounding, pushing accuracy at 2-4 bit widths beyond what naive round-to-nearest gets you. It targets ML engineers who need to deploy LLMs on constrained hardware and want better quality-per-bit than GPTQ or AWQ without training from scratch. The sign-gradient approach has peer-reviewed backing (EMNLP 2024) and the numbers on DeepSeek-R1 at INT2 (97.9% retention) are credible enough to take seriously.
The calibration pipeline is genuinely well-engineered — you can swap in custom datasets, control sequence length, and run multi-GPU quantization without patching internals. Export targets are broad and practically useful: auto_round, GPTQ, AWQ, GGUF, and LLM-Compressor all in one tool, so you're not locked into a single inference stack. AutoScheme's mixed-precision recipe generation in minutes is the right answer to the 'which layers can tolerate 2-bit?' question that most users otherwise answer by guessing. The model-free RTN path added in 2026 means you can get a fast baseline without even loading the model weights for calibration.
VLM quantization is still explicitly experimental and the multi-modal calibration path uses a single default dataset, which will hurt image-understanding tasks disproportionately compared to text-only evals. The inference backend selection is automatic but opaque — when it silently falls back to a slower backend you get a log message, not a clear error, which makes production debugging annoying. MXFP4 and NVFP4 are listed as supported schemes but the table notes the absence of real kernels for MXFP4 in the auto_round format, meaning you can quantize but not actually run fast inference without exporting to LLM-Compressor first. The `low_gpu_mem_usage` flag saving ~20GB but costing ~30% speed is a meaningful tradeoff that deserves a concrete memory breakdown in the docs rather than a parenthetical.