// the find

neuralmagic/deepsparse

★ 3,160 · Python · NOASSERTION · updated Jun 2025

Sparsity-aware deep learning inference runtime for CPUs

DeepSparse is a CPU inference runtime that exploits weight sparsity — pruned, quantized ONNX models run significantly faster on commodity CPUs because sparse matrix ops skip the zero multiplications. It targets teams that need to run NLP and CV models in production without GPU hardware. As of June 2, 2025, it is officially end-of-life: Neural Magic was acquired by Red Hat, the team moved to vLLM, and the entire ecosystem (SparseML, SparseZoo, Sparsify) was deprecated simultaneously.

The core engineering premise is sound — wrapping AVX-512 SIMD instructions to exploit unstructured sparsity is not something you get from ONNX Runtime or PyTorch out of the box. The pipeline abstraction was well-thought-out, covering NLP, CV, and LLM inference through a consistent API with both single-stream and multi-stream scheduling modes. Deployment examples are extensive — AWS SageMaker, Lambda, GCP Cloud Run, GKE, Azure — and the OpenAI-compatible server interface meant you could swap it in without changing client code. The benchmarking tooling is genuinely useful for profiling sparse vs. dense tradeoffs on specific hardware.

This is an archived project. The deprecation notice is in the README — no updates, no support, no security fixes going forward. Adopting it now means owning a dead dependency. The value proposition also required the full Neural Magic toolchain: you need SparseML to prune, SparseZoo to get pre-sparsified models, and DeepSparse to run them — and all three are gone. The CPU-only angle was already a hard sell when most serious inference workloads run on GPUs or dedicated accelerators, and the 3k stars after four years shows it never achieved broad adoption. The legacy/ directory is a warning sign too — a large chunk of the pipeline code was mid-refactor when development stopped.

View on GitHub → Homepage ↗