// the find

ashvardanian/NumKong

★ 1,827 · C · Apache-2.0 · updated Jun 2026

SIMD-accelerated distances, dot products, matrix ops, geospatial & geometric kernels for 16 numeric types — from 6-bit floats to 64-bit complex — across x86, Arm, RISC-V, and WASM, with bindings for Python, Rust, C, C++, Swift, JS, and Go 📐

NumKong is a header-heavy SIMD math library for dense and sparse linear algebra, distance metrics, and mixed-precision quantized operations across x86, Arm, RISC-V, LoongArch, Power, and WASM. It targets the gap between bloated BLAS distributions (PyTorch + MKL at 705 MB) and zero-SIMD alternatives, shipping 2000+ kernels in a 5 MB binary with bindings for 7 languages. The primary audience is people building vector search engines, LLM inference runtimes, or robotics systems where thread-pool ownership and allocator control matter.

Genuinely correct accumulation: widening Int8→Int32, Float16→Float32, BFloat16→Float32 by default, so i8 dot products don't silently overflow at dimension 128 the way NumPy and PyTorch do. The no-allocation, no-threading contract is rare and useful — BLAS thread-pool oversubscription is a real production problem and NumKong sidesteps it entirely by exposing row-range parameters instead. The three-phase pack API (size → pack → compute) separates weight repacking from inference, which is the right model for frozen-weight LLM serving and is validated by what NVIDIA TurboMind and Intel MKL both independently converged on. The benchmark methodology is honest: they benchmark single-threaded latency separately from throughput, call out where PyTorch/JAX are designed for throughput not single-call latency, and cross-validate against MKL, OpenBLAS, and Apple Accelerate rather than hand-picking favorable comparisons.

GPU support is absent and explicitly deferred — the library is CPU-only, which limits its utility in the inference workloads it most directly targets. Go bindings go through cGo, which means cGo overhead per call and CGo-incompatible goroutine stacks; fine for batch workloads, painful for tight request loops. The 2000+ kernel count is also a build-system complexity that compounds across 7 language bindings — the CI matrix is enormous and any ISA-specific regression will be hard to bisect without access to the target hardware. Documentation for the C++ template layer and the C dispatch API lives in per-module READMEs scattered across the tree rather than a single reference, which makes integration work slower than it should be.

View on GitHub → Homepage ↗