// the find
antirez/ds4
DeepSeek 4 Flash and PRO local inference engine for Metal, CUDA and ROCm
DwarfStar is a single-file C inference engine by antirez (Redis) built specifically for DeepSeek V4 Flash and PRO, with Metal as the primary target and CUDA/ROCm as secondary. It is not a generic GGUF loader — it only runs the specific GGUF files published for this project, with quantization strategies (asymmetric 2-bit: only routed MoE experts quantized, projections left at full precision) chosen deliberately for this model architecture. The target audience is developers with high-end Apple Silicon or NVIDIA hardware who want a finished, validated local inference stack rather than a configurable toolkit.
The SSD streaming path is the most interesting idea here: routed MoE experts dominate model size, modern Mac SSDs are fast enough that cache misses are tolerable, and this turns 'does the model fit in RAM?' from a binary question into a throughput dial. The validation story is serious — logit vectors from the official implementation, a 92-item regression suite covering GPQA Diamond, AIME 2025, and C/C++ security questions, and documented imatrix methodology for the quantization. Distributed layer splitting over plain TCP with pipelined prefill shows real speedups on long prompts (1.85x at 64k tokens on two M5 Max machines over Thunderbolt), and the protocol handles worker reconnection and KV replay rather than silently producing garbage. The asymmetric quantization (routed experts at IQ2_XXS/Q2_K, everything else untouched) is the right call for MoE architectures and the 2-bit quants reportedly hold up under tool calling, which is the real test.
You cannot bring your own GGUF — this only works with files from the project's HuggingFace repo, which makes it dependent on antirez maintaining those uploads as the model evolves. The distributed protocol has no authentication or encryption and is explicitly not release-stable: coordinator and workers must be on the same commit, so distributed setups will break on every pull. The server serializes all inference through a single graph worker with no request batching, which matters if you're running multiple agents against it concurrently. The macOS CPU path will crash the kernel due to a VM bug they cannot work around — they document this openly, but 'please don't run the CPU path on macOS or you'll need to reboot' is a real gap for anyone without a GPU.