finds.dev← search

// the find

Andyyyy64/whichllm

★ 4,530 · Python · MIT · updated Jun 2026

Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly.

whichllm answers the question 'which local LLM should I actually run on this machine?' by detecting your hardware, pulling live model data from HuggingFace, and ranking by merged benchmark scores rather than parameter count. It's aimed at developers who want to run models locally and are tired of guessing whether a 32B Q4 fits their VRAM and runs at a usable speed. The GPU simulation mode (`--gpu 'RTX 4090'`) is genuinely useful for hardware purchase decisions.

The scoring model is unusually honest about uncertainty — five evidence tiers with explicit discounting (self-reported gets ×0.55, inherited gets ×0.78) and visible markers for inherited/interpolated scores means you know when a number is made up vs measured. VRAM estimation accounts for KV cache, activation memory, and framework overhead rather than just weight size, which is where most naive tools get it wrong. The `whichllm plan` reverse-lookup (give it a model, get back the GPU you need) is a practical inversion that similar tools don't offer. Test coverage is solid for a CLI tool of this scope — dedicated test files per hardware backend and per benchmark source.

Benchmark data for the frozen tier (Open LLM Leaderboard v2, Chatbot Arena ELO) goes stale fast — the recency demotion logic is described but the actual staleness depends on how often maintainers push updates, and there's no visible freshness indicator in the default output. The Ollama integration is effectively 'pipe JSON and do it yourself' — the HuggingFace ID to Ollama model name gap is a real friction point that's acknowledged but not solved. Speed estimates are bandwidth-bound calculations with a lot of per-backend fudge factors, and on machines with unified memory (Apple Silicon) the actual throughput can vary widely from the estimate depending on memory pressure from other processes — the confidence markers help but won't save you from a bad surprise at runtime.

View on GitHub →

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →