// the find

SciSharp/LLamaSharp

★ 3,713 · C# · MIT · updated Jun 2026

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.

LLamaSharp is a C# wrapper around llama.cpp that lets .NET developers run local LLMs without leaving the managed ecosystem. It targets the common case of running quantized GGUF models on CPU or GPU, with higher-level abstractions like ChatSession on top of the raw inference API. If you're building a .NET app that needs local LLM inference and don't want to shell out to Python, this is the only serious option.

- The backend packaging is genuinely well done — separate NuGet packages per accelerator (CPU, CUDA 11, CUDA 12, Vulkan, Metal via CPU package on Mac) means you don't ship GPU binaries to CPU-only deployments, and you don't have to compile anything yourself

- Semantic Kernel and kernel-memory integrations are first-party and kept in sync, so you can drop this into a SK pipeline without glue code

- The llama.cpp version pinning table at the bottom of the README is a lifesaver — GGUF format has broken backward compatibility multiple times, and knowing exactly which commit each release tracks saves hours of debugging 'why does this model crash on startup'

- The batched executor API exposes proper multi-sequence inference with KV cache sharing, which most wrappers in other languages still don't expose cleanly

- The high-level ChatSession API leaks llama.cpp concepts (AntiPrompts, GpuLayerCount, context size as 'chat memory length') that make no sense to developers who just want to call a model — you have to understand the underlying runtime to tune anything correctly

- No built-in model download or registry: you're on your own to find a compatible GGUF, check the pinning table, pick the right quantization, and manage the file on disk — the experimental auto-download package exists but is explicitly not production-ready

- Multimodal (LLaVA/mtmd) support is present but thinly documented compared to text — the examples exist but the API surface changed significantly between releases and old blog posts will send you down wrong paths

- Memory management is manual and leaky-by-default in the examples — LLamaWeights and LLamaContext are IDisposable and loading a model twice without disposing the first handle will silently consume VRAM until the process exits

View on GitHub → Homepage ↗