// the find
gotzmann/llama.go
llama.go is like llama.cpp in pure Golang!
A pure-Go port of llama.cpp, implementing LLaMA inference without CGo or native bindings. Targets developers who want LLM inference in a Go codebase and don't want to ship a C++ dependency. Got to v1.4 in spring 2023, then stalled.
The tensor math is implemented in pure Go with hand-written AVX2 and ARM NEON assembly stubs for the hot paths — that's real work, not a CGo wrapper. The embedded REST server with a pod/thread concurrency model is a reasonable production primitive. Cross-platform builds (Linux, Mac, Windows) with pre-compiled binaries lower the barrier to try it. The code structure is clean enough that you can actually read the llama.go inference loop and understand what it's doing.
The project is effectively dead — last push September 2024 but the V2 roadmap (LLaMA 2, GGUF format, quantization) was never completed, meaning it only runs the original LLaMA 1 GGML format that nothing uses anymore. The README itself opens by redirecting you to two other projects instead. FP32-only weights mean you need 32GB RAM just for the 7B model — llama.cpp moved past this years ago with 4-bit quants. The author has acknowledged this is superseded by their own newer work, so adopting it now means maintaining a dead fork.