// the find

marella/ctransformers

★ 1,887 · C · MIT · updated Jan 2024

Python bindings for the Transformer models implemented in C/C++ using GGML library.

Python bindings for running quantized LLMs locally via GGML, targeting the era before llama.cpp had its own Python bindings. Wraps CPU/GPU inference for LLaMA, Falcon, GPT-J, MPT, and a handful of other architectures behind a HuggingFace-compatible API. Aimed at developers who wanted offline inference without PyTorch overhead.

The HuggingFace `AutoModelForCausalLM` drop-in interface means you can swap it into existing Transformers pipelines with minimal friction. LangChain integration is first-class, not bolted on. CUDA, ROCm, and Metal GPU offload are all supported, with `gpu_layers` giving you fine-grained control over how much fits in VRAM. Pre-built shared libs for AVX/AVX2/basic CPU tiers ship in the wheel so most users never need to compile anything.

Dead project — last commit January 2024, and the underlying GGML/llama.cpp ecosystem has moved so fast that this is now two years behind. llama.cpp now ships `llama-cpp-python` which does everything this does plus GGUF support, function calling, and an OpenAI-compatible server. The `.bin` GGML format this targets has been superseded by GGUF and most model authors have stopped publishing `.bin` files. Only 9 model families are supported, which excluded most models even at peak relevance. Context length is capped per-model with no clear path to extend it.

View on GitHub →