// the find

bentoml/OpenLLM

★ 12,352 · Python · Apache-2.0 · updated Jun 2026

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

OpenLLM is a CLI tool that wraps BentoML and vLLM to serve open-source LLMs as OpenAI-compatible API endpoints with a single command. It abstracts the messy parts of vLLM setup and adds a model repository system for versioned, reproducible deployments. Aimed at developers who want a self-hosted LLM API without writing infrastructure code.

1. The model repository concept is genuinely useful — pinning model variants by name:version (e.g. llama3.2:1b) gives you reproducibility that raw HuggingFace downloads don't. 2. vLLM backend means production-grade throughput with continuous batching; this isn't a toy wrapper around transformers. 3. Custom repository support lets teams maintain private model catalogs with the same CLI interface, which is a real gap in most similar tools. 4. The openllm hello interactive explorer is a thoughtful onboarding touch — lowers the barrier for first-time GPU setup.

1. Heavy BentoCloud upsell throughout; the deploy path funnels you toward their paid platform, and self-hosting Kubernetes deployments get much less documentation attention. 2. No model weights storage means every fresh environment re-downloads from HuggingFace — no built-in support for a private model mirror or local cache dir configuration. 3. The custom repository workflow requires building BentoML Bentos, which is a significant learning curve if you just want to serve a fine-tuned checkpoint without learning the BentoML packaging system. 4. GPU memory requirements in the model table are rough guides, not guarantees — quantization options and actual memory behavior under load aren't documented, so you'll hit OOM errors without good diagnostics.

View on GitHub → Homepage ↗