// the find

superlinked/sie

★ 2,044 · Python · Apache-2.0 · updated Jun 2026

Open-source inference server and production cluster for all the models your agent needs.

SIE is a self-hosted inference server that wraps Hugging Face models behind a unified HTTP API for embeddings, reranking, and entity extraction. It targets teams running RAG pipelines who want to swap out OpenAI's embedding endpoint or avoid per-call API costs at scale. The production Helm chart makes it genuinely deployable, not just a demo.

The three-function API (encode/score/extract) is a good abstraction — it covers 90% of retrieval pipeline needs without exposing model-specific quirks. OpenAI-compatible /v1/embeddings endpoint means migration from text-embedding-ada-002 is a one-line config change. The Helm chart ships with KEDA scale-to-zero, Grafana dashboards, and Terraform for GKE/EKS — this is production infrastructure, not a compose file with a bow on it. MTEB quality verification in CI is the right call; most inference servers ship models with no quality gate at all.

LRU eviction for on-demand model loading sounds convenient until you have a latency spike mid-request because the model just got evicted — there's no explicit model pinning visible in the README. The `/v1/embeddings` compatibility is OpenAI-shaped but the rest of the API isn't, so you're buying into a Superlinked-specific SDK for anything beyond basic dense embeddings. No mention of batching behavior or throughput numbers anywhere — hard to size hardware without knowing what 'ms' means at P99 under load. The regulatory-rag example ships a LoRA patch file (`encode_lora_routing.patch`) that suggests the server has plugin extension points, but this is completely undocumented in the main README.

View on GitHub → Homepage ↗