// the find

qdrant/fastembed

★ 3,053 · Python · Apache-2.0 · updated Jun 2026

Fast, Accurate, Lightweight Python library to make State of the Art Embedding

FastEmbed is a Python embedding library from the Qdrant team that runs models via ONNX Runtime instead of PyTorch, making it practical for serverless and resource-constrained environments. It covers dense text, sparse (SPLADE), late-interaction (ColBERT), image, and multimodal embeddings plus cross-encoder reranking under one API. The target user is someone building a RAG pipeline who doesn't want to pull in 2GB of PyTorch just to embed strings.

ONNX backend is a real practical win — cold starts on Lambda or a small VM are actually usable, which you can't say for sentence-transformers. The model surface is broader than most alternatives: dense, sparse, ColBERT, ColPali, and rerankers from one library with a consistent embed() interface. Custom model support via add_custom_model() is straightforward and lets you point at any HuggingFace ONNX export without forking anything. Deep integration with the Qdrant client means you can pass Document objects directly and skip the embedding step in your own code.

It's maintained by Qdrant and the gravity of the whole library pulls toward Qdrant as the vector store — the README examples, docs, and integration points all route there, so don't expect neutral multi-backend support to be a priority. The model catalog is frozen to what the team has exported to ONNX; if you need a model that isn't on the supported list, you're converting it yourself and hoping the export works cleanly. Parallel processing is available but the data parallelism story for GPUs is thin — the fastembed-gpu variant just flips the ONNX execution provider, it's not doing anything sophisticated with batching across devices. No async API, so embedding at request time in an async web service means running it in a thread pool executor, which adds friction.

View on GitHub → Homepage ↗