// the find

shibing624/text2vec

★ 4,969 · Python · Apache-2.0 · updated Feb 2026

text2vec, text to vector. 文本向量表征工具，把文本转化为向量矩阵，实现了Word2Vec、RankBM25、Sentence-BERT、CoSENT等文本表征、文本相似度计算模型，开箱即用。

text2vec is a Python library for computing sentence embeddings using Word2Vec, Sentence-BERT, and CoSENT models, with a strong focus on Chinese NLP. It ships pretrained Chinese models on HuggingFace and covers the full pipeline from training to inference and deployment. Primary audience is ML engineers building Chinese semantic search or text similarity systems.

CoSENT loss function is a genuine improvement over vanilla SBERT for ranking tasks — the spearman score comparisons in the README back this up with real numbers. The pretrained Chinese models (text2vec-base-chinese-paraphrase in particular) hit 63+ average spearman across five standard benchmarks, which is competitive without requiring users to fine-tune anything. Multi-GPU inference and a CLI batch tool are included out of the box, not afterthoughts. Models are compatible with sentence-transformers, so you can swap text2vec models into existing sentence-transformers pipelines without code changes.

The last meaningful release was September 2023 — the library has stalled while the embedding model landscape moved on significantly (BGE-M3, E5-mistral, etc.). The BGE fine-tuning support feels bolted on: it targets BAAI/bge-large-zh-noinstruct but the README benchmark shows text2vec's BGE fine-tune underperforms its own CoSENT models on most datasets, which raises questions about why it's there. The Jina integration example references JinaHub, which Jina AI has since deprecated in favor of their cloud platform, so that deployment path is dead. There's no async inference path and no ONNX/TorchScript export, meaning production latency optimization requires rolling your own.

View on GitHub → Homepage ↗