// the find

ddangelov/Top2Vec

★ 3,105 · Python · BSD-3-Clause · updated Nov 2024

Top2Vec learns jointly embedded topic, document and word vectors.

Top2Vec does unsupervised topic modeling by jointly embedding documents and words into the same vector space, then using UMAP + HDBSCAN to find dense clusters that become topics. No need to specify the number of topics in advance — the algorithm figures it out. The 2024 contextual variant adds per-token topic assignment and segment detection, backed by an EMNLP paper.

The automatic topic count discovery is genuinely useful — LDA-style models require you to guess K upfront, which is painful. The UMAP+HDBSCAN pipeline is a solid choice: UMAP preserves local structure better than PCA/t-SNE for clustering, and HDBSCAN handles variable-density clusters and marks outliers rather than forcing everything into a topic. The search API (by topic, by keyword, by document similarity) works out of the box without extra plumbing. Contextual Top2Vec adding token-level topic spans is a real capability improvement for long documents with mixed themes.

Last push was November 2024 and the contextual mode is still marked beta — not a project you'd lean on for production without owning some maintenance risk. The single `top2vec.py` file is around 3,000 lines with no internal module structure; finding or patching anything is a slog. UMAP and HDBSCAN are sensitive to their hyperparameters (`n_neighbors`, `min_cluster_size`) but Top2Vec exposes limited control over them, so if the default clustering looks wrong for your corpus you're mostly stuck. No incremental training — adding new documents means retraining from scratch, which gets expensive fast on large corpora.

View on GitHub →