// the find
ddangelov/Top2Vec
Top2Vec learns jointly embedded topic, document and word vectors.
Top2Vec does unsupervised topic modeling by jointly embedding documents and words into the same vector space, then using UMAP + HDBSCAN to find dense clusters that become topics. No need to specify the number of topics in advance — the algorithm figures it out. The 2024 contextual variant adds per-token topic assignment and segment detection, backed by an EMNLP paper.
The automatic topic count discovery is genuinely useful — LDA-style models require you to guess K upfront, which is painful. The UMAP+HDBSCAN pipeline is a solid choice: UMAP preserves local structure better than PCA/t-SNE for clustering, and HDBSCAN handles variable-density clusters and marks outliers rather than forcing everything into a topic. The search API (by topic, by keyword, by document similarity) works out of the box without extra plumbing. Contextual Top2Vec adding token-level topic spans is a real capability improvement for long documents with mixed themes.
Last push was November 2024 and the contextual mode is still marked beta — not a project you'd lean on for production without owning some maintenance risk. The single `top2vec.py` file is around 3,000 lines with no internal module structure; finding or patching anything is a slog. UMAP and HDBSCAN are sensitive to their hyperparameters (`n_neighbors`, `min_cluster_size`) but Top2Vec exposes limited control over them, so if the default clustering looks wrong for your corpus you're mostly stuck. No incremental training — adding new documents means retraining from scratch, which gets expensive fast on large corpora.