// the find

bojone/word-discovery

★ 512 · Python · updated Mar 2024

速度更快、效果更好的中文新词发现

A Chinese new word discovery algorithm using statistical n-gram analysis to identify previously unseen words from raw text, without requiring labeled data. It's a research implementation by the author of the kexue.fm NLP blog, aimed at Chinese NLP practitioners who need unsupervised segmentation or vocabulary expansion.

Outperforms a 2019 ICLR paper (0.765 vs 0.731 F1 on PKU corpus) while being significantly faster — the core insight is that good statistical heuristics beat slow neural approaches here. The companion blog posts explain the algorithm in depth, so you can actually understand what it's doing rather than just running a black box. Python 2/3 compatible. The approach is corpus-agnostic, so it works on domain-specific text where pretrained tokenizers fail.

The repo is essentially a single script and a precompiled binary (`count_ngrams`) with no source — if that binary breaks on your OS or architecture, you're stuck. No pip package, no proper API, you modify the script directly to point at your data. The README is Chinese-only and the algorithm details live entirely in external blog posts, which makes this hard to adopt without reading those first. Last meaningful update was 2019 (the 2024 push was likely cosmetic); the ecosystem has moved toward transformer-based tokenizers for most production use cases.

View on GitHub →