// the find

lionsoul2014/friso

★ 510 · C · Apache-2.0 · updated Oct 2023

High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.

Friso is a C implementation of the MMSEG algorithm for Chinese word segmentation, supporting UTF-8 and GBK encodings. It loads a 200k-entry lexicon into ~14.5MB of memory and exposes a simple three-object API (friso/config/task). Aimed at developers embedding Chinese tokenization into C applications, PHP extensions, or database plugins like MySQL and PostgreSQL.

The four segmentation modes (simple FMM, complex MMSEG with four disambiguation rules, detect-only, fine-grained) give real flexibility depending on whether you're optimizing for speed or accuracy. The modular lexicon system — categorized into CJK words, units, mixed-script words, name components, stopwords, and punctuation combos — makes custom vocabulary additions straightforward without touching source code. The mixed-script handling (卡拉ok, c语言, c++, email addresses) is genuinely useful and covers cases most tokenizers punt on. Thread safety is handled correctly: share friso and config across threads, give each thread its own task object.

Last commit was 2023 and the feature checklist in the README has been partially unchecked (keyword extraction, keyphrase extraction, the 'most' segmentation mode) for years — this is abandoned-ish maintenance, not active development. Bindings only exist for PHP5, PHP7, OCaml, and a half-finished Sphinx plugin; no Python, no Go, no Node, so most developers will need to write their own FFI wrapper. The dictionary is frozen in time with no mechanism for dynamic updates at runtime — you must restart the process to pick up new lexicon entries. 510 stars for a Chinese tokenizer in C is low, which means you're unlikely to find help or community forks when you hit edge cases.

View on GitHub → Homepage ↗