// the find

BYVoid/uchardet

★ 656 · C++ · NOASSERTION · updated Apr 2024

An encoding detector library ported from Mozilla

uchardet is a C library for detecting the character encoding of a byte stream, ported from Mozilla's encoding detection code. It covers a wide range of legacy encodings across Asian, European, and Middle Eastern scripts. You'd reach for this when processing files of unknown provenance — log archives, user uploads, scraped text — where you can't assume UTF-8.

The coverage of legacy CJK encodings (EUC-JP, Shift-JIS, Big5, GB2312, EUC-KR, EUC-TW) is genuinely good and hard to replicate — this is the part Mozilla spent years tuning. The statistical language model approach means it can distinguish encodings that share byte ranges (e.g., Windows-1251 vs KOI8-R for Russian) rather than just guessing from byte patterns alone. The C API is simple: feed it bytes, get a string back. CMake build system works without ceremony.

This GitHub repo is a dead mirror — the actual project moved to freedesktop.org years ago, and the README says so. Stars and activity here mean nothing; go check the GitLab. The language model data is static and built from training corpora of unknown quality and age — if your input text is short or uses unusual vocabulary, confidence scores can be meaningless and the library returns nothing. There's no streaming API; you have to buffer all input first, which is awkward for large files. Detection of Windows-1252 vs ISO-8859-1 for Western European languages is notoriously unreliable because the byte ranges overlap almost completely.

View on GitHub →