// the find

huggingface/tokenizers

★ 10,812 · Rust · Apache-2.0 · updated Jun 2026

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

The Rust core of HuggingFace's tokenizer library, with Python and Node bindings. It implements BPE, WordPiece, and Unigram tokenization — the three algorithms that cover essentially every modern language model. If you're training a model from scratch or need to load a pretrained tokenizer without pulling in all of transformers, this is what you use.

1. Genuinely fast: a GB of text in under 20 seconds on CPU is not marketing copy — it's because the hot paths are parallelized Rust with no GIL contention in the Python binding. 2. Offset tracking through normalization is non-trivial and they actually got it right: you can always map a token back to its exact span in the original string, which is critical for NER and span extraction tasks. 3. The pipeline architecture (normalizer → pre-tokenizer → model → post-processor → decoder) is composable and serializable to JSON, so custom tokenizers are reproducible across languages and runtimes. 4. Active, well-maintained: pushed yesterday, CI covers Rust/Python/Node, benchmarks are automated.

1. The Node bindings are a second-class citizen — TypeScript types exist but the API surface lags the Python binding and the documentation is sparse compared to what Python users get. 2. No Go or Java bindings, so teams not in the Python/Node/Rust ecosystem have to shell out or find a third-party wrapper. 3. Training on large datasets requires the data to fit in memory or be streamed through their iterator protocol — there's no built-in distributed training path if your corpus is truly massive. 4. WASM support is marked 'unstable' and lives under examples/, not a first-class target: don't plan a browser-side tokenizer on this without testing thoroughly.

View on GitHub → Homepage ↗