// the find
curiosity-ai/catalyst
🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
Catalyst is a pure-C# NLP library modeled after spaCy, covering tokenization, POS tagging, NER, sentence detection, and word/document embeddings. It targets .NET Standard 2.0 so it works anywhere .NET Core runs. The audience is .NET developers who need NLP without shelling out to Python.
The tokenizer claims >1M tokens/s with minimal regex, which is credible given the span-heavy implementation visible in the directory tree. Language packages ship as NuGet packages — you add `Catalyst.Models.English` and you're done, no separate model download step. The model distribution strategy (lazy-load from a hosted repository on first use, cache to disk) is practical for a library that needs to work out of the box. The Presidio integration module for PII anonymization is a useful addition that most comparable libraries don't include.
853 stars and 84 forks for a project this broad is thin community traction — spaCy has 30k+ stars for good reason, and the ecosystem gap shows in limited third-party examples and sparse issue discussion. The FastText and StarSpace embedding models are pre-transformer-era; if you need decent semantic similarity today you're using something like sentence-transformers, not word2vec variants. There's no transformer-based model support (BERT, RoBERTa, etc.) — the library tops out at perceptron-based taggers and FastText, which is a ceiling that matters for any serious NLU task. Dependency parsing support appears to exist only for a handful of languages (Afrikaans, Bulgarian, Catalan, Czech from the tree) and many languages in the directory only have tagger and sentence-detector models, making cross-language parity inconsistent.