// the find

yongzhuo/Keras-TextClassification

★ 1,815 · Python · MIT · updated Jun 2024

中文长文本分类、短句子分类、多标签分类、两句子相似度（Chinese Text Classification of Keras NLP, multi-label classify, or sentence classify, long or short），字词句向量嵌入层（embeddings）和网络层（graph）构建基类，FastText，TextCNN，CharCNN，TextRNN, RCNN, DCNN, DPCNN, VDCNN, CRNN, Bert, Xlnet, Albert, Attention, DeepMoji, HAN, 胶囊网络-CapsuleNet, Transformer-encode, Seq2seq, SWEM, LEAM, TextGCN

A Keras-based Chinese text classification library covering ~16 architectures from FastText to BERT/Albert/XLNet fine-tuning. It's a reference implementation collection for NLP practitioners who want to compare classical and transformer approaches on Chinese corpora. Primarily useful as a learning resource or baseline starter, not a production library.

1. Breadth of coverage is genuinely useful for comparison: FastText, TextCNN, DPCNN, HAN, CapsuleNet, and transformer fine-tuning all in one repo with consistent train/predict entry points per model. 2. The base class design (embedding.py + graph.py inheritance) is clean enough that adding a new model means subclassing two classes and filling in the architecture — avoids duplicating tokenization and data pipeline logic. 3. Includes both char-level and word-level embeddings with word2vec and BERT variants, which matters for Chinese since word segmentation is a real preprocessing decision. 4. Multi-label and sentence similarity examples are included beyond plain classification, which covers the three most common NLP task shapes.

1. The repo is effectively frozen at Keras 2 / TF 1.x era — it targets Python 3.5/3.6 and TF 1.13, which is years past EOL. Running it on a current environment will require non-trivial porting work. 2. Setup requires manually downloading model weights from Baidu Pan (a Chinese file host) and copying them into hardcoded paths inside the installed package directory — this is genuinely painful and breaks any CI or automated setup. 3. No test suite, no type hints, and the training data bundled in-repo is only 100 examples; the full corpora live behind Baidu Pan links with extraction passwords scattered through the README. 4. ELECTRA and TextGCN are listed as 'todo' and never landed — the feature list in the description overpromises what's actually implemented.

View on GitHub → Homepage ↗