// the find

bojone/SimCSE

★ 605 · Python · updated Aug 2023

SimCSE在中文任务上的简单实验

A bare-bones Chinese-language benchmark for SimCSE sentence embeddings, testing unsupervised contrastive learning across five standard Chinese NLP similarity datasets. It's an experiment notebook promoted to a repo — useful if you're reproducing bojone's blog post results or comparing pooling strategies on Chinese BERT variants. Not a library.

Covers five well-known Chinese similarity benchmarks (ATEC, BQ, LCQMC, PAWSX, STS-B) in one place, which saves setup time. Supports nine model variants including SimBERT and RoFormer, so pooling strategy comparisons across architectures are straightforward. The dropout-rate parameter is exposed directly, making it easy to reproduce the unsupervised SimCSE trick without touching internals. The companion blog post (kexue.fm) is dense and worth reading even if you skip this code.

Pinned to TensorFlow 1.14 + Keras 2.3.1 — a stack most teams abandoned years ago; getting this running in 2026 means fighting dependency archaeology before any actual research. Two files total: there's no training code, no fine-tuning, no supervised SimCSE variant — it's evaluation only, which is half the story the paper tells. Last touched in 2023 and clearly not maintained. Dataset download relies on Baidu Pan, which is inaccessible without a Chinese account.

View on GitHub →