// the find

mtianyan/FunpySpiderSearchEngine

★ 933 · Python · MIT · updated Feb 2023

Word2vec 千人千面个性化搜索 + Scrapy2.3.0(爬取数据) + ElasticSearch7.9.1(存储数据并提供对外Restful API) + Django3.1.1 搜索

A personal search engine demo that scrapes Zhihu (a Chinese Q&A platform) with Scrapy, indexes content into Elasticsearch, and uses Word2Vec to personalize result rankings based on query history. It's a tutorial project showing how to wire together a classic search stack, not production software.

The Word2Vec scoring integration with Elasticsearch's function_score is the interesting part — using semantic similarity to boost results based on a user's search history is a concrete example of personalization that most tutorials skip. Docker Compose setup covers the full stack (ES, Redis, Django, Spider) so you can actually run it. The project is split into two repos (spider vs. web) with clear separation of concerns. The IK analyzer integration for Chinese tokenization is handled correctly, which is often the stumbling block for Chinese NLP projects.

Last touched in early 2023 with Elasticsearch 7.9 and Scrapy 2.3 — ES is now at 8.x with breaking API changes, and this code will not run without patches. The Zhihu scraper almost certainly requires constant maintenance as Zhihu aggressively changes its login flow and anti-bot measures; the committed cookie file is a red flag. Word2Vec is trained on a tiny dataset (45k-small.txt visible in the tree), so the semantic similarity is probably not very useful in practice. No tests anywhere, and the scoring Painless script in the README is fragile — it doesn't handle missing fields.

View on GitHub →