// the find

SpiderClub/weibospider

★ 4,788 · Python · MIT · updated Jul 2020

:zap: A distributed crawler for weibo, building with celery and requests.

A distributed Weibo (Chinese Twitter) crawler built on Celery and requests. Handles login session management, user profile scraping, keyword search, comments, reposts, and follower graphs. Aimed at researchers and analysts who need structured Weibo data at scale.

Celery-based task distribution means you can add worker nodes by just running the same command on another machine — no special cluster setup. Login and cookie rotation are handled automatically with a 24-hour refresh cycle via celery beat, which solves the most annoying part of scraping authenticated sessions. The page parser is split by domain and user type (enterprise vs. personal vs. public), which reflects real Weibo API behavior rather than pretending it's uniform. Optional Django admin UI for managing keywords and seed IDs without touching the database directly.

Last commit was July 2020 — Weibo has changed its anti-scraping considerably since then, and the hand-analyzed request patterns the README brags about are almost certainly stale. Requires a CAPTCHA-solving service (cloud OCR) you have to pay for and register with separately, which is an undocumented external dependency that will break silently if your balance runs out. README is entirely in Chinese with no English translation, which limits who can actually use it. Setup is genuinely complex: MySQL, Redis, Celery, optional Django admin, manual seed data insertion — there's no docker-compose or one-command bootstrap despite having a Dockerfile.

View on GitHub →