finds.dev← search

// the find

DropsDevopsOrg/ECommerceCrawlers

★ 5,558 · Python · MIT · updated May 2024

实战🐍多种网站、电商数据爬虫🕷。包含🕸:淘宝商品、微信公众号、大众点评、企查查、招聘网站、闲鱼、阿里任务、博客园、微博、百度贴吧、豆瓣电影、包图网、全景网、豆瓣音乐、某省药监局、搜狐新闻、机器学习文本采集、fofa资产采集、汽车之家、国家统计局、百度关键词收录数、蜘蛛泛目录、今日头条、豆瓣影评、携程、小米应用商店、安居客、途家民宿❤️❤️❤️。微信爬虫展示项目:

A collection of ~20 Python web scrapers targeting Chinese platforms — Taobao, Weibo, Douban, job boards, and others. Aimed at learners who want working examples of real-world crawling against sites with active anti-bot measures, not a reusable library.

Covers the full anti-scraping stack in practice: cookie pools, mitmproxy JS signature bypass, Redis-backed session storage, and distributed Scrapy pipelines — all against sites that actually fight back. The Taobao (new) crawler's approach of reconstructing the sign parameter without cookies is a genuinely clever technique worth studying. Multiple storage backends (MySQL, MongoDB, Redis, CSV) are demonstrated consistently across projects. Each subdirectory has its own focused README explaining the analysis approach, which makes it a reasonable reference when you hit the same problem.

Last meaningful commit was 2024, and most of the target sites have rotated their APIs or tightened JS fingerprinting since — expect breakage on first run. Documentation and code are entirely in Chinese with no translation, which is a hard barrier for most Western developers. There's no shared library, no tests, and no abstraction layer: it's 20+ standalone scripts with copy-pasted boilerplate, so there's nothing reusable to extract. Several crawlers (FOFA asset enumeration, QiChaCha company data) sit in legally and ethically grey territory without any guidance on that in the repo.

View on GitHub → Homepage ↗

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →