// the find

istresearch/scrapy-cluster

★ 1,224 · Python · MIT · updated Nov 2023

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

Scrapy Cluster turns Scrapy into a distributed crawling system by routing jobs through Kafka and coordinating deduplication and throttling via Redis. You submit URLs to a Kafka topic, spiders across multiple machines pick them up, and results flow back out through Kafka. It's aimed at teams running large-scale scraping operations that have outgrown a single Scrapy process.

The distributed scheduler and Redis-backed dupe filter are well thought out — you can add or remove spider instances without losing queue state or causing duplicate crawls. Per-domain throttling that works across machines behind the same IP is genuinely hard to get right and this solves it. The plugin architecture for the Kafka and Redis monitors is clean; you can add custom validation schemas or monitoring logic without touching core code. Docker Compose setup for local testing works out of the box and the docs on ReadTheDocs are unusually complete for a project like this.

The project is effectively unmaintained — last commit was November 2023 and the 1.3 milestone has been 'in progress' for years. The README still mentions Python 2.7, which tells you how long some of this has sat. Zookeeper is a hard dependency just for distributed locking and configuration watching, which is a significant operational burden for a scraping cluster. There's no built-in JavaScript rendering support; if your targets are SPA-heavy, you're bolting on Splash or Playwright yourself with no guidance. The output is raw crawled HTML dumped to Kafka — any structured extraction is entirely your problem downstream.

View on GitHub → Homepage ↗