// the find
lorien/grab
Web Scraping Framework
Grab is a Python scraping framework built around pycurl/urllib3 with an XPath-first document API and a Spider layer for parallel crawling with pluggable task queues. It targets developers who want stateful session management and form interaction without assembling those pieces from requests + lxml themselves. The project was effectively dead from 2018 to 2025, when the author reset it to the last working release and updated it for Python 3.13 compatibility.
1. XPath document API is genuinely good — `g.doc('//h3[@class="repo-list-name"]/a')` is more expressive than CSS selectors for deeply nested HTML scraping. 2. Spider's task queue backends (in-memory, Redis, MongoDB) let you swap persistence without rewriting the crawler logic. 3. The honesty of the 2025 README is refreshing — the author explicitly says the refactor failed, reset to last known-good, and documents the one breaking change in exception imports. 4. pycurl transport option gives you real curl semantics (connection reuse, TLS fingerprinting, proxy handling) that urllib3 alone doesn't match.
1. Seven years without meaningful development means no async/await — the 'asynchronous' tag refers to threaded Spider, not asyncio, so you're fighting the GIL for I/O-bound crawling in 2025. 2. The `.hg/` directory is committed to the GitHub repo, which is just noise and suggests a mechanical mirror rather than an actively maintained project. 3. No type annotations anywhere in the codebase — the API surface is large enough that mypy or pyright support would matter here, and there's no sign it's coming. 4. The Spider cache backends include MySQL and MongoDB but the code hasn't been touched since 2018 — anyone relying on those in production is taking a bet that nothing broke during the Python version update.