// the find
ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI
ScrapeGraphAI lets you point an LLM at a URL and describe what you want extracted in plain English, instead of writing CSS selectors or XPath. It wraps Playwright for fetching, handles HTML-to-markdown conversion, and ships a graph-based pipeline model where each processing step (fetch, parse, RAG, generate) is a node you can chain or swap. Target audience is Python devs who need structured data from the web without maintaining brittle scrapers.
The node/graph architecture is genuinely extensible — you can drop in a custom node, change the LLM, or rewire the pipeline without touching the core. Local model support via Ollama is first-class, not an afterthought, which matters for scraping tasks where you might not want data leaving your network. The pipeline variety (SmartScraper, SearchGraph, ScriptCreator, multi-page variants) covers more real use cases than most scraping libraries. Open-source/managed API split is cleanly documented so you understand the tradeoff before you commit.
Token costs are the elephant in the room: every extraction call hits an LLM, so scraping 10,000 pages is financially painful compared to writing a one-time XPath. Anti-bot and JS-heavy sites are your problem to solve — the library gives you Playwright but no stealth, no proxy rotation built in (there's an example file for proxy rotation, not a real solution). The graph abstraction adds real cognitive overhead for simple cases where `requests` + `beautifulsoup` would be five lines. Test coverage looks thin for a library with this many graph variants, and the open issue count likely reflects reliability gaps on sites with unusual rendering.