// the find

ScrapeGraphAI/Scrapegraph-ai

★ 27,704 · Python · MIT · updated Jun 2026

Python scraper based on AI

ScrapeGraphAI lets you point an LLM at a URL and describe what you want extracted in plain English, instead of writing CSS selectors or XPath. It wraps Playwright for fetching, handles HTML-to-markdown conversion, and ships a graph-based pipeline model where each processing step (fetch, parse, RAG, generate) is a node you can chain or swap. Target audience is Python devs who need structured data from the web without maintaining brittle scrapers.

The node/graph architecture is genuinely extensible — you can drop in a custom node, change the LLM, or rewire the pipeline without touching the core. Local model support via Ollama is first-class, not an afterthought, which matters for scraping tasks where you might not want data leaving your network. The pipeline variety (SmartScraper, SearchGraph, ScriptCreator, multi-page variants) covers more real use cases than most scraping libraries. Open-source/managed API split is cleanly documented so you understand the tradeoff before you commit.

Token costs are the elephant in the room: every extraction call hits an LLM, so scraping 10,000 pages is financially painful compared to writing a one-time XPath. Anti-bot and JS-heavy sites are your problem to solve — the library gives you Playwright but no stealth, no proxy rotation built in (there's an example file for proxy rotation, not a real solution). The graph abstraction adds real cognitive overhead for simple cases where `requests` + `beautifulsoup` would be five lines. Test coverage looks thin for a library with this many graph variants, and the open issue count likely reflects reliability gaps on sites with unusual rendering.

View on GitHub → Homepage ↗