// the find

chiphuyen/lazynlp

★ 2,265 · Python · updated Nov 2020

Library to scrape and clean web pages to create massive datasets.

A Python library for crawling and cleaning web pages to build large text datasets for language model training. Aimed at researchers who want to replicate or exceed the scale of GPT-2's WebText dataset without paying for existing crawl dumps. The Reddit URL dumps and Gutenberg scrapers are pre-wired so you can get started quickly.

Bloom filter-based deduplication is the right tool here — memory-efficient at the scale of tens of millions of documents. The per-file output format (URL on line 1, then cleaned text) is simple and makes resumable crawls easy to implement on top. The domain/extension blocklist for skipping scraper-unfriendly and NSFW sites saves real time when processing raw Reddit dumps. Clean separation between crawling, cleaning, and dedup steps means you can swap in your own cleaning logic without touching the rest.

Last commit was 2020 — the pushshift.io Reddit dumps it links to are gone (pushshift shut down public access in 2023), so the biggest advertised data source no longer works. Parallelism is handled by the user manually running 40 shell processes; there is no built-in concurrency, rate limiting, or retry logic, which is a significant gap for production-scale crawls. No support for JavaScript-rendered pages (no Playwright/Selenium integration), so a large fraction of modern URLs silently return empty content. Python packaging and deps haven't been touched in years, and some dependencies have since had breaking releases.

View on GitHub →