// the find
buckyroberts/Spider
Python website crawler.
A minimal multi-threaded Python web crawler that collects links from a target domain. Conceived as the crawler component of a larger search engine project from the thenewboston YouTube channel. At under 200 lines total, it's a teaching artifact, not a production tool.
Threading is handled simply with a Queue and a fixed worker pool, which is easy to follow and adequate for small crawls. The code is split sensibly across domain/link_finder/spider modules rather than stuffed into one file. Good as a learning example for how a breadth-first crawl works.
No robots.txt compliance, no rate limiting, and no politeness delays — running this against a real site is rude at best and gets you banned at worst. Last commit was January 2023 and the README still links to Google+ (shut down in 2019), which tells you everything about the maintenance cadence. No persistence layer, so a crash loses all crawl state. The 'larger search engine' this was supposed to feed was never completed.