// the find
okfn-brasil/querido-diario
📰 Diários oficiais brasileiros acessíveis a todos | 📰 Brazilian government gazettes, accessible to everyone.
Querido Diário is a Scrapy-based scraper collection that harvests official government gazettes (diários oficiais) from hundreds of Brazilian municipalities. It's civic infrastructure — making legally mandated public records actually searchable and accessible, run by Open Knowledge Brasil. The audience is journalists, researchers, transparency advocates, and anyone who needs to programmatically access Brazilian public administration documents.
1. Scale is impressive: hundreds of individual spiders organized by state/city, with reusable base classes for common gazette publishing platforms (doem, dionet, sigpub, etc.) so adding a new municipality that uses a known system is a one-liner. 2. The base spider library is the real asset here — identifying that dozens of municipalities share the same backend (e.g., adiarios_v1/v2, siganet) and extracting that into shared base classes means the codebase doesn't devolve into a pile of copy-pasted spiders. 3. CI is well-structured: separate workflows for daily crawls, monthly crawls, and scheduled one-off runs per spider, with a spider status updater — this is the operational complexity that usually kills civic-tech projects. 4. The territories.csv resource and IBGE code conventions mean each spider is anchored to a canonical municipality identifier, which matters when you're trying to correlate gazette data with other government datasets.
1. No structured output schema enforcement at spider level — items.py defines the shape but individual spiders can and do yield partial or malformed items with no validation layer catching it before it hits the pipeline. 2. The date range handling (start=/end= flags) is implemented per-spider with no shared contract, so behavior varies: some spiders silently ignore the flags, some handle edge cases differently — this is a real footgun for anyone building on top of the data. 3. No deduplication or idempotency story for the crawl pipeline — re-running a spider for a date range it already crawled will produce duplicates unless the downstream database handles it, and that logic lives outside this repo. 4. The split between spiders with year-suffixed filenames (ba_correntina_2007.py, ba_correntina_2025.py) and single-file spiders is confusing and inconsistent — it's unclear whether a single spider handles the full history or whether you need to run multiple to get complete coverage for a given municipality.