// the find
sparkfish/augraphy
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Augraphy is a Python library for generating synthetic degraded document images — think dirty scans, faxes, and photocopies — from clean originals. It's purpose-built for training OCR and document AI models where you have clean source documents but need paired noisy versions. If you work on document understanding, denoising, or form extraction, this fills a gap that general image augmentation libraries don't touch.
The three-layer pipeline architecture (ink → paper → post-merge) is physically motivated and produces more realistic results than naively stacking filters. The augmentation catalog is genuinely wide — 50+ transforms covering everything from bleed-through and letterpress to book binding curvature and Moire patterns, with spatial augmentations that correctly propagate masks, keypoints, and bounding boxes. There's an ICDAR 2023 paper behind it, so the design isn't arbitrary. The benchmark table with per-augmentation throughput and memory numbers is the kind of honesty most libraries skip.
Performance on the slow end is punishing — BookBinding at 0.09 img/sec and LensFlare at 0.01 img/sec on a 2-core Xeon means a single augmented epoch over a large dataset could take hours; there's no GPU path mentioned. Several spatial augmentations mark bounding box support as '✓*' with no clear explanation of what the asterisk costs you. The library is numpy/OpenCV-heavy with no native integration into PyTorch or TensorFlow data pipelines — you're wiring it yourself. At 547 stars and 63 forks it's lightly adopted for how niche it is, which means fewer people have stress-tested the edge cases you'll hit in production.