// the find
microsoft/presidio
An open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data. Supports NLP, pattern matching, and customizable pipelines.
Presidio is Microsoft's PII detection and anonymization library for Python, covering text, images, and structured data. It combines spaCy/transformer NER with pattern-based recognizers and supports redacting everything from credit card numbers to DICOM medical images. It's aimed at data engineering teams that need to sanitize data before feeding it into analytics pipelines or LLMs.
The recognizer architecture is genuinely extensible — you can plug in custom regex patterns, checksum validators, or external NLP models without forking the core. Image redaction support including DICOM is rare and actually useful for healthcare use cases. PySpark integration means it scales to bulk anonymization jobs on real datasets, not just single-document demos. The anonymizer pipeline supports reversible pseudonymization (encrypt/decrypt operators), which is critical for workflows that need to de-identify before processing and re-identify results afterward.
NER-based detection has a hard ceiling on recall for non-English text — the multilingual support is patchy and depends on which spaCy model you load, so anything beyond English and a few Western European languages requires significant custom work. False positive rates on names and locations are high enough that you cannot run it unsupervised on production data without a review step, which the README quietly acknowledges. The image redactor uses OCR under the hood, which means its accuracy is entirely bounded by Tesseract quality — blurry scans or non-standard fonts will leak PII silently. There's no built-in audit log or confidence threshold enforcement, so you have to wire those up yourself if you need to prove compliance to an auditor.