// the find
moj-analytical-services/splink
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
Splink is a Python library for probabilistic record linkage and deduplication — finding which rows across datasets refer to the same real-world entity when there's no shared unique ID. It implements the Fellegi-Sunter model with EM training, runs on DuckDB locally or Spark/Athena at scale, and requires no labeled training data. Built by the UK Ministry of Justice, it's production-proven on census-scale data.
1. Backend abstraction is genuinely well-designed — the same model code runs on DuckDB for a million records on a laptop or Spark for 100M+ records in the cloud, with PostgreSQL and Athena also supported. 2. Unsupervised EM training means you can get reasonable match weights without any ground-truth labels, which is the realistic situation for most entity resolution problems. 3. Term frequency adjustments are built in as a first-class feature — 'John Smith' matching 'John Smith' is treated differently than 'Zoltan Kowalczyk' matching 'Zoltan Kowalczyk', which most DIY implementations miss entirely. 4. The interactive diagnostics tooling (comparison viewer, cluster studio, waterfall charts) are practical for debugging why specific pairs did or didn't match — not just accuracy metrics, but per-record explainability.
1. The Fellegi-Sunter model assumes conditional independence between comparison columns — if your name and address fields are correlated (they usually are), your match weights are biased and there's no automated warning about this. 2. Blocking rules are still a manual configuration step that requires domain expertise; bad blocking silently kills recall because candidate pairs are never generated, and the tooling for choosing blocking rules is diagnostic rather than prescriptive. 3. Spark backend introduces significant operational overhead — the DuckDB path is excellent, but using Splink at true scale means owning a Spark cluster, and the Spark-specific bugs/quirks get less testing attention than DuckDB. 4. The v3-to-v4 API break was significant enough that most Stack Overflow answers and older tutorials are now wrong, and the migration path for existing models isn't smooth.