// the find
synthetichealth/synthea
Synthetic Patient Population Simulator
Synthea generates synthetic patient health records from birth to death, modeling conditions, medications, encounters, and insurance through configurable state-machine modules. It outputs FHIR R4/STU3/DSTU2, C-CDA, CSV, and Medicare claims formats. The target audience is healthcare software developers, researchers, and anyone who needs realistic but legally safe patient data for testing or ML training.
The Generic Module Framework lets you drop in JSON-defined disease workflows without touching Java, which makes it practical to add new conditions without understanding the whole engine. The export coverage is unusually broad — FHIR bulk ndjson, C-CDA, and CMS BB2 RIF format in one tool means you're not stitching together three separate generators. Demographics are census-backed and geography-aware, so generated populations have realistic age/race/income distributions tied to actual US zip codes rather than uniform random noise. The physiology simulator (ECG, cardiovascular models via SBSCL) goes deeper than most test-data tools, producing time-series vitals that behave like real sensor output.
It's US-only out of the box — all demographics, insurance models, and cost tables are American, so international healthcare researchers have to rebuild most of the configuration data from scratch. The module authoring experience is JSON state machines which get unwieldy fast for complex chronic disease progressions; there's no tooling to catch logical errors until you run the full simulation. Parallelism is coarse-grained: you scale by running multiple processes, not threads per patient, which makes generating millions of records on a single machine awkward. The CMS claims export (BB2 RIF) is powerful but the documentation is sparse relative to its complexity — expect to read the source code to understand what fields map to what.