Skip to main content

Synthetic Data 101

In a new blog series — Synthetic Data 101 — we’ll be breaking down everything you’ve wanted to know about synthetic data. What is it? Is it really private? Can it be used the same as real data? And everything in between. 

Today’s post kicks off the series — what, exactly, is synthetic data?


The statistics are real. The patients are not.

Synthetic data looks and acts like real data, reflecting the statistical properties of the underlying real dataset, or multiple datasets. But, synthetic data is entirely artificial and does not contain any actual patient information.

While there are different approaches to creating synthetic data in healthcare and other industries, Syntegra synthetic data is generated using an out-of-the box application of a groundbreaking machine learning approach called transformer-based language models, resulting in high levels of accuracy while protecting patient privacy far more powerfully than traditional de-identification. This approach allows us to capture longitudinality and work with all types of structured data in any data format.

Our model learns the deep, underlying statistical distributions and relationships of a real dataset — learning the stories of real patients and their interactions with the healthcare system. The model then uses these learnings to create entirely new synthetic patient records. These records look like they could have resulted from any real person, with variables such as demographics, diagnoses, treatments and labs, but the synthetic patient record is just that — “synthetic.” This approach allows us to not only replicate these learned relationships in the synthetic data, but it also lets us generate entirely new, yet very realistic, synthetic patients that are built to fit any scenario for analytics, product development and more.

Synthetic data holds all of the value of real data without the limitations that stem from privacy concerns, enabling access to higher fidelity, privacy-preserved and more granular healthcare data. Longstanding privacy and administrative barriers are effectively removed. Data sharing for cross-org collaboration is far easier and faster. Data can be augmented to increase statistical power or address bias issues. Health tech builders can access the data they need to build better products and more accurate models at a much quicker pace.

Although still relatively new in healthcare, synthetic data’s use and potential is only growing and promises to reshape the way the entire industry approaches the use of data.