Skip to main content

Syntegra: Democratizing Healthcare Data to Accelerate How Treatments Reach Patients

Michael Lesh, MD

How can we efficiently share important healthcare data to fuel innovation in medicine while still protecting patient privacy?

That’s the key challenge we’re solving at Syntegra. We do this by generating synthetic healthcare data that maintains the full statistical fidelity of the underlying data while completely preserving patient privacy. This “realistic but not real” data opens up new pathways for academic and commercial stakeholders across healthcare to advance innovation both internally and through external collaborations.

Accessing healthcare data is challenging for several reasons.

There are tremendous, potential opportunities for scientists to more rapidly understand diseases and develop new treatments if clinical data, gathered during the normal course of care, could be utilized for statistical analyses. However, our patients must have complete trust that their personal health information will never be revealed beyond their caregivers. Thus, leveraging the secondary use of patient data is currently slow, expensive and limited in its use due regulations and data governance meant to guarantee privacy.

Privacy concerns — Traditional methods of maintaining data security center on a process of de-identification — removing or obscuring fields that could be used to re-identify an individual. However, these methods are no longer sufficient, as an attacker can match leaked, de-identified data with publicly available information, such as social media posts or census data. Sensitive information on specific individuals can be disclosed by attackers using advanced techniques, such as probabilistic linkage. Even when fully compliant with applicable regulations, healthcare organizations still face public relations challenges regarding patient privacy, especially when collaborating with large technology companies, even if the putative goal is improved patient care. Last year, Ascension’s partnership with Google received scrutiny and questions around the type and amount of health data being shared with big technology companies.

Low-burden access — As a result of these privacy concerns, patient data is often stored in disparate, difficult-to-access silos. This makes the process of gaining access to individual-level health data slow and expensive, and it often results in subpar, incomplete or low-fidelity data. Moreover, health systems are inundated with requests for data for a variety of purposes, such as internal research, partnerships, educational use and software testing, but processing these requests is administratively burdensome due to privacy safeguards and governance procedures such as institutional review board approvals and business associate agreements.

Data quality — The lack of comprehensive, secure data stifles innovation and the ability of researchers to leverage “big-data” analytics. This is particularly evident as we move towards precision medicine, resulting in the need for data on large populations so that treatment can be individualized. The small cohorts of interest in precision medicine are difficult to work with given their limited size and are often biased, severely limiting the use of these datasets. Furthermore, missing values and data fragmentation greatly inhibit the utility of secondary use, or real-world, data, resulting in complicated study designs or limited cohort sizes to compensate.

Syntegra’s approach

At Syntegra, we use a major advance in machine learning that has never before been used in the healthcare space — AI-based language models — to generate synthetic data. Language modeling involves a deceptively simple concept: given a large body of text, the model is trained to predict the next word in a sequence given the sequence of words already present. In essence, the model is learning all of the possible patterns that can appear in a body of text. Once trained, a language model can be used to generate new text that “looks like” the original text but is completely new. You have probably seen language models such as GPT-2 and GPT-3 demonstrate an amazing ability to produce human-like text, including New Yorker articles and novel endings to Game of Thrones.

At Syntegra, we made the key insight that medical record data is similar to natural language, consisting of a sequence of events (blood test, symptom, medication, diagnosis, surgery, etc.) rather than natural language words. We transform the specialized format of healthcare data into “patient sentences” so that these powerful algorithms can be utilized to create brand new patient data that “looks like” the original but contains no actual, individual patient.

Through this method, we are able to create comprehensive, patient-level synthetic datasets from which no individual patient data is revealed and, therefore, can be freely used across the healthcare ecosystem — from health systems and clinical research organizations to life science companies, payers and digital health entities training new care models. Based on all types of structured healthcare data, Syntegra’s synthetic data is uniquely able to capture entire datasets while still maintaining edge cases and rare cohorts. We can also impute missing values, normalize bias, and increase cohort sizes.

Our goal is to enable low-burden access to healthcare data by increasing its usability and value. To start, we’re focusing our efforts on health systems and their struggle to meet the immense need to leverage their patient data with external partners, in a way that protects patient privacy, to strengthen patient care and encourage innovation. With access to synthetic data generation, health systems can create a new layer of data access as well as customized datasets that are built to answer specific research questions.

We’ve devoted substantial resources to develop industry-leading metrics to validate the statistical accuracy and privacy preservation of synthetic data. Proving the statistical fidelity of synthetic data is essential due to the evidence-based nature of healthcare, and quantitatively demonstrating that no patient can be re-identified from synthetic data is essential in building trust in the method. We’re working closely with a third party, Mirador Analytics, to certify the privacy techniques we use to validate that Syntegra synthetic data fully protects patient privacy. More information around these metrics can be found in our white paper as well as a recent publication assessing the application of the Syntegra synthetic data engine on a dataset from the European Prevention of Alzheimer’s Dementia study.

How can synthetic data be used across healthcare

Synthetic data holds enormous promise to transform the way those in healthcare think about accessing and sharing healthcare information.

For health systems — Synthetic data has the flexibility to meet the large number of requests and opportunities presented to health systems. This privacy-guaranteed data can be used more broadly within a system without facing the traditional, administrative burdens created to address privacy concerns that usually prevent uncomplicated data sharing. Synthetic data can be used by health systems to:

  • Quickly and easily share data with internal research teams without facing administrative roadblocks, such as for clinical outcomes or precision medicine research.
  • Improve benchmarking analysis efforts, both internally and with other health systems.
  • Accelerate opportunities with software and analytics partners by sharing comprehensive and diverse health data to improve software development, testing and implementation while addressing privacy concerns.
  • Generate new monetization opportunities with industry partners without sharing access to the data of actual patients, avoiding conflicts with company mission and values.

For life science companies — Synthetic data has the potential to radically change how real-world evidence is used by the life sciences industry, allowing companies to create, share and access the data they need to drive treatment development and commercialization. For example, pharmaceutical companies can:

  • Create synthetic “digital twins” matched to clinical trial populations for improved trial design and planning, external or supplemental control arms, population exploration and in silico trials.
  • Gain access to European data without facing GDPR’s strict privacy requirements.
  • Increase cohort sizes, impute missing values and normalize bias to improve research and analysis efforts, especially in areas that are typically under-researched, such as rare disease.

For cross-industry collaborations — Create new opportunities for collaboration between health systems, researchers, life sciences and tech companies by enabling easier, faster and secure data sharing. Organizations can share patient data with partners at a much faster pace without facing the administrative roadblocks that often cause delays in access, or even prevent data sharing altogether.

The promise of synthetic data in medicine is truly unprecedented, and I’m excited to see its potential grow as its effectiveness continues to be validated and adoption increases.

Interested in learning more about the many uses for generating synthetic healthcare data? Connect with us at