Skip to main content

Protecting Privacy: A Primer on HIPAA, GDPR and the Consequences of Singing in the Car

Alexander Kerman

Last week I introduced some of the most important privacy concepts in healthcare — anonymity, dimensionality and triangulation. This week, I’m taking a deeper look into how these ideas shape regulation and data use, and why you should care.

First things first: why do we need privacy regulation in healthcare? Privacy is about freedom from external consequences, but this gets harder as data becomes richer or more complex, like healthcare data, and the consequences change depending on what the data covers. It might be humiliating if someone overheard me singing my rendition of “Girls Just Wanna Have Fun” in my car or for a social media app to track my shopping addiction, but consequences get more serious if my credit card or banking data were exposed publicly. Healthcare falls into the “more serious” bucket, as the consequences of privacy breaches can be catastrophic.

So how do we ensure our healthcare data can’t be used against us? We regulate it! The two most well known regulations are HIPAA in the U.S. and GDPR in the European Union (EU). Here’s a quick primer on how they treat healthcare data.

Let’s start there with the General Data Protection Regulation of 2016 (GDPR), which basically prohibits sharing or analysis of healthcare data without the patient’s explicit consent. Even de-identification isn’t permitted since it still counts as “data processing” of patients’ real data. So on the classic privacy vs. utility tradeoff, GDPR is decidedly on the privacy side, which is great… except if we don’t use healthcare data as much as we possibly can, we’ll miss out on insights that can potentially save lives.

The old paradigm of healthcare data privacy is that there’s a tradeoff between utility and privacy because the only way to increase privacy is by removing potentially useful information… at least until synthetic data came along! (More on this in the next post…)


The Health Information Portability & Accountability Act of 1996 (aka HIPAA, not HIPPA) had a lot going on, but I’ll keep this focused on how it affects data privacy, namely the Privacy Rule. This governs the use of protected health information (PHI) by protected entities, which basically means it says what organizations working in healthcare can do with the information they collect. HIPAA’s Privacy Rule means your healthcare data cannot be used without your permission unless it’s “de-identified” first. De-identification under HIPAA means either redacting 18 types of information (many of which are important!), called Safe Harbor, or partially redacting the dataset and then having an expert certify that there’s a minimal risk of compromising privacy, known as Expert Determination. Usually you want to remove as little of the richness in healthcare data as possible, so Expert Determination is the more common path these days. However, there are three big problems with Expert Determination: 1) You lose information from redaction, especially with small datasets 2) There are no standards for what makes an expert or what risk is acceptable 3) Even de-identified datasets can be re-identified using advanced AI/ML techniques.

So we need privacy regulations like HIPAA and GDPR because they protect us from the negative consequences of privacy violations, but they both go too far, as in GDPR preventing data utilization that can benefit everyone, and not far enough, as with HIPAA failing to protect privacy fully.

This might all sound pretty bleak if you care about patient privacy and believe that healthcare innovation needs easy access to high-quality evidence – but there is some good news: synthetic data is not subject to either HIPAA or GDPR, protects privacy more powerfully than de-identification, and can be just as useful as real data (or even more!).

We’ll discuss this in more depth in the next post, but the key insight is that synthetic data points do not represent real people, so you cannot backtrack to a real person from a synthetic data point by triangulating with other information. When done right, synthetic data can be used just like real data because it contains all of the same relationships between all of the same dimensions as the real data (hint: at Syntegra, we do it right!).


Stay tuned for the final round of this mini series on privacy with a deeper dive on how synthetic data protects privacy, but until then, there’s no need to wait to start exploring data that truly safeguards privacy while providing all of the insights of real data. Check out Syntegra’s sample datasets here to see how the utility vs. privacy tradeoff can be a thing of the past!