Is your Synthetic Data actually private?

Author

Caroline Morton

Date

May 11, 2026

Healthcare organisations face mounting pressure to demonstrate privacy protection in their data applications. This has been partially driven by ambitious new regulatory frameworks that are reshaping the European health data landscape. The European Health Data Space (EHDS) Regulation, which entered into force in March 2025, establishes unprecedented standards for health data use across EU member states. Together with the UK’s Data (Use and Access) Act 2025, these frameworks require organisations to meet “the highest standards of privacy and cybersecurity” when working with health data.

Another growing factor is public perception of how data is and should be treated, and kept safe. I would argue that there is appropriate concern about how health data is used, particularly secondary use. We only have to look at the recent issues with the UK Biobank, where data from thousands of participants was up for sale on Alibaba, to see how quickly public trust can be eroded when data protection fails. My friend Dr Jess Morley has a great write up about this in the BMJ if you are interested in that particular case. The point is that healthcare organisations need to be able to demonstrate that they are taking privacy seriously, and that they have the tools to do so.

Digital Privacy

Synthetic data is often touted as a solution to this problem, enabling data access without privacy risks. By generating artificial datasets that preserve statistical patterns while containing no actual patient records, synthetic data promises to solve healthcare’s persistent tension between research needs and privacy protection. However, this promise depends entirely on how the synthetic data was generated. If we are using real data to generate synthetic data, particularly through a black box of complex machine learning models, we need to be able to verify that the generated data genuinely protects privacy, and that no real data has slipped through into the synthetic dataset. In this blog post, I am going to explore the concept of privacy metrics, and how they can be used to provide the quantitative evidence needed to move beyond general assurances to verifiable protection standards, making synthetic data defensible in practice.

What are Privacy Metrics?

Simply put, privacy metrics are quantitative measures that assess how well synthetic data protects individual confidentiality. Unlike accuracy or performance, privacy is not directly observable, as it is not possible to check whether the synthetic dataset contains private information by inspection alone. Instead, privacy metrics estimate the risk that adversaries could extract sensitive information, and quantify the level of protection provided by the generated data. This measurement challenge is particularly problematic in healthcare, where synthetic data must preserve complex clinical relationships while ensuring patient privacy. Privacy metrics provide the framework for systematically navigating this balance.

The systematic aspect of it is important. It is not enough to just check for privacy risks in an ad hoc way, we need to have a structured approach and something concrete to point to when we are asked about privacy protection. This matters more than it might sound. A recent scoping review by Kaabachi and colleagues looked at 73 studies using synthetic health data and found that only 46% of the studies claiming to use synthetic data for privacy preservation actually evaluated the residual privacy risks. The rest assumed inherent privacy benefits without empirical verification. Basically they rely on the fact, we made synthetic data, therefore it is private because it is not actually real. I hope in this post, I can show that this assumption isn’t necessarily correct and we should care about privacy even with fake data.

Core Privacy Risks in Synthetic Data

Risk

Let’s think briefly about the kinds of privacy risk that arise in synthetic data. I am going to focus on three of the main ones, but the list is not exhaustive and others could reasonably be added. The examples below are drawn from health data because that is the setting I tend to work in, but the risks are not unique to health data; they apply equally to any dataset derived from individuals, households, or organisations.

Firstly, there is data leakage, which occurs when the generative method inadvertently memorises and reproduces specific training examples rather than learning the general patterns of the data. This is the most direct failure mode of the three: a real record passes through the model and lands in the synthetic dataset more or less intact. A well-known illustration outside healthcare comes from large language models, where researchers have shown that verbatim training examples, including names, addresses, and phone numbers, can be extracted from the model by prompting it carefully. The analogous problem in tabular synthetic data is a generator that, for a sparsely populated region of the feature space, has effectively no choice but to emit the original row. Data leakage is particularly insidious because it is easy to miss in aggregate utility metrics: the synthetic dataset can look statistically faithful while still containing a small number of real records hiding in plain sight.

Secondly, there is membership disclosure. This occurs when an adversary can determine whether a specific individual contributed to the training dataset, which on its own undermines the privacy guarantee that synthetic data is supposed to provide. An instructive example is a synthetic dataset generated from a cohort of patients attending a sexual health clinic. The dataset contains no real records at all, but suppose an adversary already knows a few things about their target from some other source: name, date of birth, postcode, perhaps a few coarse clinical attributes from a previous data breach or simply from what the person has shared publicly. The adversary can then run a membership inference attack, which in practical terms means examining the synthetic data and asking whether its statistical patterns are more consistent with the target having been in the training set than not. If the generative model has memorised or partially memorised the target’s record, the synthetic dataset will carry a fingerprint of them even though no actual record of theirs appears in it. The adversary doesn’t need to find the target’s row; they just need to detect that the model “knows” about them. If the attack succeeds, the adversary has learned that this person attended the clinic, which is the sensitive disclosure, and it has happened at the level of cohort membership rather than at the level of any individual row.

Membership Disclosure Risk

This is why membership disclosure is taken seriously even when the synthetic records themselves look nothing like the originals. Simply being part of a sensitive cohort, whether that is a mental health register, a substance misuse service, a pregnancy loss audit, or a rare disease registry, can be the disclosure. I would also like this to be extended outwards to include not just sensitive health conditions. The reality is that we don’t know what is sensitive for any particular individual and it is not our place to decide that, so membership disclosure should be considered in the context of any dataset that contains personal information, even if it is not obviously sensitive.

Finally, there is re-identification risk, which arises when the unique combination of attributes in a synthetic record can be linked back to a real person, particularly for rare conditions or unusual demographic profiles. Suppose a synthetic dataset contains a record describing a 34-year-old woman in a small postcode area, of a specified ethnicity, with both cystic fibrosis and a recent diagnosis of early-onset breast cancer. None of those attributes is identifying in isolation, but their combination may be unique within the underlying population, and an adversary with access to auxiliary information (a hospital admissions list, a charity newsletter, even a social media post) may be able to match the synthetic record to a real individual with high confidence. The very fact that there is a synthetic patient with this combination of attributes pretty much guarantees that there is a real patient with the same combination in the training data, and if the synthetic record is close enough to the real one, then the adversary can infer sensitive information about that real patient. The danger is amplified by the long tail of rare diseases and unusual comorbidity combinations in health data, where quasi-identifiers that look innocuous on their own become uniquely identifying once joined. Crucially, the synthetic record does not need to be a copy of a real one; it is enough that the generative model has preserved enough of the joint distribution of rare attributes for a plausible match to be inferred. This is quite similar to membership disclosure but really speaks to finding out more than just their membership in a group - the risk of finding their whole medical record.

One argument that occasionally surfaces in this space is worth pushing back on directly - namely because I really don’t agree with it! Sometimes you see organisations say that if a participant in the real dataset has put something out online that would allow you to identify them, for example a date of birth on Facebook with details of a recent hospital admission and a broad location, then the data holder is absolved of the responsibility to protect against re-identification because that person has identified themselves. The argument goes that they have put themselves at risk. I think this is a terrible argument, for three reasons. Firstly, people do not have a full understanding of the risks of sharing personal information online, and they may not be aware that they are sharing information that could be used to identify them in a downstream dataset. You can imagine that someone might have excitedly put on facebook 10 years ago, “wow, i’m off to have my blood taken for the biobank! doing my part!” and could not have comprehended that years later, the Biobank data would be available online. Secondly, even if they are aware of the risks, it is not ethically reasonable to say that because someone has shared information online, they have consented to being identified in a dataset they did not know would exist. Thirdly, and most importantly, it is not just about the individual who has shared information online. If someone wishes someone else a “get well soon” message on social media, and that message contains information that could be used to identify them in a dataset, then the subject of the message has not self-identified at all, but they have been identified by association. Re-identification risk has to be taken seriously regardless of what individuals have voluntarily put online.

Key Approaches for Quantifying Privacy

There is not one single metric that can capture all aspects of privacy risk in synthetic data, and the choice of which to use depends on the specific context and threat model. I find it useful to look at them in the same three categories as the risks above, although there is some overlap between them.

A quick note on taxonomy before we go further. The Kaabachi scoping review I mentioned earlier splits privacy evaluation into two categories (membership inference and attribute inference) and folds re-identification under those. That is completely reasonable. For this blog post, I prefer to keep re-identification as its own category because the threat model is meaningfully different: membership inference is about whether a specific person was in the training set, whereas re-identification is about linking a synthetic record to a real person using external information. Those are conceptually distinct enough that I think they deserve separate treatment, even if the techniques used to evaluate them sometimes overlap.

Similarity-Based Metrics

The most commonly used metrics by far are similarity-based metrics. These range in complexity but essentially measure how close any given synthetic record is to a real one. The simplest form is a distance-based metric, which calculates the distance between synthetic records and their nearest real neighbours in the feature space. This is basically a vector search problem (and click here if you want a blog on vector search in a different context). These sorts of metrics are aimed at finding synthetic records that are very close to real ones, which could indicate data leakage.

Similarity

The intuition behind similarity-based metrics is pretty straightforward: if a synthetic record is very close to a real one, that closeness may indicate that the generative model has memorised and reproduced specific training examples, which is a direct privacy risk. There are issues with this approach, however. A raw distance only tells you something useful once you know what distances normally look like in the data. Imagine a synthetic patient whose nearest real neighbour is another patient of the same age band, sex, and primary diagnosis of type 2 diabetes. That sounds like a near-match, but in a large diabetes cohort there will be thousands of real patients who sit just as close to each other on those same attributes. The synthetic record is not unusually close to any particular real person; it is sitting in a dense, well-populated part of the feature space, and flagging it as suspicious would be a false alarm. The same apparent closeness in a rare-disease cohort, where real patients are typically far apart from each other because no two of them share quite the same combination of attributes, would be a genuine cause for concern. What we actually want to measure, then, is not how close a synthetic record is to a real one in absolute terms, but whether it is closer than real records in that region of the data are to each other.

I talked earlier about being systematic in our approach to privacy evaluation, and similarity metrics are a good example of how we can do that. Most papers just stop at nearest neighbour distance, but that is not enough. We need to have a reference point to compare against, and that is where the holdout set comes in. The idea is to split the real data into two parts before training: a training set that the generative model gets to see, and a holdout set that it does not. We then measure two things. First, how close synthetic records are to records in the training set. Second, how close synthetic records are to records in the holdout set. If the generator has genuinely learned the underlying distribution of the data rather than memorising specific patients, these two distances should look statistically indistinguishable. A synthetic record should be no closer to the patients the model trained on than it is to patients drawn from the same population that the model has never seen. If, on the other hand, synthetic records are systematically closer to the training set than to the holdout set, that gap is the fingerprint of memorisation, and the size of the gap is a direct measure of how much the generator has leaked. I like this approach because it gives you a null hypothesis to test against, rather than leaving you to make sense of some raw distance number and an arbitrary judgement about whether it is close enough or not. The null hypothesis is that the synthetic records are no closer to the training set than they are to the holdout set, and you can use a statistical test to see if you can reject that hypothesis.

Three metrics show up most commonly in this category:

  1. Distance to closest record (DCR): the most direct measure. For each synthetic record, we compute the distance to the nearest real record in the training set and to the nearest real record in the holdout set, and compare those distributions.
  2. Nearest neighbour distance ratio (NNDR): this metric divides the distance to the nearest training record by the distance to the second closest. It catches cases where a synthetic record is suspiciously close to a single training record, even if it is not close in absolute terms. A ratio close to zero means that the nearest real neighbour is much closer than the second nearest, which is a red flag for memorisation.
  3. Identical match share: a blunt measure that simply counts the proportion of synthetic records that match exactly, or fall within a very small window of, a real training record.

There is a real debate in the literature about whether similarity-based metrics are actually fit for purpose. The Kaabachi review I mentioned earlier, drawing on work by Stadler and colleagues and Ganev and De Cristofaro, argues that similarity-based metrics are inadequate for two reasons. First, you can have successful inference attacks even when synthetic data is dissimilar to the original. Second, the act of publishing similarity-based metrics can itself enable reconstruction attacks. I think both of these critiques are correct as far as they go, but I would frame the conclusion slightly differently: similarity metrics on their own are insufficient, but a properly baselined similarity comparison (training set versus holdout set, as a hypothesis test) is still a useful component of a broader privacy assessment. The mistake is treating any single similarity number as the answer. I think we have to look at multiple levels of privacy here.

Membership Inference Metrics

Membership inference metrics take a more adversarial stance than similarity-based ones. Instead of asking whether any synthetic records look suspiciously close to real ones, they ask the question that actually matters from a privacy standpoint: given the synthetic dataset, or the generator that produced it, could an attacker work out whether a specific individual was in the training set?

The standard way to quantify this risk is to simulate the attack. We take a set of real records, some of which were in the training data and some of which were not, and we ask a classifier to predict, for each record, whether it was a member of the training set based only on what it can see in the synthetic data. The accuracy of that classifier, measured against the ground truth of who was actually in the training set, is the membership inference attack (MIA) score. A perfectly private generator produces synthetic data from which membership cannot be predicted any better than chance, so an MIA accuracy at the base rate is the ideal. Anything meaningfully above that is evidence that the generator has retained information about who was in the training set in a way that an adversary could exploit.

There are a few flavours of this attack but I will cover just two here:

  • Black-box attacks assume the adversary only has access to the synthetic data itself, which is the most realistic scenario for most healthcare releases.
  • White-box attacks assume the adversary also has access to the generative model, its weights, or its outputs on arbitrary inputs, which is the worst case and the one to test against if you want a defensible upper bound on risk.

One thing worth understanding is that not all synthetic data is equally vulnerable to membership inference. A particularly clear demonstration by Zhang, Yan and Malin compared fully synthetic data (where the generative model learns the data distribution and synthetic records are sampled from it, breaking any one-to-one mapping with real patients) against partially synthetic data (where each real record is transformed into a synthetic one, preserving an implicit one-to-one mapping). The results are striking: for partially synthetic electronic health record data, 82% of patients in their Vanderbilt cohort and 44% of patients in their All of Us cohort could be inferred as members of the training set with at least 0.9 precision. For fully synthetic data generated from the same source, the maximum precision an adversary could achieve on any meaningful subpopulation was 0.55 to 0.64. The practical takeaway is that the choice of synthesis method matters enormously for membership disclosure risk, and partial synthesis, despite its utility advantages, should be approached with a lot caution. My blog on multiple imputation and synthetic data is a good place to start when thinking about why partially synthetic datasets have huge privacy issues. The short version is that it is not enough to say we can’t tell which records are real and which are not, because an algorithm can work this out fairly easily especially when combined with other datasets.

For the actual mechanics of doing a membership inference evaluation, the partitioning method developed by El Emam and colleagues is the most practical approach I am aware of. The idea works like this. You start with your real dataset and split it in two. One half is the training set, which is what you use to generate the synthetic data. The other half is the holdout set, which the model never sees. You then build an attack dataset by mixing records from both halves together. This attack dataset is meant to stand in for what an adversary might have: some people they want to check on, some of whom were in the training data and some of whom were not.

Now you play the role of the adversary. For each record in the attack dataset, you compare it to the synthetic data and ask: does the synthetic data look like it was influenced by this person? If yes, you claim they were in the training set. If no, you claim they weren’t. You then check how often your claims were correct, by comparing them to the truth (which you know, because you did the splitting yourself).

The result is summarised as an F1 score, which is a single number between 0 and 1 that captures how good your guesses were overall. A high score means the adversary’s guesses were mostly right, which is bad news for privacy because it means the synthetic data is leaking information about who was in the training set. A low score means the adversary couldn’t tell the difference between people who were in the training set and people who weren’t, which is what you want. The F1 score for these claims is your membership inference risk metric.

There is one important methodological detail that El Emam et al. flag is that proportion of training records in the attack dataset (often called t) has historically been set to 0.5 in the literature, but that turns out to be wrong in most cases. The value of t that gives a faithful estimate of what an adversary would actually achieve is t = n/N, where n is the size of the real dataset and N is the size of the underlying population. The intuition is that an adversary sampling from the same population as the real data will have, in expectation, n/N of their attack records also present in the real data, and setting t to anything else gives you a biased estimate. Using the default t = 0.5 can substantially over- or under-estimate the true risk depending on the sampling fraction.

Re-identification Risk Metrics

Re-identification risk metrics focus on the likelihood that synthetic records can be linked back to real individuals. These are an interesting group conceptually, because it is usually synthetic data combined with some outside data source that creates the risk, rather than the synthetic data on its own. We don’t know what data sources might be available, and generally shouldn’t try to guess, so re-identification risk metrics really fall under what is known as “hiding in the crowd”: how many people share the exact combination of certain identifiers, and is your target patient hidden among enough of them to be safe?

Groups and holdouts

The standard metric here is k-anonymity, which can also be applied to real data that has been anonymised. The approach is to identify a set of quasi-identifiers (age, sex, region or postcode, primary diagnosis, ethnicity, that sort of thing), and then for each synthetic record, count the size of the equivalence class of real records that share the same quasi-identifier values. In plain terms, how many real people share these characteristics. A k of 1 means only one person does, and that puts them at risk of re-identification. A k of 5 is the conventional minimum for many disclosure control frameworks, and higher is better.

There are two refinements of k-anonymity that I will just mention because I think they demonstate some of the complexity:

l-diversity points out that hiding in a crowd only helps if the crowd is actually mixed. Suppose your synthetic record matches five real people, which sounds safe. But if all five of those people are HIV positive, then the adversary doesn’t need to know which one of the five is the target; they already know the target is HIV positive, because everyone in that group is. l-diversity asks for the sensitive attributes within the group to be varied enough that learning the group doesn’t give away the sensitive value.

t-closeness goes one step further. It points out that even a mixed group can leak information if the mix looks nothing like the wider population. Imagine the general population is 1% HIV positive, but in your group of five, three out of five are HIV positive. That group is diverse in the l-diversity sense (it contains both statuses), but it is still telling the adversary something useful, because being in this group has bumped the probability of being HIV positive from 1% to 60%. t-closeness asks that the mix inside the group looks roughly like the mix in the population at large, so that learning the group doesn’t shift the odds.

With re-identification, the key (and somewhat unknowable) risk is fundamentally about what the adversary knows, and we can never fully enumerate that. These metrics give you a defensible lower bound on risk under specified assumptions about quasi-identifiers, but a sufficiently motivated adversary with auxiliary information you didn’t think to consider may have a higher chance of success than your numbers suggest. This is why re-identification metrics, like all the others I have discussed, are best used as one input into a broader disclosure risk assessment rather than as a single number to point at.

Pulling it together

The fundamental tension in synthetic data lies in balancing privacy protection with research utility. Stronger privacy mechanisms often reduce the statistical fidelity necessary for meaningful research, creating trade-offs that vary significantly across applications. As data gets less like the real data, we have breakdown of the associations between different variables and that reduces its utility.

If I could get you to take one thing away from this blog it would be that privacy in synthetic data is not a single number. There is not one metrics which is the gold standard. The three categories of risk I have walked through (data leakage, membership disclosure, re-identification) are not interchangeable, and the metrics that quantify them sit at different points in the threat model. A synthetic dataset can score well on similarity-based metrics, look fine on a naive membership inference check, and still be vulnerable to re-identification if you have the wrong auxiliary information available; or it can be re-identification safe but leak training set membership through the generative model itself.

The systematic approach I have been pushing throughout this post is this: pick metrics from each of the three categories, evaluate them against an explicit threat model with explicit assumptions, and report all of them together. Use the holdout-baselined similarity comparison rather than raw distances. Use the partitioning method for membership inference with the correct t = n/N parameterisation rather than the default t = 0.5. Use k-anonymity with l-diversity or t-closeness rather than k alone. And report the numbers honestly, including the cases where they are uncomfortable, because the alternative (the 54% of “privacy-preserving” synthetic data studies in the Kaabachi review that didn’t evaluate privacy at all) is the thing that erodes public trust in synthetic data as an idea. If we want healthcare organisations to be able to use synthetic data for the genuinely valuable applications it enables, we have to be willing to do the work to demonstrate that the privacy claims are real.

Know someone who'd like this?

Enjoyed this? Subscribe to my newsletter.

I write about open science, research code, and building better tools for researchers.

Browse the newsletter archive →

Related Posts

table green

Multiple Imputation and Perturbation: Why They're Not Built for Synthetic Data

This blog explores why multiple imputation and perturbation are not suitable for generating synthetic data.

Read More
table orange

Serde Rust: Data Serialisation for Data Scientists

Practical Rust patterns for building validated data pipelines with Serde. Custom deserialisers, domain-constrained types, streaming CSV processing, and structured error handling for messy real-world data.

Read More
reproducability yellow

Why Synthetic Data is Good for Open Science

Understanding the benefits of synthetic data for open science and reproducibility.

Read More