Multiple Imputation and Perturbation: Why They're Not Built for Synthetic Data

Synthetic data is often described as a solution to the limited data problem, and as we have discussed in previous blogs, Synthetic Data it can be a powerful tool for creating larger datasets that model real data while protecting privacy. When it comes to generating synthetic data, there are many methods available, each with its own strengths and weaknesses. Some of these methods are designed specifically for generating synthetic data, while others are not.

Synthetic Data

Multiple imputation (MI) and perturbation are two statistical techniques that are often associated with synthetic data generation. However, they were not designed for this purpose and have limitations when applied in this context. It is something that I have seen come up in many papers and it is worth clarifying exactly what these methods do, how they differ from synthetic data generation, and why they are not appropriate for this purpose.

First of all, let’s dive into what multiple imputation and perturbation are, and what they are designed to do. If you’re new to synthetic data more broadly, I’d recommend starting with my introduction to synthetic data first, as this article assumes some baseline knowledge.

What is Multiple Imputation?

Multiple imputation is a statistical technique used to handle missing data in datasets. It creates multiple plausible versions of the missing values based on the observed data, allowing for more accurate statistical analysis. The key idea is to generate several complete datasets, each with different imputed values, and then combine the results to account for uncertainty. Let’s work through a simple example to illustrate this:

Imagine you have a dataset of patients with heart failure, but some patients are missing their blood pressure (BP) readings. We are going to use a dataset of only 5 patients to illustrate the concept, but in practice, you would have many more patients.

| Patient | Age Group | BMI  | Systolic BP  | Cholesterol |
|---------|-----------|------|--------------|-------------|
| 1       | Young     | 27.2 | 120          | 185         |
| 2       | Young     | 26.8 |              | 178         |
| 3       | Middle    | 31.5 | 140          | 220         |
| 4       | Young     | 22.1 | 110          | 160         |
| 5       | Middle    | 32.0 |              | 235         |

As you can see, patients 2 and 5 are missing their systolic BP readings. Multiple imputation would create several versions of the dataset, each with different plausible BP values for those patients based on their other characteristics and patterns observed in patients with complete data. For example, one version might fill in patient 2’s BP with a mean value based on the other young patients (115), while another version might use a regression model to predict BP based on BMI and cholesterol levels. The results from these datasets are then pooled to provide a more accurate estimate of the overall effect of BP on heart failure outcomes.

What is Perturbation?

Perturbation is a technique used to protect the privacy of individuals in a dataset by adding random noise to the data or swapping values between records. This method is often used in statistical disclosure control, where the goal is to obscure individual identities while preserving overall statistical patterns. For example, a statistical agency might release a dataset with income values that have been perturbed to protect privacy. The perturbation process deliberately introduces small errors to the data, making it harder to identify individuals while still allowing for meaningful analysis. The basic idea is that the consumer of the data can still draw valid conclusions about the overall population, but they have no way of knowing what data has been altered and how.

Let’s consider a small example to illustrate perturbation:

| Patient | Age Group | Income  |
|---------|-----------|---------|
| 1       | Young     | 45000   |
| 2       | Young     | 52000   |
| 3       | Middle    | 75000   |
| 4       | Young     | 48000   |
| 5       | Middle    | 80000   |

In this dataset, we have income values for 5 individuals. To protect their privacy, we might apply perturbation by adding random noise to each income value. For example, we could add or subtract a random amount within a certain range (e.g., ±5000). After perturbation, the dataset might look like this:

| Patient | Age Group | Income  |
|---------|-----------|---------|
| 1       | Young     | 46000   |
| 2       | Young     | 50000   |
| 3       | Middle    | 70000   |
| 4       | Young     | 52000   |
| 5       | Middle    | 81000   |

As you can see, the income values have been altered slightly, making it more difficult to identify individuals while still preserving the overall distribution of income in the dataset. Statistical agencies like the UK Office for National Statistics often use perturbation when releasing Census data to protect individual privacy.

Why MI and Perturbation fail for Synthetic Data Generation

MI and perturbation are often grouped under the synthetic data umbrella but they were built for entirely different problems. Multiple imputation was developed to handle missing data in existing records, not for generating entirely new records altogether. Perturbation adds random noise to real data to protect privacy, but it’s not designed for fully synthetic data generation

MI’s Problem: Too Similar, Not Private

When you apply MI for synthetic data generation, it produces records that are too similar to the original data, which occurs because MI draws values from real observed records. This similarity could increase the risk of identification disclosure, where an attacker could use combinations of variables like age, location, and diagnosis to reveal information about the underlying record. As such, while MI is good at handling missing data, it is unsuitable for generating privacy-protective synthetic datasets, a core use-case for synthetic datasets. We can use an illustrative example to demonstrate this point. Consider again our heart failure dataset with missing blood pressure values:

| Patient | Age Group | BMI  | Systolic BP  | Cholesterol |
|---------|-----------|------|--------------|-------------|
| 1       | Young     | 27.2 | 120          | 185         |
| 2       | Young     | 26.8 |              | 178         |
| 3       | Middle    | 31.5 | 140          | 220         |
| 4       | Young     | 22.1 | 110          | 160         |
| 5       | Middle    | 32.0 |              | 235         |

If we use multiple imputation to fill in the missing BP values for patient 2, we could for example use the mean BP of the other young patients (patients 1 and 4) to impute a value of 115.

| Patient | Age Group | BMI  | Systolic BP  | Cholesterol |
|---------|-----------|------|--------------|-------------|
| 1       | Young     | 27.2 | 120          | 185         |
| 2       | Young     | 26.8 | 115          | 178         |
| 3       | Middle    | 31.5 | 140          | 220         |
| 4       | Young     | 22.1 | 110          | 160         |
| 5       | Middle    | 32.0 | 130          | 235         |

At first glance, this seems reasonable. However, the imputed value of 115 is directly derived from real patient data (patients 1 and 4), and if we had a much larger dataset, this could lead to privacy concerns. A bad actor who knows that multiple imputation was used might notice that many young patients have a BP of 115, and hypothesise that these are the values being imputed by MI. They could then extrapolate that any young patient with a blood pressure that is not 115 is likely to be a real patient, leading to potential re-identification. This is a simple example, and in practice, the risk would depend on the complexity of the data and the imputation model used. However I hope this illustrates the core issue: MI generates values that are too closely tied to real data, undermining privacy.

In addition to this, the imputed records remain structurally identical to the original patients - same Age Group, same BMI, same Cholesterol - with only the missing BP value filled in. This makes them vulnerable to linkage attacks: if an attacker has access to an external dataset containing Age Group, BMI, and Cholesterol, they could match these combinations back to the “synthetic” records and potentially identify real individuals.

Perturbation’s Problem: Broken Correlations, Lost Utility

Perturbation has a different set of issues when applied to synthetic data generation. Perturbation adds independent random noise to each variable, and if too much noise is added, the correlations that make healthcare data meaningful are lost.

The CMS DE-SynPUF study is a real world example where a synthetic dataset was created using perturbation methods and released for public use. CMS’s own documentation highlights the issues described, reporting that the synthesising process resulted in a significant reduction in the amount of interdependence and co-variation among the variables, making it less useful for analytics. They further note it has limited inferential research value for drawing conclusions about Medicare beneficiaries.

We can create our own small example to show what actually happens when you try to use perturbation for synthetic data generation. Consider a small healthcare dataset studying cardiovascular risk factors:

| Patient | Age Group | BMI  | Hypertension | Cholesterol |
|---------|-----------|------|--------------|-------------|
| 1       | Young     | 27.2 | Yes          | 185         |
| 2       | Young     | 26.8 | Yes          | 178         |
| 3       | Middle    | 31.5 | Yes          | 220         |
| 4       | Young     | 22.1 | No           | 160         |
| 5       | Middle    | 32.0 | Yes          | 235         |

Can you see the patterns in people with hypertension? Young patients tend to have BMI values clustering around 25-28, while middle-aged patients with hypertension have higher BMI values around 31-32. There’s also a relationship between BMI and cholesterol levels.

When you apply perturbation, it adds independent noise to each variable, breaking these correlations. We could add or subtract a random amount within a certain range (e.g., ±4.0 for BMI, ±30 for cholesterol).

After perturbation, the dataset might look like this:

| Patient | Age Group | BMI  | Hypertension | Cholesterol |
|---------|-----------|------|--------------|-------------|
| 1       | Young     | 30.1 | Yes          | 200         |
| 2       | Young     | 22.5 | Yes          | 150         |
| 3       | Middle    | 28.0 | Yes          | 250         |
| 4       | Young     | 25.0 | No           | 180         |
| 5       | Middle    | 35.2 | No           | 210         |

As you can see, the correlations that made the original data clinically meaningful have dissolved. Patient 1 is young with hypertension but has an unusually high BMI that doesn’t match the pattern. Patient 3 is middle-aged with hypertension but has a lower than expected BMI, and a very high cholesterol. The relationships that were present in the original data have been lost, reducing the utility of the dataset for analysis.

Putting it together

Now we have seen what multiple imputation and perturbation are designed to do, and why they are not well-suited for synthetic data generation individually, we can talk about why this is such an important thing to talk about when considering synthetic data generation as a whole. There are some papers out that combine elements of multiple imputation and perturbation to create synthetic datasets, so it is worth clarifying why this still does not solve the problems we have discussed. Let’s consider an approach that combines both methods:

Start with a real dataset
Introduce missingness to some variables (i.e. a form of perturbation)
Use multiple imputation to fill in those missing values

If we take our earlier heart failure dataset now complete with blood pressure values:

| Patient | Age Group | BMI  | Systolic BP  | Cholesterol |
|---------|-----------|------|--------------|-------------|
| 1       | Young     | 27.2 | 120          | 185         |
| 2       | Young     | 26.8 | 112          | 178         |
| 3       | Middle    | 31.5 | 140          | 220         |
| 4       | Young     | 22.1 | 110          | 160         |
| 5       | Middle    | 32.0 | 130          | 235         |

If we perturb this dataset by introducing missingness for 40% of the blood pressure values, and 20% of the cholesterol values, we can use a random number generator to select which values to remove. The resulting pre-processed dataset might look like this:

| Patient | Age Group | BMI  | Systolic BP  | Cholesterol |
|---------|-----------|------|--------------|-------------|
| 1       | Young     | 27.2 |              | 185         |
| 2       | Young     | 26.8 | 112          |             |
| 3       | Middle    | 31.5 | 140          | 220         |
| 4       | Young     | 22.1 | 110          | 160         |
| 5       | Middle    | 32.0 |              | 235         |

We can then impute the missing values using multiple imputation using a mean of the value for their age group:

| Patient | Age Group | BMI  | Systolic BP  | Cholesterol |
|---------|-----------|------|--------------|-------------|
| 1       | Young     | 27.2 | 111          | 185         |
| 2       | Young     | 26.8 | 112          | 172         |
| 3       | Middle    | 31.5 | 140          | 220         |
| 4       | Young     | 22.1 | 110          | 160         |
| 5       | Middle    | 32.0 | 140          | 235         |

As you can see, while some values have been changed, a large portion of the original data remains intact. Crucially patient 3 and patient 4’s records are completely unchanged. Even the changed records are only partially modified so an attacker still has substantial real data to work with.

All in all this means that the problems we discussed earlier still apply: the synthetic records are still too similar to the original data, risking identification disclosure, and the structural integrity of the data remains unchanged, making it vulnerable to linkage attacks. Even when combined, multiple imputation and perturbation do not adequately address the challenges of synthetic data generation.

When to use Multiple Imputation and Perturbation

It is important to note that multiple imputation and perturbation do have a lot of value - just not for synthetic data generation. Multiple imputation is an excellent tool for handling missing data in existing records. If you have a dataset with missing values, MI can help you fill in those gaps and allow for more accurate analysis.

Perturbation has a legitimate but narrow role in disclosure control for partially synthetic data, where you’re replacing only the most sensitive variables while keeping the rest of the record real.

What You Should Use Instead and Why It Matters

Modern generative methods like GANs and diffusion models are specifically designed to learn and preserve complex relationships. Recent comparative studies consistently show that diffusion models outperform both GANs and traditional imputation-based approaches, with CTGAN showing particularly strong performance for mixed categorical and continuous healthcare data. While the quality gap between synthetic and real data continues to narrow, it is highly dependent on the generation approach.

The shift towards more sophisticated models is happening across the field. The MHRA’s synthetic COVID-19 and cardiovascular datasets use advanced generation techniques based on probabilistic graphical models that preserve complex biological relationships and patient privacy. Similarly, NHS England’s Simulacrum cancer dataset uses a Bayesian network-based probabilistic sampling approach to generate data. Even traditional statistical agencies are adopting more advanced methods. Notably, the US Census Bureau transitioned from perturbation to using differential privacy in 2020.

Comparative evaluations across healthcare domains consistently show that modern generative methods maintain statistical fidelity and preserve complex relationships for fully synthetic data generation, something MI and perturbation cannot achieve. Choosing the right method at the outset has long-term implications for the utility, privacy, and trustworthiness of your synthetic data.

Broader Applications

In my examples, I have talked a lot about how this applies to healthcare data but in fact the concept of synthetic data can be applied to any field where data privacy is a concern, such as finance, education, and social sciences. The goal is to generate datasets that maintain the statistical properties of the original data without exposing sensitive information about individuals, companies, transactions, or other entities. Healthcare data is a particularly useful domain to discuss these concepts because anyone who has ever had a medical record can immediately understand why privacy is so important. In reality the same principles apply across many fields - I don’t think any of us would want our bank transaction history, social media history or educational records to be easily identifiable in a dataset!

Multiple Imputation and Perturbation: Why They're Not Built for Synthetic Data

What is Multiple Imputation?

What is Perturbation?

Why MI and Perturbation fail for Synthetic Data Generation

MI’s Problem: Too Similar, Not Private

Perturbation’s Problem: Broken Correlations, Lost Utility

Putting it together

When to use Multiple Imputation and Perturbation

What You Should Use Instead and Why It Matters

Broader Applications

Further Reading:

Related Posts

Accidental Functional Programming in Rust (From an Epidemiologist's Perspective)

A PhD in generating synthetic health data

Code Review for Research Code