Representativeness in Synthetic Data: What It Means and How to Measure It

In my article, introduction to synthetic data, I described how GAN-generated records had been used to predict patient length of stay in German hospitals. It’s a compelling example of how the synthetic data we generate can serve as a credible proxy for real patient records. But how do we know that the synthetic datasets we generate actually represent the patient population they are modelling?

Consider the German study above. If the data subtly underrepresented elderly patients with multiple comorbidities (a group who typically account for the longest and most resource-intensive stays), the model could look quite accurate on paper, while producing systematically skewed predictions in practice. The difference between ‘it looks like real data to some degree’ and ‘this actually represents the population’ is called representativeness, and is what this article is all about.

One simple way to think about representativeness, is to recognize that overall statistical similarity and representativeness are not the same thing. An example of this was shown by a Spanish research team who generated 16,560 synthetic lung cancer records from 886 real patients. They ran 58 statistical tests to check how closely the synthetic data matched the real dataset. In 48 of those tests, there was no significant difference (great!), and only 10 showed discrepancies. This is an 83% pass rate and looks impressive on paper.

Fake Patients

However, these results do raise several important questions. Which are the 10 tests that failed, what did they represent clinically, and do those gaps affect research conclusions? This is the representativeness problem we face in generating synthetic healthcare data.

So what does “representativeness” actually mean?

Representativeness is not a single property. It is defined by four dimensions that need to work together to produce data that represent the population, which includes statistical fidelity, population coverage, temporal validity, and clinical coherence. I’ve summarised each of these terms below for clarity.

Statistical fidelity describes the preservation of data distributions as well as the relationships between variables. For example, it might define the way age interacts with comorbidity burden, or how treatment response varies across demographics.

Population coverage means that the data reflects the full range of the target population, including minority groups, rare conditions, and underserved patients who aren’t often represented in smaller standard datasets.

Temporal validity is what enables synthetic data to capture how patients’ health changes over time, including disease progression, treatment evolution, and complication development, rather than a single snapshot in time.

Clinical coherence means that the relationships captured by the synthetic data are ones that a clinician would actually recognise as real and plausible, not just statistically valid but also meaningful in clinical practice.

A good illustration of how the importance of these dimensions relate to each other is described by Chen and colleagues’ 2019 validation of Synthea, which generated over 1.2 million synthetic Massachusetts residents. The study tested statistical fidelity alongside clinical quality measures. The researchers found that while demographics and procedure-based measures (such as colorectal cancer screening) aligned well with national averages, when they tested outcome-based clinical quality measures (such as complication rates, disease-specific outcomes, and mortality), significant discrepancies emerged against real-world benchmarks. What the study does show is that strong demographic alignment tells you very little about whether the data holds up clinically, which is why representativeness cannot be confirmed by statistical tests alone.

How do you actually measure it?

Given that representativeness covers the four distinct dimensions described above, no single test can measure all of them. Instead, measuring representativeness requires layering several methods, each targeting a different aspect of how well the synthetic data reflects reality. Here are some of the commonly applied approaches:


Approach	What it checks	Real-world example
Distribution-based metrics	How closely synthetic distributions match real data	Gonzalez-Abril et al. used Kolmogorov-Smirnov (KS) and Chi-square tests across 58 variables in synthetic lung cancer records
Correlation preservation	Whether variable relationships survive the generation process	Foraker et al. compared Spearman correlations between real and synthetic data across three clinical conditions, finding no significant differences in variable relationships
Predictive validity	Whether ML models trained on synthetic data perform comparably	MDClone COVID data trained models matched results from real N3C data across 230,703 patients
Clinical validation	Whether domain experts recognise outputs as clinically realistic	Chen et al. compared Synthea outputs against national clinical quality measures

What this means in practice

The field is increasingly moving toward validation frameworks that assess synthetic data across resemblance, utility, and privacy together, because optimising for one dimension often trades off against another. This is something I have thought a lot about - in many ways, representativeness and privacy are at opposite ends pulling in different directions. In an ideal world, we want both highly representative and highly private data, but in practice, there is often a trade-off between the two. The more closely your synthetic data resembles the real data, the more likely it is to contain real data points, which can compromise privacy. Conversely you could have a completely private dataset that is not based on real data at all, but it is much less likely to be representative of the underlying population, and hence less useful.

Representativeness and Privacy

The way I have come to think about this is that where your data falls on the line between representativeness and privacy has to be determined by its use case. A synthetic dataset that is designed for testing the transition between two software systems might not need to be representative at all; it just needs to be structurally similar enough to the real data to test the software, so we can prioritise privacy in that case. On the other hand, a synthetic dataset that is designed for training a machine learning model to predict patient outcomes needs to be highly representative of the real patient population, so we would prioritise representativeness in that case, even if it means accepting a lower level of privacy.

This use-case-first thinking is supported by the evidence. This is consistent with the 2023 benchmarking study of six healthcare datasets which found no single generation method was consistently best across all three dimensions, and concluded therefore the right choice depends entirely on what the data needs to do. Understanding which dimensions your research requires, testing for those specifically, and involving domain expertise in the evaluation up front will ensure your data are scientifically defensible, statistically reliable, and fit for purpose.

Where it gets more complicated

Synthetic data generation faces three well-documented challenges when it comes to population representativeness; rare event preservation, bias amplification, and longitudinal complexity. Each one tends to surface at a different stage of the generation and validation process, and requires a different response.

Rare event preservation is the hardest problem to solve with standard tools. In a validation study, MDClone generated synthetic data from over 1,200 Ottawa Hospital cancer and stroke patients. It preserved most statistical properties but lost meaningful clinical signal in a low-incidence subgroup with prior stroke history. Purpose-built architectures like MTGAN, a conditional GAN can generate rare disease cases accurately at scale, but preserving rare events does require choosing a model specifically designed for it.

Bias amplification is a more insidious challenge because it’s invisible unless you’re specifically looking for it. A 2021 study showed that HealthGAN reproduced the bias already present in the MIMIC-III dataset, with certain racial and age subgroups underrepresented in the synthetic output, highlighting that any biases present in the training data, carry over to the generated dataset. A 2024 study applying BayesBoost to nearly 500,000 UK primary care records showed that active correction is possible, but only when bias detection is built into the process from the start.

Longitudinal complexity describes the difficulty of modelling disease progression over time as it introduces temporal drift, an issue where synthetic patient trajectories gradually diverge from realistic patterns in ways that single snapshot methods won’t catch. Models like LS-EHR address this directly, but capturing longitudinal complexity requires choosing a model built for this purpose.

Conclusion

Representativeness is a design problem as much as a measurement one, and the four dimensions outlined here do not operate independently. No single method captures all of them, and optimising for one can undermine another, which is why layered evaluation has become the field’s working standard.

For researchers, the right questions matter more than the right tools. If you define the research purpose first, then it is much easier to select generation and validation methods that match it. This approach is what separates synthetic data that is technically impressive from synthetic data that is actually useful. It is not possible to create a single synthetic dataset that serves all purposes (from software transitions, to training students to training machine learning models) so we must start with the use case first, and work outwards from that.

Representativeness in Synthetic Data: What It Means and How to Measure It

So what does “representativeness” actually mean?

How do you actually measure it?

What this means in practice

Where it gets more complicated

Conclusion

Further reading

Related Posts

A PhD in generating synthetic health data

Logs and tracing in Rust: From Terminal to Grafana

An Introduction to Electronic Health Records