Representativeness in Synthetic Data: What It Means and How to Measure It
Caroline Morton
April 7, 2026
In my article, introduction to synthetic data, I described how GAN-generated records had been used to predict patient length of stay in German hospitals. It’s a compelling example of how the synthetic data we generate can serve as a credible proxy for real patient records. But how do we know that the synthetic datasets we generate actually represent the patient population they are modelling?
Consider the German study above. If the data subtly underrepresented elderly patients with multiple comorbidities (a group who typically account for the longest and most resource-intensive stays), the model could look quite accurate on paper, while producing systematically skewed predictions in practice. The difference between ‘it looks like real data to some degree’ and ‘this actually represents the population’ is called representativeness, and is what this article is all about.
One simple way to think about representativeness, is to recognize that overall statistical similarity and representativeness are not the same thing. An example of this was shown by a Spanish research team who generated 16,560 synthetic lung cancer records from 886 real patients. They ran 58 statistical tests to check how closely the synthetic data matched the real dataset. In 48 of those tests, there was no significant difference (great!), and only 10 showed discrepancies. This is an 83% pass rate and looks impressive on paper.

However, these results do raise several important questions. Which are the 10 tests that failed, what did they represent clinically, and do those gaps affect research conclusions? This is the representativeness problem we face in generating synthetic healthcare data.
So what does “representativeness” actually mean?
Representativeness is not a single property. It is defined by four dimensions that need to work together to produce data that represent the population, which includes statistical fidelity, population coverage, temporal validity, and clinical coherence. I’ve summarised each of these terms below for clarity.
Statistical fidelity describes the preservation of data distributions as well as the relationships between variables. For example, it might define the way age interacts with comorbidity burden, or how treatment response varies across demographics.
Population coverage means that the data reflects the full range of the target population, including minority groups, rare conditions, and underserved patients who aren’t often represented in smaller standard datasets.
Temporal validity is what enables synthetic data to capture how patients’ health changes over time, including disease progression, treatment evolution, and complication development, rather than a single snapshot in time.
Clinical coherence means that the relationships captured by the synthetic data are ones that a clinician would actually recognise as real and plausible, not just statistically valid but also meaningful in clinical practice.
A good illustration of how the importance of these dimensions relate to each other is described by Chen and colleagues’ 2019 validation of Synthea, which generated over 1.2 million synthetic Massachusetts residents. The study tested statistical fidelity alongside clinical quality measures. The researchers found that while demographics and procedure-based measures (such as colorectal cancer screening) aligned well with national averages, when they tested outcome-based clinical quality measures (such as complication rates, disease-specific outcomes, and mortality), significant discrepancies emerged against real-world benchmarks. What the study does show is that strong demographic alignment tells you very little about whether the data holds up clinically, which is why representativeness cannot be confirmed by statistical tests alone.
How do you actually measure it?
Given that representativeness covers the four distinct dimensions described above, no single test can measure all of them. Instead, measuring representativeness requires layering several methods, each targeting a different aspect of how well the synthetic data reflects reality. Here are some of the commonly applied approaches:
| Approach | What it checks | Real-world example |
| Distribution-based metrics | How closely synthetic distributions match real data | Gonzalez-Abril et al. used Kolmogorov-Smirnov (KS) and Chi-square tests across 58 variables in synthetic lung cancer records |
| Correlation preservation | Whether variable relationships survive the generation process | Foraker et al. compared Spearman correlations between real and synthetic data across three clinical conditions, finding no significant differences in variable relationships |
| Predictive validity | Whether ML models trained on synthetic data perform comparably | MDClone COVID data trained models matched results from real N3C data across 230,703 patients |
| Clinical validation | Whether domain experts recognise outputs as clinically realistic | Chen et al. compared Synthea outputs against national clinical quality measures |
What this means in practice
The field is increasingly moving toward validation frameworks that assess synthetic data across resemblance, utility, and privacy together, because optimising for one dimension often trades off against another. This is something I have thought a lot about - in many ways, representativeness and privacy are at opposite ends pulling in different directions. In an ideal world, we want both highly representative and highly private data, but in practice, there is often a trade-off between the two. The more closely your synthetic data resembles the real data, the more likely it is to contain real data points, which can compromise privacy. Conversely you could have a completely private dataset that is not based on real data at all, but it is much less likely to be representative of the underlying population, and hence less useful.

The way I have come to think about this is that where your data falls on the line between representativeness and privacy has to be determined by its use case. A synthetic dataset that is designed for testing the transition between two software systems might not need to be representative at all; it just needs to be structurally similar enough to the real data to test the software, so we can prioritise privacy in that case. On the other hand, a synthetic dataset that is designed for training a machine learning model to predict patient outcomes needs to be highly representative of the real patient population, so we would prioritise representativeness in that case, even if it means accepting a lower level of privacy.
This use-case-first thinking is supported by the evidence. This is consistent with the 2023 benchmarking study of six healthcare datasets which found no single generation method was consistently best across all three dimensions, and concluded therefore the right choice depends entirely on what the data needs to do. Understanding which dimensions your research requires, testing for those specifically, and involving domain expertise in the evaluation up front will ensure your data are scientifically defensible, statistically reliable, and fit for purpose.
Where it gets more complicated
Synthetic data generation faces three well-documented challenges when it comes to population representativeness; rare event preservation, bias amplification, and longitudinal complexity. Each one tends to surface at a different stage of the generation and validation process, and requires a different response.
Rare event preservation is the hardest problem to solve with standard tools. In a validation study, MDClone generated synthetic data from over 1,200 Ottawa Hospital cancer and stroke patients. It preserved most statistical properties but lost meaningful clinical signal in a low-incidence subgroup with prior stroke history. Purpose-built architectures like MTGAN, a conditional GAN can generate rare disease cases accurately at scale, but preserving rare events does require choosing a model specifically designed for it.
Bias amplification is a more insidious challenge because it’s invisible unless you’re specifically looking for it. A 2021 study showed that HealthGAN reproduced the bias already present in the MIMIC-III dataset, with certain racial and age subgroups underrepresented in the synthetic output, highlighting that any biases present in the training data, carry over to the generated dataset. A 2024 study applying BayesBoost to nearly 500,000 UK primary care records showed that active correction is possible, but only when bias detection is built into the process from the start.
Longitudinal complexity describes the difficulty of modelling disease progression over time as it introduces temporal drift, an issue where synthetic patient trajectories gradually diverge from realistic patterns in ways that single snapshot methods won’t catch. Models like LS-EHR address this directly, but capturing longitudinal complexity requires choosing a model built for this purpose.
Conclusion
Representativeness is a design problem as much as a measurement one, and the four dimensions outlined here do not operate independently. No single method captures all of them, and optimising for one can undermine another, which is why layered evaluation has become the field’s working standard.
For researchers, the right questions matter more than the right tools. If you define the research purpose first, then it is much easier to select generation and validation methods that match it. This approach is what separates synthetic data that is technically impressive from synthetic data that is actually useful. It is not possible to create a single synthetic dataset that serves all purposes (from software transitions, to training students to training machine learning models) so we must start with the use case first, and work outwards from that.
Further reading
This blog is part of a wider series on synthetic data, which you can find here. If you are new to synthetic data, I recommend starting with my introduction to synthetic data and the usecases blogs.
Enjoyed this? Subscribe to my newsletter.
I write about open science, research code, and building better tools for researchers.
Browse the newsletter archive →