What are GANs and how can they generate synthetic data?

Author

Caroline Morton

Date

November 24, 2025

This blog is going to introduce the concept of Generative Adversarial Networks (GANs) and explore how they can be used to generate synthetic data. This is particularly important in the case of generating synthetic healthcare data as many of the published research papers out there focus on this application.

What are GANs?

Generative Adversarial Networks (GANs) are a class of machine learning models that consist of two neural networks competing against each other. The first network, called the generator, creates synthetic data samples, while the second network, called the discriminator, evaluates whether the samples are real (from the training data) or fake (produced by the generator). This competition is inherently adversarial, hence the name. The generator aims to produce data that is indistinguishable from real data, while the discriminator strives to become better at identifying fake data. Through this process, both networks improve over time, leading to the generation of highly realistic synthetic data. Training continues until the generator produces synthetic healthcare data so convincing that the discriminator cannot reliably identify it as artificial. The result is a trained model capable of generating unlimited synthetic patient records that maintain the statistical properties, correlations, and clinical patterns found in real datasets while containing no actual patient information.

If you are new to Machine Learning and Neural Networks, the best way to think of this is through an analogy:

Imagine a classroom art contest between two students. One is the artist, the other is the judge. The artist tries to draw pictures of cats that look just like the ones in a book of real cat photos. The judge looks at each picture and says whether it’s a real cat photo or a fake drawing. At first the artist’s cats look awful - scribbles, weird faces, wrong shapes and angles. The judge spots the fakes instantly. But each time, the artist learns from the judge’s feedback and improves the drawings. The judge also learns to spot more subtle mistakes. After many rounds, the artist gets so good that the judge can’t reliably tell whether a cat picture is real or drawn. That’s how a GAN works. The artist is the generator, the judge is the discriminator. They keep challenging each other until the generator produces synthetic data so convincing that even the discriminator can’t tell the difference.

Specialised GANs for Healthcare Data

Healthcare data presents unique challenges that have driven the development of specialised GAN architectures.

  • Conditional GANs enable researchers to generate synthetic patients with specific medical characteristics, such as diabetic patients within particular age ranges or cancer patients at defined disease stages. This targeted generation proves invaluable for studying rare conditions or underrepresented populations.

  • Wasserstein GANs address the training instability that often plagues standard GANs when working with complex medical datasets, providing more reliable synthetic data generation.

  • Temporal GANs capture how patient conditions evolve over time, modelling disease progression patterns and treatment responses across extended periods.

  • Privacy-preserving variants incorporate differential privacy guarantees directly into the training process, adding mathematical assurance that individual patient information cannot be recovered from the synthetically generated data.

From Individual Patient Modelling to Population Health

What sets GANs apart is their ability to model both individual patient trajectories and entire populations, at least in theory. The systems, if working as intended, can generate complete synthetic patient profiles encompassing demographics, diagnostic histories, medication regimens, and laboratory results while preserving the complex interdependencies that make healthcare data valuable for research. Patient journey modelling showcases GANs’ sophisticated capabilities at an individual level, generating realistic disease progression patterns and predicting treatment responses over time. This longitudinal modelling enables researchers to study long-term outcomes, treatment efficacy, and disease trajectories without requiring decades of real patient follow-up data. The NHS Digital Simulacrum project demonstrates this approach in practice, using synthetic data to model cancer patient pathways for nationwide research collaboration.

Beyond individual patient modelling, GANs enable population-scale research by generating synthetic datasets that maintain realistic demographic distributions and disease patterns across entire communities. RTI International has developed synthetic population databases for the Models of Infectious Disease Agent Study program, enabling researchers to simulate disease outbreaks and evaluate public health interventions such as H1N1 pandemic response scenarios without compromising individual privacy. These applications extend to health services research, where synthetic populations enable analysis of healthcare utilisation patterns, resource allocation, and system capacity planning across diverse demographic scenarios.

Ensuring Quality and Clinical Validity

One of the key challenges with synthetic healthcare data is ensuring that it is of high quality and clinically valid. Validating synthetic healthcare data requires rigorous assessment across multiple dimensions. Statistical tests are used to compare the distributions of variables between real and synthetic datasets, ensuring that relationships between important variables are preserved, while privacy assessments ensure that synthetic data cannot be traced back to individual patients.

Clinical validation can be an important evaluation step, with medical experts reviewing synthetic records for clinical similarities, ensuring that generated patient profiles reflect realistic disease presentations and treatment patterns. Finally, task performance testing evaluates whether machine learning models trained on synthetic data perform comparably to those trained on real patient records when deployed in clinical settings. This may not be applicable to those using synthetic data for other purposes such as software testing or training or even classic statistical analysis.

Despite their promise, GANs face significant technical and regulatory hurdles in healthcare applications.

Subpopulation Overfitting

One of the major issues is training instability, which can produce mode collapse, where the generator creates limited variety of synthetic patients, failing to capture the full diversity of real medical populations.

If we go back to our art contest analogy, this is like the artist getting stuck drawing the same cat over and over again because they found a way to fool the judge with that one drawing. The artist stops learning and improving, and the drawings become repetitive and uninteresting. This is a common problem in GAN training, where the generator finds a “shortcut” to fool the discriminator without truly capturing the complexity of the real data. This is particularly problematic in healthcare, where patient populations are diverse and complex, and mode collapse can lead to synthetic datasets consisting of very similar patient profiles for example, a particular sociodemographic group or disease type. The discriminator is fooled because they on their face value look realistic, but at a population level, the synthetic data lacks the necessary variability to be useful and is clearly not representative of the real patient population.

Computational Requirements

Another issue is the computational requirements for generating large-scale healthcare datasets, which are often substantial, requiring specialised hardware and extended training periods. Research groups looking to implement GANs for healthcare applications need to ensure they have access to sufficient computational resources, which may well include GPUs or scalable cloud infrastructure. As well as being expensive (both financially and computationally), research groups might lack sufficient technical expertise to set up and manage these systems effectively. This is particularly important when we think about what sort of training data we are using and whether that should really be stored on a third-party cloud provider.

Rare Disease Modelling

Modelling rare diseases also poses difficulties as GANs struggle to model low-frequency events present in training data. Rare diseases, by definition, have limited data available, making it challenging for GANs to learn the underlying patterns and relationships necessary to generate realistic synthetic records. This can lead to synthetic datasets that underrepresent rare conditions, limiting their utility for research focused on these diseases.

Going back to our art contest analogy, this is like the artist trying a particular breed of cat that they have only seen a couple of times. Because they don’t have enough examples to learn from, their drawings of that breed end up looking quite different from the real thing, and the judge can easily tell that they are fake. In healthcare, this means that GANs may struggle to accurately represent patients with rare diseases, leading to synthetic data that fails to capture the unique characteristics and complexities of these conditions.Techniques such as data augmentation, transfer learning, and specialised GAN architectures can help address this issue, but it remains a significant challenge.

Regulatory Compliance

Regulatory compliance adds another layer of complexity, with healthcare organisations requiring clear guidance on how synthetic data fits within existing data governance frameworks. It is often a bit unclear under what circumstances synthetic data is considered personal data under regulations such as GDPR, and organisations need to ensure they are compliant when using synthetic data for research or other purposes.

Data Privacy Considerations

While GANs can generate synthetic data that does not directly correspond to real patients, there is still a risk of re-identification if the synthetic data is too similar to the real data. For me this is one of the biggest concerns when using GANs for healthcare data generation. Due to the black box nature of GANs (and most other machine learning models) and the extensive training required, it can be difficult to fully understand how well privacy is being preserved in the synthetic data. It is hard to say with absolute certainty that no individual patient information has been leaked through the neural network’s learned representations into the synthetic data.

Privacy Metrics

With this in mind, it is essential to implement robust privacy metrics when evaluating synthetic healthcare data generated by GANs. A lot of the literature focuses on statistical similarity metrics, but these do not directly address privacy concerns. I think this is an area that needs more development, particularly in terms of standardising privacy evaluation frameworks for synthetic healthcare data. I plan to publish more on this topic in future blogs as privacy is a key concern for pretty much all methods of generating synthetic healthcare data, not just GANs, and deserves its own discussion.

Conclusion

GANs are increasingly adopted across healthcare research and clinical practice. They are here already and as the UK government’s commitment to healthcare AI intensifies, and the need for training data becomes ever greater, synthetic data will likely become essential infrastructure for medical innovation. It is predicted by some that synthetic data will outpace real data for training AI models by 2030. GANs are positioned to become fundamental tools for the generation of this synthetic data, but there are still significant challenges to overcome. These challenges need to be part of the conversation as we move towards wider adoption of synthetic data in healthcare. We do not want to end up in a situation where synthetic data is being used widely, but the quality and privacy of that data is questionable.

If you would like to learn more about GANs, I recommend starting with the following resources:

  • Yoon et al. (2023): Development of GANs for generating high-fidelity and privacy-preserving synthetic electronic health records
  • Yan et al. (2024): A comprehensive tutorial for generating synthetic electronic health record data using generative adversarial networks.
  • Sun et al. (2023): Development of DP-CGANS, a differentially private conditional generative adversarial network.

If you are interested in synthetic data and want to discuss it further, please feel free to reach out to me on LinkedIn or send me a message via the Contact Page. I would love to hear from you!

Related Posts

star green

Code Review for Research Code

An overview of how to conduct a code review for research code

Read More
padlock green

What is Synthetic Data and Why Does it Matter?

This blog is the first in a series exploring synthetic data, its benefits, and its applications in various fields.

Read More
graph_1 yellow

A PhD in generating synthetic health data

This is an introduction to my PhD project and what I am hoping to achieve with it, which is to develop methods for generating realistic synthetic health data. This project is generously sponsored by SurrealDB, a multi-model database entirely written in Rust. I am using SurrealDB for a number of reasons, including its ability to do complex queries, vector searching and embedding functions that are useful for generating synthetic data.

Read More