What is Synthetic Data and Why Does it Matter?

Healthcare research is facing a fundamental paradox. The demand for comprehensive datasets to drive medical innovation has never been greater, yet access to real patient data remains severely restricted by privacy laws and ethical constraints. In this blog post, I will explore how synthetic data is emerging as a powerful solution to this challenge, enabling researchers to access high-quality datasets without compromising patient privacy.

Data Privacy

What are the challenges of accessing real patient data?

As previously mentioned, healthcare research is often constrained by limited access to real patient data. Privacy regulations such as GDPR and HIPAA impose strict controls on how patient information can be used, making it difficult for researchers to obtain the data they need. To be clear there is nothing more personal than our health data, and there should be no compromise when it comes to protecting patient privacy. Data should only ever be used in secure, ethical ways that respect patient consent and confidentiality. However, these necessary protections can also create significant barriers to research. It can take months or even years to gain access to real patient data - this might particularly be an issue for early career researchers who get funding for 1-2 years but cannot access the data they need in that timeframe. Beyond privacy concerns, researchers also face challenges such as limited sample sizes for rare diseases, underrepresented patient populations, and the inability to share data across institutions for collaborative studies. Data cannot be released with the research paper for others to validate or replicate the findings.

This tension between scientific progress and patient protection has led to increasing interest in synthetic data.

Synthetic Data

What is Synthetic Data?

Put simply, synthetic data is artificially generated information that mimics the statistical properties of real-world data without containing any actual patient information. In health research, synthetic data can be used to create datasets that preserve the essential characteristics needed for analysis, such as correlations between different variables or treatment response patterns across populations, while ensuring that no individual patient data is used. This approach allows researchers to conduct studies without the need for extensive ethical and regulatory approvals, enabling faster and more cost-effective research timelines, unrestricted data sharing between institutions, and complete patient privacy protection.

How Synthetic Data Is Generated

The creation of synthetic healthcare data relies on multiple approaches, each suited to different applications and privacy requirements. I am going to be writing a more detailed blog post, or more likely a series of blog posts, on the different methods used to generate synthetic data, but here I will provide a brief overview at a very high level. Common methods include:

Rule-based Simulations: These methods use established clinical guidelines and statistical distributions to simulate patient data. For example, the Synthea platform generates synthetic patient records by simulating disease progression and healthcare interactions based on publicly available health statistics.
Probabilistic Models: These approaches use statistical techniques to model the relationships between different variables in the data. For example, Bayesian networks can be used to generate synthetic data that preserves the joint distributions of multiple variables.
Machine Learning Techniques: Advanced methods such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) leverage deep learning to learn complex patterns in real datasets and generate new, synthetic samples that closely resemble the original data. This is an interesting area because it generates realistic data, but there are concerns about privacy because the model is trained on real data, and there is a risk that it could inadvertently reproduce identifiable information.
Hybrid Approaches: Some methods combine multiple techniques to balance the trade-offs between data utility and privacy. For example, a hybrid approach might use rule-based simulations to generate baseline data and then apply machine learning techniques to introduce variability and complexity.

As with any part of life there are some bad actors, and synthetic data is no different. Some companies will claim to generate synthetic data, but in reality they are just anonymising or pseudonymising real data, or using techniques like multiple imputation where they are deleting some values and replacing them with simulated values. This is not synthetic data, and it does not provide the same level of privacy protection.

Accelerating Adoption

Why is Synthetic Data Adoption Accelerating?

The adoption of synthetic data in healthcare research is accelerating rapidly, driven by several converging factors:

Privacy and Regulatory Compliance: Increasingly complex data governance frameworks make accessing real patient data more challenging. In the UK, we have the NHS data governance and GDPR to navigate. Again these are necessary protections, but they can create significant delays in research timelines. Synthetic data offers a way to bypass these barriers while still enabling robust analysis. Interestingly, governments are recognising these barriers to access and investing in synthetic data initiatives. For example, NHS England has developed the Simulacrum cancer dataset for widespread research use, and the Medicines and Healthcare products Agency (MHRA) has announced the creation of two synthetic datasets for Covid-19 and cardiovascular disease research.
Machine Learning and AI Advancements: Advances in machine learning and generative models, such as Generative Adversarial Networks (GANs) and large language models (LLMs), have made it possible to create high-quality synthetic datasets that closely resemble real-world data. These technologies can capture complex relationships within the data, making synthetic datasets more useful for a wider range of research applications.
Computational Power: The increasing availability of powerful computing resources has made it feasible to generate large-scale synthetic datasets quickly and cost-effectively. Previously it might have taken weeks or months to generate a synthetic dataset that was wasn’t very good or large enough to be useful. Now it can be done in hours or days. In my own projects, I am using Rust programming language to generate synthetic datasets in a fraction of the time it would take using traditional methods, and I can generate much larger datasets, assess them for quality, and iterate quickly to improve them. I do all of this on a personal laptop, no need to access high performance computing clusters or cloud services.
Collaborative Research Needs: In the last few years, and especially during the Covid-19 pandemic, there has been a growing emphasis on collaborative research across institutions and borders. Federated analysis is often proposed as a solution to this, where data remains within its original location but the same analysis is run in multiple locations. However, this approach can be technically complex and resource-intensive, with many different datasets to manage and analyse. Synthetic data provides an alternative by enabling the sharing of datasets that can be used for collaborative studies without the need to share real patient data. You can imagine the power of being able to create one core patient population that generates the synthetic data that fits the schema that each institution needs. This would enable collaborative research without the need for complex federated analysis setups.
Widespread adoption of tech in healthcare: The healthcare sector is increasingly adopting digital technologies. When I started as a doctor, we had paper notes and I have seen how we have moved to electronic health records, digital imaging, telemedicine, and increasingly wearable health devices. This digital transformation has involved industry partners coming into hospitals and healthcare systems to implement these technologies, and these partners need to test and validate their products. Synthetic data provides a way for these companies to access realistic datasets for development and testing without the need for real patient data, which can be difficult to obtain. If you are transitioning from one electronic health record system to another, it is in everyone’s interest to test the new system with realistic data before going live. Synthetic data can provide this.
Training those AI models: As AI adoption accelerates, the demand for training data is predicted to outpace the availability of real-world datasets, making synthetic alternatives increasingly critical. This is applying pressure to create synthetic data and increase adoption. Gartner estimates that by 2030, synthetic data will surpass the use of real data for training and testing AI models, a fundamental shift that reflects growing confidence in the potential for synthetic data to transform healthcare research.
Open Science and transparency: I include this one as it is something that is important to me, although I recognise it is not as major contributor as the others. There is a growing movement towards open science and transparency in research. Recently the BMJ changed their policy to require authors to make their research code available for publication. This is a great step forward, but if you talk to researchers, they will often say that the code is useless without the data, and if the data is real patient data, they cannot share it. This is a fair point although I would argue that sharing code is still a step forward and would be useful for others to see the methods used. The point still stands though that sharing real data is not appropriate usually. Synthetic data provides a way to share datasets that can be used to validate and replicate research findings without compromising patient privacy. Just release a fake dataset alongside your code!

Other Applications

Wider applications of Synthetic Data

While this blog post focuses on synthetic data in healthcare research, it is important to note that synthetic data is of interest in many other fields as well. For example, synthetic data is being used in finance to create realistic datasets for fraud detection and risk modeling, in autonomous vehicle development to simulate driving scenarios, and in retail to analyse customer behavior without compromising personal information.

Basically, the principles and benefits of synthetic data are broadly applicable across industries where data privacy and access are critical concerns, or you just want to be able to create a huge dataset quickly and cheaply. As synthetic data generation techniques continue to improve, we can expect to see even broader adoption across various sectors. I think this is a really exciting time because we can look at how synthetic data is being used in other fields and learn from their experiences to inform best practices in healthcare research. About 18 months ago, I attended a one day conference on synthetic data, and most of the talks were from outside healthcare, in particular finance and fraud. What is interesting in these fields is that they have a similar issue to healthcare around sharing but for different reasons. In health care, we have privacy concerns which prevent us from sharing real patient data across institutions even within one study - and again this as it should be. In finance, they have concerns about sharing data because of competition and commercial sensitivity. They do not want to share their data with their competitors, but they also want to be able to collaborate on research to improve fraud detection methods. Synthetic data provides a way for them to share data without sharing the real data, which is a similar concept to healthcare.

Conclusion

With the global healthcare analytics market expected to reach over USD 170 billion by 2030, the demand for accessible healthcare data will only increase. This emerging reality suggests synthetic data may be shifting from an optional alternative to a practical necessity. If you are interested in learning more about synthetic data, I recommend starting with the following resources:

Goncalves et al. (2020): A comprehensive evaluation of different synthetic data generation methods for healthcare data, comparing probabilistic models, GANs, and imputation-based approaches across data utility and privacy metrics using SEER cancer registry data.
Kühnel et al. (2024): An evaluation of synthetic data generation for longitudinal health studies, including practical method comparisons.
Walonoski et al. (2018): The authors developed Synthea, an open-source software platform for generating complete synthetic patient lifespans using publicly available health statistics and clinical guidelines. This study is significant because it created the first widely-adopted, freely available platform for generating realistic synthetic healthcare data at scale while ensuring complete privacy protection through use of only public data sources.

If you are interested in synthetic data and want to discuss it further, please feel free to reach out to me on LinkedIn or send me a message via the Contact Page. I would love to hear from you!

What is Synthetic Data and Why Does it Matter?

What are the challenges of accessing real patient data?

What is Synthetic Data?

How Synthetic Data Is Generated

Why is Synthetic Data Adoption Accelerating?

Wider applications of Synthetic Data

Conclusion

Related Posts

How to Create a Codelist

An Introduction to Electronic Health Records

Finding Similarity with Vector Search: A Beginner's Guide