How Synthetic Data Is Used in Healthcare, Research and Beyond

In a previous post, I introduced what synthetic data is and why it is generating so much interest in healthcare. However, in this article I want to talk about the practical use cases, both in healthcare but also in other industries, to give a better sense of how synthetic data is being used in the real world. This is not an exhaustive list, but my aim is to give a better sense of the breadth of applications. I often get asked why I choose to focus on this area, and my hope is that this article will show why it is such an exciting space to be working in.

Synthetic Data in Healthcare

Healthcare Research

One of the most underappreciated benefits of synthetic data is the ability it gives researchers to share their methodologies. When a researcher develops an analysis pipeline on real patient data, they often cannot share their code publicly without risking exposure of the underlying data. Synthetic datasets that mirror the statistical structure of real data change that. The code, the methods, and the analytical approach can be shared freely across institutions and borders, because the data they were built on contains no real patient information.

Tools like the Simulacrum in the UK and Synthea in the US make this possible in practice. Multiple data generation methods exist, from probabilistic models and GANs to rule-based simulation, and the right choice depends heavily on the use case, the data type, and how much real patient data is available to train on. I’ve covered GANs here and multiple imputation (MI) approaches here if you want to go deeper on the methods. It is also worth highlighting here that synthetic data doesn’t substitute for real patient data in regulatory submissions. It is a complement that is most valuable in the development and methodology phase, rather than for drawing clinical conclusions or informing patient care, for which real patient data is required.

Testing software and algorithms

Synthetic data is also widely used for testing software and algorithms. In healthcare, this could be for testing electronic health record (EHR) systems, medical devices, or clinical decision support tools. Let’s imagine you are a company that develops software for hospitals. You need to test your software on realistic patient data, but you can’t use real patient data due to privacy concerns. Synthetic data allows you to create realistic datasets that can be used for testing without risking patient privacy. This is not limited to healthcare; in finance, synthetic data can be used to test fraud detection algorithms without exposing real customer data.

Training students and professionals

Synthetic data is also a valuable tool for training students and professionals. Often would-be researchers will be trained in a Masters programme and will have to do a research project and write a thesis, but this often requires access to real patient data, which can be difficult to obtain with the time constraints of a Masters programme. Synthetic data allows students to work with realistic datasets and develop their research skills without the need for real patient data. This is also true for professionals who want to upskill or reskill in data science or machine learning; synthetic data provides a safe and accessible way to practise and learn. Getting the real data for a week-long short course is not possible, but synthetic data can be generated and used for this purpose.

AI and machine learning model development

AI and Machine Learning

Training AI and machine learning models requires large volumes of realistic data, but in healthcare, that data is often locked behind governance approvals that can take months or even years. For developers building clinical decision support tools or diagnostic algorithms, this creates a bottleneck: you cannot iterate quickly on a model if every dataset request requires a new ethics application. Synthetic data helps here by providing a realistic development environment where models can be built, debugged, and benchmarked before real patient data is ever requested.

This is especially relevant for rare disease research and clinical trial design, where small patient cohorts limit statistical power. Synthetic augmentation can make analyses viable that would otherwise be underpowered. For trial design, running simulations on synthetic populations before real-world recruitment enables researchers to test protocol variations, selection criteria, or therapeutic efficacy early and cheaply.

In collaboration with the Clinical Practice Research Datalink (CPRD) and researchers at Brunel University, the MHRA released two synthetic datasets, one on COVID-19 symptom and risk profiles, and one on cardiovascular disease. These were designed to support development and validation of AI algorithms in medical devices, a use case where developers need realistic data but cannot access real patient records.

Federated analysis

Federated analysis is a method of analysing data across multiple institutions without sharing the underlying data. Hospital A can run an analysis on their real patient data, and then share the results with Hospital B, who can run the same analysis on their own real patient data. This allows for collaboration and sharing of insights without sending patient data across institutions. A key problem here is how to share the code and methods for the analysis. Of course the different hospitals can share the code, but they cannot share the real patient data that the code was developed on, which might be subtly different across institutions. Synthetic data can be used to develop the code and methods for the analysis, which can then be shared across institutions without risking patient privacy. Having two different datasets that mirror the structure of the real patient data at each institution but are derived from the same synthetic dataset means that each institution can collaborate and adapt research code to their own data schemas and then check that they are both getting the same results on the synthetic data before running the code on their real patient data.

The same logic beyond healthcare

The uses for synthetic data also extend beyond healthcare to other industries, partly because the underlying problem is the same. Real data is scarce, sensitive, and can be imbalanced and expensive to collect at scale. Added to that are the concerns around intellectual property and competitive advantage, which can make companies reluctant to share data even within the same industry. Synthetic data provides a way to overcome these challenges and enable collaboration and innovation across industries.

Finance and banking

In finance, synthetic data can be used to test fraud detection algorithms without exposing real customer data. Fraudsters are very rarely constrained to just one banking institution and fraudulent transactions are still relatively rare events compared to legitimate ones. This means that banks need to share data to develop effective fraud detection algorithms, but they cannot share real customer data due to privacy concerns and worries over competitive advantage. Synthetic data allows banks to share realistic datasets that can be used for developing and testing fraud detection algorithms. It is kind of the perfect use of synthetic data because you need to “know” the answer to test the algorithm (i.e. which transactions are fraudulent) but in real data you would not know that without a lengthy investigation. With synthetic data, you can generate as much data as you need and you know which transactions are fraudulent, which allows for effective testing and development of fraud detection algorithms.

Other use cases include mitigating credit scoring biases and developing open banking training sets. Testing software similar to the healthcare use case is also relevant here, for example you can imagine that having a synthetic dataset that mirrors the structure of real customer data would be useful for testing a new finance app that uses an API that gets real data from multiple banks. The synthetic dataset can be used to test the app without risking exposure of real customer data.

Fraud Detection

These use cases were covered extensively in a report by the FCA in 2024, and I would draw attention here to the importance of using quality input data to train models on, as well as being aware of biases and inaccuracies, alongside data validation. A clear note in the report was made that “close collaboration with subject matter experts is needed to balance the variety of methodological choices whilst tailoring synthetic data to the use case in question.”

Market Research

Beyond healthcare and banking, both highly regulated industries, synthetic data is also being used in market research. Companies like Kantar and Ipsos are using synthetic consumer personas to increase survey samples for hard-to-recruit populations, and run test scenarios before committing to expensive fieldwork. For example, you might want to test survey logic before funding a large survey, and synthetic data can be used to do that.

The linked report also highlights some of the issues with synthetic data, including the risk of generating data that is not representative of the real population, exaggerating biases or producing results that lack depth and variety. This is a risk in any industry, and it is something that has been highlighted in healthcare research as well. If you create synthetic data from data where you have poor quality input from certain populations (such as Gen Z or the elderly), then the synthetic data will also be of poor quality for those populations. This is a risk that needs to be carefully managed, and it is something that is being actively researched in the field of synthetic data.

Autonomous vehicles

Synthetic data is also being used in the development of autonomous vehicles. Training self-driving cars requires large volumes of realistic data, but collecting real-world driving data can be expensive and time-consuming. You can imagine that you need to collect data on a wide variety of driving conditions for example and that can be difficult to do in the real world. Maybe you want the car to be able to deal with a difficult intersection, combined with a jaywalking pedestrian or a bicycle coming out of a side street, but you can’t wait around for that to happen in the real world, but you might be able to generate that scenario in a synthetic dataset. Same with weather, you might want to test how the car performs at that same intersection in heavy rain, light snow, or fog, and that is going to be difficult to collect in the real world in a timely manner especially where the only variable that has changed is the weather. As a side note, this might particularly be tricky in sunny California, where a lot of these companies are based!

Institutional investment is accelerating

The level of institutional interest is now substantial and accelerating. In April 2025, the UK Government and Wellcome Trust jointly announced up to £600 million to establish a new Health Data Research Service (HDRS), a secure single-access gateway to NHS datasets designed to streamline the approval process for approved researchers. In November 2025, the UK Government confirmed that the HDRS will incorporate synthetic data techniques, allowing researchers to work with data that mirrors real patterns without containing individual patient information. Alongside the MHRA datasets, what we see is a picture of institutions increasingly treating synthetic data as a practical tool for enabling research. Meanwhile, the US FDA’s Center for Devices and Radiological Health (CDRH) has active research programmes exploring synthetic data for medical AI validation, while the NIH’s Data Science Strategic Plan (2025 to 2030) explicitly calls for research into developing, validating, and using synthetic clinical datasets for AI training and applications.

On the private side, Nvidia acquired synthetic data company Gretel in March 2025 for over USD $320 million, while in October 2025, KPMG acquired YData to accelerate their AI strategy and build a Synthetic Data Centre of Excellence. Together, we are seeing clear signals that major players and governments see synthetic data generation as strategically important infrastructure rather than a niche research tool.

Conclusion

The direction of travel is clear. Synthetic data will not replace real patient data, and the utility-privacy tradeoff in high-fidelity generation is still not fully resolved. But the infrastructure, the investment, and the institutional commitment are all moving in the same direction. What is less clear is how we should be evaluating whether synthetic datasets are actually good enough for the purposes they are being used for. How do we measure representativeness? How do we know when a synthetic dataset has preserved enough of the statistical structure of the original to be useful, without preserving so much that it becomes a privacy risk? These are the questions I will be tackling in the next post in this series.

If you are new to this series, you can catch up on my previous posts on what synthetic data is, GANs, and multiple imputation approaches.