Synthetic Data: The Complete Series

I have spent the last year thinking and writing about synthetic data. The promise is straightforward: generate data that looks and behaves like the real thing, without exposing anything confidential. I write from the perspective of someone working in epidemiology and healthcare research, but the principles extend well beyond this area. Synthetic data is being used across finance, pharma, university research centres, and large corporate enterprises. What I have found drives adoption in almost every case is the same tension: organisations want more data to work with, but the real data is sensitive. In healthcare and finance that sensitivity is about protecting individuals from being identified, after all what could be more sensitive than your health record or bank statements. In commercial settings it is about protecting data that is valuable to competitors.

The posts are listed in sections and represent a sensible reading order, especially if you are new to the topic. Each post stands alone and can be read independently. The series is ongoing, and I will update this page as new posts are published.

Foundations

We start with an introduction to synthetic data, some of the main methods of generation at a high level, and the applications that are driving the demand for synthetic data.

What Is Synthetic Data and Why Does It Matter? - What synthetic data is, how it differs from anonymised or pseudonymised data, and why the distinction matters.

Methods of generation

These posts deal with the methods of generating synthetic data, where each post describes a different method, the advantages and disadvantages of the method, and its potential usecase and application.

What Are GANs and How Can They Generate Synthetic Data? - How generative adversarial networks work and why they have become one of the dominant approaches to synthetic data generation.
Multiple Imputation and Perturbation: Why They’re Not Built for Synthetic Data - Two statistical techniques that are sometimes repurposed for synthetic data generation, and why they are not well suited to the task.

Applications

Here we cover the real-world use cases for synthetic data.

How Synthetic Data Is Used in Healthcare, Research, and Beyond - Use cases across healthcare, finance, autonomous vehicles, and software testing, plus the growing institutional investment in synthetic data infrastructure.
Why Synthetic Data is Good for Open Science - Why synthetic data has great potential for helping us do better science in a more open way.

Evaluation

In these posts we cover some of the evaluation metrics that are important in synthetic data generation with regards to privacy, representativeness and utility.

Representativeness in Synthetic Data: What It Means and How to Measure It - What it means for synthetic data to be representative, the four dimensions that matter, and why optimising for representativeness trades off against privacy.
Is your Synthetic Data actually private? - How to think about privacy in synthetic data, what are the risks and what metrics can we use to measure this risk.
How Do We Measure Utility? (coming soon) - Metrics and frameworks for evaluating whether a synthetic dataset is good enough for a given research question.
The Privacy-Utility Tradeoff (coming soon) - The fundamental tension at the heart of synthetic data: the more realistic the data, the greater the privacy risk.

Get in touch

This series is ongoing. If you work with synthetic data or are considering it for your research, I would be glad to hear from you.

Synthetic Data: The Complete Series

Foundations

Methods of generation

Applications

Evaluation

Get in touch

Related Posts

Why Use Newtypes? Encoding Domain Knowledge in the Type System

Why Rust for Data-Intensive Applications

Finding Similarity with Vector Search: A Beginner's Guide