Synthetic Data: The Complete Series

Author

Caroline Morton

Date

April 9, 2026

I have spent the last year thinking and writing about synthetic data. The promise is straightforward: generate data that looks and behaves like the real thing, without exposing anything confidential. I write from the perspective of someone working in epidemiology and healthcare research, but the principles extend well beyond this area. Synthetic data is being used across finance, pharma, university research centres, and large corporate enterprises. What I have found drives adoption in almost every case is the same tension: organisations want more data to work with, but the real data is sensitive. In healthcare and finance that sensitivity is about protecting individuals from being identified, after all what could be more sensitive than your health record or bank statements. In commercial settings it is about protecting data that is valuable to competitors.

The posts are listed in sections and represent a sensible reading order, especially if you are new to the topic. Each post stands alone and can be read independently. The series is ongoing, and I will update this page as new posts are published.

Foundations

We start with an introduction to synthetic data, some of the main methods of generation at a high level, and the applications that are driving the demand for synthetic data.

Methods of generation

These posts deal with the methods of generating synthetic data, where each post describes a different method, the advantages and disadvantages of the method, and its potential usecase and application.

Applications

Here we cover the real-world use cases for synthetic data.

  • How Synthetic Data Is Used in Healthcare, Research, and Beyond - Use cases across healthcare, finance, autonomous vehicles, and software testing, plus the growing institutional investment in synthetic data infrastructure.

  • Open Science and the case for Synthetic Data (coming soon) - Why synthetic data has great potential for helping us do better science in a more open way.

Evaluation

In these posts we cover some of the evaluation metrics that are important in synthetic data generation with regards to privacy, representativeness and utility.

  • Representativeness in Synthetic Data: What It Means and How to Measure It - What it means for synthetic data to be representative, the four dimensions that matter, and why optimising for representativeness trades off against privacy.

  • How Do We Measure Utility? (coming soon) - Metrics and frameworks for evaluating whether a synthetic dataset is good enough for a given research question.

  • The Privacy-Utility Tradeoff (coming soon) - The fundamental tension at the heart of synthetic data: the more realistic the data, the greater the privacy risk.

Get in touch

This series is ongoing. If you work with synthetic data or are considering it for your research, I would be glad to hear from you.

Enjoyed this? Subscribe to my newsletter.

I write about open science, research code, and building better tools for researchers.

Browse the newsletter archive →

Related Posts

dag_2 orange

What are GANs and how can they generate synthetic data?

This blog explores Generative Adversarial Networks (GANs) and how they can be used to generate synthetic healthcare data.

Read More
reproducability yellow

Why Rust for Data-Intensive Applications

Explores why Rust matters for research data pipelines - not for performance, but for correctness. Learn how Rust's type system prevents data failures.

Read More
errors green

Your Errors Are Data Too

How Rust's error handling patterns let you treat errors as structured observations about your data - capturing context, categorising failures, and producing data quality reports as first-class pipeline outputs.

Read More