Synthetic Data: The Complete Series

Author

Caroline Morton

Date

April 9, 2026

I have spent the last year thinking and writing about synthetic data. The promise is straightforward: generate data that looks and behaves like the real thing, without exposing anything confidential. I write from the perspective of someone working in epidemiology and healthcare research, but the principles extend well beyond this area. Synthetic data is being used across finance, pharma, university research centres, and large corporate enterprises. What I have found drives adoption in almost every case is the same tension: organisations want more data to work with, but the real data is sensitive. In healthcare and finance that sensitivity is about protecting individuals from being identified, after all what could be more sensitive than your health record or bank statements. In commercial settings it is about protecting data that is valuable to competitors.

The posts are listed in sections and represent a sensible reading order, especially if you are new to the topic. Each post stands alone and can be read independently. The series is ongoing, and I will update this page as new posts are published.

Foundations

We start with an introduction to synthetic data, some of the main methods of generation at a high level, and the applications that are driving the demand for synthetic data.

Methods of generation

These posts deal with the methods of generating synthetic data, where each post describes a different method, the advantages and disadvantages of the method, and its potential usecase and application.

Applications

Here we cover the real-world use cases for synthetic data.

Evaluation

In these posts we cover some of the evaluation metrics that are important in synthetic data generation with regards to privacy, representativeness and utility.

  • Representativeness in Synthetic Data: What It Means and How to Measure It - What it means for synthetic data to be representative, the four dimensions that matter, and why optimising for representativeness trades off against privacy.

  • Is your Synthetic Data actually private? - How to think about privacy in synthetic data, what are the risks and what metrics can we use to measure this risk.

  • How Do We Measure Utility? (coming soon) - Metrics and frameworks for evaluating whether a synthetic dataset is good enough for a given research question.

  • The Privacy-Utility Tradeoff (coming soon) - The fundamental tension at the heart of synthetic data: the more realistic the data, the greater the privacy risk.

Get in touch

This series is ongoing. If you work with synthetic data or are considering it for your research, I would be glad to hear from you.

Know someone who'd like this?

Enjoyed this? Subscribe to my newsletter.

I write about open science, research code, and building better tools for researchers.

Browse the newsletter archive →

Related Posts

head_brain yellow

Why Use Newtypes? Encoding Domain Knowledge in the Type System

How Rust's newtype pattern lets you encode domain knowledge - valid ranges, clinical thresholds, meaningful operations - directly into the type system, so the compiler enforces what you already know to be true about your data.

Read More
crab yellow

Why Rust for Data-Intensive Applications

Explores why Rust matters for research data pipelines - not for performance, but for correctness. Learn how Rust's type system prevents data failures.

Read More
crab blue

Finding Similarity with Vector Search: A Beginner's Guide

This blog comes out of an interactive workshop I gave using SurrealDB. It's a beginner's guide to vector search, a modern way to find matches based on multiple preferences at once.

Read More