Caroline Morton

Synthetic Data: The Complete Series

Thu, 09 Apr 2026 00:00:00 +0100

I have spent the last year thinking and writing about synthetic data. The promise is straightforward: generate data that looks and behaves like the real thing, without exposing anything confidential. I write from the perspective of someone working in epidemiology and healthcare research, but the principles extend well beyond this area. Synthetic data is being used across finance, pharma, university research centres, and large corporate enterprises. What I have found drives adoption in almost every case is the same tension: organisations want more data to work with, but the real data is sensitive. In healthcare and finance that sensitivity is about protecting individuals from being identified, after all what could be more sensitive than your health record or bank statements. In commercial settings it is about protecting data that is valuable to competitors.

Representativeness in Synthetic Data: What It Means and How to Measure It

Tue, 07 Apr 2026 00:00:00 +0100

In my article, introduction to synthetic data, I described how GAN-generated records had been used to predict patient length of stay in German hospitals. It’s a compelling example of how the synthetic data we generate can serve as a credible proxy for real patient records. But how do we know that the synthetic datasets we generate actually represent the patient population they are modelling?

Consider the German study above. If the data subtly underrepresented elderly patients with multiple comorbidities (a group who typically account for the longest and most resource-intensive stays), the model could look quite accurate on paper, while producing systematically skewed predictions in practice. The difference between ‘it looks like real data to some degree’ and ‘this actually represents the population’ is called representativeness, and is what this article is all about.

Why Rust for Data-Intensive Applications

Wed, 01 Apr 2026 00:00:00 +0100

This is a prologue to a series on Rust for data-intensive research applications - written after the first three parts, which is perhaps the wrong order, but reflects how the thinking actually developed. I wanted to write something that introduces the series as a whole, explains the motivation behind it, and I hope is accessible to researchers who may not be familiar with Rust. It is also, in the spirit of Austin Kleon’s Show Your Work!, an attempt to share the process of thinking rather than just the conclusions.

Your Errors Are Data Too

Mon, 23 Mar 2026 00:00:00 +0000

This is the third post in the Rust for Data-Intensive Applications series. The Serde post covered moving the validation boundary to the point of ingestion. The newtypes post covered encoding domain knowledge in types so the compiler enforces it. This post is about what happens when things go wrong at either of those boundaries, and why capturing that information carefully is as important as the valid records themselves.

Errors in research code are different

In application development, an error means something went wrong that needs fixing. Your code threw an exception, your service returned a 500, your database query failed. The goal is to find the error, understand it, and eliminate it.

Why Use Newtypes? Encoding Domain Knowledge in the Type System

Mon, 16 Mar 2026 00:00:00 +0000

I have spent a lot of time debugging research pipelines, and the bugs that scare me most are not the ones that crash loudly. They are the ones that produce plausible-looking output and let you carry on for weeks before anyone notices something is wrong. Wrong-order bugs are the worst offender. You have a function that takes several parameters of the same primitive type, the caller passes them in the wrong order, the compiler says nothing, and the output sits just inside the range of values you would expect to see in real data. By the time you find it, if you find it, you have already done a lot of work on incorrect results.

Serde Rust: Data Serialisation for Data Scientists

Sun, 08 Mar 2026 00:00:00 +0000

I have a confession to make: I love Serde.

Serde for those not in the know is the Rust ecosystem’s workhorse for serialisation and deserialisation, but for data pipelines I find it more helpful to think of it as something slightly different: a schema enforcement mechanism and a validation boundary.

That distinction matters. In many data pipelines, validation is treated as something that happens later. You ingest the data, clean it, transform it, and only then check whether it actually matches the assumptions your code is making. By the time you discover a problem, you may already have done quite a lot of work on data that is not what you thought it was.

How Synthetic Data Is Used in Healthcare, Research and Beyond

Mon, 02 Mar 2026 00:00:00 +0000

In a previous post, I introduced what synthetic data is and why it is generating so much interest in healthcare. However, in this article I want to talk about the practical use cases, both in healthcare but also in other industries, to give a better sense of how synthetic data is being used in the real world. This is not an exhaustive list, but my aim is to give a better sense of the breadth of applications. I often get asked why I choose to focus on this area, and my hope is that this article will show why it is such an exciting space to be working in.

Accidental Functional Programming in Rust (From an Epidemiologist's Perspective)

Mon, 09 Feb 2026 00:00:00 +0000

This post started as a talk I gave at Lambda World 2025. It’s the kind of conference where you end up in a two-hour conversation about Scala data streams despite not knowing Scala. If you’d rather watch than read, the video is embedded below.

Side note: ignore the worryingly smiley face in the thumbnail. Someone at the conference decided to run my headshot through an AI filter to increase its pixels and I ended up looking like a Pixar character. I don’t know why they thought that was a good idea, but here we are!

How to Create a Codelist

Mon, 02 Feb 2026 00:00:00 +0000

This blog post is the second part of a two-part series that accompanies a lecture I am giving on Codelists as part of the Health Data in Practice MSc at Queen Mary University of London.

In the previous blog, we looked at what a codelist is and why they are both important and difficult to create. In this post, we are going to look at some of the methods for creating codelists for your study.

What is a Codelist?

Sun, 01 Feb 2026 00:00:00 +0000

This blog post accompanies a lecture that I am giving on Codelists as part of the Health Data in Practice MSc at Queen Mary University of London.

In this post, I am going to introduce what a codelist is, why they are needed and how they are used in health data research. There will be a follow-up post where I will walk through some of the ways that codelists can be created or sourced.

Error Handling in Rust: anyhow and thiserror

Sun, 25 Jan 2026 00:00:00 +0000

This is part two of my error handling series for Women in Rust (here’s part one). It stands alone as a reference, but we’ll build on examples from the previous post.

Last time, we hit a wall: the ? operator made our code clean but stripped out our context. This post fixes that with two crates - anyhow for quick, contextual errors, and thiserror for when you need callers to handle errors differently.

Error Handling in Rust: Fundamentals

Mon, 19 Jan 2026 00:00:00 +0000

Rust’s approach to error handling is one of its most distinctive features, and one of the most confusing for newcomers. In this first of two posts, we’ll cover the core concepts: recoverable vs unrecoverable errors, when to panic, and how to propagate errors effectively. This series accompanies my Women in Rust talk on the topic, but stands alone as a reference.

Before we get into the mechanics, we need to understand the fundamental distinction that shapes all error handling in Rust.

Women in Rust 2025

Fri, 19 Dec 2025 00:00:00 +0000

What a fantastic year 2025 has been for the Women in Rust community!

Thank you to everyone who has contributed to making this year so special. This was our first full year of events after launching in Spring 2024, and the enthusiasm and engagement from the amazing women in our community has been truly inspiring. I am going to take us back through the year to highlight some of the wonderful moments we shared together. We had 18 events this year with a mix of online and in-person formats. Here are some of the highlights:

Multiple Imputation and Perturbation: Why They're Not Built for Synthetic Data

Thu, 18 Dec 2025 00:00:00 +0000

Synthetic data is often described as a solution to the limited data problem, and as we have discussed in previous blogs, Synthetic Data it can be a powerful tool for creating larger datasets that model real data while protecting privacy. When it comes to generating synthetic data, there are many methods available, each with its own strengths and weaknesses. Some of these methods are designed specifically for generating synthetic data, while others are not.

What are GANs and how can they generate synthetic data?

Mon, 24 Nov 2025 00:00:00 +0000

This blog is going to introduce the concept of Generative Adversarial Networks (GANs) and explore how they can be used to generate synthetic data. This is particularly important in the case of generating synthetic healthcare data as many of the published research papers out there focus on this application.

What are GANs?

Generative Adversarial Networks (GANs) are a class of machine learning models that consist of two neural networks competing against each other. The first network, called the generator, creates synthetic data samples, while the second network, called the discriminator, evaluates whether the samples are real (from the training data) or fake (produced by the generator). This competition is inherently adversarial, hence the name. The generator aims to produce data that is indistinguishable from real data, while the discriminator strives to become better at identifying fake data. Through this process, both networks improve over time, leading to the generation of highly realistic synthetic data. Training continues until the generator produces synthetic healthcare data so convincing that the discriminator cannot reliably identify it as artificial. The result is a trained model capable of generating unlimited synthetic patient records that maintain the statistical properties, correlations, and clinical patterns found in real datasets while containing no actual patient information.

Clinic to Code to Care

Sun, 26 Oct 2025 00:00:00 +0100

This blog is an adaptation of a talk that Steph Jones and I gave at Women in Data and AI in October 2025. It explores the journey of information from a patient in clinic to how that information is coded for research and ultimately ends up informing statistical and machine learning models that can help improve patient care. I hope it provides a useful overview of the process and highlights some of the challenges and opportunities along the way.

What is Synthetic Data and Why Does it Matter?

Sat, 11 Oct 2025 00:00:00 +0100

Healthcare research is facing a fundamental paradox. The demand for comprehensive datasets to drive medical innovation has never been greater, yet access to real patient data remains severely restricted by privacy laws and ethical constraints. In this blog post, I will explore how synthetic data is emerging as a powerful solution to this challenge, enabling researchers to access high-quality datasets without compromising patient privacy.

Why the Adapter Pattern is King in Health Data

Sun, 09 Mar 2025 00:00:00 +0000

Healthcare data is messy. If you’ve ever worked in a clinical setting, you know the frustration logging into multiple systems just to piece together a patient’s history. Pharmacy records don’t sync with GP notes, hospital systems don’t talk to each other, and critical information gets lost in the gaps. In the worst cases, paper documentation is still part of the process.

Recently, a close friend went to a midwife appointment, only to be offered a whooping cough and flu vaccine she had already received two weeks earlier at her GP. The midwife’s system didn’t communicate with the GP’s records. If she hadn’t remembered getting the vaccines, she might have been given an unnecessary second dose. This kind of duplication doesn’t just waste resources-it can lead to medical errors. And this is 2025.

Finding Similarity with Vector Search: A Beginner's Guide

Fri, 10 Jan 2025 00:00:00 +0000

Have you ever wondered how to find someone or something that’s most like you, whether it’s a roommate, someone who shares your Christmas traditions, or even a celebrity? Vector search is the answer. It’s a modern way to find matches based on multiple preferences at once, and tools like SurrealDB make it incredibly easy to use. Let’s explore what vector search is and how it works, step by step.

What is Vector Search?

Vector Search is a method to find the most similar items in a dataset by considering multiple dimensions (or preferences) simultaneously. Unlike traditional database queries, which filter data based on specific conditions, vector search calculates the “distance” between items in a multi-dimensional space to find the closest match. The easiest way to explain this is with an example.

Code Review for Research Code

Mon, 23 Sep 2024 00:00:00 +0100

This blog is a written version of a talk I have given a few times now on how to conduct a code review for research code. I am writing up to provide a reference for those who have attended the talk and for those who are interested in learning more about code review for research code.

This guide is intended to be a high-level overview of what a code review is, why you should do it, and how to do it. It does require some knowledge of Github, but I will try to explain things as I go along. If you have any questions, please feel free to ask me.

SNOMED and friends

Sat, 31 Aug 2024 00:00:00 +0100

This blog is a short introduction to SNOMED CT from my perspective as someone who has interacted with SNOMED in a variety of different ways. I am going to try to reflect a variety of points of view from my time as a clinician, an epidemiologist using electronic health records for research and a software engineer, trying to create realistic synthetic data.

What is SNOMED CT?

SNOMED CT stands for Systematized Nomenclature of Medicine - Clinical Terms, and it is designed to be a comprehensive multi-lingual set of clinical healthcare terminiology that can be used to record and exchange clinical health information.

An Introduction to Electronic Health Records

Thu, 29 Aug 2024 00:00:00 +0100

A question that I get asked a lot is “What is an Electronic Health Record (EHR)?” This is a great question, and I hope to answer it in this blog post. This will be UK focussed as that is where I am based and have experience with but the principles are the same in other countries, even if the systems are slightly different. This is a very high level overview and I will go into more detail in future posts. The aim is to provide a basic understanding of what an EHR is and how it is used in research and clinical practice.