Why Synthetic Data is Good for Open Science

I regularly do talks on reproducibility and open science at conferences and workshops, and it is one of my most requested topics. The statistics about reproducibility are pretty shocking with some papers quoting up to 74% of studies being unable to be reproduced. This is obviously a huge problem for science and it is something that we need to address. I have done a lot of thinking about what are the barriers and blockers to reproducibility and a key part of it is that often researchers do not share their code. This means it is hard to truly understand what they did and to check it. When I ask researchers why they don’t share their code, the most common answer I get is that they cannot share their data, so sharing the code feels pointless. Recently I have been thinking a lot about how synthetic data might be a solution to this problem. It can provide a way for researchers to share their code and still allow others to run it and check it, even if they cannot share the real data. In this blog post, I want to explore this idea in more detail and discuss how synthetic data can be a bridge between the need for reproducibility and the need to protect sensitive data.

What is Synthetic Data?

Synthetic Data

Synthetic data is artificially generated data that mimics the statistical properties of real data. It can be created using a variety of methods including machine learning algorithms, complex statistical models, or even simple random sampling techniques. I have written about synthetic data before, so if you want to learn more about it, check out my previous “jumping off” blog post on the topic. I go into a lot more detail about what synthetic data is here and how to create it here.

What is Reproducibility?

Reproducibility is the ability to replicate the results of a study using the same data and methods. It is fundamental to all that we do in science because it allows us to check the validity of ours and others’ work. If a study is not reproducible, it means that the results might not be trusted, and it can lead to wasted time and resources as other researchers try to build on work that may not be valid. For modern science, since most of us are now writing code as our methodology, reproducibility also means sharing our code and data so that other people can run it against the data, and see if they get the same results.

Same Data, Same Code, Same Results

Note that reproducibility is different from replicability, which is the ability to achieve the same or similar results using different data and methods. It is easy to get these two terms confused but they are not the same thing. This blog post is about reproducibility i.e. if I give you the same data and the same code to run against that data, you should get the same results as I did.

The problem: Data is often sensitive and cannot be shared

Here is where this all gets a bit tricky. In many fields, the data we work with is sensitive and cannot be shared openly for good reason. I deal often with health data where the data is your medical records, and this is obviously very personal information that should not be shared with the world, even if the names and identifiers are removed. But this is not just a problem for health data. In social sciences you might be dealing with surveys of political opinions or financial data. In conservation and ecology you might be working with data about endangered species locations. In all of these cases, if you cannot share your data, then other people cannot run your code against it to check and verify your work. This creates two opposing forces: on one hand, we want to share our code and be open about our methods, but on the other hand, we need to protect the privacy of the subjects in our data. This is a real tension and it has second order consequences for sharing code.

This is where synthetic data comes in.

Synthetic Data is a Bridge

Synthetic data provides a bridge between these two opposing forces. It allows you to share your code and a dataset that is structured like the real data, that the code can be run against. This means that people can clone your code repository, download or generate their own synthetic data and then run the code against it to see what happens. This means that they can check things like: Does the code run? What sorts of errors do you get? What do the outputs look like? It increases the transparency of your work and allows people to understand the logic of your analysis in a much more concrete way. It is a bit like trying to learn something from a recipe rather than actually cooking the dish yourself. You can read the recipe and understand all the steps but until you have actually done it yourself, there are limits to your understanding. Sharing code without data is like sharing a recipe without the ingredients. It is better than nothing, but it is not ideal. Synthetic data gives you the ingredients to go with the recipe, even if they are not the real ingredients, they are close enough that you can still learn how to cook the dish and understand the process.

What do we want from our synthetic data?

There are lots of use cases of synthetic data (see this blog for more on that) and they have different levels of fidelity and different requirements, but for this specific use case of reproducibility, we want our synthetic data to have certain qualities:

It should have the same schema as the real data (same columns, same data types).
The same rate of messiness as the real data (missing values, outliers, etc), with the same patterns of missingness and outliers. i.e. strings that are not parsable as dates in a date column, or numbers that are outside of the expected range if those are issues in the real dataset.
It should have similar statistical properties to the real data such as similar distributions of values, similar correlations between variables, etc. However it does not need to be a perfect match. That is not the point. The point is that the code runs, the pipeline holds together without erroring, and a reader can see exactly what you did at every step. They can inspect the data transformations (like creating a categorical variable from a continuous one), check the logic, and understand the analytical decisions you made and how that relates to the protocol, which you hopefully have also published!
On the flip side, there is a strong argument that the data shouldn’t be too close to the real data. If it is, you might run into issues around privacy and whether any real data has inadvertently leaked into the synthetic data. I have written about that in more detail in my previous blog on privacy and synthetic data if you want to learn more.
A usable size. Your synthetic data should be small enough that it can be easily downloaded and shared alongside your code but not too small that your models will never fit. If your real dataset has 10 million records, you do not need to generate a 10 million row synthetic dataset. You can get away with a much smaller dataset that still captures the structure and properties of the real data. That might be 10,000 rows, or 100,000 rows, or even 1 million rows depending on the complexity of your data and the needs of your code. The point is that it should be small enough that it can be easily shared and downloaded by anyone who wants to run your code.
It should be easily transportable and shareable. It should be in a format that is easy to work with (like CSV or Parquet). Often we are working with data that is on a server somewhere and write code that can pull it down but it is much easier if the synthetic data can be included directly in your code repository. You want to avoid using an external API or a service that generates the synthetic data for you because it is highly likely that there is a breaking change in the API at some point in the future, or that service might just simply go away.

Other Benefits in Research and Teaching

Once you have a good synthetic dataset, it can be useful for more than just reproducibility. You can use it for teaching. If you are running a workshop or a module, synthetic data can usually be downloaded and used without any data permissions issues, like worrying about who is the data controller. This is very much dependent on the method by which you have used to generate it.

If you are collaborating with colleagues at other institutions, sharing synthetic data is usually much more straightforward than sharing the real data. In my experience sharing real data across institutional boundaries involves ethics committees, data sharing agreements, and often months of administrative work. Sharing synthetic data sidesteps all of that and allows you to get on with the science much more quickly. You can start the administrative work in parallel but you do not have to wait for it to be completed before you can start working together.

Synthetic data can also be useful for documentation of the real data. We often use data dictionaries to document our datasets. This might look like a spreadsheet with the column names, data types, and descriptions of what each column means. This is useful but it does not give you all the information. For example, how much missingness is there? Are there any outliers? Are there any weird values that don’t match the data type, for example, a string where it should be a boolean? A synthetic dataset alongside your code serves as living documentation of what your data looks like. It is more informative than a data dictionary alone because people can actually look at the values and understand the shape of the data.

Limitations of synthetic data

Synthetic data is not a panacea. It takes time to create and it is not always straightforward to generate good synthetic data that captures the properties of the real data, particularly when you have many variables with complex relationships between them. It is often not feasible to create synthetic data that will allow a complicated multivariate regression model to fit, and even if you can, it might not be worth the time and effort to do so.

Limitations

Something which does concern me is that researchers who don’t understand the complexity of creating good synthetic data might create synthetic data that is too close to the real data and then share that publicly. I think that is a real risk and something that we all need to be aware of. If you are going to share synthetic data, it is always helpful to include a description of how you generated it and what properties it has. This is not just for the benefit of the reader but also for yourself as it forces you to think about potential privacy risks. See my blog here for more information about that: Privacy and Synthetic Data.

My final thoughts

Where I think synthetic data makes the biggest difference is in checking data cleaning and manipulation code. This is the code that creates your analysis dataset from the raw data, creates your variables, and handles missing data. In my experience it is often the most complex and error-prone part of the analysis and the least likely to have been reviewed by anyone else. It is also where serious errors creep in (see this article of mine from years ago describing a study where the outcome variable was flipped in data cleaning). Sharing synthetic data allows people to run that code and check it. Even if you can’t get the models to converge and fit, you can still check the data cleaning and manipulation code and that is a huge win for reproducibility and transparency.

The synthetic data does not need to be perfect. It just needs to be good enough that someone can run your code and see what happens. That is a surprisingly low bar, and I think more of us should be clearing it.

Why Synthetic Data is Good for Open Science

What is Synthetic Data?

What is Reproducibility?

The problem: Data is often sensitive and cannot be shared

Synthetic Data is a Bridge

What do we want from our synthetic data?

Other Benefits in Research and Teaching

Limitations of synthetic data

My final thoughts

Related Posts

Women in Rust 2025

Serde Rust: Data Serialisation for Data Scientists

What are GANs and how can they generate synthetic data?