Why Rust for Data-Intensive Applications

This is a prologue to a series on Rust for data-intensive research applications - written after the first three parts, which is perhaps the wrong order, but reflects how the thinking actually developed. I wanted to write something that introduces the series as a whole, explains the motivation behind it, and I hope is accessible to researchers who may not be familiar with Rust. It is also, in the spirit of Austin Kleon’s Show Your Work!, an attempt to share the process of thinking rather than just the conclusions.

The performance argument is a red herring

If you speak to any Rustacean about why they like Rust, they will probably mention performance as one of the main reasons why they love it. It is the language of choice for high-performance applications like game engines, operating systems, and web browsers. I originally got into Rust because I needed a faster language to speed up a simulation product I was working on. So I understand the appeal of performance as a main selling point for Rust. But when it comes to research data pipelines, performance is not the main concern. In fact, it is often not a concern at all.

Correctness bound

Most research pipelines currently are not CPU-bound; they are correctness-bound. Meaning that the bottleneck is not how fast the code runs, but how fast it can produce a correct answer. If you get a faster wrong answer, it is still a wrong answer. Unless you are dealing with enormous datasets or real-time streaming, performance gains do not meaningfully change research outcomes. In my mind, the real reason Rust matters for research data pipelines is not because it is fast, but because it is designed to help you write correct code.

In this blog, I am going to outline the three main ways that research data pipelines fail, and how Rust’s type system addresses each of those failure modes by construction. I will then give a brief overview of the three parts of the series, which will go into more detail on each of those topics. As I have said, I have already written the first three blogs - on serde, newtypes and error reporting - which form Part I of the series dealing with individual correctness of the data.

Problem one: The data lies to you

Scientific data is messy. It is not just that there are typos and missing values; it is that the data can be internally inconsistent, datasets can have different columns and formats each time you get a new version of the dataset, and the data can violate domain constraints in ways that are not obvious until you try to use it.

In an ideal situation, we would be fixing these issues upstream. Certainly a product company would be looking into how to improve the data quality at the source. But in research, we are usually working with a dataset that we did not create. In my domain of epidemiology, the data is collected as part of routine care, i.e. not for research purposes, and provided by commercial data vendors. The data is often incomplete, inconsistent, and full of errors or implausible values. You have doctors and nurses entering data into the records of patients and this data is then transformed, “cleaned” and aggregated by the data vendor before it reaches you, sometimes in unknowable ways. The data is not lying to you on purpose, but it is lying to you nonetheless. If your code assumes that the data is clean and well-formed, you are going to get wrong answers without even knowing it.

Mouse trap

What I have seen happen is that ad-hoc validation logic ends up scattered across the pipeline, often in the middle of the analysis code, and usually added when someone has noticed that a value that shouldn’t be there, has ended up there. This means that you are discovering errors after you have already done a lot of downstream work, and you are not sure how many other errors you have missed. Silent failures are the worst kind of failures, because they produce plausible-looking output that survives manual inspection but is actually wrong.

One of the advantages I have found of using Rust for research data pipelines is that it forces you to confront the messiness of the data at the boundary of where your data gets in. The deserialisation process is the only place where raw bytes become typed values, so it is the only place where you can test all of your structural assumptions about the data. If the data does not conform to the structure you expect, the deserialisation will fail explicitly and you will have to deal with it. You can take that failure and figure out why it is there, and if need be, go back to the data vendor and ask them to fix it.

This is what we could think of as a contract between the data and the code - the type is the contract, and the deserialisation is the checkpoint where you verify that the data meets that contract. I have discussed this problem and solution in more detail in my serde post, which is the first part of the series, but I wanted to give a preview of it here because it is such a fundamental part of the argument for Rust in research data pipelines.

The data does not only lie at the level of individual records. A row can pass every structural check and still be wrong in context. A patient appears twice with different dates of birth. A death date precedes a birth date across two linked datasets. A record is a perfect duplicate of another, and you do not know whether it is a data entry error or a genuine second event. These are not problems that deserialisation can catch, because each record in isolation is well-formed. They are compositional problems, and they emerge when records are joined. Part II of this series looks at how Rust’s type system can encode the guarantees that prevent these failures from compiling.

And then there is the version of this problem that only shows up in a scaled up production. The data looked fine on the sample you developed against, but the full dataset has a column that used to be non-null and now contains blanks from one particular source, or time period. Or a new extract arrives with an extra column that breaks your parser (literally the bane of my life!). The data lied to you, but it only lied under conditions you had not tested. Part III will look at how property-based testing and structured tracing help you catch the lies you did not think to look for.

Problem two: Your code forgets what it knows

Domain knowledge lives inside researchers’ heads. The methodology might make it into comments or a README, but the small constraints rarely make it into code. Some are biological facts: pregnancy cannot last more than 42 weeks. Others are dataset-specific: perhaps you have a primary care data where pregnancies cannot be registered before 12 weeks’ gestation because that is when referral to a midwife happens. This is domain knowledge - you know something about the system in which this data was collected that is not explicitly represented in the data itself.

Primitive types describe shape, not meaning. The compiler cannot distinguish between an integer representing a patient’s age and one representing hospital stay duration. This allows silent bugs - you can for example, pass arguments in the wrong order and the function will compile, and the output looks plausible but is actually wrong.

Rust’s newtype pattern encodes constraints directly in the type system. A GestationalAgeInWeeks type can wrap an integer and expose a new() constructor that returns an error for values above 42. The compiler now enforces what you know to be true about the domain. I have written about this in detail in my post on newtypes.

Domain knowledge does not just apply to individual values; it also applies to relationships between pipeline stages. Data must be deduplicated before it is aggregated. Missing values need to be dealt with before you run a regression, or calculate a mean. This type of knowledge ends up being ordering constraints on the pipeline. Part II of this series looks at how builder and typestate patterns let you encode these constraints in the type system, so that you cannot compile a pipeline that violates them.

Steps

In Part III, we will touch on this as well. One of the issues in research is there is often a long lead time between the end of the study and corrections being requested by a peer reviewer of the final paper. So you might end up in a situation where you understood an edge case when you wrote the code but six months later, you (or perhaps a colleague) needs to modify the pipeline and they have forgotten that edge case. We are going to look at property based testing, so that we can deliberately generate adversarial inputs that test those edge cases, and make sure they are still being handled correctly even if the person modifying the code does not remember them.

Problem three: Your pipeline has no memory

Research data is messy. Most records contain errors, unexpected values, or missing data. In a product company, you would treat these as bugs to fix upstream. But in research, the errors are data. They are observations to be captured, not problems to solve. In fact, the methods section of a research paper must report exactly what data was excluded and why: 83 patients missing a date of birth; 17 records with implausible gestational ages over 42 weeks; 5 patients with birth dates after death dates.

The problem is that we usually collect this information after the fact. We run the analysis, notice something odd in the output, write a separate script to investigate, and patch the exclusion criteria. The pipeline itself has no memory of what it discarded or why. This means essentially the “audit trail” lives in non-code places - like Slack messages, emails or comments in a Jupyter notebook. They are not part of the pipeline itself, and importantly not structured in a way that it can be counted or queried.

What we need is a data quality report as a first-class pipeline output, constructed as the pipeline runs. Not logged for later inspection. Not inferred from what made it through. Built directly into the return type of every function that touches the data.

Rust’s error model makes this natural. A function that processes a record does not return a single value; it returns Result<T, E>. Success and failure are both structured outcomes. You cannot ignore the error case without being explicit about it. This means you can accumulate a complete record of what went wrong, where, and how often, as the pipeline runs. The error type carries the same weight as the success type. The data quality report becomes an artefact of the same process that produces the analysis dataset, not a separate investigation.

I have written about this approach in detail in my error handling post. But the idea extends further than a single function’s error type. If your pipeline fails halfway through, you often have no way of knowing what the data looked like at the point of failure, or how far it got. Researchers sometimes get around this by outputting intermediate datasets at each stage, but this is not ideal, and if you are dealing with very large datasets, might not be possible at all. Part II is going to look a little at how the type system can make it impossible to pass data from one stage to another without the necessary guarantees, but Part III is going to look at how structured tracing can give you a detailed record of the pipeline’s execution, so that if something goes wrong, you can step back in time and see exactly what happened. We want to make it possible for the pipeline to narrate its own execution, so that you can audit it and trust it, and debug it when it fails.

The thesis: correctness at every layer

Now we have covered some of the main problems with messy data, I want to make my case. The argument is not “Rust is good”; it is “the properties Rust enforces are the same properties science requires”. In my mind, science demands that you know what your data is, where the biases and inaccuracies lie, and that you cannot accidentally misuse it. You must be able to account for every record that entered and where it left your pipeline. I think we can use Rust to make this compiler-enforced rather than pushing it solely onto the researcher, their memory and convention. We are too often using languages where correctness is optional, where correctness becomes some function of vigilance of an individual rather than structural feature of the codebase.

We often think of reproducibility as a documentation problem. I have often given talks on the importance of documentation, and of course it is important. But more and more I am realising that reproducibility is not just that a documentation issue, it is often a type system problem. An integer that is actually a string. A string that is actually missing. A value that is not biologically plausible. These errors hide in the gap between what the code assumes and what the data contains, and are often quite large and cannot reasonably be recorded within a README or even comments.

There are three layers of correctness: individual correctness, compositional correctness, and operational correctness. These map onto scientific rigour with increased scope at each layer. We start with measurement validity, then internal dataset validity, then external validity.

series map

Series map

Part I: Individual correctness

Part I addresses correctness at the level of individual records - the three problems outlined above, and the Rust patterns that address each one.

The first post covers deserialisation as a validation boundary. The second post introduces newtypes as a way to encode domain knowledge directly into the type system. The third post treats validation errors as data rather than noise, building a data quality report as a first-class pipeline output.

Part II: Compositional correctness

Individual correctness guarantees that each record is valid in isolation. Compositional correctness asks whether those records can be processed incorrectly as they move through the pipeline, and whether the type system can prevent that.

We touched on this in the problems above: data that is correct in isolation but wrong in context, pipeline stages that can be assembled in the wrong order, domain knowledge that applies to the sequence of operations rather than to individual values. Part II will look at builder and typestate patterns as a way to encode these constraints, and at missingness as a first-class type value, like error reporting as a first-class output. We want to create a pipeline that guarantees that correctness established at one stage is carried forward in the type, not re-examined at every subsequent stage of the pipeline.

Part III: Operational correctness

Parts I and II establish correctness properties for a pipeline running on well-understood data. Part III asks whether those properties hold under real conditions: bigger scale, partial failure, concurrency, persistence, re-running your analysis.

Again, the problems above hinted at this layer - data structure or missingness you had not tested, edge cases or particular set-ups, or gotchas forgotten over time and pipelines that cannot narrate their own execution. Part III will look at structured tracing as a mechanism of provenance, and property-based testing as a way to generate the adversarial inputs you did not think to write by hand. The goal is to make it possible to trust your pipeline even when it is running on data you have not seen before and to be able to debug it when it fails. This is a particularly important topic for me because I think we too often create beautiful pipelines that work for a particular study but then we don’t reuse that code for the next study because we are not sure it will work on the new data. This can just mean that we lack confidence in it and we end up having to rewrite it from scratch for the next project.

Finally

This series is a work in progress, and I am writing it as I go. The ideas in Parts II and III are less decided and fully worked out than those in Part I, which has the benefit of already being written! Some of what I have outlined here will probably change as I work through the detail and do some more thinking and that is kind of the point. I am not presenting a finished framework; I am thinking out loud about how we can use Rust, or even take some of the concepts from Rust and apply them in other languages, to make our research pipelines better and more reproducible. I hope that by sharing this process of thinking in public, I can get feedback and ideas from other people doing similar sorts of thinking and work. If you are, please get in touch via my contact page.