Your Errors Are Data Too

This is the third post in the Rust for Data-Intensive Applications series. The Serde post covered moving the validation boundary to the point of ingestion. The newtypes post covered encoding domain knowledge in types so the compiler enforces it. This post is about what happens when things go wrong at either of those boundaries, and why capturing that information carefully is as important as the valid records themselves.

Errors in research code are different

In application development, an error means something went wrong that needs fixing. Your code threw an exception, your service returned a 500, your database query failed. The goal is to find the error, understand it, and eliminate it.

Find the error, understand it, eliminate it

In a research data pipeline, errors often mean something different. When a record fails validation because a patient age is 150, that is probably not a bug in your code. More often than not, it is a finding about your data. When 47 records fail because they are missing a required field, that is information about the quality of your data extract. When 340 records have values outside the expected clinical range, that is something you might want to report in your methods section.

Rust distinguishes between recoverable and unrecoverable errors, and that distinction maps neatly onto this. A code error - a logic bug, a missing dependency, an unexpected panic - should probably halt the pipeline. A data error should be captured, counted, and reported. The goal is not to eliminate data errors but to make them legible.

The naive approach and what it loses

We are going to start with the most basic of error handling here. This is the approach you would take if you were writing a quick script in Python or R, and it is the approach that many Rust newcomers take as well. You have a function that processes a record and returns an output. In that loop, if there are any errors, you print them and move on. You are treating errors as a side effect of running the pipeline rather than as data in their own right.

fn process_records(records: Vec<RawRecord>) -> Vec<EpisodeRecord> {
    records.into_iter().filter_map(|record| {
        match process_record(&record) {
            Ok(episode) => Some(episode),
            Err(e) => {
                println!("Error processing record {}: {}", record.id, e);
                None
            }
        }
    }).collect()
}

This works, but it loses almost all of the useful information. We print the error and move on, which means we cannot aggregate it, filter it by category, or report on it in any structured way. It is not great practice in more or less any language to just print errors and then ignore them (or investigate why they exist at all?). You can imagine that if we have a large dataset, these error messages scroll past rapidly and we may not notice that 340 records failed for the same reason.

A taxonomy of data quality failures

So we can do better, and we frequently do. thiserror as I have said before is a great tool for defining our own custom error types that capture the information we want. The question is what information do we want to capture? Not all errors are created equal, and not all errors are equally useful to report on. If we just have a single DataError type with a message, we are back to the same problem: we can print the message but we cannot easily categorise or aggregate it.

I think there are three meaningful categories, and they map onto the series so far.

A structural error means the data could not be parsed at all. A date field that contains “banana”. A required field that is simply absent. Serde catches these at the ingestion boundary.

A validation error means the data parsed correctly but fails a basic constraint. A negative age. A weight of zero. Your newtype constructors catch these.

A domain error means the value is syntactically valid and passes basic constraints, but violates something that requires clinical or scientific knowledge to identify. A discharge date before an admission date. These are the errors that only you, as a researcher, know to look for.

We can define an enum that captures this taxonomy:

#[derive(Debug, thiserror::Error)]
pub enum PipelineError {
    #[error("structural error: {0}")]
    Structural(String),

    #[error("validation error: {0}")]
    Validation(String),

    #[error("domain error: {0}")]
    Domain(String),
}

One thing worth noting: if you followed the newtypes post and encoded domain rules in your constructors - a GestationalAgeInWeeks that refuses values over 42 - those failures will surface as validation errors in the pipeline, not domain errors. That is fine. The category you assign depends on how you have structured your pipeline. The important thing is that the failure is captured and counted, not which bucket it ends up in.

This alone is an improvement. When errors appear in your logs, you can see at a glance which category they belong to. But we can do better than logs.

The DataQualityReport as a first-class pipeline output

Researchers already have a vocabulary for tracking what happens to data as it flows through a pipeline. The CONSORT flowchart. The data flow diagram in a methods section. The table that shows how many records were excluded at each stage and why. The DataQualityReport is that diagram made executable.

Instead of treating errors as a side effect of running the pipeline, we can treat the quality report as a first-class output alongside the valid records. Here is a struct that captures the information we want:

#[derive(Debug, Default)]
pub struct DataQualityReport {
    pub total_records: usize,
    pub valid_records: usize,
    pub structural_errors: usize,
    pub validation_errors: usize,
    pub domain_errors: usize,
}

And here is a pipeline function that produces both outputs:

pub fn process_records(records: Vec<RawRecord>) -> (Vec<EpisodeRecord>, DataQualityReport) {
    let total = records.len();
    let mut report = DataQualityReport::default();
    report.total_records = total;

    let valid_records = records.into_iter().filter_map(|record| {
        match process_record(&record) {
            Ok(episode) => {
                report.valid_records += 1;
                Some(episode)
            }
            Err(e) => {
                match e {
                    PipelineError::Structural(_) => report.structural_errors += 1,
                    PipelineError::Validation(_) => report.validation_errors += 1,
                    PipelineError::Domain(_) => report.domain_errors += 1,
                }
                None
            }
        }
    }).collect();

    (valid_records, report)
}

We can give DataQualityReport a report() method that prints a readable summary, and a to_txt() method that writes it to a file:

impl DataQualityReport {
    pub fn report(&self) {
        println!("Data quality report");
        println!("  Total records:      {}", self.total_records);
        println!("  Valid records:      {}", self.valid_records);
        println!("  Structural errors:  {}", self.structural_errors);
        println!("  Validation errors:  {}", self.validation_errors);
        println!("  Domain errors:      {}", self.domain_errors);
    }

    pub fn to_txt(&self, path: &std::path::Path) -> std::io::Result<()> {
        use std::io::Write;
        let mut file = std::fs::File::create(path)?;
        writeln!(file, "Data quality report")?;
        writeln!(file, "  Total records:      {}", self.total_records)?;
        writeln!(file, "  Valid records:      {}", self.valid_records)?;
        writeln!(file, "  Structural errors:  {}", self.structural_errors)?;
        writeln!(file, "  Validation errors:  {}", self.validation_errors)?;
        writeln!(file, "  Domain errors:      {}", self.domain_errors)?;
        Ok(())
    }
}

The output of report() might look like this:

Data quality report
  Total records:      10000
  Valid records:      9611
  Structural errors:  47
  Validation errors:  340
  Domain errors:      2

This is the information you would put in a CONSORT flowchart or a methods section. As you are processing PipelineError enums, you could pull out more information to include in the report, such as the most common error messages or the specific fields that failed validation. The point is that you have a structured way of capturing and reporting on data quality issues that is separate from the valid records themselves.

With a little more wrangling and thinking about categories, you could end up with something like this. Note that getting to this level of detail requires storing the error messages themselves rather than just counts, but even the simple version gives you the shape of the problem:

Data quality report
  Total records:      10000
  Valid records:      9611

Structural errors:  47
  - Missing required field: 30
  - Unparseable date: 17

Validation errors:  340
    - Negative age: 200
    - Weight over 1000: 140

Domain errors:      2
    - Discharge date before admission date: 2

What you can do with the report

Having the report as a structured output rather than a stream of log messages opens up several options.

You can implement a threshold approach: abort if structural errors exceed a certain percentage, but tolerate a higher rate of validation errors that you intend to report on and account for in a sensitivity analysis.

pub fn check_thresholds(report: &DataQualityReport) -> Result<(), String> {
    let structural_rate = report.structural_errors as f64 / report.total_records as f64;
    let validation_rate = report.validation_errors as f64 / report.total_records as f64;

    if structural_rate > 0.01 {
        return Err(format!(
            "structural error rate {:.1}% exceeds threshold of 1%",
            structural_rate * 100.0
        ));
    }

    if validation_rate > 0.05 {
        return Err(format!(
            "validation error rate {:.1}% exceeds threshold of 5%",
            validation_rate * 100.0
        ));
    }

    Ok(())
}

You can send it to whoever provided the extract with specific, actionable language. “Your extract contained 47 records where the patient identifier field was empty, and 17 records where the admission date could not be parsed.” That conversation is possible because your errors are structured. The alternative is “something went wrong, can you check it?” which is a conversation that goes nowhere.

You can write the report to a file alongside your output data so that anyone who uses the dataset in the future can see exactly what the pipeline found and what it excluded.

You can include a summary in your methods section: “Of 10,000 records, 9,611 passed validation. 47 records were excluded due to structural errors in the source extract. 340 records contained values outside the expected clinical range and are reported separately in the sensitivity analysis.”

That sentence is your DataQualityReport in prose. The code produced it; you are just transcribing it, and in fact, if you wanted to fiddle around with formatted strings, you could get your DataQualityReport to produce that exact sentence for you.

How this connects to the previous posts

The errors that happen at the boundaries we have been building - and it is your choice where those boundaries lie - are what the PipelineError enum captures. This information is not just for debugging; it is a measurement of the quality of your data and the reliability of your pipeline. By categorising errors and counting them in a structured way, you can produce a report that tells you not just what you have but what you lost and why.

The three posts together describe a philosophy for structuring the ingestion and validation layers of a research data pipeline in Rust:

Serde: did the data arrive in the shape we expected?
Newtypes: does the data mean what we think it means?
Errors as data: what do the failures tell us about the dataset?

Some disadvantages

This approach requires more upfront design work. You have to think about your error taxonomy before you write the pipeline, which means understanding your data and your domain well enough to categorise the ways things can go wrong. That is not always possible at the start of a project.

One to many

The granularity question is also hard to get right. If your error types are too coarse, the report does not tell you enough. If they are too granular, it becomes noise and you stop looking at it. The worst case is a near 1:1 mapping between error types and individual error conditions, which is just recreating the problem you started with in a different form. I think the right level of abstraction is usually aligned with the stages of your pipeline: ingestion errors, validation errors, domain errors. thiserror helps here because you can include specific context in the error message without creating a new type for every possible failure mode.

Testing also becomes slightly more involved. You are not just testing that your pipeline produces the right output; you are testing that it produces the right report. That means writing tests that deliberately introduce bad data and assert that the report captures it correctly. This is more work upfront but it pays dividends when the pipeline runs on real data and produces a report you can trust. Later in this series, we are going to be talking about property-based testing and how it can help with this sort of problem, but for now, lets just be aware that this is a more complex testing scenario than just checking for correct outputs.

Final thoughts

We are researchers already think about data quality as a measurable property of a dataset. We build CONSORT flowcharts. We write methods sections that account for every record we excluded and why. We run sensitivity analyses on the data we were not sure about. Rust gives you the tools to make that thinking code to be executed rather than retrospective data collection after the fact. It does involve more upfront planning and design work, but it means less fiddly post-hoc data wrangling and more confidence in the quality of your dataset.

Your error types become a schema for what can go wrong and your report is a measurement of how wrong it actually was. By the time your pipeline finishes, you should know not just what you produced but what you could not process, and what that means for your findings.

Errors in research code are different

The naive approach and what it loses

A taxonomy of data quality failures

The DataQualityReport as a first-class pipeline output

What you can do with the report

How this connects to the previous posts

Some disadvantages

Final thoughts

Related Posts

DIVINA

Applying the Open-Closed Principle to Research Code

Code Review for Research Code