Serde Rust: Data Serialisation for Data Scientists

Author

Caroline Morton

Date

March 8, 2026

I have a confession to make: I love Serde.

Serde for those not in the know is the Rust ecosystem’s workhorse for serialisation and deserialisation, but for data pipelines I find it more helpful to think of it as something slightly different: a schema enforcement mechanism and a validation boundary.

That distinction matters. In many data pipelines, validation is treated as something that happens later. You ingest the data, clean it, transform it, and only then check whether it actually matches the assumptions your code is making. By the time you discover a problem, you may already have done quite a lot of work on data that is not what you thought it was.

Serde lets you move that boundary to the point of ingestion. You define a type that represents a valid record. Deserialisation into that type either succeeds or fails with an error that you can handle explicitly. There is no intermediate state where the data exists in your program but has not yet been checked.

This idea is not new. It is “parse, don’t validate” applied to data engineering.

Throughout this post I will use messy health data as an example, because that is what I work with most often. The patterns are not health-specific. They apply just as well to financial data, sensor feeds, operational logs, or any other pipeline where the source data is less tidy than the schema documentation suggests.

All the code in this post is available as a runnable Rust project on GitHub with test CSV files for each pattern.

Where this fits: validation boundaries

For years I have worked with data from hospitals, primary care, laboratories, and national registries. The hardest part has rarely been scale. It has been messiness.

Messy Data comes in

A field that was numeric in one extract becomes a string in the next. A categorical variable arrives with three different spellings of the same value. A date column changes format across providers, or even within the same provider over time. These things are normal in real datasets, and to be honest they drive me crazy! The code that you have written to handle the data one day can break when the data changes, and that is a constant source of bugs and maintenance work.

Imagine a simple record with a patient ID, a date of birth, and a smoking flag. At minimum you want to know that the ID is present, the date is parseable, and the smoking field contains something meaningful. More realistically you want to normalise equivalent representations and reject nonsense values. An equivalent representation might be that “yes”, “Y”, “1”, and “true” all mean the same thing in the smoking field. A nonsense value might be “maybe” or “unknown”. You want to be able to say, “if the smoking field is ‘yes’, then we can treat it as a 1”. And you want to be able to say, “if the date of birth is in any of these formats, we can parse it into a date object”.

In many pipelines that logic ends up scattered across multiple stages: one transformation for dates, another for booleans, a filter somewhere else, and a few ad hoc repairs when analysis starts failing. The result is code that works most of the time but is difficult to reason about.

Serde collapses that gap. The type becomes the contract.

To illustrate the starting point, consider a minimal struct that represents a row in a dataset:

use serde::Deserialize;

#[derive(Debug, Deserialize)]
struct PatientRecord {
    patient_id: String,
    date_of_birth: String,
    is_smoker: bool,
}

Even this simple definition already gives you structural validation. If a required field is missing or a value cannot be parsed as a boolean, deserialisation fails.

The problem is that real datasets rarely arrive this cleanly. A smoking flag might appear as "yes", "Y", "1", "true", or "TRUE" depending on which system produced the extract. The default boolean parser will reject most of these.

That is where custom deserialisers become useful.

Pattern 1: Normalising inconsistent encodings at the boundary

Many datasets contain values that are semantically identical but encoded in several different ways. Smoking flags are a common example. One source sends 1 and 0, another sends "yes" and "no", and a third uses "Y" and "N".

Instead of handling these inconsistencies later in the pipeline, you can normalise them at the point of deserialisation:

use serde::{Deserialize, Deserializer};

fn flexible_bool<'de, D>(deserializer: D) -> Result<bool, D::Error>
where
    D: Deserializer<'de>,
{
    let value = String::deserialize(deserializer)?.trim().to_lowercase();

    match value.as_str() {
        "y" | "yes" | "true" | "1" | "t" => Ok(true),
        "n" | "no" | "false" | "0" | "f" => Ok(false),
        other => Err(serde::de::Error::custom(format!(
            "unrecognised boolean value: '{other}'"
        ))),
    }
}

You then attach the deserialiser to the relevant field:

#[derive(Debug, Deserialize)]
struct PatientRecord {
    patient_id: String,

    #[serde(deserialize_with = "flexible_bool")]
    is_smoker: bool,
}

Now every time the field is deserialised, the same normalisation rule is applied automatically. The rest of the pipeline can treat the value as a clean boolean.

Since this is just a function, we can reuse it anywhere the same inconsistency appears. If we have another field that represents whether a patient has diabetes, we just annotate it the same way:

#[derive(Debug, Deserialize)]
struct PatientRecord {
    patient_id: String,

    #[serde(deserialize_with = "flexible_bool")]
    is_smoker: bool,

    #[serde(deserialize_with = "flexible_bool")]
    has_diabetes: bool,
}

The parser lives in one place in the codebase, and every field that needs it gets the same behaviour. It is super annoying to have to write this kind of logic in multiple places in a pipeline, and so I would much rather have it attached to the type itself, written once, and reused everywhere. I want to reduce the cognitive load of reading and writing pipeline code so we can concentrate on the actual analysis, rather than constantly thinking about how the data might be encoded.

Small parsing functions like this are also straightforward to test:

#[cfg(test)]
mod tests {
    use super::flexible_bool;
    use serde::de::IntoDeserializer;

    fn parse(input: &str) -> Result<bool, serde::de::value::Error> {
        let deserializer: serde::de::value::StrDeserializer<serde::de::value::Error> =
            input.into_deserializer();
        flexible_bool(deserializer)
    }

    #[test]
    fn parses_true_variants() {
        for input in ["yes", "Yes", "YES", "y", "Y", "true", "1", "t", "T"] {
            assert_eq!(parse(input).unwrap(), true, "expected true for '{input}'");
        }
    }

    #[test]
    fn parses_false_variants() {
        for input in ["no", "No", "NO", "n", "N", "false", "0", "f", "F"] {
            assert_eq!(parse(input).unwrap(), false, "expected false for '{input}'");
        }
    }

    #[test]
    fn rejects_unknown_values() {
        for input in ["maybe", "unknown", "2", "yep", "nah"] {
            assert!(parse(input).is_err(), "expected error for '{input}'");
        }
    }
}

Testing the parsing layer pays dividends later. When you encounter a new variant in the data, you add it to the parser and add a test for it. When you encounter a value that should be rejected, you add a test for that too. Over time you build up confidence that your parser handles the real-world messiness of your data correctly. Tests also provide a valuable alternative source of documentation for the parsing rules, which can be more precise and easier to understand than an inline comment.

Pattern 2: Multi-format date parsing

Dates are often the most painful field in any multi-source dataset.

Dates are the worst

In one Python pipeline using Polars I ended up writing something like this:

pl.coalesce(
    pl.col(date_col).str.strptime(pl.Date, "%Y%m%d", strict=False),
    pl.col(date_col).str.strptime(pl.Date, "%F", strict=False),
    pl.col(date_col).str.strptime(pl.Date, "%d/%m/%Y", strict=False),
    pl.col(date_col).str.strptime(pl.Date, "%d-%m-%Y", strict=False),
    pl.col(date_col).str.strptime(pl.Date, "%FT%T", strict=False),
    pl.col(date_col).str.strptime(pl.Date, "%B %d, %Y", strict=False),
    pl.col(date_col).str.strptime(pl.Date, "%FT%TZ", strict=False),
).alias(date_col)

Each format appeared because a real dataset required it. The approach works, but it is repetitive and easy to forget.

With Serde, the parsing logic can live with the type:

use chrono::NaiveDate;
use serde::{Deserialize, Deserializer};

fn parse_clinical_date<'de, D>(deserializer: D) -> Result<NaiveDate, D::Error>
where
    D: Deserializer<'de>,
{
    let raw = String::deserialize(deserializer)?;
    let s = raw.trim();

    let formats = [
        "%Y-%m-%d",
        "%Y%m%d",
        "%d-%b-%Y",
        "%d/%m/%Y",
    ];

    for fmt in &formats {
        if let Ok(date) = NaiveDate::parse_from_str(s, fmt) {
            return Ok(date);
        }
    }

    Err(serde::de::Error::custom(format!(
        "no known date format matched: '{s}'"
    )))
}

Used like this:

#[derive(Debug, Deserialize)]
struct HospitalEpisode {
    patient_id: String,

    #[serde(deserialize_with = "parse_clinical_date")]
    admission_date: NaiveDate,
}

Optional date fields require a slightly different version that recognises sentinel values such as "NULL" or "NA":

fn parse_optional_clinical_date<'de, D>(
    deserializer: D,
) -> Result<Option<NaiveDate>, D::Error>
where
    D: Deserializer<'de>,
{
    let raw = String::deserialize(deserializer)?;
    let s = raw.trim();

    if s.is_empty() || s == "NULL" || s == "NA" || s == "." {
        return Ok(None);
    }

    let formats = ["%Y-%m-%d", "%Y%m%d", "%d-%b-%Y", "%d/%m/%Y"];

    for fmt in &formats {
        if let Ok(date) = NaiveDate::parse_from_str(s, fmt) {
            return Ok(Some(date));
        }
    }

    Err(serde::de::Error::custom(format!(
        "no known date format matched: '{s}'"
    )))
}

Again, the benefit is that the rule lives in one place.

This pattern has been a game-changer for me. Epidemiological studies are complicated enough without having to think about all the weird edge cases of how dates might be encoded. By handling this right at the point of deserialisation, I can be confident that any date field in my data is properly parsed and validated before it even enters the rest of my pipeline. The aim is always the same: reduce cognitive load so we can concentrate on the actual data analysis.

Pattern 3: Domain-constrained types

This is where the investment in Serde really starts to compound. Rather than representing categorical variables as strings and validating them somewhere downstream, you can encode the valid domain directly into the type system.

I love alias in Serde for this. Let’s say you have a smoking status field that arrives as “current”, “smoker”, “active”, “current smoker”, or half a dozen other variations depending on which GP system produced the extract. Each of those aliases was added to this enum because a real dataset contained that string:

use serde::{Deserialize, Serialize};

#[derive(Debug, Clone, PartialEq, Eq, Hash, Deserialize, Serialize)]
#[serde(rename_all = "lowercase")]
enum SmokingStatus {
    #[serde(alias = "non-smoker", alias = "nonsmoker", alias = "never smoked")]
    Never,

    #[serde(alias = "ex", alias = "ex-smoker", alias = "former smoker")]
    Former,

    #[serde(alias = "smoker", alias = "active", alias = "current smoker")]
    Current,

    #[serde(alias = "NA", alias = "", alias = "not recorded")]
    Unknown,
}

Once deserialised, the rest of the pipeline deals with SmokingStatus::Current rather than a string that might be spelled three different ways. If Serde encounters a value that does not match any variant or alias, it returns an error. No custom deserialiser needed - you get this straight out of the box with #[derive(Deserialize)] and the alias annotations.

The same idea works for numeric constraints. IMD quintiles, for example, are integers from 1 to 5. Anything outside that range is invalid, and we want to catch it at the point of deserialisation rather than discovering it later when our analysis produces odd results:

use serde::{Deserialize, Deserializer};

#[derive(Debug, Clone, PartialEq, Eq, Hash)]
struct ImdQuintile(u8);

impl<'de> Deserialize<'de> for ImdQuintile {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: Deserializer<'de>,
    {
        let v = u8::deserialize(deserializer)?;
        if (1..=5).contains(&v) {
            Ok(ImdQuintile(v))
        } else {
            Err(serde::de::Error::custom(format!(
                "IMD quintile must be 1-5, got {v}"
            )))
        }
    }
}

Data types defined early in the pipeline carry with them all the rules about what is valid and invalid. The rest of the code can just work with clean, well-defined structures without worrying about the messy realities of the input. If we need to update the logic, we do so in one place and it automatically applies everywhere that type is used.

Pattern 4: Rename fields at the boundary

Column naming at the boundary

When you are working with data from multiple sources, column naming is a constant source of friction. You might have a hospital extract with a column called date that represents the admission date, and a GP extract also with a column called date that represents the consultation date. If you are going to join these datasets, you need to rename them.

Serde’s rename attribute lets you handle this at the point of deserialisation. Your struct fields use clear, descriptive names, and Serde maps them from whatever the source columns are actually called:

use serde::Deserialize;

#[derive(Debug, Deserialize)]
struct HospitalEpisode {
    patient_id: String,

    #[serde(rename = "date")]
    admission_date: String,

    #[serde(rename = "disch_date")]
    discharge_date: String,

    #[serde(rename = "diag_1")]
    primary_diagnosis: String,
}

#[derive(Debug, Deserialize)]
struct GpConsultation {
    patient_id: String,

    #[serde(rename = "date")]
    consultation_date: String,

    #[serde(rename = "code")]
    clinical_code: String,
}

Once these records are in your program, there is no ambiguity. The hospital record has an admission_date and the GP record has a consultation_date. The renaming happened once, at the boundary, and every function downstream gets clear, self-documenting field names.

For sources that consistently use a different naming convention, rename_all saves you from annotating every field:

#[derive(Debug, Deserialize)]
#[serde(rename_all = "camelCase")]
struct FhirObservation {
    resource_type: String,   // "resourceType" in source
    effective_date: String,  // "effectiveDate" in source
    value_quantity: f64,     // "valueQuantity" in source
}

The goal is that internal types reflect your domain rather than the quirks of upstream systems.

Pattern 5: Mixed-type columns

One of the most common data quality issues I encounter is columns that contain mixed types. At first this seems unhinged - how can a column be both a string and a number? But in practice, it happens all the time. A CRP (C-reactive protein) column might contain 45, <1, or sample haemolysed. If you parse it as numbers, you lose the non-numeric values. If you parse it as strings, you lose the ability to work with the numeric ones easily. In pandas, this typically ends up as an object column and you deal with it later. Or you don’t, and it bites you.

Serde’s untagged enums let you model this reality directly:

use serde::Deserialize;

#[derive(Debug, Deserialize)]
#[serde(untagged)]
enum CrpValue {
    Numeric(f64),
    NonNumeric(String),
}

Serde tries each variant in order. If the value parses as an f64, you get CrpValue::Numeric. If not, it falls through to the String variant. The data is preserved either way, but now the type tells you which kind of value you are dealing with. Note that with CSV data, where all fields arrive as strings, you would use a custom Deserialize implementation that tries to parse as a number first. The concept is the same; the companion repo shows the full implementation.

The question is what to do with the non-numeric values. A string like "<1" is not garbage - it tells us the CRP was below the detection limit, which we often want to recode as 0. But "sample haemolysed" is genuinely unusable and should be treated as missing. A regex gives us a clean way to distinguish between the two:

use regex::Regex;
use std::sync::LazyLock;

static BELOW_LIMIT: LazyLock<Regex> = LazyLock::new(|| {
    Regex::new(r"^<\s*\d+(\.\d+)?$").unwrap()
});

impl CrpValue {
    fn to_numeric(&self) -> Option<f64> {
        match self {
            CrpValue::Numeric(n) => Some(*n),
            CrpValue::NonNumeric(s) if BELOW_LIMIT.is_match(s.trim()) => Some(0.0),
            CrpValue::NonNumeric(_) => None,
        }
    }
}

When you call .to_numeric() and get None, you have to decide what to do. Exclude the row? Impute? Log it? The choice is explicit, and the compiler enforces it at every call site.

As a researcher, I want to be forced to confront the messiness rather than having it hidden in an object column. Values at the extremes of the distribution - the "<1" results, the very high readings - are often the most clinically interesting. They represent the wellest or most unwell patients, and I definitely do not want to lose them by accidentally treating them as missing. At the same time, I want to be explicit about how I handle genuinely unusable values, and I do not want that logic scattered across multiple places. The enum lets me model the reality directly in the type and deal with it in one place.

This pattern applies anywhere a column carries mixed semantics. Financial data has fields that might be a price, "N/A", or "SUSPENDED". Sensor data has readings that alternate between numeric values and error codes. The untagged enum gives you a way to represent the mess honestly.

We can then do something like this in our pipeline:

/// LabResult struct representing a single row of the CSV,
///  with custom deserialization for the date and CRP value.
#[derive(Debug, Deserialize)]
struct LabResult {
    patient_id: String,

    #[serde(deserialize_with = "parse_clinical_date")]
    sample_date: NaiveDate,

    value: CrpValue,
}

What is interesting here is that we don’t need to annotate our CrpValue enum with Serde attributes. The logic for parsing the mixed-type column lives in the Deserialize implementation for CrpValue and since the struct LabResult just has a value: CrpValue field, and the whole struct is annotated with #[derive(Deserialize)], Serde will automatically use the custom deserialisation logic for that field.

Pattern 6: Delimited lists in a single field

Serde has a number of sister crates that extend its functionality and in my experience they are brilliant. serde_with is one of these and I use it all the time. One of the things it gives you is the ability to parse delimited lists in a single field. This is unfortunately really common in health data, where you might have a column that contains a list of clinical codes separated by semicolons, commas, or pipes. Patients often don’t come with just one condition so a simple hospital admission might have multiple diagnosis codes stored as "E119;I10;J45".

We could write a custom deserialiser for this, but serde_with gives us a convenient attribute to handle it:

use serde::Deserialize;
use serde_with::formats::SemicolonSeparator;
use serde_with::{serde_as, StringWithSeparator};

#[serde_as]
#[derive(Debug, Deserialize)]
struct PatientDiagnoses {
    patient_id: String,

    /// Diagnosis codes stored as "E119;I10;J45" in the source CSV.
    #[serde_as(as = "StringWithSeparator::<SemicolonSeparator, String>")]
    diagnosis_codes: Vec<String>,
}

This converts "E119;I10;J45" directly into a Vec<String>. The crate provides CommaSeparator, SemicolonSeparator, and SpaceSeparator out of the box, and you can implement the Separator trait to define your own if your data uses a different delimiter.

Delimited lists

You can also use this with your own custom types as long as they implement FromStr, so if you had an IcdCode type with validation in its FromStr implementation, you could deserialise directly into a Vec<IcdCode>:

#[serde_as]
#[derive(Debug, Deserialize)]
struct PatientDiagnoses {
    patient_id: String,

    /// Deserializes "E119;I10;J45" directly into a Vec<IcdCode>,
    /// running each code through IcdCode's FromStr validation.
    #[serde_as(as = "StringWithSeparator::<SemicolonSeparator, IcdCode>")]
    diagnosis_codes: Vec<IcdCode>,
}

Pretty cool, right? This is one of those things that seems small but has a big impact on the readability and maintainability of your code. It keeps the parsing logic close to the data definition, and it makes it clear to anyone reading the code that this field is expected to contain a delimited list.

The serde_with crate provides several other transformations that are worth knowing about for data pipelines. These are not health-specific, but I have found them useful across different kinds of projects:

use serde_with::{serde_as, NoneAsEmptyString, BoolFromInt, DisplayFromStr};

#[serde_as]
#[derive(Debug, Deserialize)]
struct SensorReading {
    device_id: String,

    /// Source sends empty string instead of null for missing locations.
    #[serde_as(as = "NoneAsEmptyString")]
    location: Option<String>,

    /// Source encodes booleans as 0/1 integers.
    #[serde_as(as = "BoolFromInt")]
    is_calibrated: bool,

    /// Any type implementing FromStr can be deserialized from a string.
    #[serde_as(as = "DisplayFromStr")]
    reading_id: ReadingId,
}

DisplayFromStr is particularly useful when you already have a type with a well-tested FromStr implementation. Rather than writing a separate Deserialize impl that duplicates the parsing logic, you reuse what exists.

Pattern 7: Structs as schema and filter

In hospital episode data, the source file often includes columns that you don’t need or don’t trust. A computed length_of_stay field is a classic example. In my experience, this field is frequently wrong: off-by-one errors, inconsistent handling of same-day admissions, discharge notes written days after the actual discharge. You are better off computing it yourself.

In Python, dealing with unwanted columns means either loading everything and dropping what you don’t need, or maintaining a usecols list that has to stay in sync with the rest of your code. Neither is great.

With Serde and the csv crate, you just don’t include the column in your struct. Any columns in the source that don’t have a corresponding field are silently ignored:

#[derive(Debug, Deserialize)]
struct HospitalEpisode {
    patient_id: String,
    episode_id: String,

    #[serde(deserialize_with = "parse_clinical_date")]
    admission_date: NaiveDate,

    #[serde(deserialize_with = "parse_clinical_date")]
    discharge_date: NaiveDate,

    primary_diagnosis: String,

    #[serde(deserialize_with = "flexible_bool")]
    is_emergency: bool,

    // length_of_stay is in the source CSV but we don't include it here.
    // It simply doesn't exist in our program. We compute it ourselves.
}

impl HospitalEpisode {
    fn length_of_stay(&self) -> i64 {
        (self.discharge_date - self.admission_date).num_days()
    }
}

The struct is both the schema and the filter. There is no “drop this column” step, no list of column names to maintain. And because length_of_stay is a method that computes from the validated admission and discharge dates, we know it is correct by construction rather than trusting whatever the source system computed.

Error handling in pipelines

Throughout this post, I have been using serde::de::Error::custom inside custom deserialisers to return errors when values don’t match what we expect. This is the right approach inside a deserialiser, because Serde expects its own error type at that level. But in a real pipeline, Serde errors are just one of several things that can go wrong. You also have I/O errors, domain validation errors, and errors from downstream writes.

I use thiserror to bring all of these together into a single error type for the pipeline:

use thiserror::Error;

#[derive(Debug, Error)]
enum DataPipelineError {
    #[error("I/O error: {0}")]
    Io(#[from] std::io::Error),

    #[error("CSV error at line {line}: {source}")]
    Csv {
        line: usize,
        source: csv::Error,
    },

    #[error("JSON error: {0}")]
    Json(#[from] serde_json::Error),

    #[error("domain validation error at line {line}: {message}")]
    DomainValidation {
        line: usize,
        message: String,
    },
}

The custom deserialisers speak Serde’s language internally. When you call csv_reader.deserialize::<HospitalEpisode>() and it fails, the csv::Error (which wraps the Serde error) gets mapped into your DataPipelineError::Csv variant. Your pipeline code then works with a single error type that distinguishes between “the data was structurally invalid” and “the data parsed but violates a domain rule.” This distinction matters when you are working with a data provider to resolve quality issues. “Your file has 47 structurally invalid rows and 312 rows with logically impossible date combinations” is a much more useful conversation than “it didn’t work.”

Error handling

I wrote more about error handling patterns in Rust in a series of posts, starting here, so I won’t go into more detail in this blog. The key point is that Serde’s errors are designed to compose with the rest of your error handling strategy, not replace it.

What Serde does not do (and what to watch out for)

I have been enthusiastic throughout this post, so it is worth being honest about the rough edges.

The learning curve for custom deserialisers is steep. If you are coming from Python or R, some of the more advanced Serde concepts like lifetime annotations and the deserialization internals take time to get comfortable with. The patterns in this post are the ones I have found most useful, and I hope they give you a head start, but there is a real investment involved. I think it is worth it, but I would be doing you a disservice if I pretended otherwise.

Compile times are real. On a project with a lot of derived structs, the proc macro expansion is measurable. For data pipeline code where correctness matters more than iteration speed, the trade-off has always been worthwhile for me. But if you are used to the fast feedback loop of a Python script, it is an adjustment.

Error messages from nested deserialisation can be opaque. If a field fails to parse deep inside a nested structure, the default error message can be unhelpful. This is where good test coverage of your custom deserialisers pays off. If you have tests that cover the expected inputs and edge cases for your parsers, you can be confident that they will handle the real data correctly, even if the error messages are not always crystal clear.

Serde’s derive macros give you structural validation, not semantic validation. #[derive(Deserialize)] will confirm that a date field contains a parseable date, that an integer field contains an integer, and that an enum field matches one of the defined variants. It will not, on its own, tell you that a patient with a birth date in 2030 and an admission date in 2024 is impossible. You can build semantic checks into your custom deserialisers - for example, rejecting dates in the future - but that is logic you write yourself and it of course limits the reusability of the deserialiser.

In practice, I find it useful to keep these as separate layers: Serde handles the structural validation at the point of deserialisation, and a separate domain validation step checks the business rules afterwards. It keeps both layers testable and easy to reason about.

Middle ground approach with polars

If you’re not ready to make the jump to Rust for your data pipelines, I would recommend looking at Polars as a stepping stone. It’s a DataFrame library written in Rust with a Python API, so you get a lot of the performance benefits of Rust without leaving the Python ecosystem. I’ve used it extensively and it is a huge improvement over pandas. But when you need the full control over validation, type safety, and memory that this post has been about, Serde and Rust are where I end up.

Wrapping up

The patterns in this post are the ones I reach for most often when building data pipelines in Rust. They are drawn from years of working with messy health data, but the underlying idea is not specific to health, or to Rust, or to Serde. It is about where you place the validation boundary in your pipeline.

Most pipelines I have worked on treat validation as something that happens later: after ingestion, after cleaning, after transformation. The problem is that by “later,” you have already done a lot of work on data that might not be what you think it is. Every step between ingestion and validation is a place where bugs can hide and incorrect assumptions can linger.

Serde lets you move that boundary to the earliest possible point. You define a type that says “this is what a valid record looks like.” Deserialisation either gives you a value of that type or an error. There is no in-between. The messy encodings, the inconsistent date formats, the mixed-type columns, the ambiguous column names - all of it is handled once, at the boundary, and the rest of your code works with clean, validated types.

I find that this approach reduces the cognitive load of working with messy data enormously. Instead of carrying a mental model of which fields have been cleaned, which transformations have been applied, and which edge cases might still be lurking, I can look at a struct definition and know exactly what guarantees it provides. The compiler enforces those guarantees, and the custom deserialisers document the messiness of the source data in a way that is both executable and reviewable.

That is why I love Serde. Not because it makes the easy things easier - honestly, for simple cases, Python is faster to get started with. But because it makes dangerous things safer. And in my world, where the data feeds into epidemiological studies and clinical decision-making, safe is what matters.

Related Posts

errors blue

Error Handling in Rust: Fundamentals

A clear, practical guide to Rust error handling: panic, Result, ?, unwrap, and expect - written for Rust developers who want clarity without jargon.

Read More
dag_2 orange

What are GANs and how can they generate synthetic data?

This blog explores Generative Adversarial Networks (GANs) and how they can be used to generate synthetic healthcare data.

Read More
histogram blue

An Introduction to Electronic Health Records

A quick primer on what is an electronic health record and how it is used in clinical practice and research. This post is UK focussed but the principles are the same in many other countries.

Read More