Applying the Open-Closed Principle to Research Code

Let’s start with a scenario. You wrote some code for a research project last December. You ran the analysis, reported a number, wrote it up, and submitted the paper to a journal. Peer review came back a few months later with the usual list of points. One reviewer asked you to use a slightly different definition of the exposure variable as a sensitivity check. You opened the function, added an elif branch for the new definition, tidied up the shared logic above it while you were there, and sent the revised paper back. Months later, your boss asks you to pull a subset of the project together for a conference poster. You rerun the original analysis. The number is different. You cannot tell whether the number in the submitted paper is right, or the number on your screen is right, or both, or neither. You feel the dread of “oh no”!

For many people this scenario will be familiar. Times have changed. Science has progressed, and the day to day of doing science increasingly looks like the day to day of a software engineer. We are writing code, more and more of it, and we are writing it under research conditions. Unlike our friends in software or product teams, we often don’t have close colleagues who are reviewing and improving our code as we go along. In fact we are often not even working on the same codebase. So how can we know if we are writing good code? This is an area that fascinates me, and I like to look at design principles and philosophy from software engineering and think about how we can apply them to our scientific code. In this blog, I am going to take the principle of “Open/Closed” and apply it to research code to see if it is useful, and if so, how we can use it going forward. This is a nice companion piece to The Single Responsibility Principle for Scientists Who Write Code but it also stands alone if you just want to dip in here.

What is the Open/Closed Principle?

The Open/Closed Principle was first written about in 1988 by Bertrand Meyer, and it is the “O” of the SOLID principles of software design. Incidentally the Single Responsibility Principle from my last blog is the “S”. The original definition is that software entities, like classes and functions, should be open for extension but closed for modification. In other words, you should be able to add new behaviour without editing the code that already works. What does that mean in practice? For example if you have some code that allows you to save some information as a CSV or JSON file, and you want to add saving to PDF, you don’t need to alter any of the functions for save to CSV or JSON to achieve this. You are adding rather than modifying. The reason this is a great idea is that in large code bases, modifying some code in one file might break some other functionality in a distant part of the program unintentionally since software engineers are encouraged to write reusuable functions (the Don’t Repeat Yourself or DRY principle).

The Open/Closed Principle was developed thinking about an object-oriented approach to coding, where we have inheritance and classes in a hierarchy. However in research, we are typically not using this style and tend to adopt a more functional approach. The question is how can we apply the Open/Closed Principle?

Closed

For me, the principle is really about a relationship between old code and new behaviour. For our purposes, it means that once a function has produced a reported result, it is closed. Any new behaviour comes from new code rather than edits to the old. Closed here means frozen relative to the results it has produced. A function can be retired, replaced, or sit alongside newer variants, but it should not be modified, because the moment it is modified, the link between the function and the number it produced in the past is broken. How you define “reported results” is up to you but I think of it as anything written in a format that someone outside your team has seen.

Why research code violates this principle by default

I would argue that the scientific process sets us up to violate this principle by default. Science is always iterating. A new paper comes out, a new equation gets validated and adopted during a long research project, a reviewer asks for an additional sensitivity analysis, or a collaborator wants to tweak a definition. These are all good things for science and are part of healthy collaboration. However in practice each request ends up in an elif branch of a function that is already in use. It makes sense, you are working on the codebase and this is an addition to the code, but if you have ever been stuck in the scenario above, you will know there are some hidden costs.

Rerunning and iteration

Why are these costs hidden? It is really threefold in my experience. Firstly, the code usually still works. There is no error message and nothing to really flag that a result has gone through a subtly different code path. Secondly, the result is usually plausible. In my experience it is usually just a little off but totally believable. Thirdly, and I think this is a big one, we assume that when we get a different result, we have made a mistake. We have done something different or made an error somewhere, and we beat ourselves up about it. If only we could find the mistake. Researchers spend quiet hours trying to track this down.

I would argue that this is less about a mistake and more about a mismatch. Science iterates constantly, but our results are snapshots in time. When science is code, our code has been iterating too, and that makes it very hard to go back to those snapshots. You would never edit the methods section of a submitted paper without a paper trail, but the function implementing those methods gets edited casually, and if you are using git, even your git history rarely tells you which commit corresponds to which reported number.

What can we take from the Open/Closed Principle?

So how do we fix this? I think we need to move away from using branching logic (the if/elif) and into named functions that encapsulate each new addition. It is easier to work through an example at this point, and I am going to use missing data imputation because there are several methods that are commonly used. This is just an example though, and the principle applies to all areas of research, not just epidemiology or medicine. Here we are dealing with some data that has a degree of missingness and we are generating values for those missing values based on various simple methods.

Let’s say we have three imputation methods we might use to fill in missing values in a time series of patient measurements:

Mean imputation, where we replace missing values with the mean of the observed values
Median imputation, where we replace missing values with the median of the observed values
Last observation carried forward (LOCF), where we replace each missing value with the most recent observed value before it

We might start off with something like this:

def impute_missing(values, method):
    if method == "mean":
        # Fill with mean
        ...
    elif method == "median":
        # Fill with median
        ...

We have two methods and we produce some result that is now in a paper that has been submitted for peer review. Now let’s imagine that a collaborators asks us to also include LOCF as a sensitivity analysis for a poster they are submitting for (a valid side quest), because the data are longitudinal and it might be more appropriate. So we add this in:

def impute_missing(values, method):
    if method == "mean":
        # Fill with mean
        ...
    elif method == "median":
        # Fill with median
        ...
    elif method == "locf":
        # Last observation carried forward
        ...

This works, and it feels like the natural thing to write when you start with one method and then add another. But every time you come back to this function to add a new variant, you are touching code that has already produced a reported result, in our case a submitted paper. This is the violation we are trying to avoid. There is a risk when we get reviewers’ comments back, we are not going to be able to reproduce the same result.

The fix is to give each method its own function with a descriptive name, and have them share a signature so they are interchangeable at the call site. What does a signature mean? It means that the functions take in the same kinds of arguments and return the same type of result, so for example a series of values in and a series of values out.

def impute_mean(values):
    ...

def impute_median(values):
    ...

def impute_locf(values):
    ...

So now if a new method gets added, say multiple imputation or model-based imputation, you write a brand new function and do not touch the old ones at all. The old ones are “closed”. You can be reassured that your paper from last year used impute_mean, and that function has not been edited since, so the result is reproducible intentionally and by construction. We can then use these functions like this:

def analyse_with_imputation(df, column, impute_fn):
    df = df.copy()
    df[column] = impute_fn(df[column])
    return fit_model(df)

results_mean = analyse_with_imputation(df, "systolic_bp", impute_mean)
results_median = analyse_with_imputation(df, "systolic_bp", impute_median)
results_locf = analyse_with_imputation(df, "systolic_bp", impute_locf)

To my knowledge most programming languages allow this, where you can pass a function into another function and have it applied inside. Here our analysis code takes the imputation method as an argument rather than branching internally. If we wanted to add in a regression as a method, we could do the same:

results_regression = analyse_with_imputation(df, "systolic_bp", impute_regression)

There are a few things that fall out of this which I think are worth naming explicitly.

Firstly, naming becomes really important. When each method has its own function, the function name becomes the thing a future reader uses to trace a reported number back to the code that produced it. impute_mean, impute_median, and impute_locf are good names because they describe the method, which is what you would write in the methods section of a paper. impute_v1 and impute_v2 are bad names because they describe the order you happened to write them in rather than what they are. Your git history also starts to tell the more accurate story: “added LOCF imputation” rather than “modified imputation function (again)”.

Secondly, adding type hints to your signatures really helps. They make the shape of each function visible at a glance, and they let your editor or type checker catch mistakes before you run the code.

import pandas as pd

def impute_mean(values: pd.Series) -> pd.Series:
    ...

def impute_median(values: pd.Series) -> pd.Series:
    ...

def impute_locf(values: pd.Series) -> pd.Series:
    ...

The signatures are now genuinely identical, which is what makes them interchangeable at the call site. The differences between methods live inside the functions, where they belong. A reader who wants to know how each method works can look inside, and a reader who just wants to swap methods does not need to know anything about the internals.

Finally, we can use the DRY principle that we mentioned earlier. The imputation methods share quite a lot of logic underneath. They all need to identify which values are missing, validate that there are some non-missing values to work with, and preserve the index of the original series. We do not need to repeat this in every method function. We can pull it into shared helpers:

def validate_has_observed_values(values: pd.Series) -> None:
    if values.dropna().empty:
        raise ValueError("Cannot impute when all values are missing")

def impute_mean(values: pd.Series) -> pd.Series:
    validate_has_observed_values(values)
    return values.fillna(values.mean())

def impute_median(values: pd.Series) -> pd.Series:
    validate_has_observed_values(values)
    return values.fillna(values.median())

def impute_locf(values: pd.Series) -> pd.Series:
    validate_has_observed_values(values)
    return values.ffill()

The method functions stay closed once they have produced a reported result. The helpers stay shared and can be used by any method. This is DRY working with the Open/Closed principle rather than against it: the shared logic lives in one place, and each method function composes those helpers in the way it requires, without any branching on which method is being used.

A word of caution though. Be careful what you pull into a shared helper. If you find yourself adding an if method == "mean" branch inside a helper, you have just recreated the bloated branching function one level down. Helpers should be genuinely shared logic that does not vary by method, not a place to hide variant-specific behaviour.

How does this fit with the Single Responsibility Principle?

If you read my last post on the Single Responsibility Principle, you might be reading this and thinking doesn’t this go against what you said there? In that post I argued that a function like calculate_egfr should take raw values rather than a dataframe, so that it does not depend on any particular data structure and can be reused anywhere. The imputation example above does the same thing: each function takes a series, not a dataframe with hard-coded column names, and does one well-defined thing.

But what happens when the variants genuinely need different inputs? The imputation example works cleanly because every method takes the same thing (a series of values) and returns the same thing (a series of values). Sometimes you do not have that luxury. Let me show you what I mean using the eGFR (estimated glomerular filtration rate) calculation, where there are several published equations with subtle but important differences in what they need.

Take CKD-EPI 2009 and CKD-EPI 2021. The 2009 version requires a race coefficient as one of its inputs. The 2021 version dropped the race coefficient entirely, after the equation was updated to remove race-based adjustments. So the pure equations look like this:

def ckd_epi_2009(creatinine: float, age: int, sex: str, black: bool) -> float:
    """The CKD-EPI 2009 equation. Returns eGFR."""
    ...

def ckd_epi_2021(creatinine: float, age: int, sex: str) -> float:
    """The CKD-EPI 2021 equation. Race coefficient removed."""
    ...

These are pure SRP-style functions, exactly the kind I argued for in the previous post. They take raw values, return a number, know nothing about pandas, and can be tested in isolation by passing in known inputs and asserting on the output. Notice that ckd_epi_2021 does not take a black argument at all. The race coefficient was removed from the equation in 2021, and the signature reflects that. A reader of the code can see this scientific decision without opening the function.

But these two functions have different signatures, which is a problem if we want to swap them at a single call site the way we did with the imputation methods. If we tried to call them both from the same place, we would end up branching on which equation it is, and we are right back to the bloated if/elif we were trying to escape.

The way out is a thin row adapter for each equation:

def ckd_epi_2009_row(row) -> float:
    return ckd_epi_2009(
        creatinine=row["creatinine"],
        age=row["age"],
        sex=row["sex"],
        black=row["black"],
    )

def ckd_epi_2021_row(row) -> float:
    return ckd_epi_2021(
        creatinine=row["creatinine"],
        age=row["age"],
        sex=row["sex"],
    )

Each adapter does one thing: it pulls out the fields the underlying equation needs and calls it. The adapters share a signature (row) -> float, which is what makes them interchangeable at the call site. If you want to read more about adaptors in health data science, I wrote about this a while back. The pure equations underneath keep all the SRP virtues from the previous post.

Now the orchestrator stays clean:

def analyse_cohort(df, egfr_fn):
    df = df.copy()
    df["egfr"] = df.apply(egfr_fn, axis=1)
    return df.groupby("ckd_stage").size()

results_2009 = analyse_cohort(df, ckd_epi_2009_row)
results_2021 = analyse_cohort(df, ckd_epi_2021_row)

If a new 2026 equation came out that needed cystatin C as well as creatinine, you would write ckd_epi_2026 with whatever signature is right for that equation, and then ckd_epi_2026_row to adapt it to your dataframe. The orchestrator and all the previous equations and adapters stay untouched.

So the two principles end up working together rather than in opposition. SRP keeps each pure equation focused on one concern: the science. Open/Closed keeps each equation archival once it has produced a reported result. The adapters are a thin layer between them, doing the unglamorous work of mapping column names to function arguments. These are likely one-use only functions just for this research study, and that is fine.

A quick note. This is more code than just writing one row-taking function per equation, and for a small one-off analysis you might decide it is not worth it. That is also fine. The split is more useful when you are testing the equations properly, reusing them across studies, or working in a team where someone else is going to call your code.

Names are part of the method

I have talked a lot about naming conventions for variables in various conferences and tutorials I have done for researchers. It is a favourite topic of mine (and not sure what that says about me!). Here however I am totally justified. The Open/Closed principle only works if the names of the functions carry the scientific meaning plain to see. A function with a meaningless name is just as difficult to unpick as complex branching logic. You can’t trace it!

So my tips are:

Avoid names that describe the order you wrote the functions in rather than what they are: v1, v2, _new, _final, _updated - these all rot the moment a third version appears, and none of them answer the question “which version did you use?”. You don’t want to get into the place of egfr_v2_final_final_no_really_final.
The name should match what you would write in your methods section of the paper. So for example if you are writing we used the CKD-EPI 2021 equation, your function should be called ckd_epi_2021. This makes it easier for you and anyone else looking at the code to match them.
Be suspicious of adjectives rather than methods. So avoid: improved_, corrected_, better_, or fixed_, because they encode a value judgement rather than a description and they imply the old version was wrong, which is rarely what you mean in a context where the old version produced a reported result.
Let the function signatures reflect the science: if one equation takes a particular coefficient and the next one drops it, the signatures should differ visibly rather than being papered over with **kwargs or ignored parameters, because a reader noticing the dropped argument is learning something real
Keep the variant out of the orchestrator’s name: run_mortality_analysis should take the equation as an argument, not be called run_mortality_analysis_ckd_epi_2021, because a variant that has leaked into the pipeline name is a variant you can no longer swap

Open/Closed as a reproducibility principle

I am just going to reflect on what this means for reproducibility. Often when we talk about reproducibility in science, we usually talk about tooling. Using docker so the environment is the same. Requirements files so the package versions are the same. Random seeds so the stochastic bits are the same. These things definitely matter, and I am not arguing against any of them. But they are all solving the same problem, which is environment drift. This is a huge issue in research as I have already written about. But they do something different to what I am arguing for in this blog. They protect you against the world around your code changing - like a python version update or a breaking change in a package. What they do not protect you against is you changing the shape of your own code!

Different piece of reproducibility

Open/Closed protects you against your code changing. Or more accurately, it helps protects you against your code changing in ways that quietly invalidate results you have already reported. The strongest form of reproducibility is not “I can rerun this study and get the same answer”. That is a useful thing of course, but it is a weaker claim than people realise, because if the code has been edited since the answer was reported, you are really just hoping the edits did not affect the result! The strongest form of reproducibility in my mind is “the code that produced this number still exists, unchanged, and I can point to it”. That is what closed-for-modification gives you.

I think researchers already understand this principle in other parts of their work, they just have not connected it to their code. We have pre-registration, where you write down your analysis plan before you see the data, and you do not silently change it later. We have versioned protocols, where every revision gets a number and a date and the old versions are kept on file. We have study amendments, which are formal documents that exist precisely because changing the plan after the fact is a serious thing that needs to be acknowledged. The whole apparatus of research integrity is built on the idea that you do not quietly edit the things that defined your study.

Open/Closed is the code-level version of that same idea. You would not silently edit a pre-registered analysis plan, because everyone in research understands why that would be a problem. But many of us will quite happily edit the function that implements that analysis plan, because no one ever told us it was the same kind of problem. It is the same kind of problem! The function is the analysis plan, just written in a language that runs. Basically a machine-readable research protocol.

Wrapping up

If I had to boil this down to a few rules of thumb, they would be these. Once a function has produced a result you have shared with anyone outside your team, treat it as archival and do not edit it. New variants go in new functions, with names that describe what the function does rather than the order you wrote it in. Variants should share a signature so they are interchangeable at the call site, and where they cannot share a signature, a thin adapter can do the bridging. The orchestrating code is the thing that changes over time, the scientific primitives like a calculation equation are the things that stay still, and keeping those two layers distinct is key.

So if you find yourself about to add an elif to a function that already works, stop. The fix is almost always a sibling function rather than a new branch.

Applying the Open-Closed Principle to Research Code

What is the Open/Closed Principle?

Why research code violates this principle by default

What can we take from the Open/Closed Principle?

How does this fit with the Single Responsibility Principle?

Names are part of the method

Open/Closed as a reproducibility principle

Wrapping up

Related Posts

Code Review for Research Code

Bringing NHS data analysis into the 21st century

Women in Rust 2025