The Single Responsibility Principle for Scientists Who Write Code

Most research code starts life as a script. You load a dataset, clean it, fit a model and write out the results. It works, but then your study evolves, and you end up with a function that looks like this.

def analyse_cohort(input_path, output_path):
    # Load data
    df = pd.read_csv(input_path)
    df.columns = [c.lower().strip() for c in df.columns]
    df["date_of_birth"] = pd.to_datetime(df["date_of_birth"], dayfirst=True)
    df["index_date"] = pd.to_datetime(df["index_date"], dayfirst=True)

    # Derive age
    df["age"] = (df["index_date"] - df["date_of_birth"]).dt.days / 365.25

    # Clean and exclude
    df = df[df["age"] >= 18]
    df = df[df["age"] <= 100]
    df = df.dropna(subset=["systolic_bp", "creatinine", "hba1c"])
    df = df[df["systolic_bp"] > 0]
    df = df[df["creatinine"] > 0]
    df = df[~df["region"].isin(["unknown", ""])]

    # Create derived variables
    df["egfr"] = round(175 * ((df["creatinine"] / 88.4) ** -1.154) * (df["age"] ** -0.203))
    df.loc[df["sex"] == "F", "egfr"] *= 0.742
    df["ckd_stage"] = pd.cut(
        df["egfr"],
        bins=[0, 15, 30, 45, 60, 90, float("inf")],
        labels=["5", "4", "3b", "3a", "2", "1"],
    )
    df["hba1c_mmol"] = (df["hba1c"] - 2.15) / 0.0915
    df["diabetic"] = df["hba1c_mmol"] >= 48

    # Fit model
    df["outcome"] = df["died_within_year"].astype(int)
    formula = "outcome ~ age + C(sex) + egfr + systolic_bp + diabetic + C(region)"
    model = smf.logit(formula, data=df).fit(disp=0)

    # Write everything out
    summary = model.summary2().tables[1]
    summary.to_csv(output_path / "model_coefficients.csv")
    df.to_csv(output_path / "analysis_cohort.csv", index=False)
    with open(output_path / "exclusions.txt", "w") as f:
        f.write(f"Final cohort size: {len(df)}\n")
    print(f"Done. {len(df)} patients included.")

This is the sort of function that grows organically. It loads data, cleans it, applies exclusions, creates clinical variables, fits a model, and writes the output. Nobody sets out to write a long and complicated function like this that does 6 different things. It starts out as a script and then somebody wraps it into a function so that it can be used on different datasets or cuts of the same data, perhaps a small dataset for development and then the full one. As the analysis develops, things get slotted in and suddenly this function is getting quite long and unwieldy .

So what’s the problem?

There are three broad issues with this sort of structure. First, it quickly becomes quite hard to read and reason about. Research on how programmers read code suggests we can hold roughly 2-6 chunks of information in working memory at any one time. As we load and accumulate variables in our code, we quickly exceed that limit. By the time you reach the model fitting section of analyse_cohort(), you have lost track of what df contains.

Second, you can’t test any part of the function without loading the real data and then running the whole function, which in this case is probably the whole pipeline. That might be slow, or error prone. Let’s imagine we want to test if the eGFR (a measure of kidney function) is accurate, we would have to load the actual data, and if the analyse_cohort() fails, is that because the eGFR calculation is wrong or because the model doesn’t fit?

Third, if you want to reuse any part of the function in a different part of the study or different study, you have to copy-paste it out. This matters for two reasons: you have to make sure you got every single line and if you update how you do that calculation (say maybe you exclude negative input values as inaccurate), you have to remember to update it in all the places you have copy and pasted this code.

All of these consequences have the same root problem: your function has too many responsibilities. It is trying to do too many things.

Too many things

What is the single responsibility principle?

The Single Responsibility Principle was formulated by Robert C Martin back in 2003 as part of his popular book called Agile Software Development: Principles, Patterns, and Practices, but the ideas had been circulating for at least a decade before then. The original text says “a module should have one, and only one, reason to change”, and it was later clarified to say this is really about actors - a module should only be responsible to one group of stakeholders. Now I don’t find using the term “actors” all that helpful but we can roughly translate that to concerns. A concern might be data loading, data cleaning, statistical modelling, calculating a new variable and formatting an output. These are all different concerns that might in the future change for different reasons, and to satisfy different stakeholders.

The short version of what this means is that a function should do one thing and you should be able to describe what it does in one sentence without using the word “and”. By this test, our function analyse_cohort() fails immediately. It loads data and cleans it and …. you get the idea!

What SRP looks like in research code

Let’s apply the single responsibility principle to our analyse_cohort() function. We will break it down into its component parts and then write a function for each part.

First data loading:

def load_patient_data(path: Path) -> pd.DataFrame:
    """
    Load patient data into a dataframe, removing all whitespace from column names
    """
    df = pd.read_csv(path)
    df.columns = [c.lower().strip() for c in df.columns]
    return df

We load the csv and strip out any whitespace from the column names - a structural fix so we can reference columns reliably downstream. Notice this function isn’t just a thin wrapper around pd.read_csv. If all we were doing was reading a csv, we wouldn’t need a function - we’d just call pd.read_csv() directly.

The whitespace stripping is the bit that’s specific to our data, and that’s what earns this function its existence. Notice our function name describes exactly what it is doing. We can also write a one-line docstring, which we couldn’t do for analyse_cohort() because it did six things. We can also add type hints that tell us exactly what goes in and what comes out. For those of you not familiar with typehints, this reference is helpful.

Next, data cleaning. We can split this into two functions:

def convert_dates_from_string(df: pd.DataFrame) -> pd.DataFrame:
    """
    Convert dates as String into Datetime
    """
    df = df.copy()
    df["date_of_birth"] = pd.to_datetime(df["date_of_birth"], dayfirst=True)
    df["index_date"] = pd.to_datetime(df["index_date"], dayfirst=True)
    return df

def drop_invalid_values(df: pd.DataFrame) -> pd.DataFrame:
    """
    Remove rows with missing or invalid values
    """
    df = df.dropna(subset=["systolic_bp", "creatinine", "hba1c"])
    df = df[df["systolic_bp"] > 0]
    df = df[df["creatinine"] > 0]
    return df

Here we have separated out two data cleaning actions - converting dates into date objects and removing invalid clinical values like a negative blood pressure value. By not mixing these in a function called clean_data(), we have improved readability. We will be able to see exactly what our data cleaning process is.

Then we apply our eligibility criteria:

def derive_age(df: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate patient age at index date.
    """
    df = df.copy()
    df["age"] = (df["index_date"] - df["date_of_birth"]).dt.days / 365.25
    return df

def exclude_ineligible_patients(df: pd.DataFrame) -> pd.DataFrame:
    """
    Remove patients who do not meet study eligibility criteria.
    """
    df = df[df["age"] >= 18]
    df = df[df["age"] <= 100]
    df = df[~df["region"].isin(["unknown", ""])]
    return df

We derive age as its own step because it gets used in two places - the eligibility filter here and the eGFR calculation later. If it lived inside exclude_ineligible_patients, we’d have to derive it again when we calculate eGFR, and now we have the same logic in two places - exactly the copy-paste problem we identified earlier that we want to avoid. With age derived separately, exclude_ineligible_patients is purely filtering.

Now, you could argue that each filter should be its own function, but my yardstick here is to avoid shadowing functionality of the underlying data structure - which in this case is a pandas dataframe. If we are just filtering via pandas on age, it is unnecessary to have a wrapper function for that unless you really need the readability. When it comes to testing, we would end up testing pandas functionality rather than our own code. This is something to avoid.

Now we move onto creating our clinical variables - of which there are three. These are all quite different so we should create different functions for each. Now there are a couple of options here, and I have a preference. We could create a calculate_egfr() that takes in a dataframe, and outputs a dataframe like this:

def calculate_egfr(df: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate estimated glomerular filtration rate (eGFR).
    """
    df = df.copy()
    creatinine_mgdl = df["creatinine"] / 88.4
    df["egfr"] = 175 * (creatinine_mgdl ** -1.154) * (df["age"] ** -0.203)
    df.loc[df["sex"] == "F", "egfr"] *= 0.742
    return df

Or, we could create a function that takes in some values, and does the calculation, and can be applied via pandas own mechanism to a dataframe via a lambda function.

def calculate_egfr(creatinine: float, age: float, is_female: bool) -> int:
    """
    Calculate estimated glomerular filtration rate (eGFR).
    Creatinine in µmol/L.
    """
    creatinine_mgdl = creatinine / 88.4
    egfr = 175 * (creatinine_mgdl ** -1.154) * (age ** -0.203)
    if is_female:
        egfr *= 0.742
    return round(egfr)

We can then use it like this:

df["egfr"] = df.apply(
    lambda row: calculate_egfr(row["creatinine"], row["age"], row["sex"] == "F"),
    axis=1
)

I prefer the second version. This function does not know anything about pandas or any other dataframe library you are using. In fact, it does not care about the data structure at all. It takes in a creatinine value, an age, and whether the patient is female, and it gives you back a number. This makes it far more reusable because you are not tied to your column names. For example, if you had a second dataframe where age was age_in_years, the first option would not work at all because that column does not exist. It is also easier to test, as we can calculate the result given some known values and then assert on the output. More on testing later.

We do the same for CKD staging and diabetic classification:

def stage_ckd(egfr: float) -> str:
    """
    Assign CKD stage based on eGFR value.
    """
    if egfr < 15:
        return "5"
    elif egfr < 30:
        return "4"
    elif egfr < 45:
        return "3b"
    elif egfr < 60:
        return "3a"
    elif egfr < 90:
        return "2"
    return "1"

def classify_diabetic(hba1c: float) -> bool:
    """
    Classify whether a patient is diabetic based on HbA1c.
    """
    hba1c_mmol = (hba1c - 2.15) / 0.0915
    return hba1c_mmol >= 48

Notice that the clinical thresholds - 48 mmol/mol for diabetes, the eGFR cutoffs for CKD staging - each live in exactly one place. Disease definitions do change - they certainly have since I trained! If the threshold for diabetes is updated, or the CKD staging boundaries are revised, you change one function and every analysis that calls it picks up the new definition. In the original analyse_cohort(), these thresholds were buried in 40-line function, and if you had copy-pasted that function into a second study, you’d need to find and update every copy.

We could then wrap these into a single function that creates all our clinical variables:

def create_clinical_variables(df: pd.DataFrame) -> pd.DataFrame:
    """
    Derive all clinical variables needed for the analysis.
    """
    df = df.copy()
    df["egfr"] = df.apply(
        lambda row: calculate_egfr(row["creatinine"], row["age"], row["sex"] == "F"),
        axis=1,
    )
    df["ckd_stage"] = df["egfr"].apply(stage_ckd)
    df["diabetic"] = df["hba1c"].apply(classify_diabetic)
    return df

This function knows about our dataframe and our column names - that’s its job. It acts as glue between our pure clinical logic and our data structure. The calculations themselves remain independent and testable.

Finally our model fitting and outputting the results to files:

def fit_logistic_model(df: pd.DataFrame) -> sm.regression.linear_model.RegressionResultsWrapper:
    """
    Fit a logistic regression model for one-year mortality.
    """
    df["outcome"] = df["died_within_year"].astype(int)
    formula = "outcome ~ age + C(sex) + egfr + systolic_bp + diabetic + C(region)"
    return smf.logit(formula, data=df).fit(disp=0)

def save_results(model, df: pd.DataFrame, output_path: Path) -> None:
    """
    Write model coefficients and analysis cohort to disk.
    """
    summary = model.summary2().tables[1]
    summary.to_csv(output_path / "model_coefficients.csv")
    df.to_csv(output_path / "analysis_cohort.csv", index=False)
    with open(output_path / "exclusions.txt", "w") as f:
        f.write(f"Final cohort size: {len(df)}\n")

Note here, we are writing multiple files. If we were just writing one file, it would probably just be a simple wrapper of pandas functionality and wouldn’t need to be in a function.

You might wonder about the whole pipeline and if it should become a function. I think this is up to you. If you have more than one dataset, then sure, it is a good idea. We can do this like this:

def analyse_cohort(input_path: Path, output_path: Path) -> None:
    """
    Run the full analysis pipeline from raw data to results
    """
    df = load_patient_data(input_path)
    df = convert_dates_from_string(df)
    df = drop_invalid_values(df)
    df = derive_age(df)
    df = exclude_ineligible_patients(df)
    df = create_clinical_variables(df)
    model = fit_logistic_model(df)
    save_results(model, df, output_path)

If you look at this function and compare it to the original, you will see it does the same thing but now the main body of the function reads like a table of contents. Each line is a step that has a sensible name. You can reorder them if appropriate, or add in a new function call easily. When the guidelines for one of your clinical variables changes, you can update that specific function and it updates it everywhere.

Some practical notes

In the original function, you might have noticed that there was a print statement with some result. This is what we call a side effect of a function. The function is reaching out and doing something beyond returning a value. Print statements are ok when you are debugging but they should not live in production code. They are invisible to tests usually and can clutter output. If you do want printed output, it is best to use a logger.

You also might want to chain the calls together like this:

df = load_patient_data(input_path).pipe(convert_dates_from_string).pipe(drop_invalid_values)

Please don’t do this. It might look elegant, although I would argue not because I find it is a lot harder to read. Avoid as it makes debugging painful. If it breaks somewhere, the stack trace is going to look at the whole chain and not the step that broke it.

Another advantage to the step by step approach of our analyse cohort function is that we could pull each line out and use in a jupyter notebook. You can run df = load_patient_data(input_path) in one cell, inspect the result and write some markdown to explain what you are doing. This makes it very easy to read and understand, and is something I have found useful for teaching or explaining your research to the rest of the team.

Testing

Now that each function does one thing, we can test each thing. Testing is not common in research code, but it is absolutely something we should adopt. This post is not intended to be a deep dive on testing - that warrants its own series - but the core idea is simple: you give your function some known inputs and check that you get the expected output back.

Let’s take calculate_egfr(). We test it with two different sets of inputs:

def test_calculate_egfr_male():
    assert calculate_egfr(creatinine=90, age=50, is_female=False) == 77

def test_calculate_egfr_female():
    assert calculate_egfr(creatinine=90, age=50, is_female=True) == 57

We pass in some values and we get a number back which we check. It is much harder to write a test like this for the original analyse_cohort() because the eGFR calculation is trapped inside a function that also needs a csv file, valid dates, and a model that fits. You would need the rest of the code to be correct to test just a few lines of the eGFR calculation.

We can write edge case tests as well. I like this naming structure:

def test_<name_of_function_being_tested>_<specific_condition>

This way you can tie a test back to its function by name. Tests become their own form of documentation.

Edge cases can be domain-specific or code-specific. A domain-specific edge case might be a clinically implausible value. A code-specific one might be passing in the wrong type:

def test_calculate_egfr_negative_creatinine():
    with pytest.raises(ValueError):
        calculate_egfr(creatinine=-10, age=50, is_female=False)

def test_calculate_egfr_age_is_string():
    with pytest.raises(TypeError):
        calculate_egfr(creatinine=90, age="fifty", is_female=False)

If you actually run these two tests against our current calculate_egfr(), neither of them will error. A negative creatinine will silently produce a nonsensical eGFR, and a string age will just blow up somewhere in the arithmetic with an unhelpful error message. The tests have exposed gaps in our function that we did not know were there. So we go back, add a validation check for negative creatinine, and now the function is better than it was before we tested it. This is the cycle that SRP enables: smaller functions are easier to test, testing reveals problems, and fixing those problems leads to better code.

Wrapping up

A common objection to this that I hear is usually that this is more code. This is true but I would argue the complexity is lower. Each piece is simpler to understand. It is easy to lose track of what you are doing in a large chunk of code in a monolithic function. If you look at our final analyse_cohort() function, the flow is right there with a sequence of steps taken in order.

We can expand the SRP outwards and think beyond functions. This might be separating our code into files such as data_loading.py or models.py. We are essentially saying this is all the code that does data loading, and this is all the code that does modelling. It makes your project more navigable than just one single file. This is probably a topic for another day however!

The practical takeaway is that if you find yourself adding extra lines to an existing function, ask whether it might be better to extract that functionality into its own function. If you have been following along with my blog posts and talks, you know I care a great deal about reproducibility and scientific trust: I see more widespread adoption of SRP in research as a simple but important part of this.

The Single Responsibility Principle for Scientists Who Write Code

What is the single responsibility principle?

What SRP looks like in research code

Some practical notes

Testing

Wrapping up

Related Posts

OpenSAFELY: Designing for Reproducibility

Accidental Functional Programming in Rust (From an Epidemiologist's Perspective)

Women in Rust 2025