Clinic to Code to Care

This blog is an adaptation of a talk that Steph Jones and I gave at Women in Data and AI in October 2025. It explores the journey of information from a patient in clinic to how that information is coded for research and ultimately ends up informing statistical and machine learning models that can help improve patient care. I hope it provides a useful overview of the process and highlights some of the challenges and opportunities along the way.

How data gets into the system

Imagine you are a patient visiting your doctor because you have a cough. The doctor asks you a series of questions about your symptoms and how long you have had them, and performs a number of physical examinations including your blood pressure, temperature, listening to your lungs, and checking your throat. They come to the conclusion that you have a chest infection and prescribe you some antibiotics, with advice to rest up and come back if you don’t feel better in a week or so. From your perspective, this is a straightforward interaction, and you will see your doctor typing up notes on their computer. You might assume that this simple interaction might generate something like:

Patient has a cough and fever. Blood pressure 120/80, temperature 38°C. Signs of a chest infection on examination. No known drug allergies. Prescribed amoxicillin. Advised to rest and return if no improvement in one week.

However, the reality is that the information you provide to your doctor is recorded in a much more complex way. The notes your doctor takes contain a mixture of structured data - what we call coded events - and some free text, which we call unstructured data. The coded events are things like your symptoms, the results of your physical examination, and the treatment your doctor prescribes. These are recorded using standardised codes that allow for consistent recording and analysis of health data. For example, your cough might be recorded using the SNOMED CT code 49727002, which represents “Cough”. Your fever might be recorded using the code 103001002, which represents “Feeling feverish”. The antibiotics prescribed might be recorded using the code 774586009, which represents “Amoxicillin”. Blood pressure and temperature are recorded as numerical values.

But how does your doctor know which codes to use? This is where clinical coding comes in. Clinical coding is the process of translating the information recorded in your medical notes into standardised codes. Behind the scenes, there is a complex tree of codes that represent different medical concepts. In general practice, the coding system is called SNOMED CT, while hospitals use a mixture of SNOMED CT, ICD10, and OPCS-4. There is a whole blog written about SNOMED here.

In summary, SNOMED codes are numerical representations of medical concepts that allow for consistent recording and it is arranged in a hierarchical tree structure. For example, Cough would be a child of Respiratory Symptom, and the parent to subtypes of cough like Dry Cough, or Allergic Cough. This hierarchical structure allows for more flexible and detailed recording of medical information. I will explain why this adds complexity later.

So how does your doctor know which codes to use?

Well, they don’t. Your doctor is not a clinical coder. They are typing up notes on their computer, and the system they are using will offer them a drop down menu of possible words and phrase to choose from. A bit like autocorrect. So as they start to type “pneumonia”, the system will offer them a list of possible matches, and they can select the one that best fits, probably the first one. They don’t see the numerical code, just the word or phrase. From their perspective, they are just typing up notes on their computer, and this is a helpful computer system that makes it easier for them to record the information by adding in these suggestions. They can save a few seconds of typing, and they don’t have to remember the exact spelling of every medical term.

What is happening on the backend is that the system inserting into the notes, a SNOMED CT code, along with the corresponding term. So when your doctor selects “Cough” from the drop down menu, the system inserts the code 49727002 into the notes, along with the term “Cough”. This is what we call a coded event. It gets recorded with the date and time of the consultation, and the patient’s unique identifier.

With this information, you can see that our simple interaction with our doctor has generated a number of coded events:

Date	Time	Patient ID	Code	Term	Value
2025-10-01	10:00	12345	49727002	Cough
2025-10-01	10:00	12345	103001002	Feeling feverish (finding)
2025-10-01	10:00	12345	11111111	Systolic Blood Pressure	120
2025-10-01	10:00	12345	22222222	Diastolic Blood Pressure	80
2025-10-01	10:00	12345	386725007	Body temperature	38
2025-10-01	10:00	12345	272016000	On examination - chest finding
2025-10-01	10:00	12345	162965007	On examination - coarse crepitations (finding)
2025-10-01	10:00	12345	233604007	Pneumonia (disorder)
2025-10-01	10:00	12345	716186003	No known drug allergy (situation)
2025-10-01	10:00	12345	774586009	Amoxicillin 500mg capsules (product)

Importantly the clinician will still see the words and phrases so it appears like they are just typing up notes, and they will usually only be aware that coding is happening when they see it on the patient’s summary care record.

From clinic to database

Once your doctor has finished typing up their notes, they will save them, and the information will be stored in a database. The data is stored in tables as you might expect, with each row representing a single coded event, and columns for the patient identifier, the code, the term, the date and time of the event, and other relevant information. I am simplifying a bit here as usually events, observations (like blood pressure), and prescriptions are stored in separate tables, but the principle is the same. It all ends up being combined with other peoples’ data in a large database.

Now our simple interaction with our doctor has generated 10 coded events, and if we multiply that by the number of patients a doctor sees in a day, and the number of doctors in a practice, and the number of practices in a region, and the number of regions in a country, you can start to see how quickly this data can add up. In the UK, there are millions of rows of data generated every day from primary care alone.

This data is incredibly valuable for research. It allows researchers to study patterns of disease, the effectiveness of treatments, and the impact of public health interventions. You can see from our simple interaction with our doctor that we have generated a wealth of information about our health - and perhaps could be included in a research study about chest infections, or the likelihood of being prescribed antibiotics for a cough, or the average blood pressure of patients of a certain age. The possibilities are endless.

From database to clean data

The next step in the journey is to clean and prepare the data for analysis. We want to extract a subset of the data that is relevant to our research question, and this is where things can get a bit tricky. Usually we want to end up with a table that contains one row per patient, and columns for the variables we are interested in. We need to try to “flatten” the data from multiple rows per patient to a single row per patient, where they have a value for each variable we are interested in. For logistic regression, we might want a binary outcome variable (e.g. did the patient have a chest infection or not), and a date of diagnosis (e.g. when was the chest infection diagnosed) if we are doing a time to event analysis. We also want to include some covariates such as diabetes, previous history of chest infections, all of which involve flattening the data from multiple rows per patient to a single row per patient.

This presents a number of challenges. First, we need to define our variables. For example, how do we identify all patients with a cough? We might decide to use the SNOMED CT code 49727002, but what about other codes that might represent a cough, such as “Dry Cough” or “Productive Cough”? Codes which are children of the parent code “Cough” in the SNOMED CT hierarchy. We need to decide whether to include these codes or not, and this can be a subjective decision. For example, we might want to exclude “Allergic Cough” as this is not relevant to our research question if we are specifically interested in infectious causes of cough.

This is where codelists come in. A codelist is simply a list of codes that we have decided to include in our analysis. We might create a codelist for “Cough” that includes the parent code 49727002, and all its children except for “Allergic Cough” and “Coughing ineffective”. We will need to do this for every variable we are interested in, and this can be a time consuming process. There are some resources available online that provide pre-made codelists for common conditions, but these are not always comprehensive or up to date, so we often end up creating our own codelists from scratch. This is a crucial step in the process, as the quality of our codelists will directly impact the quality of our analysis.

Once we have defined our variables and created our codelists, we can start to extract the relevant data from the database or more usually flat CSV files. We write code to do this, usually in SQL, Python, R, or Stata.

From clean data to analysis

Once we have extracted the relevant data, we can write our statistical code to perform our analysis. This might involve running a logistic regression to identify risk factors for chest infections, or a time to event analysis to study the impact of antibiotics on the duration of symptoms. We might also want to create some visualisations to help us understand the data, such as Kaplan-Meier curves or forest plots.

The possibilities are endless, and the data we have extracted from our simple interaction with our doctor can provide valuable insights into the patterns of disease and the effectiveness of treatments. There are several points of bias that creep in at each stage and it is the job of the epidemiologist or data scientist to try to identify and mitigate these biases. Every study has limitations. We are concentrating on biases that arise from the data itself, rather than study design or analysis, and we are not going to include confounding, which is a whole other topic.

Naming a few data biases that can arise from the data itself:

Selection bias - not everyone goes to their GP when they are unwell - some will go to A&E, some will self care, some will go to a pharmacist, and some will not seek care at all. This means that the data we have extracted from the database might not be representative of the entire population. For example, if we are studying chest infections, we might only be capturing the more severe cases that present to primary care, and missing out on the milder cases that self care or more severe cases that go to A&E. This can lead to biased estimates of disease risk and treatment effectiveness. Since the data is only generated from one part of the healthcare system, we are missing out on a lot of information from other parts of the system.
Information bias - the data we have extracted from the database might be incomplete or inaccurate. For example, if a doctor forgets to code a cough, or codes it incorrectly, this can lead to misclassification of patients and biased estimates of disease risk. Maybe the doctor didn’t click the dropdown menu and just typed “cough” in free text, which is not coded and usually not extractable for research (although natural language processing is starting to change this).
Measurement bias - We have put this under measurement bias, but it is really a combination of information and measurement bias. SNOMED CT is an incredibly rich and complex coding system that includes what you would assume to be incredibly rare or specific events like being hit by debris from a falling aircraft (code 443761005). However, it also is missing codes or does not have specific codes for some conditions. This makes it hard to study those conditions.
Codelist bias - the codelists we have created might not be comprehensive or accurate. For example, if we have missed a code that represents a cough, this can lead to misclassification of patients and biased estimates of disease risk. If we have included codes that are not relevant to our research question, this can also lead to biased estimates of disease risk.
Temporal bias - the data we have extracted from the database might not be up to date. This is important because research is carried out on snapshots of data, and not on a live database (There are some exceptions to this, but they are rare).
Incorrect data - sometimes data is just plain wrong. A common example is pregnancy codes being applied in a variety of ways at different stages of pregnancy. For example, a pregancy code might be used to indicate a positive pregnancy test, or a code that gives an estimated due date might be used to indicate a confirmed pregnancy. This can lead to misclassification of patients and biased estimates of disease risk. The end of pregnancy is also often not recorded, and usually not in a timely manner. The pregnancy might be recorded as ongoing for weeks to months after the pregnancy has ended, and might only be updated incidentally for example, at the 6 week postnatal check. This can lead to misclassification of patients and biased estimates of disease risk. A clever researcher might use a new “person” joining a household as a proxy for a new baby being born, but this is not always accurate, and address matching even within the same database is not always reliable.

How does this data end up in AI models?

There is not a clear line between statistical models and machine learning models and AI, as there are a lot of different centres using the same data, and the academic group that are the experts in statistical models are not usually the same as those working on machine learning models. However, the data journey is similar for both types of models.

One thing that Steph made clear in our talk is that there is often a large range of data sources that are used to train AI models, and that a key part of the process is data integration. This involves combining data from multiple sources, such as primary care, secondary care, laboratory data, imaging data, and social care data. This is a complex process that involves matching patients across different datasets, and dealing with missing data, inconsistent coding, and other challenges.

Some examples of the sort of issues with the data:

Inconsistent coding - the same condition recorded differently (e.g. “HTN” vs “Hypertension”).
Missing values - diagnosis without dates, prescriptions without dose.
Duplicate patients - same person across hospital systems.
Unit confusion - mg vs µg, mmHg vs kPa.
Temporal gaps - visit dates that don’t line up, missing follow-ups.

Data might be recorded in multiple ways across different systems, and this needs to be standardised and harmonised before it can be used for analysis. This task is even bigger if you wanted to take into consideration other countries healthcare systems to get a more diverse dataset, or an idea of what was happening globally. This is all very much an extension of the sort of problems that epidemiologists have been dealing with, and I have described above, but on a much larger scale. I was struck by the fact that often epidemiologists are trying to use the smallest amount of data possible to answer a specific question, whereas AI researchers are often trying to use the largest amount of data possible to train a model that can be applied more generally. This leads to different approaches to data cleaning and preparation, and different priorities when it comes to dealing with missing data, inconsistent coding, and other challenges.

Using a common data model

At the AI Centre for Value Based Healthcare where Steph works, they are using OMOP as their common data model. OMOP stands for Observational Medical Outcomes Partnership Common Data Model and is a widely used standard for representing healthcare data in a consistent way. It is well worth reading more about OMOP if you are interested in this topic, and there is a great introduction here. The diagram in that link gives a good overview of the different tables in the OMOP data model, and how they relate to each other, and was something that Steph referred to in our talk.

Using standardised tools

Steph also talked about the importance of using standardised tools for data extraction and transformation. Drawing again on her work at the AI Centre for Value Based Healthcare, she recommended using dbt - which is sort of like a recipe book for data transformation. It helps automate checks, build trusted data pipelines and document the data transformation process. This last part was particularly interesting to me as I have often found it hard to understand how a cleaned dataset was created from the raw data when reading research papers or even collaborating with other researchers. Having a documented data transformation process is crucial for reproducibility and transparency. dbt makes short work of common data transformation tasks like converting dates, removing duplicates, combining raw tables and mapping local clinical codes to OMOP. The standardisation of dates is music to my ears as I once had to write some to convert over 10 different date formats into a standard format for analysis!

Discussion points

I have covered a lot of ground in this blog, and there are many more topics that I could explore in more detail. Here are some discussion points that I think are worth considering and that Steph and I talked about in our talk, and with the audience:

Transparency and explainability are key to building trust in AI models. Clinicians and patients need to understand how these models work and how they are making decisions.
As discussed above, there are many points where bias can creep into the data. There are the types of biases that have existed for a long time in epidemiology, but there are also new types of biases that arise from using LLMs and AI models, and interacting these with clinical systems. Algorithmic bias as discussed is a big topic and something we need to be aware of and actively try to mitigate. We need to always be thinking what was this model optimised for, and what are the implications of this for its use in healthcare. Deployment bias is another important topic.
Synthetic data is an exciting new area that has the potential to address unbalanced datasets, missingness and encourage sharing and open science. If you want a primer on synthetic data, check this blog out. What was really interesting from the question and answer session after our talk was that around 70% of the questions were about synthetic data, and people were really keen to understand how it could be used across a range of different industries, not just health. I plan to write more about synthetic data in future blogs as this is a rapidly evolving area with lots of new developments and lots of potential applications.
Continuous post-deployment monitoring with fairness metrics to detect performance drift or disparate impact on specific populations is crucial to ensure that AI models remain effective and equitable over time.

Steph and I also talked about the importance of having more women in data and AI, and the importance of diversity in teams working on these projects. Diverse teams are more likely to identify and mitigate biases in the data and the models, and to develop solutions that are more equitable and inclusive.

References and further reading

What is an EHR?
What is SNOMED?
Mapping Data Flows in the NHS - this is a fantastic paper by Zhang et al that goes into a lot more detail how data flows through the NHS, and I would recommend reading this if you are interested in this topic.
Steph Jones’ Blog