Creating Codelists for Health Data Research

This blog post is the second part of a two-part series that accompanies a lecture I am giving on Codelists as part of the Health Data in Practice MSc at Queen Mary University of London.

In the previous blog, we looked at what a codelist is and why they are both important and difficult to create. In this post, we are going to look at some of the methods for creating codelists for your study.

Broadly speaking, there are two approaches: you either make your own from scratch or you adapt an existing codelist for your purposes. We are going to start by looking at how to create a codelist from scratch.

Making a new codelist from scratch

This approach should be used when you are fairly sure that no existing codelist does exactly what you need. Basically, there is no prior art that you can build upon. This is quite rare but it does happen.

Let’s imagine that we want to create a codelist to find all the patients who have visited their GP and been recorded as having a cough. In reality, there are plenty of existing codelists for cough, but let’s pretend for the purposes of a simple example that we don’t have one or can’t access one.

Before we start looking for relevant codes, we should sit down and define the following:

What is our inclusion criteria?
What is our exclusion criteria?

To work through this properly, let’s define a study question that will help frame who should be in our cough group:

Do Type 2 Diabetics receive more oral antibiotics for their coughs than non-diabetics in general practice?

In this scenario, our population is Type 2 Diabetics, our exposure is cough, and our outcome is oral antibiotics. For now, we are going to concentrate only on our exposure of cough.

We could write our inclusion and exclusion criteria as:

All acute coughs, excluding coughs for known non-acute reasons such as allergic cough and chronic cough.

Now we have our definition, we can search for codes. In this case, we are going to use SNOMED codes as these are used in general practice. There are several ways we can search for these:

Use the SNOMED browser provided by NHS Digital
Use another code browser such as OpenCodelists
Use a dataset of known codes and search with a programming script

I am sure there are other ways, but these are the ones I have used in my career so I am going to concentrate on these.

Method 1: The SNOMED browser

The first two options involve a website with a point-and-click interface. This is hugely valuable because you can quickly browse through codes. But not all websites are created equal.

The SNOMED browser is a simple searching tool. When we search for “cough”, we get back a long list of possible matches:

SNOMED browser search results for “cough”

If I search for “cough” as of 31st January 2026, I get 1115 matches. Clicking on a non-granular term (i.e. “cough” rather than “dry cough”) shows 40 child terms underneath it. We can then click through these and note down the ones we want. Perhaps we include “productive cough” but not “psychogenic cough” or “reflux cough”.

That covers 41 of the 1115 matches. You would then need to go through the rest of the terms and quickly assess them. For example, “history of whooping cough” might not be relevant for this study. Many codes have multiple term definitions (for example, “Cough (finding)”, our parent code, has the alternative term name “Observation of cough”), so you can normally work through the list quite quickly.

If you have followed along, you will end up with a codelist of around 30 to 40 relevant codes.

The problem with manual searching

Now the question is: how do we record what we did? Because we have done a bunch of clicking around and noting things down, it gets hard to remember exactly what decisions we made and why.

It is hard to remember what you did when manually searching for codes

If someone wanted to check our work or reproduce it, they would struggle. This is why other tools exist.

Method 2: OpenCodelists

OpenCodelists is my favourite tool for this job. Full disclosure: I used to work for the team that created it, back in the early days of OpenSAFELY.

Here you can create a codelist, search for “cough”, and then use the inherent tree structure of SNOMED to capture all the codes underneath any of your search terms.

OpenCodelists search results for “cough”

It supports wildcards, encourages you to record metadata, and makes your life a lot easier by exposing the tree structure in the browser. Your codelist also becomes citable and shareable from day one, which is great for reproducibility.

OpenCodelists metadata entry screen

I highly recommend that you use this tool. The team also have excellent documentation.

Method 3: Searching with code

The final method is using a dataset provided by a data provider like CPRD. Most providers produce something similar. This approach lets you load a dataset from a CSV or similar file and then use code to search for terms such as “cough”.

What is great about this is that it makes explicit what you have searched for and what you have excluded. I will link to a very old example that I created for a study in stroke back in 2017. Even without understanding Stata code, you can see from the comments that I am searching for terms (with wildcards like "*infect*" "*sepsi*"), adding some codes manually by ICD-10 code, and then excluding terms (such as "*history*" "*benign*" "*postinfect*" "*non-infect*").

The beauty of this approach is that if someone wants to recreate the exact same codelist and check my work, they can simply run the script. It is fully reproducible. However, it does require some programming knowledge, so it is more of an advanced technique.

What we have covered so far

We have now walked through three methods for creating a codelist from scratch: manual searching with the SNOMED browser, using a dedicated tool like OpenCodelists, and writing code to search systematically. Each has its trade-offs between ease of use and reproducibility.

But in reality, you will often find that someone has already created something close to what you need. In the next section, we will look at how to find and adapt existing codelists, which is often a more efficient starting point.

Adapting an existing codelist

There are a great many existing codelists if you know where to look. Many are published online on repositories like OpenCodelists, the HDR UK Phenotype Library, or LSHTM Data Compass. Others live in GitHub repos, if you know where to find them. Some are kept within departments and might be retrievable if you know who to ask. Papers, especially older ones, sometimes published codelists in the appendix. Arguably this non-machine-readable method is the worst option other than no codelist at all!

For our cough example, we can use OpenCodelists and search for “cough”. We can immediately see that there are two SNOMED codelists that come up: one from the University of Bristol and one from OpenSAFELY. We can download these and view them ourselves.

Here we want to pay close attention to the metadata, in particular the original purpose of the codelist. We can see that the University of Bristol cough codelist was created for an asthma study, for example, and we might decide that makes it less relevant to our diabetes and antibiotics question.

Once you have found one or more candidate codelists, you have a few options:

Accept an existing codelist as-is, if it fits your inclusion and exclusion criteria well
Take two codelists and combine them, then apply your own inclusion and exclusion criteria
Use an existing codelist as a starting point and add or remove codes as needed

The choice is up to you. What matters is that you feel confident you have captured all the codes that fit your criteria, and that you can explain and justify your decisions. You will quickly see that good metadata is absolutely key to judging whether a codelist works for your study.

Useful tools

A few tools worth mentioning that can make your life easier:

Comparing codelists

Sometimes you will want to compare two codelists side by side to see where they overlap and where they differ. My advice here is to use a proper comparison tool like Beyond Compare rather than fiddling around in Excel. Just alphabetise each CSV file and then compare them. This is another example of when software engineers have already solved tricky problems for us, so we should just use their tooling.

Beyond Compare screenshot comparing two codelists

Shout out to Scooter Software for making such a great tool that is accidentally perfect for this task!

Checking code usage

OpenCodeCounts, also from the Bennett Institute (the team behind OpenCodelists and OpenSAFELY), provides counts of how often each code is actually used in the data in England. This means you can stop wasting time arguing about whether a code should be included in your codelist if it turns out nobody in the whole country has ever been coded with it. Very handy for prioritising your efforts.

Version control and documentation

If you want to version control your codelists and have in-depth discussions in the open about your decisions, I would recommend using GitHub. GitHub Issues are particularly good for this. You can have nuanced discussions about why you included or excluded particular codes, and then link back to those issues in your metadata. Issues stick around even after they are closed, so they become a permanent record of your reasoning.

Why this all matters

Whatever method you use, the goal is the same: make your codelist reproducible and defensible. Someone reading your paper should be able to understand exactly how you identified your study population, and believe that you have done a thorough job. If they cannot, your results become harder to trust and resources get wasted as other researchers recreate work from scratch rather than building on yours.

Codelists are proper scientific artifacts and should be treated as such. They deserve the same care and documentation as your statistical methods or your data cleaning pipeline.

Conclusion

Creating a codelist is a scientific decision that directly affects which patients end up in your study. Whether you build from scratch or adapt an existing one, the key is to be systematic, document your choices, and be able to explain why you included or excluded particular codes.

Where possible, reuse existing codelists that have been validated in previous studies. There is no prize for reinventing the wheel, and using established codelists makes your work more comparable to others in the field. When you do create or adapt a codelist, share it with good metadata so others can build on your work in turn.

How to Create a Codelist