key: cord-0997327-ipy290l3 authors: MacRae, Clare; Whittaker, Hannah; Mukherjee, Mome; Daines, Luke; Morgan, Ann; Iwundu, Chukwuma; Alsallakh, Mohammed; Vasileiou, Eleftheria; O’Rourke, Eimear; Williams, Alexander T; Stone, Philip W; Sheikh, Aziz; Quint, Jennifer K title: Deriving a Standardised Recommended Respiratory Disease Codelist Repository for Future Research date: 2022-02-16 journal: Pragmat Obs Res DOI: 10.2147/por.s353400 sha: 5df4708c0a0889ae25018e7e50e906078833ccc5 doc_id: 997327 cord_uid: ipy290l3 BACKGROUND: Electronic health record (EHR) databases provide rich, longitudinal data on interactions with healthcare providers and can be used to advance research into respiratory conditions. However, since these data are primarily collected to support health care delivery, clinical coding can be inconsistent, resulting in inherent challenges in using these data for research purposes. METHODS: We systematically searched existing international literature and UK code repositories to find respiratory disease codelists for asthma from January 2018, and chronic obstructive pulmonary disease and respiratory tract infections from January 2020, based on prior searches. Medline searches using key terms provided in article lists. Full-text articles, supplementary files, and reference lists were examined for codelists, and codelists repositories were searched. A reproducible methodology for codelists creation was developed with recommended lists for each disease created based on multidisciplinary expert opinion and previously published literature. RESULTS: Medline searches returned 1126 asthma articles, 70 COPD articles, and 90 respiratory infection articles, with 3%, 22% and 5% including codelists, respectively. Repository searching returned 12 asthma, 23 COPD, and 64 respiratory infection codelists. We have systematically compiled respiratory disease codelists and from these derived recommended lists for use by researchers to find the most up-to-date and relevant respiratory disease codelists that can be tailored to individual research questions. CONCLUSION: Few published papers include codelists, and where published diverse codelists were used, even when answering similar research questions. Whilst some advances have been made, greater consistency and transparency across studies using routine data to study respiratory diseases are needed. Electronic health record (EHR) databases include rich, longitudinal data on an individual's interactions with health care providers. They comprise part of the clinical information systems which health care providers use during clinical consultations across primary, secondary and tertiary care. From these systems, data can then be extracted to enhance patient care through clinical research, healthcare planning, decision-making, and clinical audit. These routine data have been used to make significant advances in research into the epidemiology, burden, and natural history of respiratory disease, leading to improved prevention, detection and management, and to inform health service planning and policy. 1, 2 The scale of these data facilitates a wide range of research with high statistical power due to the in-depth variety of variables recorded and the number of patients contributing to the data, particularly as they are increasingly linked with other data sources. 28 However, EHRs are primarily populated to support health care delivery rather than research. This gives rise to challenges, including high volume and irregularly collected, 2 informatively observed (where data collection is driven by clinical requirements), missing, and incorrectly coded data. 3, 4 To study a health condition in EHR databases, an operational definition based on clinical codes is often used. Clinical codes are alphanumerical codes ascribed to specific clinical events or descriptions. Numerous code systems exist, and each diagnosis can have multiple clinical codes associated with it. Therefore, in order to search for a particular diagnosis, multiple clinical codes are required, constituting a clinical codelist. 5 The choice of codes requires clinical and epidemiological expertise and knowledge about data quality and provenance in addition to knowledge about the databases being interrogated. 6 However, there is often significant variation in the clinical codes used to define respiratory conditions. 7, 8 This can result in considerable differences in study findings, 9 such as incidence and prevalence across studies and limits the generalisability and comparability of findings. 10, 11 As an example, Mukherjee et al examined UK asthma prevalence and reported that annual prevalence of clinician-reported-and-diagnosed asthma was 5.7% (3.6 M individuals) when derived from primary care databases and 6.8% (4.3 M individuals) when derived from the financial incentive-based Quality and Outcomes Framework in UK primary care, whereas annual prevalence of patient-reported clinician-diagnosed-and-treated asthma was 9.6% (6.0 M individuals) derived from national health surveys. 12 Therefore, it is important that standardised codelists exist to support research reproducibility, translation of findings between institutions and reduce duplication of work. 13 Previously published codelists for respiratory diseases have included specific codes relating to study-specific research questions, and few validation studies have been published. 14, 15 We sought to respond to this knowledge gap by developing a systematically derived collection of published codes from EHRs for three common respiratory disease categories: asthma, chronic obstructive pulmonary disease 1 and selected respiratory infections (Box S1). We aimed to amalgamate all codelists into one document and from that produce a recommended list for each disease, which can be used by researchers to identify relevant respiratory-related codes. We describe our methodology for this work in detail to allow researchers to replicate this methodology as appropriate to ensure transparency and reproducibility of codes using EHRs. 16 We systematically searched the literature and existing code repositories to identify all codes relating to asthma, COPD, and respiratory infections. We used a similar approach to previously published respiratory-related validation studies and built on this work. 7,14-17 Reviewers were split into three groups to search for codes related to asthma, COPD, and respiratory infections, respectively. Each group comprised at least three epidemiological researchers and one clinician researcher to evaluate disease codes. The Medline database was used because of its comprehensive coverage of clinical medicine research. Searches were performed using key search terms for asthma, COPD, and respiratory infections separately (Table 1 ) and abstracts and full text articles were screened by at least two researchers for each disease. We included full-text studies that reported codelists for asthma, COPD, and respiratory infections which were published in January 2020 (Figure 1 ), published in English language. These dates were chosen in order to identify up-to-date codes that could be added to existing systematic reviews and codelists already available in the Health Data Research UK (HDR UK) Phenotype Library. 18 Our objective was to identify which codelists are being used in research, rather than comment on validity; therefore, the risk of bias analysis was not performed. Supplementary material from the included studies was also reviewed to ensure all codes were identified. Specific codelists of interest included: Read version 2 codes, Clinical Terms version 3 (CTV3), SNOMED CT codes or Clinical Practice Research Datalink (CPRD) medcodeid codes, International Classification of Primary Care (ICPC) codes, International Classification of Diseases (ICD) 9, ICD 10, and ICD 11, and UK Biobank self-diagnosis codes. For asthma, studies that were published from 1 January 2018 were also included to capture work published since the last primary care validation study. 6 For COPD and respiratory infections, we searched full-texts from the 1st January 2020. We excluded codes related to SARS-CoV-2 infections because of the rapidly evolving evidence base in this area. There were no restrictions on study design, or age of populations studied. Studies that met the search criteria were screened by at least two researchers using a reference manager (Covidence, Zotero & Excel) for the three diseases separately and codelists that were published were compiled into a Microsoft Excel spreadsheet. In addition to a literature search, we searched existing UK code repositories: CALIBER HDR UK Phenomics portal, the Cambridge University repository, LSHTM Data Compass, QReseach, Oxford-RCGP RSC, Manchester Clinical Codes, and OpenSAFELY. Relevant code lists for asthma, COPD, and respiratory infections were added to the list of codes identified from the literature search. Once all codes were complied, recommended lists of codes for asthma, COPD, and respiratory infections were created based on validation studies, previous publications and multidisciplinary clinical expertise and consensus. The recommended lists of codes were labelled "BREATHE recommended codes" for diagnosis codes and were further categorised into phenotypes as appropriate. Phenotype lists were created by two non-clinical authors, and a second review was performed by a clinical author. Asthma-related phenotypes included incident and prevalent asthma and exacerbation of asthma. COPD-related phenotypes included emphysema, chronic bronchitis, incident COPD, prevalent COPD, and exacerbations of COPD. Finally, respiratory infections included 20 different types of infection (supplement). For asthma, 1126 articles were identified, of which 37 (3%) included published asthma codes and were therefore included. The COPD search retrieved 240 articles of which 53 (22%) included published COPD codes. Respiratory ("codeset*" OR "code set*" OR "codelist*" OR "code list*." OR "H33." OR "H33" OR "H33%%" OR "J45" OR "J46" OR "Read code*.mp" OR "ICD-10.mp" OR "ICD-11.mp" OR "SNOMED*" OR "ATC" OR "medcode" OR "CPRD" OR "Clinical Practice Research Datalink" OR "Secure Anonymised Information Linkage" OR "SAIL" OR "OPCRD" OR "Optimum Patient Care Research Database" OR "The Health Improvement Network" OR "THIN" OR "QResearch*" "Biobank" OR "ONS" OR "HES" OR "hospital episode statistics" OR "BNF" OR "NHS" OR "National Health Service" OR "Databases, AND ("codeset*.mp" OR "code set*" OR "codelist*.mp" OR "Read code*.mp" OR "ICD-10.mp" OR "ICD-11.mp" OR "SNOMED CT.mp" OR "SNOMED-CT.mp" OR "ATC.mp" OR "medcode.mp" OR "CPRD.mp" OR "SAIL.mp" OR "OPCRD.mp" OR "Biobank.mp" OR "ONS.mp" OR "HES. mp" OR "BNF.mp" OR "NHS.mp" OR "National Health Service.mp") AND ("UK.mp" OR "United Kingdom.mp" OR "England.mp" OR "Scotland.mp" OR "Wales.mp" OR "Northern Ireland.mp") NOT ("COVID-19. mp" OR "coronavirus.mp" OR "SARS-CoV-2. In terms of codes identified in repositories, 12 codelists were identified for asthma from Cambridge University, 19 Keele University, 20 LSHTM Data Compass, 21 Manchester clinical codes, 22 NHS England, 23 OpenSAFELY, 24 and UK Biobank. 25 Twenty-three codelists were identified for COPD from the Cambridge University repository, LSHTM Data Compass, Manchester Clinical Codes, and OpenSAFELY. In total, 64 codelists were identified for respiratory infections (pneumonia: 21, acute bronchitis: 12, aspergillosis: 0) from the Manchester Clinical Codes, CALIBER HDR UK Phenomics portal, OpenSAFELY, LSHTM Data Compass and Oxford-RCGP RSC repositories). Figure 2 illustrates the total number of codes published in the included articles for each disease. Most diagnosis codes for asthma were Read v2, whereas the majority of diagnosis codes for all other diseases were SNOMED CT codes for the corresponding CPRD medcodeid codes. For all diseases, very few Biobank and ICPC codes were found. All codelists for asthma, COPD, and respiratory infections can be found on the HDR UK Phenomics portal: https:// phenotypes.healthdatagateway.org/about/breathe/#collections. We undertook a systematic literature search of articles that published codelists along with their manuscript and searched codelist repositories to create a comprehensive list of codes used in previous studies related to specific respiratory diseases. From this, we derived a recommended list for each disease for future use either to use as they are or as a starting point for derivation of a list for future work. Relatively few published papers include, or reference published codelists. The majority of included studies used asthma codelists, which are likely to relate to worldwide asthma prevalence being more than twice that of COPD. 26 Of all codelists identified in the literature and repositories, the range of codes used to define specific disorders varied and it was common for research groups to reuse their own codes in each study. Codelists for a variety of databases are continuously updated and made available to the wider scientific community to allow up-to-date and transparent epidemiological research. Codes are added and removed from specific databases (such as SNOMED CT codes) over time and researchers should be aware of this and update their codelists as needed. Other nuances of these codes are that SNOMED CT codes and Read V2 codes do not always directly map across and independent code searches for each type of code should be conducted separately in the database being used for a study in order to identify all possible codes. Researchers also need to be aware that local SNOMED CT codes and Read V2 codes exist, so not all code browsers may include all possible codes. Overall, our work highlights the importance of systematic search strategies and clinical input to identify codes relevant to specific diseases. Not only do authors often not publish codelists with their work but there are also few validation studies of codelists. To date, our team has created inclusive codelists for COPD, asthma, and respiratory infections that can be used to find these respiratory diseases within UK datasets of routinely collected electronic health records, including sources such as the CPRD GOLD and Aurum, Hospital Episode Statistics, and the SAIL Databank. These codes are relatively broad and cover a range of phenotypes as well as incident and prevalent disease codes. Researchers must be cautious when using these codes to identify specific populations of individuals with these diseases and the choice of codes will depend on researcher's specific research question. For example, researchers should only consider incident codes when identifying an incident disease population or if the study group wish to identify, for example, emphysema, a specific emphysema codelist should be used rather than a more general COPD codelist. Furthermore, EHRs are primarily maintained to support health care rather than for research and specific codes may be preferred by clinicians, some codes may not be coded correctly, and some may not be used at all. 27 A study examining the usage of disease codes in primary care found in two million consultations performed over a seven-year period, 50% of EHRs were populated with only eight codes out of the 352 (2.3%) possible codes, and in 95% of cases only 36 codes out of 352 (10.2%) were used. Twenty-one percent of all possible allergy codes were never used. 28 This highlights the challenges of using EHR data and the importance of creating robust codelists to identify all possible events. The choice DovePress of codes for the same clinical condition may also need to be reviewed locally and tailored to the population or dataset as coding practices and data quality may differ between UK regions. In addition, these codes may have limited utility outside the UK and knowledge of local healthcare systems is essential to appreciate why certain codes are used and when. In addition to the use of specific codes for case definition, other important parameters should be considered depending on the clinical condition and database. This is because specific codes for various conditions may be underused by health care providers and captured by other variables in the database such as prescriptions, tests (such as spirometry to diagnose COPD) and symptoms. One example is the Quality and Outcomes Framework (QOF) indicator for asthma, AST001, (currently suspended in 2021) which uses a 12-month lookback period for prescriptions (in addition to every diagnosis) to identify individuals with active/treated asthma. Similarly, when identifying patients with exacerbations of COPD, other parameters such as prescriptions for respiratory-related oral corticosteroids and antibiotics and symptoms should be used in addition to exacerbation and lower respiratory tract infection codes to identify all possible events and patients. Transparency in research is vital and increasingly expected by funding organisations. Whilst initiatives such as RECORD have helped to increase transparency in reporting of studies undertaken using observational routinely collected data and have led to an extension of STROBE for this purpose, journals do not often mandate that codes or methodologies used for deriving codes for use in routine sources of data are published. It is important for studies to disclose the codelists used to allow the methods to be fully understood, findings interpreted clearly and for analysis to be replicated. One way in which to do this is to include or reference exact lists of codes in published manuscripts rather than only including vague code ranges. We aim to build on this work and create codelists for specific phenotypes (as well as incident and prevalent codes) for asthma, COPD, and respiratory infections with input from respiratory clinicians. These codes could be used to identify sub-populations of patients with respiratory diseases (such as severe asthma) or specific disease-related events (such as an exacerbation of COPD). We have compiled codelists with the intention of helping researchers find the most up-to-date codes relevant to their study, which will ultimately help comparative respiratory research. Our standardised codelists for respiratory diseases address these issues by creating comprehensive lists that can be used to research respiratory disease, leading to new and clinically important research insights to improve respiratory health. Since lists of codes vary by research questions, these lists of codes might need to be tailored to the exact research question being addressed and can be seen as a starting point for defining respiratory diseases in EHRs. More transparency in reporting is needed, as are validation studies for phenotypes that have not yet been validated, given these data are only going to be continued to be used more frequently and by more people. manuscript and accepts full responsibility for the work, had access to all the data and was responsible for the decision to publish. This work is supported by BREATHE -The Health Data Research Hub for Respiratory Health [MC_PC_19004]. BREATHE is funded through the UK Research and Innovation Industrial Strategy Challenge Fund and delivered through Health Data Research UK. The funder had no role in study design, data collection, analysis or interpretation, or manuscript writing. All authors had full access to all the data in the study and had final responsibility for the decision to submit for publication. CM, HW, MM, LD, AM, CI, MA, EV, EOR, ATW, PWS have nothing to declare. JKQ reports grants from AUK-BLF, The Health Foundation, MRC, grants and personal fees from AZ, BI, GSK, Bayer, grants from Chiesi, outside the submitted work. AS reports grants from AUK-BLF and HDR UK. Determinants of change in physical activity during moderate-to-severe COPD exacerbation Exploiting the potential of routine data to better understand the disease burden posed by allergic disorders Five analytic challenges in working with electronic health records data to support clinical trials with some solutions Using routine health data for research: the devil is in the detail Introduction to codelists; OpenSAFELY documentation Measuring diagnoses: ICD code accuracy Validation of chronic obstructive pulmonary disease recording in the clinical practice research datalink (CPRD-GOLD) Will systematized nomenclature of medicine-clinical terms improve our understanding of the disease burden posed by allergic disorders? Quality of recording of diabetes in the UK: how does the GP's method of coding clinical data affect incidence estimates? Cross-sectional study using the CPRD database How to validate a diagnosis recorded in electronic health records Learning health systems need to bridge the 'two cultures' of clinical informatics and data science The epidemiology, healthcare and societal burden and costs of asthma in the UK and its member nations: analyses of standalone and linked national databases Defining asthma and assessing asthma outcomes using electronic health record data: a systematic scoping review Validation of asthma recording in electronic health records: a systematic review Validation of the recording of acute exacerbations of COPD in UK Primary Care Electronic Healthcare Records Code sets for respiratory symptoms in electronic health records research: a systematic review protocol Identifying clinically important COPD sub-types using data-driven approaches in primary care population based electronic health records Health Data Research UK (HDRUK) Phenotype Library Code lists London School of Hygiene and Tropical Medicine (LSHTM) data compass UK Biobank Primary Care Linked Data Prevalence and attributable health burden of chronic respiratory diseases, 1990-2017: a systematic analysis for the Global Burden of Disease Study Quality of morbidity coding in general practice computerized medical records: a systematic review Usage of allergy codes in primary care electronic health records: a national evaluation in Scotland Pragmatic and Observational Research Pragmatic and Observational Research is an international, peer-reviewed, open access journal that publishes data from studies designed to reflect more closely medical interventions in real-world clinical practice compared with classical randomized controlled trials (RCTs). The manuscript management system is completely online and includes a very quick and fair peer-review system SAIL team for creating the website pages. JKQ conceptualised the study and all authors contributed to study design, searching the literature and collating and deriving recommended codelists. CM, HW and MM drafted the original manuscript, with critical revision of the manuscript by all authors. All authors approved the final manuscript. All authors made a significant contribution to the work reported, whether that is in the conception, study design, execution, acquisition of data, analysis and interpretation, or in all these areas; took part in drafting, revising or critically reviewing the article; gave final approval of the version to be published; have agreed on the journal to which the article has been submitted; and agree to be accountable for all aspects of the work. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted. The corresponding author is also the guarantor for this