key: cord-0280759-or7u1ycf
authors: Slater, L. T.; Bradlow, W.; Desai, T.; Aziz, A.; Evison, F.; Ball, S.; Gkoutos, G. V.
title: Making Words Count with Computerised Identification of Hypertrophic Cardiomyopathy Patients
date: 2021-04-15
journal: nan
DOI: 10.1101/2021.04.13.21255353
sha: 809281054a68d710597e81c6fb3666d1b00b73bf
doc_id: 280759
cord_uid: or7u1ycf

Background: The traditional outpatient model in hypertrophic cardiomyopathy (HCM) is under pressure. Population health management based on an accurate patient record provides an efficient, cost-effective alternative. Methods: To improve the accuracy of the HCM patient list in a single hospital, we developed a rule-based information extraction natural language processing (NLP) framework. The framework employed ontological expansion of vocabulary and exclusion-first annotation, and received training by an 'expert in the loop'. The output stratified patients with atrial fibrillation (AF) and heart failure (HF), those without active cardiology care and likely screened individuals. Results: The algorithm was validated against multiple data sources, including manual validation, for HCM, AF and HF and family history of the disease. Overall precision and recall were 0.854 and 0.865 respectively. The pipeline found 25,356 documents featuring HCM-related terms belonging to 11,083 patients. Excluding scanned documents resulted in 17,178 letters from 3,120 patients. Subsequent categorisation identified 1,753 real cases, of whom 357 had AF and 205 had HF. There were 696 likely screened individuals. Adjusting for 304 false-negative patients, the total HCM cohort was 2,045 patients. 214 were not under a cardiologist. NLP uncovered 709 patients who were absent in the registry or hospital disease codes. Conclusion: This novel NLP framework generated a hospital-wide record of patients with HCM and defined various cohorts, including the small set of HCM patients lacking current cardiology input. Existing data sources inadequately described this population, spotlighting NLP's essential role for clinical teams planning to move to a population health management model of care. Keywords: text mining, natural language processing, hypertrophic cardiomyopathy, population health management

Hypertrophic cardiomyopathy (HCM) affects between 0.2% 1 and 1.4 % 2 of individuals and is an important cause of sudden death, atrial fibrillation (AF) and heart failure (HF). Patient numbers are growing, driven by an expansion in those with mild disease 3 . The fixed one to two yearly outpatient appointment model 4, 5 was strained even before the COVID-19 pandemic. Population health management (PHM) offers a solution to these challenges 6 but relies on an accurate record of diagnosed individuals.

Clinical teams can find HCM patients within automatically generated clinic lists, but this is time-consuming. The most accessible record comes from routine hospital statistics, but this contains only patients with prior inpatient episodes. Disease registries can be repurposed but are prone to missing patients. Developing, populating and maintaining them is also laboursome and error-prone. Searching the electronic record by hand using keyword terms is the most arduous option. population health management 7 . While applications in cardiovascular disease are spreading [8] [9] [10] [11] , publications in HCM are limited 12, 13 .

Our study's aim was to improve the accuracy of a single hospital's list of patients with HCM. To do so, we set out to:

• expand our existing NLP algorithm 13 , deploy it across the electronic patient record and test its accuracy • define the entire cohort, subgroup by AF and HF complications, identify which patients needed to see a cardiologist and find people likely being screened • measure the impact NLP had on total headcount compared to existing data sources

We undertook this work at the Queen Elizabeth Hospital site of University Hospitals Birmingham NHS Foundation Trust. The authors declare that all supporting data are available within the article (and its online supplementary files).

Information Extraction Framework Figure 1 depicts the seven steps that make up the information extraction pipeline. We implemented the pipeline using the Komenti semantic text-mining framework 14 , a natural language processing program developed in-house. This is available under an open-source licence at https://github.com/reality/Komenti.

We first developed a vocabulary of diagnoses and their synonyms that identify the conditions of interest in the text. A clinical specialist (WB) produced a list of words and phrases that indicate HCM, AF and HF (see Appendix for the full list). The vocabulary contained phrases that pointed to a disease complication, for example 'weak heart' for HF. Komenti's vocabulary expansion algorithm broadened the number of searched-for labels and synonyms through linkage to biomedical ontologies.

Documents were obtained from the UHB patient document management system (OpenText) on 02/06/2019. The entire clinical document record, around 22TB in size, was collected on a server in PDF format and queried using a keyword search to identify documents that included labels for HCM. The corpus included referrals from primary, secondary and tertiary care, clinic letters, discharge summaries and typed digital noting. Cardiac imaging reports were excluded since they are stored outside OpenText.

A minority of these documents were scanned. This included all Emergency Department documents and genetic reports. Examination of scanned documents used an optical character recognition extraction tool found within the OpenText suite. Assessment for complications only included the most recent document mentioning HCM together with AF and HF.

After identifying documents matching HCM keywords, we ultimately removed scanned documents which were a source of numerous errors (described in Results; Interim Validation). We then tokenised the remaining documents into sentences. Terms from the vocabulary pertaining to HCM, AF, and HF were found in the sentences, and only sentences containing those keywords were retained. We then developed a vocabulary for the exclusion of sentences that contained irrelevant mentions of HCM, AF, and HF; in which the text did not actually refer to the condition of interest (See Results; Interim Validation for examples). This step also involved excluding acronyms that could be confused for other conditions.

The pipeline grouped sentences by scenarios where mentions did not indicate definite disease for the patient, before eliminating them to leave only genuine cases. The three groups were:

• Negated: The patient does not have the condition • Uncertain: The patient might have the condition • Family history: The sentence refers to a family member and not the patient

The negation and uncertainty detection algorithms employ a graph-based dependency extraction method to quantify the grammatical proximity between the tag and the word of interest 13 . A bespoke vocabulary of negation words was developed with the clinical specialist. The uncertainty instance was based on a resource originally intended for topic modelling 15 and improved alongside the clinical specialist.

The dependency-based parsing approach disambiguated complex individual sentences featuring more than one condition of interest ( Figure 2 ). For example, in the sentence "the patient has HCM, but does not have atrial fibrillation", the algorithm would link 'not' to 'AF', rather than 'HCM'.

To identify mentions where there was a family history of the disease, we took a simple approach where the mention of family vocabulary words in the sentence (brother, sister, father, mother etc) was considered sufficient to mark the sentence as referring to a family member. After matching each sentence to a context, it was annotated with a tag.

The framework determined the patient's final status for each condition by adding context indicators to sentences. Where there were multiple sentences expressing conflicting evidence about the condition, the final decision was arbitrated by counting which context occurred most frequently (see Figure 3 ). The framework repeated the process for family relations mentioned in letters.

A simple web server and client tool was developed for clinicians to validate the results of the algorithm-generated annotations. The tool surfaced all relevant documents for a set of patients and condensed them into a single-page web application ( Figure 4 ). This displayed the algorithm's judgment for HCM and comorbidities alongside the evidence for the decision, based on an excerpt from the clinic letter. Space was provided for free text feedback at the bottom of the page.

The algorithm's performance was evaluated on a patient by patient basis and results saved as json files for export. For each experiment, training on 300 patients took place over several iterations of the pipeline using feedback from a clinical specialist (WB).

The interim validation results determined whether the algorithm had correctly assigned the status (uncertain, negative, family history, positive) for each of the concepts based on the text provided. It also allowed feedback for cases where algorithm classification had been correct, but the vocabulary has been incorrect or non-specific. These results informed modifications to each of Steps 1 -5, particularly vocabulary and classification.

Repetition of interim validation occurred until the training set achieved high precision and recall.

We validated the pipeline outputs in a series of experiments against three structured sources; a clinico-genomic registry used to collect data on HCM patients in the Heart Muscle Disease Service since 2015, International Classification of Diseases,Tenth Revision (ICD-10) codes from the Hospital Episodes Statistics (I421 and I422) and a database of automated ECG reports. The first two contained information on HCM, AF and HF while the last resource related to AF only.

When validating HCM comorbidities, we only used records for individuals who had both HCM and disease complications featured in ICD-10 codes and the registry. The definition of HF in the registry was evidence of current heart failure or a prior admission with heart failure. Interrogation of the automated ECG reports and ICD-10 codes took place before extraction of the corpus. Consequently, validation against these databases excluded patients whose clinical documents in the corpus extraction were created after these cut-off dates.

The clinical specialist (WB) further validated an additional 300 patients excluded from the training set by hand.

Performance was measured using precision (true positives/true positives + false positives) and recall (true positives/true positive + false negative). Precision derived from manual validation, and recall from validation against databases. We excluded from experiments several instances where the validation would be flawed and produce misleading performance values:

• precision measurements versus databases, as the databases do not cover every patient (i.e. are not gold-standard comparators). • recall measurements versus manual validation, as their inclusion would distort how well the algorithm performed at finding the true number of cases. • recall measurements versus databases in patients that Komenti did not classify, i.e. patients that the document search did not find sentences mentioning HCM. This situation arose when the diagnosis lay outside the document corpus, or in an unusable form within the corpus (i.e. scanned documents, see Result, Interim Validation).

It took two days to collect the corpus and an hour for the pipeline to run.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 15, 2021. ;

We identified the entire cohort, stratified it and found which patients needed to see a cardiologist using the following stages: Stage 1. Removed scanned documents from the total number of documents with HCM keyword mentions. Stage 2. Confirmed cases identified by the algorithm after filtering irrelevant, negative or uncertain mentions, before subcategorizing for AF and HF. Patients who were negative or uncertain for HCM but had a family history of the condition were assumed to have been screened. Stage 3. False negatives (HCM patients present in the registry and ICD-10 groups which NLP had not found) were added to the NLP-found group to produce the total HCM cohort. Stage 4. Surviving HCM patients who were not under active cardiology follow-up were found by dividing the total HCM cohort into the following groups using administrative data:

-Dead -Lost to follow up (patient did not return for continued care as expected, defined here as seen in the outpatient setting over 24 months ago with follow up arranged but not seen since, not on a waiting list, not discharged) -Under active care (under a specialist cardiologist in the Heart Muscle Disease or Inherited Cardiac Conditions Service, a general cardiologist or a non-cardiologist) -Seen as an inpatient only -Discharged (and not under active care in another service).

Patients on a follow-up waiting list were defined as being currently under active care. New patients awaiting their first consultation were not included. We defined patients who need to see a cardiologist as those lost to follow up, those under active management who were not under cardiology or those seen as an inpatient only.

The total cohort was divided into three groups to determine the contribution made by NLP; Group 1 (registry), Group 2 (ICD-10, not in Group 1), Group 3 NLP (NLP found, not in group 1 or 2).

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.13.21255353 doi: medRxiv preprint

The University of Birmingham Ethical Review granted approval (ERN_20-0338). No informed consent was required as this was a service improvement project, and the documents were not de-identified as we intend to follow up individuals lost to follow up. The hospital number was used to link a patient's clinical documents to their registry record. Data remained within the Trust and only included information related to HCM.

The training phase revealed various errors (Table 1) ; the commonest arising in scanned documents where OCR misread a fragment of a word as an entity of interest ( Figure 5 ). Consequently, we excluded all scanned documents, including those from the Emergency Department and the West Midlands Regional Genetics Service. Irrelevant documents were also cast out, for example when the document header contained the entity of interest; "Heart Failure clinic" ( Figure 5 ). To address this problem, we adjusted the pipeline to detect addresses, phone numbers and postcodes.

We had developed the vocabulary to maximise coverage of diagnosed patients but this resulted in a high false-positive rate from non-specific phrases. For example, the algorithm initially flagged Grade 1 diastolic dysfunction as being indicative of heart failure. The vocabulary was updated by excluding these phrases. Performance Table 2 shows the framework's overall performance measured against three different structured data sources, along with summary statistics for the manual validation.

For the identification of HCM patients, precision and recall were high. Manual validation showed that a small number of negative and uncertain cases were incorrectly classified by the algorithm. For the case of HCM patients with AF, the performance was good in all experiments while for HCM patients with HF recall was lower when validated against ICD-10 codes.

There are two possible explanations for the latter. The first is that clinical coders identified patients with HCM and HF within ICD-10 codes from Emergency Department documentation, or in handwritten inpatient notes, which could not be examined via NLP.

The second is that clinical coders picked up on HF recorded in documents which did not also mention HCM. Table 3 displays a breakdown of the manual validation performance per classification and shows that good precision persists across the various subcohorts indicating that the framework can accurately stratify patients into HCM-related subgroups. The low precision for patients with HCM and HF relates to the low patient numbers.

NLP also improved the quality of registry data by identifying 24 people in whom staff had misentered patients as having HCM in the registry when in fact they were gene-positive but phenotype negative.

In Stage 1, as shown in Figure 6 , from a total of 25,356 letters with HCM keywords related to 11,083 patients, 8,178 scanned documents were removed. This left 17,178 documents belonging to 3,120 patients.

In Stage 2, the algorithm then classified and excluded 876 patients who were negative for the condition and 457 patients with uncertain disease (Table 4) . A further 34 patients were found by the NLP algorithm but then discovered not to exist due to either incorrect metadata in the document system, or incorrect extraction of hospital numbers from letters. They were excluded from further analysis.

Of confirmed HCM patients, 357 had AF and 205 HF. 1,029 patients were found to have a family history of HCM, 696 of which had an uncertain or negative status themselves.

In Stage 3, 1,753 patients were classified as positive for HCM. 304 patients (91 in registry, 213 in ICD-10 codes) which weren't found by NLP were added, producing a total cohort of 2,045 HCM patients.

The registry false-negatives arose because our organization has historically provided local hospitals with cardiovascular magnetic resonance (CMR) imaging, and genetic testing via outreach from the neighbouring West Midlands Regional Genetics Service. Although these patients were captured in the registry, they did not have NLP-accessible documents. For ICD-10 false negatives, it is likely that inpatient stays were coded for HCM but excluded from NLP analysis as the diagnostic data was recorded in handwritten inpatient notes and omitted from electronic discharge summaries.

8 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. In Stage 4, this information was integrated with administrative data (Table 5) . Almost twelve hundred patients were under active management, 361 were discharged and 310 had died. In total, 214 patients were not under active care with a cardiologist.

Subdivision of the confirmed HCM cohort by data source (Figure 7, Table 6 ) showed that ICD-10 added 404 patients (Group 2) to Group 1, which contained 932 patients. NLP contributed 709 patients (Group 3). The relatively modest provision by ICD-10 codes reflects the fact that coding is only applied to inpatient and not outpatient visits. NLP recapitulated almost all patients captured by the registry and ICD-10 codes, apart from the 304 false negative patients.

Need for Population Health Management Model of Care Specialist HCM services are grappling with three factors. Firstly, demand is rising, with a tenfold increase in the number of diagnoses after 2010 compared to pre-2000 3 . There is an associated shift in demographics and disease features. The average patient is older, less severely affected and lower-risk 3 , likely a consequence of increased disease awareness, imaging and genetic-led screening.

Secondly, the COVID pandemic has uprooted the traditional, fixed-appointment follow-up schedule. Although this model doesn't prevent patients from being lost to follow-up, it did at least provide a safety net for long-term monitored patients. Finally, while clinical teams have the knowledge needed to re-design services [16] [17] [18] [19] [20] , current electronic records make it difficult to act on this data systematically. Instead, new knowledge is implemented by hand on a patient-by-patient basis.

Population health management offers a means to respond by automatically capturing, stratifying and tracking an entire population of patients to address inequities, intensify care of the sickest and manage the healthiest remotely. Our study examines how NLP might help clinical teams adopt this approach by first generating an accurate and partially-stratified patient list in a large teaching hospital.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Comparison with Existing Models NLP frameworks employ either machine learning or classifier recognition/classification strategies. Although machine learning approaches return results quickly after a training set is assembled, we chose the alternative as its basis in the grammatical roots of language makes it disease-agnostic.

Incorporating an expert clinician's background knowledge to generate vocabulary and rules for concept recognition in text has been shown to out-perform machine learning methods in certain settings 21 . The web validation tool and workflow, in combination with the utilisation of domain vocabularies, enables quick and continuous expert feedback leading to improved performance. For the clinician, the web validation tool markedly shortened the review process. It also allowed them to understand how errors came about as evidence regarding the assertion was presented.

The strategy of determining clear-cut cases by excluding patients whose disease mentions did not imply that they had the condition is typical of negation classifiers in information retrieval pipelines (for examples see NegEx and negation-detection 22, 23 ). Our approach differs from these methods as it uses a dependency parser distance procedure and includes additional classifiers for sentence context. In a recent head to head comparison with these algorithms, which involved data concerning 5000 sentences from HCM patients at our hospital, ours performed favourably 13 .

The only other work applying NLP in HCM 12 involved an open-source NLP package called MedTagger. It focused on mining for instances of syncope and positive family history for sudden death within 200 patients with HCM selected from a specialist service registry. Finding diagnosed HCM patients across the entire electronic patient record is a more complicated task than searching within known patients, as evidenced by a lower performance for identification of HCM compared to detection for the two comorbidities.

The Mayo work used a neural network-based approach to information extraction, while our method employs a rule-based system. In specific settings, rule-based NLP systems offer better performance over machine learning-based strategies as they can be tailored to the needs of an experiment 21 . Different validation sources (billing codes for previous 10 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 15, 2021. ;  work, ICD-10 for ours) also make the results of comparison difficult. In our setting, we found that ICD-10 codes were an inadequate resource for diagnoses.

After casting out scanned documents, NLP was able to place a full list of patients in the clinical team's sights quickly, minimising the manual work needed to obtain this information. To ensure HCM patients are not missed, the notes of the 1333 negative or uncertain HCM patients could be rapidly reviewed via the web tool. Without this pipeline, evaluating 3,120 patient records featuring over 17,000 HCM-keywords containing documents by hand would be impractical.

Subdividing HCM patients for the two commonest complications means patients can be cohorted in care pathways. We did not attempt the more complex task of stratifying patients by sudden death risk using text mined risk factors 18, 19 . Merging NLP output with follow-up arrangements identified the 1.9% of patients who were at high risk of being entirely overlooked as they were not under a cardiologist.

By determining the small number of particularly vulnerable patients who would benefit from further evaluation, the clinical team could allay concerns regarding the amount of additional work NLP would generate. It's likely that the final number would be smaller still after adjusting for patients under active cardiology management elsewhere. Care could also be streamlined for likely screened patients, identified as a byproduct of the use of family history in patient annotation.

NLP was able to find almost all of the HCM patient cohort already known to the hospital, highlighting ICD-10 codes shortcomings in describing a condition predominantly managed in the outpatient setting.

The results of this work provide an insight into how our organisation delivers HCM care, and handles information, a pattern that is likely to be replicated for other hospitals delivering regional services. Our framework didn't analyse scanned documents due to the errors associated with this document type. In the future, improved OCR techniques could allow NLP to bring content from these documents 'into play'. We plan to integrate 11 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. This work argues for an information management strategy that avoids scanning by using only electronic communication with external organisations, and promotes point-of-care tools with which clinicians can quickly capture diagnoses prospectively in a structured form.

Further investigation is needed to measure the algorithm's accuracy across different clinical environments. We anticipate that the methodology's approach would be equally useful in other diseases, especially inherited conditions due to its family-history detector component.

We recognise the need to train and validate the framework for each disease group and centre. Since the work here has dealt with most general errors, we anticipate that training requirements will fall over time. The algorithm's performance will continue to improve as it receives knowledge from training feedback.

We will also need to address technical barriers to deployment on other sites. The pipeline is ideally performed with server infrastructure. However, its low resource requirements mean it could be run on a standard hospital computer where running time would be proportional to the performance capacity of the machine. We predict this would take around a week to run in practice.

Misclassification could result from transcription errors arising between clinic letters. If AF and HF, but not HCM, occurred in documents that post-date those which contain the HCM diagnosis, these complications will be missed. Patients may also be incorrectly classified. In the future, this could be mitigated by using confirmatory data from genetic or cardiac imaging results.

In this study, our NLP framework improved the accuracy of patient lists by discovering more than 700 additional patients not known to any structured resource, including those 12 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 15, 2021. ;  managed by the specialist service. It characterised patients with two disease complications, no cardiology follow-up, and identified individuals likely being screened. This work revealed the limitations of routine hospital records and disease registries for teams moving to a PHM model of care. Although the use of NLP creates fresh challenges around information management, the potential payoff for targeted and responsive care is substantial.

The testing cycle was repeated until high performance was achieved 16 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.13.21255353 doi: medRxiv preprint

The tool determines the strength of relationships between modifiers and mentions to arrive at the context in which the mention is made. 

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 15, 2021. ;  Patient numbers are provided on the left for each Stage of the pipeline. The boxes with dotted outlines describe document numbers. Boldface indicates key points in the flowchart; NLP, natural language processing.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 24 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.13.21255353 doi: medRxiv preprint The rows with a (*) in the final column are mutually exclusive. As the same patient could be seen by multiple teams, the numbers within the team row relate to the patient's current status with that team only, i.e. a patient may be actively monitored by general cardiology but has been discharged from Heart Muscle Disease Service. The total rows for patients not being actively monitored by any cardiology team are based on their last interaction. The total to be seen (bold) is the sum of lost to follow up, seen only by a non-cardiologist and seen as inpatient only. Found in ICD-10 codes 777

Found in ICD-10 codes, also in registry 373

Group 2 ( not in Group 1 but found in ICD-10 codes) 404

Found Using NLP 1,753

Found using NLP, also in registry 480

Found using NLP, also in ICD-10 codes 203

Found using NLP, also in both registry and ICD codes 361

Group 3 (Unknown to registry and ICD codes) 709

Not found by NLP but present in registry 91

Not found by NLP but present in ICD-10 codes 213

NLP indicates natural language processing; ICD-10, International Classification of Diseases,Tenth Revision.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.13.21255353 doi: medRxiv preprint

Prevalence of Hypertrophic Cardiomyopathy in a General Population of Young Adults: Echocardiographic Analysis of 4111 Subjects in the CARDIA Study

Prevalence of Unexplained Left Ventricular Hypertrophy by Cardiac Magnetic Resonance Imaging in MESA

The SHaRe Investigators. Temporal Trend of Age at Diagnosis in Hypertrophic Cardiomyopathy: An Analysis of the International Sarcomeric Human Cardiomyopathy Registry. Circ Heart Fail

ESC Guidelines on diagnosis and management of hypertrophic cardiomyopathy: the Task Force for the Diagnosis and Management of Hypertrophic Cardiomyopathy of the European Society of Cardiology (ESC)

2020 AHA/ACC Guideline for the Diagnosis and Treatment of Patients With Hypertrophic Cardiomyopathy

Committee on Assuring the Health of the Public in the 21st Century

Chapter 13 -Big Data and Population-Based Decision Support

Semantic computational analysis of anticoagulation use in atrial fibrillation from real world data

Development of a Portable Tool to Identify Patients With Atrial Fibrillation Using Clinical Notes From the Electronic Medical Record

Use of Natural Language Processing to Improve Identification of Patients With Peripheral Artery Disease

Automatic multilabel detection of ICD10 codes in Dutch cardiology discharge letters using neural networks

Automated extraction of sudden cardiac death risk factors in hypertrophic cardiomyopathy patients by natural language processing

A fast, accurate, and generalisable heuristic-based negation detection algorithm for clinical text

Komenti: A semantic text mining framework

Transparency and Deliberation Within the FOMC: A Computational Linguistics Approach

For the SHaRe Investigators. Hypertrophic Cardiomyopathy With Left Ventricular Systolic Dysfunction: Insights From the SHaRe Registry

Genotype and Lifetime Burden of Disease in Hypertrophic Cardiomyopathy

Hypertrophic Cardiomyopathy Outcomes Investigators. A novel clinical risk prediction model for sudden cardiac death in hypertrophic cardiomyopathy (HCM risk-SCD)

Enhanced American College of Cardiology/American Heart Association Strategy for Prevention of Sudden Cardiac Death in High-Risk Patients With Hypertrophic Cardiomyopathy

Distinct Subgroups in Hypertrophic Cardiomyopathy in the NHLBI HCM Registry

Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches. ArXiv190303985 Cs

A simple algorithm for identifying negated findings and diseases in discharge summaries

Don't Let Notes Be Misunderstood: A Negation Detection Method for Assessing Risk of Suicide in Mental Health Records

heart failure model, survival score, maggic, explained that hcm, may also be considered, high risk idf, no way to prevent, target, disease/chest pain, medical centre, w33, hcm clinic, cilnic, electronically, identified the cause, impediment to surgery, grateful, services, consider referral, esc hcm risk, koine, discharge information, screening, screened, screen, screens, hcm risk, in a patient, for a patient, previously Uncertainty Vocabulary almost, suggestive, if, suggested, indicative, either, question, query, not confirmed, nothing, apparently, appeared, appearing, appears, conceivable, could, depend, depended, depending, depends, points to, check, checking, probability, unlikely, likely, review, reviewed, differential, differentials, even if, reported as normal, exclude, maybe, may be, indicate, look for, exclude, assess, excludes, arranged, suspected, screening, screened, suspicion, suspicious, distinguish, distinguishing, family studies, might, nearly, occasionally, perhaps, possible, possibly, seldom, seldomly, sometimes, somewhat, uncertain, uncertainly, can, frequently, likely, tbc, often, ought, probable, whether, probably, rarely, regularly, should, tends, usually, abeyance, abeyances, almost, alteration, alterations, ambiguities, ambiguity, ambiguous, anomalies, anomalous, anomalously, anomaly, anticipate, anticipated, anticipates, anticipating, anticipation, anticipations

We are grateful to Ameenathal Fawzy, Asgher Champsi, Rubia Akhtar, Haroon Ahmad, Noreen Akram, and Dino FA Motti for performing initial validation. We would also like to thank Andreas Karwath, Paul Schofield, and Egon Willighagen for useful discussions surrounding the NLP technology. This work uses data provided by patients and collected by the NHS as part of their care and support.

GVG and LTS acknowledge support from the NIHR Birmingham ECMC, NIHR Birmingham SRMRC, Nanocommons H2020-EU (731032) and the NIHR Birmingham Biomedical Research Centre and the MRC HDR UK (HDRUK/CFC/01), an initiative funded by UK Research and Innovation, Department of Health and Social Care (England) and the devolved administrations, and leading medical research charities. The views expressed in this publication are those of the authors and not necessarily those of the NHS, the National Institute for Health Research, the Medical Research Council or the Department of Health.