key: cord-1043570-df42e8m1 authors: Ljubic, Branimir; Roychoudhury, Shoumik; Cao, Xi Hang; Pavlovski, Martin; Obradovic, Stefan; Nair, Richard; Glass, Lucas; Obradovic, Zoran title: Influence of Medical Domain Knowledge on Deep Learning for Alzheimer's Disease Prediction date: 2020-09-20 journal: Comput Methods Programs Biomed DOI: 10.1016/j.cmpb.2020.105765 sha: 982c06261ab475bf4ff586bceb40f47a8ad3a9ce doc_id: 1043570 cord_uid: df42e8m1 BACKGROUND AND OBJECTIVE: Alzheimer's disease (AD) is the most common type of dementia that can seriously affect a person's ability to perform daily activities. Estimates indicate that AD may rank third as a cause of death for older people, after heart disease and cancer. Identification of individuals at risk for developing AD is imperative for testing therapeutic interventions. The objective of the study was to determine could diagnostics of AD from EMR data alone (without relying on diagnostic imaging) be significantly improved by applying clinical domain knowledge in data preprocessing and positive dataset selection rather than setting naïve filters. METHODS: : Data were extracted from the repository of heterogeneous ambulatory EMR data, collected from primary care medical offices all over the U.S. Medical domain knowledge was applied to build a positive dataset from data relevant to AD. Selected Clinically Relevant Positive (SCRP) datasets were used as inputs to a Long-Short-Term Memory (LSTM) Recurrent Neural Network (RNN) deep learning model to predict will the patient develop AD. RESULTS: : Risk scores prediction of AD using the drugs domain information in an SCRP AD dataset of 2,324 patients achieved high out-of-sample score - 0.98-0.99 Area Under the Precision-Recall Curve (AUPRC) when using 90% of SCRP dataset for training. AUPRC dropped to 0.89 when training the model using less than 1,500 cases from the SCRP dataset. The model was still significantly better than when using naïve dataset selection. CONCLUSION: : The LSTM RNN method that used data relevant to AD performed significantly better when learning from the SCRP dataset than when datasets were selected naïvely. The integration of qualitative medical knowledge for dataset selection and deep learning technology provided a mechanism for significant improvement of AD prediction. Accurate and early prediction of AD is significant in the identification of patients for clinical trials, which can possibly result in the discovery of new drugs for treatments of AD. Also, the contribution of the proposed predictions of AD is a better selection of patients who need imaging diagnostics for differential diagnosis of AD from other degenerative brain disorders. According to the National Institute of Aging, more than 5.5 million Americans are diagnosed with AD. [1] Estimates indicate that AD may rank third as a cause of death for older people, only second to heart disease and cancer.[2.3] Identification of individuals at risk for developing AD is imperative for testing therapeutic interventions. [4] Many researchers have presented an overview of the classification of Mild Cognitive Impairment (MCI). [5] Early diagnosis could help with the recruitment of patients to participate in clinical trials and the testing of possible new drug therapies for AD. Several studies indicate that the use of imaging for early detection of AD is imperative to early diagnosis. [6] Diagnostic imaging can be very costly and the question is what is the cost-effectiveness of imaging in AD. [7] Heterogeneous structures of Electronic Medical Records (EMR) data pose a challenge for machine learning (ML) algorithms. [8] [9] [10] For our study, we used a repository of heterogeneous ambulatory EMR data, collected from primary care medical offices spread over the U.S. These data are typical of ambulatory EMR data from medical practices and not data from clinical trials. The data for this study was sourced from IQVIA and EMR vendors which were then mapped into the Observational Medical Outcomes Partnership (OMOP) format. ML algorithms utilizing Magnetic Resonance Imaging (MRI) for prediction of AD have been developed. [11] [12] [13] An application of RNN models was designed to differentiate AD patients from healthy control individuals using neuroimaging data. [14] Longitudinal EMRs were used to study the progression of chronic diseases like AD. [15, 16] LSTM RNN can effectively predict AD progression by fully leveraging the temporal and medical patterns derived from patient's office visits. Our research goal was to implement LSTM RNN deep learning configuration to predict AD diagnosis using EMR data alone (without relying on diagnostic imaging). [17, 18] The study objective was to show that selection of relevant input datasets is important for overall LSTM RNN model predictive performance. We wanted to determine if applying medical domain knowledge in data preprocessing and positive dataset selection significantly improves the prediction of AD comparing to the naïve model. The most current health problem with Coronavirus (COVID19) which severely affected the entire world, proves the importance of relevant medical data in the creation of analytical and predictive models as well as in the application of adequate preventive health measures. [19] Furthermore, we attempted to efficiently apply the drugs domain in prediction of AD. The objective was also to evaluate the contribution of individual clinical domains as well as the ensemble of few domains to the prediction of AD. An accurate early prediction of AD could help with the recruitment of patients for clinical trials, which could help find new drug therapies for AD. Also, the contribution of predictions of AD is a better selection of patients who need imaging diagnostics for differential diagnosis of AD. Our comprehensive methodology comprises relevant inputs selection using medical domain knowledge, and construction of the LSTM RNN deep learning model. We used the ambulatory EMR database to predict the occurrence of AD. Data are in the OMOP common data model (CDM) format. The terminologies used to describe the clinical conditions vary from database to database. In the OMOP concept, data contained in different types of observational databases are transformed into a common format. The OMOP CDM provides a common data standard to analyze multiple data sources concurrently. [20] The OMOP concept allows the evaluation of individual clinical domains separately as well as an ensemble of different domains. Patients with AD diagnosis in the EMR database were extracted using the OMOP code for AD. We found 24,734 AD patients in the EMR database. This dataset does not contain patients treated with experimental therapies for dementia or AD. It contains only medications that are approved by the FDA and available for everyday use. The measurement domain includes labs measured in ambulatory settings such as blood tests and vital signs. It does not contain data obtained in clinical settings such as biological results about CSF biomarkers related to tau and amyloid. The condition domain data contain only basic diagnoses, not including specific cognitive measurements such as episodic memory deficits according to psychometric tests. We did not use any imaging data. In the initial experiment, we built positive datasets by setting certain filters, calling this approach naïve. We filtered out conditions (diagnosis) that appeared less than 30 times in the whole dataset which are rare diseases that have small information value and represent noise. We also filtered out patients who had less than four visits. The number of visits was set to four or more to capture the temporal nature of patients' histories. [21] The negative dataset was selected randomly from the entire ambulatory EMR database and it consists of patients who had 4 visits, who didn't develop AD, and who were born before 1950. By selecting this age group, we made the age distribution of patients in the negative dataset almost identical to the age distribution of patients in positive datasets. The average age in positive datasets when AD was diagnosed was 80.2. About 51% of patients in positive datasets were older than 80 at the time of AD diagnosis. The interval between the last visit before AD and the visit when AD was diagnosed varies between a few days and many years. Further, we applied medical domain knowledge to build positive datasets. Development of AD could roughly be classified into three stages: preclinical stage, MCI stage, and clinical AD stage. Since the OMOP vocabulary does not have a hierarchical organization like ICD coding, we had to search for terms that match ICD codes for MCI. We included the following conditions from OMOP vocabulary to define the pre-AD stage: Mild Cognitive Impairment, Memory impairment, Organic mental disorder, Amnesia, Forgetful, Cognitive disorder. The final dataset represents the union of the above-mentioned conditions. Initially, in the naïve setup, we had cases that some patients had for example: knee injury, common cold, and flu in their medical history, and the naïve model was predicting the occurrence of AD. We were instructed by clinicians that data should have some relevance to AD to be accepted by medical experts. We applied medical domain knowledge and constructed datasets with MCI stage included, to avoid prediction of AD only from unrelated diseases. All selected patients had the MCI condition in the pre-AD stage. None of the visits designated to the pre-AD stage contained the OMOP code for AD diagnosis. We selected all data over a continuous period before the first AD visit date in our positive datasets. The first AD visit date is the date when the patient was diagnosed with AD for the first time. We didn't include visits after the first AD diagnosis, to avoid data leak. Experiments were conducted using the following three domains: conditions, measurements, and drugs. We selected patients who had at least 4 visits and data in each of the three domains to be able to compare prediction results for AD among these domains. This preprocessing resulted in the final relevant dataset of 2,324 patients. We will call this dataset Selected Clinically Relevant Positive (SCRP) dataset. One SCRP positive dataset was constructed for each of the domains. Experiments with patients who had data in different combinations of two domains were also conducted. We present datasets used in experiments in Table 1 . Number of patients In the training phase, the AD positive to negative ratio was 1:2 (majority sub-sampling) and in the testing phase, the AD positive to negative ratio was 1:9 in experiments with the naïve dataset and between 1:5 and 1:25 in experiments with SCRP datasets. The ratio 1:9 positive to negative corresponds to the percentage of AD in the general population (10-15%). In SCRP datasets we used ratios between 5-25% in the testing phase to simulate different possible subpopulations. Negative examples that complemented positive datasets, were selected randomly from the preprocessed dataset of patients, that we described earlier. For each domain, we had at least 20 trials. In all experiments number of hidden LSTM units was set to 100, the number of epochs was 100 and the batch size was 100. We used Adam optimizer. The objective was to predict whether the patient will be diagnosed with AD at the next visit. The curve is a plot of the precision (y-axis) and the recall (x-axis) for different thresholds. We used the Softmax function to output a risk score for AD. We ran our LSTM RNN code separately for each of the three domains. In the end, we combined the results of each domain into an ensemble model (average of the outputs from best performing domains). In the naïve model, after applying filters described in Section 2, 2,600 patients with at least 4 visits were in the positive dataset. Risk score predictions in the naïve model in the form of The final SCRP dataset with all three domains and at least 4 visits, resulted in 2,324 patients. We ran the LSTM RNN model using the condition, measurement, and drug domains separately. Initially, we used 10% of data for testing and 90% for training of our model. AD prediction based on the condition domain was similar to the naïve model. However, the measurement (labs) domain produced significantly better results than the naïve model (AUPRC 0.98-0.99). We achieved a successful application of the drugs domain in our model. Selected drugs (145 distinct drugs) were used to treat different conditions, considered as pre-AD, and they were administered before patients had been diagnosed with AD. These were standard drugs already Table 2 . We kept the fixed size of the testing set and slowly decreased the size of the training set. The influence of different splits of the dataset on training and testing sets is illustrated in Figure 6 for the drugs domain-specific LSTM RNN model. It appears that for each of the scenarios that we attempted when our model started dropping accuracy the critical size of the dataset for the training phase was about 1,500 patients. Experiments that estimated the effect of increased coverage on prediction quality as well as different numbers of visits (2, 3, or 4) are presented in Table 3 . Arbabshirani and colleagues presented a review of more than 200 studies describing ML techniques in the prediction of AD and other brain diseases, using imaging. [ and subsequently the selection of the most optimal ML models. Our results suggest that special consideration is required to identify a medically relevant positive dataset as the optimal input to the LSTM RNN prediction model. The conditions domain was crucial in the selection of relevant information that contributed to the creation of datasets with patients who had MCI conditions before AD was diagnosed. The drugs domain was successfully applied in our model. Other researchers usually attempt to apply all drugs contained in datasets. [9] Tang and colleagues attempted to apply all drugs from the MIMIC dataset in their models and they did not achieve good prediction results of diseases using this domain. [9] Most drugs are not relevant to diseases of interest and contribute to bad prediction as noise. Our research shows that the optimal approach with the drugs domain is to apply class-by-class of drugs and evaluate the contributions of each class and combinations of classes to predictions of diseases, which is different than the model described by Tang Significance of accurate and early prediction of AD could be found in the identification of patients for clinical trials, which can possibly result in the discovery of new drugs for the treatment of AD. Also, the contribution of predictions of AD is a better selection of patients who need some form of imaging diagnostics for AD. Alzheimer's Disease? Available at Predictive Modeling of the Progression of Alzheimer's Disease with Classification and epidemiology of MCI Early detection of Alzheimer's disease using neuroimaging Cost-effectiveness of PET in the diagnosis of Alzheimer's disease Learning from heterogeneous temporal data in electronic health records Predictive modeling in urgent care: a comparative study of machine learning approaches Predictive modeling of the severity/progression of Alzheimer's diseases Machine learning framework for early MRI-based Alzheimer's conversion prediction in MCI subjects Disease Diagnosis and Biomarker Identification Forecasting the progression of Alzheimer's disease using neural networks and a novel preprocessing algorithm Predictive Modeling of Longitudinal Data for Alzheimer's Disease Diagnosis Using RNNs. International Workshop on Predictive Intelligence In Medicine Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review Finding progression stages in timeevolving event sequences Long Short Term Memory Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network Predicting Complications of Diabetes Mellitus Using Advanced Machine Learning Algorithms Alzheimer's Disease Diagnosis From Diffusion Tensor Images Using Convolutional Neural Networks Single Subject Prediction of Brain Disorders in Neuroimaging: Promises and Pitfalls Using Path Signatures to Predict a Diagnosis of Alzheimer's Disease Studying the Manifold Structure of Alzheimer's Disease: A Deep Learning Approach Using Convolutional Autoencoders A Novel Deep Learning Framework on Brain Functional Networks for Early MCI Diagnosis Prediction of Alzheimer's Disease Dementia With MRI Beyond the Short-Term: Implications for the Design of Predictive Models Random Forest Prediction of Alzheimer's Disease Using Pairwise Selection From Time Series Data A Parameter-Efficient Deep Learning Approach to Predict Conversion From Mild Cognitive Impairment to Alzheimer's Disease Machine Learning-based Virtual Screening and Its Applications to Alzheimer's Drug Discovery: A Review Longitudinal Clinical Score Prediction in Alzheimer's Disease With Soft-Split Sparse Regression-Based Random Forest Identifying Undetected Dementia in UK Primary Care Patients: A Retrospective Case-Control Study Comparing Machine-Learning and Standard Epidemiological Approaches This research sets the framework for future analyses of disease-relevant temporal heterogeneous EMR data. Further research is necessary for the evaluation of different groups of drugs, measurements, and conditions and their contribution to the successful prediction of AD. The authors declare no conflict of interest. B.L., S.R. and X.H.C designed and implemented the method and conducted all the experiments.All authors were involved in writing the paper. This study was supported in part by the IQVIA grant Disease Detection and Disease Progression Project. ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: