key: cord-1043374-sfrr4zww
authors: Roland, T.; Boeck, C.; Tschoellitsch, T.; Maletzky, A.; Hochreiter, S.; Meier, J.; Klambauer, G.
title: Machine Learning based COVID-19 Diagnosis from Blood Tests with Robustness to Domain Shifts
date: 2021-04-09
journal: nan
DOI: 10.1101/2021.04.06.21254997
sha: 48a5c30303b22c98e7bffa94c0d7731d87f11769
doc_id: 1043374
cord_uid: sfrr4zww

We investigate machine learning models that identify COVID-19 positive patients and estimate the mortality risk based on routinely acquired blood tests in a hospital setting. However, during pandemics or new outbreaks, disease and testing characteristics change, thus we face domain shifts. Domain shifts can be caused, e.g., by changes in the disease prevalence (spreading or tested population), by refined RT-PCR testing procedures (taking samples, laboratory), or by virus mutations. Therefore, machine learning models for diagnosing COVID-19 or other diseases may not be reliable and degrade in performance over time. To countermand this effect, we propose methods that first identify domain shifts and then reverse their negative effects on the model performance. Frequent re-training and re-assessment, as well as stronger weighting of more recent samples, keeps model performance and credibility at a high level over time. Our diagnosis models are constructed and tested on large-scale data sets, steadily adapt to observed domain shifts, and maintain high ROC AUC values along pandemics.

Reverse transcription polymerase chain reaction (RT-PCR) 1 are still the gold standard tests for the coronavirus disease 2019 (COVID-19) 2 . However, RT-PCR tests are expensive, time-consuming, and not suited for high-throughput or large-scale testing efforts. In contrast, antigen tests are cheap and fast, but they come with considerably lower sensitivity than RT-PCR tests 3 . Blood tests for COVID-19 are a promising technique, since they unify the best of RT-PCR and antigen tests: they are cheap, fast, efficient, and have sufficiently high sensitivity when combined with machine learning (ML) methods. Furthermore, automatically checking all routinely taken blood tests for COVID-19 allows frequent, fast and broad scanning at low costs, thus provides a powerful tool to ban new outbreaks 4, 5 . Therefore, we assess ML methods for diagnosing COVID-19 from blood tests. ML can enhance the sensitivity of cheap and fast tests such as antigen 6 or blood tests, therefore enabling a cost efficient alternative to RT-PCR tests. ML methods enhanced tests could be particularly useful for asymptomatic patients with a routine blood test, who would not be tested for COVID- 19 . In this scenario, COVID-19 positive patients could be identified, isolated and a further spread of the virus might be prevented. Especially in developing countries with limited testing capacities, the ML enhanced tests can evolve into an efficient tool in combating a pandemic.

To confine the spread of infectious diseases, and especially the COVID-19 pandemic, ML approaches can be applied in very different ways 7 . ML algorithms help in developing vaccines and drugs for the treatment of COVID-19 [8] [9] [10] . COVID-19 and the patient's prognosis can be predicted from chest CT-scans, X-rays [11] [12] [13] [14] or sound recordings of coughs or breathing [15] [16] [17] . Furthermore, it has been shown that ML models based on blood tests are capable of detecting COVID-19 infection [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] and predicting other outcomes, such as survival or admission to an intensive care unit 33-41 . predicts the label or class for a new data item. The quality, size and characteristics of the training data set strongly determine the predictive quality of the resulting model on new data. The central ML paradigm is that training data and future (test) data have the same distributions. This paradigm guarantees that the constructed or learned model generalizes well to future data and has high predictive performance on new data. However, this paradigm is violated during pandemics. Data sets collected during the progression of the COVID-19 pandemic are characterized by strong changes in distribution, called domain shifts. These domain shifts violate the central ML paradigm, nevertheless, they were insufficiently considered or even neglected during the evaluation of ML models. Unexpected behavior of the models in real world hospital settings often stem from neglected domain shifts 42 . Such an unexpected behavior could even lead to unfavorable consequences, like a major disease outbreak in a hospital. Most of the previous ML studies evaluated the predictive performance of the learned models by cross-validation, bootstrapping or fixed splits on randomly drawn samples [18] [19] [20] [21] [22] [26] [27] [28] [29] [30] [31] [32] . However, the theoretical justification of these evaluation methods is heavily founded on the central ML paradigm: that the distributions remain constant over time. To disregard domain shifts is a culpable negligence, since they may lead to an overoptimistic performance estimate on which medical practitioners base their decisions. These decisions are then misguided.

Yang et al. 25 and Plante et al. 24 addressed the domain shifts via evaluation on an external data set. Yang et al. 25 trained and evaluated their models on data from the same period and therefore, temporal domain shifts were not sufficiently considered. The training and external evaluation set as in Plante et al. 24 only includes pre-pandemic negatives, they missed out on using pandemic negatives. Soltan et al. 23 considered the temporal domain shift by conducting a prospective evaluation. However, analogous to Plante et al. 24 , the negatives are all pre-pandemic, therefore, the domain shift is artificially generated and can deviate from domain shifts during the pandemic.

In the following, we describe the categories of domain shifts that can occur in COVID-19 data sets. For the categorization, we have to consider two random variables, which both are obtained by testing a patient:

• x: Outcome of a fast and cheap test. The measurement values for a patient, which serve as input data (input features) for an ML model. We assume that the COVID-19 status (positive / negative) can, to some extent, be inferred from these tests. The measurements can arise from a fast and cheap test such as a blood test or vital sign measurement. To illustrate this value, we assume that

x is the fibrinogen level, since it tends to rise during a systemic inflammation 43 .

• y: Outcome of the slow and expensive COVID- 19 RT-PCR test, which is assumed to be binary y ∈ {0, 1} to indicate the COVID-19 status. The test result y is assumed to be the ground truth and should give the actual COVID-19 status.

Our goal is to use ML methods to predict y from x, in order to replace the slow and expensive COVID-19 RT-PCR test by a fast and cheap test. Examples of temporal domain shifts are shown in Figure 1 a, which affect the model performance and the trustworthiness of performance estimates, see • Prior shift: p(y). The probability of observing a certain RT-PCR test result, e.g., y = 1, strongly changes during the pandemic. If the overall prevalence of the disease in the population is high, the probability to observe a positive test usually increases.

• Covariate shift: p(x). The distribution of the patient features is also affected by the overall pandemic course. E.g., if the prevalence of the disease is high, more persons suffer from disease symptoms, with potentially high fibrinogen, and go to the hospital. Nevertheless, fibrinogen levels could also change without connection to the pandemic, for example, with time of year 46 . Or, in case there is an obligation for testing, the person group is changed as are the measurements.

• General domain shift: p(y, x). The joint distribution of patient features and labels also changes during the pandemic, for example with new virus mutations. A new mutation could lead to more severe disease progression 47 and to even higher fibrinogen.

• Concept shift: p(y|x). Probability to observe a certain RT-PCR test result given a patient characterized by their measurements such as blood tests. We model this by p(y = 1|x) ≈ g(x; w), with the model g and the model parameters w. The RT-PCR test result y changes even if the patient features x are the same, which can occur with changing test technologies, changing test procedures, changing thresholds, and so on.

Neglecting and insufficiently countering the above mentioned domain shifts can lead to undesired consequences and failures of the models:

CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted April 9, 2021. Figure 1 : Domain shifts in COVID-19 data sets. a, COVID-19 numbers in Austria over time, illustrating factors causing a temporal domain shift. The numbers are sketched according to data from the Austrian BMSGPK (https://www.data.gv.at/COVID-19/). b, The actual model performance is calculated for each month from June to December 2020 and the estimated model performance is calculated on the respective previous month. c, Estimated and actual performance with 95 % confidence intervals. The estimated and actual ROC AUC is significantly different in December and PR AUC differs significantly in November and December, showing the effect of the domain shifts. Note that the PR AUC is sensitive to changes of prevalence.

Unreliable performance estimates. Performance estimates without consideration of domain shifts might be overoptimistic and the actual performance of the model can deviate significantly from the estimate 42 (see Figure 1 , and Section 2.5).

Degrading of predictive performance over time. Standard ML approaches are unable to cope with domain shifts over time and during the progression of a pandemic, which can result in a decrease of predictive performance 44, 48, 49 .

In light of the domain shifts, we suggest lifelong learning and assessment [50] [51] [52] , thereby maximizing the clinical utility of the models. Concretely, we propose a) frequent temporal validation to identify domain shifts and b) re-training the models with higher weights of recently acquired samples. To this end, a continuous stream of COVID-19 samples is required, which can be achieved by routinely testing a subset of samples with an RT-PCR test.

We evaluate and compare our proposed approach of lifelong learning and assessment against standard ML approaches on a large-scale data set. This data set comprises 127,115 samples after pre-processing and merging, which exceeds the data set size of many small scale studies [18] [19] [20] [21] [22] 32 by far. Our data set comprises pre-pandemic negative samples and pandemic negative and positive samples spanning over multiple different departments of the Kepler University Hospital, Linz. As opposed to studies that require additional expensive features 19, 22, 23 , our models use no other features than blood test, age, gender and hospital admission type. This way, the blood tests can be automatically scanned for COVID-19 in a cost-effective way without any additional temporal effort for the hospital staff.

We additionally report the predictive ability for mortality risk of the COVID-19 positive samples on the basis of the blood tests only, again with no additional expensive features [33] [34] [35] 38, 40, 41, 53, 54 . Compared to previous studies 33, 36, 37, 39 , our mortality models are trained on a large number of COVID-19 positive patients. We again take domain shifts and other potential biases into account for mortality prediction.

Our dataset comprises 125,542 negative and 1,573 positive samples for training and evaluation of the ML models for COVID-19 diagnosis. From the negatives, 116,067 have been acquired before the pandemic and 9,475 during the pandemic . The RT-PCR test sample   has been collected after the blood test, with a window  of 48 hours between the two tests. From the COVID-19  diagnosis data set 919 cases survived and 118 cases died  with COVID-19. For the mortality prediction, the features and samples are selected on the basis of the COVID-19 positive patients, rather than the 2019 cohort for the COVID-19 diagnosis data set. The data sets are imbalanced in both tasks, the COVID-19 diagnosis and the mortality prediction. The pre-selection of the samples and the merging is described in more detail in Section 4 and in Figure 2 . In the 2019 cohort and in the 2020 cohort women and men occur about equally often (2019 cohort: 48 % men, 2020 cohort: 48 % men). However, in the positives cohort, there are more men (56 % men). The death rate of patients relative to the COVID-19 positive samples in patients with 80 years or older is 20 %, in patients younger than 80 years, it is 8 %. In our data set, men died more than twice as often as women (68 % men). In the age group below 80 years, men died even three times as often as women with COVID-19 (75 % men).

We show the capability of the ML models to classify COVID-19 and to predict the mortality risk. We compare the performance of self normalizing neural network (SNN) 55 , K-nearest neighbor (KNN), logistic regression (LR), support vector machine (SVM), random forest (RF) and extreme gradient boosting (XGB). XGB and RF outperform other model classes in the COVID-19 diagnosis and also in the mortality prediction. The domain shifts are exposed when comparing the evaluations on different cohorts. The hyperparameters are selected on a validation set or via nested cross-validation to avoid a hyperparameter selection bias. Performance is estimated either via standard cross-validation or by temporal crossvalidation (for details see Section 4.3).

In this experiment, we investigate the effects of a standard ML approach, in which a model is trained on data collected in a particular time-period, then assessed on a hold-out set and then deployed. Concretely, we train an XGB model on data from July 2019 until October 2020, and assess the model performance on data from November 2020. We then simulate that the model is deployed and used in December 2020. Without domain shifts, the predictive performance would remain similar, 4 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted April 9, 2021. but in the presence of domain shifts, the performance would be significantly different. Thus, domain shifts are exposed by comparing actual performance with the estimated performance determined on the respective previous month, see Figure 1 b. The area under the receiver operating characteristic curve (ROC AUC) estimate is higher than the actual performance in most months (Figure 1 c) . The ROC AUC performance estimate for December was significantly lower than the actual performance in December. The estimated and actual area under the precision recall curve (PR AUC) differ significantly in November and December. These results show that there is a domain shift and thus there is a necessity for up-to-date assessments, otherwise the performance estimate is not trustworthy.

In this section, we set up five modeling experiments with two prediction tasks and different assessment strategies:

(i) COVID-19 diagnosis prediction assessed by random cross-validation with pre-pandemic negatives,

(ii) COVID-19 diagnosis prediction assessed by random cross-validation with recent negatives, (iii) COVID-19 diagnosis prediction assessed by temporal cross-validation, (iv) mortality prediction assessed by random crossvalidation,

(v) mortality prediction assessed by temporal crossvalidation.

We then compare the performance estimates obtained by the assessment strategy. If the performance estimates by random cross-validation and temporal cross-validation are similar, then the underlying distribution of the data is likely to be similar over time. If the performance estimates of (ii) are different from (i), then former and current negatives follow different distributions. If performance estimates from (iii) are lower than those of (i) and (ii), the distribution of the data changes over time, hence indicating the presence of domain shifts. Equally, changing performance estimates from (iv) to (v) indicate a domain shift over time. The results in terms of threshold-independent performance metrics for the comparison of the models are shown in Table 1 a and b and in Figure 3 . More information about the discriminating capability of individual features is shown in Figure 5 and in Table 5 .

(i) COVID-19 diagnosis prediction & random cross-validation with pre-pandemic negatives. In this experiment all cases from 2019 and 2020 are randomly shuffled, see Section 4.3 for more details. In experiment (i) the highest performance is achieved, however, domain shifts are not considered in the performance estimate. The model with the highest ROC 5 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ;

AUC of 0.97±0.00 and PR AUC of 0.52±0.01 is the RF. Note that the baseline of a random estimator (RE) is at 0.50±0.00 for ROC AUC and 0.01±0.00 for PR AUC, the latter due to the high imbalance of positive and negative samples. For in-hospital application a threshold is required to classify the probabilities of the models to the positive or negative class. This threshold is a trade-off between identifying all positive cases and a low number of false positives. Therefore, we report the threshold-dependent metrics for multiple thresholds, which are determined by defining negative predictive values on the validation set. The results with these determined thresholds are shown in Table 1 c for the RF.

(ii) COVID-19 diagnosis prediction & random cross-validation with recent negatives. The test set of experiment (ii) only comprises cases, which have been tested for COVID-19 with an RT-PCR test. The 2020 cohort comprises patients which are suspicious for COVID-19, some might even have characteristic symptoms. Therefore, a classification of the samples in the 2020 cohort is more difficult and potential biases between the 2019 and 2020 cohort cannot be exploited. XGB outperforms the other models with a ROC AUC of 0.92±0.00 and a PR AUC of 0.62±0.00.

(iii) COVID-19 diagnosis prediction & temporal cross-validation. In experiment (iii), the model is trained with samples until October and evaluated on samples from November and December. XGB achieves the highest ROC AUC of 0.81±0.00 and a PR AUC of 0.71±0.00. We face an additional performance drop in comparison to experiment (i) and (ii), which points to a domain shift over time. Besides others, this domain shift over time occurs due to potential changes in the lab infrastructure, testing strategy, prevalence of COVID-19 in different patient groups, or maybe even due to mutations of the COVID-19 virus, see Figure 1 and Section 1 for more details. These results again emphasize the necessity for countering the domain shifts with lifelong learning and assessment.

(iv) Mortality prediction & random crossvalidation. We predict the mortality risk of COVID-19 positive patients, who only occur in the 2020 cohort. The samples are randomly shuffled and a five-fold nested cross-validation is performed. RF outperforms the other models for the mortality prediction with a ROC AUC of 0.88±0.02 in (iv) and a PR AUC of 0.63±0.11. We report the threshold-dependent metrics in Table 1 d, although the prediction scores of survival or death, provided by our models, are more informative for clinicians in practice, rather than a hard separation by a threshold into the two classes.

(v) Mortality prediction & temporal crossvalidation. In experiment (v), the model is trained with samples until October and evaluated on samples from November and December for mortality prediction of COVID-19 positive patients. Again, RF outperforms the other models with a ROC AUC of 0.86±0.01 and a PR AUC of 0.56±0.01. The performance drops from experiment (iv) to (v), revealing a domain shift over time for mortality prediction.

We propose re-training and re-assessment with high frequencies to tackle the domain shifts by exploiting the new samples to achieve high performance and model credibility in the real world hospital setting. Therefore, we suggest to continuously determine the COVID-19 status with an RT-PCR test of some patients to acquire frequent samples, which is indispensable to avoid the model behavior to drift into unexpected and poor performance. These measures are essential to enable trustworthy ML models for clinical utility.

The effect of the re-training frequency of the model is shown in Figure 4 b. The performance of the ML models increases with the re-training frequency, thereby reducing the domain shift of the training to the test samples. The evaluation procedure is shown in Figure 4 a and in Section 4.4.

To counter the domain shifts, we additionally propose to weight current samples stronger during training of the COVID-19 diagnosis model, see Figure 4 c. On the validation set (May -October), we determine the best weighting in dependence of the sample recency. The highest performance gain on the validation set is achieved by setting the weight of the 2019 cohort samples to 0.01 and the weight of the samples of the most recent month to 3, and the second last month to 2 ([1, 1, 2, 3]). Compared to weighting all samples equally, this increases the ROC AUC on the validation set from 0.8118 (95 % CI: 0.7849-0.8386) to 0.8502 (95 % CI: 0.8271-0.8734) (p-value = 9e-6), which is statistically significant. The selected weighting is tested on November and December, leading to a statistically significant increase of the ROC AUC from 0.7996 (95 % CI: 0.7831-0.8162) to 0.8120 (95 % CI: 0.796-0.828) (p-value = 0.0045). The method to determine the weighting is described in more detail in Section 4.4.

For clinical insight, the violin plots show discriminating capability of the selected features for the three different cohorts (2019 and 2020 cohort, 2020 cohort, COVID-19

6 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. Figure 3 : Comparison of model classes for COVID-19 diagnosis and mortality prediction. a, ROC and b, PR curves for the test set of COVID-19 diagnosis prediction in experiment (i). c, ROC and d, PR curves for mortality prediction in experiment (iv). a-d, Curves plotted for the different model classes at one random seed. RF and XGB outperform the other model classes as well as the random estimator (RE) baseline and the best feature (BF) as an estimator.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ; Table 1 : Performance metrics. a, Experiment (i) -(iii) are the results of the COVID-19 diagnosis prediction. In experiment (i) the test set is randomly selected from the shuffled 2019 and 2020 cohort. In experiment (ii) the test set is a random subset of the 2020 cohort and experiment (iii) are the results of a prospective evaluation on November and December 2020. b, The threshold-independent metrics for mortality prediction with random shuffling of the positives set (experiment (iv)) and with prospective evaluation on November and December (experiment (v)) are listed. The ML models are trained, validated and tested with five random seeds. The mean and the standard deviation (±) for the ROC AUC and PR AUC are listed. c and d, Performance metrics on test set of RF for different thresholds selected on basis of the negative predictive value on the validation set (NPV val) of c, COVID-19 diagnosis prediction in experiment (i) and d, mortality prediction in experiment (iv). 8 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. positive cohort) in Figure 5 . The plotted features are selected based on their ROC AUC on the five experiments, the top-10 features as predictors for all five tasks are listed in Table 5 in the supplementary material.

Through multiple experiments we expose domain shifts and their detrimental effect on ML models for COVID-19 diagnosis. We suggest to carefully assess the model performance frequently to avoid unexpected behavior with potentially adverse consequences, such as even greater spread of the disease due to trusting the wrongly classifying model. The model should be re-trained after particular time-periods to exploit newly acquired samples and, thus, to countermand the domain shift effect. To this end, we propose to assign a higher weight to recent samples, which, as we show, increases the predictive performance.

In this large-scale study, we train and evaluate our models with more samples than most studies [18] [19] [20] [21] [22] . Besides our large number of tested subjects, we also exploit pre-pandemic negative samples, which vastly increases our data set size. In comparison to Soltan et al. 23 and Plante et al. 24 we use the pre-pandemic as well as the pandemic negatives in our data set.

We achieve high predictive performance with our models, comparable to previous studies 18, 19, 21, 25, 35 , although the results can not be directly compared as our assessment procedure is more rigorous. Different assessment procedures within our study also yield highly vari-able performance estimates. Some studies suggested logistic regression models for COVID-19 and mortality prediction 39,53 , however, most identified (X)GB or RF as the best model classes 18, 20, 25, 31, 38 . We confirm these findings and suggest to use XGB or RF for COVID-19 diagnosis and RF for mortality prediction, as these exhibit the highest performance in our experiments.

Our models only require a small set of features of a patient, concretely a minimum of twelve blood test parameters and the age, gender and hospital admission type; in total at least 15 features. In case many blood test parameters are available, the model exploits up to 100 pre-selected features. Missing features are imputed, thereby allowing model application also on samples with a small number of features. This enables automatically scanning the blood tests without additional effort by the hospital staff, as opposed to published models, which require more expensive features, e.g., vital signs, which might not be as easily available 19, 22, 23 .

One limitation of our work could be that we did not evaluate the generalization of our model to other hospitals. A transfer of a COVID-19 diagnostic model should only be done with thorough re-assessments, as a domain shift between hospitals might be present. Besides others, such domain shifts from one institution to another could result from different testing strategies, laboratory equipment or demographics of the population in the hospital catchment area. Re-training of models rather than transferring to another hospital should be considered to obtain a skilled and trustworthy model.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ;  However, this is not part of our investigation. Our findings and suggestions about domain shifts should be accounted for in all hospitals when applying a COVID-19 model.

We evaluate our models on different cohorts to show the high performance as well as to reveal the domain shifts. However, the 2020 cohort only contains subjects that were tested for COVID-19 and where a blood test was taken. Hence, the 2020 cohort only is a subset of the total patient cohort on which the model will be applied. To counteract missing samples from a particular group, we also use the pre-pandemic negatives, which should cover a wide variety of negatives due to the large data set. An evaluation of all blood tests of 2020 just is not possible due to the lack of RT-PCR tests which serve as labels in our ML approach. Non-tested subjects of 2020 cannot be assumed to be negatives, therefore we discard them. This could only be circumvented by explicitly testing a large number of patients for this study, who would not be tested otherwise.

For lifelong learning and assessment, testing of a subset of the patients with an RT-PCR test still is necessary to identify and counter the temporal domain shift. However, this does not diminish the benefit of the model as by automatically scanning all blood tests a large number of patients can be checked for COVID-19, which would not be feasible with expensive and slow RT-PCR tests. The benefit of the model also transfers to hospitals or areas with limited testing capacity. Rather than replacing RT-PCR tests, the model can be applied as a complement to or replacement for antigen tests. The model can be retrained with the already implemented pipeline. The computational effort is relatively low, as the model only requires tabular data and no time series (sound recordings) or images (CT, X-ray) [11] [12] [13] [14] [15] [16] [17] . Other studies do not consider the domain shifts and the associated necessity for re-training, although it is indispensable for clinical utility. Lifelong learning and assessment does not only provide a performance gain for diagnostic models in pandemics like COVID-19, but also for other medical tasks, or in general, other applications of ML, where we face a continuous stream of data.

We demonstrate the high capability of ML models in detecting COVID-19 infections and COVID-19-associated mortality risk on the basis of blood tests on a large-scale data set. With our findings concerning domain shifts and lifelong learning and assessment, we want to advance the ML models to be accurate and trustworthy in real world hospital settings. Lifelong learning and assessment is an important tool to allow the transition of research results to actual application in hospitals. By advancing this field of research, we want to increase patient safety and protect clinical staff and we wish to make a contribution in banning the pandemic.

Ethics approval for this study was obtained from the ethics committee of the Johannes Kepler University, Linz (approval number: 1104/2020). The study is conducted on the blood tests (including age, gender and hospital admission type) from July 2019 until December 2020 and the COVID-19 RT-PCR tests from 2020 of the Kepler University Hospital, Med Campus III, Linz, Austria. In our study, we analyze anonymized data only.

We predict the result of the RT-PCR COVID-19 test on the basis of routinely acquired blood tests. A block diagram of the data set is sketched in Figure 2 . We only use the blood tests before the RT-PCR test to avoid bias caused by the test result. We limit the time deviation of the blood test to the COVID-19 RT-PCR test to 48 hours to ensure that the blood test matches the determined COVID-19 status. Additionally, we incorporate pre-pandemic blood tests from the year 2019 as negatives to our data set to cover a wide variety of COVID-19 negative blood tests. For the data from the year 2020, we aggregate the blood tests of the last 48 hours before the test. If parameters are measured more than once, we take the most recent one, see Figure 2 b. In case no COVID-19 test follows the blood test within 48 hours, the blood test samples are discarded. Additionally, we discard all samples with a deviating RT-PCR test result within the next 48 hours, as the label might be incorrect. The data from 2019 does not contain COVID-19 tests, therefore, blood tests with a temporal distance of less than 48 hours are aggregated. The features age, gender and admission type (inpatient or outpatient) are added to the samples. For the prediction of the COVID-19 diagnosis, we select the 100 most frequent features in the 2019 cohort as the feature set. For the mortality task the 100 most frequent features are selected based on the positives cohort, as the model is only applied to COVID-19 positive samples. Each sample requires a minimum of 15 features (minimum of any twelve blood test features and age, gender and hospital admission type). All other features and samples are discarded. The fact that the samples only require a minimum of 15 features can lead to many missing entries as the feature vector has a length of 100. For each sample we create 100 additional binary entries,

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. which indicate whether each of the features is missing or measured. The missing values are filled by median imputation. Hence, the models can be applied to blood tests with few measured values.

Given the presence of domain shifts, we define five experimental designs to estimate the performance.

The experiments differ at the data split into training, validation and test set. These splits are conducted on patient level, such that one patient only occurs in one of the sets. In the first three experiments we train models for COVID-19 diagnosis prediction. We train and evaluate the COVID-19 diagnosis models with five random seeds with a fixed data split.

(i) In our first experiment we randomly shuffle all patients and we split regardless of the patient cohorts (60 % training, 20 % validation, 20 % testing).

(ii) The training and validation sets include the 2019 cohort and 80 % (60 % training, 20 % validation) of the 2020 cohort. The test set comprises the remaining samples (20 %) of the 2020 cohort. Therefore, the performance is estimated on patients who actually were tested for COVID-19.

(iii) The training and validation sets include the 2019 cohort and the 2020 cohort before November (80 % training, 20 % validation). We conduct a prospective performance estimate for the test set with all samples from November and December 2020.

In experiment (iv) and (v) we train the models to predict the mortality risk of COVID-19 positive patients.

(iv) The training (60 %), validation (20 %) and test (20 %) sets comprise the positive cases from the 2020 cohort. Due to the limited number of samples, we predict performance with five-fold nested cross validation.

(v) The training and validation sets include the positive cases from 2020 before November (80 % training, 20 % validation). The test set comprises the cases from November and December. The test set is fixed, but again, we train and evaluate the models with five random seeds.

Z-score normalization is applied to the entire data set, with the mean and standard deviation calculated from the respective training set.

We compare multiple different models suitable for tabular data. The pre-processing, training and evaluation is implemented in Python 3.8.3. In particular, the model classes RF, KNN and SVM are trained with the scikit-learn package 0.22.1. XGB is trained with the XGBClassifier from the Python package XGBoost 1.3.1. The SNN and LR are trained with Pytorch 1.5.0.

The models are selected and evaluated based on the ROC AUC 56 , which is a measure of the model's discriminating power between the two classes. Further, we report the PR AUC 56 and we calculate threshold-dependent metrics, where the classes are separated into positives and negatives, instead of probability estimates. These metrics are negative predictive value (NPV), positive predictive value (PPV), balanced accuracy (BACC), accuracy (ACC), sensitivity, specificity and the F1-score (F1) 57 . We additionally report the thresholds, which are determined on the validation set to achieve the intended NPV.

We perform a grid search over hyperparameters of the models, see Table 2 in the supplementary material. The best hyperparameters are selected based on the ROC AUC on the validation set. In the COVID-19 diagnosis prediction tasks (experiment (i)-(iii)) we use one fixed validation fold due to the high number of samples. The models are trained and evaluated with five random seeds. For the mortality prediction tasks (experiment (iv) and (v)) the mean ROC AUC over five validation folds is calculated to select the hyperparameters. Further, the selected models are evaluated on the test set to estimate the performance. Experiment (iv) is evaluated with five-fold nested cross validation and all other experiments use a fixed test set. The mean and standard deviation of the models, which are trained, validated and tested with five random seeds, are reported.

We conduct three experiments to show the necessity of lifelong learning and assessment for trustworthy and accurate models. The first experiment investigates the deviation of the estimated to the actual performance. Therefore, we test the models on the months June until December. The performance estimate is calculated on the respective preceding month (May until November), see Figure 1 b. The 95 % confidence intervals are determined via bootstrapping by sampling 1,000 times with replacement. The deviations of estimated and actual performance are checked for significance. For this purpose, XGB is trained with the selected hyperparameters of experiment (iii). Further, we check the effect of the model training . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ; frequency on the performance. We evaluate the trained model on different numbers of subsequent months without re-training. We also refer to this number of subsequent months as model training frequency. A model training frequency of two months is sketched in Figure 4 a. We evaluate the different model training frequencies with an increment of one month, concatenate the predictions as well as the targets and calculate the ROC AUC and its 95 % confidence interval with bootstrapping 1,000 times with replacement. We do not report PR AUC, as the prevalence in the test sets of the different model training frequencies are not comparable.

In our third experiment for lifelong learning and assessment, we investigate the effect of higher weights for current samples during training, as shown in Figure 4 c. Therefore, we define May until October as our validation months to select the optimal weighting and we evaluate the selection on November and December. We train the models with all available data before the respective validation month with the best hyperparameters determined in experiment (iii). The predictions and targets are concatenated for all validation months. With a one-sided, paired DeLong test 58 , we test our hypothesis that the ROC AUC increases when current samples are weighted higher than older samples, in comparison to the ROC AUC when all samples are equally weighted. We pass the concatenated prediction and target vectors to the DeLong test, which returns the p-value and ROC AUC, calculated with the pROC package 1.17.0.1 in R.

We identify the best weighting by combining all listed options of weights of the 2019 cohort and of the most recent, previous months on the validation set. The default weight of the samples is 1. We restrict the 2019 cohort weights to the set: {1, 0.1, 0.01, 0.001}, and the weights of the previous months to: {[1, 1, 1, 1], [1, 1, 1, 2] , [1, 1, 2, 3] , [1, 2, 3, 4] , [2, 3, 4, 5] }, with the last entry in each square bracket being the weight of the last month, the second last of the second last month, and so forth. Afterwards, we normalize the weights to the length of the training samples, thereby we only change the relative weighting. As determined by the hyperparameter search, we also pass the scaling factor to scale_pos_weight to the model to balance positive and negative samples. The best weighting parameters are selected on the validation set and tested on November and December.

Besides the ML models, we additionally report statistical evaluations to allow clinical insight: we calculate the ROC AUCs of individual features equally to the five experiments in Section 4.3. For these evaluations, the features themselves are considered as predictors. This way, we can identify features with discriminating capability and compare these with the ML models. The ROC AUC is equivalent to the concordance-statistic (c-statistic) for binary outcomes 59 . Note that we do not train a model for this purpose, we simply use the positive or negative feature value as a predictor on the test set. Thereby, we identify important features for the COVID-19 diagnosis and the mortality task ( Table 5 in supplementary material). Additionally, we visually prepare the most important features selected from the above described evaluation. The features of the full data set (2019 and 2020 cohort) and the 2020 cohort are plotted for the COVID-19 diagnosis as well as for the 2020 cohort for the mortality prediction in Figure 5 .

The violin plots only contain measured features, imputed feature values are not displayed for better visual clarity.

The data set is not available for public use due to data privacy reasons.

Code is provided at https://github.com/ml-jku/covid. 

The authors declare no competing interests.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ; https://doi.org/10.1101/2021.04.06.21254997 doi: medRxiv preprint 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ; Table 3 : Comparison of estimated and actual performance. These metrics are calculated on the basis of the COVID-19 diagnosis prediction task with XGB. At significantly different deviations, the confidence intervals (CI) are colored in red. a, The actual performance is calculated on the listed month and the estimate is determined on the respective previous month. b, The estimate is determined by random samples from the 2020 cohort. c, The estimate is determined by random samples from the 2019 and 2020 cohort. 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. Figure 6 : Comparison of ML models for different experiments. ROC and PR curves for test set of a and b, experiment (ii); c and d, experiment (iii); e and f, experiment (v); the curves are plotted for different models (one random seed) and compared with best feature (BF) and random estimator (RE).

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Figure 7 : Performance of models in dependence of the minimum number of days until death. In this figure we answer, how early before death we can predict the risk of dying. Samples at which the death has occurred within the next days (minimum days until death), are excluded from the test set (but not from the training and validation set). a, The mean of ROC AUC and b, PR AUC values of five test folds is plotted. c, The mean of ROC AUC and d, PR AUC values of five random seeds in prospective evaluation are shown. For visual clarity, the standard deviations (error bars) are only plotted for the RF. The mortality risk can be estimated early before death, as the discriminating capability of the models remains high with increasing number of minimum days until death. The mean PR AUC in b and d is decreasing with increasing minimum days until death, equally to the random estimator baseline, due to the decreasing ratio of deceased to survivors.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. c, Estimated and actual ROC AUC are significantly different in all month but August, due to heavy domain shifts, and PR AUC in October and November. The mean deviation of estimated and actual ROC AUC and PR AUC is higher in c compared to a and b. d, Three options to determine the performance estimate. In a, the performance estimate is calculated from the preceding month. In b, the samples for the estimate are randomly selected from the 2020 cohort (20 %), and in c they are randomly sampled from the 2019 and the 2020 cohort (20 % of the 2020 cohort and equal proportion from the 2019 cohort).

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. Figure 9 : Deviation of estimated from actual performance for mortality risk with two options to determine the performance estimate. a, The estimate is determined on the respective previous month. Note that the confidence interval at an early stage of the pandemic is high due to a low number of samples. b, The estimate is determined on randomly sampled 20 % of the COVID-19 positives, who occurred before the actual performance month. There is a significant difference in October for ROC AUC and PR AUC, which means that the performance estimate is overoptimistic in October. Table 4 : Comparison of estimated and actual performance for mortality risk prediction. These metrics are calculated from the predictions of a RF, trained with the hyperparameters as determined in experiment (v). At significantly different deviations, the confidence intervals (CI) are colored in red. a, The actual performance is calculated on the listed month and the estimate was determined from the respective previous month. b, The estimate is determined by random samples from the positives cohort, occurring before the month, which the actual performance is calculated on. 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted April 9, 2021. ; Table 5 : Features with discriminating capability. Top-10 features as predictors for COVID-19 diagnosis (experiment (i)-(iii)) and mortality prediction (experiment (iv) and (v)). The sign in brackets indicates whether the target is connected with the positive (+) or the negative sign of the feature value (−), i.e., patients with high ferritin and low calcium have higher probability for class COVID-19 positive. The standard deviation (±) is listed for experiment (iv), for the other experiments the test set is fixed. Abbreviations: absolute eosinophil count (AEC), immature granulocytes (IG), absolute basophil count (ABC), lactate dehydrogenase (LDH), C-reactive protein (CRP), absolute lymphocyte count (ALC), red cell distribution width (RDW). 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted April 9, 2021. ; https://doi.org/10.1101/2021.04.06.21254997 doi: medRxiv preprint

Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR

A Novel Coronavirus from Patients with Pneumonia in China

Rethinking Covid-19 Test Sensitivity -A Strategy for Containment

Frequency of Routine Testing for Coronavirus Disease 2019 (COVID-19) in High-risk Healthcare Environments to Reduce Outbreaks

Test sensitivity is secondary to frequency and turnaround time for COVID-19 surveillance. medRxiv

Evaluation of rapid antigen test for detection of SARS-CoV-2 virus

Machine and Deep Learning towards COVID-19 Diagnosis andTreatment: Survey, Challenges, and Future Directions

Artificial Intelligence for COVID-19 Drug Discovery and Vaccine Development. Front

COVID-19 Coronavirus Vaccine Design Using Reverse Vaccinology and Machine Learning

Large-scale ligand-based virtual screening for SARS-CoV-2 inhibitors using deep neural networks

Review on Diagnosis of COVID-19 from Chest CT Images Using Artificial Intelligence

Using Artificial Intelligence for COVID-19 Chest X-ray Diagnosis

Automated COVID-19 diagnosis from X-ray images using convolutional neural network and ensemble of machine learning classifiers

Classification of COVID-19 chest Xrays with deep learning: new models or fine tuning?

Robust Detection of COVID-19 in Cough Sounds

COVID-19 and Computer Audition: An Overview on WhatSpeech & Sound Analysis Could Contribute in the SARS

Artificial Intelligence Diagnosis Using Only Cough Recordings

Detection of COVID-19 Infection from Routine Blood Exams with Machine Learning: A Feasibility Study

Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests

Machine Learning Prediction of SARS-CoV-2 Polymerase Chain Reaction Results with Routine Blood Tests

A machine learning algorithm to increase COVID-19 inpatient diagnostic capacity

Development of machine learning models to predict RT-PCR results for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in patients with influenza-like symptoms using only basic clinical data

Rapid triage for COVID-19 using routine clinical data for patients attending hospital: development and prospective validation of an artificial intelligence screening test

Development and External Validation of a Machine Learning Tool to Rule Out COVID-19 Among Adults in the Emergency Department Using Routine Blood Tests: A Large, Multicenter, Real-World Study

Routine Laboratory Blood Tests Predict SARS-CoV-2 Infection Using Machine Learning

Exploring the Relation between Blood Tests and Covid-19 Using Machine Learning

Ensemble learning model for diagnosing COVID-19 from routine blood tests

Complete blood count might help to identify subjects with high probability of testing positive to SARS-CoV-2

IA: an intelligent system to support diagnosis of Covid-19 based on blood tests

Use of Machine Learning and Artificial Intelligence to predict SARS-CoV-2 infection from Full Blood Counts in a population

Prediction of COVID-19 From Hemogram Results and Age Using Machine Learning. Front. health inform

Hemogram data as a tool for decision-making in COVID-19 management: applications to resource scarcity scenarios

CoVA: An Acuity Score for Outpatient Screening that Predicts Coronavirus Disease

Machine Learning to Predict Mortality and Critical Events in a Cohort of Patients With COVID-19 in New York City: Model Development and Validation

Deep learning prediction of likelihood of ICU admission and mortality in COVID-19 patients using clinical variables

Development of a prognostic model for mortality in COVID-19 infection using machine learning

An Artificial Intelligence Model to Predict the Mortality of COVID-19 Patients at Hospital Admission Time Using Routine Blood Samples: Development and Validation of an Ensemble Model

Early risk assessment for COVID-19 patients from emergency department data using machine learning

The Predictive Effectiveness of Blood Biochemical Indexes for the Severity of COVID-19. Can

A multipurpose machine learning approach to predict COVID-19 negative prognosis in São Paulo

Severity Detection for the Coronavirus Disease 2019 (COVID-19) Patients Using a Machine Learning Model Based on the Blood and Urine Tests. Front

To Annotate or Not? Predicting Performance Drop under Domain Shift

Fibrinogen as a key regulator of inflammation in disease

An introduction to domain adaptation and transfer learning. arXiv, 1812

Cross-Domain Few-Shot Learning by Representation Fusion

The relationship between elevated fibrinogen and markers of infection: a comparison of seasonal cycles

Increased mortality in community-tested cases of SARS-CoV-2 lineage B.1.1.7

WILDS: A Benchmark of inthe-Wild Distribution Shifts

Lifelong Machine Learning: Second Edition

Continual lifelong learning with neural networks: A review

Alaa, A. M. & van der Schaar, M. Lifelong Bayesian Optimization. arXiv

Development and external validation of a logistic regression derived formula based on repeated routine hematological measurements predicting survival of hospitalized Covid-19 patients. medRxiv

Machine learning based early warning system enables accurate mortality risk prediction for COVID-19

Self-normalizing neural networks. NIPS

The relationship between Precision-Recall and ROC Curves

A Survey of Predictive Modeling on Imbalanced Domains

Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach

Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors

Funding This project was funded by the Medical Cognitive Computing Center (MC3) and AI-MOTION (LIT

DeepFlood (LIT-2019-8-YOU-213), PRIMAL (FFG-873979), S3AI (FFG-872172)