key: cord-0493788-uzcw9ted
authors: Gerevini, Alfonso Emilio; Maroldi, Roberto; Olivato, Matteo; Putelli, Luca; Serina, Ivan
title: Prognosis Prediction in Covid-19 Patients from Lab Tests and X-ray Data through Randomized Decision Trees
date: 2020-10-09
journal: nan
DOI: nan
sha: 7e4bd5e07da87af608dfcfa2a74298e09f6e212d
doc_id: 493788
cord_uid: uzcw9ted

AI and Machine Learning can offer powerful tools to help in the fight against Covid-19. In this paper we present a study and a concrete tool based on machine learning to predict the prognosis of hospitalised patients with Covid-19. In particular we address the task of predicting the risk of death of a patient at different times of the hospitalisation, on the base of some demographic information, chest X-ray scores and several laboratory findings. Our machine learning models use ensembles of decision trees trained and tested using data from more than 2000 patients. An experimental evaluation of the models shows good performance in solving the addressed task.

The fight against Covid-19 is a new important challenge for the world that AI and machine learning can help facing at various levels [15, 28, 29] . In March 2020, at the time of the coronavirus emergency in Italy, we started working in strict collaboration with one of the hospitals that had more Covid-19 patients in Italy, Spedali Civili di Brescia, to help predicting the prognosis of hospitalised patients. Our work was focused on the task of predicting the risk of death of a patient at different times of the hospitalisation. As discussed in [28] , predicting if a patient is at risk of decease or adverse events can help the hospital, for instance, to organize the allocation of limited health resources in a more efficient way.

Our predictive models are built on the base of demographic information (sex and age), the values of ten laboratory tests and the chest X-ray score(s), which is an innovative measure developed and used at Spedali Civili di Brescia to assess the severity of the pulmonary conditions [3] . Other important information, such us the patient comorbidities or the time and duration of the symptoms related to Covid-19, were not used because not available to us.

Using raw data from more than 2000 patients, we built some data sets describing the "clinical history" of each patient during the hospitalisation. In particular, each dataset contains a "snapshot" of the infection conditions of every considered patient at a certain day after the start of the hospitalisation. For each dataset, we built a different predictor, allowing to make progressive predictions over time that take into account the evolution of the disease severity in a patient, which helps the formulation of a personalized prediction of the prognosis. A change of the predicted risk over time for a patient could also hint a link between specific events or treatments and the increase or decrease of the risk for the patient. As snapshot times for a patient, in 1 This paper is published in Proceedings of 5th International Workshop on Knowledge Discovery in Healthcare Data (KDH) at ECAI 2020.

our experiments we considered the 2nd, 4th, 6th, 8th and 10th hospitalization day, and the day before the end of the hospitalisation. Our datasets were engineered to cope with a number of practical issues, including missing values and feature values categorization, and to add some helpful artificial features. We also addressed the "concept drift" issue [6, 23] , since we observed that the risk of death was clearly sensitive to the time period when the patient was hospitalised; the risk was significantly higher during the earlier period of the emergency (March 2020), when in northern Italy the spread of the virus infection was very high and many people were hospitalised. Moreover, given the very sensitive nature of our task, we introduced a threshold to discharge the model predictions that have a low estimated probability. Such a threshold is a parameter that is automatically calculated and optimised during the training phase.

We considered several machine learning algorithms. A first experimental comparison of their performance on our data sets showed that methods based on forests of trees have more promising performance, and so we decided to focus on this approach. The obtained prediction models have good performance over a randomly chosen test set of 200 patients for each considered period, in terms of both F2 and ROC-AUC scores. In particular, overall the system makes very few errors in predicting patient survival, i.e., the specificity of the prediction is very high.

In the following, after discussing related work, we describe our data sets, we present our prediction models and their experimental evaluation, and finally we give conclusions and mention future work.

Artificial Intelligence and Machine Learning techniques can be used for tackling the Covid-19 pandemic in different aspects. However, given that the pandemic has started only few months ago, most works are still preliminary, and there isn't a clear description of the developed techniques and of their results (often only pre-printed and not properly peer-reviewed).

A preliminary study is presented in [15] . Given a set of only 53 patients with mild symptoms and their lab tests, comorbidities and treatment, the authors train several machine learning models (Logistic Regression, Decision Trees, Random Forests, Support Vector Machines, KNN) to predict if a patient will be subject to more sever symptoms, obtaining a prediction accuracy score of up to 0.8 using 10-fold cross validation. The generalizability and strength of these results are questionable, given the very small set of considered patients. Another example is the pre-printed work by Li Yan et al. [29] that uses lab tests for predicting the mortality risk; the proposed model is a very simple decision tree based on the three most important features. While the performance seems promising, the test set used for evaluation was very small (29 patients).

Various AI and machine learning techniques have been developed for prognosis and disease progression prediction [7] in the context of diseases different from Covid-19 [20, 21, 22] . In particular, in the last few years, several works about predicting mortality risk or adverse events and on the use of AI in critical care [19] have been published. The survey in [1] presents a review of statistical and ML systems for predicting the mortality risk, the need of beds in intense care units [30] or the length of the patient hospitalization. In particular, it is worth to mention the work by Harutyunyan et al. [11] which uses LSTM Neural Networks for predicting both the mortality risk and the length of the hospitalisation.

An overview of the issues and challenges for applying ML in a critical-care context is available in [16] . This work stresses the need to deal with corrupted data, like missing values, imprecision, and errors that can increase the complexity of prediction tasks.

Lab test findings and their variation over time are the main focus of the work by Hyland et al. [14] , which describes a system that processes these data to generate an alarm predicting that a patient will have a circulatory failure 2 hours in advance.

During the Covid-19 outbreak, from February to April 2020 in hospital Spedali Civili di Brescia more than two thousand patients were hospitalised. During their hospitalisation, the medical staff performed several exams to them in order to monitor their conditions, checking the response to some treatments, verifying the need to transfer a patient to the ICU, etc. We had data from a total of 2015 hospitalised patients; for each of these patients, the specific data that were made available to us are:

• the age and sex;

• the values and dates of several lab tests (see Table 1 ); • the scores (each one from 0 to 18), assigned by the physicians, assessing the severity of the pulmonary conditions resulting from the X-ray exams [3] ; • the values and dates of the throat-swab exams for Covid-19;

• the final outcome of the hospitalisation at the end of the stay, which is the classification value of our application (either inhospital death, released survivor, or transferred to another hospital or rehabilitation center). Table 1 specifies the considered lab tests, their normal range of values, and their median values in our set of patients. We had no further information about symptoms, their timing, comorbidities, generic health conditions or clinical treatment. Moreover, we have no CT images or text reports associated with the X-ray exams. The available information about whether a patient was or had been in ICU was not clear enough to be used. Finally, of course, also the names of the patient and of the involved medical staff names were not provided.

When applying machine learning to raw real-world data, there are some non-trivial practical issues to deal with, such as the quality of the available data and related aspects, that in biomedical applications are especially important given the very sensitive domain [12] .

In our case, one of such issues is that the length of the hospitalisation period can sensibly differ from one patient to another (from few days to two months), due to different reasons including the novelty and the characteristics of the disease, its high contagiousness or the absence of an effective treatment. Therefore, the number of performed lab tests and relative findings significantly varies among the considered set of patients (from only three to hundreds). Moreover, the lab tests and X-ray exams are not performed at a regular frequency due, e.g., to the different kinds and timing of the relative procedures, the need of different resources (X-Ray machines, lab equipments, technical staff, etc.), or to the different severity of the health conditions of the patients. For example, in our data we see that a patient can be tested for PCR everyday and not be subject to a Ferritin exam for two weeks. This leads to the need of handling the issues missing values and outdated values. When we consider a snapshot of a patient at a certain day, we have a missing value for a lab test (or X-ray) feature if that test (X-ray) has not been performed. We have an outdated value for a feature if the corresponding lab test (X-ray) was performed several days earlier: since in the meanwhile the disease has progressed, the findings of the lab test could be inconsistent with the current conditions of the patient, and so they could mislead the prediction.

Data quality issues arise especially patients hospitalised in the period of the highest emergency, when several hundreds of patients were in the hospital at the same time.

An examination of the data available for our cohort of patients revealed that their prognostic risk is influenced by multiple factors, such as the number of the patients currently hospitalised and the consequent availability of ICU beds or other resources, the experimentation of new therapies, and the increase of the clinical knowledge.

In machine learning, this change of data distribution is known as concept drift [6, 23] . A classical method to deal with this problem is training the algorithm using only a subset of samples, depending on the data distribution that we are considering [6, 24] .

For this reason, we divided the considered set of patients into two groups: the High Contagion Phase (HCP) group of patients, which is composed by the patients admitted during the last weeks of February and the first weeks of March (the most critical period of the pandemic outbreak in Italy) and the Moderate Contagion Phase (MCP) group of patients, which is composed by the patients admitted from the last decade of March to the end of April.

The main differences between these groups of patients are:

1. the mortality rate of the HCP patients is about twice the mortality rate of the MCP patients; Figure 1 : Length of stay in hospital (left) and weekly death rate histograms for the High Contagion Phase (in blue) and for the Moderate Contagion Phase (in orange). On the x-axis, for the length of stay we indicate the range of days, for the death rate we indicate the week when the patient was released. On the y-axis we indicate the percentage of patients.

2. in HCP patients the median value of the hospitalisation period is 8 days, while in MCP patients is 14 days. Further details are given in Figure 1 ; 3. for many of the considered lab test, the mortality rate associated with having values in a particular range significantly changes in the two groups. For example, in HCP patients the mortality rate for the patients which had a PCR value 10 times above the normal range is 40.1%, while in MCP patients it is 21.1%.

These differences clearly indicate that the data in the HCP and MCP groups represent different target (concept) functions; therefore predicting mortality during the high infection phase and during the moderate phase can be considered as two different tasks. If we had only the patients hospitalised during the high infection phase, using these data for training an algorithm that predicts the mortality during the moderate phase would lead to many errors.

In our case, we generated two different systems, one for each of the two groups of patients. We are currently investigating ways to automatically select the set of patients for training starting from the latest ones, and keeping the less recent ones until we find significant changes in the mortality rate or in the data distribution.

The main task of our work is to provide survival/death predictions at different days of the patient hospitalisation, according to the current patient conditions reflected by the available lab findings and X-ray scores. In this section we describe the specific extracted features and the (training and testing) datasets that we built for this purpose.

The issues presented in Section 3.1 compel us to a robust preprocessing phase with the goal of extracting features in order to summarize the patients conditions and process them by a machine learning algorithm. The pre-processing is applied to both HCP and MCP data.

Given that we have no information about the survival or the decease of a patient after a transfer (which can be due to limited availability of beds or ICU places), we exclude from our training and test set the 142 patients which were admitted in Spedali Civili di Brescia and then transferred to another hospital. However, the 74 patients who were transferred to a rehabilitation center can be considered not at risk of death; therefore we include them in our datasets and consider the transferred patients as released alive.

In order to provide a prediction for a patient at different hospitalisation times, we introduced the concept of patient snapshot to represent the patient health conditions at a given day.

In this snapshot, for each lab test of Table 1 , we consider its most recent value. In the ideal case, we should know the lab test findings at every day. However, as explained in Section 3.1, in a real-world context the situation is very different. For example, in our data if we consider to take a snapshot of a patient 14 days after the admission into the hospital, we have cases with very recent values of PCR, LDH or WBC (obtained one or a few days before), very old values for Fibrinogen or Troponin-T (obtained the first day of the hospitalisation) and even no value for Ferritin.

Given the difficulty to set a predefined threshold that separates recent and old values of the lab tests (e.g., for Fibrinogen and Troponin-T), we choose to always use the most recent value, even if it could be outdated. In order to allow the learning algorithm to capture that a value may not be significant to represent the current status of the patient (because too old), we introduce a feature called ageing for each test finding. If a lab test has been performed at a day d0, and the snapshot of a patient is taken at day d1, the ageing is defined as the number of days between d1 and d0. If there is no available value for a lab test, its ageing is considered a missing value.

A patient snapshot can contain the values of the lab test findings in two forms: either numerical, in which we report the value itself, or categorical, in which the value is transformed into an integer number expressing the gravity of the test finding within a partition of the possible real values. This partition is based on the range of values for normal conditions and on how the test values are distributed over the data of all patients. For example, we divide the D-Dimer vales into 6 categories: the normal range, up to 2 times the maximum value of the normal range, up to 4 times, 6 times, 10 times and over 10 times. The categorical form could help the algorithm to have a clearer understanding of the data and improve performance.

Monitoring the conditions of a patient means knowing not only the patient status at a specific time, but also how the conditions evolve during the hospitalisation. For this purpose, we introduce a feature called trend that is defined as follows:

For each lab test, if there is no available value for a lab test or if the patient has not performed the lab test at least two times, the trend is a missing value. Otherwise, given the values v1 and v2 of the findings for the lab test performed at days d1 and d2 and a threshold T that we set to 15% of v1, if v2 > (1 + T ) * v1, then the trend is increasing, while if v2 < (1 − T ) * v1 the trend is decreasing; otherwise the trend is stable.

We distinguish two types of trends: the start trend, that uses the distance between the most recent value and the first available value, and the last trend, that uses the distance between the last one and the penultimate one. We are currently investigating techniques for including more than two values in the trend calculation.

To summarize, for each lab test in a patient snapshot, we have the most recent finding and the relative ageing and trend, as well as the static features age and sex.

In this section we describe how we generated the training and test sets for the purpose of predicting, at different days from the start of the patient hospitalization, the final outcome of her/his stay.

First, for both the HCP and MCP sets, we used stratified sampling for selecting 80% of the patients for training the models and 20% for testing them. Then, we created specific training and test sets for each element in a sequence of times when the model is used to make the prediction 2 :

• 2 days of hospitalisation. We include all the patients' snapshots containing the first values for each lab test conducted in the first two days after the hospital admission. Note that if a patient has performed a lab test more than once in the first two days, the snapshot will consider the oldest value. In fact, the purpose of the model we want to build is to provide the prediction as soon as possible, with the first information available. Furthermore, in these snapshots the ageing and trend values are not included. • 4 days and 6 days of hospitalisation. In these cases, the corresponding snapshots also contain the ageing and trend features, and the lab values will be the most recent ones in the available data. Given that only a few days passed after admission, we consider the start trend. • 8 days and 10 days of hospitalisation. The procedure of creating the corresponding snapshots is the same as for the snapshots of 4 days and 6 days cases, except that we consider the last trend instead of the start trend. • End day (the last day before the patient is released or the patience decease). In this case, for each lab test the snapshot includes both the start trend and the last trend features.

It is important to observe, that while the datasets of the latter days will contain more information about the single patients (more lab tests findings, less missing values), the overall number of patients in the datasets decreases with the prediction day increase. This is due to the fact that more patients are released or die within longer periods of hospitalisation, and therefore such patients are not included in the corresponding datasets.

Finally, note that the splitting between training and testing of the data is done only once considering all patients. Thus if, for instance, a patient belongs to the training set of 2 days, then it does not belong to the test set of the following days.

In this section we briefly describe the machine learning algorithms used in our prognosis prediction system.

Decision Trees [25] are one of the most popular learning methods for solving classification tasks. In a decision tree, the root and each internal node provides a condition for splitting the training samples into two subsets depending on whether the condition holds for a sample or not. In our context, for each numerical feature f , a candidate splitting condition is f ≤ C, where C is called cut point. The final splitting condition is chosen by finding the f and C providing the best split according to one of some possible measures like Information Gain, Entropy index or Gini index.

A subset of samples at a tree node can either be split again by further feature conditions forming a new internal node, or form a leaf node labelled with a specific classification (prediction) value; in our application domain the label is either the alive class or the dead class. Let us consider a decision tree with a leaf node l and a subset S of associated training samples. A test instance X that reaches l from the root tree, is classified (predicted) y with probability P (y|X) = T P T P + F P where T P (True Positives) is the number of training samples in S that have class value y, and F P (False Positives) is the number of samples in S that don't have class value y [5] . Given that in our task we have only two classes (y and y), P (y|X) = 1 − P (y|X). The classification outcome of a decision tree forX is the class value with the highest probability.

Random Forests (RF) [4] is an ensemble learning method [32] that builds a number of decision trees at training time. For building each individual tree of the random forest, a randomly chosen subset of the data features is used. While, in the standard implementation of random forests the final classification label is provided using the statistical mode of the class values predicted by each individual tree, in the well-known tool Scikit-Learn [18] that we used for our system implementation, the probability of the classification output is obtained by averaging the probabilities provided by all trees. Hence, given a random forest with n decision trees, a class (prediction) value y is assigned to an instance X with the following probability:

n .

Extremely Randomized Trees (Extra Trees or ET) [8] are another ensemble learning method based on decision trees. The main differences between Extra Trees and Random Forests are:

• In the original description of Extra Trees [8] each tree is built using the entire training dataset. However in most implementations of Extra Trees, including Scikit-Learn [18] , the decision trees are built exactly as in Random Forests. • In standard decision trees and Random Forests, the cut point is chosen by first computing the optimal cut point for each feature, and then choosing the best feature for branching the tree; while in Extra Trees, first we randomly choose k features and then, for each chosen feature f , the algorithm randomly selects a cut point C f in the range of the possible f values. This generates a set of k couples {(fi, Ci) | i = 1, . . . , k}. Then, the algorithm compares the splits generated by each couple (e.g., under split test fi ≤ ci) to select the best one using a split quality measure such as the Gini Index or others.

The probability P (y|X) of assigning a class value y to an instance X is computed as in Random Forests (see equation above).

Most machine learning algorithms have several hyperparameters to tune such as, for instance, in a Random Forest the number of decision trees to create and their maximum depth. Since in our application handling the missing values is an important issue, we also used a hyperparameter for this with three possible settings: a missing value is set to either the average value, the median value or a special constant (-1).

In order to find the best performing configuration of the hyperparameters, we used the Random Search optimization approach [2] , which consists of the following main steps:

1. We divide our training sets into k folds, with either k = 10 or k = 5, depending on the dimension of the considered dataset. 2. For each randomly selected combination of hyperparameters, we run the learning algorithm in k-fold cross validation. 3. For each fold, we evaluate the performance of the algorithm with that configuration using the Macro F-β score metric and β = 2.

The F -β score is the weighted harmonic mean of precision and recall measures. The β parameter indicates how many times the recall is more important with respect to the precision:

We choose β = 2 in order to give particular importance to false negatives, i.e. those patients which our system could not identify as at death risk. Given that we can compute the F2-score both for both the alive class and the dead class, we considered the Macro F2-Score, which is the arithmetic mean of the scores for the two classes. 4. The overall evaluation score of the k-fold cross validation for a configuration of the parameters is obtained by averaging the scores obtained for each fold. 5. The hyperparameter configuration with the best overall score is selected.

The output for an instance X of every generate classification model is an array of two probabilities, P (alive|X) and P (dead|X), defined as described in Section 5.1. We can see them as "degrees of certainty" of the prediction: the higher the probability is, the more reliable the prediction is. Given the very sensitive nature of our task, the system discards potential predictions supported by a low probability. This is achieved using a prediction threshold under which the system considers the prediction uncertain (and the patient risk unpredictable). Note that if we used a threshold value that is too high, many patients could be classified uncertain, and our model would be much less useful for clinical practice. To avoid this, at training time we impose a maximum percentage of samples that can can be considered uncertain (unpredictable), and we implemented this with a parameter, called max u, that is given in input; for our experimental analysis we used max u = 25%.

FINDUNCERTAINTHRESHOLD: Algorithm for computing, during the training phase, an optimised prediction threshold under which the model labels an instance as uncertain. Input: -L array of labels (alive or dead) li with l[i] label of the sample i of the validation data (fold); -P = [pi = (p alive , p dead )i | i is the sample index in val. set]; -max u the maximum percentage of the samples in the validation set that can be labeled as uncertain (not predictable); -n the maximum number of thresholds to try; -EvaluateScore the score function to maximize by dropping the uncertain samples; ) where v is the score function value after dropping the uncertain samples and th the optimized threshold value. 1 L pred ← array of labels such that L pred [i] is the predicted label (the label with highest probability) of the val. sample i; We designed an algorithm called FINDUNCERTAINTHRESHOLD that is used in the training phase to decide the threshold and optimize the prediction performance on the training samples that pass it, under the max u constraint. The pseudocode of the algorithm is in Figure 2 .

Given the original labels L of the validation samples and their prediction probabilities P derived by the learning algorithm, FIND-UNCERTAINTHRESHOLD first computes: the predicted labels L pred (i.e., the class values with highest probabilities) and the relative Pmax probabilities; the original score v obtained using the input score function evaluating all samples; an initial value of the threshold (th) defined as to the minimum probability in Pmax.

The next loop finds an optimal value of threshold th and computes the score function for the validation set reduced to the validation samples with predicted labels that have probabilities above th. The considered threshold values are obtained by using the δ-increments defined at lines 5 and 7. First we compute the new threshold th increasing the current threshold by δ, and then we derive the set S of sample ids with prediction probabilities higher than th . Next we compute the percentage u of samples that are labeled as uncertain using threshold th . If u ≥ max u, we can terminate returning the current best new score v and the corresponding threshold value th (a greater threshold value cannot lead to label as uncertain less samples than the returned th value). Otherwise (u < max u), we compute the correct sample labels L and the predicted sample labels L pred for the samples identified by S, and we compute the new score value v using L and L preds . If v is a better score than v, we update both the threshold and the score values.

FINDUNCERTAINTHRESHOLD is executed during the training phase. In particular during the hyperparameter search, for each attempted hyperparamenter configuration, we compute through FIND-UNCERTAINTHRESHOLD an optimized threshold and the relative score function value. These two values are obtained by averaging the optimal thresholds and corresponding scores over all folds of the cross validation for the attempted configuration. The hyperparameter search returns the best configuration together with the relative (averaged) threshold.

In this section, we evaluate the performance the of the machine learning models that we built. Our system was implemented using the Scikit-Learn [18] library for Python, and the experimental tests were conducted using a Intel(R) Xeon(R) Gold 6140M CPU @ 2.30GHz.

The performance of the learning algorithms with the relative optimized hyperparameters was evaluated using the test set in terms of F2 score and ROC-AUC score. The second metric is defined as the area under the Receiver Operating Characteristic curve, which plots the true positive rate against the false positive rate, and it takes also into account the probability that the predictive system produces false positives (i.e. false alarms). This metric is a standard method for evaluating medical tests and risk models [9, 10] .

In a preliminary study we examined various machine learning approaches and we compared their average performances over the HCP datasets. Figure 3 shows a summary of the relative performance in terms of F2 score. We considered Decision Trees [25] , ExtraTrees (ET) [8] , Gaussian Naive Bayes [31] , Multilayer Perceptron with two layers (MLP) [13] , Quadratic Discriminant Analysis [26] , Random Forests (RF) [4] and Support Vector Machines [27] . The best performance was obtained with RF and ET. NN and SVM performed much worse and with a much higher variability over the datasets, probably related to the missing values and the scarcity of data. For the MCP datasets the relative performance was similar. Given the observed better performance of RF and ET, we focused the evaluation of our system on these learning algorithms Regarding the training time, including the hyperparamenter search over 4096 random configurations and the optimization of the uncertainty threshold, for any specific dataset (e.g., the MCP numerical dataset for 2 days), the overall training time is between 20 and 30 minutes. Therefore, we can build all the four most promising models generated by RF and ET using the numerical version (RF-N, TC-N) or the categorical version (RF-C, ET-C) of the data set in less than two hours, and then select the best performing model among them.

It is also worth to note that in our system the models for predicting the prognostic risk at different days are completely independent from each other, and so we can consider prediction tasks at different days as different tasks.

In Figure 4 and in Table 2 we show the performances of our system at each considered day for both the High Contagion Phase and the Moderate Contagion Phase. As we can see, we obtain promising results in terms of F2 score for an early evaluation of the risk during the HCP (with score 77.1% at day 2), while we encounter some problems at the 6th and 10th days. For the MCP datasets, the system performs better at the latter days, in particular for the 10th day F2 is 80.4% and ROC-AUC is 90.2%. For HCP, both RF and ET obtain good results in both the numerical and categorical versions of the datasets. Instead, for MCP using the categorical datasets does not give good performance, and we do not observe an improvement for the latter prediction days (the F2 score is always below the 70%).

In all but one case, the models using the uncertain threshold increase the performance in terms of both F2 and ROC-AUC scores. In particular, in the most problematic cases of HCP, such as for the 6-days and 10-days datasets, the prediction performance improves in terms of F2 by over than 7 points. The improvement is less significant for MCP.

Note that, while the threshold value under which the system labels an instance (patient risk) as uncertain is derived at training time imposing a maximum percentage of uncertain samples (we used 25%), there is no formal guarantee that this percentage limit is satisfied for test set. However, in most cases the percentage of uncertain test samples (indicated with % Unc in Table 2 ) is much below the limit imposed during training, expect for the test set of the 6th day in HCP, where the unpredicted (labelled as uncertain) patients are 26.1%. The performance for the "end" dataset is good for both HCP and MCP even without omitting the uncertain patients (F2 score 86.6% for HCP, and F2 score 86.9% for MCP). Figure 4 gives graphical pictures comparing the performance of our system for HCP and MCP in terms of F2 and ROC-AUC. The performance behaviour over time significantly differs in the two contagion periods, reflecting the concept drift we discussed in Section 3.2. For HCP, considering the results without omitting the uncertain test instances (blue curves), the performance prediction is very good at the 2nd day and it decreases at the 6th and 10th days. Instead, for MCP the performance improves over time, reaching 90.2% in terms of ROC-AUC at the 10th day, as also reported in Table 2 . This is due to several factors:

• MCP includes patients that have hospitalisation periods much longer than the patients in HCP, which can make more difficult to predict the mortality risk for some patients with only a few days of hospitalisation; • on the contrary, in HCP half of the patients stayed in hospital for less than 8 days. This decreases significantly the size of the 8days and 10-days training sets, which contain respectively only 431 and 339 patients. The lack of training data in these datasets is only partially compensated by the increase of the lab tests for a single patient in the datasets; • as described in Section 3.2, the MCP patients are much more unbalanced (with only 11% deceased patients) than the HCP patients, and this increases the difficulty of learning an high performing model [17] . Figure 5 shows the confusion matrices for the test sets generated using our predictive models. Above the line we have the HCP datasets and below the MCP datasets. Despite the training phase was optimised (through the use of the F2 metric) to avoid false negatives, for the HCP datasets there are several false negatives (bottom-left of the matrices). This can be explained by the scarcity of lab test and X-ray data in the HCP data that affects prediction.

However, false negatives are significantly reduced with the models that can classify a patient as uncertain. For example, at day 6, the system classifies as uncertain 4 patients who otherwise would be false negatives. Moreover, when there are less false negatives, such as at days 8 and 10, classifying some patients as uncertain helps to also avoid false positives and so to generate less false alarms.

Remarkably, especially for the MCP datsets, we have very few false negatives even at the early days, which is quite important in our application context. On the other hand, especially for days 2 and 4, our system produces many false positives. This type of error is reduced in the models with uncertain patients up to only 5 false alarms for the end dataset (e.g., at day 2 we avoid 16 false positives.)

We have presented a system for predicting the prognosis of Covid-19 patients focusing on the death risk. We built and engineered some datasets from lab test and X-ray data of more than 2000 patients in an hospital in northern Italy that was severely hit by Covid-19. Our predictive system uses a collection of machine learning algorithms and a new method for setting, at training time, an uncertain threshold for prediction that helps to significantly reduce the prediction errors.

Overall, the experimental results are quite promising, and show that our system often obtains high ROC-AUC scores. The observed predictive performance is especially good in terms of false negatives (patients erroneously predicted survivor), that are very few. This gives a predictive test for patient survival with very good specificity in particular when the system can classify a patient as uncertain.

On the other hand, in terms of false positives, there is room for significant improvements. We are confident that the availability of more information, such as patient comorbidities or clinical treatments, will help to improve performance, reducing the number of both false pos- itives and (few) false negatives.

For future work we plan to extend our datasets with more information (both additional features and patients), to consider further methods for dealing with the observed concept drift and to address other prediction tasks such as the duration of the hospitalisation or the need of ICU beds and critical hospital resources. Moreover, we are analyzing the importance of the features used in our models, and we intend to investigate additional learning techniques.

Acnowledgements. The work of the first author has been supported by Fondazione Garda Valley.

Patient length of stay and mortality prediction: A survey

Random search for hyperparameter optimization

Covid-19 outbreak in italy: experimental chest x-ray scoring system for quantifying and monitoring disease progression

Random forests

Evaluating probability estimates from decision trees

A survey on concept drift adaptation

Automatic classification of radiological reports for clinical care

Extremely randomized trees

Receiver operating characteristic curve analysis of clinical risk models

Receiver operating characteristic (roc) curve analysis for medical diagnostic test evaluation

Multitask learning and benchmarking with clinical time series data

Analyzing the effect of data quality on the accuracy of clinical decision support systems: a computer simulation approach

Neural networks: a comprehensive foundation

Early prediction of circulatory failure in the intensive care unit using machine learning

Towards an artificial intelligence framework for datadriven prediction of coronavirus clinical severity

Machine learning and decision support in critical care

Learning from imbalanced data: open challenges and future directions

Scikit-learn: Machine learning in Python

Enabling machine learning in critical care

Deep learning for classification of radiology reports with a hierarchical schema

The impact of self-interaction attention on the extraction of drug-drug interactions

Applying self-interaction attention for extracting drug-drug interactions

Dataset shift in machine learning

Training feedforward neural networks with dynamic particle swarm optimisation

Data Mining with Decision Trees: Theory and Applications

Bayesian quadratic discriminant analysis

Least squares support vector machine classifiers

How artificial intelligence and machine learning can help healthcare systems respond to covid-19

Prediction of criticality in patients with severe covid-19 infection using three clinical features: a machine learning-based prognostic model with clinical data in wuhan

Forecasticu: a prognostic decision support system for timely prediction of intensive care unit admission

The optimality of naive bayes

Ensemble Methods: Foundations and Algorithms