key: cord-0167358-yqi535az authors: Rodr'iguez, Carlos E.; Mena, Rams'es H. title: COVID-19 Clinical footprint to infer about mortality date: 2021-04-15 journal: nan DOI: nan sha: 92ba2cfc5d899cf8e4bf0c58f8e1c084d9c81ec7 doc_id: 167358 cord_uid: yqi535az Information of 1.6 million patients identified as SARS-CoV-2 positive in Mexico is used to understand the relationship between comorbidities, symptoms, hospitalizations and deaths due to the COVID-19 disease. Using the presence or absence of these latter variables a clinical footprint for each patient is created. The risk, expected mortality and the prediction of death outcomes, among other relevant quantities, are obtained and analyzed by means of a multivariate Bernoulli distribution. The proposal considers all possible footprint combinations resulting in a robust model suitable for Bayesian inference. On 31 December 2019, the World Health Organization (WHO) received a troubling report from Chinese health officials, WHO (2021) . A mystery pneumonia had sickened dozens of people in Wuhan, the capital of Hubei Province in China. A virus, that we know now as SARS-CoV-2, had been transmitted from an unknown animal host to humans and since then has turned up lives worldwide with an unprecedented speed. As of 19 th March, 2021 the WHO database has confirmed 121,464,666 cases SARS-CoV-2 virus globally with 2,684,093 reported deaths caused by COVID-19 disease from 238 countries. The most affected countries are the United States with 532,971 deaths out of 29,317,562 confirmed cases; Brazil with 284,775 deaths out of 11,693,838 cases; Mexico with 182,009 deaths out of 1,734,503 confirmed cases and India with 159,370 deaths out of 11,514,331 confirmed cases, WHO (2021) . The incubation period of COVID-19, defined as the time between exposure to the virus and symptom onset, is on average five to six days, but it can be as long as 14 days, Lauer et al. (2020) . The symptoms of COVID-19 range from those that might not be noticeable to severe life-threatening illness. Some infected people have no symptoms, known as asymptomatic or pre-symptomatic carriers, Arons et al. (2020) . According to the WHO most infected people will develop mild to moderate illness and recover without hospitalization. The WHO divides COVID-19 symptoms in three groups: most common symptoms; fever, dry cough and tiredness, less common symptoms; aches and pains, sore throat, diarrhea, conjunctivitis, headache, loss of taste or smell, a rash on skin, or discoloration of fingers or toes, and serious symptoms; difficulty breathing or shortness of breath, chest pain or pressure, loss of speech or movement, WHO (2021) . It is important to mention that individual symptoms appear to have poor diagnostic properties. Indeed, based on currently available data, neither absence nor presence of any symptoms are accurate enough to rule in or rule out the disease, Struyf et al. (2020) . Thus, the gold standard for COVID-19 diagnosis is the laboratory technique known as RT-PCR test (Reverse-Transcription Polymerase Chain Reaction test), however there are other alternatives, see Oliveira et al. (2020) . Comorbidities and its effects in COVID-19 patients People of any age who have underlying medical conditions, such as hypertension and diabetes, have shown worse prognosis, Sanyaolu et al. (2020) . Diabetic patients have increased morbidity and mortality rates and have been linked to more hospitalization and intensive care unit (ICU) admissions, Singh et al. (2020) . People with chronic obstructive pulmonary disease (COPD) or any respiratory illnesses are also at higher risk for severe illness from COVID-19, Zhao et al. (2020) . Impact of sex and age on COVID-19 outcomes From the first reports from China a sex imbalance with regard to the fatality rate of COVID-19 patients has been detected. Case fatality rates reported in China, Italy, Spain, France, Germany, and Switzerland support the view that a consistent biological phenomenon is operating, accounting for a higher case fatality in men. Such observation is independent of country specific demographics and testing strategies, Gebhard et al. (2020) . Age has also been identified as a variable with high impact over the mortality rate of COVID-19 cases. All age groups appear to have significantly higher mortality compared with the immediately younger age group, Bonanad et al. (2020) . The aim of this research is to use the available data from the COVID-19 pandemic in Mexico to gain insight into the COVID-19 disease. The first particular objective is to understand the relationship between comorbidities, symptoms, mortality risk and hospitalization. The second is to identify differences by sex and age group. And the third is mortality prediction, that is given the commodities, symptoms, age and sex of a patient identified as SARS-CoV-2 positive, then estimate the mortality risk of a given patient. It is important to observe that almost all relevant variables in this study are binary and indicate the presence or absence of a symptom/comorbidity or whether a patient has died, has been hospitalized, or not. A straightforward first strategy to analyze the impact of comorbidities and symptoms over death outcomes and hospitalizations under a binary data setting is to use contingency tables. Hence, basic probabilities such as P (Y j = 1) and P (Y i = 1|Y j = 1) can be easily computed via the observed frequencies. This can be extended to account for higher order interactions of the form P (Y i |Y j , Y r ). The idea to compute these probabilities efficiently is to concatenate the observed combinations y j y r and then to obtain the correspondent frequencies. The same ideas could be applied to obtain probabilities such as P (Y i |Y j1 , Y j2 , . . . , Y jk ). Moving to a modelling framework, a second vanilla strategy is to use logistic regression models. Under such a strategy, an idea would be to estimate death outcome probabilities using all available variables. The challenge under this setting is to select a small subset of variables that could describe the death outcomes effectively. However, it is not possible to obtain all the summaries of interest as the joint distribution over all variables is sometimes needed and not robustly available for such regression method. A third and possibly more general approach is to use a multivariate binary distribution. To achieve the aforementioned objectives via the generation of relevant information summaries such as those obtained via contingency tables while maintaining the predictive capabilities of the logistic regression, we have chosen the multivariate Bernoulli distribution (see Dai et al., 2013) . In this case all the relevant quantities are obtained in closed form and its implementation is straightforward. This model is a generalization of the well known Bernoulli distribution that takes the value 1 (success) with probability p and the value 0 (failure) with probability 1 − p. In the multivariate case each observation is a vector of, let us say, k successes/failures. The multivariate Bernoulli distribution assigns positive probability to each of the 2 k possible combination of successes/failures. It is important to note that this model can estimate not only the main effects and the interactions between pairs of variables, but is also capable of modeling higher-order interactions. For us the information of each patient is represented by a k vector of ones and zeros: 1 (presence) or 0 (absence) of each comorbidity, symptom, hospitalization and death. Such data composition will be referred to as the COVID-19 footprint for each patient. With the multivariate Bernoulli distribution and the footprint for each patient is relatively simple to use a Bayesian approach to obtain meaningful inference. Furthermore, under such an approach it is possible to obtain inferences about quantities such as the mortality risk given certain comorbidity, sex and age group, the mortality risk given certain symptom, sex and age group. Indeed, this will be done for all patients identified as SARS-CoV-2 positive by a laboratory test. In Section 2 information about the prevalence of the comorbidities that the Mexican population suffers the most are provided, and the data used for this study is described. In Section 3, the multivariate Bernoulli model is introduced and some of its properties discussed. Section 4, deals with the Bayesian ideas to make inference via the multivariate Bernoulli distribution. In Section 5, the available data from the pandemic in Mexico is analyzed and some interesting results are presented and described. Finally, a discussion is provided in Section 6. In Mexico, the first cases of SARS-CoV-2 were detected on the 29 th of February 2020 and to the date of writing (19 th March, 2021), this represents 1,734,503 confirmed cases and 182,009 deaths. With this numbers Mexico has become the country with the third highest death toll with coronavirus, with only the US and Brazil recording greater numbers. The World Health Organization has said people with underlying medical problems like high blood pressure, heart and lung problems, diabetes, or cancer, are among those most vulnerable to severe cases of the new coronavirus disease, along with the elderly WHO (2021). Over the past 30 years, Mexico has become one of the countries in the world most heavily affected by the global epidemic of obesity. It is now the second country worldwide with obesity prevalence. Between 2006 Between , 2012 Between and 2018 prevalence increased from 69.5% to 71.3% and then to 75.2% (respectively) in population of 20 years and over, while the rate of obesity alone rose from 30% in 2006 to 32.4% in 2012 and then to 36.1% in 2018 (estimates). Also, Mexico is now one of the countries with the highest child obesity rates in the world with one in three children being overweight or obese. Diabetes, the chronic disease most directly linked with obesity, is spreading rapidly and in Mexico in 2018 affected 10.3% of the adult population (aged over 20 years old), while in 2012 it affected 9.2%. High blood pressure, or hypertension, has less noticeable symptoms. But if untreated, it increases the risk of serious problems such as heart attacks and strokes. In Mexico, the prevalence of hypertension in 2012 was of 30.2% and in 2018 of 32.7%. However, these estimates may not be comparable as there was a methodological change in the National Health and Nutrition Survey ‡(NHNS) of 2018. Digital baumanometers were introduced; being less susceptible to measurement error, they provide interviewers with better estimates of blood pressure values, see Campos et al. (2019) . All the information in this section has been obtained via the NHNS. This survey is conducted every six years by the National Institute of Statistics and Geography which is an autonomous agency of the Mexican Government dedicated to coordinate the National System of Statistical and Geographical Information of Mexico. ‡https://www.inegi.org.mx/programas/ensanut/2018/ The source of information for this work is the database of the National Epidemiological Surveillance System for monitoring possible cases of COVID-19 in Mexico (SINAVE/SISVER for its acronym in Spanish), coordinated by the Secretary of Health (Spanish: Secretaría de Salud †Is the government agency in charge of all social health services in Mexico, a very important part of the Mexican health system.). The SINAVE/SISVER platform considers cases that are suspected of COVID-19. People who have had flu like symptoms, or that believe to have been infected with the SARS-CoV-2 virus, are entitled to attend any public or private health service in Mexico, and after an initial examination is suspected to suffer from the COVID-19 disease are registered on this database. By March 19 th , 2021 this database had information for a total of 5,839,528 suspected cases and 115 variables. This is the latest update of this data base that we have. For the relevance of this analysis we will use only the following variables: • Result from the lab test (positive or negative for the presence of the SARS-CoV-2 virus in a blood test). We work with the information of 1,584,288 positive cases. • Gender (Female and Male) and age group (four groups < 20, [20, 40) , [40, 60) and > 60). • Symptoms: fever (1), cough (2), ears pain (3), difficulty breathing (4), irritability (5), diarrhea (6), chest pain (7), chills (8), headache (9), muscle pain (10), joint pain (11), attack general state (12), nasal discharge (13), increased respiratory frequency and depth (14), vomiting (15), abdominal pain (16), conjunctivitis (17), blue color lack of oxygen (18) and sudden onset of symptoms (19). • Comorbidities: chronic kidney failure (1), COPD (2), heart disease (3), diabetes (4), immunosuppression (5), hypertension (6), obesity (7), smoking (8) and asthma (9). • Disease outcomes: hospitalization (1) and death (2). Symptoms, comorbidities, sex and outcomes of the disease are dichotomous variables, namely each patient has only two possible values presence (1) or absence (0) of the symptom and/or comorbidity. Sex is also taken as dichotomous, male (1) and female (0). Also the variables recording if the patient has been hospitalized (1) or not (0) or the patient has died (1) or has not (0) are decoded in a similar manner. To include the age groups; < 20, [20, 40), [40, 60) and > 60, under the framework of the multivariate Bernoulli distribution described below, these can be treated as (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0) and (0, 0, 0, 1) respectively. Therefore, in this study the clinical footprint of each patient that has been identified as SARS-CoV-2 positive is given by y 35 = (sex, age group 1:4 , comorbidity 1:9 , symptom 1:19 , hospitalized, death), = (s, g 1 , . . . , g 4 , j 1 , . . . , j 9 , l 1 , . . . , l 19 , h, d), which is a vector of zeros and ones of length 35. In other words, our dataset is a matrix of zeros and ones of dimensions 1, 584, 288 × 35. The Secretary of Health via the General Directorate of Epidemiology (Spanish: Dirección General de Epidemiología) updates daily the data base with the suspected cases and 48 variables, these data can be accessed using the following link https://www.gob.mx/salud/documentos/ datos-abiertos-152127. Additional variables are available upon request via the platform http://covid-19.iimas.unam.mx/, as described in Loza et al. (2020) . Health authorities in Mexico classify a patient as a COVID-19 case if he/she fells in one of the following three categories: • Confirmed by epidemiological clinical association. Confirmed by association applies when the case had contact with a COVID-19 case, and the latter is registered in SISVER plataform. The case was not tested or the test was invalid. • Confirmed by dictamination committee. Confirmed by ruling out only applies to deaths when the case was not tested or a test was taken, but it was invalid. • Confirmed by SARS-CoV-2. The case has a laboratory test or antigenic test and was positive for SARS-CoV-2, regardless of whether the case has a clinical epidemiological association. As of 19 th of March 2021 there have been 1, 734, 503 confirmed cases: 120, 944 by association, 5, 307 by dictamination and 1, 608, 252 by a laboratory test. In this work we only consider cases that have tested positive for SARS-CoV-2 in a laboratory test. The complete data base has 5,839,528 records. Focusing only in the positive cases by laboratory test, there are 1,608,252 patients. At this instance, there are no missing values in the variables with information about deaths, hospitalizations, sex and age. However, if symptoms (19) and comorbidities (9) are included, there are 0.24% of cells, in the 1, 608, 252 × 28 database, with missing values. The percentage of missing values over the commorbidities ranges from 0.25% (obesity) to 0.29% (diabetes), while for the symptoms ranges from 0.08% (difficulty breathing) to 0.5% (sudden onset of symptoms). Assuming a missing completely at random mechanism, we followed a case deletion strategy to obtain the 1,584,288 registers that we work in this analysis. As an additional source of uncertainty we have that the information about the symptoms and comorbidities is mainly self declared by patients and the Mexican health authorities know there must be many patients not fully aware if they suffer from certain comorbidities. The number of SARS-CoV-2 positive tests has turned into an important indicator, it has been used to decide whether or not nations or regions around the world can open their economies. This is assessed using what has been called the percent positive which is simply the percentage of all coronavirus tests performed that are actually positive. The WHO recommended that the percent positive remain below 5% for at least two weeks before governments consider reopening after a lock down period, WHO (2021). In Mexico, the percent positive since March 2020 has been larger than 30% and there even have been periods where it almost reached 50%. These very high percentages are easy to explain, the Mexican health authorities operate with limited resources, which forbids a widespread testing. Thus mainly patients with COVID-19 symptoms have been tested. Hence, it is clear that there is bias in the data base of 5,839,528 suspected cases, and is far from a random sample. All this said, the 1,584,288 positive cases are expected to provide us with important information for inference. Let Y k = (Y 1 , Y 2 , . . . , Y k ) be a k-dimensional random vector of possibly correlated Bernoulli random variables and let y k = (y 1 , y 2 , . . . , y k ) be a realization of Y k . In our case k = 35. The multivariate Bernoulli distribution can be described via its mass probability function where is the vector of probabilities associated to each possible outcome, and there are 2 k possible outcomes. as the matrix where the rows are the 2 k possible outcomes of the random vector Y k and ρ k l,. as the k dimensional vector of the l possible outcome. It is possible to write the above mass probability function in shorter forms, that is where γ l (y k ) = 1(ρ k l,. = y k ). These expressions clearly indicate that we have 2 k parameters, namely one parameter value per each possible footprint. Indeed, much in the same spirit as for the univariate Bernoulli distribution, the above multivariate version constitutes a dense model in its support. Namely all possible combinations are considered. The above finite mixture representation suggests the possibility of using a discrete random probability measure as a potential model, namely a Bayesian nonparametric distribution. However, such an approach, e.g. via the Dirichlet processes, would require and infinite number of footprints and a diffuse baseline distribution to ease its implementation. While potentially possible some adaptations and interpretations would be at hand, so we prefer to keep it simple. The marginal distribution of the random vector Y k which follows a multivariate Bernoulli distribution with density function (2) to any order is still a multivariate Bernoulli distribution. Thus, for s < k is a multivariate Bernoulli distribution where w s (ρ s r,. ) = 2 k l=1 1 (ρ k l,1:s = ρ s r,. ) w k (ρ k l,. ) , for r = 1, . . . , 2 s , with ρ k l,1:s as the first s observations of the l outcome of the random variable Y k , and ρ s r,. as the s dimensional vector of the r possible outcome of the marginalized random variable Y s . As a straightforward consequence, for j ∈ {1, . . . , k}, we have where Thus each Y j follows a Bernoulli distribution as expected. Also, the conditional distribution of Y s given the rest is also multivariate Bernoulli distribution (s < k). Finally, in this brief outline, the covariance between any two random variables Y j and Y s , is given by Then, the correlation is given by 3.1. Bayesian inference For a random sample y 1,k , y 2,k , . . . , y n,k the likelihood of the multivariate Bernoulli distribution is given by If a Dirichlet distribution, i.e. p(w 2 k |δ 1 , . . . , δ 2 k ) = Dir(w k 2 |δ 1 , . . . , δ 2 k ) is chosen as a prior distribution over w 2 k , then it is trivial to see that the posterior distribution is given by where α l = r l + δ l , for l = 1, . . . , 2 k . In Section 4 it is shown how to use these weights to obtain meaningful probabilities. In this section our aim is to study the relationship between comorbidities, symptoms, sex, age and death for all those patients that resulted positive for SARS-CoV-2 virus in a laboratory test in Mexico. Hence, in this section we will use the data of 1,584,288 cases, and for each patient we have the footprint for i = 1 . . . , 1, 584, 288 . To fit the Bernoulli multivariate model we take the footprint y i,35 for each patient. Thus, our theoretical sample space is of 2 35 possible outcomes. However, for the four variables of age group there are only four possible results, namely the same person cannot belong to two different age groups at the same time. Hence, the true sample space of our model is of 2 32 outcomes. The key to handle this huge number of sets is simply to identify the active outcomes in (7), i.e. the unique different outcomes that have been observed in the data. Within the 1, 584, 288 cases, we have observed m = 585, 034 active outcomes. Setting k = 35 and denoting β k l,. for l = 1, . . . , m as the active outcomes, we can order the rows of the matrix with all the possible outcomes such that in the first m rows we have the active outcomes, thus ρ k l,. = β k l,. for l = 1, . . . , m. With this we can write (7) as 1(β k l,. = y i,k ), for l = 1 . . . , m, and r l = 0, for l > m. Then α l = r l + δ l , for l = 1 . . . , m and for l > m we have α l = δ l . To obtain the marginal distribution of w l along with the distribution of any combination of the weights, we will need to compute This is the starting point of how to handle a theoretical sample space of 2 k outcomes, and in the next sections it will become clear that we only need to focus in the m active outcomes. Assuming δ l = δ in the Dirichlet prior for the weights, we have α 0 = m l=1 r l + (2 k )(δ). If, for example, δ = 1 is assumed we will be giving uniform weight over the 2 k possible outcomes, but this is unrealistic since only m active outcomes out of the 2 k possibilities have been observed (with 2 k m). Also, the sample size n = 1, 584, 288 cases, is quite large and do not expect drastic changes if we update the information and consider more cases. Hence, taking δ = ν/2 k , we have and giving a small value for ν < 1 our inference will be heavily based on the information available. In this section we focus on the weights w j and w js , defined in (4) and (5) respectively, rather than the complete vector (1). The point being the simplified random variables w j and w js allows us to compute important summaries directly. Consider (9) and observe that, for example, for any j over the set {6, . . . , 14}, the random variable Y j indicates the presence or absence of comorbidity j − 5. Letting C l as the event of suffering from comorbidity l, with l ∈ 1, . . . , 9. Then, from (3), (4), (8) and knowing that the marginals of a Dirichlet distribution are Beta random variables, we have and α 0 is computed as in (10). As a summary measure the expected value can be obtained as The same applies to symptoms, hospitalization, death, sex and age group. The complete list of events are • C j -the patient suffers from comorbidity j, with j ∈ {1, . . . , 9}. • S l -the patient presents symptom l, with l ∈ {1, . . . , 19}. • M -the patient is male. Note M c = F , the patient is female. • A g -the patient belongs to group age g, with g ∈ {1, 2, 3, 4}. • D -the patient has died. Note death is the variable 35 in (9). • H -the patient has being hospitalized. Note hospitalization is the variable 34 in (9). With this conventions, and assuming j, s, l and g range over the correct set on each case (5), hence by the aggregation property of the Dirichlet distribution w j,s |y k,1 , . . . , y k,n ∼ Beta(w j,s |η j,s , and then To compute these summaries in the case of females, we need to consider 1(β k l,1 = 0) instead of 1(β k l,1 = 1), but this is a straightforward adaptation of what has been described. If a new patient has been diagnosed as SARS-CoV-2 positive, then the information for this patient would be given by its clinical footprint y i * ,35 = (s, g 1 , . . . , g 4 , j 1 , . . . , j 9 , l 1 , . . . , l 19 , h, d), where the sex, age group, comorbidities and symptoms are known but the outcome variables hospitalized (h) and death (d) are unknown. In this case it is easier to work with the original notation for the weights (1), since all the variables are needed. It is straightforward to obtain the predictive probability of death by calculating p(Y 35 = 1|(Y 1 , . . . , Y 33 ) = (s, g 1 , . . . , g 4 , j 1 , . . . , j 9 , l 1 , . . . , l 19 ), y 1,35 , . . . , y n,35 ) = w 35 (s, g 1 , . . . , g 4 , j 1 , . . . , j 9 , l 1 , . . . , l 19 , h = 0 or 1, d = 1) w 35 (s, g 1 , . . . , g 4 , j 1 , . . . , j 9 , l 1 , . . . , l 19 , h = 0 or 1, d = 0 or 1) |y 1,k , . . . , y n,k , again is a Beta distribution with parameters η * and β * where with y 1 * ,35 = (s, g 1 , . . . , g 4 , j 1 , . . . , j 9 , l 1 , . . . , l 19 , 0, 0), y 2 * ,35 = (s, g 1 , . . . , g 4 , j 1 , . . . , j 9 , l 1 , . . . , l 19 , 0, 1), y 3 * ,35 = (s, g 1 , . . . , g 4 , j 1 , . . . , j 9 , l 1 , . . . , l 19 , 1, 0), y 4 * ,35 = (s, g 1 , . . . , g 4 , j 1 , . . . , j 9 , l 1 , . . . , l 19 , 1, 1). Hence, the posterior mean predictive is given by g 1 , . . . , g 4 , j 1 , . . . , j 9 , l 1 , . . . , l 19 ), y 1,35 , . . . , y n,35 )) = α * α * + β * , and the predictive is simply the Bernoulli distribution with z ∈ {0, 1} and α * α * + β * the probability of death. A couple of comments are in order. First, if prediction about the hospitalization status of the patient is required this can be done in an analogous manner. Second, often only a few subset of variables are needed to make prediction, in this case the variables that are not needed are integrated out as it was done with the hospitalizations. Focusing again the analysis on the n = 1, 584, 288 positive cases, here we compute mortality risk given a particular footprint configuration. Also the expected mortality and correlation between variables is obtained. The section is closed with a prediction of death outcomes. For patients that have been identified as SARS-CoV-2 positive in a laboratory test in Mexico, it is expected that approximately • 10 out of 100 patients to die. • 22 out of 100 patients to be hospitalized. • 4 out of 100 patients to be asymptomatic. • 70 out of 100 patients to suffer symptoms such as headache, cough or fever. • 30 out of 100 patients to to suffer difficulty while breathing. • 60 out of 100 patients will not have any comorbidity. • 19 out of 100 patients to suffer from hypertension, 16 from obesity and 15 diabetes. All these summaries were computed via posterior expectations, e.g, E(P (C j |data)) or E(P (D|data)). To identify the comorbidities and symptoms that imply greater mortality risk, we need to obtain the conditional probabilities P (D|C j , data) and P (D|S l , data) that were described in section 4. In Figure 1 we display a graphic of these distributions. First, it is important to note that the colored labels in Figure 1 are ordered following the same order as the conditional distributions over the x axis. Thus, for example, chronic kidney failure, hospitalization, COPD, bluish coloration due to lack of oxygen, etc., are the comorbidities or Fig. 1 . Distribution for the random variables P (D|C j , data), for j = 1 . . . , 9, P (D|S l , data), for l = 1 . . . , 19 and P (D|H, data). symptoms that imply greater mortality risk. Second, in these labels we have included a "-c" for comorbidity and a "-s" for symptom to differentiate each condition. Then, it is easy to identify two groups of comorbidities or symptoms. The first group is formed by the comorbidities or symptoms (including hospitalization) associated with a mortality risk greater than 23%, then those with mortality risk lower than 17%. The symptoms in the group with higher mortality risk are all related to lungs becoming inflamed, while the comorbidities range from chronic kidney failure, COPD, heart disease diabetes to hypertension. It is straightforward to generate this information desegregated by sex and age group. Such cases are displayed in Figure 2 , see also Table 4 in Appendix A, for more detail. For these cases, we compute, e.g. E(P (D|C j , M, A g , data)) for the case of comorbidities, males and age group. If age increases, the mortality risk increases in all cases. This, can clearly be seen via the mortality risk by COVID-19 when the patient has no comorbidities: for patients below 20 years we can expect 1 death out of 100 patients (male or female), but increases substantially and for patients over 60 years old, we expect 30 deaths out of 100 male patients and 21 deaths for female patients. This indicates that age is a key variable that increases the mortality risk dramatically even in patients with no comorbidities. Sex is also a key variable. In most cases males have greater mortality risk given comorbidities or symptoms than female patients. Being hospitalized is also associated with a very high mortality risk. Other interesting findings from Figure 2 (and Table 4 in Appendix A) are the following: • The expected mortality risk (as a percentage) for males and females can be described by a straightforward exponential model of the kind y = ae bx where y is the probability of death and x is the age. Using the four age groups, and thus assuming x = 1, 2, 3 and 4. In the case of males, and using least squares fit, it was obtained thatâ = 1.1 andb = 0.88 while for females we haveâ = 0.91 andb = 0.84. These are the red lines included in Figure 2 . • For cases under 20 years old chronic kidney failure (comorbidity) is associated with the highest mortality risk for males and females, while immunosupression (comorbidity) is the second condition with higher risk for males but for females is hypertension (comorbidity). This difference is interesting since hypertension for males is on 6 th place. • For cases between 20 and 40 years old chronic kidney failure and immunosuppression are the comorbidities with higher mortality risk for males and females. In third place for males is bluish coloration due to lack of oxygen, while for female patient is diabetes. • For cases between 40 and 60 years old chronic kidney failure (comorbidity) is associated with the highest mortality risk for males and females, then for both follows bluish coloration due to lack of oxygen (symptom) and increased respiratory rate and depth (symptom) in second and third places respectively. • For cases above 60 years old, bluish coloration due to lack of oxygen (symptom) is associated with the highest mortality risk for males and females, then for males and females follows increased respiratory rate and depth (symptom), and the fourth is chronic kidney failure (comorbidity). Let us assume that v = 100, 000 people in Mexico were diagnosed as positive for SARS-CoV-2 virus. To appreciate which comorbidities and symptoms have greater impact over COVID-19 mortality, from these v cases, we can approximate the number of people who suffer from comorbidity C j , and this is given by vp(C j |data). Then the approximate the mortality within those with C j would be vp(C j |data)p(D|C j , data) = vp(D, C j |data). Thus, to obtain information about the mortality associated with each comorbidity we need the distribution of p(D, C j |data) and we will have a similar expression for the symptoms. Instead of working with with vp(D, C j |data) we obtain its expectation and this is shown in Figure 3 (assuming v = 100, 000 cases). Hospitalization is clearly associated with the highest mortality, however it is not a comorbidity nor a symptom. Following hospitalization there are seven symptoms, and the one that leads to the a greater mortality is difficulty breathing. The comorbidities that have greater impact over the mortality are hypertension and diabetes, and this is of great concern to health authorities due to the enormous problem of diabetes and hypertension in Mexico. Now we can do the complete exercise to estimate the mortality related to each comorbidity (or symptom) by sex and age group. For simplicity, let Y be the distribution of the mortality related to each comorbidity by sex and age group out of v cases (the expression for symptoms and female cases is similar), then This is again a constant multiplied by a Beta random variable. The corresponding expected values are displayed in Figure 4 (see Table 6 in Appendix A for more details). In the previous section we obtained p(S l |M, A g , data), p(S l |F, A g , data) and its expectation. With this it is straightforward to identify if there are patterns of symptoms for all the cases by sex and age group. In Figure 5 the expected probability of each symptom given sex and age group is displayed via a heat-map. Cough, fever and headache are the most common symptoms, but the probability of having the first two increases marginally with age. Also, the mortality risk associated to difficulty in breathing, chest pain and increased respiratory frequency and depth increases with age. This is for both, male and female patients. On the other side, the probability of having nasal discharge decreases with age. Correlation between comorbidities, symptoms, hospitalization and deaths The multivariate Bernoulli model give us the possibility to obtain the correlation between the comorbidities, deaths and hospitalizations. And this is done using equation (6), but here we need to approximate the posterior mean E Corr(Y j , Y s )|y k,1 , . . . , y k,n = E w j,s − w j w s w j (1 − w j ) w s (1 − w s ) |y k,1 , . . . , y k,n , via classic Monte Carlo, and the idea is described in Appendix C. With the matrix of pairwise correlations a hierarchical clustering is used to obtain groups of variables closely associated to each other. In Figure 6 the correlation plot and the groups are displayed. The correlation plot indicates that there are six groups of variables closely related to each other. The group of the outcomes hospitalizations and deaths includes the comorbidities immunosuppression, COPD, heart disease, chronic kidney failure, diabetes and hypertension. The only symptom that is included in this group is difficulty breathing. Finally, the group age of all those greater than 60 years old is also included. Observe that this coincides with Figure 4 , which indicates that larger mortality rates are associated to the variables in this group. To predict COVID-19 death outcomes using the ideas outlined in Section 4, only six variables are used: sex (s), age group (g), two comorbidities (diabetes j 1 and hypertension j 2 ), one symptom (difficulty breathing l) and information about the hospitalizations (h). These variables have been included as the correlation plot in Figure 6 , indicates that these are those with higher correlation with the death outcomes. It could seem that relationships of higher order are not been considered, however we compared this parsimonious model against the complete model with 31 variables and obtained similar results (not shown). The parameters of the beta distribution are calculated as described in Section 4. This can be achieved efficiently concatenating the values of the six variables for the training set, which creates the corresponding footprint for each combination of sex, age group, comorbidities, symptom and hospitalization. These footprints along with the death outcomes are used to generate a contingency table as shown in Table 5 . Note that instead of using the footprints 1000, 0100, 0010 and 0001 for age group we only use 1, 2, 3, and 4 and this is as there are only four possible combinations for this variable. Then, the footprints for each combination of of sex, age group, symptoms, comorbidities and hospitalization in the prediction set are obtained. If in the prediction set we have, for example, the footprint 130011, the contingency table in Table 5 , can be used to obtain η * = v/2 5 + 11, 164 and β * = v/2 5 + 22, 378, and the posterior mean predictive is given by η * /(η * + β * ) ≈ 0.33. If this is greater than a cutpoint c the predicted outcome would be death, otherwise this footprint implies an alive outcome. If a footprint in the prediction set cannot be found in the training set, instead of considering only the prior, sex and age group are used to compute the posterior mean predictive in these cases. It is important to stress that this model, together with the ideas outlined in this section, can handle large data bases easily. Indeed, a logistic regression was also implemented to have a benchmark model, and the multivariate Bernoulli model needs less CPU time to calculate the predictions. Note that in the case of the logistic model, age has been included as a continuous variable, as it was used without any categorization. To determine the cut-point c we maximize the proportion of death outputs that are correctly identified (true positive rate or TPR) plus the proportion of alive outcomes that are correctly identified (true negative rate or TNR). Thus, the optimal cut-point maximizes the quantity TPR + TNR. This optimization exercise is performed by randomly dividing the total 1, 584, 288 cases randomly into a training set of 1, 484, 288 observations and a prediction set of 100, 000 cases. Then, obtaining the posterior mean predictive for each footprint in the training set and using these to classify the observations with the same footprint in the prediction set. In this last step is where the optimization over c is performed. To avoid any bias from the selection of a particular training set and prediction set, this processes was repeated 100 times to obtain c 1 , . . . , c 100 , and the optimal cut-point is taken as the mean. The library of R cutpointr was used to find the optimal cut-points, see Thiele (2020) . The optimal cut-point for the multivariate Bernoulli and logistic models were c = 0.112 and c = 0.058 respectively. The maximum value for TPR + TNR is of 1.777 and 1.776 for the multivariate Bernoulli and logistic models respectively. The process outlined in the preceding paragraph is similar to that of using a receiver operating characteristic curve (ROC) curve. In this case the quantities (1-TNR, TPR) are plotted for different values of the cut-point, c, and this is the ROC curve. Thus, the optimal c is the one that produces the best possible combination of TNR and TPR. Note however, the values of c are not plotted. It might be seen as a clearer strategy to have a function to optimize and see explicitly how it changes with the change of the cut-point. However, for completion we mention that the average (over the 100 samples) area under the ROC curve (AUC) for the multivariate Bernoulli was found to be equal to 0.935 while for the logistic regression was equal to 0.936. In general terms classifiers with larger AUC are considered better to discriminate between groups, hence according to this criterion there is virtually no difference between the discrimination ability of both models. That said, under our approach a much wider set of summaries and prediction possibilities are available. To test the prediction power of the algorithm we use a 20 fold cross-validation setting: divided the total 1, 584, 288 cases randomly into 20 sets or folds of approximately 79,214 cases each. The first fold is treated as a validation set, and the method is fit on the remaining 19 folds. The confusion matrix is then computed on the observations in the held-out fold, thus generating 19 confusion matrices. Taking the average of these 19 confusion matrices an average confusion matrix C 1 was obtained. This procedure is repeated 20 times; each time, a different group of observations is treated as a validation set. This process results in 20 estimates of the confusion matrix, C 1 , C 2 , ..., C 20 . The 20-fold Cross-Validation estimate of the overall confusion matrix is computed by averaging these 20 matrices. See Tables 2 and 3. Hence, via the multivariate Bernoulli we have (TPR, TNR) = (92.7%, 84.9%), while using a logistic regression (TPR, TNR) = (93.6%, 84.4%) is obtained. Classification-wise the results are comparable, and it is worth noting that there is no difference in performance between a model that uses age as a continuous variable and a model that uses age coded as categorical variable. We have used a multivariate Bernoulli distribution to analyze information about the COVID-19 pandemic in Mexico. In particular we have used the data of patients identified as SARS-CoV-2 positive in a laboratory tests and determined: (1) the comorbidities immunosuppression, COPD, heart disease, chronic kidney failure, diabetes, hypertension and the symptom difficulty breathing are the variables with stronger association with hospitalizations and deaths; (2) the expected mortality risk for males and females follows an exponential increase which is a function of the age of the patient, and the increase rate of the curve for males is steeper than for female patients; (3) the comorbidities associated with greater expected mortality in Mexico are hypertension, diabetes and obesity, and this is mainly because the prevalence of these chronic diseases in Mexico is very high; (4) cough, fever and headache are the most common symptoms, but the probability of having the first two increases marginally with age. Also, the probability of suffer difficulty breathing, chest pain and increased respiratory frequency and depth increases with age; (5) using only two comorbidities (diabetes and hypertension), one symptom (difficulty breathing), information about hospitalizations, the age group and sex of the patients it is possible to predict the mortality of the COVID-19 disease with the following accuracy: 92.7% of death outputs have been correctly classified and 84.9% of alive outputs have been correctly identified. Finally, it is important to mention that via the logistic regression we tried to select a small subset of variables that could describe death outcomes effectively (out of the 31 variables). We used the library of R bigstep (Szulc (2019) ) and then try all the available criteria (AIC, BIC and modifications), and were only able to reduce to 16 variables (this analysis is not included in the paper). Via the multivariate Bernoulli and obtaining the correlation matrix is easy to see that there are six variables closely correlated to the death outcomes. Table 4 . E (P (D|S l , A g , Sex)) and E (P (D|C j , A g , Sex)) in percentages. These tables have been computed using the information for all those patients who tested positive for SARS-CoV-2 in a laboratory test. Let w j,−s = 2 k l=1 1 (ρ k l,j = 1) 1 (ρ k l,s = 1) w k (ρ k l,. ), η j,−s = 2 k l=1 1(ρ k l,j = 1) 1(ρ k l,s = 1) (r l + δ l ) Table 6 . E (P (D|A g , Sex)), E (P (H|A g , Sex)), E (P (C j |A, Sex)) and E (P (S l |A g , Sex)) in percentages. Estimates for Sex = M and F have been included with equivalent definitions for w s,−j and η s,−j . First, it is easy to see that w j,s + w j,−s = 2 k (1 (ρ k l,j = 1) 1 (ρ k l,s = 1) + 1 (ρ k l,j = 1) 1 (ρ k l,s = 1)) w k (ρ k l,. ), = 2 k l=1 1 (ρ k l,s = 1) + 1 (ρ k l,s = 1) 1 (ρ k l,j = 1)w k (ρ k l,. ), = 2 k l=1 1 (ρ k l,j = 1)w k (ρ k l,. ) = w j , and in the same manner η j,s + η j,−s = η j . Distribution of w j,s /w j We have p(w j,s , w j,−s |η j,s , η j,−s , α 0 ) = Dir(w j,s , w j,−s |η j,s , η j,−s , α 0 − η j,s − η j,−s ). Performing the transformation U = w j,s and V = w j,s + w j,−s (observe that V = w j ) the Jacobian is equal to 1 and the density is given by p(u, v) = p wj,s,wj,−s (u, v − u|η j,s , η j,−s , α 0 ), ∝ u ηj,s−1 (v − u) ηj,−s−1 (1 − v) α0−ηj,s−ηj,−s . Here u > 0, v > u and 1 > v, thus this is a valid probability distribution function. Now transforming X = V and Y = U V , then the Jacobian is equal to x and the density is given by p(x, y) = x p U,V (xy, x|η j,s , η j,−s , α 0 ), ∝ x(xy) ηj,s−1 (x − xy) ηj,−s−1 (1 − x) α0−ηj,s−ηj,−s ∝ x ηj,s+ηj,−s−1 (1 − x) α0−ηj,s−ηj,−s y ηj,s−1 (1 − y) ηj,−s−1 Then X ⊥ Y , where X ∼ Beta(x|η j , α 0 − η j ) and Y ∼ Beta(y|η j,s , η j − η j,s ). Note that η j = η j,−s + η j,s . Since Y = wj,s wj , then w j,s w j ∼ Beta w j,s w j |η j,s , η j − η j,s , ⇒ E w j,s w j = η j,s η j . We want to approximate E Corr(Y j , Y s )|y k,1 , . . . , y k,n = E w j,s − w j w s w j (1 − w j ) w s (1 − w s ) |y k,1 , . . . , y k,n . First, note that it is straightforward to generate samples from (w j,s , w j,−s , w s,−j ) ∼ Dir(w j,s , w j,−s , w s,−j |η j,s , η j,−s , η s,−j , α 0 − η j,s − η j,−s − η s,−j ), ∼ Dir(w j,s , w j,−s , w s,−j |η j,s , η j − η j,s , η s − η j,s , α 0 + η j,s − η j − η s ), and with each sample we can compute w j = w j,s + w j,−s , w s = w j,s + w s,−j and then ξ = g(w j,s , w j , w s ) = w j,s − w j w s w j (1 − w j ) w s (1 − w s ) . Thus, generating N samples and computing ξ 1 , . . . , ξ N we can approximate E Corr(Y j , Y s )|y k,1 , . . . , y k,n ≈ 1 N N l=1 ξ l , and this is a usual Monte Carlo approximation. Presymptomatic sars-cov-2 infections and transmission in a skilled nursing facility The effect of age on mortality in patients with covid-19: A meta-analysis with 611,583 subjects Prevalencia, diagnostico y control de hipertension arterial en adultos mexicanos en condicion de vulnerabilidad. resultados de la ensanut 100k Multivariate bernoulli distribution Impact of sex and gender on covid-19 outcomes in europe The incubation period of coronavirus disease 2019 (covid-19) from publicly reported confirmed cases: Estimation and application Sistema de informacion nacional depurado sobre la evolucion de la pandemia COVID-19 SARS-CoV-2 and the COVID-19 disease: a mini review on diagnostic methods Comorbidity and its impact on patients with covid-19 Diabetes in covid-19: Prevalence, pathophysiology, prognosis and practical considerations Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19 disease bigstep: Stepwise Selection for Large Data Sets cutpointr: Determine and Evaluate Optimal Cutpoints in Binary Classification Tasks Coronavirus disease (covid-19) pandemic The impact of copd and smoking history on the severity of covid-19: A systemic review and meta-analysis The first and second authors are grateful for the support of PAPIIT-UNAM projects IA103220, IV100220 and IG100221.