key: cord-0899595-uayj7fes
authors: Lucero, Jorge C.
title: Identification of COVID-19 mortality patterns in Brazil by a functional QR decomposition analysis
date: 2021-03-16
journal: Biophysical Reviews & Letters
DOI: 10.1142/s1793048022500035
sha: b4a47c32fae388111f40b4b6859df3bd5b69fd7e
doc_id: 899595
cord_uid: uayj7fes

The subset selection problem of linear algebra is applied to identify independent patterns of COVID-19 evolution within Brazil. The data consist of a set of mortality curves in states of Brazil. A subset of the most independent curves is selected by using a functional version of the QR matrix decomposition technique with column pivoting. The selected subset is used next as a basis to represent the remaining curves filtering out any data redundancy. For each independent curve, an associated epidemiological region of influence is defined. The results show two main independent curves with a similar two-peak pattern and a 50-day shift between the patterns. Two main epidemiological regions are next identified: one encompassing most of the country from the center and northeast states to the south, an another one containing the Amazonian region at the northwest.

On March 11, 2020, the World Health Organization declared a worldwide pandemic of the COVID-19 disease caused by the new coronavirus SARS-CoV-2. The disease was identified for the first time in Wuhan, People's Republic of China, in December 2019. As of the present date (January 21, 2022), around 344 million cases have been reported, with 5.6 million deaths [23] . In Brazil, the number of cases reaches 23 millions, and over 622 thousands lives have been lost [12] .

Throughout the world, health authorities have implemented vaccination campaigns and a number of measures enforcing adequate hygiene and social distancing [13, 22] . Naturally, the response to those campaigns and measures depends on demographic characteristics, compliance of the population, timing, and emergence of mutations of the virus [6, 17] . Thus, data-driven models of the pandemic propagation constitute a useful tool to characterize and analyze underlying patterns, assess the effectiveness of implemented policies, forecast its evolution, and a number of them have been proposed [9] .

Here, we consider a modeling approach based on the QR decomposition technique of linear algebra [7] , in order to identify regions with independent patterns of COVID-19 evolution within Brazil. The QR decomposition is a matrix factorization technique that provides a simple and numerically robust solution to the so-called "subset selection problem". In that problem, a set of observations n vectors is given and a subset of the k most independent ones is sought. The subset may be used next as a basis to represent the n − k remaining vectors filtering out any data redundancy. This process has some similarities to the well-known technique of principal component analysis (PCA), in the sense that it achieves a reduction of the dimensionality of the data. However, instead of expressing the data in terms of transformations of the data, it does so in terms of a set of the most nonredundant observation vectors and therefore the results tend to have an easier interpretation [3] .

In a previous study [11] , the QR decomposition was applied to identify kinematic regions of the face that follow independent motion patterns during speech. The study argued that, whereas PCA could be used to extract facial gestures (i.e., temporal patterns of motion), the QR decomposition approach was more adequate to express the motion of the face in terms of eigenregions which acted as independent biomechanical units. The present study has a similar purpose in the sense that it intends to build a spatio-temporal model in terms of regions of independent behavior. Therefore, the same modeling strategy of the previous facial study will be followed, except that a functional extension of the QR decomposition will be considered.

The proposed extension fits within a functional data analysis (FDA) context [16] , in which data is expressed as sets of curves instead of discrete numerical values as in traditional statistics. Techniques of FDA have been successfully applied to a variety of problems in biomedicine and public health [20] . In a recent paper, functional principal components analysis (fPCA) combined with functional clustering was used to identify patterns of COVID-19 incidence and mortality across countries [1, 10] . Further, variations of subset selection problems in functional contexts have also been addressed recently, such as regression analysis with a scalar response and a functional predictor [8] , dimension reduction of a functional predictor for a categorical variable [19] , and others [5, 3] . Thus, the present study has the secondary goal of introducing the functional extension of the QR decomposition as an addition to the set of available FDA tools.

The evolution of the pandemic is assessed in terms of mortality rates (i.e., death counts per day), which provide a more reliable measure than infection rates [21] . Official data of COVID-19 were obtained from a repository at the Ministry of Health of Brazil [12] , accessed on January 21, 2022. The data consists of records of deaths counts per day since February 25, 2020, in Brazil's 27 federative units (26 states and a Federal District). For simplicity, the federative units will be be called "states" throughout the analysis.

For each state, the period from the first confirmed death was extracted, and all extracted records were cut to the length of the shortest one (646 days). Then, the records were normalized to population size of each state and expressed in deaths per million individuals,

Number of deaths at day j Population size × 10 6 (1)

for i = 1, 2, . . . , 27 and j = 1, 2, . . . , 646.

A few isolated mortality values were detected in the records, and those were removed by averaging them with nearby data points, as follows: if x ij < 0, then

for k = j − 1, j, j + 1.

In addition, a square root transformation y ij = √ x ij was applied to the data. The transformation compresses the dynamic range of the data, which prevents the occurrence of negative values of death rates when reconstructing the data from the selected subset [14] . A logarithmic transformation has the same effect and was also tested, but it tended to produce larger errors.

The first step of the analysis is to put the discrete data into functional form [16] . For each state i, the existence of a smooth non-negative real function f i (t) is assumed, such that

where t j is the time at the end of day j (with t 1 = 0), and ε ij is an observational error or noise term. Each mortality function f i is defined over the domain t ∈ [0, T ], with T = 345 days, and is expressed in a basis expansion form

where g k (t), k = 1, 2, . . . , K is a set of basis functions and c ik are the expansion coefficients. The expansion coefficients are computed by minimizing the cost function

where λ is a roughness penalty coefficient and D 2 denotes the second order derivative. For the basis in Eq. 4, a truncated Fourier cosine series [4] was adopted, i.e.,

g k (t) = 2/T cos kπt/T, k = 2, 3, . . . , K.

This basis was chosen because of its stability, ease of computation, and orthonormality on the interval [0, T ], which facilitates the QR decomposition. A basis size of K = 20 was selected by visual inspection of the results. Further, the optimal roughness penalty coefficient λ was determined by minimizing the sum of the generalized cross validation measure (GCV) for each f k function [16] , which produced λ = 10. Fig. 1 shows all data in functional form and one example comparing the functional form to the original discrete data. The resultant functions are visually smooth and approximate well the original data, without weekly or short-term fluctuations. 

In the so-called subset selection problem of linear algebra, a data matrix A ∈ R m×n and an observation vector b ∈ R m×1 are given, with m ≥ n, and a predictor vector x is sought in the least squares sense; i.e., a minimizer of Ax − b 2 2 [7] . However, instead of using the whole data matrix A to predict b, only a subset of its columns is used so as to filter out any data redundancy. This problem may be solved by the QR decomposition with column pivoting. The decomposition expresses A in the form AP = QR, where P ∈ R n×n is a column permutation matrix, Q ∈ R m×m is an orthogonal matrix, and R ∈ R m×n is an upper triangular matrix with positive diagonal elements. A simplified variant is the "thin" version, in which Q ∈ R m×n and R ∈ R n×n .

The first column of AP is the column of A that has the largest 2-norm, and the kth column of AP (k > 1) is the column of A with the largest component in a direction orthogonal to the directions of the first k − 1 columns. Thus, the algorithm reorders the columns of A so as to make its first columns as well conditioned as possible. The first columns of AP may be then adopted as the sought subset of least dependent columns. The diagonal elements of R (r ii ), also called the "R values", measure the size of the orthogonal components, and they appear in decreasing order for i = 1, . . . , n.

The decomposition may be extended to the functional case as follows. First, the data set of n functions f i (t) is expressed as A = [f 1 , f 2 , . . . , f n ]. From Eq. (4), we have

where G = [g 1 , g 2 , . . . , g K ] and C is a K × n matrix of coefficients c ik . Letting AP = QR 

which represents the standard (discrete) QR decomposition of matrix C, and may be computed using available algorithms of matrix algebra. Once a suitable number k of independent mortality functions has been chosen, the data set is approximated as the linear combination of the first k functions, with A ≈ GC ′ , and

where B Kk and R kn are formed by the first k columns of B and the first k lines of R, respectively. Finally, the regions of influence of each independent mortality function is determined by the size of the elements of C ′ ; i.e., element c ′ ij measures the relative effect of function i over state or country j. Fig. 2 shows the whole set of R values for the data. The R values decrease as the number of selected functions increases, and their distribution suggest two main independent mortality functions [18] .

The two main mortality functions correspond to the states of Mato Grosso (MT) in west-central Brazil, and Amazonas (AM) at the northwest, and they are plotted in Fig. 3 . Fig. 4 shows the respective epidemiological regions that result from fitting the remaining states to the two main ones, as explained in Section 3. The first region encompasses most of the country from the center and northeast to the south, whereas the second one contains the Amazonian region at the northwest.

Both curves in Fig. 3 have a similar two-peak pattern, with a 50-day shift between them. The first peaks correspond to the initial wave of the pandemic, and they occur in mid April 2020 in Amazonas and beginning of June in Mato Grosso. The earlier occurrence in Amazonas may be consequence of its international borders with Peru, Colombia and Venezuela, and the flow of people across them [2] . Other contributing factors may have been its high percentage of indigenous population which is more susceptible against contagious diseases, as well as its poorer developed public health care system [15] . The second peaks occurs at the beginning of 2021 in Amazonas and end of February in Mato Grosso. The timing matches the appearance of the new lineage P.1 of the SARS-CoV-2 virus, which had a higher transmissibility than previous lineages and was first detected in Manaus (capital city of Amazonas) [17] .

This letter has introduced a simple functional extension of the QR decomposition technique of linear algebra, and shown its application to identify independent patterns of COVID-19 evolution in Brazil. Each pattern defines an epidemiological region, and the overall evolution of the pandemic in the country may be modeled (in the square root domain) as linear combination of the behavior of those regions. Naturally, the accuracy of the model depends on the number of independent patterns considered. Only the first two mortality patterns were discussed here for a general qualitative view; however, a larger number should be included if a more precise representation is desired.

The functional expansion of the data adopted an orthogonal basis to facilitate the computation of the QR decomposition. Nevertheless, further development of the decomposition algorithm to allow for the use of non-orthogonal basis systems, such as the widely used B-splines, would be desired as a next step.

Deep impact of COVID-19 in the healthcare of Latin America: The case of Brazil

A partial overview of the theory of statistics with functional data

Fourier series and orthogonal functions

Feature selection for functional data

Evaluation of the effect of different policies in the containment of epidemic spreads for the COVID-19 case

Matrix Computations, 3rd edn

Functional linear regression that's interpretable. The Annals of Statistics

Data-driven modeling of COVID-19 -Lessons learned

Prevention-versus promotion-focus regulatory efforts on the disease incidence and mortality of COVID-19: A multinational diffusion study using functional data analysis

Analysis of facial motion patterns during speech using a matrix factorization algorithm

COVID 19 -Painel Coronavírus

A global database of COVID-19 vaccinations

Principal components of vocal-tract area functions and inversion of vowels by linear regression of cepstrum coefficients

COVID-19 in the indigenous population of Brazil

Functional Data Analysis

Resurgence of COVID-19 in Manaus, Brazil, despite high seroprevalence

Rule base reduction: some comments on the use of orthogonal transforms

Interpretable dimension reduction for classifying functional data

Applications of functional data analysis: A systematic review

Modelling fatality curves of COVID-19 and the effectiveness of intervention strategies

Isolation, quarantine, social distancing and community containment: pivotal role for old-style public health measures in the novel coronavirus (2019-nCoV) outbreak

COVID-19 Coronavirus Pandemic

This work was supported by the Committee of Research, Innovation, and Extension to Combat COVID-19 (COPEI) of the University of Brasília.