key: cord-0939041-r4vf80m7
authors: Mollalo, Abolfazl; Vahedi, Behrooz; Bhattarai, Shreejana; Hopkins, Laura C.; Banik, Swagata; Vahedi, Behzad
title: Predicting the Hotspots of Age-Adjusted Mortality Rates of Lower Respiratory Infection across the Continental United States: Integration of GIS, Spatial Statistics and Machine Learning Algorithms
date: 2020-08-22
journal: Int J Med Inform
DOI: 10.1016/j.ijmedinf.2020.104248
sha: 48eb808ff55a2c47ded100c7204b65284d5dd67c
doc_id: 939041
cord_uid: r4vf80m7

OBJECTIVE: Although lower respiratory infections (LRI) are among the leading causes of mortality in the United States, their association with underlying factors and geographic variation have not been adequately examined. METHODS: In this study, explanatory variables (n = 46) including climatic, topographic, socio-economic, and demographic factors were compiled at the county level across the continental US. Machine learning algorithms - logistic regression (LR), random forest (RF), gradient boosting decision trees (GBDT), k-nearest neighbors (KNN), and support vector machine (SVM) - were used to predict the presence/absence of hotspots (P < 0.05) for elevated age-adjusted LRI mortality rates in a geographic information system framework. RESULTS: Overall, there was a historical shift in hotspots away from the western United States into the southeastern parts of the country and they were highly localized in a few counties. The two decision tree methods (RF and GBDT) outperformed the other algorithms (accuracies: 0.92; F1-scores: 0.85 and 0.84; area under the precision-recall curve: 0.84 and 0.83, respectively). Moreover, the results of the RF and GBDT indicated that higher spring minimum temperature, increased winter precipitation, and higher annual median household income were among the most substantial factors in predicting the hotspots. CONCLUSIONS: This study helps raise awareness of public health decision-makers to develop and target LRI prevention programs.

Lower respiratory infections (LRI) are diseases of the lower respiratory tracts and include bronchitis, bronchiolitis, pneumonia, and recently emerged coronavirus . LRI are major public health concerns across the world (Dasaraju and Liu, 1996, Mollalo et al., 2020 a and , and are among the leading causes of mortality and morbidity in children and adults (Rahmanian Malosh et al., 2018) . In 2016, LRI caused nearly 2.38 million deaths worldwide, including 652,572 children under five years old and 1,080,958 adults over 70 years old, making it the six th leading cause of death for all ages (Troeger et al., 2018) .

LRI are the cause of a significant number of hospitalizations in developed countries (Torzillo et al., 1999) . In the US, LRI have been classified as the 7 th leading cause of death and years of life lost (Murray et al., 2018) . In this country, bronchiolitis is the leading diagnosis of LRI in children younger than two years old, causing almost 150,000 annual hospitalizations (Hasegawa et al., 2013) . Similarly, pneumonia is another most common reason for hospital admissions in the US that causes the most common severe bacterial infection in children (Huang et al., 2011) . However, with the success of the childhood vaccination programs such as the 7-valent and 13-valent pneumococcal conjugate vaccines, the proportion of elderly affected by LRI in the US has significantly declined (Walter & Wunderink, 2017) .

Previous studies have shown that many socioeconomic factors such as education level, income, and poverty (Sonego et al., 2015) and environmental factors such as climate and air pollution (Lapena et al., 2005; Mirsaeidi et al., 2016) were significantly associated with LRI prevalence. Further, demographic factors such as age, gender, and race (Wang et al., 2016) and behavioral factors such as cigarette smoking (McEvoy & Spindel, 2017) were correlated with LRI prevalence. Few studies have examined the spatial variation of LRI in small geographic regions. For example, Beamer et al. (2016) identified distinct patterns of significant spatial clusters for each LRI phenotype within Tucson, Arizona. Those clusters were associated with various community-level risk factors such as increased air pollution, poor housing conditions, and low socioeconomic status. Beck et al. (2015) conducted a study in Cincinnati, Ohio, to examine geographic variation of LRI hospitalization rates across Hamilton county using Getis-Ord Gi* statistic. They also examined whether such variation was correlated with socioeconomic status using the non-parametric Kruskal-Wallis test. The results indicated a significant alteration in the median hospitalization rates by census tract quintile for both bronchiolitis and pneumonia. Further, socioeconomic conditions had substantial influences on those hospitalization rates, and hotspots were located in the impoverished neighborhoods in the urban core.

In recent decades, the use of novel modeling techniques such as machine learning algorithms in public health studies, in particular, respiratory disease research has increased (Reid et al., 2016) . For instance, Heckerling et al. (2004) trained a back-propagation artificial neural network (ANN) optimized by genetic algorithm to predict pneumonia among patients (n=1044) with respiratory complaints from the University of Illinois and the University of Nebraska. A multitude of variables, such as demographics, symptoms, signs, and comorbidity with other respiratory diseases, including asthma and lung disease, were compiled to predict the presence or absence of pneumonia among the patients. The ANN model successfully predicted pneumonia on the test dataset with 93% accuracy. In a case-control study in Taiwan, Kuo et al. (2019) compared the performance of seven machine learning classifiers, including random forest and logistic regression, to predict hospital-acquired pneumonia among schizophrenic patients. Among the employed J o u r n a l P r e -p r o o f algorithms, random forest had the highest accuracy (93%) in predicting pneumonia. Further, the significant predictors were clozapine use, clozapine prescription, and prescription duration.

While several studies have been conducted in smaller geographic regions, to our knowledge, no previous nationwide study has examined geographic variations of LRI mortality rates and their association with underlying factors across the US. Identifying hotspot(s) of LRI mortality rates (i.e. counties with higher than expected mortalities) and their presence based on population-level underlying factors can help public health decision makers for targeted interventions at the national level. Thus, in this ecological study, we investigate the geographic variation of age-adjusted LRI mortality rates across the continental US from 1980 to 2014 using spatial statistics. Further, we employed several machine learning algorithms to predict hotspot(s) occurrence with potential risk factors in a geographic information system (GIS) framework.

Continental US age-adjusted mortality rates of LRI were obtained at the county level from Global Health Data Exchange (http://ghdx.healthdata.org/record/ihme-data/united-statesmortality-rates-county-1980-2014). The data were available for eight years: 1980, 1985, 1990, 1995, 2000, 2005, 2010, and 2014 . The disease data were then spatialized at the county level in ArcGIS 10.7 (ESRI, Redlands, CA). The ESRI shapefile of the administrative boundary of US counties was obtained from Topologically Integrated Geographic Encoding and Referencing (TIGER)/Line US Census Bureau for the year 2018 (http://www.census.gov/). Explanatory variables (n=46) including climatic, topographic, socio-economic, and demographic factors were compiled at the county level across the continental US and stored in a file geodatabase in ArcGIS 10.7. The variables were selected according to either the previously published literature or domain knowledge.

Low and high air temperature can aggravate symptoms, particularly among individuals with preexisting respiratory diseases. Low air temperature can adversely impact epithelium by narrowing the respiratory airways and declining lung functions. In contrast, high air temperature can increase allergic illnesses due to several reasons such as by increasing pollen production or the length of pollen season, which in turn can make the respiratory symptoms worse. Increased precipitation may facilitate the spread of respiratory diseases. Vitamin D, which is produced by sunlight exposure, may protect the human body against respiratory diseases. Climate data including daily air temperature (°C), daily precipitation (mm), and daily sunlight (KJ/m 2 ) were obtained from the Centers for Disease Control and Prevention Wide-Ranging Online Data for Epidemiologic Research (CDC WONDER) database (http://wonder.cdc.gov/), and were aggregated for the spring (March 19-June 20), summer (June 20-September 22), autumn (September 22-December 21) and winter (December 21 to March 20) seasons (i.e., seasonal minimum and maximum temperature, seasonal average precipitation, and seasonal average sunlight).

The fine particulate matter (PM 2.5), which may contain soot, smoke, and dust, can get deep into human lungs and enter the bloodstream. According to Bowe et al. (2019) , exposure to high levels of PM 2.5 is associated with almost 200,000 deaths in the United States. Moreover, cigarette smoking can damage human airways and the small air sacs in the lungs. Daily PM 2.5 air quality data was obtained from the CDC WONDER database. The mean values of PM 2.5 for the four seasons were computed for each county. Also, the data pertaining to cigarette smoking prevalence in the US for men and women were obtained from Dwyer-Lindgren et al. (2014) .

Respiratory infections are more complicated in infants and children living in high altitudes. During acute LRI, hypoxemia, occurs more frequently in children at high altitudes, which may result in increased mortality (Niermeyer et al., 2009) . Therefore, the topographic data (i.e., median altitude and slope) of US counties were also incorporated as explanatory variables. The altitude shuttle radar topography mission (STRM) data with 30m spatial resolution were obtained from the national map website (http://nationalmap.gov/). The altitude and slope values for counties were then quantified using zonal statistics in ArcGIS Spatial Analyst extension.

Lower socio-economic status can be associated with unbalanced access to health care which in turn can lead to elevated mortality of diseases. A broad range of socioeconomic and demographic variables including the proportion of the white and black population, median household income, poverty, unemployment rate, (lack of) health insurance, and the number of physicians per county was obtained from the US Census Bureau's American FactFinder (https://factfinder.census.gov/) and included in the geodatabase. All data used in this study are publicly available from the above sources.

The spatial pattern of age-adjusted LRI mortality rates (i.e., clustered, dispersed, or random) across the continental US, were examined with global and local indices of spatial autocorrelation for every eight years of study. Moran's I and Getis-Ord General G were employed to investigate the extent to which the nearby counties had similar LRI rates. Moran's I is calculated using the following formula:

(1)

where and are the deviations of LRI mortality rates from the average mortality rate for county and county , respectively; is a binary weight matrix between county and county based on the first-order Queen contiguity (i.e. each element in weight matrix is non-zero when the counties share borders of non-zero length); and is the aggregate number of counties. The value of ranges between -1 (negative spatial autocorrelations) and +1 (positive spatial autocorrelation), while values close to 0 indicate no spatial autocorrelation (Mollalo et al., 2014; ).

Using the same notation as for Eq (1) Getis-Ord General G is computed as:

A significant value of G indicates spatial clustering of LRI mortality rates. Both Moran's I and Getis-Ord General G statistics were calculated in ArcGIS 10.7.

Local measures of spatial autocorrelation such as Getis-ord Gi* also were applied to locate the identified spatial autocorrelations of LRI mortality rates (P<0.05) as follows (Grubesic et al., 2014; Aldstadt, 2010) 

A high positive and a high negative value of * imply hotspot and coldspot, respectively. However, the focus of this study is on mapping and analyzing the identified hotspots of LRI mortality rates for further modeling. More detailed information about the clustering and hotspot detection techniques have been published elsewhere (Mollalo et al., 2015; .

Five different machine learning classifiers were employed to identify hotspot locations (P <0.05) of the LRI age-adjusted mortality rates for the year 2014 as the explanatory data were not available particularly for the earlier years of study period. The classifiers were vanilla logistic regression (LR), random forest (RF), gradient boosting decision trees (GBDT), k-nearest neighbors (KNN), and support vector machine (SVM). These classifiers were selected due to their successful performance in identifying intricate patterns in many binary classification applications (Naghibi et al., 2017; Than Noi et al., 2018) . The scikit-learn Python package was used to develop the classifiers.

LR, a linear function for binary classification, applies maximum likelihood estimation to minimize the errors after transforming the presence or absence of LRI hotspots into a logit variable (Bailey et al. 2003) . The output of LR is a likelihood of LRI' hotspot occurrence, as a function of several exploratory variables and can be expressed as:

Where is the predicted likelihood of LRI hotspot occurrence bounded between 0 and 1; and is a linear combination of the variables and its value varies between −∞ and +∞. More precisely:

Where 0 is the intercept and ( = 1, … , ) are the coefficients associated with the variables ( = 1, … , ). The detailed information about LR is provided by Hosmer and Lemeshow (2000) .

J o u r n a l P r e -p r o o f (2001) is an ensemble learning method where a plethora of decision trees are produced based on bootstrap sampling. The input data are repeatedly split, based on many different generated classification trees, and the final decision is made based on the maximum number of 'votes' obtained from individual trees (Boström, 2007; Hastie et al., 2009; Mollalo et al., 2018) . In this study, the number of trees was set to 1000. Also, the optimal number of layers from the root to the node of the trees was chosen using cross-validation from the set of {2, 3, 4}.

Similar to RF, GBDT is an ensemble method based on bootstrap sampling, which generates many decision trees. While RF uses the bagging method (e.g., equal probability of sample selection in each iteration), GBDT uses a boosting method (i.e., weighted (unequal) sample selection in each run). After each iteration, the weights are adjusted so that the higher weights will be assigned to the models with good performances (Freidman (2002)).

Suppose is a training sample, is the associated label of , and N is the number of training samples. For any training sample , F( ) is the classification (the i th decision tree) of , and L( , F(x )) is the loss between F( ) and . GBDT determines an optimal model such that ∑ ( , ( )) =1 is minimized. In the first step, the GBDT initialize the decision tree 0 ( ), then iteratively constructs m new trees. For each iteration, a negative gradient is computed and a new tree h(x) is added to reduce the residuals. The optimal model * ( ) can be calculated as follows:

* ( ) = 0 ( ) + + ∑ * ℎ ( ) =1

(7) where m is the number of iterations; v controls the learning rate;

is the weight of ℎ ( ) and ℎ ( ) is the trained decision tree in the t th iteration.

The k-nearest neighbors classifier (k is a positive integer), is a non-parametric and distancebased algorithm which assigns a test sample to the class that is common among its k-nearest training samples. In other words, a county is classified as a hotspot of LRI if a majority of its neighboring counties are hotspots (Peterson (2009) ). Using a random search algorithm, k = 10 was selected as the optimal number of nearest neighbors. Also, the explanatory variables are not involved in this algorithm.

The distance can be calculated in a variety of ways including Euclidean distance, Hamming distance, Manhattan distance and Minkowski Distance. We used Manhattan distance which yielded better results as the distance metric and is calculated as:

and are -dimensional vectors such that = ( 1 , 2 , … , ) and = ( 1 , 2 , … , ).

The SVM classifier, first proposed by Vapnik (1992) , uses robust statistical learning theory. Consider a dataset of high dimensional points, viewed as vector { ∈ : = 1, … }, > 1, where each point belongs to one of two classes defined by { ∈ {0,1}: = 1, … }. Here, corresponds to the presence/absence of LRI hotspots. If we assume these points to be linearly separable (i.e. can be separated via a linear boundary), the goal of SVM is to find the d-dimentional hyperplane maximizing the margin (i.e. distance between the closest points or support vectors) as illustrated in Figure 1 (Yoon et al., 2011) .

The hyperplane can be expressed as ( ) = ( . + ), where is the orientation of hyperplane and is the offset of hyperplane from origin and is sign function (i.e. sgn= +1 for presence and sgn= -1 for absence of LRI hotspot). SVM can work in the case where the points are not linearly separable by using a soft-margin. Soft margin allows a trade-off between the margin of separation and the miss-classification penalty. One form of which can be the aggregated distance of the miss-classified points to the separation hyperplane. The optimal separating hyperplane can be found using Lagrangian multipliers from:

Where are the Lagrange multipliers and the value of or regularization shows a trade-off between maximizing the margin and minimizing the errors. Finally, and can be obtained as follows:

Where is the number of support vectors placed on the margin lines. Many real-world problems are nonlinear. In this case, SVM utilizes kernel functions to transform data into a higher dimensional space than the original dimension in which the input data can be separated by a linear boundary (Scholkopf et al., 2001) . For non-linear separable cases, the above formula is extended using kernel function. This function maps the input dataset onto a higher dimensional feature space as shown in Figure 2 . The decision function is modified as:

Where ( , ) is a Gaussian radial basis function kernel as:

Appropriate results highly depend on the selection of and . Here, we used a grid search to find the optimum values for the two parameters. This method checks various combinations of C and γ in a range of pre-defined values (C between 0.5 and 20 with increments of 0.5 and γ between 0.005 and 1.0 with increments of 0.1). It should be noted that these ranges are boundaries of search space and have been chosen to cover a large enough space. For example, in our case, 20 is numerically large enough for C.

A non-linear boundary in the input space (left) and a maximum margin hyperplane in feature space (right)

To employ the algorithms, 70% and 30% of the dataset were randomly selected for training and as a test dataset, respectively. A randomized search algorithm for tuning hyper-parameters in each classification algorithm was used. L1 regularization (LASSO) was used to reduce the complexity J o u r n a l P r e -p r o o f of the model and to avoid overfitting. This is done by penalizing small weights, leading to a sparser model.

The performances of the classifiers were assessed with several metrics: overall accuracy ( + + + + ), precision ( + ), recall ( + ), F1-score (2 * * + ), false positive rate or FPR ( + ) and area under ROC (receiver operating characteristic/) curve (ROC AUC). In the above formulas, , , , and represent the number of true positives, true negatives, false positives, and false negatives, respectively.

The area under the precision-recall curve (PR AUC), which shows the tradeoff between precision and recall of different thresholds, was also measured as the classes are imbalanced (Goutte & Gaussier (2005) ). All evaluation metrics were computed on the test dataset.

The null hypothesis of complete spatial randomness was rejected for all years based on Moran's I (range: 0.36 -0.61; p-values<0.001) and General G (range: 0.0018 -0.0019; p-values<0.001) statistics. The z-scores of both statistics almost consistently increased to large values from 1980 to 2014, indicating highly significant clustering (Table 1) . Clustering was minimal from 1980 to 1990, but sharply and consistently increased thereafter. In the earlier years of the study period (1980 -1985) , the identified hotspots of the LRI mortality rates by Getis-Ord Gi* hotspot detection technique were mostly concentrated in the western US. In contrast, from 1990 to 2000, these hotspots became less prominent, while LRI hotspots shifted toward the southeastern parts of the US (Figure 3 ). These counties continue to represent hotspots through the remaining periods.

In total, 118 counties (3.8% of US counties) were persistently identified as (part) of LRI hotspots (Figure 4 ). Among these were counties in Georgia (n=49), Kentucky (n=25), and Virginia (n=22) that were persistently affected, and accounted for 81.3% of total persistent hotspot counties. Fig. 4 . Location of counties that were persistently identified as hotspots of LRI mortality rates by Getis-Ord Gi* hotspot detection technique, 1980-2014.

All the classification algorithms predicted the hotspots of LRI mortality rates with relatively high accuracy (≥ 0.84); however, GBDT and RF were the most accurate models (0.92) ( Table 2) . Precision-recall plots of the employed models ( Figure 5 ) indicated GBDT had the highest PR AUC -indicating the largest values of both precision and recall for different cut-off values. GBDT achieved the highest F1-score (85%) and PR AUC (84%), compared to the other models, while the LR model had the worst performance (Table 2) . Also, the results of RF were slightly better than KNN and SVM. Overall, of the employed machine learning algorithms, the decision trees (i.e., GBDT and RF) yielded a more accurate predictions. The contributions of variables were analyzed for the GBDT and RF models ( Figure 6 ). The GBDT model indicated that spring minimum temperature, winter precipitation, and median household income had the greatest positive influence in predicting the hotspots. 

In this study, we integrated spatial statistical tools with machine learning classifiers in a GIS platform to identify hotspots of the LRI mortality rates across the continental US and to identify the most substantial LRI-associated environmental and socioeconomic factors. Given the lack of nationwide spatial analysis and modeling of LRI, our modeling framework can be applied as a general protocol specifically to more prevalent respiratory diseases in the US such as asthma, chronic obstructive pulmonary disease, pneumonia and COVID-19 to support public health decision makings at the national level. Overall, there was a historical shift in hotspots away from the western US into the southeastern parts of the country, and the hotspots were highly localized in a few counties. Environmental factors contributed most strongly to these hotspots, while economic and social factors seem to be of secondary significance.

According to Fischer et al. (2016) , advanced computational models can translate the occurrence of infectious diseases into decision-support tools. Unlike traditional models, machine learning algorithms can quantify the association between infectious disease and explanatory variables, even with incomplete or noisy data (Mollalo et al., 2019) in a shorter time period and less costs.

Moran's I and General G statistics confirmed that LRI mortality rates are spatially clustered (P<0.001) across the continental US. Counties with high mortality rates tend to locate closer together than expected by chance. Using Getis-Ord Gi*, we identified several hotspots across the continental US. Additionally, spatial-temporal analysis of the clusters found a notable geographic shift in the location of hotspots from the west coast to the southeast of the US during the study period. The spatial pattern and shift in the locations of hotspots over time may partially reflect the vast differences in LRI mortality rates by drivers of geographic patterns, including environment, socio-economic and behavior factors. It may also be attributed to the health disparities or improved health care quality such as PCV7 and PCV13 vaccination programs during the study period. The latter is consistent with the substantial global decline of Streptococcus pneumonia (the leading cause of LRI mortality), as estimated by GBD 2016 Lower Respiratory Infections Collaborators (2018 . Moreover, some states (including Georgia, Kentucky, and Virginia) and counties included persistent hotspots, suggesting targeting resources and policy interventions in these areas.

All the classifiers showed a considerable accuracies; however, due to the imbalanced dataset, in general, ensemble decision trees outperformed the (complex) SVM or traditional and frequently applied LR. Additionally, although SVM was slightly less accurate compared to the decision trees, it is less interpretable, slower to run, and more susceptible to overfitting. Allyn et al. (2017) developed LR, RF, GBDT, SVM, and Naïve Bayes Model to predict the mortality of 4676 patients after elective cardiac surgery from December 2005 to December 2012. Their results showed RF outperformed the other classifiers (AUC=0.788). Our results are also in agreement with the findings of Churpek et al. (2016) , who compared LR, tree-based models, KNN, SVM, and neural networks. Their findings showed that RF was the most accurate classifier (AUC=0.801), followed by the gradient boosting machine (AUC=0.794).

The findings of decision trees indicated that higher spring temperature and increased precipitation during winter are among the most substantial predictors of the presence or absence of the hotspots. The contribution of these environmental factors is most likely due to the changes in the epidemiology of weather-sensitive pathogens and host immune response, which can, in turn, lead to respiratory infections (Hossain et al., 2019) . Other studies show that respiratory infections are seasonal, especially during winter and rainy months. Seasonality may play a role due to the proximity of people in enclosed environments during cold temperature weather, which can facilitate the spread of infections during those seasons. For example, Thomas et al. (1994) found that RSV infection was more prevalent in children during the winter months in Canada. In Malaysia, LRI was positively correlated with the monthly number of rainy days but negatively associated with the monthly mean temperature (Chan et al., 2002) . A study conducted in Pakistan showed that LRI cases were more frequent in months when the minimum temperature was lower (Erling et al., 1999) , however, in Brazil, statistically significant associations were found between viral LRI and increasing temperature and decreasing humidity (Gurgel et al., 2016) . Inconsistent findings may be due to different studied organisms or different spatial units of analysis. For example, from county-level studies, one can not draw a conclusion at the individual level due to ecological fallacy. Moreover, age is a potential confounder that needs to be adjusted, particularly in studying mortality rates of diseases, to avoid distorting the relationship.

The findings of decision trees also implied that the economic status such as median household income and the higher proportion of the population living below the poverty line (according to the definition of US census Bureau (https://www.census.gov/) were among substantial socioeconomic factors in describing LRI hotspots. Although we cannot provide an explicit explanation for economic factors, poor access to basic treatments is a plausible explanation. The findings were consistent with a large body of literature worldwide. LRI was found predominantly in the disadvantaged populations in South Auckland, New Zealand (Trenholme et al., 2017) . These populations were living in areas in the bottom quintile for socio-economic deprivation and with high rates of smoke exposure and poor living conditions. Similarly, impoverished children living in informal households without electricity and running water had approximately four times higher LRI mortality rates in South Africa (Hutton et al., 2019) .

There are several limitations of the current research study. First, the variables incorporated in the machine learning models undergoes several transformations and are susceptible to measurement or analysis errors. Also, neglecting the role of spatial autocorrelation, especially in sparse data, may produce biased estimates of the importance of variables. Another limitation is attributed to the selection of spatial scale. The values within each county are uniform, but there might be sharp contrasts between neighboring sub-counties, however, the choice of the spatial unit was dictated by the available data. Future studies should analyze and predict hotspots of LRI at the sub-county level, such as zip code or census tract levels, for targeted human interventions, particularly for Virginia, Kentucky, and Georgia, which were persistently identified as LRI hotspots. Additionally, future LRI studies should incorporate the concentration of other criteria air pollutants such as ground ozone, Sulphur oxides, lead, carbon monoxide, and nitrogen oxides as they may cause serious damages to internal organs especially to lungs which can lead to a higher mortality of LRI.

To our knowledge, this is the first study that incorporated national datasets on the LRI mortality rate using machine learning algorithms. Despite the above limitations, these findings have important public health implications. Predicting why the counties with high LRI mortality rates cluster geographically can be helpful further to reduce mortality in these regions. Moreover, the results of decision tree modeling can provide insight for future research geared toward identifying contributing factors such as median household income and climate factors to elevated LRI mortality rates. Despite significant efforts for control, there are many clustered counties, particularly in Georgia, Kentucky, and Virginia, where LRI mortality rates have remained elevated for the past 35 years.

J o u r n a l P r e -p r o o f

Spatial clustering

A comparison of a machine learning model with EuroSCORE II in predicting mortality after elective cardiac surgery: a decision curve analysis

Modelling soil series data to facilitate targeted habitat restoration: a polytomous logistic regression approach

Spatial Clusters of Child Lower Respiratory Illnesses Associated with Community-Level Risk Factors

Geographic Variation in Hospitalization for Lower Respiratory Tract Infections Across One County

Estimating class probabilities in random forests

Burden of cause-specific mortality associated with PM2. 5 air pollution in the United States

Seasonal variation in respiratory syncytial virus chest infection in the tropics

Multicenter comparison of machine learning methods and conventional regression for predicting clinical deterioration on the wards

Infections of the Respiratory System

Cigarette smoking prevalence in US counties

The impact of climate on the prevalence of respiratory tract infections in early childhood in Lahore

CDC grand rounds: modeling and public health decisionmaking

Stochastic gradient boosting

Estimates of the global, regional, and national morbidity, mortality, and aetiologies of lower respiratory infections in 195 countries, 1990-2016: a systematic analysis for the Global Burden of Disease Study

A probabilistic interpretation of precision, recall and Fscore, with implication for evaluation

Spatial Clustering Overview and Comparison: Accuracy, Sensitivity, and Computational Expense

Relative frequency, Possible Risk Factors, Viral Codetection Rates, and Seasonality of Respiratory Syncytial Virus among Children with Lower Respiratory Tract Infection in Northeastern Brazil

Trends in Bronchiolitis Hospitalizations in the United States

Random forests

Use of genetic algorithms for neural networks to predict community-acquired pneumonia

Applied Logistic Regression

Sociodemographic, climatic variability and lower respiratory tract infections: a systematic literature review

Healthcare utilization and cost of pneumococcal disease in the United States

Clinical Features and Outcome of Children with Severe Lower Respiratory Tract Infection Admitted to a Pediatric Intensive Care Unit in South Africa

Predicting hospital-acquired pneumonia among schizophrenic patients: a machine learning approach

Climatic factors and lower respiratory tract infection due to respiratory syncytial virus in hospitalised infants in northern Spain

The risk of lower respiratory tract infection following influenza virus infection: A systematic and narrative review

Pulmonary effects of maternal smoking on the fetus and child: effects on lung development, respiratory morbidities, and life long lung health

Climate change and respiratory infections

Spatial and spatio-temporal analysis of human brucellosis in Iran

Geographic information system-based analysis of the spatial and spatio-temporal distribution of zoonotic cutaneous leishmaniasis in Golestan Province, north-east of Iran

A 24-year exploratory spatial data analysis of Lyme disease incidence rate in

Machine learning approaches in GIS-based ecological modeling of the sand fly Phlebotomus papatasi, a vector of zoonotic cutaneous leishmaniasis in Golestan province

A GIS-Based Artificial Neural Network Model for Spatial Distribution of Tuberculosis across the Continental United States

Artificial Neural Network Modeling of Novel Coronavirus (COVID-19) Incidence Rates across the Continental United States

GIS-based spatial modeling of COVID-19 incidence rate in the continental United States

The state of US health, 1990-2016: Burden of diseases, injuries, and risk factors among US states

Application of support vector machine, random forest, and genetic algorithm optimized random forest models in groundwater potential mapping

Child health and living at high altitude

Epidemiology of Influenza in Patients with Acute Lower Respiratory Tract Infection in South of Iran

Differential respiratory health effects from the 2008 northern California wildfires: a spatiotemporal approach

Learning with kernels: support vector machines, regularization, optimization, and beyond

Risk factors for mortality from acute lower respiratory infections (ALRI) in children under five years of age in low and middle-income countries: a systematic review and meta-analysis of observational studies

Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using Sentinel-2 imagery

Respiratory syncytial virus subgroup B dominance during one winter season between 1987 and 1992 in Vancouver

Etiology of acute lower respiratory tract infection in Central Australian Aboriginal children

Respiratory virus detection during hospitalisation for lower respiratory tract infection in children under 2 years in South Auckland

Estimates of the global, regional, and national morbidity, mortality, and aetiologies of lower respiratory infections in 195 countries, 1990-2016: a systematic analysis for the Global Burden of Disease Study

Principles of risk minimization for learning theory

Severe Respiratory Viral Infections: New Evidence and Changing Paradigms

Spatiotemporal analysis for the effect of ambient particulate matter on cause-specific respiratory mortality in Beijing

A comparative study of artificial neural networks and support vector machines for predicting groundwater levels in a coastal aquifer

The first author would like to thank professor Gregory Glass for kindly reviewing the earlier version of the manuscript. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.