key: cord-0614329-8ycbjyge authors: Kim, Ji Yoon title: Using Machine Learning to Predict Poverty Status in Costa Rican Households date: 2021-11-26 journal: nan DOI: nan sha: fe12f04fdd4d38a94909ce8a008d7ddb06c3c0de doc_id: 614329 cord_uid: 8ycbjyge This study presents two supervised multiclassification machine learning models to predict the poverty status of Costa Rican households as a way to support government and business sectors make decisions in a rapidly changing social and economic environment. Using the Costa Rican household dataset collected via the proxy means test conducted by the Inter-American Development Bank, Random Forest and Gradient Boosted Trees achieved F1 scores of 64.9% and 68.4%, respectively. This study also reveals that education has the greatest impact on predicting poverty status. Over the past two decades, Costa Rica has removed their foreign investment restrictive measures and liberalized their international trade policies [1] . These efforts have brought economic growth to Costa Rica and have led Costa Rica to become an upper-middle-income country. According to the World Bank Group [1] , when the poverty line of upper-middleincome countries was set to $5.50 per day, Costa Ricans with incomes below the poverty line decreased, from 12.9% in 2010 to 10 .6% in 2019. Despite strong economic growth, Costa Rica has recently been experiencing economic hardships due to the COVID-19 pandemic. Costa Rica's gross domestic product (GDP) decreased by 4.1% in 2020. A sharp increase in unemployment pushed an estimated 124,000 people into poverty, which raised the poverty rate to 13.0% in the same year [1] . To minimize the social and economic impacts of unexpected crises, it is necessary to consider introducing data-driven technology capable of making dynamic predictions. According to the United Nations Development Programme (UNDP) [2] , traditional statistical methods may require two years of data collection and analysis to predict poverty. Machine learning will be a great way to empower government and business sectors to make more intelligent and strategic decisions, ultimately supporting the lives of vulnerable people in society and leading towards a sustainable future. To build a machine learning model for poverty prediction, this study referenced a research paper titled "Poverty Classification Using Machine Learning: The Case of Jordan," which presents a machine learning model to predict poverty among Jordanian households [3] . Alsharkawi et al. [3] implemented a classification model that is robust enough to deal with changes in political, social, and economical factors. This study achieved an F1 score of 81.0% using Gradient Boost implemented with Light GBM, which is an acceptable level of accuracy compared to the average F1 score of 87.1% among poverty prediction classification models (i.e., Naïve Bayes, Decision Tree, K-Nearest Neighbors, Logistic Regression, and ID3) in other countries (i.e., Lagangilang, Abra, and Philippines) [4] . This paper aims to build a machine learning model to predict the poverty status of Costa Rican households. In this study, the Inter-American Development Bank's Costa Rican household dataset was used to build a machine learning model for predicting the poverty status of Costa Rican households. The dataset was compiled through a proxy means test that includes questionnaires related to household composition, observable characteristics of the household (e.g., material of roof), and ownership of electronic devices. As shown in Appendix -Exhibit 1, the dataset has 143 columns (i.e., a mix of categorical and numerical variables) and 9,557 rows. Of the 143 variables, the variable titled "Target" is used as a dependent variable to predict poverty status. This variable consists of four classes (i.e., extreme poverty, moderate poverty, vulnerable households, and non-vulnerable households). To examine whether the dataset is imbalanced, univariate analysis is conducted. As shown in Exhibit 2, of the 9,557 survey participants, 5,996 were non-vulnerable households. This represents about 62.7% of the total, which indicates that this is the majority class of the dataset. As shown in Exhibit 3, 6,829 of the 9,557 survey participants live in urban areas, which represents about 71.4% of the total. As shown in Appendix -Exhibit 1, seven variables (i.e., v2a1, v18q1, dependency, edjefe, edjefa, meaneduc, and SQBmeaned) have missing values. In particular, v2a1 and v18q1 have estimated missing values percentages of 72.0% and 75.0%, respectively. In this study, five models are considered (i.e., Decision Tree, Random Forest, Gradient Boosted Trees, Naïve Bayes, and K-Nearest Neighbors). A Decision Tree is a model that utilizes the tree-like model for analyzing and forecasting the data. The tree consists of the root node, internal nodes, and leaf nodes, and is recursively split into sub-trees [5] . A Decision Tree is one of the most widely used machine learning models because the model can handle categorical and numerical datasets, as well as a mix of categorical and numerical datasets. It can also be applied by non-expert users more easily than other machine learning models, as it requires less skill in data pre-processing, and because the model has a built-in resistance to outliers [6] . Decision Trees can be utilized for datasets with missing values; many studies have found the method to work with such datasets [7] , [8] . However, because Decision Trees require careful parameter tuning to prevent the model from becoming biased towards the majority class [9] , they were not used in this study. [10] Random Forest creates multiple independent trees using a random sample of data and aggregates trees that are created using a Decision Tree model. By aggregating the results of different trees into one result, Random Forest can limit overfitting without increasing errors that are caused by bias [11] . As Decision Trees can be utilized to overcome missing values, Random Forest is also a well-known algorithm that can handle datasets with missing values. Because Random Forest can decrease the risk of overfitting [12] , and because it works well with non-linear data [13] , it was used in this study to predict the poverty status of Costa Rican households. EXHIBIT 6. GRADIENT BOOSTED TREES STRUCTURE [14] Gradient Boosted Trees can be used to improve the predictive performance of a Decision Tree. Gradient Boosted Trees generate the trees sequentially, and new trees correct previously trained trees iteratively [15] , [16] . Gradient Boosted Trees are prone to overfitting, as they develop the models based on the previous trees. However, regularization parameters (e.g., learning rate or shrinkage parameter) prevent overfitting by controlling the amount of information coming from previously fitted trees when forming new trees [17] . Various algorithms can be applied to Gradient Boosted Trees to handle the missing values in the datasets, allowing Gradient Boosted Trees to minimize loss functions and the risks of under/overfitting [18] . Because Gradient Boosted Trees' ability in "minimizing some loss function" makes it "to be more accurate than some more theoretically intensive predictive models" [17, p. 9] , Gradient Boosted Trees were used in this study. The Naïve Bayes is a popular algorithm in machine learning because it can be used with large datasets efficiently, and it can be interpreted easily. However, as the word naïve suggests, Naïve Bayes assumes that the features are independent [7] . Also, special consideration is needed when using Naïve Bayes with datasets that have both numerical and categorical variables [19] . Thus, the Naïve Bayes was not used in this study, due to the limitations of the model. K-Nearest Neighbors can be implemented simply because it is a non-parametric algorithm. It does not require training steps, as it does not build any models [20] . "Instead an observation is predicted to be the class of that of the largest proportion of the k nearest observations" [21, p. 251 ]. Because K-Nearest Neighbors is sensitive to outliers [22] , it was not used in this study. As shown in Appendix -Exhibit 9, multiple individual variables with similar characteristics are merged into one variable. As a result, 18 new variables are formed, and they are encoded as dummy variables along with other ungrouped binary categorical variables. The age variable is a continuous data type, with a range from 0 to 100. This variable is divided into six groups (i.e., children, adolescents, young adults, adults, middle-aged adults, and old adults). These ordinal groups are mapped with unique labels to transform them from continuous to categorical data types. Because inaccurate binning can add bias to the dataset, numerical variables with dependent relationships to other variables remain as numerical variables. The dataset is reorganized according to the following criteria. First, when multiple variables contain the same values under different variable names 1 , only one variable remains and the rest of the variables are deleted. Second, when the same property 2 is expressed in two different data types (i.e., categorical and numerical), and when multiple variables are similar to each other 3 , the variable containing more meaningful information is retained. Third, variables that contain limited information 4 are removed from the dataset. Of the seven variables with missing values (i.e., v2a1, v18q1, dependency, edjefe, edjefa, meaneduc, and SQBmeaned), two (i.e., v2a1 and v18q1) were deleted from the dataset. Deleting variables can cause a loss of information and introduce bias into the model [21] ; however, the proportion of missing values for both variables was too large to be replaced with statistical values (i.e., mean, median, and mode). The remaining four 5 variables (i.e., dependency, edjefe, edjefa, and meaneduc) were replaced with the median value of the variable. Replacing missing values with the median value is not the most accurate approach, so other techniques (e.g., predicting missing values using algorithms) can be considered in future studies. As shown in Appendix -Exhibit 9, 125 variables were used to build the model after data cleaning and wrangling. Among these variables, 17 (i.e., rooms, r4h1, r4h2, r4h3, r4m1, r4m2, r4m3, r4t1, r4t2, r4t3, escolari, rez_esc, dependency, edjefe, edjefa, meaneduc, and overcrowding) are numerical variables, either discrete or continuous data types. To examine their distribution and skewness, these variables were individually plotted, as shown in Appendix -Exhibit 10. Normalization and standardization were performed to rescale the distribution of numerical variables. Standardization was used to rescale numerical variables, except dependency variables. Normalization was performed on dependency variables, as they had lower and upper bounds of 0% and 100%. Normalization rescaled variables to fit within the range of 0 to 1. Max(x) indicates maximum values, and min(x) indicates minimum values [21] . Standardization rescaled variables into a normal distribution, with means of 0 and standard deviations of 1. x̄ indicates the mean of the variable, and σ indicates the standard deviation of the variable [21] . After normalization and standardization, Principal Component Analysis (PCA) was performed to "reduce the dimensionality (number of variables) of the dataset but retain most of the original variability in the data" [23, p. 5] . As shown in Exhibit 13, the amount of explained variance above 60 principal components is very low. V. DATA MODELLING Some important findings were made during exploratory analysis and data cleaning and wrangling. First, the model should be able to deal with the supervised multiclassification problem. Second, the model should be able to work with heterogeneous datasets (i.e., a mix of categorical and numerical variables). Third, the model should excel in processing outliers and missing values. Therefore, Random Forest and Gradient Boosted Trees were selected from the five previously considered models to predict the poverty status of Costa Rican households. Before building models, the study randomly split the dataset into train and test sets. The train set comprises 80% of the dataset, while the test set comprises the other 20%. Because the dataset is imbalanced, stratified five-fold crossvalidation was performed on the train set to determine the generalized performance of the model. Stratified crossvalidation helps ensure "that the proportions between classes are the same in each fold" [24, p. 255 ]. The accuracy of the Random Forest model is 76.0%, while the accuracy of the Gradient Boosted Trees model is 77.6%. To determine the models' predictive power on the test set, accuracy was calculated. The accuracy of Random Forest and Gradient Boosted Trees is 78.1% and 79.6%, respectively. Because the models are built on an imbalanced dataset, other performance evaluation methods (i.e., F1, recall, and precision) were used to compare the performance of the model under different metrics. As shown in Exhibit 14, the Random Forest model achieved the highest score with 88.5%, followed by Gradient Boosted Trees with 82.4%. Random Forest performed well in the precision method, while it underperformed Gradient Boosted Trees in the recall method. Therefore, F1 was used, as this measure provides "the harmonic mean of precision and recall" [4, p. 13] . Gradient Boosted Trees achieved an F1 score of 68.4%, and Random Forest achieved a score of 64.9%. Performance evaluation metrics do not yield the same result as data balancing. Therefore, the most accurate approach will involve equally weighting all four classes through databalancing techniques (e.g., over-sampling, under-sampling, the synthetic minority over-sampling technique (SMOTE), and class weights) during the data pre-processing. These techniques can be explored further in future studies. Because Gradient Boosted Trees achieved a higher F1 score than Random Forest, feature importance analysis was performed on Gradient Boosted Trees to determine which variables had the greatest effect on the model. As shown in Exhibit 15, the meaneduc variable (i.e., average years of education for adults) was found to be the most impactful variable. This study has several limitations. First, although the dataset was assembled by a credible organization, the Inter-American Development Bank, some information (e.g., the data collection period and dataset creation time) is unavailable; therefore, it is difficult to understand what kinds of variance are included in the dataset. Second, Costa Rica's population in 2020 was 5,094,114 [25] , but the size of the dataset used in this study is 9,557. The small sample size suggests that the population's characteristics may not be adequately represented in the dataset. However, the historical patterns of poverty among Costa Rican households and the percentage of urban populations agree with the dataset. As this dataset was originally published by the Inter-American Development Bank to develop a machine learning model to predict poverty status, this study assumes that collected samples truly reflect the population of Costa Rica. If the collected sample does not reflect the population demographics for some reason (e.g., the sample is collected from the specific regions, or the sample is collected from the specific target), the research findings would less closely reflect the population. Along with the limitations of the dataset itself, due to resource and time constraints, several important techniques could not be performed. In future studies, these approaches can be applied to improve poverty status prediction. In this study, two variables (i.e., v2a1 and v18q1) with a significant amount of missing values were deleted, and the missing values of four variables (i.e., dependency, edjefe, edjefa, and meaneduc) were replaced with the median value of those variables. However, these methods are not the most accurate techniques for handling missing values. Because inaccurate data cleaning and wrangling techniques can introduce bias or reduce variance in the dataset, it is important to pre-distinguish the types of missing values (e.g., missing completely at random, missing at random, and missing not at random). Another approach involves predicting the approximate value of missing values using algorithms (e.g., K-Nearest Neighbors). These more precise techniques will correct the reduction in accuracy caused by mishandling missing values. As mentioned earlier, among the four classes, the nonvulnerable class comprises approximately 62.7% of the dataset. This indicates that the dataset is imbalanced. Therefore, the dataset has to be balanced to prevent the machine learning model from becoming biased towards the majority class. There are four methods to balance the dataset. The first method is random under-sampling. Random under-sampling will "randomly delete examples in the majority class" [26, p. 113 ]. The disadvantage of random under-sampling is that "this method can discard potentially useful data that could be important for the induction process" [27, p. 2] . The second method is random over-sampling. Random over-sampling will "randomly duplicate examples in the minority class" [26, p. 113 ]. Random over-sampling has its own disadvantages as well. Random over-sampling does not "add any new information" to the model, as it involves "duplicating examples in the minority class" [26, p. 121 [26] . Lastly, class weights can be used to equally weigh all four data classes. This technique places different weights on each class to emphasize the minority class [3] . All four techniques have their advantages and disadvantages; therefore, future studies can apply these techniques to the model to find the best performing databalancing method. As shown in Exhibit 14, performance varies between the four evaluation metrics. Recall and F1 underperform training accuracy, while accuracy and precision outperform training accuracy. Because the dataset is imbalanced, not all classes may be classified equally. The other possibility is that the test set may represent a localized portion of the train set, as it comprises only 20% of the dataset. However, these are just two of many possible explanations for its performance. In future studies, further examination (e.g., building separate one-versus-rest classifiers to review the performance of each class) can be conducted to clearly distinguish factors that may cause under/overfitting and to determine the generalized performance of the model. Other studies have proven that the Naïve Bayes works well in predicting poverty status [4] ; therefore, future studies can consider implementing the Naïve Bayes. However, the Naïve Bayes performs well only when variables are independent. Social datasets contain variables that are sometimes highly correlated with each other, forming a dependent relationship. In the future studies, further feature engineering can be attempted to eliminate dependency between variables. Assuming that independence can be established by eliminating dependency, two approaches can be considered in future studies for implementing the Naïve Bayes in predicting poverty status. First, binning can be considered as a way to transform a heterogeneous dataset into a homogeneous dataset. Numerical variables can be binned to remove numeric attributes and transform them into categorical variables. However, variables grouped by unspecific criteria can introduce bias to the dataset; thus, binning can be conducted only when specific, objective, and clear criteria is available. Second, if the dataset cannot be made homogenous, having a mix of categorical and continuous variables, special consideration is needed when implementing the Naïve Bayes classifier. Hsu et al. [19] developed the Extended Naïve Bayes (ENB) classifier, in which probabilities of categorical variables are calculated using the original method in the Naïve Bayes model, and variances of numerical variables are found using the statistical theory. In this study, a dataset collected through a proxy means test by the Inter-American Development Bank was used to predict the poverty status of Costa Rican households. Based on characteristics of the dataset (i.e., multiclassification, heterogeneous dataset, missing values, and outliers), Random Forest and Gradient Boosted Trees were selected to develop multiclassification poverty prediction models. Before building Random Forest and Gradient Boosted Trees models, irrelevant or highly correlated variables were deleted, and missing values were replaced with the median value of the variable to simplify the dataset. Both normalization and standardization were used to rescale categorical and numerical variables. Also, PCA was performed to reduce the dimensionality of the dataset. As a result, under the assumption that the dataset reflects the characteristics of Costa Rica's population, the Random Forest model achieved a 64.9% F1 score, while the Gradient Boosted Trees model achieved a score of 68.4%. However, in terms of F1 scores, these models underperformed the Jordanian model and the average of other models found in the literature. Further, this study found that education (i.e., meaneduc) has the greatest impact on predicting the status. Finding a causal relationship between educational attainment and poverty was not a goal of this study, so further examination of this topic was not carried out. However, many prominent scholars have revealed that additional years of education increase individual income [28] . Despite their several limitations, both Random Forest and Gradient Boosted Trees demonstrated the ability to predict poverty status among Costa Rican households. Future studies could address the limitations described in this study to improve the performance of these models. Further, the models' robustness could be measured by adding a variety of social and economic factors into the dataset. Such efforts will continue after this study to strengthen the models, as this is an area of research with development potential. 8 Variables in this row are grouped together based on a characteristic (i.e., roof materials) to encode to dummy variables; however, the study determined that 66 observations are not applicable to any of the four variables in the group. These observations were deleted in this study. 9 Variables in this row are grouped together based on a characteristic (i.e., electricity type) to encode to dummy variables; however, the study determined that 15 observations are not applicable to any of the four variables in the group. These observations were deleted in this study. 10 Variables in this row are grouped together based on a characteristic (i.e., education level) to encode to dummy variables; however, the study determined that three observations are not applicable to any of the nine variables in the group. These observations were deleted in this study. 11 The original dataset has a character encoding error. 12 The original dataset has a character encoding error. The World Bank Costa Rica United Nations Development Program Poverty classification using machine learning: the case of Jordan Performance comparison of different classification algorithms for household poverty classification Data mining and knowledge discovery handbook Optimal decision trees for categorical data via integer programming Handbook of statistics Data science concepts and practice Scikit-learn: machine learning in python An empirical study of downstream analysis effects of model pre-processing choices Ensemble machine learning: methods and applications Statistical analysis and data mining: the ASA data science journal Comparing the performance of random forest, SVM and their variants for ECG quality assessment combined with nonlinear features Machine learning reveals the influences of grain morphology on grain crushing strength Greedy function approximation: A Gradient Boosting Machine LightGBM: a highly efficient gradient boosting Exploration of missing data imputation methods XGBoost: a scalable tree boosting system Extended naive bayes classifier for mixed data Solving the multiple-instance problem: a lazy learning approach Python machine learning cookbook : Practical solutions from preprocessing to deep learning Explaining the Success of Nearest Neighbor Methods in Prediction Efficient intrusion detection using principal component analysis Introduction to machine learning with Python : a guide for data scientists The World Bank, Population, total-Costa Rica Imbalanced classification with Python: better metrics, balance skewed classes, cost-sensitive learning Handling imbalanced datasets: a review Returns to education: the causal effects of education on earnings, health and smoking