key: cord-0535199-y7jfg58r authors: Podiotis, Panagiotis title: Towards International Relations Data Science: Mining the CIA World Factbook date: 2020-10-12 journal: nan DOI: nan sha: 57441d3903fdb4f6bcbc356073a7af0cf30db90c doc_id: 535199 cord_uid: y7jfg58r This paper presents a three-component work. The first component sets the overall theoretical context which lies in the argument that the increasing complexity of the world has made it more difficult for International Relations (IR) to succeed both in theory and practice. The era of information and the events of the 21st century have moved IR theory and practice away from real policy making (Walt, 2016) and have made it entrenched in opinions and political theories difficult to prove. At the same time, the rise of the"Fourth Paradigm - Data Intensive Scientific Discovery"(Hey et al., 2009) and the strengthening of data science offer an alternative:"Computational International Relations"(Unver, 2018). The use of traditional and contemporary data-centered tools can help to update the field of IR by making it more relevant to reality (Koutsoupias, Mikelis, 2020). The"wedding"between Data Science and IR is no panacea though. Changes are required both in perceptions and practices. Above all, for Data Science to enter IR, the relevant data must exist. This is where the second component comes into play. I mine the CIA World Factbook which provides cross-domain data covering all countries of the world. Then, I execute various data preprocessing tasks peaking in simple machine learning which imputes missing values providing with a more complete dataset. Lastly, the third component presents various projects making use of the produced dataset in order to illustrate the relevance of Data Science to IR through practical examples. Then, ideas regarding the future development of this project are discussed in order to optimize it and ensure continuity. Overall, I hope to contribute to the"fourth paradigm"discussion in IR by providing practical examples while providing at the same time the fuel for future research. Chapter 1 -Introduction 1.1 Relevance to the field "In discipline after discipline . . . academics have all but lost sight of what they claim is their object of study" (Ian, 2005, p. 2) ; and the discipline of International Relations is no exception. Praising the advantages and the necessity of the field of International Relations to politics and society would be redundant in terms of time and paper when the field can only be strengthened when its weaknesses are realized and solutions are sought. In that spirit, the field of International Relations, either as a subfield of Political Science (Kouskouvelis, 2007, pp. 22-23) or even as Political Science itself (Carr, 1939) , suffers from inherent weaknesses. These weaknesses are evident on both sides of the same coin. Namely, International Relations' theory 1 and application 2 . While the debate around IR's theoretical weaknesses is still going on, there is much wider consensus on its practical failures (Baron, 2014) . Factors like the lack of experimentation data, the long term aspect of political outcomes, the difficulties in substantiating an opinion into a theorem and the varying schools of thought (Martin, 1972, p. 846) all make IR's role significantly harder in interpreting the world and forecasting the future. Make no mistake, this paper will not discuss the Great Debates (David, 2013) nor will it try to discredit IR theory. It merely serves to promote a practical multidisciplinary approach to International Relations. Despite the fact that Complexity Sciences (Caws, 1963) and fields like System Dynamics Models, Data Mining and Quantitative methods (Unver, 2018) have been around for many decades, their application in IR has been relatively limited when compared to other sectors. Technical literature remains underdeveloped and most guides for the young IR student come from the field of Computer Science and non-academic 3 sources. Philip A. Schrodt in his paper "Forecasting Political Conflict in Asia using Latent Dirichlet Allocation Models" (Schrodt, 2011) presents a plethora of computer-powered IR-related projects which, as mentioned before, exist on high institutional levels. Such technical efforts to assist IR in describing, explaining or 1 A non-exhaustive list of examples being the work of Helen Milner in "Review: International Theories of Cooperation among Nations: Strengths and Weaknesses" (Helen, 1992, p. 481 ) and the article of W. Julian Korab-Karpowicz "Political Realism in International Relations" (2010). 2 Certain aspects of realism are discussed in Barak Obama's interview (Goldberg, 2016) for the Atlantic. Even though IR theory is not discussed in an official academic setting, the clutter surrounding political theory and actual practice becomes quite evident especially when additional other non-academic sources are reviewed. 3 Mostly online forums, websites, video guides etc. forecasting reality should be further democratized for IR students. While data science courses start to appear and increase in political sciences study programs, few political science students understand the potential or utilize the relevant tools. The ideas of the Fourth Paradigm as described in (Hey et al., 2009 ) are now more relevant than ever. While most aspects of human life become increasingly reliant and invested in the mining and use of data, international relations are falling behind. It is high time that the "fourth paradigm -data intensive scientific discovery" is re-examined in International Relations. It is time we give "Computational IR" (Unver, 2018) a chance. Events such as: The annexation of Crimea by Russia, BREXIT, the revolutions in the middle east 4 (Gadi et al., 2013) , the election of Donald Trump 5 and the SARS-CoV-2 outbreak. Events like these highlight the modern challenges for the policy-maker and decision-taker. In the era of Information (Kremer & Müller, 2014) , political intelligence (Fishman & Greenwald, 2015) , within the big data and OSINT realms, assumes central role for leaders. Kenneth Waltz's single picture prescriptions as painted in Man, the State and War: A Theoretical Analysis (Waltz, 1959, p. 300) seem to become synonymous with IR theory itself. A theory becoming increasingly unable to capture the chaotic modern world. This proposition can be illustrated through the continuous reexamination of realism (Korab-Karpowicz, 2010) , one of the most recognized branches of IR theory. The "…search for the inclusive nexus of causes" (Waltz, 1959, p. 300) can become easier to materialize with the democratization of computational machines and algorithms. Solid theory and political practices can be boosted by capitalizing on data mining and analytics tools. It is this very ability to extract quantitative and qualitative data from societies which can be used for more effective policies (Koutsoupias & Mikelis, 2019) , substantiated opinions and theory that makes the multidisciplinary cross-over between IR (Rosenberg, 2016) and Data Science so attractive and useful. Under no circumstances does this introduction aims to reduce the importance of IR and of related theories. It serves as a reminder that multidisciplinary knowledge and projects can lead to better policies and theory evaluation. I am aware to the fact that this new approach may ultimately fail, but considering the probable benefits it is definitely worth trying. The paper at hand consists of three components, one theoretical and two technical. The theoretical one, calls for the deeper use of Data Science in IR and for the strengthening of Computational International Relations. The other two technical components support the arguments of the theoretical one by creating, using and presenting actual tools and methods. This paper can thus be illustrated as: Overall, the author aims to 1) provide new literature for the field of International Relations Data Science by raising the awareness; 2) create a high-quality dataset which will serve as groundwork for future research; and 3) develop his knowledge and abilities on Data Science by exploring various algorithms and concepts. In order for the above aims to be met, it is crucial that: a) This application is effectively placed within the International Relations context; b) Methodology and executed steps are presented in a concise and detailed manner; c) Additional uses of the data and tools produced are discussed in order to ensure continuity. Considering the ideas discussed in section 1.1 Relevance to the field, the present work can be regarded as multidisciplinary in nature. Specifically, an explanatory data-science project standing within the theoretical IR cosmos. The reader should expect a variety of methodologies within this paper. This is due to the fact that this paper has both technical and theoretical aspects. The three-component structure presented in 1.2 Structure, Aims and Objectives plays a crucial in role in that regard. The theoretical part is argumentative and heavily influenced by the ideas of the "Fourth Paradigm" (Hey et al., 2009) , "Computational IR" (Unver, 2018) and by the experiences of the late 2010s. Its aims are mostly met in 1.1 Relevance to the field. Moving on to the two technical components, the Chapter 3 -Relevant Projects part presents some findings of other projects which were extracted from the CIA World Factbook dataset by using Statistical, Natural Language Processing and Machine Learning tools. For more details on the methodologies and methods of these projects, the reader is strongly advised to refer to the relevant papers directly. The CIA World Factbook Dataset creation, which is the center of gravity of this paper, was carried out using a simplified version of CRISP-DM 6 . The selection criteria and other details are presented below. Due to the lack of data-centered methodologies in IR, Data-Science methodologies 7 were examined. CRISP-DM was selected and modified in order to better serve the requirements of this project (for a comprehensive summary of the project's pipeline see Appendix A. Program Pipeline). Modifications conducted mostly refer to removing certain business and model deployment steps. Such steps are irrelevant to this project since it is placed within a large corporation or institution with bureaucracy nor is it ultimately producing a model for use outside of its narrow scope. 6 Summarized effectively in (Shearer, 2000) and (Hipp & Wirth, 2000) . 7 A variety of online and other sources were examined with the initial references being: (Azevedo & Santos, 2008) , (Martínez-Plumed, et al., 2019) and (Mariscal, Marbán, & Fernández, 2010) . The selection criteria which led to the CRISP-DM methodology were: 1. Fit for the present project. This project is mostly about data-mining and processing rather than advanced modeling. 2. Widely used, proven and accepted (Gregory, 2014 Under these criteria, the aforementioned modified version of CRISP-DM was selected as the ideal 8 methodology. 8 The author is aware that CRISP-DM may be ideal but not optimal. There is a great number of robust methodologies but an evaluation of each one not only falls outside the scope of this paper but is also challenging in terms of time. (Shearer, 2000) Figure 3 -CRISP-DM tasks and outputs. (Hipp & Wirth, 2000) A more technical presentation of the various python scripts is laid out in the project plan provided below. The summarized version can be found in Appendix A. Program Pipeline. Step 4 was compared (human) to Dataframe v1 in order to observe whether semantics had been altered during the transformation. Step 5 was iterated multiple times and outcomes were compared. After the optimal models and parameters were found, step 6 was executed effectively forming the last version of the Dataframe (v5). After the publication of this paper, the code will be uploaded on the web. Comments and suggestions of the online community will be taken under consideration and the code will hopefully keep evolving on the public domain. With respect to the IR background of this paper, it was obligatory that data describing states, the largest actors of the International System, were found. Moreover, the author of the original data must enjoy a high degree of validity and be a recognized authority. In that sense, considering the US as one of the most important global policy powers and the CIA as one of the most recognized intelligence agencies, the CIA World Factbook seemed ideal. Furthermore, the policy 9 governing the use of the Factbook data gives flexibility to the researcher. Last but not least, the inquisitive nature of this paper along with the author's familiarity 10 with the Factbook also played an important role in its selection. Background information regarding the importance and need for this project are provided in 1.1 Relevance to the field. Existing IR datasets 11 are limited in scope usually focused on a specific sector in great depth. Multidomain datasets in regard to IR are scarce and data are usually of low validity. In an effort to alter the situation described above, an objective to create a complete, reliable and easy to use dataset for the International Relations study and policy has been set. An optimization approach was adopted and will continue to unravel even after this paper. The project's inventory of resources has various disparities. While the human capital and hardware capabilities are limited, the amount of data and software tools are rather robust and extensive. In our case, the CIA World Factbook has been selected as the source of data. For the rationale behind the choice of this source, please see 1. . Various libraries were used, more details in the respective parts of the paper. All data and libraries were contained in a project-specific virtual environment. Generated data were saved in pickle format in order for data types to be maintained and excel (.xlsx) format for human exploration. All software tools used are available for free use or licensed to the device used. All tools performed as designed during every iteration of activities. All libraries used were selected based on the following criteria: 1. Presence in relevant literature (fame). 3. Extensive documentation. 4. Ease of use considering the non-technical background of the author. The assumptions regarding this project are: 1. CIA World Factbook data are expected to be of high confidence and low bias. 2. Some data were determined to be Missing not at Random at the discretion of the author (see Appendix D. Columns Assumed MNAR) 3. Some assumptions are made while converting textual columns to numerical or labels (see Appendix E. Columns Generated from Textual -Assumptions) 4. The data produced or transformed by each algorithm were manually compared to the data of the previous phase in order to determine quality of implementation. Thorough manual comparisons, sometimes pc assisted, were assumed to provide with a good picture of overall algorithm effectiveness. The constrains and limitations of the project are: 1. Existing academic theory is limited in regards to Data Science applications for IR. 2. There is a lack of academic technical guides for IR students. 3. IR students tend to have weak technical literacy. 4. There is a general lack of consensus and axioms regarding machine learning model evaluation metrics. Most of the related knowledge is empirical. 5. There are no methodologies covering data science in International Relations. 6. The gap between IR theory, actual policy making and data science remains vast. 7. Information such as units of measure is provided separately in the CIA website thus an important amount of manual labour is required. 8. Lack of standardization in textual data makes conversion to labels difficult. Vectorization could be an effective alternative, to be tested in future versions of the code. This project aims to extract the maximum amount of Data from the CIA World Factbook to a panda's Dataframe and then calculate missing values with machine learning. The dataframe must contain refined data, be easily interpretable and ready to be used by individuals irrelevant to this paper. Missing values of the dataset must be imputed accurately in order for a more complete-dataset to be produced. It is thus a project consisting of three major parts. 1) Data Extraction 2) Data Processing 3) Missing values imputation (Prediction). The creation of an easy to use and accurate Dataframe representing the CIA World Factbook with high confidence is the primary success criterion. The referencing and usage of the dataset in the public domain will constitute the ultimate success evaluation metric along with the relative feedback. Adaptability of the code to future versions of the Factbook is also highly desired. Data were downloaded though the official CIA World Factbook website 12 thus no web scrapping was needed. The downloaded file (.zip) not only contained the data for all Factbook entries but also included pictures, appendices, fonts etc. In our case, only the files containing entry data were used. Such data were found in .html and .json files. Moreover, .json files contained per-entry data while the .html files contained per-field data covering all Factbook entries. In essence, both .json and .html files contained the scrapped online Factbook pages in HTML, same data but different perspectives. All data, including numbers, are by default handled as strings. Numbers needed to be defined as such. Moving on, the structure of the downloaded files was explored comparatively to the online version. It was discovered that the Factbook is structured in the following order Grouped. In order for the maximum amount of data to be captured and for data types to be preserved and defined, it was essential that subfield data were captured. The need to capture all data at the lowest level possible along with the lack of structure resulted in the failure of xpath approaches which failed to adapt sufficiently to the structure of the Factbook. Moreover, html reading functions and existing algorithms did not perform as intended due to the fact that the data downloaded were in distorted HTML format, a byproduct of the scrapping conducted by the CIA at the first hand. With that in mind, a custom-made approach was developed. At first, an empty panda's DataFrame was created in order to accommodate all extracted data. Then, all 268 json files representing individual Factbook entities were opened iteratively using python OS functions, all characters were converted to lowercase and then the subfields of each entity were parsed using regular expressions and string manipulation methods and functions. The data extraction began from the category level. Specifically, "
" as a reference point and were then treated as numeric. Grouped columns were labeled in the dataframe by the word "num" at the beginning of their title due to the numerical type of their data. After all files had been processed, empty columns were dropped from the DataFrame and ("N/A", "na", "na%", "nan", "n/a", "$na") 17 values were replaced with numpy's nan Two major problems were encountered in this phase. The first problem was the structure of data referring to terrorist groups which led to numerous columns being created with most of them being more than 90% empty due to the fact that most terrorist groups operate in a small minority of states globally. The second problem stems from the fact that subfields of non-state entries (examples are: "World", "European Union", "Antarctica" etc.) are structured in a different manner and single data columns are created. Both problems were tackled on the next phase of the project with cleaning and preprocessing. The Dataframe resulting from the first phase of the project has: A quick exploration of our dataset as shown in the above table found that 58.3 % of cells in our 2-dimensional table were empty. It became evident at this stage that even though our data are of high quality, our dataset isn't. Further exploration of our data uncovered that a) The CIA does not offer complete data for each entity, b) the data extraction code has weaknesses due to the inherent lack of structure in the data. Problem a) was solved at a later stage with missing values imputation and missing not at random detection while problem b) had to be tackled at the data-cleaning stage which will be describe in-detail below. The weaknesses of the dataframe-populating algorithm are two. Firstly, it spots Fields and creates the relative column before moving on to look for contained subfields and the second weakness is its inability to process grouped subfields' data hence appending them as columns. As a consequence, 25 columns only contain a single value, while approximately 40 columns contain data as part of their title. Regarding data structure, the fact that not only states but also overseas territories, geographic regions, the European Union and other non-state entries are included, leads to an increase of data noise and decrease of overall structure and code performance. For this purpose, a data-cleaning script was developed aiming to reduce the noise and data size while preventing data loss. Despite these weaknesses, the data extraction and preprocessing phase captured, defined and preprocessed all 268 CIA World Factbook Entries and all subfields. Big portion of the Dataframe is empty because of the reasons discussed above. The quality 19 of the Dataframe was greatly improved over the next phases. The project aims to capture as much CIA World Factbook data as possible and then present them in an easy to interpret and ready-to-use dataset. Unfortunately, the Factbook's uneven and incomplete data collection for each entity and the presence of non-state entities which follow different structures, led to an important number of empty cells (in the Dataframe). 19 In terms of missing data and shape. Considering the current state of data science, the output-dataset of this project will surely be used, and has already been used as presented in Chapter 3 -Relevant Projects, for machine learning thus an important problem emerges: The small number of entities (observations) when combined with multiple missing data inside columns (features) render it almost useless for machine learning (see curse of dimensionality). The reduction of empty cells was achieved with the following approaches in the respective order: Row Removal from the dataset, column concatenation, filling of MNAR columns (ex. txt people-and-society-majorinfectious-diseases food or waterborne diseases which does not appear in countries with no such diseases and is at first considered as nan while in reality true value is 0 -for all columns assumed MNAR see Appendix D. Columns Assumed MNAR ), column removal, and missing value imputation which is covered in 2.4 Modeling and Evaluation (Missing Values Imputation). Special attention was given in maintaining an equilibrium between data loss and missing values reduction during the data selection and cleaning processes. The data cleaning process included the following actions in the order presented below: Some columns can be broken down to more than one of the above. For example, column "txt transportation-ports-and-terminals container port(s) (teus)" contains both the number of container ports and the volume of cargo transferred through each one. Columns were then divided into 4 groups depending on which of the above methods can be applied to them. Group 1 included columns ready to be converted to categorical, Group 2 contained columns out of which only amounts could be calculated, Group 3 was formed by columns out of which only sums could be calculated and lastly Group 4 consisting of columns which needed special (per-column) processing and were subject to any combination of the aforementioned methods. Similar code was prepared and executed for Groups 1,2,3. Specifically, code made use of regular expressions and of basic python operations in order to deconstruct text and provide with the desired outputs. For columns of Group 4, different code was developed for each column. This process was the longest in terms of both time consumed and code size. The highly unstructured nature of data raised serious challenges not only in terms of code but also in terms of data distortion. Each and every column of Group 4 was studied and relative structure of data was determined 22 . In many instances, data of the same column had different structure, probably due to the primary data-collecting agency and not the CIA. After a column was studied, the potential data conversions were determined. The data were then deconstructed and manipulated with attention to avoid losing data or distorting them. Unfortunately, in one instance, due to the lack of standardization, data had to be supplemented from 3 rd party sources. This was the case for: Column "txt geography-climate" (example of data for the UK: "temperate; moderated by prevailing southwest winds over the North Atlantic Current; more than one-half of the days are overcast" (CIA, 2019)). It was determined that the only possible output is categorical/label. Sentences were searched for punctuation characters (; : ,) in the respective order. Sentences were then split at the highestlevel symbol found and the left-most section was picked for further processing. The string segment resulting from the previous splitting operation was searched for keywords (tropical, semiarid, temperate, etc.). Keywords were selected after reviewing data for common occurring words. Depending on the keyword found, each country was assigned a climate type based on the Köppen climate classification system (Beck, et al., 2018) . For more information on conversion of textual columns see Appendix E. Columns Generated from Textual Assumptions. New columns created from the Processing phase (source columns were not replaced) were named using the original title (ex. txt military-and-security-military-branches) but with the data type indicator (in this case txt) replaced either by "lbl" for categorical data (ex. lbl government-citizenship citizenship by birth), "sum" for summed amounts (ex. sum transportation-pipelines, which originally included lengths of various pipelines) or "amount" for amount of original elements (ex. amount military-and-security-military-branches). Empty cells of category "lbl" were filled with string "None/NA" and empty cells of category "num" or "amount" were filled with number 0 if believed to be missing not at random (MNAR In the case of sanitation and water source columns, the provided values refer to the unimproved conditions. In cases where ranges of values were provided (ex. refugees 2.5-3.0), the largest value was chosen. Overall, it should be noted that columns produced from this phase of the project (all those having "lbl", "amount", "sum", "enc" as data type indicator) are not always expected to be 100% accurate and attention should be given when used. This project suffers from one of the most common problems in Data Science, missing values, which contributes negatively to a second problem, dimensionality. Missing values do not allow for wholistic research to be conducted and seriously limits machine learning capabilities. The Processing algorithm described above, filled only a small portion of missing data. Another option would be to generalize and fill all missing values with 0, mean values or nan but such a choice would seriously hurt the quality and credibility of the dataset. In that context, it was decided that more missing values could be found through machine learning. Specifically, numerical values. The decision to avoid textual missing-data computation was taken on technical grounds. The objective of the models implemented in this project was to successfully fill missing values strictly within the scope of this project and not to be deployed in the real world. For the selection of the appropriate machine learning models, the following were carefully considered: 1. The author has limited machine learning knowledge and computing resources rendering deep learning models unavailable as an option. Moreover, models selected need to be proven and well-established requiring as less hyperparameter tuning as possible. 2. The dataset consists of textual, numeric and labeled data. The scope of this paper and with the unstructured nature of textual data contained led to the decision that only numeric data will be predicted. The prediction of numeric values led to regression models. 3. Data have a high variance, high dispersion, uneven and differing distributions with many instances of outliers due to variability. There are also instances of multiple groups of columns referring to the same category or field increasing multicollinearity. Random Forests are known to handle chaotic data and many features well. Multiple Linear regression because it is simple, fast, efficient and proven but also because it takes under consideration features that the simple regression doesn't. Random Forests for being known to handle chaotic data and provide good results with limited hyperparameter tuning. After having established generic model types, it was crucial that the best model of each category was selected. This applied mostly to linear regression models for which sklearn offers a plethora of models. Not only models but also feature selection methods were compared. For the comparison of models and methods, it was decided that the Mean Average Percentage Error (MAPE) will be used. This metric not only is easy to use and interpret but is known to be a good option for model comparisons. In order to tackle its biggest weakness, actual 0 values were removed along with the paired prediction in order to avoid divisions by zero. The validity of the metric was increased by the repetitions of the process through K-folds validations which were themselves placed within automated feature selection processes, also executed multiple times. The different ways of splitting the same group of data along with the multiple different iterations raise the confidence in the metric. Each model was run 10 times per feature selection threshold over the whole dataset. Successful models were calculated cumulatively using a modified Mean Average Percentage Error which dropped actual 0 values along with paired predictions. Successful models were considered those with a MAPE smaller than 15%. At the same time, optimal feature selection techniques and thresholds were calculated. Specifically, features were selected based on Pearson's and Spearman's coefficients take values from -1 to 1 were converted to absolute values (0 to 1) since strength and not direction of correlation is important. In our study, the only case in which the correlation coefficient equaled to 1 was between a column and itself. Due to that, 9 absolute-value discrete thresholds could be established at the single decimal digit level. Pearson's correlation as basis for feature selection outperforms Spearman's. Contrary to other machine learning projects, overfitting is not a big problem considering that models are not to be used outside of this project. The test conducted above not only serves to identify which correlation type is better for our case but also helps discover the optimal threshold in choosing columns as features. Threshold 0.4 provided the most successful models. Even though Single Linear Regression is concise in terms of outcomes, the model was run 10 times per column because of the Grid Search Cross Validation which can provide different hypermeters due to the randomness in splitting the data for training, testing and validating. Thus, every repletion of the model can lead to the creation of more efficient models not built in previous iterations. 23 (between the target column and feature column) 24 The Pearson's correlation coefficient describes linear correlation. 25 In the model evaluation phase, actual values of 0 were removed along with the predicted value in an effort to avoid nan/inf MAPE outcomes. 26 More iterations would have yielded results of higher confidence but computing resources did not allow for that. Ridge Multiple Linear Regression was performed after consulting the findings presented in Table 4 The same table helped identify the optimal feature selection method being the Pearson's correlation coefficient. Structure-wise, the algorithm was identical to the Single Linear Regression one, with the only exception being that not one but rather a group of columns were used as features after considering correlation thresholds. Successful models/columns completed 11 Missing Value Reduction 3.5% Successful models considered those with MAPE <15% For the Random Forest Regression, scikit-learn's RandomForestRegressor (scikit-learn developers, 2020) was used. Contrary to the Linear Regression methods discussed above, the Random Forest Regression can better capture the complex and varied relations of our data. Unfortunately, a higher degree of hyperparameter tuning is required. Grid search cross validation was compared with the baseline model (hypermeters at default) and, interestingly, performed worse. Specifically, after 10 executions of each code, the Grid Search cross validated approach provide with 2 successful models over a 2-hour period while the baseline model provided with 6 successful models in 2.57 minutes. Successful models considered those with MAPE <15% Approach B was implemented using scikit-learn (scikit-learn developers, 2020) "sklearn.feature_selection.SelectFromModel". Importance weights were drawn from the Random Forest Regressor itself. Train and test data were transformed separately and after splitting in order to avoid contamination or bias. The implementation of the algorithm yielded the results described below. It should be noted that the data had already been subject to the Single Linear and Multiple Linear Regressions described above with sixty columns already filled. Successful models/columns completed 6 Missing Value Reduction 3.5% Successful models considered those with MAPE <15% The present paper covers an underrepresented topic in IR by increasing the existing limited literature. It hopes to become a breakthrough by further introducing the ideas of "Computational IR", Data Science and of the "Fourth Paradigm". Thus, it offers a new perspective on the field. It challenges and invites researchers and practitioners to experiment with this new approach in a wholistic manner. Moreover, it constitutes an anthology of basic tools and approaches ideal for IR students who want to explore the world of Data Science within the scope of their studies. The backbone methodology (CRISP-DM) is one of the most if not the most widely acceptable and proven in various fields of study and applications. Hard to dispute data of high quality are drawn from the prestigious US Central Intelligence Agency and its decades-old and widely recognized encyclopedia "World Factbook". The code developed for this project is modular, open-source and can be applied both to future and past publications of the Factbook. There is a great window of potential for further development of both the ideas and code of this paper. The reporting of findings in Chapter 3 -Relevant Projects along with the provision of a ready-to-use dataset all support the main argument that Data Science can be extremely useful for the IR scholar and decision-taker. This is a rare case in IR of a theoretical argument coexisting in an otherwise rather technical paper. This coexistence not only serves to support the argument but also to enable them conduct further research or built upon this very project themselves. This effort brings the contemporary ideals of the internet in the study and practice of International Relations by employing a collection of computational resources. Prior to analyzing technical weaknesses, it is crucial that some theoretical warnings are raised. It should be noted that data science is no panacea nor totally supported by axioms, especially in social sciences. It merely constitutes a major tool supplementing the field. In that spirit, Campbell's L and Goodhart's Laws should be reminded: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor." (Campbell, 1979) and "When a measure becomes a target, it ceases to be a good measure." (Strathern, 1997) . The International Relations data researcher should never assume a priori that reality can be totally explained and forecasted with models nor completely disregard existing literature and history for the sake of data. On a technical level, the largest weakness of this project lies in the missing values imputation phase. This phase is very reliant on Mean Average Percentage Error which is calculated without using actual 0 values and paired predictions in order to avoid division by zero. As a result, models are evaluated, selected and implemented based on a single metric whose validity is still widely debated (as happens with most of the other related metrics). The creation of new numerical columns derived from textual ones has inherent weaknesses. The complexity of written language and the lack of uniformity and structure does not guarantee 100% accurate data transformation. Overall, the multifaceted character of this project makes its structure, presentation and wholistic comprehension demanding tasks. The author plans to optimize the code in the future ultimately aiming towards a totally object-oriented program. The Factbook will constitute the object and the user will be able to perform various operations breaking the original pipeline of the project. Moreover, such a program could be made a library specifically oriented towards IR students and researchers. In the latter case, a user guide may also be produced. For the time being, all aspects of the code can be further enhanced with the most important and time demanding being the Machinelearning Missing value imputation phase which could be replaced with deep learning models to provide more tailored predictions. The missing values imputation process can be also strengthened if more metrics 27 are used to better evaluate models. There is also large room for improvement when it comes to feature selection. In no case has the author exhausted or tested all possible approaches. NLP tasks (Liddy, 2001) can help mine large quantities of data from the textual columns which have seen limited use in relevant projects. Vectorization could provide with a smoother transformation of textual columns to numerical but at the cost of losing human interpretability. The generated dataset can be used in every type of machine learning project. It can be combined with other data in order to provide forecasting models or even be expanded to include older Factbook Data (2000-2018 data available) in order to produce timeseries forecasting models or even assist in visualizing the effects of country policies within the last two decades. PCA can provide with strong indexes for each country embodying a great wealth of information per sector (ex. Transportation, Energy etc.). Lastly, this dataset can be used for classification tasks like developed-developing countries which can then be used in other projects as labels. MNAR columns will be revised and further studied. The approach and ideas of this paper can be used by a government to reverse engineer their true relationship with the US. Furthermore, it also serves as verification of the findings of public opinion polls (like the YouGov one used) in an effort to create bilateral relations labels between the US and other countries. Lastly, it can also be used as a STRATCOM tool to promote a certain narrative either in favor or against a bilateral relation with the US. This work was a poster prepared for the undergraduate course "Religion and Protestantism seems to occur in countries with the highest religious diversity". Catholicism seems to be stronger than Orthodoxy and Protestantism in terms of dominant religion. Moreover, Protestantism seems to be the "softest" dominant religion. Israel and Armenia (which both have unique dominant religions) were clustered along atheist and non-dominant clustering meaning that these two countries are exceptions to the rule and should be examined on a per case basis. The above are only a fraction of findings from the related paper (poster). Those interested are strongly advised to read it. The usefulness of the aforementioned paper lies mostly on the grounds of education, research and advising as it can help convey large amounts of information regarding countries and religions with less than 1000 words while fueling future research and hypotheses. This project compared the two blocs of countries mentioned in the title. After having understood and presented the political background, a data-centered comparison was conducted using 227 columns of data from the CIA World Factbook. Saudi Arabia -United Arab Emirates and Egypt were found to be a more homogenous bloc than Turkey -Qatar: "The This paper is not only useful within university amphitheaters but can also assists real world policies and private sector investment. As the title suggests, this paper focused on education. Contrary to other projects presented in this section, this one focused more on education itself rather than on international relations. It made use of CIA World Factbook, UN and OECD data (combined) in order to find the correlations between the quality of education in a country and more than 100 other factors. After using various tools, it was ultimately found that: " This paper provides ample information and ideas regarding not only the future study/research of education but also for the creation of more efficient education policies. It successfully managed to draw data from IR and by using data science and statistical methods and tools, it generated insights which can be used in a variety of sectors spanning from social science research to gender studies and governance. Appendices -User Guide* Under no circumstances can this section be considered as a complete User Guide. The author is planning to produce a user guide over the next year after having received and implemented community and academic feedback on this project. txt people-and-society-major-infectiousdiseases respiratory diseases, txt people-and-society-major-infectious-diseases animal contact diseases, txt people-and-society-major-infectious-diseases water contact diseases, txt people-and-societymajor-infectious-diseases degree of risk, num people-and-society-children-under-the-age-of-5-years-underweight, txt people-and-society-major-infectious-diseases vectorborne diseases, txt people-andsociety-major-infectious-diseases food or waterborne diseases, num people-and-society-hiv-aids-deaths, txt people-and-society-people-note, txt government-dependency-status, txt governmentdependent-areas, txt government-diplomatic-representation-in-the-us chief of mission, txt government-diplomatic-representation-in-the-us chancery, txt government-diplomatic-representation-in-the-us telephone, txt government-diplomatic-representation-in-the-us fax, txt government-diplomatic-representation-in-the-us consulate(s) general, txt government-diplomatic-representation-in-the-us consulate(s), txt government-diplomatic-representation-in-the-us consulate(s), txt government-diplomatic-representation-from-the-us branch office(s), txt government-diplomatic-representation-in-theus chancery, txt government-diplomatic-representation-from-the-us chief of mission, txt government-diplomatic-representation-from-the-us embassy, txt government-diplomatic-representation-fromthe-us mailing address, txt government-diplomatic-representation-from-the-us telephone, txt government-diplomatic-representation-from-the-us fax, txt government-diplomatic-representation-fromthe-us consulate(s) general, txt government-diplomatic-representation-from-the-us consulate(s), txt government-constitution, txt government-country-name abbreviation, txt government-governmentnote, txt government-legal-system, num energy-electricity-access population without electricity, num energy-electricity-installed-generating-capacity, txt military-and-security-military-branches, txt military-and-security-military-service-age-and-obligation, txt military-and-security-maritime-threats, txt military-and-security-military-note, txt communications-communications-note, num transportationairports, num transportation-airports-with-paved-runways total, num transportation-airports-with-paved-runways 2,438 to 3,047 m, txt transportation-ports-and-terminals major seaport(s), txt transportation-ports-and-terminals oil terminal(s), txt transportation-ports-and-terminals cruise port(s), num transportation-airports-with-paved-runways under 914 m, num transportation-airports-withunpaved-runways total, num transportation-airports-with-unpaved-runways under 914 m, num transportation-roadways total, num transportation-roadways paved, num transportation-roadways unpaved, num transportation-merchant-marine total, txt transportation-merchant-marine by type, num transportation-airports-with-paved-runways over 3,047 m, num transportation-airports-withpaved-runways 1,524 to 2,437 m, num transportation-airports-with-paved-runways 914 to 1,523 m, num transportation-airports-with-unpaved-runways over 3,047 m, num transportation-airports-withunpaved-runways 2,438 to 3,047 m, num transportation-airports-with-unpaved-runways 1,524 to 2,437 m, num transportation-airports-with-unpaved-runways 914 to 1,523 m, num transportationheliports, num transportation-waterways, txt transportation-ports-and-terminals river port(s), num transportation-railways total, num transportation-railways standard gauge, num transportation-railways narrow gauge, num transportation-railways broad gauge, sum transportation-ports-and-terminals oil terminal(s), sum geography-environment-international-agreements signed, but not ratified, sum transnational-issues-refugees-and-internally-displaced-persons refugees (country of origin), sum people-and-society-major-infectious-diseases water contact diseases, sum government-dependent-areas, sum transportation-ports-and-terminals river port(s), sum transnational-issues-refugees-and-internally-displaced-persons idps, sum government-international-organization-participation, sum people-andsociety-ethnic-groups, sum people-and-society-major-infectious-diseases animal contact diseases, sum people-and-society-major-infectious-diseases respiratory diseases, sum geography-naturalresources, sum terrorism-terrorist-groups-home-based, sum transportation-ports-and-terminals lng terminal(s) (import), sum geography-environment-international-agreements party to, sum transportation-ports-and-terminals major seaport(s), sum terrorism-terrorist-groups-foreign-based, sum people-and-society-major-infectious-diseases degree of risk KDD, semma and CRISP-DM: A parallel overview. IADIS European Conference on Data Mining The continuing failure of international relations and the challenges of disciplinary Present and future Köppen-Geiger climate classification maps at 1-km resolution Assessing the impact of planned social change The Twenty Years' Crisis Science, Computers, and the Complexity of Nature Publications: Download. Retrieved Theory is dead, long live theory: The end of the Great Debates and the rise of eclecticism in International Relations SciPy: Open Source Scientific Tools for Python The Intercept. Retrieved Social Media and the Arab Spring: Politics Comes First The Atlantic International Theories of Cooperation among Nations: Strengths Weaknesses The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond: Microsoft Research CRISP-DM: Towards a Standard Process Model for Data The Flight from Reality in the Human Sciences JetBrains: Pycharm Entries: Political Realism in International Relations Εισαγωγή στις Διεθνείς Σχέσεις Exploring International Relations Journal Articles: A Multivariate Approach Text, Content and Data Analysis of Journal Articles: the field of International Relation Cyberspace and International Relations Natural Language Processing. Encyclopedia of Library and Information Science A survey of data mining and knowledge discovery process models and methodologies Political Theory and Political Science: Studies in the Methodology of Political Inquiry International relations in the prison of political science. International Relations Forecasting Political Conflict in Asia using Latent Dirichlet Allocation Models. Paper. scikit-learn developers The CRISP-DM Model: The New Blueprint for Data Mining Improving Ratings Audit In The British University System Computational International Relations. What Can Programming, Coding and Internet Research Do for the Discipline? Foreign Policy Man, the State andWar: A Theoretical Analysis Terror on Facebook, Twitter, and Youtube pandas: powerful Python data analysis Release 1.0.3. Retrieved Cross-industry standard process for data mining Column had missing data which were calculated with ML along with the Mean average percentage error of the model. Notes Characters "-" and " " do not necessarily always divide categories-fields-subfields. In some cases they appear within those. Specifically, some categories may contain "-" themselves like people-and-society, not to be considered dividers between category and field. The same applies to some fields as well (example: labor-force-by-occupation which is a single field). Subfields may contain " " in between. Explanations num (MAPE): 1.05 economy-gross-national-saving hist "num" denotes that numerical data are contained. "(MAPE): " means that missing values were calculated using machine learning. "1.05" is the Mean Average Percentage error of the model which imputed missing values. "economy" is the category of data. "gross-national-saving" is the field. "hist" means that data for various years were available, the latest were picked. enc people-and-society-major-infectious-diseases degree of risk_high "enc" means that contained data are binary numbers (1 = yes and 0 = no). "people-and-society" is the column's category, "major-infectious-diseases" is the field, "degree of risk" is the subfield, "_high" is the label. Countries with 1 in this column have a high degree of risk regarding infectious diseases. Climates were converted to labels on the basis of the Koppen Climate Classification system (Beck, et al., 2018) . Climates not originally included were manually added using external sources. For details regarding assumptions see the original code. txt transportationpipelines "condensate/gas" and "oil/condensate" were grouped as "condensate". "refined petroleum product", "oil and refined products", "petroleum products" and "refined petroleum products" were classified as "refined products". natural gas", "gas transmission pipelines", "high-pressure gas distribution pipelines" , "mid-and low-pressure gas distribution pipelines" , "domestic gas" and "gas transmission pipes" were classified as "gas". "crude oil" and "extra heavy crude" were considered simply as "oil". "gas and liquid petroleum" was transformed to "liquid petroleum gas". "distribution pipes, "unknown" , "water", "cross-border pipelines" were grouped under the label "oil/gas/water". "ethanol/petrochemical" was labeled as "chemicals". Columns for each type were created and filled with the respective values, otherwise considered MNAR. txt military-and-securitymilitary-service-ageand-obligation For column "lbl military-and-security-conscription", "no conscription" "no compulsory" were assumed as "no" label while "compulsory" was assumed as "yes". If an age younger than 15 was found then it was considered that "lbl military-and-securitymilitary-service-age" is "none". txt military-and-securitymilitary-branches If "no regular" then "amount military-and-security-military-branches" was assumed 0.txt governmentdependency-statusWere generalized into "self-sovereign" and "dependent".txt governmentgovernment-type "unresolved" was assumed to be "in transition". If "totalitarian" or "dictatorship" was found in the CIA comments of the subfield then it was adopted as government type. txt government-legalsystem "local" was grouped with "customary". "islamic" was grouped with "religious". txt governmentexecutive-branch head of government "head of government" was labeled as "prime minister".