key: cord-0754607-5233mt4f authors: Lo, Shaw-Hwa; Yin, Yiqiao title: An Interaction-based Convolutional Neural Network (ICNN) Towards Better Understanding of COVID-19 X-ray Images date: 2021-06-13 journal: Algorithms DOI: 10.3390/a14110337 sha: f697d1b4003d26caabccb5b54a6810344cc1e01d doc_id: 754607 cord_uid: 5233mt4f The field of Explainable Artificial Intelligence (XAI) aims to build explainable and interpretable machine learning (or deep learning) methods without sacrificing prediction performance. Convolutional Neural Networks (CNNs) have been successful in making predictions, especially in image classification. However, these famous deep learning models use tens of millions of parameters based on a large number of pre-trained filters which have been repurposed from previous data sets. We propose a novel Interaction-based Convolutional Neural Network (ICNN) that does not make assumptions about the relevance of local information. Instead, we use a model-free Influence Score (I-score) to directly extract the influential information from images to form important variable modules. We demonstrate that the proposed method produces state-of-the-art prediction performance of 99.8% on a real-world data set classifying COVID-19 Chest X-ray images without sacrificing the explanatory power of the model. This proposed design can efficiently screen COVID-19 patients before human diagnosis, and will be the benchmark for addressing future XAI problems in large-scale data sets. The outbreak of novel coronavirus SARS-Cov-2 (previously 2019-nCov, also known as COVID-19) has quickly spread to nearly every country in the world [23] [16] [24] . Since the beginning, the top priority for controlling the spread of COVID-19 has been to monitor suspected cases for appropriate quarantine measures and treatment. Pathogenic laboratory tests are the standard procedure (RT-PCR, collected with the invasive nasal swab most readers will be familiar with), but the accuracy of this test suffers from poor results [23] . Chest CT scans have a high sensitivity for diagnosis of coronavirus disease 2019 (COVID-19) [2] and can be primary tool for the current COVID-19 detection [2] . This innovation, if utilized earlier, could have revolutionized the initial screening procedure of COVID-19 and might prevented many unnecessary deaths. Looking forward, we believe that other related diseases could use similar detection methods to prevent future epidemics and pandemics. Creating these methods now will ensure the development of testing procedures with speed, accuracy, and explainability that will be required to deploy such Artificial Intelligence (AI) systems into future testing environments. AI-centered medical imaging-based deep learning systems have already been developed Convolutional Neural Networks (CNNs) that have shown promising results in feature extraction and learning [24] . However, due to increasing complexity of deep CNN models, it is difficult to explain the prediction performance to their human users. To better assist the medical community by providing and deploying Explainable Artificial Intelligence (XAI) systems for analyzing chest scans to save critical time that is necessary for disease control, we call for immediate attention to developing a self-interpretable and explainable Convolutional Neural Network architecture capable of early disease detection. arXiv:2106.06911v1 [cs.CV] 13 Jun 2021 In 2016, DARPA initiated the Explainable Artificial Intelligence (XAI) challenge. The goal is to build suites of machine learning algorithms that are interpretable without sacrificing prediction performance (see Figure 1 ). The trade-off between learning performance and the effectiveness of explanation is illustrated in Figure 1 . The work for this paper delivers a novel interaction-based methodology that produce the power of interpretability and explainability while maintaining state-of-the-art learning performance. [7] [21]. It presents the relationship between learning performance (usually measured by prediction performance) and effectiveness of explanations (also known as explainability). The proposed method in our work aims to take any deep learning method and provide explainability without sacrificing prediction performance. In the diagram, the proposed method is the orange dot on the upper right corner of the relationship plot. A popular description of interpretability states the essential element for XAI is the ability to explain or to present in understandable terms to a human [8] . Another popular version states interpretability as the degree to which a human can understand the cause of a decision [19] . Though intuitive, these definitions lack mathematical formality and rigorousness as certain perspectives [1] . In this paper, we regard the core issue of an XAI problem to be "how features or variables are used to produce the prediction performance". In other words, the effectiveness of explanations in the DARPA document (see Figure 1 ) is innately a variable set assessment and selection problem. This means the explainability and interpretability of a machine learning algorithm is directly referring to what measures statisticians use to assess how the features or variables affect the final prediction results. In order to establish accountability, responsibility, and transparency in an AI system, one must first address an explainable and interpretable measure to assess the feature importance. We define the following three dimensions, D1, D2, and D3, for a measure to be interpretable and explainable from statistical and variable assessment perspective. In other words, these three dimensions, denoted by D1, D2, and D3, serve as the key premises of the definition of an explainable and interpretable measure. A quantifiable measure for interpretability and explainability does not need to rely on the knowledge of correct specification of the underlying model. . Any desirable measure for interpretability and explainability should state clearly how a combination of explanatory variables influence the response variable. Moreover, it can directly compute a score for a set of variables in order to make reasonable comparisons. This means any additional influential variables should raise this measure while any existence of redundant or noisy variables should decrease the magnitude of this measure. A measure appropriate for interpretability and explainability needs to state clearly the impact that the influence explanatory variables have on the predictivity of response variable. In other words, this measure should directly speaks for the predictivity (see equation [2] of [17] ) of the explanatory variables. In the field of image classification, Convolutional Neural Networks (CNNs) are well-known for their high prediction performances in image classification [15] [13] [22] [11] [10] [6] . Though CNNs are successful in its application areas, there is no theoretical proof to explain why it performs so well [3] especially when using a large size of filters [12] (See Table 1 ). Another challenge is the underlying assumption of using learned feature-maps due to large filter size [12] [13] . Hence, it is difficult to explain how every parameter contributes to the prediction performance and to deliver explainability that satisfy the three dimensions (D1, D2, and D3) defined above. Figure 2 : This executive diagram summarizes the key components of the proposed methods of this paper. We start with the COVID-19 Image Data. With a small rolling window defined, we execute the Backward Dropping Algorithm (BDA) to select the important features within this window. Next, the BDA could select {X 1 , X 2 } as a variable module. Then we can construct a new variable using the technique of Interaction-based Feature Engineer (see the construction of X † in equation 4 to appreciate this new design). For more detailed discussion, please see Appendix 5.8. Table 1 : Summary of some well-known CNNs and their number of parameters. Number of Parameters LeNet [15] 60,000 AlexNet [13] 60 million ResNet50 [10] 25 million DenseNet [11] 0.8 -40 million VGG16 [22] 138 million Chernoff, Lo, and Zheng (2009) [5] presents a general intensive approach, based on a method pioneered by Lo and Zheng (2002) [18] for detecting which, out of many potential explanatory variables, have an influence (impact) on a dependent variable Y . This paper presents an interaction-based feature selection methodology incorporating the notion of influence score, I-score, as a major technique to detect the higher-order interactions in complex and large-scale data set. Our work investigates the potential usage of I-score and a novel deep learning framework. The executive diagram can be seen in Figure 2 which outlines the road-map of the proposed methodology and the architecture of a novel Interaction-based Convolutional Neural Network (ICNN). This novel architecture takes full advantage of I-score and Backward Dropping Algorithm. In other words, it produces convolutional layers that are self-interpretable which provides explainable power to the features of image data at any convolutional layer if this design is implemented. In the following section, we discuss the contribution of our work and why the proposed method satisfies the three dimensions defined in §1. With all three dimensions satisfied (D1, D2, and D3), the proposed design of an Interaction-based Convolutional Neural Network (ICNN) is the ideal candidate to address XAI problems. The key novelty of our proposed approach (see Executive Design in Figure 2 ) rests on the collection of features identified by I-score. Based on the contributions described above, the proposed method is model-free and hence checks the first condition D1. It describes a quantifiable measure of how much a combination of explanatory variables impact the outcome variable which allows statisticians to make comparisons and screen for influential features. This phenomenon checks the second dimension D2. In addition, the statistics, I-score, provides a measurement for explanatory variables that is directly associated with the predictivity that the explanatory variables have on outcome variable which satisfies the third condition D3. With all three dimensions (D1, D2, and D3) satisfied, the design of the proposed architecture presents an Interaction-based Convolutional Neural Network (ICNN) that is innately interpretable and explainable. It extracts influential information from the image data and generates explanatory features that directly associate with the predictivity of the data. Because all three dimensions are met, the proposed design of CNN is interpretable and it serves as the touchstone in the field of XAI. For more detailed discussion, please see Appendix 5.1. 2 Proposed Method 2.1 Influence Score (I-score) Assume that we have response variable Y to be binary (taking values 0 and 1) and all explanatory variables to be discrete. Consider the partition P k generated by a subset of k explanatory variables {X b1 , ..., X b k }. Assume all variables in this subset to be binary. Then we have 2 k partition elements; see the first paragraph of Section 3 in (Chernoff et al., 2009 [5] ). Let n 1 (j) be the number of observations with Y = 1 in partition element j. Letn(j) = n j × π 1 be the expected number of Y = 1 in element j. Under the null hypothesis the subset of explanatory variables has no association with Y , where n j is the total number of observations in element j and π 1 is the proportion of Y = 1 observations in the sample. In Lo and Zheng (2002) [18], the influence score is defined as The statistics I is the summation of squared deviations of frequency of Y from what is expected under the null hypothesis. There are two properties associated with the statistics I. First, the measure I is non-parametric which requires no need to specify a model for the joint effect of {X b1 , ..., X b k } on Y . This measure I is created to describe the discrepancy between the conditional means of Y on {X b1 , ..., X b k } disregard the form of conditional distribution. Secondly, under the null hypothesis that the subset has no influence on Y , the expectation of I remains non-increasing when dropping variables from the subset. The second property makes I fundamentally different from the Pearson's χ 2 statistic whose expectation is dependent on the degrees of freedom and hence on the number of variables selected to define the partition. We can rewrite statistics I in its general form when Y is not necessarily discrete whereȲ j is the average of Y -observations over the jth partition element (local average) andȲ is the global average. Under the same null hypothesis, it is shown (Chernoff et al., 2009 [5] ) that the normalized I, I/nσ 2 (where σ 2 is the variance of Y ), is asymptotically distributed as a weighted sum of independent χ 2 random variables of one degree of freedom each such that the total weight is less than one. It is precisely this property that serves the theoretical foundation for the following algorithm. The Backward Dropping Algorithm is a greedy algorithm to search for the optimal subsets of variables that maximizes the I-score through step-wise elimination of variables from an initial subset sampled in some way from the variable space. The steps of the algorithm are as follows. This section proposes an Interaction-based Convolutional Neural Network. This action is presented in Figure 3 . In this diagram, we start with an image that is sized 64 by 64 and suppose we use a window that has shape 4 by 4. Thus, this small window has 16 pixels locally. This set of 16 pixels can be converted into binary variables which hence gives us well defined partition. For example, we can discretize these pixels into black and white. In other words, each pixel takes value 1 or 0 (two levels) and the set of 16 pixels would give us a partition that is sized 2 16 . This is the set up for us to run Backward Dropping Algorithm. Based on the definition of the Backward Dropping Algorithm, each round we take turns dropping one variable iteratively. Every time we drop a variable, we compute I-score. We drop the variable at each step such that the I-score is the highest if that variable is dropped. Using this procedure, we are able to select a subset of variables out of the original 16 pixels. A visualization of the real application for this proposed convolutional layer is presented in Row (b) Figure 5 . For the rest of the paper, we refer this formula as Influence Measure or Influence Score (I-score). Drop Variables: Tentatively drop each variable in S b and recalculate the I-score with one variable less. Then drop the one that gives the highest I-score. Call this new subset S b which has one variable less than S b . l = |S b | (update l with length of current subset of variables) end A deeper but similar architecture can be constructed using the following diagram. In Figure 3 , we propose a deep Interactionbased Convolutional Neural Network. Instead of building a single Interaction-based Convolutional Layer (ICNN), multiple such layers can be constructed using the same procedure. See Row (b) and Row (c) in Figure 5 . This subsection we define a procedure to mechanically engineer an interaction-based feature. For an image that is sized 3 × 3 (that is, this image has 9 pixels). We can write these pixels as X 1 , X 2 , ..., X 9 . Consider a small window of 2 × 2 passing from the first row and the first column of this 3 × 3 image. This means we start with {X 1 , X 2 , X 4 , X 5 } within this small 2 by 2 window in this example. We use the Backward Dropping Algorithm to narrow down to {X 1 , X 2 } because we observe this subset of variables deliver us the highest I-score. Since both X 1 and X 2 are discretized into two partitions, we have partition Π {X1,X2} well defined (see equation 3). In other words, assume X 1 and X 2 both only take values 0 and 1. Then the partition Π {X1,X2} is defined as the following In this case, each partition π j while j ∈ {1, 2, 3, 4} we can compute the local mean of response variable Y from observations in training set. Hence, a new interaction-based feature can be constructed using X † := y j whileȳ j is the local mean of Y based on the j th partition ∈ Π. In general situation, let us consider an input matrix with size s in by s in . Suppose window size is w × w and the stride level to be l (notice in the above example w = 2 and l = 1). By, s out = sin−w l + 1 , we can compute output matrix with size s out by s out (which coincide with the X † matrix). In this case, we define running index b to take values {1, 2, 3, ...s out × s out }. In the example in previous paragraph, b can take value {1, 2, 3, 4} because s out = sin−w l + 1 = (3 − 2)/1 + 1 = 2 thus s out × s out = 4. For each round b of Backward Dropping Algorithm (b takes value from {1, 2, 3, 4} in this example), we can construct a new variable module that takes formula 4, defined below, :=ȳ j whileȳ j is the local mean of Y based on the j th partition ∈ Π while X † b is defined asȳ j using observations in training set while j indicates the j th partition of Π that is formed by the variables selected from the b round of Backward Dropping Algorithm. Hence, the relationship from a 3-by-3 input Table 2 : This table summarizes the dimension of the data. We download the COVID data from the work by [20] . This totalled 576 COVID images and 2,000 non-COVID images. First, we split test set from the total images. The test consists of 60 COVID cases and 60 non-COVID cases. Next, we are left with 516 images for COVID class and 1,940 images for non-COVID class. This is in-sample set which we use for training and validating (short of Tr. and Val. below). For in-sample set, we upsample the images by adding noises drawn from normal distribution. This gives us 5,000 COVID images and 5,000 non-COVID images for training and testing. The out-of-sample test set has 120 observations and this is only used in the end to check the learning performance. matrix into a 2-by-2 output matrix can be visualized using the following diagram input: while the input matrix has size 3 by 3 and the output matrix has size 2 by 2. In the output matrix, the X † 's are defined using equation 4 with observations in training set. The data is sourced from the work by [20] . For detailed discussion of the source of data, please see Appendix 5.6. We present a brief table for the data in Table 2 . Figure 4 : The figure presents two panels. Panel A exhibits 16 randomly sampled images from class "COVID". Panel B exhibits 16 randomly sampled images from class "non-COVID". We observe that the images for the COVID class appear more cloudy and unclear than the images for the non-COVID class. This is because in the X-ray images for COVID class there are substance that are not air. These substances can be liquid, germs, or inflammatory fluid which causes the images to have cloudy, unclear, and shady areas. Panel A Panel B From observation of Figure 4 , it is visible for humans that the chest areas are clear in the healthy chest X-ray images. However, the images for COVID cases are not quite clear. This is an indication that there are inflammatory cells or other related body fluids filled in the chests. Instead of air which shows up on the pictures to be clear area, these areas in COVID cases tend to be cloudy and unclear. This subsection we present the experimental results. A brief summary of comparison of conventional methods literature and the proposed method is presented in Table 3 . A detailed report of the proposed methodology is presented in Table 4 . We also plot the AUC values for the proposed models in Figure 9 . The report in Table 3 shows results of previous work on this data set. Minaee et al. (2020) [20] used a 50-layer Convolutional Neural Network, ResNet50, and delivered a value of AUC at 99%. Their work also proposed SqueezeNet that delivered an even higher performance at 99.2% [20] . This near-perfect performance is largely due to the design of We read the experimental results of the proposed model in the following order. As stated in section §3.3, the proposed architecture can be designed as deep or wide as the user desires. All models start with input images with dimensions 128 by 128. In other words, the input data has 128 × 128 = 16, 384 pixels. For simplicity of notation, let us denote to be the set of parameters of starting point of 6, window size of 2 by 2, and a stride level of 2. Let us further denote to be the collection of parameters of starting point of 1, window size of 2 by 2, and a stride level of 2. For the training of the proposed model and the tuning parameters, please see Appendix 5.2, 5.3, and 5.4. Model 1. This model starts with input images that are sized 128 by 128. Using the parameter in set , we create the first Interaction-based Convolutional Layer (namely 1st Conv. in Table 4 ). This new matrix has dimension (128 − 6 − 2 + 1)/2 + 1 × (128 − 6 − 2 + 1)/2 + 1 = 61 × 61 = 3, 721. These 3,721 variables are used directly to create output layer with 2 units (assuming using softmax as loss function). Therefore, the total number of parameters for the network architecture 3, 721 × 2 = 7, 442 parameters. The test set performance, measured by AUC, is 98.5% for Model 1. Model 2 -6. We also provided more updated versions of Model 1. Please see appendix. 5.7 This section presents visualization of the proposed architecture. These visualizations are presented in Figure 5 . Unlike Figure 2 that is an executive summary with each position representing many samples, these visualizations in Figure 5 are sample-wise plots. In other words, the 10 original images that are sized 128 by 128 in Panel A and Panel B are the same samples in the second row, 1st Conv. Layer, and the third row, 2nd Conv. Layer. Visualization Interpretation The plot in Figure 5 of the original images for COVID-19 patients has grey and cloudy textures in chest area. Because an X-ray picture is at its brightest when most of the light beams emitted are bounced back from the object, we can observe bones to be the color "white" while the margin to be completely "black". For muscle and organs inside human body, X-ray beams that are emitted can only partially be collected and this causes the greyscale on the X-ray images in chest area. For COVID-19 patients, there are grey and shaded area in the chest X-ray pictures. This is due to the inflammatory fluid when patients exhibit pneumonia-like symptoms. The fluid inside chest area is a consequence of human immune system fighting against outside diseases. This shaded (as seen in Panel A of Figure 5 ) prevents us from observing the clear location of lungs. This is different in Panel B where the lung areas are dark and almost black, because a healthy lung is filled with air (i.e. normal cases and X-ray image presents color black). The black and white contrast in the two panels is directly related to how much inflammatory fluid there is in human lungs. This contrast translates to greyscale on pictures and it is directly related with COVID cases and non-COVID cases (i.e. response variable Y ). The same contrast can be seen using the new variables (these are X † 's based on equation 4) in the 1st Conv. Layer (sized 61 by 61). For COVID-19 patients, the lung area is cloudy and unclear while the healthy cases it is clearly visible. This is not a surprising coincidence because the proposed new variable modules, X † 's, are engineered using equation 4 which relies on the response variableȳ j in training set. The images sized 61 by 61 from the proposed algorithm is a direct translation of not only the original pixels but also response variable. In other words, this visualization presents how I-score sees image data. Explainable AI System for Early COVID-19 Screening As the most important contribution of this paper, an Explainable Artificial Intelligence (XAI) system is proposed to assist radiologists for the initial screening of COVID-19 and other related diseases using chest X-ray images for treatment and disease control with accountability, responsibility, and transparency to human users and patients. A Heuristic and Theoretical Framework of XAI. This paper introduces a heuristic and theoretical framework for addressing the XAI problems in large-scale and high-dimensional data sets. We provide three dimensions (D1, D2, D3) as necessary conditions and premises required for a measure to be regarded as explainable and interpretable. An Interaction-based Convolutional Neural Network (ICNN). To address the XAI problems heuristically described above, this paper introduced a novel design of an explainable and self-interpretable Interaction-based Convolutional Neural Network (ICNN) to contribute to the major issues about explainability, interpretability, transparency, and trustworthiness in black-box algorithms. We introduce and implement a non-parametric and interaction-based feature selection methodology and use this as replacement of pre-defined filters that are widely used in ultra-deep CNNs. Under this paradigm, we present an Interaction-based Convolutional Neural Network (ICNN) that extract important features. We believe that the proposed design can be adapted in any type of CNN. Thus, any CNN architecture that adapts the proposed technology can be regarded as Interaction-based Convolutional Neural Network (ICNN or Interaction-based Network). We encourage both the statistics and computer science communities to further explore this area to deliver more transparency, trustworthiness, and accountability to deep learning algorithms and to build a world with truly Responsible A.I.. We would like to dedicate this to H. Chernoff, a well-known statistician and a mathematician worldwide, in honor of his 98th birthday and his contributions in Influence Score (I-score) and the Backward Dropping Algorithm (BDA). We are particularly fortunate in receiving many useful comments from him. Moreover, we are very grateful for his guidance on how I-score plays a fundamental role that measures the potential ability to use a small group of explanatory variables for classification which leads to much broader impact in fields of pattern recognition, computer vision, and representation learning. This research is supported by National Science Foundation BIGDATA IIS 1741191. Figure 5 : This figure presents visualization summary for 10 randomly sampled images from COVID class and non-COVID class (each has 10). Panel A is for COVID and Panel B is non-COVID. Each panel samples 10 images from the data. Each of the 10 images is then used to feed into proposed architecture and the visualization below presents these images after each proposed Interaction-based Convolutional Layer. For detailed discussion, please see Appendix 5.8. The illustration of the proposed Interaction-based Convolutional Nerual Network (ICNN) is presented in Figure 2 . This executive diagram walks reader through a step-by-step procedure how the proposed network architecture is designed and why it satisfies the three dimensions (D1, D2, and D3) in the definition of interpretability and hence can be the benchmark address XAI problems. Proposed Architecture. The executive diagram for the proposed architecture is presented in Figure 2 . First, the architecture starts with image data that consists of X-ray pictures sized 128 by 128 (see detailed discussion of COVID-19 data set in §4). The architecture proposes using a rolling window with size 2 by 2 (we use a 2 by 2 window for simplicity and larger sizes can be applicable as well in practice depending on the data). Since the window size is 2 by 2, this means there would be four variables every time the window rests on a certain location of the image. Within this subset of variables, we execute the proposed method of Backward Dropping Algorithm (BDA). This procedure finely selects a subset of variables that is highly predictive by omitting the noisy variables in this small neighborhood on the image. Next Next, the proposed architecture is transparent at disclosing to its human users what location of the image is important at making decisions about prediction and what location is noisy. In every Interaction-based Convolutional Layer, we can compute the proposed measure I-score for any single variable. We can also compute I-score and finely select predictive variables from any combination of variables. The larger the I-score values the more important the variables are at making predictions. With simple visualization presented in Figure 2 , we are able to use a spectrum of different colors to illustrate this phenomenon. We can code high I-score values to be one side of the color spectrum and the low I-score values to be the opposite side. The areas that are informative have high I-score values and have very different color than areas that are noisy. This characteristics of I-score allows the statistician to make comparisons and variable selection assessment. Therefore, it satisfies the second dimension, D2, defined in §1. Third, the proposed architecture has a direct association with the predictivity (see equation [2] of [17]) of a variable set. This means that the important features screened by I-score provide the statistician how much impact this variable set has on the response variable. In addition, it is beneficial to be able to compute I-score for any variable in any Interaction-based Convolution Layer which implies that the association with the predictivity is well stated each step in the architecture. This can be visualized using the AUC values (see §4.4 for detailed discussion of AUC values) that are coded onto the same spectrum location of I-score values. Moreover, we observe the path of I-score and AUC values to have parallel behavior. This implies that the values of I-score and AUC move up and down together which means for each variable with high I-score measure would definitely have high AUC values and vice versa. This interesting yet powerful phenomenon allows this architecture to satisfy the third dimension, D3, stated in the definition of an interpretable measure which is novel in the literature. As a summary, we have identified that the proposed Interaction-based Convolutional Neural Network (ICNN) relies on I-score which is a measure satisfies the three dimensions of explainable and interpretable measure. We regard the proposed ICNN to be an explainable and interpretable deep learning algorithm. Thus far, we have not been able to identify other methods and procedures that satisfies all three dimensions defined in §1. For the neural network classifier adapted in this paper, we use a standard neural network architecture with some basic designs (see the diagram after equation 5). This procedure has input variables (known as input layer), an optional hidden layer, and also output layer. The input layer are sets of variables ready to be fed into the machine to build classifier. The hidden layer can consist of any number of units and it is an optional design to deepen the network architecture. The output layer is constructed in order to compare with the output variable. This comparison can be quantified by a loss function. The path from input layer to output layer completes the procedure of forward propagation. Based on the loss function, we are able to compute the gradient which allows us to conduct optimize the weights of the architecture backwards by using gradient descent (or some upgraded version of gradient descent). This completes backward propagation. The training of a neural network model is based on many iterations of forward and backward propagation. Forward Propagation. To illustrate the procedure of model training. Let us consider a set of input variables to be In the proposed work, this is referring to the variable modules, also notated as X † 's, that we created using Interaction-based Feature Engineer (see equation 4) . For this discussion, we define a set of weights {w 1 , w 2 , w 3 } to construct a linear transformation. The symbol Σ below in the following diagram represents this linear transformation that takes the form X † 1 w 1 + X † 2 w 2 + X † 3 w 3 . For simplicity of notation, we write Σ = 3 j=1 w j X † j in short. Then we denote a(·) as an activation function. We choose sigmoid to be this activation function a(·). This means we have output y to be defined as a(Σ). In other words, let us write the followinĝ The general form (assuming there are p variable modules) is expressed beloŵ Architecture. This architecture of neural network is presented below. For simplicity of drawing this picture, we assume there are 3 input variable modules, {X † 1 , X † 2 , X † 3 }. In practice, the number of variable modules (the total number of X † 's) depends on image data dimensions, window size, stride level, and starting point (please see §4.3 and equation 8 for the exact calculation). Σŷ := a(·) Activation function: "sigmoid" For the loss function, we used the binary cross-entropy loss function. This loss function is designed to minimize the distance between a target probability distribution P and an estimated target distribution Q when the task is a two-class classification problem. The cross-entropy loss function is defined as the following where y i is the ground truth of response variable for the i th observation andŷ i is the predicted value of response variable for the i th observation. The linear transformation, non-linear transformation, and the computation of the loss function completes the forward propagation. Backward Propagation. To search for the optimal weights, we use an optimizer algorithm called RMSprop (short for Root Mean Square Propagation, a named suggested by Geoffrey Hinton). With the loss function computed above, we derive the gradient of the loss function to be ∇L := ∂L(y,ŷ)/∂w. At each iteration t, we compute v t,∇L := βv t−1,∇L + (1 − β)∇L 2 while β is a tuning parameter. Note that the square term on ∇L is element-wise multiplication. Then we can update the weights using w t := w t−1 − η · ∇L/ √ v t,∇L while η is learning rate. The value of learning rate η is a tuning parameter and it is usually a very small number. This process starts with the loss function and goes back to the beginning to update the weights w = {w 1 , w 2 , w 3 }. Hence, it earns the name backward propagation. This section investigates the tuning parameter of proposed method. In the design of Interaction-based Convolutional Neural Network (ICNN), the window size and the level of stride are hyper-parameters. Window Size. Window size is the size of the local area that we narrow down to run the Backward Dropping Algorithm. For example, in the first Artificial Example in §2.3, the data is sized 6 × 6. A window size of 2 × 2 means that we are starting with a local area that investigates the first row and the first two columns and the second row and the first two columns. For each row i and each column j in this 6 × 6 grid structure, a window size of 2 × 2 means a local area of the following 2 by 2 matrix (i, j) (i, j + 1) (i + 1, j) (i + 1, j + 1) For a 2 by 2 matrix, the Backward Dropping Algorithm will iteratively drop a variable amongst these 4 variables to compute I-score. The result will be a subset of this four variables. We can also change the size of this window to be 3 × 3 which means we investigate the following matrix This means that the Backward Dropping Algorithm will start with 9 variables. After omitting the noisy variables within this set, the resulting predictive set will be a subset of these 9 variables. Stride Level. The level of stride is how many rows or columns that get skipped over. This tuning parameter allows the algorithm to move faster but it makes sacrifice by skipping some variables. For example, we investigate stride level of 1 starting from row i and column j. Assume we use a 2 × 2 window and let us start from (i, j). We can visualize this action by using the following diagram Original matrix: If we are at the end of the column for a row, we move down by moving to the first column of the next row. For example, in a grid structure of size 6 × 6, assume we are in the last position in a certain row i. The action of stride level 1 can be taken using the following diagram Original matrix in the end of a row: (i, 5) (i, 6) (i + 1, 5) (i + 1, 6) stride=1 −→ (i + 1, 1) (i + 1, 2) (i + 2, 1) (i + 2, 2) Again assume we are at row i and column j. Suppose we set stride level to be 2 and we want to move down. This means we increase increment of 2 on the number of rows and the action is the following If this window happens to be in the final position of a row, then we move down by skipping one row and we start with the first column. If we have a grid structure of size 6 × 6, this action is shown in the following diagram Original matrix in the end of a row: (i, 5) (i, 6) (i + 1, 5) (i + 1, 6) stride=2 −→ (i + 2, 1) (i + 2, 2) (i + 3, 1) (i + 3, 2) Starting Point. Another tuning parameter that we recommend to adjust is the starting point. The starting point represents the location of the first pixel that we start the proposed operation. The vanilla starting point is to start the rolling window from the pixel located at the first row and the first column. This can be illustrated in the following matrix Starting point = 1 : while the starting point is colored in red. Alternatively, we can initiate the starting point to be at a higher level such as two or three. This allows algorithms to run more efficiently in large-scale data sets. For a simple example, in a matrix that is sized 6 × 6 (see §2.3 for the first artificial example), we have the first row of variables to be {X 1 , X 2 , ..., X 6 } and the second row of variables to be {X 7 , X 8 , ...X 12 }. At a starting point of 2, we start with X 8 to pass over the rolling window, because this variable sits at the position with the second row and the second column. This can be illustrated in the following matrix Starting point = 2 : with the starting point colored in red. This tuning parameter is particularly useful when the edge of the images are noisy and non-informative. For example, in Section §4.1, we notice from training images of COVID-19 Chest X-ray data set that the margins of X-ray images are dark and do not have human body within the range of a few pixels length. This is an example when this tuning parameter can be helpful and it speeds up the training process. Computation of Dimensions. The above discussion introduced the tuning parameters of window size, stride level, and starting point. These parameters update our input matrix and generate a new matrix with different sizes. Let us denote window size to be w, stride level to be l, and starting point to be p. Given an input matrix with size s in by s in , the output matrix has new dimensions computed as the following For simplicity of this investigation, we assume input matrices to be a square. In other words, in the application of this paper, we process the input images to have the same width and height, i.e. 128 by 128 pixels. For future work, this can be generalized into different shapes. This paper focuses on using Area-Under-Curve (or AUC) as the main evaluation metric. The AUC value is obtained from the Receiver Operating Characteristic (ROC) curve, a path constructed using different pairs of specificity and sensitivity. Sensitivity and Specificity. The notion of sensitivity is interchangeable with recall or true positive rate. In a simple two-class classification problem, the goal is to investigate covariate matrix X in order to produce an estimated value of Y . From the output of a Neural Network model, the predicted values are always between 0 and 1, which acts as a probabilistic statement to describe the chance an observation is class 1 or 0. Given a threshold between 0 and 1, we can compute sensitivity to be the following On the other hand, specificity is also used to create ROC curve. Given a certain threshold between 0 and 1, we can compute specificity using the following It is not possible for two observations to have the same probability predicted while they come from different classes. Therefore, there must be a mistake in one of them. This information is reflected using the same procedure. Please see Table 7 . In this section, we illustrate proposed methodologies on the following artificial examples. The first artificial example we demonstrate the procedure to construct Interaction-based Convolutional Layer and engineer Interaction-based Features. Let us consider the following experiment. We create an artificial data set with 500 data points in training set and 10,000 data points in testing set. These observations are randomly drawn from Bernoulli(1/2) independently. In other words, we independently generate X i ∼ Bernoulli(1/2) while i = {1, 2, 3, ..., 36}. We define the underlying model to be the following (here known as model 11), This model 11 is a two-module example. The first module is a two-way interaction in modulo 2. The second module is a three-way interaction in modulo 2. The response variable Y is defined to be the first module 50% and the second module 50%. In this setup, the individual explanatory variable does not have any predictive power on the response variable Y . Scenario I. Assume the statistician knows the model in this simulation. This means he is fully aware of S 1 = {X 1 , X 2 } to be an important variable set and S 2 = {X 3 , X 4 , X 5 } to be the other. In other words, he can use the first module as a predictor to make predictions on response variable Y . He is able to compute the theoretical prediction rate of the first variable set to be 75%. This is because the response variable is defined to be the first variable module S 1 = {X 1 , X 2 } exactly 50% of the times so S 1 is able to guess Y correctly at least 50% of the times. The rest 50% of the times the response variable is defined as the second variable module S 2 = {X 3 , X 4 , X 5 }. Since there is no marginal signal, the performance is exactly like random guessing and hence only get the rest of the occurrences correct 50% of the times. In other words, assuming training sample size has n data points, we can write the following This result can be extended to the other variable module as well. If this statistician uses the second variable module S 2 = {X 3 , X 4 , X 5 }, then a similar calculation can be carried out in the following. The theoretical prediction rate for model 11 is the maximum percentage accuracy delivered by S 1 and S 2 . In this case, we have max(θ c (S 1 ), θ c (S 2 )) = 0.75. Scenario II. In practice, it is often the case that we do not observe exactly what the underlying model is in a given data set. The recommendation is to use I-score to select the important local information and then create an interaction-based feature based on the selected variable. The diagram for this action can be seen in Figure 8 . Classical procedure of Convolutional Neural Network starts with many pre-defined filters (or kernels) from a previous data set. The size of this filter can be 2 × 2, 3 × 3, or any other higher dimension. The procedure starts with the first row and the first column. Then the process takes the filter to pass it over across rows and columns in the image. The proposed methodology shares similar characteristics. However, instead of using a pre-trained filter, we apply the Backward Dropping Algorithm in this local area. Constructing the Proposed Convolutional Layer. In the proposed artificial example, we have a data set with 36 variables. This means each observation can be reshaped into 6 × 6 grid structure. In other words, each observation may be considered as an image. The first row and the first column is variable X 1 . The first row and the last column will be variable X 6 . This means we can arrange these variables into the following structure Originally: and the red colored 4 variables are first used to the Backward Dropping Algorithm. If the window size is 2 × 2, we would use {X 1 , X 2 , X 7 , X 8 } and execute the Backward Dropping Algorithm on the variables within this window. Notice that in the diagram (see Figure 8 ) the blue region indicates a local area where we run the Backward Dropping Algorithm. The pink region indicates the engineering workflow of building Interaction-based Feature using equation 4. Each window we run Backward Dropping Algorithm. The procedure adopts the steps introduced in §2.2. The steps can be seen in Table 5 . First, we start with all 4 variables which are {X 1 , X 2 , X 7 , X 8 }. For each step, we take turns dropping one variable and compute the I-score for the remaining variables. It can be seen that X 8 should be dropped because I-score raises from 160.18 in step 1 to 319.65 to step 2. We conduct the same procedure in step 2 and realize The artificial data has 6 × 6 = 36 variables which can be arranged in a grid structure with shape 6 by 6. We use a window size that is 2 by 2. By passing this window from top left corner of the original 6-by-6 matrix, we would create a new matrix that has dimension 5 by 5. The number 32 is the number of units in the hidden layer. In this example, there is one hidden layer which is sufficient for the dimension of the data in the example. that X 7 needs to be dropped. Then I-score can increase to 638.17. We can keep dropping variables, but statistician realizes that I-score is the highest when there is only two variables, {X 1 , X 2 }, left. Hence, we put an up-arrow at the step where it indicates the most optimal variable selection (see up-arrow in Table 5 ). In this experiment, the most optimal selection is the variable module {X 1 , X 2 } with an I-score of 638.17. Each local area with a window size of 2 × 2 we discuss in the above how to finely select the important variable. Similar to the flavor of standard procedure in Convolutional Neural Network to use filter on a local area of an image, the proposed method above also investigate local information. However, instead of relying on many filters that are pre-defined, the proposed approach uses I-score, a model free influence measure that directly associates predictivity. After the variable module is selected within a local window, we adopts formula 4 to create Interaction-based Features. For example, in the local window above, we narrow down to variable module {X 1 , X 2 }. This means we can create the first Interaction-based Feature to be the following 1 if obs. falls in X 1 = 1, X 2 = 1 y 2 if obs. falls in X 1 = 1, X 2 = 0 y 3 if obs. falls in X 1 = 0, X 2 = 1 y 4 if obs. falls in X 1 = 0, X 2 = 0 while j = {1, 2, 3, 4} and this corresponds to the 2 2 = 4 partitions generated by {X 1 , X 2 }. This gives us the first predictor to allow us to build classifier and to make predictions. The experimental results from the first artificial example can be seen in Table 6, Table 7 and Table 8 . Suppose a statistician knows the model. The theoretical prediction rate is 75% (which is derived using formula 12 and 13). In reality, we suppose that a statistician have no knowledge of the data set. In this case, he can directly proceed using Neural Network or even Convolutional Neural Network to make predictions. However, without the correct model specification the performance for these models are sub-par. We can see that Neural Network and Convolutional Neural Network both performed 50% on test set. This performance is rather like random guessing. Alternatively, this statistician can proceed the experiment using the proposed methods. First, one can generate Interaction-based Convolutional Layer. Instead of using many pre-trained filters for feature mapping, the proposed method adopts I-score and the Backward Dropping Algorithm to construct X † {X1,X2} , X † {X8,X9,X21} , and so on. The proposed methodology also has an exact I-score that is associated to each variable module which allow us to rank feature importance. With I-score, it is obvious that the first variable module X † {X1,X2} is much more influential than the other variable modules. We pursue using window size 2 × 2. In a data set with 6 2 = 36 variables, this gives us (6 − 2 + 1) 2 = 25 variable modules. Using these 25 variable modules we are able to build a classifier and delivers the performance of 76% AUC (see §4.4 for detailed discussion of AUC values) on test set, much higher than 50% from before. We can also use a window size that is 3 × 3. This gives us (6 − 3 + 1) 2 = 16 variable modules. These 16 modules allow us to build another classifier using Neural Network and delivers 76% AUC on test set as well. The detailed variable modules and their corresponding I-score are exhibited in Table 6 . In this table, we present three columns: New Modules, Variables, associated I-score, and corresponding AUC. The new module is constructed using 4. The variables come from BDA. The I-score of a variable set is computed using equation 2. For computation of AUC, please see section §4.4. In the artificial data set that has 6 × 6 = 36 variables, a window size of 2 × 2 allows us to generate (6 − 2 + 1) 2 = 25 variable modules. Each module is created using selected variables from a local 2 × 2 window after using the Backward Dropping Algorithm and the module is generated using formula 4. We easily observe that the variable module X † 1 that is formed based on {X 1 , X 2 } is the most influential candidate because it directly links to predictivity. This can be confirmed with AUC value if this module is used to build a classifier by itself. A different size of window can be selected. In this artificial example, we attempted the size 3 × 3. For a data with 6 × 6 = 36 variables, a window size of 3 × 3 allows us to create (6 − 3 + 1) 2 = 16 variable modules. The results of this experiment can be seen in Table 7 . In Table 7 , we observe a slightly different sets of variable modules selected. We notice that the most influential variable modules X † 1 that is based on {X 1 , X 2 } remains the same. In other words, if a local section of the data has real predictive power, the proposed method will almost always pick out the information. This can be verified using AUC value (see §4.4 for detailed discussion of AUC values) associated with the variable module. Since we pass over a window that is sized 3 × 3, we are able to sometimes screen for higher order interaction. As discussed before, we have I-score presented in Table 7 so that feature ranking is possible if desired. From this table, we observe that another module X † 9 that is constructed using {X 3 , X 4 , X 5 } is also very important. Though it has a lower I-score than the first module X † 1 , it has I-score higher than any other candidates in the data set. Prediction. After the variable modules (the variables named X † 's) are constructed, we can proceed with building neural network classifier. In Table 6 , there are 25 variable modules, i.e. {X † 1 , ..., X † 25 }. We use these variable modules as the input layer of a neural network architecture. In other words, the input layer has 25 neurons (these neurons are exactly the 25 variable modules named X † 's). For variable modules in Table 7 , the input layer consists of 16 neurons, which are {X † 1 , ..., X † 16 }. Since we are dealing with a rather small size of input layer (only 25), there is no hidden layer assuming we use the 25 variable modules (the X † 's) in Table 6 . Alternatively, suppose we use the 16 variables modules, Table 7 . We can computeŶ using the followinĝ assuming the same activation function to be sigmoid (same as 14). This architecture is called a forward propagation (a standard architecture widely used in neural network). This is discussed in §4.2 and please read the architecture 6 for detailed description. We assume there is no bias term. Then the objective is to minimize the loss between Y and Y . Hence, we may write min w L(Y,Ŷ ). We can simply choose the loss function to be "square-error", which means In order to minimize this loss function, an optimizer algorithm is required and we can choose the standard gradient descent algorithm. The results for the first artificial example is presented in Table 8 . We discussed in Scenario I that the theoretical prediction rate is 75% if a statistician knows the model. In practice, we assume that the statistician has no knowledge of the model. In this case, the standard approach is to directly work with all variables. This means one can build a classifier such as using Neural Network. In this example, we observe the data set has number of variables to be 6 × 6 = 36. This means one can also adopt Convolutional Neural Network by reshaping the input dimension for an observation from 1 × 36 into 6 × 6. Both Neural Network and Convolutional Neural Network deliver poor performance, i.e. approximately 50% AUC. This is because the underlying model is in binary form and there is no marginal signal. In addition, the in-sample training set has only 500 observations which further raise difficulty for complex architecture such as deep learning models. However, this issue can be resolved if we have the correct model specification. Based on the proposed methodology, we are able to create the variable modules presented in Table 6 and Table 7 . Their performance of test set is presented in Table 8 . A window size of 2 × 2 produces 25 variable modules and deliver AUC of 75%. A window size of 3 × 3 produces 16 variable modules and delivers AUC of 76%. These results approximate to 75% theoretical prediction rate. The source of the data is from the work by [20] . The link of their work can be accessed here: https://pubmed. ncbi.nlm.nih.gov/32781377/. The authors [20] made the dataset publicly available for research community at https://github.com/shervinmin/DeepCovid.git. We download 576 COVID images directly from their work. We also collect 2,000 healthy chest X-ray images from their database and use these images as non-COVID case. The work by [20] also has other diseases such as Pneumonia in the data. The goal for our work is to understand [25] . The proposed method that uses I-score has prediction performance that is close to the theoretical prediction rate (see §4.4 for detailed discussion of AUC values). The notation "NN" stands for neural network. This is discussed in §4.2 and please read the architecture 6 for detailed description. Test AUC Theoretical Prediction 0.75 Net-3 0.50 LeNet-5 0.50 Interaction-based Conv. Layer: window size: 2 × 2 (25 modules listed in Table 6 . In other words, using the selected variables X 1 and X 2 , we construct X † . The procedure of BDA is illustrated in the bottom left corner of the figure (we use a 2 by 2 window for simplicity). We set the starting point to be 12 (i.e. start from the pixel at the 12th row and the 12th column). From data (size of 128 by 128) to the 1st layer (58 by 58), this gives us a new dimension that is computed by (128 − 12 − 2 + 1)/2 + 1 = 58 (see equation 8). We can repeat the process in the 2nd Layer and the 3rd Layer. After the 3rd Layer, we shrink the dimension to 14 by 14 (which gives us 196 new variable modules, i.e. the new X † 's). We fully connect these 196 variable modules with the 10 neurons in the hidden layer (in practice the number of hidden layer and the number neurons are tuning parameters). This novel design is fundamentally different than the conventional practice of using pre-trained filters, because it proposes to use an explainable measure I-score to extract and build information directly from images in the training data. For each local variable (in the data it is referring to pixels and in the layers it is referred as variables), we compute the I-score values and also the AUC (see §4.4 for detailed discussion of AUC values) for that variable (using this variable as a predictor alone). We observe that the I-score value fully represents the predictivity of each local variable which can be confirmed by the variable's AUC value. The color spectrum of both I-score and AUC are presented in the bottom part of the diagram. We observe I-score values exhibiting paralleled behavior with AUC values. This means the variables with high I-score values have high AUC values which indicates strong predictive power for the information in that location. This design heavily rely on I-score and has an architecture that is interpretable at each location of the image at each convolutional layer. More importantly, the proposed design satisfies all three dimensions (D1, D2, and D3 in §1 Introduction) of the definition of interpretability and explainability. Discussion for Figure 5 . This figure presents visualization summary for 10 randomly sampled images from COVID class and non-COVID class (each has 10). Panel A is for COVID patients and Panel B is non-COVID people. The first row plots the original images that are sized 128 by 128. The 1st Conv. Layer generates 61 × 61 = 3, 721 new variables. We plot the same 10 images from both classes using these 3,721 variables in the second row. We also print the predicted COVID probabilities on top left corner of each image. The 2nd Conv. Layer generate 30 × 30 = 900 variables. We plot the same 10 image samples from both classes using these 900 variables in the third row. We also print the predicted COVID probabilities on top left corner of each image assuming using only these 900 variables as predictors. The plot of the original images for COVID-19 patients has grey and cloudy textures in chest area. This is due to inflammatory fluid when patients exhibit pneumonia-like symptoms. This shaded (as seen in Panel A) prevents us from observing the clear location of lungs. This is different in Panel B where the lung areas are dark and almost black which means the lung is filled with air (i.e. normal cases). The black white contrast in the two panels is directly related to how much inflammatory fluid there is in human lungs which translate to greyscale on pictures. The same contrast can be seen using the new variables (these are X † 's based on equation 4) in the 1st Conv. Layer (sized 61 by 61). For COVID-19 patients, the lung area is cloudy and unclear while the healthy cases it is clearly visible. Discussion for Figure 5 . Table 4 with detailed information including the parameters from each layer and the out-of-sample prediction performance. Original Images to 1st Conv. Layer. The input images are sized 128 by 128. With the 1st Conv. Layer constructed, we have 61 × 61 = 3, 721 new variables. We trace back to the same samples as shown in the first row of Figure 5 and use these 3,721 variables only. When we plot these samples with these new variables, we resize them back in matrix form of 61 by 61. Panel A is for COVID class and Panel B is for non-COVID class. In addition, we use Model 1 in Table 4 to produce the texts that states predicted probability of COVID class. The red color implies ground truth to be COVID class (Panel A) and the green color implies ground truth to be non-COVID class (Panel B). 1st Conv. Layer. to 2nd Conv. Layer. From the resulting matrix of the 1st Conv. Layer, we are left with 3,721 variables. We go through the proposed design in Table 4 and we create a new convolutional layer, i.e. 2nd Conv. Layer. This new layer has 30 × 30 = 900 variables. We take the same 10 sampled images from before and we use these 900 variables to present these images. In this presentation, we resize these 900 variables into shape 30 by 30. In other words, we get a smaller matrix that we can plot that exhibit mini version of similar patterns as before. We use Model 4 to generated the predicted probabilities. These probabilities are printed on the top left corner of each image and they are color coded similarly as before (red probabilities have ground truth of COVID class while green probabilities have ground truth of non-COVID class). Explainable AI System for Early COVID-19 Screening As the most important contribution of this paper, an Explainable Artificial Intelligence (XAI) system is proposed to assist radiologists for the initial screening of COVID-19 and other related diseases using chest X-ray images for treatment and disease control. This innovation can revolutionize the application of how AI systems are deployed in hospitals and healthcare systems. We anticipate that other related diseases with viral pneumonia signs can use the same detection methods proposed in our paper, which ensure the development of testing procedures with accountability, responsibility, and transparency to human users and patients. A Heuristic and Theoretical Framework of XAI. This paper introduces a heuristic and theoretical framework for addressing the XAI problems in large-scale and high-dimensional data sets. We provide three dimensions as necessary conditions and premises required for a measure to be regarded as explainable and interpretable. The first dimension, D1, states an interpretable measure does not need to rely on the knowledge of the true model, because any mistakes made in model fitting would be carried over in explaining the features. The second dimension, D2, describes that an interpretable measure should be able to indicate the impact a combination of variables have on the response variable. This means that any inclusion of influential variables would increase this measure while any injection of noisy and useless variables would decrease this measure. This desirable property allows human users to directly make comparisons of the impact the features have when any classifier is trained to make prediction decisions. Though the paper provided detailed work with an arbitrary image data set, the proposed method can be generalized and adapted to any big-data problems. Moreover, it opens up future pipelines for feature selection and dimension reduction in any large-scale and high-dimensional data sets. Last, the third dimension, D3, associates an interpretable measure with the predictivity of a sets of features. This helpful property benefits human users, because it allows us to establish connections and foresee the potential prediction performance (such as AUC values) that a set of features can deliver before any model fitting procedure. An Interaction-based Convolutional Neural Network (ICNN). To address the XAI problems heuristically described above, this paper introduced a novel design of an explainable and self-interpretable Interaction-based Convolutional Neural Network (ICNN). We provide a flexible approach to contribute to the major issues about explainability, interpretability, transparency, and trustworthiness in black-box algorithms. We introduce and implement a non-parametric and interaction-based feature selection methodology and use this as replacement of pre-defined filters that are widely used in ultra-deep CNNs. Under this paradigm, we present an Interaction-based Convolutional Neural Network (ICNN) that extract important features. The proposed architecture uses these extracted features to construct influential and predictive variable modules that are directly associated with the predictivity of the response variable. The proposed design and its many prudent characteristics provide an extremely flexible pipeline that can learn, extract useful information, and unleash the hidden potential from any largescale or high-dimensional data sets. The proposed methods have been presented with both artificial examples and real data application in the COVID-19 Chest X-ray Image Data. We conclude from both simulation and empirical application results that I-score shows unparalleled potential to explain informative and influential local information in large-scale data set. The high I-score values suggest that local information possess capability to have higher lower bounds of the predictivity which thus leads to not only high prediction performance but also explanatory power. By arranging features according to I-score from high to low, we are able to cater the dimension of our model to any neural network architecture. Furthermore, we have also shown undiscovered field of an interaction-based neural network architecture which can help us move towards Explainable Artificial Intelligence many steps closer. In fact, we believe that the proposed design can be adapted in any type of CNN. Thus, any CNN architecture that adapts the proposed technology can be regarded as Interaction-based Convolutional Neural Network (ICNN or Interaction-based Network). We encourage both the statistics and computer science communities to further explore this area to deliver more transparency, trustworthiness, and accountability to deep learning algorithms and to build a world with truly Responsible A.I.. Peeking inside the black-box: a survey on explainable artificial intelligence (xai) Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: A report of 1014 cases A review on deep convolutional neural networks Artificial intelligence augmentation of radiologist performance in distinguishing covid-19 from pneumonia of other origin at chest ct Discovering influential variables: A method of partitions Xception: Deep learning with depthwise separable convolutions We do not touch the test set until the very end after training and validation is completed. For the remaining images, we use data augmentation technique by adding noises drawn from normal distribution. This gives us totalled of 5,000 COVID cases and 5,000 non-COVID cases. These 10,000 images consist of the training and validating sets (short for tr. and val. in the Table 2). These 10,000 images have two classes: COVID and non-COVID. of COVID-19 diseases on CT scans have been proposed 20] have demonstrated using deep CNNs including ResNet18, ResNet50, DenseNet-121 to classify COVID-19 disease using X-ray images This means that from 1st Conv. Layer to the hidden layer there are 3, 721 × 64 = 238, 144 parameters. From the hidden layer of 64 neurons to output layer with 2 units, there are 64 × 2 = 128 parameters. This means in total there are 238, 144 + 128 = 238, 272 parameters. The performance for this architecture is 99.7%. The design of this one hidden layer with 64 units reduced the error rate from 1 This model aims to build two Interaction-based Convolutional Layer. The 1st Conv. Layer uses the set of parameters in and the 2nd Conv. Layer uses the set of parameters in . From the 1st Conv we can directly pass these 900 variables into the output layer for making predictions This model is the deepest amongst all six models. More specifically, Model 4 has two Interaction-based Convolutional Layers and one hidden layer. From the 2nd Conv From the hidden layer with 64 units to output layer with 2 units, there are 64 × 2 = 128 parameters