key: cord-1014136-kkq1u6zq authors: nan title: RCoNet: Deformable Mutual Information Maximization and High-Order Uncertainty-Aware Learning for Robust COVID-19 Detection date: 2021-06-18 journal: IEEE Trans Neural Netw Learn Syst DOI: 10.1109/tnnls.2021.3086570 sha: 192c50fa9fc1c857a56702260c8cbf78e2d72b99 doc_id: 1014136 cord_uid: kkq1u6zq The novel 2019 Coronavirus (COVID-19) infection has spread worldwide and is currently a major healthcare challenge around the world. Chest computed tomography (CT) and X-ray images have been well recognized to be two effective techniques for clinical COVID-19 disease diagnoses. Due to faster imaging time and considerably lower cost than CT, detecting COVID-19 in chest X-ray (CXR) images is preferred for efficient diagnosis, assessment, and treatment. However, considering the similarity between COVID-19 and pneumonia, CXR samples with deep features distributed near category boundaries are easily misclassified by the hyperplanes learned from limited training data. Moreover, most existing approaches for COVID-19 detection focus on the accuracy of prediction and overlook uncertainty estimation, which is particularly important when dealing with noisy datasets. To alleviate these concerns, we propose a novel deep network named RCoNet [Formula: see text] for robust COVID-19 detection which employs Deformable Mutual Information Maximization (DeIM), Mixed High-order Moment Feature (MHMF), and Multiexpert Uncertainty-aware Learning (MUL). With DeIM, the mutual information (MI) between input data and the corresponding latent representations can be well estimated and maximized to capture compact and disentangled representational characteristics. Meanwhile, MHMF can fully explore the benefits of using high-order statistics and extract discriminative features of complex distributions in medical imaging. Finally, MUL creates multiple parallel dropout networks for each CXR image to evaluate uncertainty and thus prevent performance degradation caused by the noise in the data. The experimental results show that RCoNet [Formula: see text] achieves the state-of-the-art performance on an open-source COVIDx dataset of 15 134 original CXR images across several metrics. Crucially, our method is shown to be more effective than existing methods with the presence of noise in the data. C ORONAVIRUS disease 2019 (COVID- 19) causes an ongoing pandemic that significantly impacts everyone's life since it was first reported, with hundreds of thousands of deaths and millions of infections emerging in over 200 countries [1], [2] . As indicated by the World Health Organization (WHO), due to its highly contagious nature and lack of corresponding vaccines, the most effective method to control the spread of COVID-19 infection is to keep social distance and contact tracing. Hence, early and fast diagnosis of COVID-19 has become significantly essential to control further spreading, and such that the patients could be hospitalized and receive proper treatment in time. Since the emergence of COVID-19, reverse transcription polymerase chain reaction (RT-PCR), as a viral nucleic acid detection method by gene sequencing, is the accepted standard for COVID-19 detection [3] . However, because of the low accuracy of RT-PCR and limited medical test kits in many hyperendemic regions or countries, it is challenging to detect every individual affected by COVID-19 rapidly [4] , [5] . Therefore, alternative testing methods, which are faster and more reliable than RT-PCR, are urgently needed to combat the disease. Since most COVID-19 positive patients were diagnosed with pneumonia, radiological examinations could help detect and assess the disease. Recently, chest computed tomography (CT) has been shown to be efficient and reliable to achieve a real-time clinical diagnosis of COVID-19, outperforming RT-PCR in terms of accuracy. Moreover, some deep learning-based methods have been proposed for COVID-19 detection using chest CT images [6] - [9] . For example, an adaptive feature selection approach was proposed in [10] for COVID-19 detection based on a trained deep forest model. In [11] , an uncertainty vertex-weighted hypergraph learning method was designed to identify COVID-19 from community-acquired pneumonia (CAP) using CT images. However, the routine use of CT, which is conducted via expensive equipment, takes considerably more time than X-ray imaging and brings a massive burden on radiology departments. Compared to CT, X-rays could significantly speed up disease screening, and hence become a preferred method for disease diagnosis. Accordingly, deep learning-based methods for detecting COVID-19 with chest X-ray (CXR) have been developed and shown to be able to achieve accurate and speedy detection [12] , [13] . For instance, a tailored convolution neural network platform trained on open-source dataset called COVIDNet in [14] was proposed for the detection of COVID-19 cases from CXR. Oh et al. [15] proposed a novel probabilistic gradient-weighted class activation map to enable infection segmentation and detection of COVID-19 on CXR images. Fig. 1 shows three samples from the COVIDx dataset [14] which contains three different classes: normal, pneumonia, and COVID-19. However, due to a similar pathological information between pneumonia and COVID-19 in the early stage, the CXR samples may have latent features distributed near the category boundaries, which can be easily misclassified by the hyperplane learned from the limited training data. Moreover, to the best of our knowledge, most of the existing methods for COVID-19 detection were designed to extract the lower-dimension latent representations which may not be able to fully capture statistical characteristic of complex distributions (i.e., non-Gaussian distribution presented in CXR images). Furthermore, quantifying uncertainty in COVID-19 detection is still a major yet challenging task for existing deep networks, especially with the presence of noise in the training samples (i.e., label noise and image noise). To address the above problems, we propose a novel deep network architecture, referred to as RCoNet k s , for robust COVID-19 detection which, in particular, contains the following three modules, i.e., Deformable mutual Information Maximization (DeIM), Mixed High-order Moment Feature (MHMF) and Multiexpert Uncertainty-aware Learning (MUL): 1) The DeIM module estimates and maximizes the mutual information (MI) between input data and learned high-level representations, which pushes the model to learn the discriminative and compact features. We employ deformable convolution layers in this module which are able to explore disentangled spatial features and mitigate the negative effect of similar samples across different categories. 2) The MHMF module fully explores the benefits of using a mix of high-order moment statistics to better characterize the feature distributions in medical imaging and reduce the negative effects of noise. 3) The MUL creates multiple parallel dropout networks, each can be treated as an expert, to derive multiple experts-based diagnosis similar to clinical practices, which improves the prediction accuracy. MUL also quantifies the prediction accuracy by obtaining the variance in prediction across different experts. 4) The experimental results show that our proposal achieves the state-of-the-art performance in terms of most metrics both on open source COVIDx dataset of 15134 original CXR images and that of noisy setting. The remaining of this article is organized as follows: In Section II, we review related works on MI estimation and uncertainty learning as well. In Section III, after an overview of our proposed approach, we discuss the main components of RCoNet k s . In Section IV, we compare our proposed architecture with the existing deep learning-based methods evaluated on a publically available dataset of CXR images and also the same dataset but under noisy conditions. And we also conduct extensive experiments to demonstrate the benefits of DeIM, MHMF, and MUL on the system's performance. Finally, we conclude this article in Section V. In this section, we introduce related works on MI estimation and uncertainty learning that lay the foundation of this article. MI, as a fundamental concept in information theory, is widely applied to unsupervised feature learning for quantifying the correlation between random variables. MI has been exploited in a wide range of domains and tasks, including biomedical sciences [16] , blind source separation (BSS, e.g., independent component analysis [17] ), feature selection [18] , [19] , and causal inference [20] . For example, the object tracking task considered in [21] was treated as a problem of optimizing the MI between features extracted from a video with most color information removed and those from the original full-color video. Closely related work presented in [22] considered learning representations to predict cross-modal correspondence by maximizing MI between features from the multiview encoders and the content of the held-out view. Moreover, Mutual Information Neural Estimation (MINE) proposed by [23] was designed to learn a general-purpose estimator of the MI between continuous variables based on dual representations of the Kullback-Leibler (KL)-divergence, which are scalable, flexible, and, most crucially, trainable via back-propagation. Inspired by MINE, our proposal estimates and maximizes the CXR image inputs and the corresponding latent representations to improve diagnosis performance. Aiming at combating the significant negative effects of uncertainty in deep neural networks, uncertainty learning has been getting lots of research attention, which facilitates the reliability assessment and solves risk-based decision-making problems [24] - [26] . In recent years, various frameworks have been proposed to characterize the uncertainty in the model parameters of deep neural networks, referred to as model uncertainty, due to the limited size of training data [27] , [28] , which can be reduced by collecting more training data [25] , [29] , [30] . Meanwhile, another kind of uncertainty in deep learning, referred to as data uncertainty, measures the noise inherent in given training data, and hence cannot be eliminated by having more training data [31] . To combat these two kinds of uncertainty, lots of works on various computer vision tasks, i.e., face recognition [24] , semantic segmentation [32] , object detection [33] , and person reidentification [34] , have introduced deep uncertainty learning to improve the robustness of deep learning model and interpretability of discriminant. For face recognition task in [25] , an uncertainty-aware probabilistic face embedding (PFE) was proposed to represent face images as distributions by utilizing data uncertainty. Exploiting the advantage of Bayesian deep neural networks, one recent study [35] leveraged the model uncertainty for analysis and learning of face representations. To our knowledge, our proposal is the first work that utilizes the high-order moment statistics and multiple expert networks to estimate uncertainty for COVID-19 detection using CXR images. In this section, we introduce the novel RCoNet k s for robust COVID-19 detection, which incorporates DeIM, MHMF, and MUL, as illustrated in Fig. 2 . k is the number of levels of moment features that are combined in MHMF, and s is the number of the expert network in MUL, which will be further clarified in the sequel. The CXR images are first processed by DeIM which consists of a stack of deformable convolution layers, extracting discriminative features. The compact features are then fed into MHMF module to generate mixed high-order moment latent features, reducing negative effects caused by similar images and noise. The proposed MUL utilizes the learned high-order features to generate final diagnoses. Due to the similarity between COVID-19 and pneumonia in the latent space, we propose DeIM to extract discriminative and informative features, reducing the negative influence caused by the lack of distinctiveness in the deep features. In particular, we train the model by maximizing the MI between the input and corresponding latent representation. We use a stack of five convolutional stages, as shown in Fig. 2 , to encode inputs into latent representations, which is denoted by a differentiable parametric function E ψ where ψ denotes the set of all the trainable parameters in these layers, and X and Z denote the input and output spaces, respectively. The detailed architecture of each convolutional stage is presented in Fig. 2 , which consists of several convolutional layers each followed by a batch normalization layer. Note that we employ deformable convolutional layers which can better extract spatial information of the irregular infected area compared to conventional convolutional layers. More specifically, regular convolution operates on predefined rectangular grid from an input image or a set of input feature maps, while the deformable convolution operates on deformable grids that each grid point is moved by a learnable offset. For example, the receptive grid P of a regular convolution with kernel size 3 × 3 is fixed and can be given by while, for deformable convolution, the receptive grid is moved by the learned offsets p n ∈ R 2 and the output is given as follows: where b( p 0 ) denotes the value at location p 0 on the output feature map b, p n enumerates the locations in P, w( p n ) represents the weight at location p n of the kernel, and a(·) is value at given location on the input feature map. We can see that with the introduction of offsets p n , the receptive grid is no longer fixed to be a rectangle, and instead is deformable. We optimize E ψ by maximizing the MI between the input and the output, i.e., I (X; Z ), where Z E ψ (X). The precise MI requires knowledge probability density functions (PDFs) of X and Z , which is intractable to obtain in practice. To overcome this issue, MINE proposed in [23] estimates MI by using a lower-bound on the Donsker-Varadhan representation [36] of the KL-divergence where J represents the joint probability of X and Z , i.e., J P(X, Z ), and M denotes the product of marginal probabilities of X and Z , M P(X)P(Z ). T θ : X × Z → R denotes a global discriminator modeled by a neural network with parameters θ , which is trained to maximize I (DV) θ (X; Z ) to approximate the actual MI. Hence, we can simultaneously estimate and maximize I (X; E ψ (X)) by maximizing Since the encoder E ψ and the MI estimator T θ are optimized simultaneously with the same objective function, we can share some layers between them, and replace the T θ with T θ,ψ to account for this fact. Since we are primarily interested in maximizing the MI rather than estimating the precise value, we can alternatively use a Jensen-Shannon MI estimator (JSD) [37] , which offers more interpretable tradeoff where x is an input sample of an empirical probability distribution P, x denotes a fake sample from distribution P, where P = P. This estimator is illustrated by the DeIM block shown in Fig. 2 , which has the latent representation E ψ (x), the input sample x and the fake sample x as input, and the difference between the outputs of the two softplus operations as the estimation of MI. Another alternative MI estimator is called Noise-Contrastive Estimator (NCE) [38] , which is defined as The experiments have found that using the NCE estimator outperforms the JSD estimator in some cases, but appears to be quite similar most of the time. The existing works [39] that implement these estimators use some latent representation of x, which is then merged with some randomly generated features to obtain "fake" samples that satisfy P = P. In contrast, we use the samples from other categories as the "fake" samples, i.e., x , instead. For example, if the input is a pneumonia sample, then the fake sample is either a normal or COVID sample. We note that this can push the learned encoder to derive more distinguishable features for samples from different categories. The presence of the image noise and label noise in CXR datasets may cause image latent representations generated by deep neural networks to be scattered in the entire feature space. To deal with this issue, [24] , [25] , [34] represent each image as a Gaussian distribution, that is defined by a mean (a standard feature vector) and a variance. However, the deep features of CXR samples we considered in this article typically follow a complex, non-Gaussian distribution [40] , [41] , which cannot be fully captured by its first-order (mean) or second-order statistics (variance). We seek a better combination of different orders of statistics to more precisely characterize the latent representation of the CXR images. We illustrate the moment features of different orders [42] in Fig. 3 , where we plot 350 data points in R 2 sampled from a distribution that combines three different Gaussian distributions. We can observe that the high-order moment features are more expressive of statistical characteristic compared to low-order one. More specifically, they capture the shape of the cloud of samples more accurately. Therefore, we include the MHMF module in the proposed model, as shown in Fig. 2 , which outputs a combination of high-order moment features with the latent representation E ψ (X) as input. This will potentially solve the scattering problem and capture the subtle differences between CXR images of similar categories, i.e., pneumonia and COVID-19 in our case. We show how to obtain the complicated high-order moment feature in the following. Define r th order moment feature as φ r (a), where a ∈ R H ×W ×C denotes a latent feature map of dimension H × W × C. Lots of recent works adopt the Kronecker product to compute high-order moment feature [41] . However, calculating Kronecker product of high dimensional feature maps is significantly computational intensive, and hence infeasible for real-world applications. Inspired by [43] - [45] , we approximate φ r (a) by exploiting r random projectors which rely on certain factorization schemes, such as Random Maclaurin [46] . We use 1 × 1 convolution kernels as the random projectors to estimate the expectations of high-order moment features. That is where represents the Hadamard (element-wise) product, and K 1 , K 2 , . . . , K r are 1 × 1 convolution kernels with random weights. Note that Random Maclaurin produces an estimator that is independent of the input distribution, which causes the estimated high-order moments to contain noninformative high-order moment components. We eliminate these components by learning the weights of the projectors, i.e., the 1 × 1 convolution kernels, from the data. Also, note that the Hadamard product of a number of random projectors may end up with the estimated high-order moment features to be similar to low-order ones. To solve this problem, we use a recursive way to estimate the high-order moments instead Since different order moments capture different informative statistics, we design the MHMF module to keep the estimated moments of different levels of order, as shown in Fig. 2 , the output of which is given as Hence, J (a) is rich enough to capture the complicated statistics, and produce discriminative features for the input of different categories. The MHMF module, as described in Section III-B, generates MHMFs of each sample in the latent space, which we aim to further exploit to derive compact and disentangled information for COVID-19 detection. Meanwhile, quantifying uncertainty in disease detection is undoubtedly significant to understand the confidence level of computer-based diagnoses. Motivated by the clinical practices, we present a novel neural network in this section, referred to as MUL, which takes in the MHMFs and outputs the prediction and the quantification of the diagnostic uncertainty caused by the noise in the data. The structure of the MUL module is shown in Fig. 2 , which consists of multiple dropout layers that process the output from MHMF in parallel, each of which together with the following several fully connected layers can be regarded as an expert for COVID-19 detection. We note that each dropout layer uses different masks which results in different subsets of latent information to be kept, while the following fully connected layers share the same weights across different experts. The masks for the dropout layers are generated randomly at each iteration during training but fixed during the inference time. We denote the input-output function of each expert by C j e (·), j = 1, . . . , N, where N is the total number of experts. Hence, we have the classification loss L j e of j th expert given as follows: where n represents the total number of labeled CXR samples, and y i denotes the one-hot representation of the class label, i = 1, . . . , n, and we recall that J (·) denotes the MHMF operation given in (10) and E ψ (·) is the preprocessing step on the CXR samples. Note that, the total number of COVID-19 cases is much smaller than non-COVID cases, i.e., normal and pneumonia cases. This imbalance in the dataset leads to a high ratio of false-negative classification. To mitigate this negative effect, we employ a weighted cross-entropy L w (·) given as follows: where C is the total number of classes, y i,c is the cth element of y i , and y i,c denotes the corresponding prediction. λ c represents the weight that controls how much the error on class c contributes to the loss, c = 1, . . . , C. Finally, the loss L M of the whole MUL module is derived by averaging the loss values of all the experts We use the variance of classification loss L j e with regards to the average loss L M to quantify the uncertainty, denoted by σ , which is given as The proposed MUL module improves the diagnostic accuracy as the final prediction combines the results from multiple experts, and also mitigates the negative effects caused by the noise in the data by introducing the dropout layers. Moreover, the experiments have revealed that the more experts in the MUL module the faster the system converges during training. The whole architecture of RCoNet k s is presented in Fig. 2 , where the CXR images are first processed by a stack of deformable convolution layers, and then are transformed to high-order moment latent features by the MHMF module. Finally, the MUL module utilizes the learned high-order features to generate final diagnoses. The loss used to optimize RCoNet k s is given as follows: where L M is the prediction loss given by (13) , and L I denotes the MI between the input X and the latent representation E ψ (X) estimated by either (6) or (7) . α is a positive hyperparameter that governs how much L M and L I contribute to the total loss. During training, the trainable parameters of the whole systems are updated iteratively to minimize L total , which is to jointly minimize the prediction loss L M thus to improve the accuracy and maximize the MI L I . We use a public CXR dataset, referred to as COVIDx, to evaluate the proposed model, which is [14] , [47] , the dataset is finally divided into 13624 training and 1510 test samples. The numbers of samples from different categories used for training and testing are summarized in Table I . Moreover, we also adopted various data augmentation techniques to generate more COVID-19 training samples, such as flipping, translation, rotation using random five different angles, to tackle the data imbalance issue such that the proposed model can learn an effective mechanism of detecting COVID-19. In our experiments, we use the following six metrics to evaluate the COVID-19 detection performance of different approaches We compare the proposed RCoNet k s with the following five existing deep learning methods for COVID-19 detection: detection network that exploits a CNN-based multilevel preprocessing filter block and a multitask learning loss. We implement our RCoNet k s using the PyTorch library and apply ResNeXt [50] as the backbone network. We train the model with the Adam optimizer with an initial learning rate of 2 × 10 −4 and a weight decay factor of 1 × 10 −4 . All the experiments are run on an NVIDIA GeForce GTX 1080Ti GPU. We set the batch size to be 8, and resize all images to 224×224 pixels. The hyperparameter α in the loss function given in (15) is set to be within the range of [0, 0.4]. The drop rate of each dropout layer in the MUL module is randomly chosen from {0.1, 0.3, 0.5}. The loss weight λ c for each category, which is used to calculate the weighted sum of the loss as given in (12), is set to be 1, 1, and 20 for the normal, pneumonia, COVID-19 samples, respectively, corresponding to the number of training samples in each. We adopt fivefold cross-validation and evaluate our proposed model with a different number of order moments for the MHMF module k, and a different number of experts s. To evaluate the performance of the proposed model with the presence of label noise, we derive a noisy dataset from the given dataset in the following way: we randomly select a given percentage of training samples in each category and assign wrong labels to these samples. In particular, to ensure that the fake COVID-19 samples are less than the real ones, we assign the COVID-19 labels to select normal and pneumonia samples in a way the number of normal and pneumonia samples assigned with the COVID-19 label equals the number of COVID-19 samples assigned with either normal or pneumonia label. We show a realization of the derived noisy dataset when the percentage of fake samples is set to be 10% in Table II . 1) Performance on Clean Data: The numerical results on the clean dataset without any artificial noise added are shown in Table III . The results are presented in the form of We can see that RCoNet 5 4 , i.e., the proposed model with k = 4 levels of mixed moment features and s = 4 experts, achieves notable performance improvement over the comparison methods in terms of most metrics considered, including ACC, SPE, BAC, PPV, and F1 score. We note the performance of RCoNet k s can be further improved with a different set of k and s. For instance, RCoNet 5 4 achieves a better SEN and F1 score than RCoNet 4 4 . The higher ACC and F1 score validate that RCoNet k s is able to obtain latent features, i.e., the mixed moment features of different levels of order, that maintain interclass separability and intraclass compactness better than other models. Note that RCoNet 5 4 leads to a higher SEN than all other methods, which is particularly important to COVID-19 detection since successfully detecting COVID-19 positive cases is the key to control the spread of this super contagious disease. Moreover, it can be observed that RCoNet k s has smaller variance compared to the others, which demonstrates the robustness and stability of our model. We also evaluate the complexity of the proposed model in terms of numbers of parameters and computational cost, i.e., Float-point operations (FLOPs), which is presented in Table III . It can be observed that the proposed model has much fewer parameters than several existing methods, except ReCoNet. However, we note that the FLOPs of RCoNet k s is quite close to that of ReCoNet, which means it takes a similar amount of time to diagnose COVID-19 from CXR images by these two models. We can also observe that the increase of k, i.e., the number of mixed moment features, only causes a small or even neglectable amount of increase in the number of parameters and FLOPs as well, which suggests that we can improve the performance of the proposed model by optimizing k, without the concern on the significant increase of the complexity. As for s, the number of experts in MUL, we select 4 which is confirmed to have better performance with a bit of computational cost increase after a great number of experiments. We further compare the proposed model to the existing ones when there is noise present in the training dataset. We generate three noisy training datasets in an aforementioned way from the clean dataset with 10%, 20%, and 30% samples with wrong labels, respectively. The results, which we take the averages from five independent experiments, are presented in Table IV . It can be easily seen that the more fake samples we add, the more it degrades the performance of all the methods. Note that the proposed RCoNet 4 4 still gets the state-of-the-art results in all considered cases with different percentages of noisy samples in the training dataset. Moreover, the performance gain over the existing methods slightly increases with the ratio of noisy samples, verifying that our model is more robust to the noise. Note that the extreme case of 30% noisy samples leads to great performance degradation of all the models. In practice, the percentage of label noise is usually around 10% to 20%. We present the confusion matrices in Fig. 4 to summarize the prediction accuracy of different categories. We can observe that, although with very limited number of COVID-19, our model still maintains high accuracy of detecting COVID-19 cases, even with the presence of noisy samples. One remarkable advantage of our model is the ability to quantify the uncertainty in the final prediction, which is significantly crucial for COVID-19 detection. This is done by obtaining the variance in the output of different experts in MUL as described in Section III-C. The larger the variance is, the more different experts disagree with each other, and, hence, the more uncertain the model is about the final prediction. We present two CXR samples in Fig. 6 , including the predictions and the corresponding uncertainty level by RCoNet k s . We can see that the correctly classified CXR image has a low uncertainty level about its prediction, i.e., 0.0094, and the misclassified CXR sample with a high uncertainty level, i.e., 0.4792, suggests that an alternative way of diagnosis should be sought to correct this prediction. This greatly improves the reliability of the prediction by RCoNet k s , and reduces the chance of misdiagnosis. We also show in Fig. 7 the average uncertainty levels of RCoNet k s trained on clean and noisy datasets with different ratios of noisy samples. It can be observed that the uncertainty level increases almost linearly with the percentage of noisy samples in the dataset, which highlights the negative impact of noise on model training. We further numerically analyze the benefits of the three key modules of RCoNet k s , i.e., the DeIM, MHMF and MUL modules in this section. 1) Effectiveness of DeIM: As shown in Fig. 5 , we utilize tstochastic neighbor embedding (SNE) method [51] Fig. 5(a) , and that by RCoNet-D presented in Fig. 5(b) , we can tell that the introduction of DeIM leads to better class separation in the latent space. 2) Effectiveness of MHMF: We can observe in Fig. 5 (a)-(d) that the latent features of the COVID-19 samples, generated by the models without MHMF, always distribute around the category boundary, and are not quite separable from those of some pneumonia samples. Meanwhile, the latent feature distributions presented in Fig. 5 (e)-(h) derived by the models with MHMF show significant separability between different categories, which implies that MHMF can extract discriminative features. We also include numerical results of RCoNet k s , trained and tested on COVIDx dataset, with regards to different values of k, i.e., the number of levels of the moment features to be mixed, and s, i.e., the number of experts, in Table V in terms of accuracy. We can observe that, for a given value of s, the accuracy increases first with the value of k but decreases after k is larger than 4. It demonstrates that including more levels of moment feature could improve the model performance. However, the overly high-order moments may lead to performance degradation, which may be because these features are not useful for COVID detection. 3) Effectiveness of MUL: From Table V , we observe that, for a given value of k, accuracy increases first with the value of s but saturates around s = 5. This implies that having more experts in MUL can increase the prediction accuracy but it is not necessary to have too many. We evaluate how sensitive the model performance in terms of accuracy to the value of α. We show the average accuracy of five independent experiments by RCoNet 4 4 trained on the dataset with different ratios of noisy samples in Fig. 8 . As we can see, the larger α, which means the prediction loss, i.e., L M , contributes less to the total loss, not necessarily leads to degradation in the accuracy. This means maximizing the MI between the input and the latent features could keep useful information within the latent features, thus improving the prediction accuracy. We have also shown the learning curves of different models in Fig. 9 , which shows that RCoNet 4 4 converges slightly faster than the others, including COVID-Net, ReCoNet and CoroNet. In this article, we proposed a novel deep network model, named RCoNet k s , for robust COVID-19 detection, which contains three key components, i.e., DeIM, MHMF and MUL. DeIM estimates and maximizes the MI between input data and the latent representations simultaneously to obtain the category separability in the latent space. MHMF overcomes the limited expressive capability of low-order statistics, and uses a combination of both low and high order moment features to extract more informative and discriminative features. MUL generates the final diagnosis and the uncertainty estimation by combining the output of multiple parallel dropout networks, each as an expert. We numerically validated that the proposed RCoNet trained on either the public COVIDx dataset or the noisy version of it outperforms the existing methods in terms of all the metrics considered. We noted that these three modules can be easily implemented into other frameworks for different tasks. REFERENCES [1] K. Zhang et al., "Clinically applicable AI system for accurate diagnosis, quantitative measurements, and prognosis of COVID-19 pneumonia using computed tomography," Cell, vol. 181, no. 6, pp. 1423.e11-1433.e11, Jun. 2020. Shunjie Dong received the B.S. degree in inte- Accurate screening of COVID-19 using attention-based deep 3D multiple instance learning Artificial intelligence-enabled rapid diagnosis of patients with COVID-19 Relational modeling for robust and efficient pulmonary lobe segmentation in CT scans Dual-sampling attention network for diagnosis of COVID-19 from community acquired pneumonia AI augmentation of radiologist performance in distinguishing COVID-19 from pneumonia of other etiology on chest CT Application of deep learning technique to manage COVID-19 in routine clinical practice using CT images: Results of 10 convolutional neural networks Diagnosis of coronavirus disease 2019 (COVID-19) with structured latent multi-view representation learning Inf-Net: Automatic COVID-19 lung infection segmentation from CT images Adaptive feature selection guided deep forest for COVID-19 classification with chest CT Hypergraph learning for identification of COVID-19 with CT imaging Coronavirus disease 2019 (COVID-19): A perspective from China COVIDLite: A depth-wise separable deep neural network with white balance and CLAHE for detection of COVID-19 COVID-Net: A tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images Deep learning COVID-19 features on CXR using limited training data sets Multimodality image registration by maximization of mutual information Independent component analysis: Algorithms and applications Input feature selection by mutual information based on Parzen window Feature selection based on mutual information criteria of max-dependency, max-relevance, and minredundancy Mutual information relevance networks: Functional genomic clustering using pairwise entropy measurements Tracking emerges by colorizing videos Look, listen and learn Mutual information neural estimation Data uncertainty learning in face recognition Probabilistic face embeddings Bayesian SegNet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding Weight uncertainty in neural networks Uncertainty in deep learning A practical Bayesian framework for backpropagation networks What uncertainties do we need in Bayesian deep learning for computer vision Deep convolutional encoder-decoder network with model uncertainty for semantic segmentation Gaussian YOLOv3: An accurate and fast object detector using localization uncertainty for autonomous driving Robust person re-identification by modelling feature uncertainty Face recognition with Bayesian convolutional networks for robust surveillance systems Asymptotic evaluation of certain Markov process expectations for large time, I f-GAN: Training generative neural samplers using variational divergence minimization Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics Learning representations by maximizing mutual information across views Blind image quality assessment based on high order statistics aggregation HoMM: Higher-order moment matching for unsupervised domain adaptation Sorting out typicality with the inverse moment matrix sos polynomial Metric learning with HORDE: High-order regularizer for deep embeddings Conf. Comput. Vis. (ICCV) Negative evidences and co-occurences in image retrieval: The benefit of pca and whitening BIER-Boosting independent embeddings robustly Random feature maps for dot product kernels ReCoNet: Multi-level preprocessing of chest X-rays for COVID-19 detection using convolutional neural networks," medRxiv Densely connected convolutional networks CoroNet: A deep neural network for detection and diagnosis of COVID-19 from chest X-ray images Aggregated residual transformations for deep neural networks DeCAF: A deep convolutional activation feature for generic visual recognition