key: cord-0560658-haocckgd authors: Zhang, Jianjia title: Triple-view Convolutional Neural Networks for COVID-19 Diagnosis with Chest X-ray date: 2020-10-27 journal: nan DOI: nan sha: ebf6645e9ae38a765e9b9169c31763e2d5284146 doc_id: 560658 cord_uid: haocckgd The Coronavirus Disease 2019 (COVID-19) is affecting increasingly large number of people worldwide, posing significant stress to the health care systems. Early and accurate diagnosis of COVID-19 is critical in screening of infected patients and breaking the person-to-person transmission. Chest X-ray (CXR) based computer-aided diagnosis of COVID-19 using deep learning becomes a promising solution to this end. However, the diverse and various radiographic features of COVID-19 make it challenging, especially when considering each CXR scan typically only generates one single image. Data scarcity is another issue since collecting large-scale medical CXR data set could be difficult at present. Therefore, how to extract more informative and relevant features from the limited samples available becomes essential. To address these issues, unlike traditional methods processing each CXR image from a single view, this paper proposes triple-view convolutional neural networks for COVID-19 diagnosis with CXR images. Specifically, the proposed networks extract individual features from three views of each CXR image, i.e., the left lung view, the right lung view and the overall view, in three streams and then integrate them for joint diagnosis. The proposed network structure respects the anatomical structure of human lungs and is well aligned with clinical diagnosis of COVID-19 in practice. In addition, the labeling of the views does not require experts' domain knowledge, which is needed by many existing methods. The experimental results show that the proposed method achieves state-of-the-art performance, especially in the more challenging three class classification task, and admits wide generality and high flexibility. T HE Coronavirus Disease 2019 , caused by severe acute respiratory syndrome coronavirus 2 (SARSCoV-2), is quickly spreading over the world and a huge number of people have been affected. Nearly 27 million COVID-19 cases and 0.9 million deaths have been confirmed globally as of 07 Sep. 2020 [1] . And the numbers still keep rising dramatically due to high rate of community transmission and lack of appropriate treatment and vaccines [2] . A critical measure to stop the virus transmission and restrain the outbreak is early diagnosis [3] , [4] . Early diagnosis not only enables timely treatment for the patients affected, but Jianjia Zhang is with the School of Computer Science, University of Technology Sydney, Sydney, NSW 2007, Australia, e-mail: Jianjia.zhang@uts.edu.au. Luping Zhou is with School of Electrical and Information Engineering, The University of Sydney, Sydney, NSW 2006, Australia, e-mail: luping.zhou@sydney.edu.au Lei Wang is with School of Computing and Information Technology, University of Wollongong, Wollongong, NSW 2522, Australia, e-mail: leiw@uow.edu.au also allows quick isolation of their close contacts for disease containment [4] . Although real-time reverse-transcriptionpolymerasechainreaction (RT-PCR) is considered as the golden standard to make a definitive diagnosis of COVID-19 infection, it encounters several issues in such a global pandemic. Firstly, its false negative rate is high. A patient initially tested negatively could be later tested positive [5] , [6] . Therefore, a series of RT-PCR tests, which can take up to two days [4] , may be required to confirm a case. Secondly, RT-PCR test kits may not be sufficiently available in all areas across the world, especially when considering that the global supply chains are at risk due to major disruptions. In addition, the laboratory equipment required by RT-PCR could be also a bottleneck to conduct large scale tests. These issues may result in delayed or even missed diagnosis and a computer-aided diagnosis system could, at least partially, automate the diagnosis and facilitate large scale screening of COVID-19 patients. SARSCoV-2 infection can attack various types of lung cells [7] and trigger an inflammatory response in the air sacs of lungs, leading to a typical symptom of COVID-19 patients: pneumonia. Making matters even worse, both the left and right lungs are often involved for early, intermediate and late stage patients [8] , causing breathlessness and even death. The inflammatory response can be detected by radiology examinations, especially chest computed tomography (CT) or chest X-ray (CXR). Typical radiographic features of COVID-19 patients in these scans include ground-glass opacities (GGO), multifocal patchy consolidation and/or interstitial changes with a peripheral distribution [3] , [9] . These visual features specific to COVID-19 patients are used by clinicians for COVID-19 diagnosis. At the same time, they admit the possibility of computer-aided diagnosis. There have been considerable research interests devoted to it and the existing works can be summarized as follows from four different perspectives: 1) Radiology exam used: CXR [2] , [10] , [11] , [12] , [13] , [14] and CT [8] , [3] , [15] scans are the most commonly used radiology exams by recent works for computer-aided diagnosis of COVID-19. Ultrasound is also explored in a few recent studies, such as [16] , [17] . At the same time, some other studies explore a combination of multiple source data for joint learning. For example, the works in [4] , [18] integrate CT scans with non-imaging clinical metadata, e.g., clinical symptoms, exposure history, and/or laboratory testing, for COVID-19 prediction. 2) Diagnostic objectives: The existing works can be categorized into three groups according to their aims. The first group, e.g., in [11] , [2] , [12] , [13] , [14] , [18] , is to differentiate normal controls, COVID-19 patients, and other non-COVID-19 pneumonia, e.g., Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS), which forms a three class classification problem. The second group [8] , [3] , [15] aims to solve a binary classification problem of distinguishing COVID-19 from other non-COVID-19 pneumonia. The remaining group [4] also works on a binary classification problem, but it focuses on recognizing COVID-19 from normal. 3) Labeling information: Some existing works, e.g., [11] , [8] , [12] , [13] only use the ground truth diagnosis results as labels in the training stage. Meanwhile, many other works [2] , [3] , [4] , [14] , [18] also utilize the lung mask or lesion segmentation results to train a more focused classifier. For example, the work in [3] proposes an online attention module in Convolutional neural networks (CNN) to focus on the infection regions in lungs when making diagnosis decisions. 4) Shallow or deep models: Most of existing works [16] , [2] , [3] use CNN in developing COVID-19 diagnosis model. Shallow models are also explored in a few works [11] , [8] with hand-crafted features or combined with deep models as in [4] , [18] . There are several challenging issue faced by the earlier studies. One critical issue is that the radiographic features of COVID-19 are diverse and can vary [19] . Another challenge is the scarcity of training data since collecting large scale training data is difficult, if not impossible, in the current emergent situation of COVID-19 pandemic. Moreover, manual segmentation of lung or lesion masks for each scan required by the works in [2] , [3] , [4] , [14] , [18] not only is costly and time-consuming, but also requires experts' domain knowledge. These issues can affect the scalability and generality of the models. In this case, it is desirable to design an effective model using limited training data without requirement of domain knowledge. This is attempted this paper. Inspired by the encouraging success achieved by the earlier studies and motivated by the issues we are facing, this work proposes a novel triple-view CNN for COVID-19 diagnosis with CXR images and we denote it as TV-CovNet in the remaining sections. Specifically, the proposed TV-CovNet consists of three streams with three views of the lungs in each CXR image, i.e., the left lung view, the right lung view and the overall view, as inputs and conducts diagnosis by integrating the information from them. This idea respects the anatomical lung structure and pathology of COVID-19. The bronchi of the left and right lungs are internally connected by the trachea, therefore the SARSCoV-2 could easily transit form one lung to the other, especially when considering the high transmission rate of COVID-19. That's why bilateral lung involvement can often be observed for COVID-19 patients. The proposed TV-CovNet is also inspired by clinical practice in diagnosing COVID-19 with CXR images, i.e., a clinician usually checks the left and right lungs individually and then jointly before making a decision. In comparison with traditional methods which consider each CXR image as an single-view image, the triple-view structure and joint decision making enable TV-CovNet to extract more representative and relevant features from each CXR image. This is especially important when the training data are limited. Moreover, for each CXR image at either the training or the test stage, we only need to provide three lung bounding boxes and no expert knowledge from clinicians is required. This improves the generality of the proposed method in comparison with the methods in [2] , [3] , [4] , [14] , [18] , which highly rely on accurate contour segmentation of lung or lesion areas. As far as we are aware, we are among the first ones to investigate the individual left and right lung views for CXR based COVID-19 diagnosis. CXR scan is used in the proposed TV-CovNet since it is cheaper, faster and more accessible across the world in comparison with CT. This paper's main contributions are summarized as follows: • We propose a novel triple-view CNN to conduct COVID-19 diagnosis using CXR images. The proposed networks can extract and integrate the diagnostic clues from the left lung, right lung and overall lung views for a joint decision, which respects the anatomical structure of lungs and is well aligned with the practical diagnosis by clinicians. • Various methods are explored to integrate the information from the three views in a fusion layer. As will be demonstrated in the experimental evaluation, the average pooling at score layer attains the best performance. • The superior performance of the proposed method over the competing methods and its wide generality are consistently verified in two tasks with three backbone network architectures. Specifically, one task is differentiating normal from COVID-19, and the other is distinguishing normal, COVID-19, and other non-COVID-19 pneumonia. Most recent works only study one of these two tasks, but both of them are covered in this paper since we believe that either one may be important in certain applications. The three backbone network architectures refer to ResNet-50, ResNet-101 and ResNet-152. • We also demonstrate the high flexibility of the proposed network, i.e., it could be easily integrated with other stateof-the-art deep learning methods to further improve the diagnosis performance. Motivated by their success in computer vision tasks, e.g., object detection and image classification, deep learning has been intensively studied for diagnosis of various conditions ranging from breast lesions [20] , cardiac disease [21] to, the focus of this paper, pneumonia [22] in recent years. Due to the emergent COVID-19 pandemic at present, there is intensive research interest devoting to computer-aided COVID-19 diagnosis using deep learning with radiology imaging. As one of the main complications caused by COVID-19, pneumonia is an infection of the lung tissues, including the bronchi, bronchioles and alveoli, resulting in breathing difficulties or even respiratory failure. CXR [2] , [10] , [11] , [12] , [13] , [14] and CT [8] , [3] , [15] , [4] , [18] scans are the most commonly used radiologic examinations by clinicians in identification of pneumonia inflammation. COVID-19 positive cases present radiographic abnormalities such as ground-glass opacity and bilateral patchy shadowing in CXR and CT images [23] , [13] . Although CT scan could provide more detailed diagnostic clues in identification of COVID-19 positive cases, CXR is probably a more practical option, especially in resource-constrained or heavily-affected areas for its various advantages over CT [13] . Firstly, X-ray machines are more widely and readily accessible across the world. It is one of the most equipped device in all levels of medical institutions due to its wide application and significantly lower price in comparison with CT. Secondly, CXR has higher scanning efficiency, admitting rapid screening. Thirdly, there are various kinds of portable CXR machines, which can better adapt to various application contexts. Last but not least, a CXR scan delivers much less dose of radiation in comparison with a CT scan, typically less than 4% of the later [24] . These advantages motivate this paper to develop a CXR based COVID-19 diagnosis method. While diagnosis via CXR scan enjoys various advantages as explained above, it also brings challenges to develop a robust and effective diagnosis method. On the one hand, each CXR scan typically only generates one single image for diagnosis. In comparison, a CT scan generates 3D volumetric data and enable detailed visualization from multiple perspectives. On the other hand, the CXR scans from COVID-19 patients that can be used as training data are limited since it is difficult to collect large scale of samples considering the current emergent situation. In this case, how to better extract informative features from each sample becomes essential. Many existing works [2] , [14] resort to manual labeling of lesion or lung masks, however, this is time-consuming and can only be done by experts. This paper attempts to address this issue without requirement of experts in labeling from another perspective. That is, unlike the existing works which take a CXR scan as a single view image, to explore multi-view feature extraction from each CXR scan by building multi-stream networks. In fact, there has been a number of attempts to develop multi-stream networks in computer vision tasks, e.g., action recognition [25] and 3D reconstruction [26] . In these tasks, the input data can be naturally decomposed into multiple components. For example, a video clip of action sequence is split into spatial and temporal components in [25] . In the area of medical data analysis, a similar methodology has been explored to fuse multi-stream information within shallow models [27] , [28] , [8] or deep learning models [29] , [4] . These works can be roughly categorized into two groups: i) the first group aims to fuse different modalities, e.g., MRI and PET data in [28] , PET and CT in [29] and CT and clinical meta data in [4] ; ii) the second group works on combining multiple features extracted from the same modality, e.g., features with multiple templates of brain regions of interest in [27] and multiple hand-crafted features from CT images in [8] . These works consistently demonstrate that a fusion of multiple modalities or features is able to improve the performance over any single modality or feature. Unfortunately, these multi-stream networks can hardly be readily applied to sole CXR scan, upon which we are trying to develop computer-aided diagnosis method in this paper. One obvious reason is that we presume no other modality scan or clinical information is available except a single CXR image per subject. In addition, although a combination of multiple hand-crafted features in shallow models is always doable, CNN based end-to-end methods are more preferable due to their capability to learn more specific and adaptive features, especially for challenging CXR images. After a careful review of the existing works and the challenges as explained above in this line, we propose TV-CovNet for COVID-19 diagnosis with CXR images. As aforementioned, the proposed TV-CovNet is in alignment with both of the anatomical lung structures and clinicians' practical diagnosis of COVID-19. The overall framework of the proposed TV-CovNet is illustrated in Fig 1. As seen, each input CXR image, which is assumed in posteroanterior (PA) view in this paper, is firstly cropped into the left lung view (the top blue stream), the overall view (the middle orange stream) and the right lung view (the bottom green stream). In comparison with lesion or lung mask labeling in the literature, such as in [2] , [3] , [4] , [14] , [18] , this cropping step is efficient and no domain expert knowledge is required. Therefore, this cropping step is not restricted to clinicians and this could significantly improve the generality of the proposed method. Also, it will still be applicable when large scale of training data is available in future. Once cropped, the three views are fed into the three streams of TV-CovNet respectively. As shown in Fig 1, three colors, i.e., blue, orange and green, are used to indicates the corresponding streams. The backbone network architecture of each stream is flexible, i.e., be it a specifically-designed or offthe-shelf network architecture. And the network architectures in the three streams can be identical or different. The features extracted from the three streams will be combined in a fusion layer, which will be explained in detail in the following section. The fusion layer will be followed by a final fully connected classification layer. The advantages of the proposed TV-CovNet are recapped as follows: • In comparison with the existing works taking a single overall view of CXR scans as input, TV-CovNet could extract more detailed and complementary information from the left lung, right lung and the overall views in three different streams, providing more clues for accurate diagnosis. This is motivated by the clinical practice in diagnosing COVID-19. Each CXR scan of human lungs in PA view composes of the left and right lungs, and these two lungs may present different visual characteristics although they are internally connected and, in most COVID-19 cases, bilaterally affected. Therefore, the left and right lungs are usually reviewed individually and jointly by a clinician to identify more diagnostic clues; • TV-CovNet does not require experts in manual labeling of data, admitting high generality to different sized data sets. More detailed labeling information, e.g., lesion or lung mask labeling in [2] , [3] , [4] , [14] , [18] , could certainly help to train a more focused network and may improve the diagnosis accuracy. At the same time, however, it may limit the generality of the networks due to extensive requirement on experts' domain knowledge. In contrast, cropping three views in TV-CovNet is much simpler and more efficient without requirement of domain knowledge; In addition, as a byproduct, TV-CovNet could possibly alleviate the effects of dataset bias. In order to construct a CXR training dataset for CNN based COVID-19 diagnosis, most works in the literature combine multiple datasets from various institutes as one, as in [2] , [3] , [11] , [12] . However, as pointed out in [30] , [11] , joining different databases may add bias from the non-lung areas, which can be learned by the networks to recognize which origin database the test sample is from rather than the lung injuries, especially when the training data is scarce. To a certain extent, the proposed method could minimize the negative effects of this issue by only feeding the cropped lung areas in the left and right lung views. On the one hand, the left and right lung views are non-overlapping, so no specific non-lung area will appear in both of these two views. On the other hand, the strategy of diagnosing from triple views of the lungs will help to alleviate the effects of the non-lung areas in the overall view. The above section explains how TV-CovNet extracts features from three streams. With these output features, a following issue is how to effectively integrate them in a fusion layer, as illustrated in Fig. 1 . To address this issue, five fusion methods will be studied. Specifically, the five fusion methods performs at two levels, i.e., feature level (sub-figure (b), (c), (d) in Fig. 2 ) and score level ( sub-figure (e), (f) in Fig. 2) . Feature level fusion. Let us denote the output feature vectors from the left lung, overall and right lung streams as f L , f O and f R , respectively, and they are assumed to have the same dimensionality D, i.e., f L , f O , f R ∈ R D , as demonstrated in Fig. 2 (a) . This can be easily met by using the same backbone network architecture in the three streams or appending a fully connected layer with D-dimensional output if different network architectures are used in the three streams. At this level, a fusion function F: The combined feature f will be fed into the classification layer to calculate the class score s. We investigate the following three fusion functions illustrated in Fig. 2. (b) , (c), (d): • Max feature pooling. f = F max (f L , f O , f R ) takes the element-wise maximum of the three input feature vectors: The resulting feature f after max pooling operation is still a vector with the same dimension D as input vectors. the mean of the three input feature vectors: Similarly, the resulting feature f after mean pooling is also a D-dimensional vector. the three input feature vectors: The resulting concatenated feature f is 3D-dimensional. Score level fusion. If fusion is applied at the score level, the scores of the last classification layer in the three streams will be combined. Let s L , s O and s R denote the corresponding class score vectors from the left, overall and right lung streams. At this level, a fusion function F: s L , s O , s R → s combines the class score vectors s L , s O , s R into the final class score s. We investigate the following two fusion functions, as illustrated in sub-figures (e), (f) in Fig. 2: • Max score pooling. Similar to the Max feature pooling above, s = F max (s L , s O , s R ) takes the element-wise maximum of the three input score vectors: takes the mean of the three input score vectors: The dataset used in our evaluation is constructed by CXR images from the following two publicly accessible sources: • COVID-19 image data collection [31] . This dataset contains CXR images of patients which are confirmed as COVID-19 or other non-COVID-19 pneumonia. All the CXR images in PA view are collected into the combined dataset. The COVID-19 cases form the "COVID-19" class while the other non-COVID-19 pneumonia cases are assigned to the "Other" class. • Chest X-ray Database [32] , [33] . This CXR dataset is composed of normal cases and patients with tuberculosis. The normal cases form the "Normal" class and tuberculosis cases are assigned to the "Other" class. The numbers of CXR samples in each class of the combined dataset are shown in Table I . The left lung, right lung and overall views of each CXR scan, as shown in Fig. 1 , in the combined dataset will be cropped. We use ResNet [34] architecture pretrained on ImageNet [35] as the backbone network due to its promising performance in recent works on COVID-19 diagnosis [12] , [13] , [14] . In order to verify the generality of the proposed TV-CovNet, three backbone networks are adopted in two classification tasks. Specifically, ResNet-50, ResNet-101 and ResNet-152 will be applied respectively within TV-CovNet to evaluate their performance. And the two classification tasks refer to a binary classification task between Normal and COVID-19 and a three class classification task among Normal, COVID-19 and Other. Cross entropy is used as the objective function with learning rate of 0.01 and momentum of 0.9. The training process is iterated for 100 epochs with a batch size of 10 samples and the learning rate is decreased by a factor of 0.1 every 20 epochs. Among all the samples in each class, we randomly select 60% of them as training set and the rest as test set. This random training/test partition is repeated 15 times for each classification task to obtain stable statistics. The following two recent works achieving state-of-the-art performance on COVID-19 diagnosis with CXR images will be involved in our experimental evaluation: • COVID-Net [13] . COVID-Net is one of the earliest work on designing CNN for the detection of COVID-19 cases with CXR images. It works on classifying normal, COVID-19 and non-COVID-19 pneumonia, which shares the same setting as that of the three class classification task in this paper. Their publicly accessible implementation on https://github.com/lindawangg/COVID-Net is used in our evaluation and the overall view of CXR images is used as input to train the networks. • LocalPatch [2] . LocalPatch is another recent work closely related to ours. LocalPatch tries to address the issue of data scarcity by developing a patch-based CNN with many random patches cropped from each CXR image. The final classification result for a test sample is obtained by majority voting from inference results of the patchy crops from the test CXR image. We implemented their network by following the paper. As in the paper, 100 random patches are cropped from the overall view of each CXR scan for majority voting during the inference stage. The essential objective of this paper is to extract informative visual features from the left and right lung views in addition to the overall view to boost the diagnosis accuracy. From this sense, an implicit assumption behind is that the left or right lung view should contain certain information for diagnosis. In order to verify the validity of this assumption, this section explores COVID-19 diagnosis using a single stream networks with a single view of CXR images as input. Specifically, a pretrained ResNet, i.e., one of ResNet-50, ResNet-101 or ResNet-152, is used as backbone networks and the final classification layer is reset to predict three classes of "Normal", "COVID-19" and "Other". Note that each classifier has only one stream as in the commonly used setting from the literature, and it is fine-tuned with samples from one of the overall, left lung or right lung views. The performance of the fine-tuned networks is reported in Table II . As seen, with the overall view, ResNet-50 achieves an accuracy of 79.4%, which is reasonably good considering the challenges of CXR based COVID-19 diagnosis. If left or right lung view is used, ResNet-50 obtains 80.2% and 79.6%, respectively, which are comparable or even slightly higher than that of the overall view. This demonstrates that a single lung view indeed contains substantial valuable information for diagnosis. Moreover, it is interesting to see that the left or right lung view could even outperform the overall view considering that the left or right lung view is essentially an interior part of the overall view. This is probably because although the overall view contains all the visual contents in both of the left and right lung views, it also contains considerable non-lung areas, whose visual characteristics could introduce bias or disturbance to the classifier. In contrast, these non-lung areas are partially excluded from the left or right lung views, so the networks could focus more on the lung area and extract COVID-19 related features. Similar results can be observed when ResNet-101 or ResNet-152 is used. These results lay a solid foundation for integration of multiple views to extract complementary diagnostic information and alleviate the effects of non-related features. The above section verifies the assumption of the proposed TV-CovNet, and this section will evaluate its effectiveness to integrate the left lung, right lung and overall views for joint diagnosis. For simplicity and clear comparison, an identical backbone network architecture is applied to the three steams of TV-CovNet. The network architecture is set as ResNet-50, ResNet-101 and ResNet-152, respectively, to verify its generality with respect to different network depths. Both binary classification, i.e., between Normal and COVID-19, and three class classification among Normal, COVID-19 and Other will be performed. The results on binary case will be evaluated by five metrics with true positive ( In the three class case, accuracy is used as the evaluation metric. As aforementioned, the random partition of training/test sets are repeated 15 times for each task to obtain stable statistics. Table III reports the results in four portions, with the first portion containing the competing methods, including COVID-Net [13] and ResNet-18 based LocalPatch [2] method. The rest three portions report the performance of methods with ResNet-50, ResNet-101 and ResNet-152 [34] , respectively. In each of these three portions, we firstly show the performance of the baseline method, i.e., a single stream Resnet with the overall view as input. It is followed by the results of LocalPatch [2] method with ResNet. Then we report the performance of the proposed TV-CovNet with five fusion methods, where "fea" denotes feature level fusion and "sc" refers to score level fusion introduced in Section III-A, while "cat", "max" and "mean" indicate feature concatenation, max pooling and mean pooling, respectively. As reported in the first portion, the competing method COVID-Net [13] obtains an accuracy of 78.0% in the three class classification case. Its is not applied to the two class case since the CNN in that method is intentionally designed for the three class classification task. The ResNet-18 based LocalPatch [2] method improves the performance to 80.5% in the three class classification case, achieving an improvement of 2.5 percentage points. In the second portion, as seen in the binary classification case, ResNet-50 [34] already achieves very promising accuracy of 98.7%. LocalPatch [2] with ResNet-50 improves the performance of ResNet-50 in all the five metrics, verifying the effectiveness of generating patches for major voting in [2] . When the proposed TV-CovNet is applied with ResNet-50 architecture as backbone networks, among the feature fusion methods, TV-CovNet fea cat and TV-CovNet fea max obtain comparable accuracy to the baseline ResNet-50 [34] , but no improvement is observed. In contrast, TV-CovNet fea mean improves ResNet-50 in all the five metrics. This demonstrates the effectiveness of integrating triple-view information for joint diagnosis. When score fusion methods are applied, TV-CovNet sc max is comparable to ResNet-50 [34] while TV-CovNet 50 sc mean outperforms ResNet-50 [34] , becoming comparable to ResNet-50 + LocalPatch [2] . In the more challenging three class classification task, a similar trend can be observed. Specifically, TV-CovNet fea cat, TV-CovNet fea max and TV-CovNet sc max are comparable to ResNet-50 while TV-CovNet fea mean and TV-CovNet 50 sc mean achieve considerable improvement over ResNet-50. Especially, TV-CovNet 50 sc mean outperforms all the competing methods, achieving an improvement of 4.3 percentage points over ResNet-50 and 2.1 percentage points over the state-of-the-art method LocalPatch [2] . Similar results can be observed with ResNet-101 and ResNet-152 architectures as backbone networks, as shown in the bottom two portions of Table III . Especially, the proposed mean score pooling (denoted as sc mean) method consistently obtains the best performance among the competing methods. In order to provide more details on the results above, the accuracies of the baseline ResNet and TV-CovNet with mean score pooling in 15 splits are shown in Table IV , including both of the binary classification case and the three class classification case. As seen, regardless of ResNet-50, ResNet-101 or ResNet-152 is used, the proposed TV-CovNet sc mean outperforms ResNet in most splits and the improvement is statistically evident, as verified by the small p-value (< 0.05) of student's t-test. The corresponding confusion matrices averaged over 15 splits are shown in Table V , including both of the binary classification case and the three class classification case. As can be seen that all the diagonal entries of the confusion matrices in the right column obtained by the proposed method are enlarged while all the off-diagonal entries are reduced in comparison with the results of baseline ResNet in the left column. This observation is consistent in both binary and three class cases with any of ResNet-50, -101 or -152. In summary, as verified in our experiments above, fusion at the score layer performs better than fusion at the feature level in the proposed TV-CovNet. Regarding the fusion methods, mean pooling performs better than feature concatenation or max pooling methods. TV-CovNet with mean pooling at the score level consistently outperforms the competing methods and demonstrates the state-of-the-art performance. The above section has verified the effectiveness of the proposed TV-CovNet, which integrates the visual features from the left lung, right lung and overall views. A natural question arises that, if the key objective is to integrate the complementary information from the three views for joint diagnosis, would ensemble methods also serve this purpose? To answer this question, we train three classifiers separately with the the cropped left lung, right lung and overall views, respectively, and ensemble the output scores of these three classifiers during test phase by applying max or mean functions. The results are reported in Table VI . As seen, the ensemble method with max, denoted by ensemble max, or mean, denoted by ensemble mean, consistently improves the performance of baseline ResNet in both of the two and three class classification cases no matter ResNet-50, ResNet-101 or ResNet-152 is used. However, the ensemble methods do not perform as well as the proposed TV-CovNet sc mean method since the later could better integrate the complementary information from the three views in an end-to-end learning manner. In contrast, the ensemble methods treat the three views independently and do not allow interaction between them during the training stage. This section will investigate is the overall view still required in TV-CovNet considering that the left lung and right lung views already contain all the visual information in the two lungs? To this end, we remove the overall view stream from the proposed TV-CovNet and train double-view networks, denoted as DV-CovNet, with the left lung and right lung only. Its performance is compared with TV-CovNet. Since the mean score pooling has consistently obtained the best performance among the five fusion methods in all the experiments above, only it is applied in this section. As reported in Table VII , DV-CovNet indeed improves the baseline, ResNet, but it is outperformed by the proposed TV-CovNet counterpart. This demonstrates that the overall view is indeed contributive in TV-CovNet. This is probably because, although the left and right lung views contain all the visual contents of the two lungs, the overall view enables learning certain co-occurrence features and forming a hierarchical representation of an CXR image, which could benefit final diagnosis. C. If pretrained networks help? In the experiments above, the ResNet parameters are initialized with the pretrained networks on ImageNet [35] , however, the images from ImageNet are natural images rather than medical images. This section will verify does initialization with pretrained parameters benefit the diagnosis accuracy given the huge domain gap between the images from ImageNet and those from the CXR dataset constructed in this paper. Table VIII reports the performance of ResNet and TV-CovNet with or without pretrained parameter initialization. As seen, initialization with pretrained parameters always outperforms random initialization in both cases of the ResNet and the proposed TV-CovNet with any of ResNet-50, -101 or -152. It has been verified above that, among the five fusion methods in III-A, the mean score pooling method leads to the best performance in TV-CovNet. In that method, the three views are treated equally important. This section aims to study will weighted fusion be able to further improve the performance. To this end, three positive scalars w L , w O , w R are assigned to the corresponding scores of three streams s L , s O , s R , respectively, as weight of the steam. These weights are adaptively optimized during the training stage. The weighted score mean, i.e., is used as the final prediction score. As reported in Table IX, TABLE VIII COMPARISON BETWEEN NON-PRETRAINED AND PRETRAINED INITIALIZATION. The proposed TV-CovNet is an flexible framework, which can be easily integrated with various network architectures or training strategies. This section shows an integration of the proposed method with LocalPatch [2] as an example. Specifically, we generate random region crops from each of the three views and use these crops as input to train TV-CovNet by following LocalPatch [2] . During the test stage, we also generate 100 crops for each view and apply major voting to obtain the final label as in [2] . As seen in Table X , the integrated method, TV-CovNet 101 sc mean + LocalPatch, further improves the diagnosis accuracy in the three class classification task with any of ResNet-50, -101 or -152. The two class case is not reported since the existing methods already achieved very high accuracy and the challenging three class case could better demonstrate the comparison. This section studies the performance trend of the baseline ResNet and proposed TV-CovNet with respect to different scales of training data in the three class classification case. As shown in Fig 3 , the x-axis indicates the ratios of samples assigned to training set and y-axis shows the corresponding performance of ResNet-50 and the proposed TV-CovNet 50 sc mean. As seen, with increasing ratios, although both methods achieve higher accuracy, TV-CovNet 50 sc mean admits faster accuracy increase and its improvement over ResNet-50 becomes larger. This study indicates the promising potential learning capacity of TV-CovNet. In order to better extract informative visual features from the two lungs in CXR images for COVID-19 diagnosis, we proposed a triple-view network structure. The proposed structure respects the anatomical structure of human lungs and is well aligned with clinician's diagnosis practice with CXR images. The advantages and effectiveness of the proposed structure are experimentally verified in both binary classification task between normal and COVID-19 cases and three class classification task among normal, COVID-19 and non-COVID-19 pneumonia. All the results consistently show that the proposed structure obtains the state-of-the-art performance. Various properties of the proposed method are discussed, including its comparison with ensemble methods, its flexibility to be extended and promising modeling capacity with increasing training data scale etc. These discussions present more insights on why the proposed method performs well and may inspire future explorations in this line. Several open issues are worth exploring along this line of research. Firstly, the proposed networks can be extended to other scan modalities, e.g., CT, or a combination of multiple scans. Secondly, the backbone networks in different streams are not necessarily the same. Specially designed networks for each stream may admit more adaptive feature extraction and achieve better diagnosis accuracy. Last but not least, collecting larger scale of training data for the proposed method is also critical for further evaluation and performance improvement. Weekly epidemiological update Deep learning covid-19 features on CXR using limited training data sets Dual-sampling attention network for diagnosis of covid-19 from community acquired pneumonia Artificial intelligence for rapid identification of the coronavirus disease 2019 Coronavirus disease 2019 (covid-19): a perspective from china A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Sars-cov-2 and viral sepsis: observations and hypotheses Diagnosis of coronavirus disease 2019 (COVID-19) with structured latent multi-view representation learning CT imaging features of 2019 novel coronavirus (2019-ncov) Deep learning for screening covid-19 using chest x-ray images Covid-19 identification in chest x-ray images on flat and hierarchical classification scenarios Towards an efficient deep learning model for covid-19 patterns detection in x-ray images Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images X-ray image based covid-19 detection using pre-trained deep learning models Ai augmentation of radiologist performance in distinguishing covid-19 from pneumonia of other etiology on chest ct Deep learning for classification and localization of covid-19 markers in point-of-care lung ultrasound Is there a role for lung ultrasound during the covid-19 pandemic? Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography Radiological findings from 81 patients with covid-19 pneumonia in wuhan, china: a descriptive study Computer-aided diagnosis with deep learning architecture: applications to breast lesions in us images and pulmonary nodules in ct scans Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved? Chexnet: Radiologistlevel pneumonia detection on chest x-rays with deep learning Clinical characteristics of coronavirus disease 2019 in china Comparison of effective radiation doses from x-ray, ct, and pet/ct in pediatric patients with neuroblastoma using a dose monitoring program Two-stream convolutional networks for action recognition in videos Garnet: A two-stream network for fast and accurate 3d cloth draping Inherent structure-based multiview learning with multitemplate feature representation for alzheimer's disease diagnosis Multi-view feature selection and classification for alzheimer's disease diagnosis Accurate esophageal gross tumor volume segmentation in pet/ct using two-stream chained 3d deep network fusion A critic evaluation of methods for covid-19 automatic detection from x-ray images Covid-19 image data collection Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration Automatic tuberculosis screening using chest radiographs Deep residual learning for image recognition ImageNet: A Large-Scale Hierarchical Image Database the weighted mean performs worse than TV-CovNet sc mean. This is probably due to the limited training data scale, and the adaptively learned weights may lead to over-fitting in this case.