key: cord-0498505-1kz2btih authors: Huang, Huimin; Cai, Ming; Lin, Lanfen; Zheng, Jing; Mao, Xiongwei; Qian, Xiaohan; Peng, Zhiyi; Zhou, Jianying; Iwamoto, Yutaro; Han, Xian-Hua; Chen, Yen-Wei; Tong, Ruofeng title: Graph-based Pyramid Global Context Reasoning with a Saliency-aware Projection for COVID-19 Lung Infections Segmentation date: 2021-03-07 journal: nan DOI: nan sha: e03fce0ec184976628b276202efd2973fcaa08b4 doc_id: 498505 cord_uid: 1kz2btih Coronavirus Disease 2019 (COVID-19) has rapidly spread in 2020, emerging a mass of studies for lung infection segmentation from CT images. Though many methods have been proposed for this issue, it is a challenging task because of infections of various size appearing in different lobe zones. To tackle these issues, we propose a Graph-based Pyramid Global Context Reasoning (Graph-PGCR) module, which is capable of modeling long-range dependencies among disjoint infections as well as adapt size variation. We first incorporate graph convolution to exploit long-term contextual information from multiple lobe zones. Different from previous average pooling or maximum object probability, we propose a saliency-aware projection mechanism to pick up infection-related pixels as a set of graph nodes. After graph reasoning, the relation-aware features are reversed back to the original coordinate space for the down-stream tasks. We further construct multiple graphs with different sampling rates to handle the size variation problem. To this end, distinct multi-scale long-range contextual patterns can be captured. Our Graph-PGCR module is plug-and-play, which can be integrated into any architecture to improve its performance. Experiments demonstrated that the proposed method consistently boost the performance of state-of-the-art backbone architectures on both of public and our private COVID-19 datasets. The break of Coronavirus Disease 2019 (COVID-19) has rapidly spread over the world, which has been declared as a global pandemic [1] . Accurate lung infections segmentation is one of the most important pre-processing steps for assessment and quantification of COVID-19 [2] [3] [4] [5] . The classic UNet [6] and UNet++ [7] were widely performed as segmentation architectures for COVID-19. Recently, UNet-Inf [8] with a parallel partial decoder was proposed to segment lung infections. Despite achieving good results, these approaches are still incapable of exploring sufficient information from multifocal infections, appearing in different lobe zones [9, 10] . It may hinder the infections segmentation performance, especially considering each pixel in isolation, as local information is noisy and ambiguous. It is also noteworthy that infections with various size occur in different scales. To tackle these two challenges, it is imperative to perform multi-scale long-term interactions on COVID-19 CT images, which contributes to model long-term dependencies among multiple lesions. Recently, graph convolution [11] has been incorporated into computer vision tasks for globally reasoning, which can be generally summarized as two kinds of approaches: feature space graph convolution and coordinate space graph convolution. The feature space graph convolution captures interdependencies along the channel dimensions of the feature map, which projects the feature into a non-coordinate space [12] [13] [14] [15] ; whistle coordinate space graph convolution explicitly models the spatial relationships between pixels [16] [17] [18] [19] [20] , which projects the feature into a new coordinate space, to produce coherent prediction between the disjoint infections. In this paper, we propose a saliency-aware projectionbased Graph-based Pyramid Global Context Reasoning (Graph-PGCR) module for COVID-19 lung infections segmentation. Different from the existing work that the infection-related pixels were highlighted via average pooling [16] or maximum object probability [17] , we propose a saliencyaware projection (SAP) to keep eye on 'where' is an informative part, and thus selects discriminative pixels to form a fully-connected graph. In addition, we further take the multiscale cues into consideration to address the challenge that different infections appear in various scales. Inspired by the Pyramid Pooling Module [21] , we build a pyramid global context reasoning architecture to harvest multi-scale representations via SAP with various sampling rates. Hence, a coarser graph is constructed with lower sampling rate, providing more global dependencies for the larger receptive scale; while a finer graph is modeled with higher sampling rate, embedding more explicit long-range context for the smaller receptive field. In this way, we can perform graph reasoning on each scale and aggregate local and global clues to make the final prediction more reliable. Our Graph-PGCR module is plug-and-play and thus can be integrated into a wide variety of existing network architectures to further enhance their performance. In summary, the main contributions of this research are four-fold: (i) We propose a Graph-PGCR module to model long-range dependencies among disjoint infections as well as adapt size variation; (ii) We propose a SAP mechanism to select the infection-related pixels as a set of graph nodes, where global contextual information can be propagated via graph convolution; (iii) We construct multiple graphs to harvest multi-scale contextual patterns from infections with various size; (iv) We conduct extensive experiments on public and private COVID-19 dataset, where our method yields consistent improvements over a number of baselines. Fig.1 illustrates an overview of the proposed Graph-PGCR module in the segmentation architecture (e.g. UNet). Given an input image, we first extract features via the UNet-Encoder, and then our Graph-PGCR module is integrated to capture multi-scale long-range representations. Benefitting from the saliency-aware projection, the input feature map is firstly sampled into (e.g. = 3) parallel pyramid levels with various scales. After individual graph convolution, the reprojection via upsampling and the multi-scale fusion with concatenation layers are performed to generate the feature representation , which is finally fed into the UNet-decoder for prediction. It is worth noting that the input feature map can be extracted from any layer of deep convolutional model. In the following subsections, we introduce the detail of each component in the Graph-PGCR module. In order to project infection-related pixels in coordinate space into a set of graph nodes in a new coordinate space, we proposed a saliency-aware projection mechanism, which integrates attention mechanism with pooling operation. Specifically, the attention mechanism aims at learning where to emphasize or suppress; while the pooling operation desires of picking out discriminative pixels. Before projection, we need to reduce the dimension of feature map , which enhances the capacity of the projection. Inspired by the dual attention network [22] , we implement the channel attention module to capture the channel dependencies between any two channel maps via self-attention mechanism. After enhancing the feature representation, we adopt a 1×1 convolution layer to reduce the feature dimension from to . Similarly, the feature map also enhanced by spatial attention module to model the spatial dependencies between any two positions. Benefiting from the channel and spatial attention modules, we could emphasize interdependent feature maps and improve the feature representation of specific semantics. Considering that pooling along the channel dimension can effectively highlight informative regions [23] , we further perform the max-pooling and average-pooling operations on the channel axis and then concatenate them to generate an attention map ∈ ℝ ,×which focus on the salient pixels related to infections and surpass unnecessary ones. As vividly shown in Fig.2 , the attention map with spatial size of × is divided into several non-overlapping sub-regions with the stride of pixels. Within each region, the pixel with maximum localization probability: = arg max 8 9 ( ; ) is selected as a node. This process results in a set of nodes = { @ } @BC for the feature map . equals to / × / and represents the number of nodes. • is the ceiling operation, which gives the smallest integer equal or larger than its input. Note that the process can be considered as a sampling process and 1/ can be considered as a sampling rate. In view of this, each node @ is represented by its corresponding image coordinates. It is worth noting that the spatial interval between nodes can be controlled by adjusting . The coarser graph is constructed with lager values, which perhaps captures longer-range interactions among nodes. In contrast, all pixels are assigned as individual nodes in the extreme case, where = 1. The initial feature representation of each @ is extracted from the feature map enhanced in both channel and spatial dimensions. This results in a set of node features, ∈ ℝ G× , where equals to the feature dimension of . In spite of graph reasoning exploring the global context, the long-term context pattern differs in multiple scales of the same image. Specifically speaking, the finer representation with smaller receptive field (smaller ) embeds more explicit context; while the coarser representation with larger receptive scale (larger ) explores global dependencies. Taking the multi-scale schema into consideration, we incorporate it with graph reasoning to extend the long-range contextual patterns, and thus devise the Graph-PGCR module. Fig. 2 . An illustration of the proposed saliency-aware projection (SAP) mechanism. In this example, the attention map ∈ ℝ H×H is sampled with the stride = 2. We then find the pixel with maximum localization probability in each region (shown in different colors), which is selected as the node. As seen in Fig.1 , the graph convolution begins with subsampling the convolved features into parallel pyramid levels with various scales via saliency-aware projection. Higher sampling rate (smaller ) generates finer representations; while the coarser features are extracted from lower sampling rate (larger ). After selecting infection-related pixels as a set of nodes, a lightweight fully-connected graph with adjacency matrix K ∈ ℝ L × L is generated from -th separate branch for propagating information across nodes, as depicted in Fig.3 . The adjacency matrix K is defined as the similarity between nodes, where the more similar feature representations of two nodes, the stronger connectivity between them. It can be formulated as: where K (•) and K (•) are two learnable 1-dimensional linear transformations along node-wise dimension, ⨂ is the matrix multiplication. We further apply a softmax layer to yield a normalized adjacency matrix K . Then we conduct the graph convolution [11] in our model as: (2) where K • is a learnable linear transformation, K ∈ ℝ G×G is a trainable weight matrix, • is the ReLU activation function, and K is the output feature map after graph convolution in -th separate branch. To provide complementary feature for the down-stream task, the last step is to map the relation-aware features (ℝ G× L ) generated from -th separate branch back to the coordinate space (ℝ U×,×-), which is compatible with the regular CNN. To achieve the dimension transformation, we reshape the relation-aware K ∈ ℝ G× L into K ∈ ℝ G× ,/V L × -/V L . Then, a simple but effective upsampling operation is adopted as the reprojection function. In practice, the bilinear interpolation is performed to resize / K × / K to the original spatial input size × . To maintain the original information, we further utilize a multi-scale fusion to fuse the reshaped relation-aware features from each scale with the original feature map in a learnable way, which carries both local and global context information. The multi-scale fusion process can be formulated as: where (•) realizes the feature aggregation mechanism with a 1×1 convolution followed by a batch normalization and a ReLU activation function. (•) indicate up-sampling operation, and • represents the concatenation. As a result, we have the feature with channel dimension of . The method was evaluated on two datasets: the public and our private COVID-19 datasets. (i) The public COVID-19 dataset [24] : It contains 20 COVID-19 CT scans from the Coronacases Initiative and Radiopaedia, which were manually annotated for the left lung, right lung and COVID-19 infection. In the experiment, we trained our models using the 16 volumes with 2-fold cross-validation and average the experiment results as the final performance. (ii) The private COVID-19 dataset: we collected 102 COVID-19 CT scans (from the Department of Radiology, The First Affiliated Hospital, College of Medicine, Zhejiang University), which has passed the ethic approvals. The left lung, right lung, and infection were annotated by two radiologists with 5-year experience in chest radiology. For our study, 82 scans were randomly selected for training and the other 20 scans for testing. After 2-fold cross-validation, we averaged the experiment results as the final performance. The input image consisted of three slices: the slice to be segmented and the upper and lower slices, which was cropped to 224×224×3. The networks were updated utilizing the stochastic gradient descent, where the learning rate was 1e-3 and weight decay was 5e-4. To effectively model the global contextual information, our Graph-PGCR module was appended at the end of the encoder as seen in Fig.1 . After feature extraction, the input feature map had the size of 1024×14× 14. We simply set node feature dimension = 64 in our implementation. The Dice coefficient was employed as our principal performance metric for each case. This section experiments the effect of key components of the Graph-PGCR module on the public COVID-19 dataset for lung infection segmentation. It includes the architecture of the original UNet as a baseline, Dual Attention (DA) Module [22] , the proposed Graph-PGCR with different projection mechanisms and multi-scales with different . Table 1 shows the segmentation performances when gradually adding components to the UNet. As seen, each component of the Graph-PGCR module contributes to the performance. Generally, a total improvement of 3.88% was gained by our proposed Graph-PGCR module compared to baseline UNet. A series of UNet based variations are adopted to further exam the effectiveness of the proposed Graph-PGCR module, including UNet++ [7] , UNet-Inf [8] and UNet 3+ [25] . Except for Dual Attention (DA) module, we further compare our method with the state-of-the-art graph context reasoning module (i.e., GloRe) [13] . The hyper-parameters of the graph, e.g., the number of the nodes and its feature dimensions, are set based on [13] . It is worth noting that they are appended at the same place as our proposed module. (i) Quantitative comparison: Table 2 shows the comparison results on public and private dataset, where we have the following observations. First, the proposed Graph-PGCR module ( = 2) improves the performance from the baselines under different segmentation networks. Moreover, our proposed Graph-PGCR module ( = 2 ) has superior performance over GloRe module. Additionally, the Graph-PGCR module ( = 2,4,7) with multiple GCR achieves the best performance in four architectures, obtaining average improvement of 3.0 and 2.5 point between four backbones performed on two datasets. (ii) Qualitative comparison: Fig.4 visualizes the segmentation results of different plugin based on UNet 3+ network in our private datasets, including DA module, Glore unit, our proposed Graph-PGCR module ( = 2,4,7). The results illustrated how efficient our proposed Graph-PGCR module is on segmenting the irregular and even small infections. Specifically, it generates segmentation results that are close to the ground truth with much less missegmented infections. The success of Graph-PGCR module is owed to the ability of capturing multi-scale long-range dependencies. In this paper, we develop an effective GCN-based approach, termed as Graph-based Pyramid Global Context Reasoning (Graph-PGCR) module, to model the multi-scale long-range contextual relationships, which is critical for COVID-19 lung infections segmentation. Benefiting from the saliency-aware projection that selects infection-related pixels as graph nodes, a fully-connected graph is constructed where global contextual information is propagated across all nodes via graph convolution. The multi-scale schema is also adopted to explore distinct contextual patterns from multiple graphs. Experiments show that the proposed Graph-PGCR module can effectively capture global contextual dependencies in COVID-19 CT images and consistently improve over four strong baselines on lung infections segmentation task. World Health Organization Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19 A noise-robust framework for automatic segmentation of covid-19 pneumonia lesions from ct images A rapid, accurate and machine-agnostic segmentation and quantification method for CT-based covid-19 diagnosis Residual Attention U-Net for Automated Multi-Class Segmentation of COVID-19 Chest CT Images U-Net: Convolutional Networks for Biomedical Image Segmentation Deep Learning in Medical Image Anylysis and Multimodal Learning for Clinical Decision Support Inf-Net: Automatic COVID-19 Lung Infection Segmentation from CT Images Relational modeling for robust and efficient pulmonary lobe segmentation in ct scans Diagnosis and treatment protocol for novel coronavirus pneumonia (Trial Version 6) Semi-supervised classification with graph convolutional networks Beyond grids: Learning graph representations for visual recognition Graph-based global reasoning networks Pedestrian attribute recognition by joint visual-semantic reasoning and knowledge distillation Referring Image Segmentation via Cross-Modal Progressive Comprehension Dual graph convolutional network for semantic segmentation Deep vessel segmentation by learning graphical connectivity Edge-aware Graph Representation Learning and Reasoning for Face Parsing GINet: Graph Interaction Network for Scene Parsing Multiple-human parsing in the wild Pyramid scene parsing network Dual attention network for scene segmentation CBAM: Convolutional block attention module Towards Efficient COVID-19 CT Annotation: A Benchmark for Lung and Infection Segmentation UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation