key: cord-0215957-gajahm5p authors: Zhang, Shichuan; Zhu, Chenglu; Li, Honglin; Cai, Jiatong; Yang, Lin title: Weakly Supervised Learning for cell recognition in immunohistochemical cytoplasm staining images date: 2022-02-27 journal: nan DOI: nan sha: 50b39552842b9b58051910b2dd1ce061f54adbf9 doc_id: 215957 cord_uid: gajahm5p Cell classification and counting in immunohistochemical cytoplasm staining images play a pivotal role in cancer diagnosis. Weakly supervised learning is a potential method to deal with labor-intensive labeling. However, the inconstant cell morphology and subtle differences between classes also bring challenges. To this end, we present a novel cell recognition framework based on multi-task learning, which utilizes two additional auxiliary tasks to guide robust representation learning of the main task. To deal with misclassification, the tissue prior learning branch is introduced to capture the spatial representation of tumor cells without additional tissue annotation. Moreover, dynamic masks and consistency learning are adopted to learn the invariance of cell scale and shape. We have evaluated our framework on immunohistochemical cytoplasm staining images, and the results demonstrate that our method outperforms recent cell recognition approaches. Besides, we have also done some ablation studies to show significant improvements after adding the auxiliary branches. Immunohistochemical (IHC) staining is a universal protocol to signalize specific tumor cells. The cancer progress can be evaluated by quantitating the response of different cells to antigens, which promotes making prognostic observation and treatment decisions. Recent automatic cell recognition (classification and counting) algorithms [1, 2] based on nucleus staining reduce the labor-intensive counting and improve the efficiency of pathologists. Some of the specific IHC antibody reactions occur in the cytoplasm [3] , which will bring some challenges to current IHC analysis models with their unclear boundaries and confusing nucleus. Therefore, it is significant to propose a general analysis method based on deep learning for IHC cytoplasm staining images, which can help to confirm the type of cancer, stage of cancer, and potential treatment options in clinical diagnosis [4] . Recently, many nuclear recognition models with regression loss for IHC image analysis have shown superior performance, which applies weakly supervised learning under the generated Gaussian density map from point annotations [5] . However, the cytoplasmic staining pattern does not have uniform recognition objects, and its stained images are different from the nuclear-stain with stainable antibodies in the circle or ellipse nucleus. After cytoplasmic staining, cells may take on various shapes and the neighbors may present indistinguishable masses. Consequently, it may not be appropriate to use the general regression models for nuclear recognition in this scenario. Besides, a pixel classification method [6] for cell recognition improves model performance by multitask scheduling, which uses repel code, Voronoi diagram and clustering algorithms to strengthen the supervision information of point annotation. Nevertheless, cells are touching and overlapping each other under cytoplasmic staining patterns, which is not conducive to generating the pseudo mask with such clustering algorithms [1] . In addition, subtle differences between tissue cells and positive-tumor cells are inevitable in IHC images, which increases the difficulty of categorization. All these problems bring new challenges for cell recognition in the cytoplasm staining images. The generated pseudo masks from point annotations are used for model regression. Due to the limitation of Gaussian field, the model may easily overfit to a fixed size and approximately round shape, which may become more serious in the cytoplasmic staining pattern. Fortunately, consistency learning is an effective regularization strategy to improve the availability of rough pseudo masks. It minimizes the output of each auxiliary decoding branch through consistency loss in the semi-supervised semantic segmentation tasks [7, 8] , which promotes model representation learning based on uncertain masks. Moreover, tissue prior can guide cell recognition by introducing an auxiliary task [9] , which is constructed to learn the spatial distribution of tumors with extra annotations. A traditional multi-task learning framework whose auxiliary tasks are to learn from different types of pseudo masks has been introduced to deal with nuclear recognition. So well-designed auxiliary tasks can effectively improve the Inspired by explicit tissue prior and consistency learning, we present a cell recognition framework for the quantification of IHC cytoplasmic staining images by introducing three decoder branches: main decoder, dynamic decoder and prior decoder. Since the tissue prior is generated by an additional model with early stop learning, the encoder and the auxiliary branch(prior decoder) can be constrained by spatial distribution information without extra tissue annotations. In addition, the other auxiliary branch (dynamic decoder) and the encoder are guided by dynamic masks with stochastic disturbance. Last, the consistency loss between the outputs of the main decoder and the dynamic decoder is minimized to learn the invariance of shape and scale. The proposed framework is shown in Fig.1 . It contains three parts: encoder, three decoders and a pre-trained model. And the decoder consists of three branches: main branch, consistency learning branch and tissue prior learning branch. In the whole framework, only the encoder and the main decoder exist in both training and testing processes. While the other two auxiliary branches provide more information to the encoder during training. The encoder downsampling the input image to a deep feature and the decoders upsampling the feature to the original size construct the forward process. Semantic segmentation based on an encoder-decoder structure is used for cell counting and classification. Thus, a fixedsize circle mark is generated for each cell during the training of the main decoder. The estimated shape and size are chosen to approximate the ground truth mask for cell segmentation and classification. There are two proximity masks {G l i,j |G l i,j ∈ {0, 1}, 0 < i < m, 0 < j < n} for each input image with size m × n, which one is for positive-tumor cells and the other one is for the rest cells. The main decoder outputs two corresponding probability map {P l i,j |0 < P l i,j < 1, 0 < i < m, 0 < j < n}. l ∈ {0, 1} represents the category of cells, (i, j) is the pixel coordinates. A certain value P l i,j in the output map is the probability that the corresponding input pixel belongs to class l. Thus we use the cross-entropy loss for pixels classification as shown in eq.1 Taking into account the imbalance between the number of object pixels and the number of background pixels, we also apply the Intersection over Union(IOU) loss [10] for the training of the main decoder. And the IOU loss as shown in eq.2 is sensitive to the imbalance. Eq. 3 shows the total loss for the training of the main decoder and the encoder. α is a hyperparameter whose value is 0.8 in our experiments. Not all cells are approximately round, especially in IHC images with cytoplasmic staining. The cells usually have different sizes and shapes, even in the same category. So the encoder and the main decoder trained with L m is easy to cause overfitting and missed detection. In order to make the model not sensitive to inaccurate shapes and sizes when extracting features, We design an auxiliary branch (dynamic decoder). In each training iteration, the cell marks with random polygon and size in new estimated masks are generated for the training of the encoder and the dynamic decoder by eq.4. Then we minimize the difference of the output probability maps between dynamic decoder and main decoder by the consistency loss as shown in eq.5. whereG l i,j is the generated dynamic masks in which the proximity marks of cells have distinct shapes and sizes in different iterations. AndP l i,j is the output probability map of the dynamic decoder. By consistency learning, the lower layers in the encoder will focus on the features that are conducive to cell identification in addition to size and shape. It is an effective regularization strategy for the encoder, which reduces the risk of overfitting the proximity masks. In the IHC images, the same kind of cells are generally distributed in clusters. In other words, tumor cells are usually adjacent to tumor cells, and vice versa. This tissue prior is helpful to classify different types of cells with similar appearance characteristics. In practice, the pathologists identify tumor cells based on not only cell appearances, such as size, shape and intensity, but also on the spatial distribution of cells. Thus we design another auxiliary branch (prior decoder). It has the same layers as the dynamic decoder and main decoder. Positive-tumor cells are generally arranged tightly and touching to each other, so the combination of the rough area of each positive-tumor cell will form the proximity mask for the tumor area. The ground truth mask for the training of prior decoder and encoder is from the output of a pre-trained model. The pre-trained model has the same architecture as the combination of the encoder and the main decoder and is trained by the circle masks. In circle masks, the mark of each cell is a circle with a specified radius, resulting in inaccurate labeling of cell boundaries. Inspired by the robust learning for classification with label noise [11] , an early stop training process with L ce is used for the pre-trained model. And the output will show the rough tissue prior as shown in Fig.1 . The training loss for the prior decoder is shown in eq.6. whereĜ l i,j is the ground truth mask for the training of prior decoder and encoder from the output of a pre-trained model. AndP l i,j is the output of prior decoder. Only positive-tumor prior is used in this branch because other types of tissue cells are not in a state of aggregation. All the three decoders corresponding to three tasks share the same encoder in the lower layers. The two auxiliary tasks make the encoder pay attention to their own tasks which are useful for the main task-cell classification and counting. Therefore, the encoder will benefit from the two auxiliary tasks in representation learning. We collected 80 images with IHC cytoplasmic staining cropped as the heat map from whole slide images under×40 magnification, which cover four types of cancer and their corresponding stained antibodies. All cells are separated into two categories by two pathologists: positive tumor cells (dark or weak brownish stains) and the other cells without IHC reaction. We split the dataset into a training set (56 images with annotated 15,309 positive tumor cells / 37,277 other cells) and a testing set (24 images with annotated 6,053 positive tumor cells / 8,821 other cells). In the whole framework as shown in Fig.1 , the encoder and main decoder follow the structure of the commonly used segmentation model deeplabv3+ [12] with resnet101 backbone, which has fewer parameters than Unet [13] . The dynamic decoder and prior decoder in the proposed framework have the same layers as the main decoder. We first train the pre-trained model with early stop in 10 epochs to get the rough cluster prior for tumor cells. And then use the total loss computed from the three branches to train the model. where λ c , λ p and λ d are the weights for L c , L p and L d respectively. In the training stage, Adam serves as the optimizer with the followed setting: momentum is 0.9, weight decay is 2 × 10 −5 and the initial learning rate is 0.001. The hyperparameters λ c , λ p and λ d are set to 0.5, 0.5 and 1. While in the testing stage, we only adopt the encoder and main decoder to recognize cells, which will output two probability maps as pixel-level predictions for the two categories through a sigmoid function. Finally, the number of cells in each category is calculated by picking the local peak in probability maps [2] . We use meanF 1 and totalF 1 to evaluate the performance of the proposed multi-tasking learning framework, where totalF 1 is composed of recall and precision of all cells in the testing set, and meanF 1 is the average of F1 score of each image in the test set. Therefore, meanF 1 can effectively reflect the performance in each testing image, especially when the number of positive-tumor cells is relatively small. The related works [1, 5] are compared with ours under identical experimental conditions. The results are shown in Table 1 , where (P ) and (N ) indicate positive tumors and the other cells, respectively. In addition, we conduct the ablation experiment to validate the effect of the proposed improvements. In Table 1 , Ours is the models with encoder and only main decoder. On the basis of Ours, Ours + has an auxiliary branch (dynamic decoder) and Ours ++ has two auxiliary branches (dynamic decoder and prior decoder). The whole framework improves the performance on cell classification as shown in the first row of Fig.2 by adding the prior decoder. In the second row, the model Ours ++ effectively suppresses the missed detection of positive-tumor cells compared with Ours. Thus, the auxiliary branches will both help the representation learning of the encoder. Considering that post-processing will affect the statistical results, we tried different radius r. We regard the output as a successful prediction if the distance between the center of 10 Fig. 3 . The vertical axis of the two figures are totalF 1 and meanF 1 of positive-tumor(P) cells respectively, and the horizontal axis is radius r, which means the threshold of the distance between the predicted and the annotated location. a predicted cell and the point annotation is less than r. The radius r shown in Fig.3 represents the number of pixels. Our method has obvious advantages with all the reasonable radius. In this paper, we propose a cell recognition model based on multi-task learning. The main task is for cell classification and counting based on cell segmentation with point annotations. We append two auxiliary tasks to provide more information for the encoder. The first auxiliary task (prior decoder) learns the tissue prior information to provide spatial distribution basis apart from shape, size and color for cell classification. And the tissue prior is obtained by another pre-trained model instead of manual labeling. The second auxiliary task (dynamic decoder) is for cell segmentation based on dynamic masks generated from point annotations. Last, the consistency loss between the output of the dynamic decoder and the main decoder is minimized. The proposed model is more suitable for cell recognition in the IHC cytoplasm staining images by comparing with related works. A limitation is that this Weakly supervised deep nuclei segmentation using partial points annotation in histopathology images Efficient and robust cell detection: A structured regression approach Napsin a expression in human tumors and normal tissues mtor/p70s6k signal transduction pathway contributes to osteosarcoma progression and patients' prognosis Microscopy cell counting and detection with fully convolutional regression networks Weakly supervised multi-task learning for cell detection and segmentation Dual-consistency semi-supervised learning with uncertainty quantification for covid-19 lesion segmentation from ct images Semi-supervised semantic segmentation with cross-consistency training Pixel-to-pixel learning with weak supervision for single-stage nucleus recognition in ki67 images Unitbox: An advanced object detection network Unsupervised label noise modeling and loss correction Encoder-decoder with atrous separable convolution for semantic image segmentation U-net: Convolutional networks for biomedical image segmentation