key: cord-0114185-aqc207dr
authors: Liu, Qing; Liu, Haotian; Liang, Yixiong
title: Dual-Branch Network with Dual-Sampling Modulated Dice Loss for Hard Exudate Segmentation from Colour Fundus Images
date: 2020-12-03
journal: nan
DOI: nan
sha: e9cb228faa22eb8274e9a071838bbfe4789324af
doc_id: 114185
cord_uid: aqc207dr

Automated segmentation of hard exudates in colour fundus images is a challenge task due to issues of extreme class imbalance and enormous size variation. This paper aims to tackle these issues and proposes a dual-branch network with dual-sampling modulated Dice loss. It consists of two branches: large hard exudate biased learning branch and small hard exudate biased learning branch. Both of them are responsible for their own duty separately. Furthermore, we propose a dual-sampling modulated Dice loss for the training such that our proposed dual-branch network is able to segment hard exudates in different sizes. In detail, for the first branch, we use a uniform sampler to sample pixels from predicted segmentation mask for Dice loss calculation, which leads to this branch naturally be biased in favour of large hard exudates as Dice loss generates larger cost on misidentification of large hard exudates than small hard exudates. For the second branch, we use a re-balanced sampler to oversample hard exudate pixels and undersample background pixels for loss calculation. In this way, cost on misidentification of small hard exudates is enlarged, which enforces the parameters in the second branch fit small hard exudates well. Considering that large hard exudates are much easier to be correctly identified than small hard exudates, we propose an easy-to-difficult learning strategy by adaptively modulating the losses of two branches. We evaluate our proposed method on two public datasets and results demonstrate that ours achieves state-of-the-art performances.

Hard exudate is one of the most significant manifestation of diabetic retinopathy (DR) (Klein et al. 1987) . Automated and accurate segmentation of hard exudate in colour fundus images has several potential applications in clinical such as large-scale automated DR screening, computer-aided diagnosis and severity level assessment of DR (Sasaki et al. 2013) .

The segmentation of hard exudate can be formulated as a dense classification problem. At the era of deep learning, without any exception, the first choice is fully convolutional networks (FCNs). Its goal is to optimise parameters in designed FCNs to best fit the exudate ground-truth via minimising a specified loss function. However, achieving this goal is challenge due to two issues: Table 1 : Class distribution imbalance and exudate region size variation existing in exudate segmentation datasets DDR (Li et al. 2019) and IDRiD (Porwal et al. 2018) . Ratio neg/pos denotes the ratio of background pixels and hard exudate pixels. Size large and Size small are the relative size to images of the top 10% largest exudate regions and top 10% smallest exudate regions in the whole dataset.

Size large /Size small DDR 512 9.7 × 10 −5 /2.0 × 10 −6 IDRiD 110 1.1 × 10 −4 /5.0 × 10 −6

• Extreme class imbalance. To illustrate how extreme the class imbalance is, we account the ratio of negative samples (i.e. background pixels) to positive samples (i.e. hard exudate pixels) in two public datasets for hard exudate segmentation, i.e. DDR (Li et al. 2019) and IDRiD (Porwal et al. 2018 ) (see Table. 1). The ratio in DDR (Li et al. 2019) and IDRiD (Porwal et al. 2018) is 512 and 110 respectively. With those serious extreme class imbalanced data, how to design loss function and train the segmentation model to alleviate the bias towards majority class becomes critical. • Enormous variation in size across connected components of hard exudate regions. Most of hard exudate connected regions are small. In particular, we calculate the relative area of connected hard exudate regions and find that almost 90% hard exudate pixels belong to connected regions whose relative area to the whole fundus image is less than 9.7 × 10 −5 in DDR (Li et al. 2019) and 1.1 × 10 −4 in IDRiD (Porwal et al. 2018) . More seriously, there are 10% hard exudate pixels belonging to connected hard exudate regions with relative areas less than 2.0 × 10 −6 in DDR (Li et al. 2019 ) and 5.0 × 10 −6 in IDRiD (Porwal et al. 2018) respectively. The size variations of largest 10% and smallest 10% in DDR (Li et al. 2019) and IDRiD (Porwal et al. 2018) are almost 48 and 22 times, respectively. This variation in size which the FCN model needs to handle is enormous and rises a huge challenge for representation and classifier learning. An intuitive way for hard exudate segmentation is to finetune semantic segmentation networks, such as HED (Xie and Tu 2015) , PSPNet , Deeplabv3 (Chen et al. 2017b ) and Deeplabv3+ (Chen et al. 2018) , which are originally designed for dense classification tasks on natural scene images. Those methods handle the issue of class imbalance via using class balanced cross-entropy (CBCE) loss rather than the traditionally used cross-entropy loss. Inspired by HED (Xie and Tu 2015) , Guo et al. (Guo et al. 2019a propose a variant of CBCE loss called bin loss and fine-tune the parameters in HED (Xie and Tu 2015) for hard exudate segmentation. Bin loss (Guo et al. 2019a (Guo et al. , 2020 considers not only the class imbalance problem but also the hardness of background pixels to be correctly classified. FCNs trained with CBCE loss successfully avoid background bias. However, directly up-weighting loss of minority class and down-weighting loss of majority class according to the inverse class frequency is too rough. This makes the segmentation model be biased in favour of exudate pixels. Taking PSPNet ) as example, when training it with CBCE loss, the model always wrongly identifies confusion structures and background regions around hard exudates, as shown in Fig. 1 

An alternate strategy is to train FCNs with Dice loss (Milletari, Navab, and Ahmadi 2016) . Dice loss (Milletari, Navab, and Ahmadi 2016 ) is a regional loss which measures the overlapping error between the prediction and ground truth. It works better than CBCE loss when class imbalance issue is serious. However, because the costs on small exudate regions in terms of Dice loss are slight comparing to that on large hard exudate regions, FCNs trained with Dice loss is biased towards large hard exudate regions and results in dentification on small large hard exudates. The large variation in size across connected components of hard exudate makes the bias more serious. Fig. 1 (c) and (g) show the results by PSPNet ) trained with Dice loss, which tends to mis-classifiy small hard exudate regions as background.

In this paper, we propose a dual-branch network with dual-sampling modulated Dice loss to take care of both large and small hard exudate connected regions. As shown in Fig.  2 , our dual-branch network consists of two branches: large hard exudate biased segmentation branch and small hard exudate biased segmentation branch. It is trained with a dualsampling modulated (DSM) Dice loss. Each branch separately performs its own duties for representation and classifier learning for hard exudates in different sizes. Large hard exudate biased segmentation branch learns a segmentation model which is large hard exudate biased while small hard exudate biased segmentation branch is biased in supporting small hard exudates. The bias of both learning branches is achieved by the proposed DSM loss. For large hard exudate biased segmentation branch, Dice loss with uniform pixel sampler is used. For small hard exudate biased segmentation branch, a re-balanced pixel sampler is used to oversample hard exudate pixels and undersample background pixels. Thus hard exudate pixels are sampled multiple times, which increases the penalty on misidentification of small hard exudate regions. In this way it well compensates the large hard exudate biased segmentation branch. The bias of two branches are shifted by modulation with regard to learning epoch. We evaluate the effectiveness of the proposed dual-branch network on DDR (Li et al. 2019) and IDRiD (Porwal et al. 2018) and results show that our dual-branch network outperforms existing hard exudate segmentation methods. Furthermore, to demonstrate the effectiveness of underlying thoughts of dual-branch network, we combine it with several dense classification networks. Results show that dualbranch networks trained with our proposed dual-sampling modulated Dice loss achieve superior performance to single branch networks trained with Dice loss.

In summary, the contributions of this paper are as follows:

• We propose a novel framework named dual-branch network to handle the issues of extreme class imbalance and enormous variation in size existed in the task of automated hard exudate segmentation from colour fundus images.

• We propose a dual-sampling modulated Dice loss to guide the learning process of dual-branch network, which is an easy-to-difficult learning strategy and adaptively modulates the losses of two branches such that dual-branch network gradually shifts the attention to easy task of large hard exudate segmentation to hard task of small hard exudate segmentation.

• We conduct extensive experiments on two public datasets DDR (Li et al. 2019) and IDRiD (Porwal et al. 2018) and demonstrate that dual-branch network achieves state-ofthe-art performance on hard exudate segmentation.

Unsupervised Hard Exudate Segmentation Methods. Earlier methods such as (Walter et al. 2002; Sopharak et al. 2008; Ravishankar, Jain, and Mittal 2009; Welfer, Scharcanski, and Marinho 2010) adopt morphological operations to enhance exudates, then use a simple thresholding to partition exudate from background. Similarly, Pereira et al. (Pereira, Gonçalves, and Ferreira 2015) propose to use median filtering and normalisation on green plane fundus image for enhancement. J. Kaur and D. Mittal (Kaur and Mittal 2018) first remove the vessel and optic disc, then enhance the image by adaptive image quantization and normalisation. Finally, adaptive thresholding is used to identify exudates. Those kinds of methods are simple and do not need expensive annotations by ophthalmologists, but they always fail on confused structures which have high contrast to background. Coarse-to-fine Supervised Hard Exudate Segmentation Methods. Those methods are data-driven and require expert annotation. They commonly involve two stages: (1) coarse detection stage for candidate detection and (2) fine segmentation stage to finally determine whether the candidate is hard exudate region. For example, in (Zhang et al. 2014; Wang et al. 2020) , candidates are extracted by mathematical morphology first, then a random forest is trained for classification. Rather than learning to determine whether the candidate is hard exudate, other researchers focus on high quality candidate extraction via learning. For example, Liu et al. (Liu et al. 2017 ) first learn to extract multiscale hard exudate candidate patches and reduces numerous background regions, then identify hard exudate regions from can- didate patches according to their characteristics such as intensity contrast to background and area. Kusakunniran et al. (Kusakunniran et al. 2018) first learn a multilayer perceptron for the detection of candidate hard exudate seeds. With the clusters of initial seeds, iterative graph-cut is used for segmentation. Additionally, Parham et al. propose to identify exudate patches from non-exudate patches by either training a lightweight deep network on candidate patches or training a support vector machine with features extracted by pre-trained deep network Resnet-50 (He et al. 2016 ). However, how to extract hard exudate candidate patches from whole images during testing phase still needs to be solved.

End-to-end Hard exudate Segmentation Methods. Recent hard exudate segmentation methods adopt an end-toend manner to train an FCN with a loss function. Mo et al. (Mo, Zhang, and Feng 2018) design a fully convolutional residual network named FCRN for exudate segmentation while Guo et al. (Guo et al. 2019b ) design a lightweight neural network named LWENet. In (Guo et al. 2019a) , L-seg is proposed for multi-lesion segmentation. All of those meth-ods take the class imbalance problem into consideration and train the networks with CBCE loss. In (Guo et al. 2020) , an improved CBCE loss incorporating hard negative mining is proposed for hard exudate segmentation. Both CBCE and bin loss avoid the background bias by increasing the cost of wrong classification on exudate pixels. However, due to the imbalanced cost weights, FCNs trained with CBCE loss and bin loss turn to suffer from exudate-bias.

It is noteworthy to mention that dual-network and dualsampling have been used for class imbalance classification. In (Zhou et al. 2020) , bilateral branch network (BBN) equipped with two samplers is proposed for class imbalance image classification. In (Ouyang et al. 2020 ), a dualsampling network (DSN) which consists of two separate branches with two samplers is proposed for diagnosis of COVID-19. Our method is inspired by BBN (Zhou et al. 2020 ) and DSN (Ouyang et al. 2020) . Although all of those methods contain dual-branches with dual-samplers, ours differs from BBN (Zhou et al. 2020 ) and DSN (Ouyang et al. 2020 ) in three aspects: (1) our dual-branch network is designed for dense classification while BBN (Zhou et al. 2020) The left part is the dual-branch network which is constructed with two identical segmentation branches with partial weight sharing. We note here arbitrary segmentation models such as PSPNet , Deeplabv3 (Chen et al. 2017b ) and HED (Xie and Tu 2015) are desired. The right part illustrates the proposed dualsampling modulated Dice loss, which uses a uniform pixel sampler and a re-balanced pixel sampler sample pixels involving loss calculation. The two losses are adaptively modulated by a hyper-parameter which is set according to training epoch. and DSN (Ouyang et al. 2020) are for image-level classification; (2) Images fed into dual-branch network are randomly sampled from training set while images fed into BBN (Zhou et al. 2020 ) and DSN (Ouyang et al. 2020 ) are sampled according to the pre-defined samplers. (3) Samplers in dualbranch network are used on predicted segmentation masks which sample pixels involving Dice loss calculation while samplers in BBN (Zhou et al. 2020 ) and DSN (Ouyang et al. 2020 ) are used in input layer which sample images for representation and classifier learning.

Our goal is to pursuit deep network parameters that best fit the hard exudate ground-truth in different sizes with given training images. For hard exudates in different sizes, learning representation and classifier in different manners is desired. To this end, we propose to adopt two branches to separately learn representation and classifier. One branch, named large hard exudate biased segmentation branch, is mainly responsible for hard exudates in large size. The other, named small hard exudate biased segmentation branch, is responsible for hard exudates in small size. To achieve the size bias of each branch adaptively, we design a dual-sampling modulated Dice loss, termed DSM. Fig. 2 illustrates our proposed dual-branch network.

Formally, let X ∈ R H×W ×3 denote a training colour fundus image size of H×W and Y ∈ R H×W is the corresponding ground-truth, which is a binary map within the context of hard exudate segmentation. From training set, we randomly fetch two images {X L , Y L } and {X S , Y S } and feed them into large hard exudate biased segmentation branch and small hard exudate biased segmentation branch respectively to obtain the final predictionsŶ L andŶ S . Next we elaborate the architecture of our dual-branch network and training details with our dual-sampling modulated Dice loss.

We let both branches economically share the same segmentation network structure, as illustrated in Fig. 2 . Our dualbranch segmentation network can adopt arbitrary segmentation network. In this paper, we take three state-of-theart segmentation networks, i.e. PSPNet ), Deeplabv3 (Chen et al. 2017b ) and HED (Xie and Tu 2015) as examples to introduce our dual-branch segmentation network. PSPNet ) and Deeplabv3 (Chen et al. 2017b ) adopt ResNet50 (He et al. 2016 ) equipped with dilation convolution (Chen et al. 2017a) in the last two stages as the backbone while HED (Xie and Tu 2015) adopts the fivestage VGG16 (Simonyan and Zisserman 2015) as backbone. Both ResNet50 (He et al. 2016 ) and VGG16 (Simonyan and Zisserman 2015) contain five stages of convolutions. Additionally, in PSPNet ), a pyramid pooling module is attached in the last convolutional stage, then a classifier is followed to make dense predictions. An auxiliary classifier is attached on the second convolution stage and auxiliary loss is generated to help optimise the learning process. In Deeplabv3 (Chen et al. 2017b) , an atrous spatial pyramid pooling module is attached on the last convolutional stage to generate multiscale feature maps. In HED (Xie and Tu 2015) , five side-output layers are stitched on five convolutional stages and finally a fusion layer is used to aggregate the side-output predictions. To reduce the model complexity and speed up the inference, for PSPNet ) and Deeplabv3 (Chen et al. 2017b ) as segmentation branch, weights in first four stages of backbone networks are shared while rest weights are learned separately. For HED (Xie and Tu 2015) , only the first three stages of backbone networks are shared. In this way, the representations for final classifiers are specific to hard exudate in different sizes. The loss items in PSPNet ), Deeplabv3 (Chen et al. 2017b ) and HED (Xie and Tu 2015) are replaced with our proposed dual-sampling modulated Dice loss.

In each training iteration, two pairs of fundus images and their ground truth denoted by {X L , Y L } and {X S , Y S } are randomly fetched. X L is fed into the large hard exudate biased segmentation branch and predictions are obtained and denoted byŶ L . Similarly, X S is fed into the small hard exudate biased segmentation branch and predictions are obtained and denoted byŶ S . As hard exudate segmentation suffers from the issues of extreme class imbalance and large variation in size, rather than using the class balanced crossentropy loss (Guo et al. 2019a (Guo et al. , 2020 , we propose the dualsampling modulated Dice loss (DSM loss). As shown in Fig.  2 , the total loss can be expressed as:

(1)

Dual-Sampling Modulated Dice Loss. In our design of DSM loss, two different samplers are used to sample pixels from predictions by two branches separately. Then Dice loss is used to measure the dissimilarity between set of sampled pixels and their ground truths.

For large hard exudate biased segmentation branch, with predicted segmentation maskŶ L , we use a uniform pixel sampler which samples hard exudate pixels and background pixels with equal probability. We denote the sampled pixel set withŜ L = {ŷ L,n } N n=1 where N = H × W and the corresponding set of ground truth with S L = {y L,n } N n=1 where y L,n is ground truth ofŷ L,n . We use Dice loss (Milletari, Navab, and Ahmadi 2016) to calculate the dissimilarity between sampled pixel set and the corresponding ground truth labels:

where D(Ŝ L , S L ) measures the overlapping degree between two sets: D(Ŝ L , S L ) = ŷ L,n ∈Ŝ L 2ŷ L,n y L,n ŷ L,n ∈Ŝ Lŷ L,n + y L,n ∈S L y L,n .

Obviously, the loss defined by Eq. (2) and (3) focuses on overlapping error, which greatly alleviates the bias towards majority class like cross-entropy loss as well as minority class like CBCE loss. Dice loss produces serious penalty on misidentification of large hard exudate regions while slight penalty on misidentification of small hard exudate regions, which results in bias towards large hard exudate regions.

To remedy the misidentification on small hard exudate regions, from prediction by small hard exudate biased segmentation branch, we use a re-balanced pixel sampler to sample pixels involving loss calculation. Particularly, we oversample the exudate pixels and undersample background pixels. Thus hard exudate pixels in small regions are sampled multiple times with a high confidence, which increases the penalty of misidentification on small hard exudate regions. As a result, the learning focus is shifted into small hard exudate regions. In formulation, we randomly sample N 1 hard exudate pixels and N − N 1 background pixels with replacement. We denote the sampled pixel set witĥ S S = {ŷ S,n } N n=1 and the corresponding set of ground truth with S S = {y S,n } N n=1 where y S,n is ground truth ofŷ S,n . Similarly, the loss for small hard exudate biased segmentation branch is calculated as follows:

where D(Ŝ S , S S ) is the Dice coefficient which is computed same with Eq. (3).

As the segmentation of hard exudates in large size is much easier than in small size, we propose an easy-to-difficult learning strategy. We adaptively modulate the losses of two branches such that the dual-branch network first learns to handle the easy task, then focuses on difficult task. At the beginning of learning, we multiply the loss of large hard exudate biased segmentation branch by a large weight α while the loss of small hard exudate biased segmentation branch by a small weight 1 − α to enforce dual-branch network learn to segment hard exudate in large size. As the large hard exudate biased segmentation branch becomes more and more sophisticated in segmentation of large hard exudates, we gradually decrease the loss weight α and increase 1 − α. In this way, the focus of dual-branch network is shifted to segmentation of small hard exudates gradually. Formally, we express our proposed dual-sampling modulated Dice loss as

where α is relative to learning epoch

Inference. In inference phase, the test fundus image is fed into both branches and two predictions are obtained. As both branches are equally important, we simply perform elementwise average on two predictions to obtain the final prediction.

In this section, we first introduce the data and evaluation metrics for hard exudate segmentation, and present the implementation details, then give ablation analysis and finally make comparisons with state-of-the-arts.

Data. In our experiments, we validate our method on two public datasets for hard exudate segmentation. The first one is the DDR dataset (Li et al. 2019) , which is made public for diabetic retinopathy classification, lesion segmentation and detection in 2019. To our best knowledge, DDR (Li et al. 2019 ) is the largest dataset for hard exudate segmentation. Fundus images in DDR (Li et al. 2019 ) are collected from 147 hospitals, covering 23 provinces in China and their size ranges from 1088 × 1920 to 3456 × 5184. The large variant in image size and large domain gap make the classification and segmentation on DDR (Li et al. 2019) more challenge. Specific to lesion segmentation, DDR (Li et al. 2019) provides 757 fundus images with pixel-level annotation, among which 383 images are for training, 149 for validation and 225 for testing. The other is IDRiD (Porwal et al. 2018 (Porwal et al. , 2020 , provided by a grand challenge on "Diabetic Retinopathy -Segmentation and Grading" in 2018. It provides 81 fundus images size of 4288 × 2848 with pixellevel hard exudate annotations, among which 54 images are used for training and 27 for testing. All of those images are acquired from an eye clinic in India.

Evaluation Metrics. We evaluate segmentation methods at both pixel-level and region-level. With regard to pixel-level metrics, following (Li et al. 2019) and (Porwal et al. 2020) , Intersection of Union (IoU) and Area Under Precision-Recall Curve (AUPR) are adopted. We also adopt F-score for evaluation, which is harmonic mean of sensitivity (SN ) and positive predicted value (P P V ).

With regard to region-level metrics, we follow (Zhang et al. 2014; Liu et al. 2017; Guo et al. 2020 ) and re-define true positive (TP), false positive (FP) and false negative (FN). We denote predicted hard exudate connected component set withĈ = {Ĉ 1 , · · · ,Ĉ N } and ground truth hard exudate connected component set with C = {C 1 , · · · , C M }. A pixel is defined as a TP if, and only if, it belongs to:

where | · | accounts the cardinality and σ ∈ [0, 1] is the overlap ratio threshold. The larger σ is, more rigorous the condition that a pixel is treated as a TP is. A pixel is considered as an FP if, and only if it belongs to

whereC is complementary set to C. A pixel is considered as an FN if, and only if it belongs to

whereC is complementary set toĈ. Rest pixels are considered as TNs. The region-level F-score is defined as

where SN σ is sensitivity defined as SN σ = T P T P +F N and P P V σ is positive predictive value defined as P P V σ = T P T P +F P . In our experiment, F 0.2 , F 0.35 , F 0.5 , F 0.65 and F 0.8 are reported.

Data Preprocessing and Augmentation. For images in DDR (Li et al. 2019) , we first crop the bounding box of field of view, then for cropped boxes whose short side is less than 1024, we enlarge the short side to equal length of long side via zero padding. Finally, we resize them to 1024 × 1024. For images in IDRiD (Porwal et al. 2018 (Porwal et al. , 2020 , we directly resize images to 1440 × 960. Following (Guo et al. 2019a (Guo et al. , 2020 , on both datasets, two tricks are adopted to augment the training data: rotation (90 • , 180 • and 270 • ) and flipping (horizontal and vertical).

Experimental Setting. We build three variants of dualbranch segmentation network based on PSPNet , Deeplabv3 (Chen et al. 2017b ) and HED (Xie and Tu 2015) . We call them dual-PSPNet, dual-Deeplabv3 and dual-HED respectively and implement them within PyTorch framework. We initialise the parameters in backbones with the pre-trained model on ImageNet (Russakovsky et al. 2015) and the parameters associated with classifiers with Gaussian distribution with zero mean and standard deviation of 0.01. SGD is used for parameter optimisation. Hyperparameters include: initial learning rate(0.03 poly policy with power of 0.9), weight decay (0.0005), momentum (0.9), batch size (2) and iteration epoch (100 on DDR (Li et al. 2019 ) and 40 on IDRiD (Porwal et al. 2018) ). Sample rate N 1 /N in re-balanced pixel sampler is set to 0.5. The models are trained on GeForce RTX 2080 Ti with two GPUs.

Effect of Sample Rate. We take dual-PSPNet as example and first explore the effectiveness of sample rate N 1 /N of re-balanced pixel sampler. We set it to 0.25, 0.5, 0.75 and reversed class frequency and train dual-PSPNet on DDR (Li et al. 2019 ) training set separately. Results on test set are reported in Table. 2. They show that our dual-PSPNet network achieves best when sample rate N 1 /N is set to 0.5. In what follows, except for extra illustration, N 1 /N = 0.5 is the default setting.

Ablation Study on Different Losses. To evaluate dualbranch network, we conduct experiments on three state-ofthe-art dense classification methods PSPNet , Deeplabv3 (Chen et al. 2017b ) and HED (Xie and Tu 2015) with several settings, including single branch network with CBCE loss, single branch network with Dice loss (Milletari, Navab, and Ahmadi 2016) and our proposed dualbranch network with DSM loss on both DDR (Li et al. 2019) and IDRiD (Porwal et al. 2018) . Table. 3 reports the results on DDR (Li et al. 2019 ). Obviously, single branch networks with Dice loss (Milletari, Navab, and Ahmadi 2016) are much more sophisticated in hard exudate segmentation than with CBCE loss. This is because that CBCE loss re-weights costs according to reverse class frequency. It over-weights costs on hard exudate pixels and under-weights costs on background pixels, which makes the network be biased in favour of hard exudates. Thus the pixel-level sensitivity is very high while the PPV is very low. The harmonic mean i.e. pixel-level F-score is low. On the contrary, single network branches trained with Dice loss im- prove the pixel-level PPV significantly but inferior sensitivity. Nevertheless, final pixel-level F-score has been greatly improved. Compared with single branch networks with Dice loss, our dual branch networks with proposed DSM loss further improve the performance. In terms of IoU, our dualbranch ones with DSM loss are superior to the single branch ones. In terms of AUPR, our dual-branch ones outperform single branch ones except for the dual HED. In terms of region-level evaluation metrics, from Table. 3, we can see that dual Deeplabv3 (Chen et al. 2017b ) with DSM loss consistently achieves better than single branch ones. Our dual PSPNet with DSM outperforms the single branch PSPNet with CBCE loss and Dice loss except for σ = 0.65. Our dual HED with DSM outperforms the single branch HED (Xie and Tu 2015) with Dice loss when σ is small. As σ is larger than 0.5, our dual HED with DSM loss achieve inferior regional F-score to single branch network trained with Dice loss. The possible reason is that our dual HED with DSM segment small sized hard exudates coarsely and background pixels around small sized hard exudates are misidentified. When σ is small, those misidentified background pixels are treated as true positives. Thus our dual HED with DSM loss achieve high regional F-scores than single branch network with Dice loss. When σ increases, those misidentified background pixels are considered as false positives, thus the regional F-scores of ours are lower than single branch one with Dice loss. On IDRiD (Porwal et al. 2018) , from Table. 4 we can see that our dual-branch networks with DSM outperform single branch networks with CBCE and Dice loss except for Dual HED with DSM when σ is 0.5.

In terms of number of parameters, parameters in two branches with PSPNet ) and Deeplabv3 (Chen et al. 2017b ) are almost 1.5 times of those in single branches while two branches with HED (Xie and Tu 2015) is 1.9 times of that in single branch.

Visual comparisons on DDR (Li et al. 2019) and IDRiD (Porwal et al. 2018) of our dual-branch networks with DSM loss to single branch networks with CBCE and Dice losses are provided in Fig. 3 and Fig. 4 respectively. We can see that single networks with CBCE loss are prone to misidentify (Chen et al. 2017b) and HED (Xie and Tu 2015) . Pixels in yellow are exudate pixels that are correctly classified. Pixels in red are background pixels that are wrongly classified as exudate pixels. Pixels in green are exudate pixels that are wrongly classified as background pixels. Dashed magenta boxes highlight hard exudates that are misidentified. background pixels around hard exudates. Single networks with Dice loss are prone to misidentify small size hard exudates. Our dual-branch networks work better than single branch networks.

On DDR (Li et al. 2019) , we compare our three dualbranch networks with three deep learning based methods: Deeplabv3+ (Chen et al. 2018) , DNL (Yin et al. 2020 ) and SPNet (Hou et al. 2020) . For DNL (Yin et al. 2020 ) and SP-Net (Hou et al. 2020) , we provide results with two losses: CBCE and Dice loss. All these methods are originally designed for natural scene image segmentation. Table. 5 reports the results. We note that results of Deeplabv3+ (Chen et al. 2018) in the first row are provided by (Li et al. 2019) and the rest are obtained by fine-tuning on DDR (Li et al. 2019) . As shown in Table. 5, our dual-branch networks achieve superior performance in terms of both pixel-level and region-level metrics.

On IDRiD (Porwal et al. 2018) , we compare our dualbranch network with five deep learning based methods: DNL (Yin et al. 2020) , SPNet (Hou et al. 2020) , L-seg (Guo et al. 2019a) , LWENet (Guo et al. 2019b ) and Bin loss (Guo et al. 2020) . For LWENet (Guo et al. 2019b ) and L-seg (Guo et al. 2019a) , the predicted binary masks are provided by authors. With them, IoU , F pixel and the region-level metrics are computed. For the rest methods, results are obtained by fine-tuning. Table. 6 reports the results. We can see that our dual-branch network achieves superior results than compared methods.

Visual comparisons between previous methods and ours on DDR (Li et al. 2019) and IDRiD (Porwal et al. 2018) are performed. In Fig. 5 , the segmentation results by our dual-branch networks based on PSPNet ), Deeplabv3 (Chen et al. 2017b ) and HED (Xie and Tu 2015) , DNL (Yin et al. 2020) with CBCE loss and Dice loss, SPNet (Hou et al. 2020) with CBCE loss and Dice loss are shown. We can see that (1) (Porwal et al. 2018) . The single branch networks are PSPNet , Deeplabv3 (Chen et al. 2017b ) and HED (Xie and Tu 2015) . Pixels in yellow are exudate pixels that are correctly classified. Pixels in red are background pixels that are wrongly classified as exudate pixels. Pixels in green are exudate pixels that are wrongly classified as background pixels. Dashed magenta boxes highlight hard exudates that are misidentified. (Chen et al. 2018) are directly borrowed from (Li et al. 2019 ).

IoU F pixel AU P R F σ=0.2 F σ=0.35 F σ=0.5 F σ=0.65 F σ=0.8 Deeplabv3+ (Chen et al. 2018) 0 and SPNet (Hou et al. 2020) with CBCE loss are hard exudate biased and prone to misidentify background pixels as hard exudate pixels; (2) SPNet (Hou et al. 2020) with Dice loss are background biased and prone to misidentify hard exudate pixels as background pixels; (3) Our dual-branch networks achieve better than them. In Fig. 6 , the segmentation results by ours, DNL (Yin et al. 2020 ) SPNet (Hou et al. 2020 ), L-seg (Guo et al. 2019a) , LWENet (Guo et al. 2019b) , and Bin loss (Guo et al. 2020 ) are shown. Similarly, we can see that (1) DNL (Yin et al. 2020) and SPNet (Hou et al. 2020) with CBCE loss are hard exudate biased while with Dice loss are background biased;

(2) L-seg (Guo et al. 2019a ) is background biased while LWE (Guo et al. 2019b) and Bin loss (Guo et al. 2020 ) are hard exudate biased; (3) ours achieve better than the compared methods. 

In this paper, we propose dual-branch network to address the issues of extreme class imbalance and enormous variation in size across target regions for segmentation of hard exudate in colour fundus images. Our dual-branch network uses two branches with partial weight sharing to learn representations and classifiers for hard exudates in different sizes. It is trained with the proposed dual-sampling modulated Dice loss, which enables dual-branch network to first learn to segment large hard exudates then small hard exudates. Experimental results on two public datasets for hard exudate segmentation demonstrate that our dual-branch network outperforms existing segmentation networks with both CBCE loss and Dice loss.

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

Rethinking atrous convolution for semantic image segmentation

Encoder-decoder with atrous separable convolution for semantic image segmentation

L-Seg: An end-to-end unified framework for multi-lesion segmentation of fundus images

A Lightweight Neural Network for Hard Exudate Segmentation of Fundus Image

Bin loss for hard exudates segmentation in fundus images

Deep residual learning for image recognition

Strip pooling: Rethinking spatial pooling for scene parsing

A generalized method for the segmentation of exudates from pathological retinal fundus images

A novel color space of fundus images for automatic exudates detection

Exudate detection in fundus images using deeply-learnable features

The Wisconsin Epidemiologic Study of Diabetic Retinopathy: VII. Diabetic nonproliferative retinal lesions

Hard exudates segmentation based on learned initial seeds and iterative graph cut

Diagnostic Assessment of Deep Learning Algorithms for Diabetic Retinopathy Screening

A location-to-segmentation strategy for automatic exudate segmentation in colour retinal fundus images

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Exudate-based diabetic macular edema recognition in retinal images using cascaded deep residual networks

Dual-Sampling Attention Network for Diagnosis of COVID-19 from Community Acquired Pneumonia

Exudate segmentation in fundus images using an ant colony optimization approach

Indian diabetic retinopathy image dataset (idrid): A database for diabetic retinopathy screening research

IDRiD: Diabetic Retinopathy-Segmentation and Grading Challenge

Automated feature extraction for early detection of diabetic retinopathy in fundus images

Imagenet large scale visual recognition challenge

Quantitative measurement of hard exudates in patients with diabetes and their associations with serum lipid levels

Very Deep Convolutional Networks for Large-scale Image Recognition

Automatic detection of diabetic retinopathy exudates from non-dilated retinal images using mathematical morphology methods

A contribution of image processing to the diagnosis of diabetic retinopathydetection of exudates in color fundus images of the human retina

Hard exudate detection based on deep model learned information and multi-feature joint representation for diabetic retinopathy screening

A coarseto-fine strategy for automatically detecting exudates in color eye fundus images

Holistically-nested edge detection

Disentangled non-local neural networks

Exudate detection in color retinal images for mass screening of diabetic retinopathy

Pyramid scene parsing network

BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition

 (Guo et al. 2019b) and Bin loss (Guo et al. 2020) on IDRiD (Porwal et al. 2018) . Pixels in yellow are exudate pixels that are correctly classified. Pixels in red are background pixels that are wrongly classified as exudate pixels. Pixels in green are exudate pixels that are wrongly classified as background pixels.