key: cord-0531376-m6ogyg5h authors: Nguyen, Duy M. H.; Nguyen, Thu T.; Vu, Huong; Pham, Quang; Nguyen, Manh-Duy; Nguyen, Binh T.; Sonntag, Daniel title: TATL: Task Agnostic Transfer Learning for Skin Attributes Detection date: 2021-04-04 journal: nan DOI: nan sha: 7d94c799aab9b3448b100bb9f19a39e66bf36efc doc_id: 531376 cord_uid: m6ogyg5h Existing skin attributes detection methods usually initialize with a pre-trained Imagenet network and then fine-tune on a medical target task. However, we argue that such approaches are suboptimal because medical datasets are largely different from ImageNet and often contain limited training samples. In this work, we propose emph{Task Agnostic Transfer Learning (TATL)}, a novel framework motivated by dermatologists' behaviors in the skincare context. TATL learns an attribute-agnostic segmenter that detects lesion skin regions and then transfers this knowledge to a set of attribute-specific classifiers to detect each particular attribute. Since TATL's attribute-agnostic segmenter only detects skin attribute regions, it enjoys ample data from all attributes, allows transferring knowledge among features, and compensates for the lack of training data from rare attributes. We conduct extensive experiments to evaluate the proposed TATL transfer learning mechanism with various neural network architectures on two popular skin attributes detection benchmarks. The empirical results show that TATL not only works well with multiple architectures but also can achieve state-of-the-art performances while enjoying minimal model and computational complexities. We also provide theoretical insights and explanations for why our transfer learning framework performs well in practice. Melanoma is one of the most dangerous types of skin cancer. Even though it only accounts for 1% of all skin cancer cases, it is responsible for the majority of skin cancer deaths (Ward and Farma, 2017) . In 2021, it is estimated that there will be 207,390 new cases of melanoma will be diagnosed and 7,180 recent deaths from the disease in the United States alone (Society, 2021) . Moreover, the 5-year relative survival rate for melanoma reduced from 99% for cases diagnosed at a localized stage to 27% for a distant stage (Society, 2021) . Therefore, there have been tremendous efforts in detecting the disease in its early stages (Masood and Ali Al-Jumaily, 2013; Curiel-Lewandrowski et al., 2019) . One of the most promising technology is Dermoscopy, which can generate high-resolution images of skin lesions and allows dermatologists to examine the lesion regions more carefully . However, dermoscopy still requires extensive training, which is expensive, time-consuming, error-prone, and might not be widely available (Zalaudek et al., 2008) . Therefore, it is important and highly beneficial to develop automatic systems to detect abnormal skin lesions and aid dermatologists during diagnosis (Nunnari et al., 2021b) . For this purpose, the International Skin Imaging Collaboration (ISIC) hosted challenges for automatic melanoma detection based on dermoscopic images (ISIC-2018 , ISIC-2017 (Codella et al., , 2018 . In this work, we focus on Task 2 of predicting the locations of dermoscopic attributes in an image. In particular, there are five dermoscopic attributes that the challenge focused on: Streaks, Globules, Pigment Network, Negative Network, and Milia-like Cysts. Locating these clinically meaningful skin lesion patterns helps detect anomalous regions and provides an explanation for dermatologists to verify and make further diagnoses. For instance, the Negative Network, which consists of relatively light regions and some darker regions, is usually considered a melanoma-specific structure (Pizzichetta et al., 2013) . We provide an example of such attributes in Figure 1 . and different kinds of skin features, which makes a promising first step towards a more practical and beneficial medical-aid deep learning system. Medical image analysis is a vital research venue and has a significant impact on practice. However, most medical image datasets have limited training samples and often suffer from the data-imbalanced problem. Therefore, a popular strategy is transfer learning, which uses a pre-trained ImageNet model as an initialization to build additional components. Transfer learning is a base of many existing methods (Abràmoff et al., 2016; De Fauw et al., 2018; Gulshan et al., 2016; Rajpurkar et al., 2017) , and is a norm for practitioners (Tan et al., 2018) . However, recent studies in Cheplygina (2019) ; He et al. (2020) conducted a large-scale analysis on the benefit of this strategy. They concluded that for medical images, transfer learning based on pre-trained ImageNet is not consistently better than random initialization. One reason is that medical images are vastly different from the ImageNet dataset, resulting in the pre-trained weights that are not helpful for the current task. Another reason is that medical data are often imbalanced and rare due to data privacy. For example, Table 2 illustrates the distributions of each skin characteristic in the ISIC 2017 and ICSIC 2018 datasets, with the rarest class (Streaks) accounting for only 7.98% (113 images) and 3.86% (100 images) of the training data, respectively. In comparison, the most common class (Pigment Network) accounts for 79.03% (1119 images) and 58.67% (1522 images) of the total samples. Fortunately, TATL can address the scarcity of training data in such situations by transferring the knowledge from the Attribute-Agnostic Segmenter. Moreover, we since apply the strategy one class one model (Buda et al., 2018) , each Target-Segmenter classifier in TATL only detects one attribute, and can alleviate the data's imbalance problem. Self-supervised learning, first mentioned in Schmidhuber (1990) , refers to a technique of creating additional tasks for training where the label is also a part of the data (images) rather than a set of separate labels (annotations). This strategy has been a successful pre-training technique in various vision applications, including image colorization (Vondrick et al., 2018; Larsson et al., 2016; Zhang et al., 2016) , image impainting (Pathak et al., 2016; Chen et al., 2020) , and video representation (Misra et al., 2016; Lee et al., 2017) . In self-supervised learning, a newly created task for pre-training is called the "pretext task", and the main tasks used for fine-tuning are called the "downstream tasks". Various strategies have been proposed to construct the pretext task based on the image rotation (Gidaris et al., 2018) , temporal correspondence (Li et al., 2019; Wang et al., 2019b) , cross-modal consistency (Wang et al., 2019a) and instance discrimination with contrastive learning (Wu et al., 2018) . Recently, in the medical domain, He et al. (2020) successfully applied self-supervised learning in diagnosing COVID19 from CT scans based on contrastive self-supervised learning for reducing the risk of overfitting. Chen et al. (2019) also presented a context restoration framework in which the image is disordered by randomly changing the order of their sub-patches, and then a neural network is trained to recover the original input. Our TATL approach shares some similarities with self-supervised learning in the sense of building better pretrained models without extra training instances or learning through auxiliary tasks. Here, the Attribute-Agnostic Segmenter plays a role as of learning a pretext task, while detecting the attributes in the Target Segmenters are learning the downstream tasks. However, our work differs from the SSL approaches because we define an auxiliary task through solving (x, g(y)) while SSL methods follow the scheme (x, f (x)) where (x, y) indicates for the image and corresponding label, g is an operator on the training label, and f is another transformation on the image such as rotation (Gidaris et al., 2018) , dividing images into sub-patches and suffering their positions (Chen et al., 2019) . The construction of g is specifically designed for the medical domain, which makes TATL's pretext task closely complements the subsequent downstream tasks. Therefore, if the pretext task of recognizing skin attribute regions can perform well, it will likely facilitate detecting such areas' attributes. Finally, by providing skin attribute regions or abnormal regions from the pretext task, TATL is helpful to end-users by allowing dermatologists to validate the employed system's diagnostics. This section aims to formalize our problem setting and outline the dermatologists' practice to diagnose skin attributes, which later motivates our method. We consider the skin attributes detection problem on a target dataset consisting of training images and their corresponding masks The detector, parameterized by W, can be initialized from a pre-trained model on another dataset, which we call the source dataset D src . Moreover, each training sample in the target domain consists of an image x ∈ R c×w×h and a set of labels where y (i) is a binary mask indicating the skin region associated with the i-th attribute. In this work, we consider five different attributes: Globules, Milia, Negative Network, Pigment Network, and Streaks, shorthanded as Y = {G, M, N, P, S}, i.e., |Y | = 5. It is worth noting that each sample may not have all the attributes and the label for those missing attributes is the empty mask. The training process can be performed by minimizing the empirical risk: where y (i) j denotes the binary mask prediction of the network on a sample about the i-th attribute. For each attribute, we use the Tversky loss L T versky , which is a generalization of Dice loss (Eelbode et al., 2020; Jadon, 2020; Salehi et al., 2017) and the soft Jaccard loss functions L Jaccard (Eelbode et al., 2020; Kawahara and Hamarneh, 2018) , to penalise the deviation between network's predictions and the ground-truths. Formally, these loss functions can be calculated as: Here, the predictionŷ and the ground-truth y are first re-shaped into a vector form; , and ||.|| 1 are the inner product and L 1 norm respectively. The parameter α is used to ensure that the loss functions are not undefined when division by zero or in case y =ŷ = 0. The parameter β in other way controls the magnitude of false positives and false negatives. In our experiment, we choose λ 1 = λ 2 = 0.5 to balance the importance of two loss functions and parameters α = 1, β = 0.6 through validation experiments. Figure 2 depicts a prediction pipeline inspired by the conventional diagnosis process of dermatologists, as discussed in Section 1. In the first step, dermatologists will identify lesion regions by eliminating irrelevant background and rescaling these regions to a higher resolution for better visualization (Stage 1). Following that, they continue to spot any abnormal and clinically relevant sub-areas on the lesion (Stage 2). Finally, by accounting for these factors, doctors diagnose specific skin attributes by comparing various features based on their textures and colors compared to nearby spaces (Stage 3). We argue that identifying lesion and skin attribute regions is crucial since it serves focal points for later steps, and we develop a skin attributes detection framework that closely follows the three-step procedure represented in Figure 2 . We realize the diagnosis procedure into a single framework named Task Agnostic Transfer Learning (TATL). First, TATL employs a Segment-Net to segment the lesion regions from an input image. Then, TATL trains an Attribute-Agnostic Segmenter to detect all skin attribute regions in the image, regardless of their attributes, which is inspired by the second step in the procedure. Finally, the parameters of the Attribute-Agnostic Segmenter are utilized as an initialization for the Target Segmenters (Tar-S), which are trained to identify just one specific attribute and are employed as the final step in the method. TATL not only closely resembles how dermatologists diagnose but also enjoys two additional benefits than conventional approaches. First, TATL provides additional information about the skin attribute regions regardless of their attributes, which can be helpful for dermatologists. Remarkably, such areas reveal variations and commonalities of relevant lesions, thereby reducing subjective errors in the evaluation process. This can be demonstrated by two examples in Figure 9 . Here, attributes such as "Negative Network" or "Globules" can be challenging to identify in isolation. In contrast, the union of all attributes provided by TATL can correctly cover those areas. Second, adapting weights trained on skin attribute regions to a specific attribute can guide the network to pay attention to shared features across diverse attributes, thus strengthening trained systems to be more robust and stable. We will empirically demonstrate in detail these properties from Subsection 5.3 to Subsection 5.8. We now detail our TATL framework and discuss the theoretical properties, which sheds light on its competitiveness. We first provide an overview of our TATL framework in Section 4.1. Then, we discuss the pretext task with the Segment-Net and Attribute-Agnostic Segmenter in Section 4.2, and the downstream task with the Target Segmenter in Section 4.3. Then, we summarize the TATL framework and outline its algorithm in Section 4.4. Lastly, we conclude this section with a theoretical insight in Section 4.5. The Encoder -Decoder Architecture. The core component in our TATL framework is the encoder-decoder architecture, which takes an image as input and produces a binary mask as output. While we employ two kinds of encoder-decoder networks in our method, they share the same design as follows. The encoder part could be any feature extraction layers in arbitrary architectures such as ResNet152 (He et al., 2016) or EfficientNet (Tan and Le, 2019) . For feature extraction purposes, we thus discard non-linear rectification layers in these architectures. For the decoder path, we used layers to up-sample the encoder's features back to the original input's dimension. Particularly, to match the encoder's stages, the decoder consists of an up-sampling layer and a sequence of convolutional blocks where each block has two 3 × 3 convolution filters with activation functions in between. Each stage in the decoder receives a feature map from its immediate preceding layer and a corresponding feature from the encoder's stage. The two inputs are combined by either the adding or concatenating operations, corresponding to the settings of LinkNet (Chaurasia and Culurciello, 2017) and U-Net (Ronneberger et al., 2015b) . The TATL Framework. Our TATL framework involves of three encoder-decoder networks. The first network, Segment-Net, segments and upscales the lesion regions in the original image. Then, the second network, Attribute-Agnostic Segmenter, takes the lesion regions as input and learns to segment the skin attribute regions, possibly including any of the five attributes of interests. Finally, for each attribute, a corresponding network, the Target Segmenter, is trained to segment that attribute's regions. Moreover, the Target Segmenter's decoder also rescales the final mask to match the original image's dimensions. Therefore, our TATL framework consists of seven networks in total, one Segment-Net, one Attribute-Agnostic Segmenter, and five Target Segmenters corresponding to five attributes. Each network uses either the Link-shape or U-shape architecture with the b0-EfficientNet (Tan and Le, 2019) as the main backbone due to its lightweight property compared to other architectures (Table 7) . We now introduce some notations to detail our framework. We denote {f Seg (., W seg ), f U (., W U ), f i (., W i )} as the corresponding networks of the Segment-Net, the Attribute-Agnostic Segmenter, and the Target Here, we refer to the TATL Pretext Task as the problem of recognizing regions containing any attributes. The pretext training scheme consists of two stages: (i) cropping skin lesion with the Segment-Net; and (ii) segmenting the attribute-agnostic mask with Attribute-Agnostic Segmenter. In the following, we describe the training procedure with the corresponding components in detail. In Stage 1, we use the Segment-Net on the lesion dataset D seg to eliminate extraneous skin-based regions and only keep the lesion regions. Especially, given an original image, we re-scale it to a size of 386 × 512 and pass it into the Segment-Net. A bounding box with an offset value of 40 pixels in four directions is used to crop the Segment-Net's output so that errors in the segment step are not propagated in the later stages. Once this bounding box has been created, it is scaled to match the resolution of the input image and utilized as a new input for the next stage. The second stage focuses on training the Attribute-Agnostic Segmenter. We first define the Attribute-Agnostic region as a region that contains at least one of the attributes in Y . From this, we define an intermediate dataset of the Attribute-Agnostic as D U = {x, M U }, where M U is the binary mask corresponding to an image whose value is 1 whenever a pixel is an attribute from Y (Pretext Task). Note that given an image x and a set of attributes masks, M U is the union of all the masks and can be easily constructed by performing bitwise OR operator as: where | denotes the bitwise OR operator. The dataset D U is used to train the Attribute-Agnostic Segmenter such that it can detect the skin attribute regions belonging to any of the attributes. It is important to note that D U contains masks covering all attributes, thus ameliorating the negative effect of training data scarcity, especially for minor classes. Given W U learned from the Attribute-Agnostic Segmenter, we can proceed to Stage 3 and train the segmenter for each of the attributes (Downstream Task ). Different from previous approaches, we initialize the Target-Segmenter's parameters from the Attribute-Agnostic Segmenter's parameters as: W e i ← W e U and W d i ← W d U for each type i-attribute. Lastly, a set of Target-Segmenters is trained to segment the attributes. Having a dedicated network for each attribute is advantageous in alleviating the imbalance training data problem. Moreover, we explore two strategies in training the Target-Segmenters, which corresponds to allowing knowledge sharing across attributes or not. First, we freeze all the encoders (TATL-Freeze) to allow feature sharing across attributes because the encoder is initialized from the Attribute-Agnostic Segmenter. Second, we allow both the encoder and decoder to be updated (TATL-Non Freeze), which allows each Target-Segmenter to adapt to their dedicated attribute. We summarize these Pretext (with Attribute-Agnostic Segmenter) and Downstream Tasks in Algorithm 1 and Figure 3 . Ablation studies for freeze encoder layers are also discussed in Table 8 of Subsection 5.4. Input: Pre-trained ImageNet from employed backbone W ImgNet = {W e ImgNet , W d ImgNet } The attribute dataset: D = {(x 1 , {y (i) 1 } |Y | i=1 ) . . . , (x n , {y (i) n } |Y | i=1 )} Output: Trained weights W i = {W e i , W d i }, i ∈ Y = {G, M, N, P, S} // Create the attribute-agnostic masks 1 for each image x ∈ D do 2 M U [x] = y (1) | y (2) | . . . | y (|Y |) 3 end // Learning Attribute-Agnostic Segmenter 4 Initialise: W e U ← W e ImgNet and W d U ← W d ImgNet 5 for minibatch {x k , y k } where x k ∈ D , y k ∈ M U do 6 Minimize f U ({x k , y k }|W U ) using Eq. (1) 7 Update W e U , W d U 8 end // Learning Target-Segmenter for each attribute 9 for attribute i ∈ Y do 10 Initialise: W e i ← W e U and W d i ← W d U 11 for minibatch {x k , y i k } where x k , y i k ∈ D do 12 Minimize f i ({x k , y i k }|W i ) using Eq. (1) 13 Update W e i , W d i 14 end 15 end 16 return W e i and W d i where i ∈ Y We summarize the training and inference pipeline of TATL as follows, Step. Given a training dataset, TATL training performs the following steps, 1. Training the Segment-Net using dataset D seg = {x, M seg }. )} Note that we only apply a U-Shape encoder-decoder architecture in the first step, while we use both U-shape and Link-shape connections in the second step. Step. Given an input image, it first will be segmented by the Segment-Net and then fed into five different Target-Segmenters to segment five kinds of skin features. To compare with other competitors, we selected the b0-EfficientNet (Tan and Le, 2019) as the network backbone and used either the adding (Link-shape) and concatenating (U-shape) operations to correlate feature maps obtained from encoder and decoder layers. The final binary map for each skin attribute is produced by averaging the probability estimates from these two network architectures. Note that our pipeline prediction requires a focused lesion image generated by the Segment-Net, which might be influenced by the employed segmentation method. Fortunately, TATL only uses a bounding box with a specific offset around predicted lesion regions, thus making the following steps less susceptible to the resulting segmentation step. We present an experiment in Table 5 to validate this property. This section provides theoretical insights to justify our approach using recent results from data-dependent stability of optimization schemes. Particularly, we investigate the model's generalization to a target domain based on its initialization weights and show that initializing in a TATL fashion gives a tighter generalization error bound than ImageNet initialization via Proposision 1. First, we introduce used notations, which are briefly summarized in Table 1 . We consider the supervised training problem with X ⊂ R n as the input space and Y ⊂ R as the output space. We also assume that training and testing instances are sampled i.i.d. from a probability distribution D over Z = X × Y . Also, we denote the where a training sample z i consists of an input image x ∈ X and its corresponding label y ∈ Y . We express H = {w j } as the hypothesis space where w j ∈ R d denotes a hypothesis (model) with dimension d that maps an input instance x to its corresponding label y. Lastly, we define a map A S : S → H as a learning algorithm that returns a hypothesis given a training data set S. Definitions X, Y The input and output spaces S the training set Z the joint input-output space learning algorithm given the training data S (D, w 1 ) stability function of D and w 1 Kuzborskij and Lampert (2018) established a data-dependent aspect of algorithm stability for Stochastic , and an initialization weight w 1 , which is sequentially updated as: Here, (w, z) is a loss function, which measures the difference between predicted values and true values with parameters w ∈ H on an sample z. We indicate (D, w 1 ) as the data-generating distribution and the initialization point w 1 of SGD, (D, w 1 ) as a stability function of D and w 1 . To characterize a randomized learning algorithm A, we define its "On-Average stability". We now have the following theorem (Kuzborskij and Lampert, 2018) : where R(A s ),R S (A s ) are risk and empirical risk of A s respectively, defined by: In other words, the generalization of a learning algorithm on unseen data drawn from the same distribution is controlled by its (D, w 1 )-on average stable, which depends on initialized weights w 1 . In the following, we will examine the model's generalization performance through the lens of its training algorithm's stability. In the transfer learning setting, Theorem 1 provides a tool to understand the model's generalization on the target domain, given that it is initialized from one of the pre-trained models on a set of source domains (source tasks). Specifically, we suppose that the target task is characterized by a joint probability distribution D tgt and assume that a set of source hypotheses {w src k } ⊂ H, k ∈ K trained on K different source tasks. In this paper, we consider two distinct source cases with K = {ImgNet, TATL} where "ImageNet" refers to weights trained on ImageNet and "TATL" is our approach of learning the attribute-agnostic mask. Now we are ready to analyze the generalization bound of TATL versus ImageNet initialization strategies by utilizing the results in Kuzborskij and Lampert (2018) . Proposition 1. Given a non-convex loss function and assume that (., z) ∈ [0, 1] has a p-Lipschitz Hessian, β-smooth and step sizes of a form α t = c t satisfy c min( 1 β , 1 4(2βln(T )) 2 ), then with high probability, the (D tgt , w src k ) of SGD scheme satisfies: with ||.|| 2 is a spectral norm. Intuitively, Theorem 1 and Proposition 1 suggest that an initialization's generalization error depends on two factors: (i) how well it performs on the target domain without any training, which is characterized byR S ; and (ii) the loss function's curvature around this initialization, which is characterized by empirical ||∇ 2 (w src k , z)|| 2 over m training samples, denoted asγ. This result provides an intuitive explanation of why TATL provides a more favorable initialization than the traditional ImageNet pre-trained models. Notably, we will explain why TATL, which initializes the Target-Segmenter from the Attribute-Agnostic Segmenter, can perform better than initializing the segmenter from ImageNet pre-trained models. Pre-trained ImageNet models are unlikely to perform well on medical images due to the vast diversity between the two domains. Therefore, such models often have a higher empirical error on the target domainR S and usually lie in high curvature regionsγ. On the other hand, TATL uses an initialization from the Attribute-Agnostic Segmenter, which is pre-trained on self-generated data of the target task. Note that the Attribute-Agnostic Segmenter can detect any of the attributes, and therefore enjoy lower empirical riskR S compared to ImageNet models. Moreover, due to its construction, the Attribute-Agnostic Segmenter's parameter likely lies in a region close to the local minimum of each attribute detector, which enjoys lower curvatureγ. Consequently, TATL exploits the target task's knowledge to form an initialization with a high probability of attaining lower empirical error and curvature, which translates to a tighter generalization error bound than initializing from pre-trained ImageNet models. We empirically verify this hypothesis by comparing the bound's values in Eq. (6) of different initialization strategies in Figure 7 of Section 5.8. We conduct experiments on two well-known datasets for skin attributes detection: the ISIC 2017 3 and 2018 4 Task 2 datasets. Table 2 provides a summary of the two datasets. It is worth noting that the ISIC 2017 dataset only contains four classes: Streaks, Negative Network, Milia, and Pigment Network, while the ISIC 2018 introduces a new class of Globules. Moreover, both datasets exhibit high data imbalance among the attributes. For example, in the ISIC 2018 dataset, the class "Streaks" only appears in 3.86% of the training data while "Pigment Network" is observed in 58.67% of the training data. We implemented all experiments using the Pytorch framework (Paszke et al., 2019) on 4 NVIDIA TITAN RTX GPUs. All images were pre-processed by centering and normalizing the pixel density per channel. Besides, we also re-scale all images to the resolution of 386 × 512 in training steps and transform final predictions to a size of 256 × 256 in the evaluation step following the standard of the ISIC challenge. We used the SGD optimizer (Goodfellow et al., 2016) with an initial learning rate of 0.01 and momentum of 0.9 to be consistent with the theory presented in Section 4.5. For TATL, we obtained the Segment-Net by training a b0-EfficientNet backbone with U-shape on both the ISIC 2018 and ISIC 2017 Task 1 using the loss function in Eq. (1). Given the segmentation results, we defined a bounding box around the masks with an offset of 40 pixels in four directions to mitigate the segmentation errors, before feeding them to the Attribute-Agnostic Segmenter and the Target-Segmenters. The Attribute-Agnostic Segmenter and the Target-Segmenters were then trained for 40 epochs with early-stopping after 10 epochs. Finally, we measure our performance and compare it with other baselines using the five-fold cross-validation method and report the average values on Dice and Jaccard coefficients as Koohbanani et al. (2018) . We compare our method against the winner of ISIC 2017 (Kawahara and Hamarneh, 2018) and ISIC 2018 (Koohbanani et al., 2018) and report the Dice and Jaccard index in Table 4 and Table 3 respectively. Here the results of the 1-st method in ISIC 2018 are extracted from their original paper (Koohbanani et al., 2018) while we use the source code published by (Kawahara and Hamarneh, 2018) and run-again experiments using our setting with the five-fold cross-validation method. Due to the high competitiveness of skincare challenges, we utilize both U-shape and Link-shape architectures with b0-EfficientNet (Tan and Le, 2019) as the backbone for the Attribute-Agnostic Segmenter and Target-Segmenters, then taking the average probability predictions. For a comprehensive comparison, we include five variants of four TATL corresponding to removing components in the TATL framework: • the vanilla encoder-decoder architecture but without the Segment-Net and the Agnostic-Attribute Segmenter (TATL Stage 3); • a variant that performs the second and last stage of our TATL: train first the Attribute-Agnostic segmenter on the original images and then a set of Target-Segmenters (TATL Stage 2, 3 ); • a variant that performs the first and last stage of our TATL: segment the lesion regions and then the attributes (TATL Stage 1, 3 ); • our full TATL framework that performs all three stages (TATL Stage 1, 2, 3 ); • our full TATL framework initialized from scratch (TATL w/o ImgNet Stage 1, 2, 3 ). Our TATL shows competitive performances against other approaches on both benchmarks and metrics. Notably, our method presents substantial improvements over other baselines on attributes with the least amount of training data. For example, in Table 3 , TATL Stage 1, 2, 3 for Negative Network achieved the Jaccard index of 0.283, which is 5.5% higher compared to ISIC 2018 winner (0.228) (Koohbanani et al., 2018) . Also, this TATL setting for the Streak feature has the Dice index of 0.401, which is 13.1% higher compared to ISIC 2018 winner (0.270) (Koohbanani et al., 2018) . This experiment aims to investigate the contribution of the first and second stages to the final performance, given that Stage 3 is always enabled for supervised training. Table 3 and Table 4 demonstrate that enabling Stage 2, i.e., TATL Stage 2, 3 results in a significant improvement in most skin attributes compared to using Segment-Net (TATL Stage 1, 3 ). For example, the Average Dice score in Table 3 of TATL Stage 3 increased from 0.383 to 0.466 with TATL Stage 2, 3 while only attaining at 0.401 with TATL Stage 1, 3. This result thus emphasizes the critical role of learning the Attribute-Agnostic Segmenter in our framework. In addition, progressively adding Stage 1 to the Stage 2 and 3 model further improves the results. For example, in Table 4 , when using a pre-trained ImageNet, adding Stage 1 increased the Averaged Dice score from 0.367 to 0.387. Overall, the findings support our hypothesis that all three phases contribute to TATL's competitive performance. We investigate the advantages of employing pre-trained models in TATL (Algorithm 1) by examining the TATL w/o ImgNet Stage 1, 2, 3. The outcomes suggest that employing pre-trained models in TATL favours classes with more training data. Remarkably, for ISIC 2017's Pigment Network, the attribute with the most training samples, pre-trained models improves the Jaccard index from 0.473 to 0.516 (9.09% relative improvement). On the other hand, the contributions of pre-trained models on attributes with limited training samples such as Streaks and Negative Network are much less significant, e.g., ISIC 2017's Negative Network Jaccard index increased from 0.209 to 0.215 (2.87% relative improvement). In general, we conclude that the improvements observed in minor classes come from our TATL framework rather than the pre-trained ImageNet; however, the TATL version with pre-trained ImageNet, on average, outperforms the TATL version using random weights. The Influence of Cropping Lesion Segmentation on the Final Result Our inference step requires the segmented lesion region to eliminate less relevant parts for the later phase. This task is handled by the Segment-Net (Stage 1). The lesion regions then are cropped by a bounding box with an offset value in four directions to generate input for the next step. In Table 3 and Table 4 , we presented the ablation study for Stage 1 derived from U-Shape using a b0-EfficientNet backbone with an offset of 40 pixels. We now investigate how much the errors in Stage 1 can propagate to the final predictions by varying segmentation methods and offset values. We conducted tests on ISIC 2017 in which four different models were trained to segment four skin features using the same configuration in Stage 2 and 3 but will take distinct inputs in Stage 1 created by various networks as U-Net (Ronneberger et al., 2015b) , SegNet (Badrinarayanan et al., 2017) , and Mask-RCNN (He et al., 2017) . In addition, we changed offset values ranging from 0 to 60 pixels with a 20-pixel step. Table 5 depicts the results of various approaches when changing these factors, in which we used the Dice score to compute accuracy for all experiments. We observe that applying an offset 0 pixel reduces the accuracy of the subsequent step because segmentation errors lead to the loss of some essential information, particularly at the image's border locations. It also explains why improved performance in the feature segmentation stage results from higher accuracy in the lesion segmentation step. When increasing the magnitude of offset to 20 pixels, all methods are improved; for instance, the U-Net case rises from 0.358 to 0.362. Moreover, with the offset 40 and 60 pixels, margins between baselines are no longer available, and the performance of two later cases is better than offset 20 pixels. Table 5 also presents a trade-off in selecting large offset values. In particular, a considerable value of 60 pixels can reduce efficiency compared to 40 pixels because the image may involve more unrelated data. In summary, we conclude that adding an optimized offset value of 40 pixels around segmented lesion areas allows our inference step to be stable despite lesion segmentation perturbations. This experiment examines the contribution of each component network to the overall performance of TATL. In particular, we evaluate our method's performance using just U-shape or Link-shape based on the b0-EfficientNet backbone and compare them to networks used in the ISIC-2018-1st: . Furthermore, we also include the b0-EfficientNet performance in the ISIC-2018 challenge for overall comparison. Table 6 highlights the acquired data, with blue and red representing the best Jaccard and Dice scores, respectively. In this table, our two variants, labeled as U-Eff(TATL) and L-Eff(TATL), are the results of employing the U-shape and the Link-shape, respectively. We can see that all of the top results came from one of our techniques, with the U-shape architecture gaining first place in the Pigment Network and Globules features and the Link-shape architecture taking second. The baseline with the b0-EfficientNet backbone, on the other hand, appeared to outperform the other approaches. Considering skin characteristics with a large amount of training data, such as Pigment Network (79.03%) or Milia-lie cysts (33.55%), our model enhanced the Jaccard of the b0-EfficientNet from 55.4% to 56.5% and from 15.7% to 16.8%, respectively. On Pigment Network, the Dice coefficient was improved by 0.9%, and on Milia-like cysts, it was improved by 1.6%. The lower the number of images, the greater the margin of improvement made by our model through the transfer learning step in TATL. For example, with the Streaks, our L-Eff(TATL) was 25.2% and 39.3% in Jaccard and Dice, correspondingly, which were 11.3% and 15.1% higher than the best baseline results with EfficientNet backbone. Overall, TATL with the Link-shape structure performed the best across all network backbones, followed by TATL with the U-shape with a minor margin. Furthermore, these configurations surpass all remaining baselines with a large margin, thereby proving the benefit of using TATL. While achieving state-of-the-art performances, our method enjoys a significant reduction in the number of parameters. We provide the number of trainable parameters on different architectures in Table 7 . Notably, compared to the winner of ISIC 2017 (Kawahara and Hamarneh, 2018) and ISIC 2018 challenge (Koohbanani et al., 2018) , our method has 1.4 to 2.33 times and 30 to 50 times fewer parameters during training. Consequently, our TATL consumes less GPU memory and thus can be trained with higher image resolution or employed in mobile devices with low memory costs. In Section 5.3, we demonstrated that TATL achieved promising results using the b0-EfficientNet backbone. In this experiment, we explore the robustness of TATL to different network architectures beyond EfficientNets. Particularly, we compare ImageNet initialization against TATL on five different backbone networks (VGG16, ResNet151, and evaluate the performances on the Negative and Streaks attributes because they are the most challenging ones with the least training samples. In addition, we consider the following settings: • The first one, denoted as TATL (FE), was to apply the TATL technique but froze the encoder part and only update weights of the decoder module while training for a specific skin attribute. • The second configuration, denoted as TATL (NF), was similar to the former but allowed updating the parameters in the encoder. • The last setting (ImageNet) was not to apply the transfer learning process and train from scratch using weights pre-trained on the ImageNet dataset. For each model, we use five different backbone architectures and two different convolution network shapes (U-shape and Link-shape). We also rerun three times for each configuration and measure the corresponding performance with the five-fold cross-validation technique to estimate average results. This configuration results in a total of 600 experiments to be examined, which provides a comprehensive analysis of our TATL. Table 8 reports the results of the experiment and shows that applying TATL could help improve all backbone performance except the ResNet-v2 with the Negative attribute. However, the difference between the Dice values, in this case, was not noticeable (less than 1%). In contrast, the TATL could boost the Dice by nearly 13% when using DenseNet-169 with U-shape to segment Streaks regions in the ISIC2018 dataset, and 11.4% when using ResNet151 with Link-shape in a similar task. On the backbones such as DenseNet-169 + U-shape, ResNet151 + Link-shape, TATL consistently provided significant improvements. In summary, we find that our proposed TATL transfer learning could operate with various network architectures, thereby demonstrating its efficacy and generalizability. Using pre-trained ImageNet models have been a common practice for many computer vision applications because of the diversity in the ImageNet dataset, making the pre-trained models stable and can detect local relationships. In contrast, TATL does not use additional data sources and only relies on the dataset at hand to create and learn the skin attribute regions. As a result, TATL's performance is subject to the amount of training data for the current task. Nevertheless, the results in Section 5.3 showed that TATL had achieved promising results on standard benchmarks using all the labeled data provided. In this section, we explore the stability of TATL under the effect of different data size. To test the stability of TATL, for each skin feature, its testing data are reserved for evaluation, and we subsample a portion of the remaining data for training. Particularly, we vary the amount of training data for each skin feature from 100% (the original experiments in Section 5.3) to only 40% and consider three competitors: • TATL using a pre-trained ImageNet model; • TATL with a random initialization; • Only a pre-trained ImageNet model. We report the results in Figure 4 . In most cases, reducing the amount of training data results in worse performances for all methods. Moreover, both TATL versions achieved better performances than only using a pre-trained ImageNet model, except for the Millia feature at 80% training data. Interestingly, the two TATL versions achieved similar performances on skin attributes with limited training data (Negative Net and Streak), while the gap was larger on the attributes with more training data (Pigment Network). However, overall, both TATL variants achieved better results than the standard strategy of using a pre-trained ImageNet. Furthermore, TATL with a pre-trained ImageNet model achieved the best-averaged results across attributes. As discussed in Section 3.2, our TATL is inspired by the dermatologists' behaviors, which motivates the learning of skin attribute regions. In this experiment, we examine the relationship between skin attribute regions' features learned by the Attribute-Agnostic Segmenter and the specific features learned for each particular attribute of Target-Segmenters. We hypothesize that there are additional benefits of performing supervised learning on attributes whose features are similar to those obtained from the Attribute-Agnostic learning step. To validate this hypothesis, we consider two TATL variants: • Downstream TATL: The standard TATL framework (Figure 3) , where the model for each skin attribute is initialized from the Attribute-Agnostic step; • Pretext TATL: A TATL variant which directly uses the model trained in the Attribute-Agnostic learning step to inference for each skin attribute, without additional supervised learning in downstream task ( Figure 3 with only the left block). We run experiments on the ISIC 2017 benchmark with the two aforementioned TATL variants and report the Dice score for each attribute after the learning procedures in Figure 5 as well as the loss curves in Figure 6 with respect to the number of training steps. Note that the Pigment Network enjoys the most training samples amongst the four attributes, while the remaining three, Streaks, Negative Network, and Milia, are considered the minority classes. Figure 5 shows that for the Pigment Network feature, the improvement of Downstream TATL compared to Pretext TATL is almost neglectable, which shows that the benefit of additional supervised training is minor for the attributes with ample training data. In contrast, the minor attributes' performance gaps are much more significant, suggesting TATL brings more benefits to the classes with limited training samples. We further verify this result in Figure 6 where the validation loss for the Pigment Network attribute plateaus very early on (at around 300 iterations) while it further decreases for other minor attributes. Overall, these experiments show that the downstream fine-tuning step is particularly beneficial for the small classes. Therefore, one can infer that TATL, with all three stages, can significantly boost the performance on the classes with limited training data, which are more challenging to improve performance. In this experiment, we explore the benefits of TATL compared to other training paradigms, namely selfsupervised learning (SSL), training with data augmentation, and attention-based methods. Note that our TATL is related to the SSL paradigm in terms of deriving better pre-trained models through solving auxiliary tasks. For the SSL baselines, we implement an additional pre-training phase to replace TATL's first and second stages. Particularly, we consider the task of predicting the rotation angle applied on the input (Gidaris et al., 2018) or reconstructing the image after scrambling the pixels (Gidaris et al., 2018) . We also consider the strategy of supervised training with data augmentation to increase the total training samples, and using an attention-based architecture which can suppress irrelevant regions in an input image while focusing salient features useful for a specific task (Oktay et al., 2018) . In summary, given the same U-shape architecture and pre-trained b0-EfficientNet as the network backbone, we compare TATL against the standard Pre-trained ImageNet models and the following training strategies: • Self-supervised with the image rotation-based method (Gidaris et al., 2018 ); • Self-supervised with the image context restoration method (Chen et al., 2019) ; • Supervised training with data-augmentations, in which we use random rotation, flip, shift, brightness, or zoom; • U-Eff network with attention gates as proposed in Oktay et al. (2018) . Table 9 presents the experiment results with the metrics computed by averaging all skin attributes in the ISIC-2018 challenge. In general, both the attention approach and the image reconstruction SSL can provide marginal improvements to the traditional Imagenet initialization on the U-Shape design. However, our TATL still outperforms such strategies on both evaluation metrics and architecture designs. This evidence confirms our finding that transferring knowledge from the Attribute-Agnostic Segmenter is beneficial for the skin-attribute segmentation task. In Section 5.7, we demonstrated that TATL could outperform standard training paradigms such as using a pre-trained model and SSL methods. This section explores how the theoretical insights developed in Section 4.5 supports the empirical success of TATL. Recall that from Proposition 1, one can infer that given the same conditions of employed SGD algorithm and other hyper-parameters such as learning rate or the number of epochs, a model is expected to achieve small generalization errors on a testing set if its errors on the corresponding training data measured at the point of initialization (without any supervised learning) is small. Given that TATL achieved lower testing errors in experiments (Table 8 and 9), we now conduct a test to explicitly verify if this results correspond to a lowest TATL's generalization error bound values compared with other strategies. In particular, we estimate generalization error's bound in the right side of Eq. (6) on the ISIC 2018 with four different cases: our TATL, pre-trained ImageNet, Rotation-based SSL (Gidaris et al., 2018) , and Image Context Reconstruction-based SSL (Chen et al., 2019) . We do not compare with the Attention-based method and Data Augmentation approach since both use the same Pre-trained ImageNet. For each method, we use the U-Eff network and run a full pass over all training samples of each attribute to estimate the spectral norm of the Hessian matrix and the empirical riskR S . Here, the largest eigenvalue is approximated by the power iteration method (Solomon, 2015) . We set K = 4, c = 0.01 for all attributes and present the relative relations among categories in Figure 7 . It is noteworthy that our TATL acquired the lowest generalization error values for all skin attributes, especially with Streaks and Negative Net. These observations are compatible with our experiment, in which we outperformed other training strategies (as shown in Table 9 ) and surpassed the Pre-trained ImageNet by a wide margin for the two characteristics Negative and Streak (Table 8 , at Eff. Net-b0 row and ISIC 2018 column). In conclusion, we argue that the TATL's efficacy could be demonstrated in both experimental and theoretical settings. Figure 8 illustrates some sample results of our proposed TATL model. The ground-truth segmentation was highlighted in green, and our prediction was marked with red. Regarding attributes with many training samples such as Globules or Pigment Network, TATL has a better segmentation covering most ground truth areas. Furthermore, although Streaks and Negative Network's prediction missed some injured regions, the result still captured the primary matter location. Our model also provides the benefit of extra information for end-users through the predicted union regions. For instance, we show, in Figure 9 , the correlation between the union regions with typical skin attributes for the same example in Figure 8 . For each skin attribute, the ground truth is in green color, and we draw them over the binary maps to indicate the union positions' prediction. It can be seen that predicted union could cover both large regions as in Pigment Network and small disconnected regions as Negative Network (Figure 9a ) and Globules (Figure 9b ). This result can especially be useful under the scenarios where the dermatologists could not detect deformed or disconnected regions. In such cases, the union region provided by TATL can become helpful by highlighting the region of interest to assist the dermatologists. As a result, TATL outputs not only can speed up the diagnosis process and but also help the dermatologists diagnose better, which is critical because they, not our model, will make the final diagnosis. (a) (b) Figure 9 : An illustration demonstrating the usefulness of our union segmentation, which can help dermatologists know which area they should focus on. Our predicted union can cover most of the ground truth of attribute regions which are depicted in the green areas. Our work proposes a novel strategy to initialize the attribute segmenters' parameters using an attributeagnostic segmenter trained on abnormal skin regions. We empirically demonstrate this benefit over the traditional strategy of using the ImageNet pretrained models. From the promising results, we outline several potential and interesting directions for future research. Generalization to Other Medical Image Analysis Tasks. We develop TATL to address the skin-attribute detection problem specifically. It would be interesting to test the TATL's generalization capabilities to other medical image analysis tasks, where using pretrained Imagenet models is likely to be suboptimal. For example, similar tasks such as brain lesion segmentation (Hu et al., 2018; Duy et al., 2018; Mallick et al., 2019; Nguyen et al., 2017) , abnormal chest detection (Hashir et al., 2020; Ibrahim et al., 2021; Nguyen et al., 2021) , or diabetic retinopathy lesion segmentation share similar characteristics to our problem setting: the data are often imbalance and classes share semantic features that can be leveraged to improve the overall performance. Therefore, it is of interest to explore the applications of TATL in such tasks and make possible adjustments. Real-world Applications Using TATL. Our ultimate goal is to develop a model that not only makes predictions but also provides helpful information and assists dermatologists in making the final decisions. Our TATL framework realizes this goal by providing a mask of skin attribute regions regardless of their attributes, compensating for inaccurate predictions of later stages, especially on minor classes. A promising future direction for TATL is integrating it in an online learning setting with human-in-the-loop Nunnari et al., 2021a) . Particularly, a model is trained to detect some diseases and then deployed to a real-world environment with a stream of data and feedback from dermatologists and patients. In such scenarios, the model can continuously improve its performance by accumulating the attribute-agnostic information via the dermatologists' feedback and then transferring it to the target segmenters, allowing for a fast adaptation to newer patients and more accurate predictions over time. A Holistic Medical Image Analysis Method Beyond TATL. Intuitively, TATL works by achieving a tighter generalization error bound compared to other initialization strategies. However, the theoretical result in Proposition 1 only bounds using the initialization parameters. In practice, additional aspects can affect the model's generalization, such as (i) the number of source tasks (training classes in our case); (ii) which properties among those tasks that can be safely transferred; and (iii) beyond an initialization, which mechanisms allow for a successful knowledge transfer. Such properties are not yet rigorously studied, and exploring them can potentially provide a holistic method for medical image analysis: a method not only starts with a quality initialization but also exploits the complex relationship of medical images to improve its performance over time. Such a method can provide accurate detection and assist dermatologists in diagnosing rare diseases more precisely, which results in effective treatments at a lower cost. We have investigated the limitations of the common fine-tuning strategy in state-of-the-art skin attributes detection methods. We show that such strategies are not optimal when the current task is largely different from ImageNet and contains limited training data. This limitation motivated us to develop TATL, a novel transfer learning method that exploits all attribute data to train the agnostic segmenter. By transferring the agnostic segmenter's knowledge to each attribute classifier, TATL alleviates issues of training data scarcity, especially for small classes, and allows knowledge sharing among attribute models. Through extensive experiments on the ISIC 2017 and ISIC 2018 benchmarks, we demonstrate the efficacy of TATL over existing state-of-the-art methods. Moreover, TATL is proven to work effectively with various backbone networks while enjoying minimal model and computational complexity. Finally, we present theoretical insights that demonstrate that TATL works in practice by bridging the domain gap via the task-agnostic segmenter, thus leading to competitive performance. Table 4 Table A.10: Standard deviation of the Jaccard metric on the ISIC2018 challenge. The best results are in bold font. Stage 1: segmenting the lesion region, Stage 2: training the Attribute-Agnostic Segmenter, Stage 3: training the Target-Segmenters. Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning Segnet: A deep convolutional encoder-decoder architecture for image segmentation A systematic study of the class imbalance problem in convolutional neural networks Dermoscopy Image Analysis: Overview and Future Directions Linknet: Exploiting encoder representations for efficient semantic segmentation Self-supervised learning for medical image analysis using image context restoration A simple framework for contrastive learning of visual representations Cats or cat scans: transfer learning from natural or medical image source data sets? Xception: Deep learning with depthwise separable convolutions Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic) Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic), in Clinically applicable deep learning for diagnosis and referral in retinal disease Accurate brain extraction using active shape model and convolutional neural networks Optimization for medical image segmentation: theory and practice when evaluating with dice score or jaccard index Unsupervised representation learning by predicting image rotations Deep learning Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs Quantifying the value of lateral views in deep learning for chest x-rays Mask r-cnn Deep residual learning for image recognition Sample-efficient deep learning for covid-19 diagnosis based on ct scans Deep learning for image-based cancer detection and diagnosis-a survey Densely connected convolutional networks Pneumonia classification using deep learning from chest x-ray images during covid-19 A survey of loss functions for semantic segmentation Fully convolutional neural networks to detect clinical dermoscopic features Leveraging transfer learning for segmenting lesions and their attributes in dermoscopy images Data-dependent stability of stochastic gradient descent Learning representations for automatic colorization Unsupervised representation learning by sorting sequences Joint-task self-supervised learning for temporal correspondence Brain mri image classification for cancer detection using deep wavelet autoencoder-based deep neural network Computer Aided Diagnostic Support System for Skin Cancer: A Review of Techniques and Algorithms Shuffle and learn: unsupervised learning using temporal order verification An attention mechanism with multiple knowledge sources for covid-19 detection from ct images 3d-brain segmentation using deep neural network and gaussian mixture model A visually explainable learning system for skin lesion detection using multiscale input with attention u-net Anomaly detection for skin lesion images using replicator neural networks, in: International Cross-Domain Conference for Machine Learning and Knowledge Extraction On the overlap between grad-cam saliency maps and explainable visual features in skin cancer images A software toolbox for deploying deep learning decision support systems with xai capabilities Pytorch: An imperative style, high-performance deep learning library Context encoders: Feature learning by inpainting Negative pigment network: an additional dermoscopic feature for the diagnosis of melanoma Transfusion: Understanding transfer learning for medical imaging Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning Medical Image Computing and Computer-Assisted Intervention -MICCAI 2015 U-net: Convolutional networks for biomedical image segmentation Tversky loss function for image segmentation using 3d fully convolutional deep networks Making the world differentiable: On using self-supervised fully recurrent n eu al networks for dynamic reinforcement learning and planning in non-stationary environm nts Cancer facts & figures 2021 Numerical algorithms: methods for computer vision, machine learning, and graphics Inception-v4, inception-resnet and the impact of residual connections on learning A survey on deep transfer learning Efficientnet: Rethinking model scaling for convolutional neural networks Tracking emerges by colorizing videos Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation Learning correspondence from the cycle-consistency of time Cutaneous melanoma: etiology and therapy Unsupervised feature learning via non-parametric instance discrimination Time required for a complete skin examination with and without dermoscopy: a prospective, randomized multicenter study Colorful image colorization This research has been supported by the Ki-Para-Mi project (BMBF, 01IS1903-8B), the pAItient project (BMG, 2520DAT0P2), and the Endowed Chair of Applied Artificial Intelligence, Oldenburg University. Binh T. Nguyen is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number NCM2019-18-01. We would like to thank Dr. Fabrizio Nunnari (German Research Centre for Artificial Intelligence, Germany) and Dr. Paul Swoboda (Max Planck Institute for Informatics, Germany) for their valuable discussions. Table 3