key: cord-0622044-dj1zfvy0 authors: Xia, Meng; Kheterpal, Meenal K.; Wong, Samantha C.; Park, Christine; Ratliff, William; Carin, Lawrence; Henao, Ricardo title: Malignancy Prediction and Lesion Identification from Clinical Dermatological Images date: 2021-04-02 journal: nan DOI: nan sha: e6096613f32b7c0a2af52e6b05b5662ebcdde9f2 doc_id: 622044 cord_uid: dj1zfvy0 We consider machine-learning-based malignancy prediction and lesion identification from clinical dermatological images, which can be indistinctly acquired via smartphone or dermoscopy capture. Additionally, we do not assume that images contain single lesions, thus the framework supports both focal or wide-field images. Specifically, we propose a two-stage approach in which we first identify all lesions present in the image regardless of sub-type or likelihood of malignancy, then it estimates their likelihood of malignancy, and through aggregation, it also generates an image-level likelihood of malignancy that can be used for high-level screening processes. Further, we consider augmenting the proposed approach with clinical covariates (from electronic health records) and publicly available data (the ISIC dataset). Comprehensive experiments validated on an independent test dataset demonstrate that i) the proposed approach outperforms alternative model architectures; ii) the model based on images outperforms a pure clinical model by a large margin, and the combination of images and clinical data does not significantly improves over the image-only model; and iii) the proposed framework offers comparable performance in terms of malignancy classification relative to three board certified dermatologists with different levels of experience. Prior to the COVID-19 pandemic, access to dermatology care was challenging due to limited supply and increasing demand. According to a survey study of dermatologists, the mean ± standard deviation (SD) waiting time was 33±32 days, 64% of the appointments exceeded the criterion cutoff of 3 weeks and 63% of the appointments exceeded 2-week criterion cutoff for established patients. During the COVID-19 pandemic, the number of dermatology consultations were reduced by 80-90% to urgent issues only, leading to delay in care of dermatologic * Corresponding author concerns. Moreover, the issue of access is very significant for the growing Medicare population, expected to account for 1 in 5 patients by 2030 [1] , due to a higher incidence of skin cancer. Access issues in dermatology are concerning as there has been an increasing incidence of skin cancers, particularly a 3-fold increase in melanoma over the last 40 years [2] . Many of the skin lesions of concern are screened by primary care physicians (PCPs). In fact, up to one third of primary care visits contend with at least one skin problem, and skin tumors are the most common reason for referral to dermatology [3] . High volume of referrals places a strain on specialty care, delaying visits for high-risk cases. Given the expected rise in baby boomers, with sig-nificantly increased risk of skin cancer, there is an urgent need to equip primary care providers to help screen and risk stratify patients in real time, high quality and costconscious fashion. PCPs have variable experience and training in dermatology, causing often low concordance between their evaluation and dermatology [3] . A consistent clinical decision support (CDS) system has the potential to mitigate this variability, and to create a powerful risk stratification tool, leveraging the frontline network of providers to enhance access to quality and valuable care. In addition, such a tool can aid tele-dermatology workflows that have emerged during the global pandemic. Over the last decade, several studies in the field of dermatology have demonstrated the promise of deep learning models such as convolutional neural networks (CNN) in terms of classification of skin lesions [4, 5] , with dermoscopy-based machine learning (ML) algorithms reaching sensitivities and specificities for melanoma diagnosis at 87.6% (95% CI 72.72-100.0) and 83.5% (95% CI: 60.92-100.0), respectively, by meta-analysis [6] . Several authors have reported superior performance of ML algorithms for classification of squamous cell carcinoma (SCC) and basal cell carcinomas (BCC) with larger datasets improving performance [7, 4] . From a machine-learning methods perspective, a common approach for classification with dermoscopy images consists on refining pre-trained CNN architectures such as VGG16 as in [8] or AlexNet after image pre-processing, e.g. background removal, as in [9] . Alternatively, some approaches consider lesion sub-types independently [10] , sonified images [11] , or by combining clinical data with images to increase the information available to the model for prediction [12] . However, dermoscopy images are generally of good quality, high resolution and minimal background noise, making them less challenging to recognize compared to clinical, wide-field, images. Beyond dermoscopy images, similar refinement approaches have been proposed based on architectures such as ResNet152 [7, 13] , with additional pre-processing (illumination correction) [14] , by using detection models to account for the non-informative background [15, 16] , or by first extracting features with CNN-based models, e.g., Inception v2, to then perform feature classification with other machine learning methods [11] . Moreover, comparative studies [17, 5] have shown that models based on deep learning architectures can perform similarly to der-matologists on various classification tasks. However, these ML algorithms are often developed with curated image datasets containing high quality clinical and dermoscopy photographs with limited skin variability, i.e., majority Caucasian or Asian sets in the ISIC dataset (dermoscopy), Asan dataset, Hallym dataset, MED-NODE, Edinburgh dataset [7] . The use of such algorithms trained on images often acquired from high quality cameras and/or dermatoscopes may be limited to specialty healthcare facilities and research settings, with questionable transmissibility in resource-limited settings and the primary care, thus creating a gap between healthcare providers and patients. Smartphone-based imaging is a promising image capture platform for bridging this gap and offering several advantages including portability, cost-effectiveness and connectivity to electronic medical records for secure image transfer and storage. To democratize screening and triage in primary care setting, an ideal ML-based CDS tool should be trained, validated and tested on smartphone-acquired clinical and dermoscopy images, representative of the clinical setting and patient populations for the greatest usability and validity. While there are challenges to consumer grade smartphone image quality such as variability in angles, lighting, distance from lesion of interest and blurriness, they show promise to improve clinical workflows. Herein, we propose a two-stage approach to detect skin lesions of interest in wide-field images taken from consumer grade smartphone devices, followed by binary lesion classification into two groups: Malignant vs. Benign, for all skin cancers (melanoma, basal cell carcinoma and squamous cell carcinoma) and most common benign tumors. Ground truth malignancy was ascertained via biopsy, as apposed to consensus adjudication. As a result, the proposed approach can be integrated and generalized into primary care and dermatology clinical workflows. Importantly, our work also differs from existing approaches in that our framework can detect lesions from both wide-field clinical and dermoscopy images acquired with smartphones. This paper is organized as follows: in Section 2 we present the problem formulation and the proposed approach. In Section 3 we describe the data used, the implementation details and quantitative and qualitative experimental results. Finally, in Section 4 we conclude with a discussion of the proposed approach and acknowledge some limitations of the study. We represent a set of annotated images as D = {X n , Z n , U n , y n } N n=1 , where N is the number of instances in the dataset, X n ∈ R h×w×3 denotes a color (RBG) image of size w × h (width × height) pixels, Z n is a non-empty set of annotations Z n = {z n1 , . . . , z nm n }, with elements z ni corresponding to the i-th region of interest (ROI) represented as a bounding box with coordinates (x ni , y ni , w ni , h ni ) (horizontal center, vertical center, width, height) and ROI labels U n = {u n1 , . . . , u nm n }, where m n is the number of ROIs in image X n . Further, y n ∈ {0, 1} is used to indicate the global image label. In our specific use case, the images in D are a combination of smartphone-acquired wide-field and dermoscopy images with ROIs of 8 different biopsy-confirmed lesion types (ROI labels): Melanoma, Melanocytic Nevus, Basal Cell Carcinoma, Actinic Keratosis/Bowen's Disease, Benign Keratosis, Dermatofibroma, Vascular Lesions and Other Benign lesions. The location of different lesions was obtained by manual annotation as described below in Section 3.1. For malignancy prediction, the set of malignant lesions denoted as M is defined as Melanoma, Basal Cell Carcinoma, and Actinic Keratosis/Bowen's Disease/Squamous cell carcinoma while the set of benign lesions contains all the other lesion types. For the global image label y n , a whole image (smartphone or dermoscopy) is deemed as malignant if at least one of its ROI labels are in the malignant set, M. Below, we introduce deep-learning-based models for malignancy prediction, lesion identification and imagelevel classification for end-to-end processing. An illustration of the two-step malignancy prediction and lesion identification framework is presented in Figure 1 . Assuming we know the position of the ROIs, i.e., {X n , Z n } N n=1 are always available, the problem of predict-ing whether a lesion is malignant can be formulated as a binary classification task. Specifically, we specify a function f θ (·) parameterized by θ whose output is the probability that a single lesion is consistent with a malignancy pathohistological finding in the area, i.e., where f θ (·) is a convolutional neural network that takes the region of X n defined in z ni as input. In practice, we use a ResNet-50 architecture [18] with additional details described in Section 3.2. Above we assume that the location (ROI) of the lesions is known, which may be the case in dermoscopy images as illustrated in Figure 1 . However, in general, wide-field dermatology images are likely to contain multiple lesions, while their locations are not known or recorded as part of clinical practice. Fortunately, if lesion locations are available for a set of images (via manual annotation), the task can be formulated as a supervised object detection problem, in which the model takes the whole image as input and outputs a collection of predicted ROIs along with their likelihood of belonging to a specific group. Formally, wherep ni = [p ni1 , . . . ,p niC ] ∈ (0, 1) C is the likelihood that the predicted regionẑ ni = {x ni ,ŷ ni ,ŵ ni ,ĥ ni } belongs to one of C groups of interest, i.e., p(ẑ ni ∈ c) =p nic . In our case, we consider three possible choices for C, namely, i) C = 1 denoted as one-class where the model seeks to identify any lesion regardless of type; i) C = 2 denoted as malignancy in which the model seeks to separately identify malignant and benign lesions; and iii) C = 8 denoted as sub-type, thus the model is aware of all lesion types of interest. Note that we are mainly interested in finding malignant lesions among all lesions present in an image as opposed to identifying the type of all lesions in the image. Nevertheless, it may be beneficial for the model to be aware that different types of lesions may have common characteristics which may be leveraged for improved detection. Alternatively, provided that some lesion types are substantially rarer than others (e.g., dermatofibroma and vascular lesions only constitutes 1% each of all the lesions in the dataset described in Section 3.1), seeking to identity all lesion types may be detrimental for the overall detection performance. This label granularity trade-off will be explored in the experiments. In practice, we use a Faster-RCNN (region-based convolutional neural network) [19] with a feature pyramid network (FPN) [20] and a ResNet-50 [18] backbone as object detection architecture. Implementation details can be found in Section 3.2. For screening purposes, one may be interested in estimating whether an image is likely to contain a malignant lesion so the case can be directed to the appropriate dermatology specialist. In such case, the task can be formulated as a whole-image classification problem where p(y n = 1|X n ) ∈ (0, 1) is the likelihood that image X n contains a malignant lesion. The model in (3) can be implemented in a variety of different ways. Here we consider three options, two of which leverage the malignancy prediction and lesion identification models described above. Direct image-level classification. h φ (·) is specified as a convolutional neural network, e.g., ResNet-50 [18] in our experiments, to which the whole image X n is fed as input. Though this is a very simple model that has advantages from an implementation perspective, it lacks the context provided by (likely) ROIs that will make it less susceptible to interference from background non-informative variation, thus negatively impacting classification performance. Two-stage approach. h φ (·) is specified as the combination of the one-class lesion identification and the malignancy prediction models, in which detected lesions are assigned a likelihood of malignancy using (1). This is illustrated in Figure 1 (Right). Then we obtain where we have replaced the ground truth location z ni in (1) with them n predicted locations from (2), and a(·) is a permutation-invariant aggregation function. In the experiments we consider two simple parameter-free options: Other more sophisticated (parametric) options such as noisy AND [21] , and attention mechanisms [22] , may further improve performance but are left as interesting future work. One-step approach. h φ (·) is specified directly from the sub-types lesion identification model in (2) as where a(·) is either (5), (6) or (7). From the options described above, the direct imagelevel classification approach is conceptually simpler and easier to implement but it does not provide explanation (lesion locations) to its predictions. The one-step approach is a more principled end-to-end system that directly estimates lesion locations, lesion sub-type likelihood, and overall likelihood of malignancy, however, it may not be suitable in situations where the availability of labeled sub-type lesions may be limited, in which case, one may also consider replacing the sub-type detection model with the simpler malignancy detection model. Akin to this simplified one-step approach, the two-stage approach provides a balanced trade-off between the ability of estimating the location of the lesions and the need to identify lesion sub-types. All these options will be quantitatively compared in the experiments below. Comprehensive experiments to analyze the performance of the proposed approach were performed. First, we describe the details of the dataset and the models being considered, and present evaluation metrics for each task, while comparing various design choices described in the previous section. Then, we study the effects of adding To compare the proposed model with human experts, we had three dermatology trained medical doctors with different levels of experience label each of the images without access to the biopsy report or context from the medical record. In terms of experience, MJ has 3 years dermoscopy experience, AS has 6 years of dermoscopy experience and MK has 10 year dermoscopy experience. Provided that MK also participated in lesion annotation with access to biopsy report information, we allowed 12 months separation between the lesion annotation and malignancy adjudication sessions. Detailed lesion type counts are presented in Table 1 . The average area of the lesion is 186,276 (7,379-153,169) pixels 2 (roughly 432 × 432 pixels in size) while the average area of the images is 7'200,000 (3'145,728-12'000,000) unbelievable it is same as train but it is true) pixels 2 (roughly 2683 × 2683 pixels in size). Malignancy Classification. For malignancy classification we use a ResNet-50 architecture [18] as shown in Figure 1(Bottom right) . The feature maps obtained from the last convolutional block are aggregated via average pooling and then fed through a fully connected layer with sigmoid activation that produces the likelihood of malignancy. The model was initialized from a ResNet-50 pre-trained on ImageNet and then trained (refined) using a stochastic gradient descent (SGD) optimizer for 100 epochs, with batch size 64 initial learning rate 0.01, momentum 0.9 and weight decay 1e-4. The learning rate was decayed using a half-period cosine function, i.e., η (t) = 0.01 × [0.5 + 0.5 cos (tπ/T max )], where t and T max are the current step and the max step, respectively. Lesion Identification. The lesion identification model is specified as a Faster-RCNN [19] with a FPN [20] and a ResNet-50 [18] backbone. The feature extraction module is a ResNet-50 truncated to the 4-th block. The FPN then reconstructs the features to higher resolutions for better multi-scale detection [20] . Higher resolution feature maps are built as a combination of the same-resolution ResNet-50 feature map and the next lower-resolution feature map from the FPN, as illustrated in Figure 1 (Top right). The combination of feature maps from the last layer of the feature extraction module and all feature maps from the FPN are then used for region proposal and ROI pooling. See [20] for further details. The model was trained using an SGD optimizer for 80,000 steps, with batch size of 512 images, initial learning rate 0.001, momentum 0.9 and weight decay 1e-4. Learning rate was decayed 10x at 60,000-th and 80,000-th step, respectively. Direct Image-Level Classification Model. The direct image-level classification model in Section 2.3 has the same architecture and optimization parameters as the malignancy classification model described above. Clinical Model. The clinical model was built using logistic regression with standardized input covariates and discrete (categorical) covariates encoded as one-hot-vectors. Combined Model. In order to combine the clinical covariates with the images into a single model, we use the malignancy classification model as the backbone while freezing all convolutional layers during training. Then, we concatenate the standardized input covariates and the global average-pooled convolutional feature maps, and feed them through a fully connected layer with sigmoid activation that produces the likelihood of malignancy. The combined model was trained using an SGD optimizer for 30 epochs, with batch size 64, initial learning rate 0.001, momentum 0.9 and weight decay 1e-4. The learning rate was decayed using a half-period cosine function as in the malignancy classification model. Implementation. We used Detectron [25] for the lesion identification model. All other models were coded in Pyhton 3.6.3 using the PyTorch 1.3.0 framework except for the clinical model that was implemented using scikit-learn 0.19.1. The source code for all the models used in the experiments is available (upon publication) at github. com/user/dummy. For malignancy prediction, two threshold-free metrics of performance are reported, namely, area under the curve (AUC) of the receiving operating characteristic (ROC) and the average precision (AP) of the precision recall curve, both described below. AUC is calculated as: where t ∈ [p 1 , . . . ,p i ,p i+1 , . . .] is a threshold that takes values in the set of sorted test predictions {p i } N i=1 from the model, and the true positive rate, TPR t , and false positive rate, FPR t , are estimated as sample averages for a given threshold t. Similarly, the AP is calculated as: where PPV t is the positive predictive value or precision for threshold t. The calculation for the AUC and AP areas follow the trapezoid rule. The intersection over union (IoU) is defined as the ratio between the overlap or ground truth and estimated ROIs, {z ni } m n i=1 and {ẑ ni }m n i=1 , respectively, and the union of their areas. For a given ROI, IoU=1 indicates complete overlap between prediction and ground truth. Alternatively, IoU=0 indicates no overlap. In the experiments, we report the median and interquartile range IoU for all predictions in the test set. The mean average precision (mAP) is the AP calculated on the binarized predictions from the detection model such that predictions with an IoU≥ t are counted as correct predictions or incorrect otherwise, if IoU< t, for a given IoU threshold t set to 0.5, 0.75 and (0.5,0.95) in the experiments. These values are standard in object detection benchmarks, see for instance [26] . We also report the recall with IoU> 0 as a general, easy to interpret, metric of the ability of the model to correctly identify lesions in the dataset. Specifically, we calculate it as the proportion of lesions (of any type) in the dataset for which predictions overlap with the ground truth. Malignancy Prediction. First, we present results for the malignancy prediction task, for which we assume that lesions in the form of bounding boxes (ROIs) have been preidentified from smartphone (wide-field) or dermoscopy images. Specifically, we use ground truth lesions extracted from larger images using manual annotations as previously described. Table 2 shows AUCs and APs for the malignancy prediction model described in Section 2.1 on the independent test dataset. We observe that Malignancy Detection. Provided that in practice lesions are not likely to be pre-identified by clinicians, we present automatic detection (localization) results using the models presented in Section 2.2. Specifically, we consider three scenarios: i) one-class: for all types of lesions combined; ii) malignancy: for all types of lesions combined into malignant and benign; and iii) sub-type: for all types of lesions separately. Table 3 shows mean Average Precision (mAP) at different thresholds, Recall (sensitivity) and IoU summaries (median and interquartile range), all on the independent test set. In order to make mAP comparable across different scenarios, we calculate it for all lesions regardless of type, i.e., mAP is not calculated for each lesion type and then averaged but rather by treating all predictions as lesions. We observe that in general terms, the one-class lesion identification model outperforms the more granular malignancy and sub-type approaches. These observation is also consistent in terms of Recall and IoU. For the one-class model specifically, 82.9% regions predicted are true lesions at at IoU≥ 0.5 (at least 50% overlap with ground truth lesions), whereas the precision drops to 26.8% with a more stringent IOU≥ 0.75. Interestingly, the 95.6% Recall indicates that the oneclass model is able to capture most of the true lesions at IoU> 0 and at least 50% of the predicted regions have a IoU> 0.73 or IoU> 0.59 for 75% of the lesions in the independent test set. Image Classification. The image-level prediction results of malignancy are reported in Figure 2 . Predictions on the independent test set were obtained from the averagepooled image classification model in Section 2.3 with the one-class detection model in Section 2.2 and the malignancy prediction model in Section 2.1. From the per- formance metrics reported we note that the proposed approach is comparable with manual classification by three expert dermatologists (AS, MK and MJ). Interestingly, in dermoscopy images, the model slightly outperforms two of the three dermatologists and the difference in their performance is consistent with their years of experience; MK being the most experienced and better performing dermatologist. Additional results comparing the different image-level malignancy prediction strategies described in Section 2.3, namely, i) direct image-level classification, ii) two-stage with one-class lesion identification, and one-step with iii) malignancy or iv) sub-type identification models with average pooling aggregation are presented in Table 4 . In terms of AUC, the one-class approach consistently outperforms the others, while in terms of AP, sub-type is slightly better. Interestingly, the direct image-level classification which takes the whole image as input, without attempting to identify the lesions, performs reasonably well and may be considered in scenarios where computational resources are limited, e.g., mobile and edge devices. Further, we also compare different aggregation strategies (average, max and noisy OR pooling) described in Section 2.3 and lesion identification models (one-class, Figure 2 : Performance metrics of the malignancy prediction models. ROC and PR curves, top and bottom rows, respectively, for all images (Left), smartphone (wide-field) images (Middle) and dermoscopy images (Right) on the test set. Predictions were obtained from the one-class model followed by the malignancy prediction model and the image classification aggregation approach. Also reported are the TPR (sensitivity) and FPR (1-specificity) for three dermatology trained MDs (AS, MK and MJ). Table 5 : Performance metrics of the image-level malignancy prediction model with different aggregation strategies (average, noisy OR and max pooling) and lesion identification models (one-class, malignancy and sub-types). The best performing combination is highlighted in boldface. Noisy malignancy and sub-types) described in Section 2.2 are presented in Table 5 , from which we see that the combination of average pooling and one-class lesion detection slightly outperforms the alternatives. Next, we explore the predictive value of clinical features and their combination with image-based models. Specifically, we consider three models: i) the logistic regression model using only clinical covariates; ii) the twostage approach with one-class lesion identification; and iii) the combined model described in Section 3.2. Note that since we have a reduced set of images for which both clinical covariates and images are available as described in Section 3.1, all models have been re-trained accordingly. Figure 3 shows ROC and PR curves for the three models and the TPR and FPR values for three dermatology trained MDs on the independent test set. Results indicate a minimal improvement in classification metrics Table 6 : Performance metrics (AUC and AP) of the models with data augmentation. We consider three models with and without ISIC2018 dermoscopy image dataset augmentation. The three models considered are the malignancy prediction model described in Section 2.1, and the direct image-level classification and two-step approach with one-class lesion identification described in Section 2.3. Smartphone Finally, we consider whether augmenting the discovery dataset with the publicly available ISIC2018 dataset improves the performance characteristics of the proposed model. Specifically, the ISIC20128 (training) dataset which consists of only dermoscopy images is meant to compensate for the low representation of dermoscopy images in our discovery dataset, i.e., only 11% of the discovery images are dermoscopy. Results in Table 6 are stratified by image type (all images, smartphone (wide-field) only and dermoscopy only) are presented for three different models: i) malignancy prediction (assuming the po- sitions of the lesions are available); ii) direct image-level classification; and iii) the two-stage approach with oneclass lesion identification. As expected, data augmentation consistently improve the performance metrics of all models considered. Figure 4 shows examples of the one-class lesion identification model described in Section 2.2. Note that the model is able to accurately identify lesions in images with vastly different image sizes, for which the lesion-to-image ratio varies substantially. We attribute the model ability to do so to the FPN network that allows to obtain image representations (features) at different resolution scales. Further, in Figure 5 we show through a two-dimensional t-SNE map [27] that the representations produced by the lesion detection model (combined backbone and FPN features) roughly discriminate between malignant and benign lesions, while also clustering in terms of lesion types. The early skin lesion classification literature used largely high-quality clinical and dermoscopy images for proof of concept. However, usability of these algorithms in the real-world remains questionable and must be tested prospectively in clinical settings. Consumer-grade devices produce images of variable quality, however, this approach mimics the clinical work flow and provides a universally applicable image capture for any care setting. The utility of wide-field clinical images taken with smartphone was recently demonstrated by Soenksen et. al for detection of "ugly duckling" suspicious pigmented lesions vs. non-suspicious lesions with 90.3% sensitivity (95% CI: 90.0-90.6) and 89.9% specificity (95% CI: 89.6 -90.2) validated against three board certified dermatologists [28] . This use case demonstrates how clinical work flow in dermatology can be replicated with MLbased CDS. However, the limitation is that the number needed to treat (NNT) for true melanoma detection from pigmented lesion biopsies by dermatologists is 9.60 (95% CI: 6.97-13.41) by meta-analysis [29] . Hence, the task of detecting suspicious pigmented lesions should be compared against histological ground truth rather than concordance with dermatologists, for improved accuracy and comparability of model performance. Furthermore, pigmented lesions are a small subset of the overall task to detect skin cancer, as melanomas constitute fewer than 5% of all skin cancers. Our approach utilizing wide-field images to detect lesions of interest demonstrated encouraging mAP, IoU and Recall metrics, considering the sample size used. This primary step is critical in the clin- ical workflow where images are captured for lesions of interest but lesion annotation is not possible in real time. An ideal ML-based CDS would identify lesion of interest and also provide the likelihood of malignancy and the sub-type annotations as feedback to the user. Our study demonstrates malignancy classification for the three most common skin cancers (BCC, SCC and Melanoma) vs. benign tumors with smartphone images (clinical and dermoscopy) with encouraging accuracy when validated against histopathological ground truth. The usability of this algorithm is further validated by comparison with dermatologists with variable levels of dermoscopy experience, showing comparable performance to dermatologists in both clinical and dermoscopy binary classifica-tion tasks, despite low dermoscopy image data (11%) in the Discovery set. This two-stage model, with the current performance level, could be satisfactorily utilized in a PCP triage to dermatology (pending prospective validation) at scale for images concerning for malignancy as a complete end-to-end system. Interestingly, the additional ISIC high-quality dataset (predominantly dermoscopy images) improved performance across both clinical and dermoscopy image sets. This suggests that smartphone image data can be enriched by adding higher quality images. It is unclear if this benefit is due to improvement in image quality or volume, and remains an area of further study. Finally, we demonstrated that comprehensive demographic and clinical data is not critical for improving model performance in a subset of patients, as the image classification model alone performs at par with the combination model. Clinicians often make contextual diagnostic and management decisions when evaluating skin lesions to improve their accuracy. Interestingly, this clinical-context effect that improves diagnostic accuracy at least in pigmented lesions maybe dependent on years of dermoscopy experience [5] . The value of clinical context in model performance has not been studied extensively and remains an area of further study in larger datasets. Limitations. Limitations of the study include a small discovery image dataset, predominantly including light and medium skin tones, and with less than 2% of images included with dark skin tone. However, this may represent the bias in the task itself as skin cancers are more prevalent in light-followed by medium-skin tones. Given the large range of skin types and lesions encountered in clinical practice, additional images may improve performance and generalizability. At scale, image data pipelines with associated metadata are a key resource needed to obtain inclusive ML-based CDS for dermatology. Improved image quality and/or volume improves performance as demonstrated by the ISIC dataset incorporation into the model, however, this theoretical improvement in performance needs validation in prospective clinical settings. While the pure clinical model incorporates a comprehensive list and accounts for temporal association of this metadata with detection of lesions, it is not an exhaustive list as it does not include social determinants such as sun-exposure behavior and tanning bed usage; two critical factors contributing to increasing incidence of skin can-cer. In particular, metadata including lesion symptoms and evolution is missing and should be incorporated in future studies. Finally, it should be noted that lesions included in this study were evaluated and selected for biopsies in dermatology clinics. If this model was to be utilized in other clinical settings such as primary care, additional validation will be needed as pre-test probability of lesion detection is different among clinical settings [29] . The next four decades: The older population in the United States: 2010 to 2050. Number 1138 Cancer Stat Facts melanoma of the skin Dermatology in primary care: prevalence and patient disposition Dermatologist-level classification of skin cancer with deep neural networks Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists Machine learning and melanoma: the future of screening Gyeong Hun Park, Ilwoo Park, and Sung Eun Chang. Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm Skin lesion classification from dermoscopic images using deep learning techniques Using deep learning to detect melanoma in dermoscopy images Detection of skin diseases from dermoscopy image using the combination of convolutional neural network and oneversus-all Skin cancer detection by deep learning and sound analysis algorithms: A prospective clinical study of an elementary dermoscope A new deep learning approach integrated with clinical data for the dermoscopic differentiation of early melanomas from atypical nevi Deep-learning-based, computeraided classifier developed with a small dataset of clinical images surpasses board-certified dermatologists in skin tumour diagnosis Melanoma detection by analysis of clinical images using convolutional neural network Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) Skin lesion segmentation in clinical images using deep learning The development of a skin cancer classification system for pigmented skin lesions using deep learning A convolutional neural network trained with dermoscopic images performed on par with 145 dermatologists in a clinical melanoma image classification task Deep residual learning for image recognition Faster r-cnn: Towards real-time object detection with region proposal networks Feature pyramid networks for object detection Classifying and segmenting microscopy images with deep multiple instance learning Attention is all you need Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic) The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data Piotr Dollár, and Kaiming He Microsoft coco: Common objects in context Visualizing data using t-sne Using deep learning for dermatologist-level detection of suspicious pigmented skin lesions from wide-field images Meta-analysis of number needed to treat for diagnosis of melanoma by clinical setting The authors would like to thank Melodi J. Whitley, Ph.D., MD (MJ) and Amanda Suggs, MD (AS) for their assistance with the manual classification of images for the test dataset; and the Duke Institute for Health Innovation (DIHI) for providing access to the clinical data.