key: cord-0872064-70nbtnx5 authors: He, Yu; Pan, Ian; Bao, Bingting; Halsey, Kasey; Chang, Marcello; Liu, Hui; Peng, Shuping; Sebro, Ronnie A.; Guan, Jing; Yi, Thomas; Delworth, Andrew T.; Eweje, Feyisope; States, Lisa J.; Zhang, Paul J.; Zhang, Zishu; Wu, Jing; Peng, Xianjing; Bai, Harrison X. title: Deep learning-based classification of primary bone tumors on radiographs: A preliminary study date: 2020-11-22 journal: EBioMedicine DOI: 10.1016/j.ebiom.2020.103121 sha: 6667a1989306b5c636558e3b3a6e75ccde0467af doc_id: 872064 cord_uid: 70nbtnx5 BACKGROUND: To develop a deep learning model to classify primary bone tumors from preoperative radiographs and compare performance with radiologists. METHODS: A total of 1356 patients (2899 images) with histologically confirmed primary bone tumors and pre-operative radiographs were identified from five institutions’ pathology databases. Manual cropping was performed by radiologists to label the lesions. Binary discriminatory capacity (benign versus not-benign and malignant versus not-malignant) and three-way classification (benign versus intermediate versus malignant) performance of our model were evaluated. The generalizability of our model was investigated on data from external test set. Final model performance was compared with interpretation from five radiologists of varying level of experience using the Permutations tests. FINDINGS: For benign vs. not benign, model achieved area under curve (AUC) of 0•894 and 0•877 on cross-validation and external testing, respectively. For malignant vs. not malignant, model achieved AUC of 0•907 and 0•916 on cross-validation and external testing, respectively. For three-way classification, model achieved 72•1% accuracy vs. 74•6% and 72•1% for the two subspecialists on cross-validation (p = 0•03 and p = 0•52, respectively). On external testing, model achieved 73•4% accuracy vs. 69•3%, 73•4%, 73•1%, 67•9%, and 63•4% for the two subspecialists and three junior radiologists (p = 0•14, p = 0•89, p = 0•93, p = 0•02, p < 0•01 for radiologists 1–5, respectively). INTERPRETATION: Deep learning can classify primary bone tumors using conventional radiographs in a multi-institutional dataset with similar accuracy compared to subspecialists, and better performance than junior radiologists. FUNDING: The project described was supported by RSNA Research & Education Foundation, through grant number RSCH2004 to Harrison X. Bai. Background: To develop a deep learning model to classify primary bone tumors from preoperative radiographs and compare performance with radiologists. Methods: A total of 1356 patients (2899 images) with histologically confirmed primary bone tumors and preoperative radiographs were identified from five institutions' pathology databases. Manual cropping was performed by radiologists to label the lesions. Binary discriminatory capacity (benign versus not-benign and malignant versus not-malignant) and three-way classification (benign versus intermediate versus malignant) performance of our model were evaluated. The generalizability of our model was investigated on data from external test set. Final model performance was compared with interpretation from five radiologists of varying level of experience using the Permutations tests. Findings: For benign vs. not benign, model achieved area under curve (AUC) of 0894 and 0877 on cross-validation and external testing, respectively. For malignant vs. not malignant, model achieved AUC of 0907 and 0916 on cross-validation and external testing, respectively. For three-way classification, model achieved 721% accuracy vs. 746% and 721% for the two subspecialists on cross-validation (p = 003 and p = 052, respectively). On external testing, model achieved 734% accuracy vs. 693%, 734%, 731%, 679%, and 634% for the two subspecialists and three junior radiologists (p = 014, p = 089, p = 093, p = 002, p < 001 for radiologists 1À5, respectively). Interpretation: Deep learning can classify primary bone tumors using conventional radiographs in a multiinstitutional dataset with similar accuracy compared to subspecialists, and better performance than junior radiologists. Funding: The project described was supported by RSNA Research & Education Foundation, through grant number RSCH2004 to Harrison X. Bai. Although primary bone tumors are uncommon with incidence rates of 4À7% among children and adolescents in the United States [1] , primary malignancies of the bone and joints are ranked as the third leading cause of death in patients with cancer who are younger than 20 years of age [2] . Bone tumors vary widely in their biological behavior and require different management depending on their classification as benign, intermediate, or malignant, by the World Health Organization (WHO) [3] . Benign bone tumors (e.g. osteochondroma, osteoid osteoma, etc.) have a limited capacity for local recurrence, and are almost always readily cured by complete local excision/curettage [3] . Tumors in the intermediate group (e.g. giant cell tumor, chondroblastoma, etc.) have the potential to be locally aggressive or metastasize in rare cases. Therefore, bone tumors classified as intermediate often require wide excision margins inclusive of normal tissue, and/or the use of adjuvant therapy in order to ensure local control [3] . Malignant bone tumors (e.g. chondrosarcoma, osteosarcoma, etc.) not only have the potential for locally destructive growth and recurrence, but also carry significant risk for distant metastases [3] . Differential diagnoses of primary bone tumor mostly depend on the review of the conventional radiographs and the age of the patient. The plain radiograph remains the most useful examination for differentiating these cases, while CT and MRI are only helpful in selected cases. Besides demographic information such as the patient's age, radiographic appearance of the tumor including size, location, margin, type of matrix, presence of periosteal reaction and cortical destruction are other key clues in helping the radiologist differentiate indolent from aggressive bone tumors [3] . Because bone tumors have a variety of appearances and are relatively uncommon, few radiologists develop sufficient expertise to make a definite diagnosis. Among general radiologists, accuracy in interpretation of bone lesions can be low, leading to misdiagnosis which can be detrimental to patient outcome [4] . Many patients with benign tumors are referred to bone biopsy, which has the issue of increased morbidity and cost, and is subject to sampling error [5] or evaluated with advanced imaging modalities which increase health care costs. Artificial intelligence, especially deep learning with convolutional neural networks has shown great promise in classifying two-dimensional images of some common diseases and relies on databases of thousands of annotated or unannotated images [6À9]. Deep learning models can recognize predictive features directly from images by utilizing a back-propagation algorithm which recalibrates the model's internal parameters after each round of training [10] . Recent studies have shown the potential of deep learning in the assessment of solid liver lesions on ultrasonography [11] , renal lesions [12, 13] and glioma on MR Imaging [10,14À17] and abnormal chest radiographs [18] . An algorithm that can distinguish benign from malignant bone tumors on routine radiographs with high accuracy can facilitate triage, guide patient management, and save patients from unnecessary procedures. In this study, we trained a deep learning algorithm to classify primary bone tumors on plain film and compare performance with radiologists of varying level of experience. Patients with primary bone tumor confirmed by histology according to the 2013 World Health Organization (WHO) classification were retrospectively identified from five large academic centers from July 2008 to July 2019. Plain radiograph and clinical variables including patient demographics (i.e., age and sex) were collected. The study was conducted in accordance with Declaration of Helsinki and approved by the Institutional Review Boards at all five institutions. The inclusion criteria for the study were (i) histopathologically confirmed (biopsy or surgery) primary bone tumor according to current WHO criteria, (ii) available pre-procedure plain radiograph including all the projections it had which can show the lesion clearly, and (iii) quality of the images was adequate for analysis, without motion or artifacts. The images were screened by a radiologist (Y.H.) with 7 years of experience reading musculoskeletal (MSK) plain film. Our Figure S1 , which demonstrates inclusion and exclusion criteria). In respect of patient confidentiality and consent, the radiographs and clinical information datasets analyzed in this study are not available for download but are available upon reasonable request to the corresponding author. All images were downloaded in DICOM format at their original dimensions and resolution. Images were converted from DICOM to 8bit JPEG. Then the images were loaded into Click 2 Crop software Evidence before this study Primary malignancy of the bone and joints is ranked as the third leading cause of death in patients with cancer who are younger than 20 years. The plain radiograph remains the most useful examination for differentiating benign from aggressive lesions. Because of the low incidence and variety of uncommon feature of primary bone tumors, few radiologists develop sufficient expertise to make a definite diagnosis. For general radiologists and those working in resource limited regions, radiographic interpretation can be less accurate, leading to misdiagnosis and unnecessary biopsies. Classification of primary bone tumors correctly via radiography is a challenging problem even for subspecialists. The aim of this project was to raise the level of plain radiography analysis through deep learning to the level of the musculoskeletal subspecialist. Artificial intelligence, especially deep learning with convolutional neural networks has shown great promise in classifying two-dimensional images of some common diseases. With the use of PubMed and Google Scholar, a systematic literature search was performed to identify original research papers in English from inception to October, 2019, using the terms ("bone tumors" OR "bone cancer") AND ("DCNN" OR "deep learning" OR "machine learning") AND ("radiographs" OR "plain film"). No previously published report was found. Our study is the first to establish a deep learning algorithm for classifying primary bone tumors on conventional radiographs using a multi-institutional dataset with similar accuracy to subspecialists and higher accuracy than junior radiologists. The performance is expected to improve further in the future with larger datasets. Correctly classifying bone tumors on plain radiograph is important for clinical decision making as it can guide subsequent management. Our algorithm has the potential to improve primary bone tumor radiographs interpretation to the level of the subspecialists. If further validated, the algorithm can prevent patients from undergoing unnecessary invasive biopsies and help guide clinical management, especially in areas without subspecialty expertise. (v5.2.2), and regions of interest containing the whole tumor were manually cropped from the original image to include some surrounding while capturing the margin of the lesion, by a radiologist (Y.H.) with 7 years of experience reading musculoskeletal (MSK) plain film. Images were padded and resized to 512 by 512 pixels. Single-channel images were converted to 3-channel images by repeating the single channel 3 times [12, 19, 20] . Pixel values were normalized by scaling values into the range [0, 1], then subtracting (0485, 0456, 0406) and dividing by (0229, 0224, 0225) channel-wise. Model training was performed in Python 3.7 and PyTorch 1.6 using a NVIDIA GV100 32GB graphics processing unit. Models were based on the EfficientNet-B0 convolutional neural network architecture [21] . Model weights were initialized with weights pretrained on the ImageNet database. Training was performed using a batch size of 96, dropout probability of 0.2 before the final fully-connected layer, and data augmentation consisting of horizontal flips, affine transformations, and contrast adjustments. Models were trained for 3-way classification (benign, intermediate, and malignant) and binary classification (benign versus not-benign and malignant versus not-malignant). The RAdam optimizer was used with a categorical crossentropy loss and a cosine annealing learning rate schedule with an initial learning rate of 3 £ 10eÀ4. Models were trained for 20 epochs. The selected model for each training episode was selected based on the Cohen's kappa score on the validation set. For each test fold, 3 training episodes were performed to form a 3-model ensemble. Predictions were averaged across all models and all radiographic views to produce a final prediction for each case. During each training epoch, 1 image from 1 patient is sampled so that the model is exposed to the same number of images per patient over the course of the entire training period. An external test set comprised of images from two of our five institutions (institutions 4 and 5) was used to evaluate the generalization performance of the model. The external test set consisted of 639 images from 291 patients. Each patient contributed 1 lesion to the dataset. Of the 291 lesions, 162 (368 images) were benign based on histopathology, 61 (126 images) were intermediate, and 68 (145 images) were malignant. Please see Supplementary Figure S2 for schematic of our pipeline. Two board-certified musculoskeletal subspecialists (H.L. and S.P.), who see more than 100 bone tumors per year, with 25 and 23 years of experience, and three junior radiologists, who with 6, 1, and 7 years of experience reading MSK plain film respectively, blind to histopathologic data, evaluated conventional radiographs of the bone lesions, and labeled each case as benign, intermediate, or malignant with their own interpretations. They were given clinical information of age and sex of each patient. The 2 musculoskeletal subspecialists interpreted the uncropped images of entire cohort (data from all five institutions) and the cropped images of external test set (data from institutions 4 and 5), while the 2 junior radiologists evaluated only the uncropped and cropped images of external test set. One junior radiologist (J.G.) only evaluated the uncropped images of the external set, because she was exposed to the gold standard during the recropping process. Ground truth labels were obtained using the final pathology results. The model' results were compared with radiologists' interpretations and final pathology results to assess model performance. Information on the five radiologists is shown in Supplemental Table S1 . The model performance was evaluated using several metrics. Receiver operating characteristic (ROC) curves and area under curve (AUC) for benign versus not-benign and malignant versus not-malignant were used to evaluate binary discriminatory capacity. Cohen's kappa scores and categorical accuracy were used to evaluate the three-way classification performance of the model and radiologists. Five-fold cross-validation was used to analyze model performance, ensuring no patient overlap across different folds. First, the model was divided into 5 disjoint partitions based on patient ID, each approximately 20% of the overall dataset, which comprise the test folds. Next, the remaining 80% of the dataset was used for training (70%) and validation (10%). A separate model was trained for each fold, and the out-of-fold predictions were obtained for the test fold. The cross-validation scheme is illustrated in Supplemental Figure S3 and Supplemental Table S2 . This cross-validation procedure allowed us to obtain an out-of-fold prediction for each sample in the dataset to maximize the sample size on which the model performance was evaluated without data leakage. Model performance was also evaluated on an external test set to evaluate generalizability beyond the institutions present in the internal cohort. To evaluate the impact of manual lesion cropping on model performance, a second radiologist (J.G.) with 1 years of experience reading MSK plain film independently recropped the images in the external test set, and model performance was evaluated on the external test set using this set of recropped images. Statistical analysis was performed using the R statistical computing language, as well as non-parametric methods implemented in Python 3.7. 95% confidence intervals for AUCs were obtained via the DeLong method. For Cohen's kappa scores and categorical accuracy, 95% confidence intervals were generated using 10,000 bootstrap samples. Permutation tests with 10,000 iterations were used to calculate p-values. p < 005 was considered to indicate a statistically significant difference in performance. Comparison with radiologists was performed only for 3-way classification. Subgroup analysis based on age was also performed. Table 1 summarizes the 5 datasets used in this study. Overall, the mean age was 247 § 181 years with 501% benign tumors (average age 228 § 169), 234% intermediate tumors (average age 235 § 157), and 265% malignant tumors (average age 277 § 211), as indicated by the final pathology results. There was a slight male predominance (582%). Differences in the distributions of age (One-way analysis of variance, p < 001), sex (Chi-square test, p = 0013) and pathology (Chi-square test, p < 001) were statistically significant among the 5 institutions. Please see Supplementary Figure S4 On cross-validation, the AUCs for the two classifications were 0894 and 0907, respectively. For benign vs. not benign, at a naive threshold of 05, the model achieved 827% sensitivity and 818% specificity. Sensitivity and specificity for the model can be adjusted along the ROC curve by calibrating the model threshold. For malignant vs. not malignant, at a naive threshold of 05, the model achieved 777% sensitivity and 896% specificity. On external testing, the AUCs for these 2 classifications were 0877 and 0916, respectively. The data were divided into quartiles by age for subgroup analysis: younger than 12 years old, 12À18 years old, 19À36 years old, and older than 36 years old. Performance of 2 formulated binary classification problems: benign vs. not benign and malignant vs. not malignant for the deep learning model are summarized in Table 2 . Three-way classification results for the deep learning model and two subspecialists are shown in Table 3 . For three-way classification, Cohen's kappa scores for the model and subspecialists were 0548, 0605, and 0565, respectively. On cross validation, differences between model predictions and subspecialist 1's rating was found to be statistically significant (Permutation tests, p = 003). Differences between model predictions and subspecialist 2's ratings were not found to be statistically significant (Permutation tests, p = 052). In addition, the data were divided into age quartiles, and detailed stratified model performance by age is summarized in Table 3 . Whereas class distributions for both subspecialists were similar, the model predicted a higher number of benign tumors (509% vs. 432% and 435%) and fewer intermediate tumors (181% vs. 236% and 245%). Malignant tumor predictions were more similar across model and subspecialists (310% vs. 331% and 320%). Three-way classification results for the deep learning model and five radiologists on uncropped images of external testing data are shown in Table 4 . Cohen's kappa scores for the model and five radiologists were 0560, 0483 0553, 0555, 0430, and 0367, respectively. Differences between model predictions and 1-3 radiologist's ratings were not found to be statistically significant (Permutation tests, p = 014, p = 089 and p = 093). Differences between model predictions and 4-5 radiologist's ratings were found to be statistically significant (Permutation tests, p = 002 and p < 005). In addition, the data were divided into three equally sized age groups, and detailed stratified model performance by age is shown in Table 4 . Three-way classification results for the deep learning model and five radiologists on cropped images of external testing data are shown in Supplementary Table S3 . Intra-rater reliability for evaluation using cropped versus uncropped images on the external test data showed that radiologist 1-3's ratings were moderate while radiologist 5's rating was fair. Cohen's kappa scores of intra-rater reliability for the four radiologists were 0544, 0560, 0509, and 0385, respectively. Figs. 2À4 depicts examples of model-subspecialist disagreement under 3 scenarios for prediction of malignancy. Fig. 2 depicts 3 examples of malignant tumors that were predicted to be not malignant by both deep learning model and subspecialists, selected from total number of 15 lesions in the first scenario. These cases either had uncharacteristic appearances (n = 8) or were located in unusual locations (e.g., vertebral body or coccyx) (n = 7). Examples include an osteosarcoma that is completely sclerotic (Fig 2a) , a chondrosarcoma that has no calcification of cartilage matrix (Fig 2b) , and an Ewing sarcoma that has no cortical destruction or periosteal reaction (Fig 2c) . Fig. 3 depicts one example of malignant tumor that was predicted to be malignant by the deep learning model and otherwise by the subspecialists, selected from total number of 9 lesions in the second scenario. Almost all these cases were ill-defined lytic lesions without aggressive periostitis (Fig 3) . Fig. 4 demonstrates 2 instances of the opposite of Fig. 3 , selected from total number of 22 lesions in the third scenario. These cases all have an aggressive type of periosteal reaction (lamellated, amorphous or sunburst) (Fig 4a) , or have a permeative or moth-eaten appearance (Fig 4b) . In this study, we constructed and evaluated a deep learning model for lesion classification on a collection of 2899 images from 1356 patients with histologically confirmed primary bone tumors and preoperative radiographs. The model achieved similar grouping ability in three-way classification when compared to subspecialists, and better performance than the junior radiologists. Correctly classifying bone tumors on plain radiograph is important for clinical decision making as it can guide subsequent management [3] . This is especially true in locales where there is a relative lack of subspecialty radiology expertise. Because many bone lesions are uncommon or rare, few radiologists develop sufficient expertise to diagnose them accurately. In clinical practice, one relies on learning and recalling characteristic imaging features of various lesions, both of which are subject to bias. Inappropriate classification of benign bone tumor can lead to unnecessary biopsy and subsequently increased morbidity and cost. In fact, a study utilizing questionnaires revealed that biopsy wounds yielded complications in 173% of patients with malignant primary tumors of bone or soft tissue who underwent biopsy, and that biopsy was detrimental to these patients' prognosis and overall outcome 85% of the time [22] . Biopsy of malignant bone tumor without appropriate planning can increase the risk of tumor seeding along the biopsy tract, with the incidence of seeding reported as up to 192% following osteosarcoma biopsy [23] . Sampling error presents as another problem for bone biopsy. A diagnosis was not obtained successfully in 79% of cases reported with CT image-guided core biopsies of musculoskeletal tumors [24] , as well as in 47% of open biopsy cases [25, 26] . Incorrect diagnosis from tertiary cancer centers also range from 6% to 12% for image-guided core needle biopsies [27] . When the referring center is accounted for, this rate increases to 23% [28] . In addition, CT-guided core biopsy is associated with re-biopsy rate up to 20% of cases [26] . There are also a host of other factors that can prevent providers from obtaining adequate tissue for diagnosis. For instance, many bone tumors are extremely vascular and often yield what appears to be blood only. For lesions that have massive bony sclerosis, such as osteosarcomas, the material obtained is often of poor quality and non-diagnostic. Specimens of benign or malignant cystic lesions or tumors with necrosis are also difficult to obtain for biopsy. The aim of this project was to raise the level of plain radiography analysis through deep learning to the level of the musculoskeletal subspecialists. Most radiologists rely on "pattern recognition" to differentiate benign from malignant lesions on plain radiograph, which can often lead to erroneous conclusion. Some common radiologic criteria used for this distinction include cortical destruction [29] , periostitis [29] , orientation or axis of the lesion [30] , and zone of transition [30] . However, all have limitations. Cortical bone can be replaced by part of the noncalcified matrix (fibrous matrix or chondroid matrix) of benign fibro-osseous lesions and cartilaginous lesions, giving the false impression of cortical destruction on plain film [31] . Periostitis and orientation of the lesion can be nonspecific [30] . Although the zone of transition is arguably the most useful indicator of whether a lesion is benign or malignant (i.e. a narrow zone of transition indicates a benign lesion and vice versa), it only applies to lytic lesions-a blastic or sclerotic lesion will always appear to have a narrow zone of transition and may erroneously be diagnosed as benign even if it is malignant [31] . Despite the challenges, we have identified no study in the literature which applies deep learning to differentiate benign from malignant bone lesions on plain radiograph. Past studies had the limitations of small cohort size, focus on specific differential diagnoses, and use of advanced imaging modalities [7, 8, 32, 33] . Our model demonstrated good binary discriminatory capacity on cases from different hospitals stratified by age. For the older than 36 years old group, the model's binary discriminatory capacity was slightly lower than that of the younger age group. This can be explained by the smaller sample size on which the deep learning algorithm was trained since most bone tumors were diagnosed in pediatric patients. The good model performance on external testing supports generalizability of our algorithm. On cross-validation and external testing, our model achieved similar categorical accuracy to the subspecialists for the 3-category classification. This demonstrates that classification of primary bone tumors on radiographs is a challenging problem even for experienced radiologists subspecialized in MSK. Our model performed better than the junior radiologists for all the different age groups, except the younger than 10 years old group. That may be caused by our external testing data containing excessively high proportion of benign bone tumor, such as osteochondroma and osteoid osteoma, which is easy to recognize even for less experienced radiologists. Deep learning is often considered black box. To understand the choices and mistakes that the model and subspecialists made, we investigated specific cases of model-subspecialist disagreement under 3 scenarios for prediction of malignancy. We also concentrated on wrong prediction of malignant tumors because this would have impacted management and outcome if our algorithm were used in lieu of biopsy. In the first scenario where both model and subspecialists were wrong, we found that the tumors either had uncharacteristic appearances or were located in unusual locations, such as vertebral bodies or the coccyx, where the characterization on plain film was poor. In the second scenario where the model was right but the subspecialists were wrong, we found that almost all the cases were ill-defined lytic lesions without aggressive periostitis. It appears that our deep learning model was better than the subspecialists at evaluating the zone of transition, which is often considered the most reliable plain film indicator for benign versus malignant lesions as discussed above. In the third scenario where the model was wrong and the subspecialists were right, there are some common findings. These cases all had a permeative, moth-eaten appearance or an aggressive type of periosteal reaction (lamellated, amorphous, or sunburst). Although many benign lesions can cause aggressive periostitis such as infection, eosinophilic granuloma, and trauma, our study only included primary bone tumors so aggressive periostitis helped the subspecialist recognize them as malignant. It seems that our deep learning model was not good at recognizing a permeative appearance or aggressive periostitis and associating it with malignancy. This can be explained by either a lack in number or variety of these patterns or both in training. It is also important to note that the difference in class distribution between model and subspecialist predictions. The deep learning model predicted a greater number of benign tumors (509% vs. 432À435%) than subspecialists, largely at the expense of intermediate lesions (181% vs. 236À245%). This is most likely due to the class distribution in the training set (487% benign, 240% intermediate, and 272% malignant), as deep learning model predictions will tend toward the training distribution. This may also suggest that benign and intermediate lesions share similar features learned by the model, causing confusion between these 2 classes. There are several limitations to our work. First, this is a retrospective study with cases identified from a search of pathology databases at five institutions. In the general population, benign bone tumors are far more common than malignant ones. But due to the tertiary care center character of the five including centers, most typical benign bone tumors are diagnosed directly, without biopsy and pathology. Therefore, our data contained a smaller number of benign bone tumor and large number of intermediate and malignant bone tumor, indicating selection bias. Second, we included only primary bone tumors, but did not consider other situations (e.g. osteomyelitis, metastasis, bone-tumor mimickers, etc.) commonly encountered in clinical practice that often cause diagnostic difficulty. It is also well known that benign processes such as infection and eosinophilic granuloma can mimic malignant tumors [5] . However, a lot of cases were without pathology or were not diagnosed with confidence on pathology. Future studies will include these cases. Third, the images were cropped by a MSK radiologist to highlight the tumor before being inputted into the network. To evaluate the impact of this manual cropped on model performance, a junior radiologist was asked to recrop the images in the external set and model performance was evaluated on this recropped set to compare with the original radiologist. Although we believe that manual cropping keeps the radiologist in the loop who are already interpreting the study, is easy to implement clinically and requires only seconds to complete, it is important to emphasize that the current pipeline is not ready for real-time clinical use. Future study will incorporate deep learning based lesion localization before classification to achieve a fully automated pipeline for clinical integration. Finally, our cohort size is still small compared to the millions of images on ImageNet used to train deep neural network models. Algorithm development can benefit from incorporation of more data from additional institutions, which will result in better performance. In conclusion, our study shows that deep learning with DCNN can classify primary bone tumors on conventional radiographs using a multi-institutional dataset with similar accuracy to subspecialists, and better performance than the junior radiologists. Our algorithm has the potential to improve primary bone tumor radiographs interpretation to the level of the subspecialists. Future study will focus on development of a fully automatic pipeline including lesion localization, incorporation of studies such as CT or MRI through deep learning and inclusion of bone tumor mimic pathologies. Childhood and adolescent cancer statistics World Health Organization. International Agency for research on cancer. WHO classification of tumours of soft tissue and bone Bone Tumor Diagnosis Using a Naive Bayesian Model of Demographic and Radiographic Features Fundamentals of diagnostic radiology. Philadelphia: Wolters Kluwer Health Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks Deep learning for classification of benign and malignant bone lesions in [F-18]NaF PET/CT images Application of a deep learning algorithm for detection and visualization of hip fractures on plain pelvic radiographs Clinically applicable deep learning for diagnosis and referral in retinal disease Residual convolutional neural network for the determination of IDH status in low-and high-grade gliomas from MR imaging Deep learning for differentiation of benign and malignant solid liver lesions on ultrasonography Deep learning to distinguish benign from malignant renal lesions based on routine MR imaging Deep learning based on MRI for differentiation of low-and high-grade in low-stage renal cell carcinoma Automatic assessment of glioma burden: a deep learning algorithm for fully automated volumetric and bidimensional measurement Machine learning reveals multimodal MRI patterns predictive of isocitrate dehydrogenase and 1p/19q status in diffuse low-and high-grade gliomas A visually interpretable, dictionary-based approach to imaging-genomic modeling, with low-grade glioma as a case study MRI features predict survival and molecular markers in diffuse lower-grade gliomas Generalizable Inter-Institutional Classification of Abnormal Chest Radiographs Using Efficient Convolutional Neural Networks Deep learning based on mr imaging for predicting outcome of uterine fibroid embolization Artificial intelligence augmentation of radiologist performance in distinguishing COVID-19 from pneumonia of other origin at chest CT rethinking model scaling for convolutional neural networks: ICML 2019 THE CLASSIC: The hazards of biopsy in patients with malignant primary bone and soft-tissue tumors Revisiting tract seeding and compartmental anatomy for percutaneous image-guided musculoskeletal biopsies Accuracy of computed tomography guided core needle biopsy of musculoskeletal tumours The accuracy and clinical utility of intraoperative frozen section analysis in open biopsy of bone Surgical biopsy with intra-operative frozen section. An accurate and cost-effective method for diagnosis of musculoskeletal sarcomas Diagnosis of primary bone tumors with image-guided percutaneous biopsy: experience with 110 tumors Ultrasound-guided needle biopsy of primary bone tumours Primary bone tumors of adulthood Systematic approach to musculoskeletal benign tumors Nonneoplastic lesions that simulate primary tumors of bone Identification of the most significant magnetic resonance imaging (MRI) radiomic features in oncological patients with vertebral bone marrow metastatic disease: a feasibility study Comparison of radiomics machine-learning classifiers and feature selection for differentiation of sacral chordoma and sacral giant cell tumour based on 3D computed tomography features We acknowledged the help of Ke Jin (K.J.) in data collection. The project described was supported by RSNA Research & Education Foundation, through grant number RSCH2004 to H. Bai. The content is solely the responsibility of the authors and does not necessarily represent the official views of the RSNA R&E Foundation. The funders had no role in study design, data collection, data analysis, interpretation, writing of the manuscript. HXB and ZZ conceived the study; YH, JW, BB, MC,XP and PJZ collected the data; YH and JG preprocessed the images; HL, RAS, SP and JG evaluated the bone lesions; IP analyzed the data; HXB and KH helped in the analyses and discussion of the results; YH and IP wrote the manuscript; TY, ATD, EF, LJS helped in manuscript editing and revision. All authors contributed to the review, edit, and approvaled the final version of the manuscript.Ethics approval and consent to participate The study was conducted in accordance with Declaration of Helsinki and approved by the Institutional Review Boards at Hospital of University of Pennsylvania, Children's Hospital of Philadelphia, Hunan Children's Hospital, Xiangya Hospital and Second Xiangya Hospital. The need for informed consent was waived by the institutional review board for this retrospective study. In respect of patient confidentiality and consent, the radiographs and clinical information datasets analyzed in this study are not available for download but are available upon reasonable request to the corresponding author. Source code have been uploaded to a public GitHub repository at: https://github.com/i-pan/bone-tumor. The authors declare that they have no conflicts of interests. Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.ebiom.2020.103121.