key: cord-0475611-n1uz356e authors: Joshi, Anirudh; Eyuboglu, Sabri; Huang, Shih-Cheng; Dunnmon, Jared; Soin, Arjun; Davidzon, Guido; Chaudhari, Akshay; Lungren, Matthew P title: OncoNet: Weakly Supervised Siamese Network to automate cancer treatment response assessment between longitudinal FDG PET/CT examinations date: 2021-08-03 journal: nan DOI: nan sha: b2ed32d82b973280b16d6d0b8f66b2949a67952b doc_id: 475611 cord_uid: n1uz356e FDG PET/CT imaging is a resource intensive examination critical for managing malignant disease and is particularly important for longitudinal assessment during therapy. Approaches to automate longtudinal analysis present many challenges including lack of available longitudinal datasets, managing complex large multimodal imaging examinations, and need for detailed annotations for traditional supervised machine learning. In this work we develop OncoNet, novel machine learning algorithm that assesses treatment response from a 1,954 pairs of sequential FDG PET/CT exams through weak supervision using the standard uptake values (SUVmax) in associated radiology reports. OncoNet demonstrates an AUROC of 0.86 and 0.84 on internal and external institution test sets respectively for determination of change between scans while also showing strong agreement to clinical scoring systems with a kappa score of 0.8. We also curated a dataset of 1,954 paired FDG PET/CT exams designed for response assessment for the broader machine learning in healthcare research community. Automated assessment of radiographic response from FDG PET/CT with OncoNet could provide clinicians with a valuable tool to rapidly and consistently interpret change over time in longitudinal multi-modal imaging exams. Cancer is one of the leading causes of death worldwide and accurate diagnosis, staging and restaging are essential to optimize therapeutic management. Advanced medical imaging techniques such as positron emission tomography (PET) coupled with computed tomography (CT) are integral for clinical assessment of cancer diagnosis, and assessment of treatment response. PET is the most sensitive non-invasive imaging modality capable of detecting picomolar amounts of radiolabeled sugar molecules trapped in cancer cells while CT provides high tissue resolution for precise localization. The clinical interpretation of PET/CT scans involves synthesizing multiple data sources: clinical information, the metabolic findings from PET and the anatomic information from CT. In clinical practice, radiologists and nuclear medicine physicians must interpret consecutive PET/CT examinations to determine whether a cancer patient receiving treatment is appropriately responding to therapy. They do this by measuring whether the amount of malignant tissue is decreasing, unchanged or increasing across exams. This process is chiefly qualitative and sometimes subjective, but oncology treatment planning increasingly demands standardized, quantitative data [1] . Further, the process is extremely labor intensive and time-consuming and a stark rise in utilization of FDG-PET/CT imaging suggest that automation technologies would be of high impact in clinical oncologic imaging workflows. Deep learning approaches have produced state of the art results for automated interpretation of various medical imaging modalities [2] [3] [4] . Prior work in deep learning for PET/CT has demonstrated the ability of automated systems to detect and estimate locations of abnormalities in individual PET/CT imaging exams [5] . However, as discussed above, there is a pressing need for automated methods capable of comparing across consecutive oncologic imaging studies and estimating changes in disease burden. If successful, automation of this clinically important task could improve routine clinical oncologic imaging workflows, enhance standardized quantification of imaging biomarkers for oncologic therapy trials, and contribute to operationalization of communications regarding response to therapy to referring clinicians and patients. Existing approaches that compare consecutive medical images over time have been validated on 2D imaging modalities such as radiography and retinal fundus imaging [6, 7] . However, applying these techniques to PET/CT presents significant new challenges such as: (1) a single PET/CT exam is composed of hundreds of PET and CT image slices, so training a model to compare multiple complex PET/CT examinations is methodologically challenging and computationally expensive, (2) established scoring systems for measuring longitudinal changes are subjective, so human readers often produce inconsistent scores, (3) clinical information depicting longitudinal changes in consecutive PET/CT imaging studies are reported in narrative text reports that make it challenging to extract meaningful labels for deep learning training in large datasets, and (4) existing work in PET/CT are limited by small reported experimental datasets or reliance on phantoms which limits scientific advancement. The purpose of this study is to model the task of longitudinal treatment response assesment on volumetric multi-modality oncologic PET/CT imaging examinations for automatically determining disease progression in pairs of FDG-PET/CT studies obtained before and after treatment. The training set annotations were extracted using a rules based heuristic on the radiology reports. Radiology reports contain Findings sections which detail sections of the scan (Head and Neck, Thorax, Abdomen) along with information on lesions identified, measurements and SUV values. We propose to use the maximum SUVs (SUVmax) recorded for lesions in the thorax before and after treatment as weak supervision for OncoNet. If the difference in SUVmax values between the scans is greater than 25% or less than -25%, the label is considered tumor progression and resolution respectively and is stable if between -25% and 25%, in accordance to the Lugano 2014 criteria for tumor evaluation, assessment, and response prediction using 18F-FDG PET/CT [8] . The validation and test set SUV annotations are determined by a board certified radiologist reviewing the radiology reports. For each exam, the radiologist assigned a single SUV score that corresponded to the most metabolically active lesion reported. The categorical label was determined as above using the Lugano 2014 criteria. Internal Test Set: Our thorax test set consisted of 46 scans from 13 patients, leading to 33 paired longitudinal scans. The exams were sampled randomly from the original dataset and contain 11 pairs of each of the three classes. External Test Set: For our external validation we used a public dataset (ACRIN 6668) from The Cancer Imaging Archive which contains studies from 242 patients. The study was conducted as a multicenter trial with the goal of determining whether PET SUV uptake in non small cell lung carcinoma (NSCLC) was a useful predictor of long-term clinical outcome (survival) after definitive chemoradiotherapy. Using the metadata where SUVmax was recorded for the thorax led to a subset of 60 patients. From this subset we filtered out scans where the PET and CT did not take place at the same date or had different number of slices. For the subset of the dataset that we selected, OncoNet consists of three main components; Encoder, Decoder, Classifier Head. The Encoder is formed using the Inflated Inception V1 3D convolutional neural network (I3D) pretrained on Kinetics using optical flow [9] . The final classification layer was removed making the output encoding a 3 dimensional encoding of the input scan. The encoding shape is 7 * 7 * l 6 where l is the number of slices in the original exam. The Decoder consists of a soft attention mechanism and a linear classification layer. The soft attention is a dot product between each voxel e i,j,k in the encoded representation and a learned weight matrix w. A softmax is applied on the scores computed by the dot product and a linear combination of the voxels is computed. The intuition behind the soft attention is to place higher weight on certain voxels of the exam in a data-driven manner while determining change in tumor burden. This encoder-decoder structure has shown effectiveness in prior work on PET/CT abnormality detection [5] . The encoder and decoder weights are shared during each forward pass for each exam in the pair of exams. The output representations from the decoder are used to compute a difference representation which is a flattened tensor of dimension (hidden size,) which is then passed into a classifier head to determine response to treatment. The classifier head is formed of two linear layers with a ReLU activation function in between. The model is trained with a batch size of 2 for a maximum of 30 epochs using early stopping based on validation loss. The learning rate used is 0.0001 with an Adam optimizer and a step decay of the learning rate every 10 epochs by a factor of 0.1. Dropout is used as regularization during training in the classifier head. Cross entropy loss is used for supervision and the models are trained using The test set SUV annotations are determined by a board certified radiologist reviewing the radiology reports. For each exam, the radiologist assigned a single SUV score that corresponded to the most metabolically active lesion present in the scan. The categorical labels for tumor response were determined by the percentage difference in SUVmax scores for the paired scans pre and post treatment. If the percentage difference was greater than 25%, the response was determined to be "progression". If the percentage was less than -25%, the response was determined to be "resolution" and values in between were determined to be "stable". Each model was evaluated using AUROC on the predictions compared to the test set labels AUCs were computed for each class and each region individually. F1 scores were also computed. 95% confidence intervals were computed using 1000 bootstrap replicates for average AUROC across classes. The models were also evaluated on an external public dataset from additional institutions in a similar manner. The SUVmax values for the external dataset were derived from the dataset metadata directly and not through radiology reports. We visualize the model outputs using gradient based Guided Backpropagation saliency maps which compute the gradients of loss with respect to the original pixels in each 2D slice of the exam. These gradients are plotted to demonstrate the pixels that are most sensitive to the model prediction. OncoNet uses an encoder-decoder architecture previously validated on the task of anatomicallyresolved PET/CT abnormality detection [5] . Like a siamese neural network [10] , OncoNet computes decoder representations from two forward passes (one for each PET/CT scan in the pair). Finally, it computes the elementwise difference between the two representations and feeds it into a final classification layer. As an ablation, we study an approach that computes a difference between the PET/CT imaging exams prior to passing into the encoder-decoder network and runs a single pass through the network. This single pass approach has been studied in prior literature in automated disease progression prediction [11] . We find that OncoNet scores 0.84 [0.95 -0.75] AUROC on the task of longitudinal change prediction on thorax and outperforms the single pass approach by 0.14 AUROC (p < 0.01). From In Figure 3 we produce guided backpropagation saliency maps over each slice of the PET/CT to understand which voxels OncoNet's classification is most sensitive to. The top row shows an instance of diseases progression where saliency is concentrated on the new tumor in the lower left. The middle row shows a stable tumor nodule. There is no spike in saliency on the stable tumor suggesting that the model is not focused on solely abnormality detection but change in abnormality. Finally, in the bottom row we see that when there is a reduction in the disease from the previous scan, the saliency identifies the region where tumor was resolved. We evaluated OncoNet on a public external dataset from The Cancer Imaging Archive contributed by the ACRIN Cooperative Group [12] . The single pass ablation performs worse on the external set compared to internal by 0.11 AUROC, however maintains performance on the flipped perturbation for both internal and external test sets.See Table 2 for complete metrics. A board certified radiologist compared each paired FDG-PET/CT scan in the test set and scored the scans based on the Deauville five point scale. The scoring system is routinely used in clinical practice to quantify treatment response in FDG PET/CT. A score of 1,2,3 indicates that there was response to treatment with a granular breakdown provided in the methods. A score of 4 could indicate a partial response where the metabolic activity has either remained constant or reduced but not below the level in the liver. A score of 5 indicates a new tumor or increased metabolic activity compared to the previous scan. We study whether OncoNet's predictions agree with the radiologist based on two levels of stratification. (1) In clinical practice stratifying a patient in categories 1,2,3,4 vs 5 is important for the radiologist to determine whether patients are either getting worse or not. We evaluate agreement with 5 by using the label corresponding to disease progression. Labels corresponding to resolved or stable would agree with categories 1,2,3,4. (2) Since there isn't a direct one to one mapping between the "stable" model class and the Deauville scoring system we also evaluate another stratification to study whether using the outputs of "resolution" and "progression" correlate to categories 1,2,3 and category 5 respectively. We selected all the exams where the model predicts "resolved" and "progressed" and of those selected the exams that received scores of 1,2,3,5. We computed the Cohen's Kappa agreement between the model outputs and the clinical scoring and found kappa values of (1) 0.80 and (2) 0.73. The purpose of this study was to model the task of longitudinal treatment response prediction on multi-slice, multi-modality, multi-class oncologic imaging examinations to achieve automated determination of disease progression, improvement/response, or stability using pairs of FDG PET/CT studies obtained before and after treatment. Earlier work toward an end-to-end framework utilizing a weakly supervised approach to lesion detection and localization in PET/CT was found to be capable of excellent performance in leveraging an individual multi-slice imaging examination [5] . While this work represents innovation in automated analysis of FDG PET/CT using deep learning techniques, consideration of only the individual examination, without context to change over time in consecutive studies, ultimately lessens the clinical impact because comparative quantification of disease over time, especially dur-ing oncologic therapy, is a chief indication for performing FDG PET/CT imaging. Toward that goal, there have been prior efforts toward automating the quantification of FDG-PET disease progression consisting of largely semi-automated approaches requiring significant manual input to achieve quantification [13] . For example the Auto-PERCIST software, based on traditional rules based software methodologies, can extract quantitative data for relevant imaging pathology (SU-Vmax, volume, etc) and was shown to reduce inter-reader variability between readers. However this system demands much of the interpreting physician, requiring user input for lesion identification, manual registration of comparison examinations, and human expert localization and selection of the reference tumor as the basis for comparison across studies. Other work used a CNN-based deep-learning approach to achieve automated segmentation of lung tumors in thoracic FDG-PET images based on phantom images or on small datasets without an end-to-end approach [14] . By contrast our approach represents a fully automated end-to-end approach that reports progression, stable disease, or response without requiring any manual input from the human expert and achieving state of the art inter-rater agreement with human experts and is robust to external populations with varying scanner parameters, protocols, etc. Prior work in automated disease progression have used class activation maps to visualize the progression of disease over multiple days for detecting COVID-19 from CT [15] . Other work on MRI to automatically assess treatment response primarily focused on using deep learning to segment tumor regions and using the intermediate extracted features to correlate to progression [16] . Alternative approaches assigned severity scores to retinopathy imaging that were tracked longitudinally to determine response [17] . Such approaches however need extensive segmentation annotations which are prohibitively expensive for modalities like PET/CT. They are also limited by not assessing the treatment response directly from data in an end to end fashion. OncoNet leverages pretraining for abnormality detection and derives supervision signal directly from associated radiology reports making it highly scalable and unbiased to hand crafted intermediate features. FDG PET/CT has become indispensable in the routine clinical management of cancer patients and in therapeutic clinical trials [18] [19] [20] . Response to cancer treatment is determined by serial size and SUV measurements of index cancerous lesions seen on PET/CT scans; the percentage of change in the measurements between scans is used to monitor response to therapy and demands standardized and reproducible assessments for meaningful comparisons and conclusions across multiple trials. For example the PET Response Criteria in Solid Tumors 1.0 (PERCIST 1.0) was proposed in 2009 as a method to standardize the assessment of tumor response and in-cludes the assessment of SUVmax on PET [21, 22] . But poor inter-reader agreement using scoring systems across examinations have been widely reported, as low as 0.14 (range 0.14-0.68) under a variety of experimental settings and comparison methods; the agreement rate is likely lower in clinical practice compared to ideal study settings [23] [24] [25] . Such variability is an often cited hurdle to broader utilization of quantitative FDG PET/CT for response assessment especially in examining early treatment response-related changes [1, 26] . While in practice the SUVmax is reasonably easy to determine with many forms of software, and as mentioned above can improve inter-reader [27] . We found that OncoNet was at human-expert level agreement in both treatment response in a three point scale (i.e. progression, stable, response). Leveraging the routine use of OncoNet for simplified categorical measures of disease state results may lead to improved consistency and also help address the challenges of patient direct access to medical imaging results records as mandated under the final rule of the 21st Century Cures Act by providing simplified quantitative outcomes measures for tracking oncologic disease over time. This study includes several important limitations. This is a retrospective study design which comes with well-established shortcomings and inherent limitations. The deep learning model described was developed and trained on data from a single large academic institution and while robust external test evaluation was performed, additional study to comprehensively understand the generalizability of our model is needed to inform the direction of future work. The evaluation of this approach considered only a few tasks of the many use cases for FDG PET/CT, however, the methods and results should be considered when applying to other predictive tasks. Lastly, while our results are promising, delivering production-ready models in their final clinical form is beyond the scope of this study and additional work is needed before deploying such models in clinical practice. In conclusion, this work describes the development of OncoNet as an end-to-end approach for quantitatively determining longitudinal treatment response assessment on multi-slice multi-modality oncologic FDG PET/CT imaging examinations. OncoNet achieved an AUROC of 0.85 on automated determination of disease resolution, stability or progression using pairs of FDG PET/CT studies obtained before and after treatment with robust external validation (AUROC 0.84). On-coNet further achieves agreement with a board certified radiologist with a kappa of 0.8. OncoNet's methodology and associated annotated dataset are designed to achieve automated quantitative oncologic imaging evaluation over time with potential broad implications for cancer care and contributes to the broader machine learning in healthcare research community. Pet/ct evaluation of response to chemotherapy in non-small cell lung cancer: Pet response criteria in solid tumors (percist) versus response evaluation criteria in solid tumors (recist) Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison Penet-a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric ct imaging Deep learning-assisted diagnosis of cerebral aneurysms using the headxnet model Multi-task weak supervision enables anatomically-resolved abnormality detection in whole-body fdg-pet/ct Deep learning algorithm predicts diabetic retinopathy progression in individual patients Siamese neural networks for continuous disease severity evaluation and change detection in medical imaging criteria for assessing fdg-pet/ct in lymphoma: an operational approach for clinical trials. Drug design Quo vadis, action recognition? a new model and the kinetics dataset Siamese neural networks for one-shot image recognition Automated estimation of progression of interstitial lung disease in ct images Prediction of survival by [18f] fluorodeoxyglucose positron emission tomography in patients with locally advanced non-small-cell lung cancer undergoing definitive chemoradiation therapy: results of the acrin 6668/rtog 0235 trial Quantitation of cancer treatment response by 2-[18 f] fdg pet/ct: multicenter assessment of measurement variability using auto-percistâ„¢ A deep-learning-based fully automated segmentation approach to delineate tumors in fdg-pet images of patients with lung cancer Automated quantification of covid-19 severity and progression using chest ct images Automated quantitative tumour response assessment of mri in neurooncology with artificial neural networks: a multicentre, retrospective study Monitoring disease progression with a quantitative severity scale for retinopathy of prematurity using deep learning Methods for staging non-small cell lung cancer: diagnosis and management of lung cancer: American college of chest physicians evidence-based clinical practice guidelines Non-small cell lung cancer, version 5.2017, nccn clinical practice guidelines in oncology Acr appropriateness criteria non-invasive clinical staging of bronchogenic carcinoma From recist to percist: evolving considerations for pet response criteria in solid tumors Practical percist: a simplified guide to pet response criteria in solid tumors 1.0 Inter-reader reliability of early fdg-pet/ct response assessment using the deauville scale after 2 cycles of intensive chemotherapy (oepa) in hodgkin's lymphoma Machine learning-based assignment of deauville scores is comparable to interobserver variability on interim fdg pet/ct images of pediatric lymphoma patients Variance of standardized uptake values for fdg-pet/ct greater in clinical practice than under ideal study settings Metabolic monitoring of breast cancer chemohormonotherapy using positron emission tomography: initial evaluation The predictive role of interim positron emission tomography for hodgkin lymphoma treatment outcome is confirmed using the interpretation criteria of the deauville five-point scale Acknowledgements. We would like to acknowledge the GE Blue Sky team (Elizabeth Philps, Omri Ziv, Gil Kovalsky, Melissa Desnoyers, Shai Kremer) for their financial support for this industry-academic collaboration. hjoshi@cs.stanford.edu.