key: cord-0638699-2yvsp93r authors: Gonzalez, Camila; Harder, Christian; Ranem, Amin; Fischbach, Ricarda; Kaltenborn, Isabel; Dadras, Armin; Bucher, Andreas; Mukhopadhyay, Anirban title: Quality monitoring of federated Covid-19 lesion segmentation date: 2021-12-16 journal: nan DOI: nan sha: 2d6d8c0312b25ca9c99d96dc4f9f66722e6c0b02 doc_id: 638699 cord_uid: 2yvsp93r Federated Learning is the most promising way to train robust Deep Learning models for the segmentation of Covid-19-related findings in chest CTs. By learning in a decentralized fashion, heterogeneous data can be leveraged from a variety of sources and acquisition protocols whilst ensuring patient privacy. It is, however, crucial to continuously monitor the performance of the model. Yet when it comes to the segmentation of diffuse lung lesions, a quick visual inspection is not enough to assess the quality, and thorough monitoring of all network outputs by expert radiologists is not feasible. In this work, we present an array of lightweight metrics that can be calculated locally in each hospital and then aggregated for central monitoring of a federated system. Our linear model detects over 70% of low-quality segmentations on an out-of-distribution dataset and thus reliably signals a decline in model performance. The Covid-19 pandemic has strained medical resources across the world while demonstrating the value of time-saving workflow enhancements. Deep Learning solutions for the quantification of clinically relevant infection parameters, which segment Covid-19-characteristic lesions in CTs, have shown promising results. Yet sufficient maturity for clinical use is frequently not reached by present approaches [1] . This is mainly due to neural networks failing silently coupled with a lack of appropriate quality controls. Scanner models and acquisition protocols vary between and within hospitals, changing image distribution. This causes deep learning models to produce low-quality outputs with high confidence [2] . Covid-19-related ground glass opacities and consolidations can occur in various forms, from covering multiple small regions to diffuse affection of the entire lung [3] . Identifying low-quality segmentation masks is very time consuming and requires extensive experience, but thorough monitoring of all network outputs by expert readers is not logistically feasible. Automated quality assurance for segmentation masks is not yet a developed field. Existing approaches include the training of a CNN on the logits of the segmentation prediction [4] or the concept of a Reverse Classification Algorithm [5] to predict segmentation quality. These are either computationally expensive or depend on rigid target shapes, which is not given in the case of Covid-19 lesions. Failed segmentations can however be identified by observing certain properties in the segmentation masks. We propose an array of lightweight yet reliable quality metrics for segmentation masks that do not require ground truth annotations. These can be calculated locally without the need for expert reader review and then aggregated for each hospital for central monitoring of federated systems, as illustrated in We implemented our code with Python 3.8 and PyTorch 1.6 and performed a retrospective study using several open-source datasets, as well as in-house data. The code can be found at github.com/MECLabTUDA/QA Seg. Data: To obtain a dataset of predicted segmentations, we extracted predictions from an nnU-Net [6] trained on the COVID-19 Lung Lesion Segmentation Challenge (Challenge) dataset [7] . We also predicted segmentations on MosMed [8] , as well as in-house data with further 50 cases. Images were interpolated to dimension (50,512,512). Further details can be found in Table 1 . We partitioned the predictions into in-distribution (ID) for the Challenge and in-house datasets (with which we trained our classifiers) and out-of-distribution (OOD) for MosMed. The ID datasets were randomly divided into ID train and ID test. We considered the Dice between ground truth and predicted masks as a measure of segmentation quality, as it is the most-used metric for segmentation overlap. As shown in Table 1 , the ID data is heavily skewed towards good-quality segmentations. We define a failed segmentation as having a Dice lower than 0.6 (following Valindria et al. [5] ) and report their prevalence in Table 1 . Proposed features: Inspired by van Rikxoort et al. [10] , we looked to predict the quality of segmentation masks -in the form of Dice coefficient -using only four features (see Fig. 2 ), defined as follows: -Connected Components: While lung lesions may occupy several components, failed segmentations are often more disconnected. We counted the number of connected components using Scikit-Image [11] , defining a component as one with a maximal distance of 3 by the City Block Metric to other voxels. -Intensity Mode: Observing the intensity values in the CT, we can identify tissue that is very unlikely to be infected. Inspired by Kalka et al. [12] , we fitted a Gaussian distribution over the largest component and returned its mean. -Segmentation Smoothness: In a correct segmentation mask, we expect two consecutive slices to have a high overlap and thus a high two-dimensional Dice. We computed the smoothness for every component by taking the average Dice scores for all consecutive slices that were not identical. We then averaged the smoothness over all components. -Lesions within Lungs: A correct segmentation mask should be completely contained within the lung. To factor this in, we used a pre-trained lung segmentation model [9] and recorded the percentage of segmented tissue that is inside of the lung. Models and training: With these features, we trained and evaluated several models to predict the segmentation quality. We directly regressed the quality with a Ridge Regression (RR) and a Support Vector Regression (SVR) (trained until convergence) as well as a Multi-Layer-Perceptron (MLP) with (50,100,100,50) layers for 200 epochs minimizing the Mean Squared Error. We also discretized the quality values into five bins and performed classification with Support Vector Machine (SVM) and Logistic Regression (LR) models, using balanced class weights. Unless otherwise stated, we used the default Scikit-learn [13] library implementations. Evaluation: As we were primarily interested in detecting failed segmentations, we report the sensitivity of all 5 models on this task. We also report the specificities for identifying the correct quality interval (averaged over 5 bins) on all ID and OOD datasets. In addition, we report the Mean Absolute Error as a metric than quantifies the ability of all models to directly predict the segmentation quality. In terms of sensitivity (detection of faulty segmentations) the classifiers (LR and SVM) outperformed the regression models by a large margin (see Table 2 ). This can be attributed to the class weights of the LR and SVM models balancing the disparately appearing classes in the training data, which improved their performance on differently distributed data. Though we were unable to detect the single failed segmentation out of 10 on the in-house dataset, we highlight the performance of the LR model, which detects over 60% of failed segmentations on both of the bigger Challenge and MosMed datasets. All models showed a high specificity of over 0.8 on all datasets. The regression models achieved a lower mean absolute error but seemed to overfit the good-quality segmentations on the training dataset, which might explain their worse sensitivity. We further evaluated the LR model using 10000 bootstrapping runs, sampling 192 data points from the training set and evaluating the model's sensitivity Table 2 . Sensitivity of finding failed segmentations (Dice < 0.6), specificity of identifying the correct quality interval (avg. over 5 bins) and Mean Absolute Error (mean+/-std) results for each model for ID and OOD datasets. trained on these samples on the ID and OOD datasets for every run. We achieved 95% confidence intervals for the sensitivity covering a range from 0.22 to 1.0. Furthermore, using a p-valued test with a significance level of 0.05, we can reject every null hypothesis stating that the sensitivity of the LR model is below 0.28. In order to evaluate the individual contribution of each feature, we performed an ablation study where we left out each of the features for LR models. The "Intensity Mode" feature proved to be the least useful. Leaving it out allows us to correctly identify 5 more high-quality segmentations as such, though 9 faulty segmentations less are detected. All in all, using all four features achieves the best sensitivity-to-specificity trade-off. We attribute most of the falsely classified segmentations to the low representation of bad segmentations in the training data and to these displaying plausible shapes. For example, segmentation masks covering only a few spots of healthy lung tissue, containing intensity values of possibly infected areas, while maintaining a smooth shape, were not detected. We introduced a simple method to monitor performance of an nnU-Net trained to detect lung infections onset by Covid-19. We designed four features and found that a LR model using these reliably detects faulty segmentation masks. All the features are lightweight and do not require ground truth annotations, and so they can be used to monitor the deployment of a distributed, federated learning system. Our findings have some limitations. First, we tested our methods retrospectively on a statically trained nnU-Net. This allowed us to accurately evaluate our methods, as we had access to ground truth test annotations, but a prospective study on a federated system with a few participating institutions would better emulate real deployment. Secondly, the CT data was acquired on ICU patients, thus introducing considerable bias in patient demographics which are likely not representative of the general Covid-19 population. This also suggests that a measure other than Dice may be better suited for the general population, as the expressiveness of Dice is heavily dependent on lesion size. Finally, each dataset was annotated by a different group of experts, so the definitions of the findings may vary across datasets. This is often the case when evaluating with OOD data but should be taken into account when considering the differences in performance. In conclusion, training models in a federated fashion allows to leverage heterogeneous data sources without compromising patient privacy. However, it is necessary to constantly monitor the quality of the model outputs. In this work, we introduced an array of lightweight quality metrics that can be calculated locally and aggregated for central monitoring. These are particularly well-suited to the use case of lung lesion segmentation in chest CTs, as lesions vary greatly in terms of form and location and verifying their correctness is time-intensive even for trained radiologists. Future work should expand the metric catalogue and assess the effectiveness of the proposed methods in a model deployed across multiple hospitals. Our results present a first step towards an effective quality control of federated lung lesion segmentation. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans Detecting when pre-trained nnU-Net models fail silently for Covid-19 lung lesion segmentation Performance of radiologists in differentiating COVID-19 from non-COVID-19 viral pneumonia at chest CT CNN-based quality assurance for automatic segmentation of breast cancer in radiotherapy Reverse classification accuracy: predicting segmentation performance in the absence of ground truth nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation Rapid artificial intelligence solutions in a pandemicthe COVID-19-20 lung CT lesion segmentation challenge Chest ct scans with covid-19 related findings dataset Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem Automatic lung segmentation from thoracic computed tomography scans using a hybrid approach with error detection scikit-image: image processing in Python An automated method for predicting iris segmentation failures Scikit-learn: machine learning in Python. the Journal of machine Learning research