key: cord-0107123-x4f725lv
authors: Soin, Arjun; Merkow, Jameson; Long, Jin; Cohen, Joseph Paul; Saligrama, Smitha; Kaiser, Stephen; Borg, Steven; Tarapov, Ivan; Lungren, Matthew P
title: CheXstray: Real-time Multi-Modal Data Concordance for Drift Detection in Medical Imaging AI
date: 2022-02-06
journal: nan
DOI: nan
sha: cd4d89f4ae11646f69245456a4f89ed19a254dd9
doc_id: 107123
cord_uid: x4f725lv

Clinical Artificial lntelligence (AI) applications are rapidly expanding worldwide, and have the potential to impact to all areas of medical practice. Medical imaging applications constitute a vast majority of approved clinical AI applications. Though healthcare systems are eager to adopt AI solutions a fundamental question remains: textit{what happens after the AI model goes into production?} We use the CheXpert and PadChest public datasets to build and test a medical imaging AI drift monitoring workflow to track data and model drift without contemporaneous ground truth. We simulate drift in multiple experiments to compare model performance with our novel multi-modal drift metric, which uses DICOM metadata, image appearance representation from a variational autoencoder (VAE), and model output probabilities as input. Through experimentation, we demonstrate a strong proxy for ground truth performance using unsupervised distributional shifts in relevant metadata, predicted probabilities, and VAE latent representation. Our key contributions include (1) proof-of-concept for medical imaging drift detection that includes the use of VAE and domain specific statistical methods, (2) a multi-modal methodology to measure and unify drift metrics, (3) new insights into the challenges and solutions to observe deployed medical imaging AI, and (4) creation of open-source tools that enable others to easily run their own workflows and scenarios. This work has important implications. It addresses the concerning translation gap found in continuous medical imaging AI model monitoring common in dynamic healthcare environments.

age, latent encoding from VAE), Kolmogorov-Smirnov (K-S) tests are applied, and for categorical features (e.g. sex, view projection), χ 2 tests are used to measure the distribution shift. Results: In aggregate, we found agreement between our proposed multi-modal data concordance metric and medical imaging AI model performance metrics. Through experimentation, we demonstrate a strong proxy for ground truth performance (AUROC) using unsupervised distributional shifts in relevant DICOM metadata tags, predicted probabilities, and VAE latent representation. This comprehensive approach to unsupervised drift detection was found to correlate with supervised drift detection approaches leveraging ground truth labeling data. Conclusion: We propose methodologies to achieve real-time drift monitoring metrics in the absence of contemporaneous ground truth in a medical imaging AI model and demonstrate the robustness of our approach with a chest X-ray use case. Our key contributions include (1) proof-of-concept for medical imaging drift detection approaches including use of variational autoencoders and medical imaging data specific statistical methods (2) a methodology for measuring and unifying drift metrics across patient demographics, imaging metadata and pixel based statistics (3) new insights into the unique challenges and proposed solutions for observing medical imaging AI models in production (4) creation of open-source tools 1 that leverage existing open source medical imaging datasets, enabling others to easily use our tools. This work has important implications for addressing the translation gap related to continuous medical imaging AI model monitoring in dynamic healthcare environments.

Artificial intelligence (AI) applications in medical imaging have expanded substantially over the past 5 years [1] . The growth is evident by both the rising volume of academic publications and the acceleration of commercial approvals for these applications in clinical practice [2, 3, 1, 4, 5] . Alongside this trend of new discovery and market-ready products 1 , clinicians are increasingly eager to adopt AI solutions into their practice [6] . However, to date clinical translation has been disproportionately limited. The reasons behind the translational gap in real-world clinical practice are multi-factorial, partially explained by technical and infrastructure hurdles, lack of IT resources, and no clear data-driven clinical utility analyses. Many of these barriers to adoption are being addressed with existing or emerging solutions [7, 8, 9, 10 ]. Yet even with successful site specific model validation and successful integration/deployment in clinical workflows, a fundamental problem remains: what happens after the AI model goes into production?

Of particular interest in these production systems is how does model performance change over the life cycle of an AI model. Traditional performance drift detection requires monitoring a metric of interest (such as AUROC, F1 score, PR scores) then alerting when that metric falls below a specified value, allowing administrators to investigate, perhaps triggering an adjustment or retaining of the model. Clearly translating medical imaging AI safely and effectively requires a real-time understanding of performance to address critical questions that remain unanswered in healthcare AI monitoring. The lack of visibility and the inability to guard against performance drift remains a critical barrier to widespread adoption of AI solutions in healthcare [11] . The current lack of answers to these questions in the field demonstrate the unrealistic expectation that input data and model performance will remain static indefinitely which runs counter to decades of machine learning operations research, as outlined by extensive experience in AI model deployment for other verticals [12, 13] . Identifying a solution for real-time model monitoring in production for clinical workflows is crucial, and includes detecting both out-of-distribution data as well as data drift using statistical techniques. Most performance monitoring solutions in production environments require systematic access to contemporaneous or near real-time ground truth data to inform metrics [14] . But in healthcare, ground truth data is seldom, if ever, available in real-time, particularly for medical imaging. Further, existing model monitoring solutions are designed to leverage structured tabular data, and no solution currently exists for imaging data. The challenge we face in the medical imaging AI model monitoring task, then, is to derive a systematic approach to real-time clinical AI model performance monitoring for medical imaging data (pixel and non-pixel data) without contemporaneous ground truth labels. To tackle this critical issue, we propose a system that relies on statistics of input data, deep-learning based pixel data representations, and output predictions coupled with a novel multi-modal integration solution to allow real-time monitoring that can alert when data has drifted which may adversely affect model performance; a solution which to date has never been described for medical imaging models.

Monitoring of machine learning models in production is a distinct domain, that lies between traditional software systems and quality outcome management. It requires appropriate practices, strategies, and tools [12] . These challenges are exacerbated in medical imaging by the lack tools for monitoring pixel-based AI models across the field. Further, Figure 1 : Overview of our Multi-modal concordance algorithm. From each object in a data stream of x-ray exams, we extract DICOM metadata, model predicted probabilities and a latent representation produced by a variational autoencoder. We collect these values from exams in a reference set, and compare distributions of extracted data to a detection window to produce similarity for each component. We then standardize and weight these measures to combine them into a single value representative of the total concordance between the reference data and the detection window. This provides with a simple metric capable of detecting data drift from the reference. medical imaging data is often accompanied by various metadata regarding the patient demographics, device model and manufacturer, patient position, image projection, and a number of device settings all of which may lead to unexpected results from AI systems. The predictive performance of medical imaging AI models can degrade in drift scenarios such as changing patient populations, disease prevalence, acquisition protocols, new imaging software equipment or updates, new clinics, and many more. Furthermore, it is insufficient to simply apply methodologies to measure changes in these statistics individually for practical notification and intervention as data points per day may number in the hundreds or thousands and while changes in some are critical, many are superfluous. Thus, unifying the multitudes of individual metrics appropriately is vital to monitoring drift for healthcare AI.

In this manuscript, we explore a data-driven approach to a system for real-time AI model monitoring in a medical imaging environment. We demonstrate a critical use case of providing drift metrics that correlate strongly to changes in model performance (based on ground truth metrics) but do so without the needs for contemporaneous ground truth. The key contributions of the presented work include (1) multifaceted medical imaging drift detection approach including use of variational autoencoders and statistical methods (2) a methodology for measuring and unifying drift metrics across patient demographics, imaging metadata and pixel based statistics (3) new insights into the unique challenges and proposed solutions for medical imaging AI model in production (4) creation of open source tools, demonstrated on existing public datasets, that allow the research community to build and validate their own custom monitoring systems.

In this section, we focus on outlining the various building blocks of our approach to medical imaging AI drift detection, cover data specifications, AI model, drift detection concepts and foundational metrics.

Our experimentation utilizes two publicly available datasets to test a medical imaging AI drift workflow: CheXpert [15] and PadChest [16] . CheXpert is a large dataset of chest X-rays and competition for automated chest X-ray interpretation that includes uncertainty labels as well as a radiologist-labeled reference standard evaluation sets. CheXpert contains 224,316 chest radiographs of 65,240 patients who underwent an examination at Stanford University Medical Center between October 2002 and July 2017 and includes both inpatient and outpatient medical scans.

The PadChest dataset is a large, labeled chest X-ray dataset containing 160K high resolution images with their corresponding labeled reports. Its detailed description and labeling methods are described in [16] . PadChest includes more than 160,000 images obtained from 67,000 patients that were interpreted and reported by radiologists at San Juan Hospital (Spain) from 2009 to 2017, covering six different position views, and including additional information on image acquisition and patient demography. Both human annotation and Natural Language Processing (NLP) annotation was used to obtain PadChest's labels. Labels include 19 differential diagnoses, 103 anatomic locations and 179 different radiological findings which were mapped onto the NLM standard Unified Medical Language System (UMLS) using controlled biomedical vocabulary unique identifiers (CUIs). These were further organized into semantic hierarchical concept trees. Unlike many other datasets, the chronology of the scans was not removed during de-identification, making it uniquely useful in model and data drift experiments.

Typically, training a new clinical model will leverage a pre-trained model which is fine-tuned on a new task and class biases of a different clinical setting. To mimic this process, we started with an available model, pretrained on CheXpert then trained and validated it on PadChest data. Fine-tuning a model on PadChest data from a model pretained on CheXpert, requires unifying their labels sets. PadChest original has an extensive label set from which we merged relevant labels into a common label set of ten: Atelectasis, Cardiomegaly, Consolidation, Edema, Lesion, No Finding, Opacity, Pleural Abnormalities, Pleural Effusion and Pneumonia (See Table 3 for details).

We split PadChest in training, validation and test sets based on their exam dates, allowing us to produce simulated data streams from the data. We used 2013-01-01 and 2014-01-01 to partition the data into training, validation and test sets. Our training data spans 2007-05-03 to 2012-12-31 (12-29 to 12-31 contain frontal samples), validation data 2013-01-01 to 2013-12-31 and finally our test data starts on 2014-01-01 and continues to the end of the dataset (2017- [11] [12] [13] [14] [15] [16] [17] . Note that in early 2014 the method for which image were labeled began to change; where studies were previous labeled using NLP, they were now manually labeled by expert radiologists. There are also gaps in the data where the number of exams per day drops significantly. Since most AI models have a life cycle of about one year, we conduct our experiments within the first year of the test set (from 2014-01-1 to 2014-12-31). For a CheXpert pathology label-wise training data distribution breakdown, refer to [15] . Table 1 contains the resulting post label mapping distribution and date cutoffs between domains (training, val, test). [17] with 121 layers as originally implemented in Pytorch for the CheXpert competition. Starting from ImageNet [18] weights, we first pretrain the model on frontal-only CheXpert training data (N = 191,000), using a U-Ones scheme (uncertain labels considered positive) as outlined in [15] , We selected the model that yielded the best performance on the 10 listed pathologies in section 2.1.1 and then retrain only the final classifier layers on PadChest frontal-only 2 training data. After PadChest training completes, we deploy the model in a psuedo-clinical setting and pass the entirety of PadChest through the system in a sequential, date-wise fashion, recording both raw scores and activations from the system for each sample. Using these activations, we are able to measure performance benchmarks for any time window within the dataset. The performance benchmarks for each stage of model development are found in Table 2 .

Note that while transfer learning from ImageNet for downstream medical imaging tasks is the current standard in Deep Learning, we fine-tune the model on two large frontal X-ray datasets to unify the pathology labels, allowing us to accurately explore different drift scenarios in a pseudo-clinical deployment setting. 

When objects in a dataset include timestamps, it is referred to as a data stream. The underlying statistics and properties of a data stream are subject to change over time, which gives rise to drift. Broadly speaking, drift falls into one of two categories: Covariate shift (sometimes called input or feature drift) and concept shift.

Covariate shift [19] is defined as a deviation within the input variable of a data stream. Covariate shift is common in medical imaging, examples include changes to imaging protocols, imaging software or equipment updates, and changing patient demographics. After a covariate shift, a deployed model may be operating in an untested or poorly validated environment wherein performance degradation becomes an obvious concern [20] . When a significant time gap exists between contemporary data and model deployment, the likelihood of drift and consequently, classification errors increases [11] . For healthcare AI systems, drift associated errors can cause unwarranted harm to patients. Ideally, we would like to continuously monitor any clinically deployed AI system, and refresh the algorithm with new training data upon observing any significant performance degradation.

Whereas the covariate shift refers to changes in input data x, concept shift occurs when the relationship between input data x and output variable y changes. Modern AI systems are built upon stationarity -the idea that the characteristics of a target class remain static [21] . This assumption allows models to be trained to identify those characteristics then predict the presence of that class in unseen data. This assumption is not always valid, particularly when the target class can be influenced by outside factors. Take for instance, the impact of COVID-19 on an automated chest X-ray interpretation model trained pre-COVID. A model designed to predict mortality for COVID patients using chest radiograph images may work on a dataset taken from the height of the pandemic but as treatments advance, disease prevalence shifts, the mortality outcome based on imaging features may no longer be sufficient for accurate prediction.

In this work, we assume that concept is fixed and our experiments concentrate on detecting changes related to covariate shift.

In our approach, instead of highlighting differences in the data stream, we monitor the similarity or concordance of the data stream with respect to a reference dataset; when the concordance metric decreases the degree to which the data has drifted has increased. Specifically, our method measures the concordance between 'reference' set and second set we refer to as a 'detection' window. The 'reference' set comprises a gold standard collection of samples with known characteristics (and model performance) with which we wish to stay in concordance with. To measure this concordance on a detection window, we calculate a number of individual metrics that compare statistics between the two samples. We calculate these metrics sequentially on a data stream providing concordance metrics over-time. More formally, we define a reference set, ω R that is a collection of individual exams (I), ω R = {I 0 , I 1 , · · · , I K−1 , I K }. Using this sample, we are able to measure concordance of a detection window at time t (ω t ) by applying our collection of metric functions, Ψ = {ψ 1 , ψ 2 , · · · , ψ N −1 , ψ N }, each metric compares a subset of statistics between the two samples.

For metric functions, we chose two statistical tests, one for continuous real-valued features and another for discrete (categorical) valued features. For continuous features (e.g. age, z from VAE), a Kolmogorov-Smirnov (K-S) test is applied which measures distribution shift from the reference window. The K-S test is a non-parametric test used to measure distribution shift of a continuous variables from a reference sample. As a non-parametric test, the K-S test compares samples without assuming a specific distribution of a variable making it an efficient and effective way to distinguish the distribution change from one time to another [22] . While we used these metrics for our experimentation, our framework is extensible and modular, built specifically to allow additional metrics to be seamlessly added.

For categorical features (e.g. sex, projection), the χ 2 (chi-square) goodness of fit test is used to compare observed frequencies in data and compares them to expected values. Another non-parametric test, χ 2 goodness of fit test, calculates if an input sample with observed frequencies is likely to be obtained from the frequencies observed in the reference set [23] .

Both of these tests provide both a statistical measure of the similarity between the two distribution as well as a p-value that provides a likelihood that the null hypothesis is accepted or rejected. We found that the p-value gives noisy results and to be an inconsistent measure of similarity between two distributions. On the other hand, the test statistics directly compare the two distributions and provide a softer and more consistent metric for measuring similarity. For these reasons, our experiments concentrate on the test statistics for measuring concordance and ignore the p-values for both tests.

To evaluate performance of our model, we calculate AUROC, which serves as a discrimination measure to demonstrate how well the model can classify patients in two groups: those with and those without a given pathology of interest [24] . The AUROC is the integral of the receiver operating characteristic curve which measures the trade-off between true positive rate (TPR) and false positive rate (FPR) at different decision thresholds. A test with no better performance than chance has an AUROC of 0.5, while a test with perfect accuracy would have an AUROC of 1.0. Accordingly, an AUROC = 0.90 indicates that 90% of the time we draw a test X-ray from the disease group and non-disease group, the predicted score from the disease group will be greater.

Since monitoring AUROC over production time-frame can provide clear-cut evidence of a model drifting, it is a metric that physicians, hospital systems and AI clients would ideally track real-time. However, this also requires real-time, domain expert-labeled ground truth labels (a cost-prohibitive and unfeasible ask even for the most dynamic healthcare systems). AUROC serves another critical purpose for this work, giving insight into model performance-based drift analysis in contrast with other statistical metrics that aim to do so without ground truth.

We measure data concordance by capturing metrics from three sources: 1) DICOM metadata, 2) image appearance data, and 3) model response data, then unify those metrics into a single multi-modal concordance metric which we refer to as Multi-Modal Concordance (MMC ). Our multi-modal metric is comprised of a diverse set of signals that cover a variety of areas where drift can occur. DICOM metadata contains information on the origin and construction of the image as well as patient demographics; changes in these variables may be indicative of feature drift. In addition to image source information, our approach utilizes appearance-based features from a Variational Autoencoder (VAE) to directly monitor visual feature drift. Lastly, we incorporate model responses which could indicate if other changes have occurred that directly affect model output statistics including covariate shift or prior probability shift.

DICOM (Digital Imaging and Communications in Medicine) is the international standard to transmit, store, retrieve, print, process, and display medical imaging information [25] . Data available in the DICOM format is produced by the imaging device. A DICOM file consists of a header and image pixel intensity data packed into a single file. An image header represents embedded metadata. This metadata includes demographic information such as patient sex and date of birth, as well as a record of imaging attributes that govern how the image is captured, stored and transmitted. All of these attributes can effect an AI system's responses. We use these DICOM variables to characterize data shifts given that changes to these attributes can be indicative of changes to imaging features and patient population makeup. For metadata, clinical and imaging protocols extracted from the PadChest dataset are analyzed to generate the drift metrics in this study and fall into 3 categories: They are 1) patient demographics features including age and sex, 2) image formation metadata, i.e. X-ray scan information including view position, device manufacturer, frontal projection (Y/N), X-ray tube current, X-ray exposure and relative exposure, 3) image storage information such as pixel representation, spatial resolution, bits stored, window width, and pixel aspect ratio. 

Medical imaging data is complex, and a medical imaging data stream can shift without any accompanying metadata changes. Change in imaging hardware, disease presentation or patient demographic that are not captured in the DICOM metadata may be invisible to a human eye, but noticeable to the sensitive ML model. After such a shift, features originally present in the training and validation data may morph or disappear altogether, impacting model performance. It is crucial to be able to capture these changes when monitoring a live data stream. Recent work for high-dimensional data drift detection proposes combining dimensionality reduction (eg. PCA, autoencoders etc.) with two-sample hypothesis testing [14] . We leverage a Variational Autoencoder (VAE) to generate an encoded representation of each image upon which we apply statistical metrics to detect drift. Rather than measuring change by image reconstruction loss which is a scalar quantity, we utilize the latent space encoding generated by the VAE which gives us a feature-rich representation with compressed yet relevant information about the input images. This a encoded representation can more easily be checked for distributional shifts than the pixel representation [26, 27, 28] and is more descriptive than a scalar reconstruction value. Using this feature rich latent space allows a much more fine-grained (and often explainable) analysis.

Generic autoencoders compress input data into an encoded representation, then reconstruct the original data using only the latent values. Based on this concept, variational autoencoders (VAE) assume that the data follows some underlying parametric probability distribution (typically a multi-variate Gaussian) and attempts to model the parameters of this distribution which become the image's latent representation. A by-product of this process is that the input is not simply compressed, but is encoded into a probabilistic latent space where inputs where similar features have similar latent representations. [29] . We leverage this fact and encode input images into this descriptive latent representation, then build a statistical model of it, allowing us to compare our reference data with a detection window.

We trained our variational autoencoder (VAE) from scratch on the PadChest training data including both frontal and lateral images. During test time, we feed all input images within the detection window through the encoding portion of the VAE and capture the resulting z parameters of the probability distribution. These parameters are used to establish statistical similarity of new data to our reference dataset.

The goal of monitoring live medical data streams is to measure the consistency of a data stream over the life cycle of a deployed model. If this data stream begins to drift, and model performance begins to degrade, it stands to reason that the model responses would also change. Thus, it is crucial to monitor the model responses for any such changes.

By and large, model responses shift for two reasons: 1) the underlying class distribution has shifted (prior probability shift), or 2) the visual representations have shifted (covariate shift). Both of these conditions can dramatically affect performance, in particular the latter may have a significant impact on model performance, but would likely go unnoticed without a performance audit. Measuring these responses directly allows us to catch any such changes in the data stream.

In general, model response monitoring takes one of two forms: measuring soft predictions or hard predictions. Soft predictions refer to the raw score or activation from the model, whereas hard predictions refer the the final label out from the system, typically after applying a threshold. In this work, we exclusively use soft predictions for monitoring. Soft predictions contain valuable information on the relative certainty of our predictions. Model responses may incrementally shift as rare or more subtle signs of pathology increase in frequency, a change that would remain hidden by hard predictions until the responses pass some threshold. By directly measuring shifts in soft responses, we are able to detect these subtle changes.

In this section, we discuss how our approach unifies multiple metrics across a variety of multi-modal inputs. Specifically, we discuss our sampling methodology for creating detection windows as well as our strategy to standardize and aggregate individual metrics into a single multi-modal data concordance metric.

Rolling Detection Window: To construct our detection windows, we use a sliding window technique. Sliding window sampling functions have parameters for window length (l) and stride (s). The window length determines the size of each window and the stride denotes the spacing between neighboring samples. We use temporal-based stride and window length values to collect all exams within a time-window, looking backwards from the indexed date. For example, if the window length is 30 days, then sample for December 31st would include all exams from December 1st through December 31st. Note that, our approach calculates multiple metric values, m i from each detection window, ω using metric functions Ψ i (ω) = m i .

Over Sampling: Many distribution similarity metrics, including those those used in this work, are sensitive to sample size; even when two samples are drawn from the same distribution they may produce differing results simply due to sample size. Our method mitigates this issue by repeatedly calculating metrics on a fixed-size sample and calculating an average result. To do this, we use a bootstrap method that samples the detection window to draw K samples, then calculate metrics on this bootstrapped sample. We repeat this process N times and average the results to obtain a final value. More formally, computation of a given metric on a detection window is a function Θ(·) that uses the sample and another function, θ K to collect K samples from a detection window ω to calculate metric ψ i , which is done N times then the results are aggregated:

where θ K is a function that collects K samples from ω with replacement.

Detection Window Sets: A detection window set is a collection of detection windows, where each detection window is typically captured with a time-index. If we denote a detection window taken at time t as ω t , then we define some detection window set taken from time a to time b as Ω [a,b] = {ω a , ω a+1 , · · · , ω b−1 , ω b }. We can then collect metrics at each time step, resulting in a metric value set at multiple time steps: we defineψ i (ω t ) =m t i as an individual metric calculated at time t from detection window ω t , we can then capture metric valuesm 

Metrics outlined in the previous sections provide diverse inputs for monitoring drift, however, presenting a cohesive framework that bridges supervised and unsupervised data requires a holistic approach and metric unification. There are three main challenges to metric unification: 1) fluctuation normalization, 2) scale standardization and 3) metric relevancy. Without normalizing for acceptable fluctuation, it is impossible to differentiate changes that occur within normal operation and those that truly represent drift. Furthermore, since each of these metrics are based on different types of tests comparing separate statistics, there is no guarantee that these values will reside on the same relative scale. For example, a χ 2 test statistic has no upper bound, where as a two sample K-S test statistic lies between 0 and 1. Merging metrics across non-standardized values may result in improper unification, where large values overpower smaller ones disregarding their relative importance. We tackle fluctuation normalization and scale standardization by using a standardization function, Γ, which transforms all individual metrics into a numerical space with common upper and lower bounds. In this work, we use a simple function that normalizes an input value m with fixed values for scale and offset:

where scale and offset factors are represented by η and ζ, respectively.

Next, we unify individual metric values across all standardized metrics through weighted aggregation using predefined weights, α i for each metric. Putting it all together, we calculate our multi-modal concordance metric, MMC , on a detection window ω from L metrics as follows:

whereψ i (ω) represents the ith metric calculated on detection window ω, Γ i represents the standardization function, and α i represents the weight used for the ith metric value. Calculating MMC on a time-indexed detection window set Ω [a,b] , we now have a robust multi-modal concordance measure capable of monitoring drift over the given time period from a to b, MMC [a,b] .

A number of strategies exist to choose appropriate values for η, ζ and α; these strategies range from manual selection to fully automated functions. Indeed, each of the metric weights, scales and offsets could be manually chosen using clinical heuristics. In our experiments, we used automatic methods to calculate values for η, ζ and α. Instead of manually choosing weights which may be time consuming, we propose an automatic method for obtaining scale and offset values, as well as metric weights. First, we calculate values for η i and ζ i using a detection window set collected from the validation data. Specifically, we first generate raw metric values using individual metric functions ψ i , on all windows in a detection window set, calculate the means and standard deviations of each metric value across the detection window set and finally set each ζ i and η i to their corresponding mean and standard deviation. Second, we obtained values for each α i using a strategy that ties concordance directly to performance by leveraging the correlation between individual metrics, ψ i and performance on validation data. Specifically, each weight, α i , is calculated using a detection window set Ω α as follows:

whereψ i (Ω α ) and ρ(Ω α ) represent the standardized metrics and performance on some detection window set Ω α . Selection of Ω α requires careful consideration. The detection window set Ω r used to standardize metrics is often not suitable as it contains only high performing samples (by design). We generate Ω α by adding poor performance samples into Ω r though hard data-mining of our validation set.

For experimentation, we also use a baseline method for calculating weights, where all weights are equal and we instead apply a simple average across standardized metric weights. Throughout this work, we denote MMC calculated without any weights as MMC 0 and the weighted counter part as MMC w . Note that depending on the individual metrics used, the sign of α may need to be flipped to measure similarity rather than distance (as is the case for both statistical tests used in this work).

In this work, we aim to investigate the connections of various drift detection outcomes over the lifetime of a medical imaging model deployed in production. To observe these connections, we designed an experimental setup with an adaptive learning workflow backed by our open-source framework 3 and stress-tested it on engineered drift scenarios that simulate different indicators for medical imaging AI drift in a production environment. Our framework has a modular design and can be used in a plug-and-play manner to test multiple input drift modalities and scenarios with include or new datasets.

Using this framework, we highlight two real-time drift scenarios by either generating an artificial data stream backed by real data or injecting data into a real data stream to induce drift, Thus, enabling us to retain genuine data properties while ensuring drift substantial enough to risk degrading the model.

In all of our experiments, we used the validation set for the reference data from which we wished to detect concordance/drift. We also used the validation set to generate metric standardization and metric weights. We used data starting 2014-01-01 and ending on 2014-12-31 as our test set to simulate a production data stream. Manipulation of this test data stream forms the basis for drift observations. All experiments used a detection window stride (s) of 1 day and length (l) of 30 days. For our sampling function Θ, we set K = 2500 and N = 20. Note that, in some situations, the number of exams in a detection window can be extremely low, leading to anomalies in our metric calculations. For this reason, we skip any detection windows that contain fewer than 150 exams. We used these same values to produce a reference detection window set, Ω r , which was used to calculate standardization factors η and ζ. Lastly, we generated an additional detection window set, Ω α to calculate metric weights, α i ∈ A. This detection window set was obtained by augmenting Ω r with poor performing samples through hard data mining.

Our principal experiment specifically investigates if performance changes are detectable by our concordance metric. To accomplish this, we induce performance degradation through hard data mining and observe its effects on concordance and drift. For this experiment, we mine hard data by including only samples where the classifier has a low degree of certainty on a per label basis by including exams where a pathology was indicated but scored low for that pathology and conversely when no the predicted score was high on an exam but no indication of the pathology was found by an annotator. Using this method, we created a sample pool that we drew from to populate each day's simulated exams. Every exam during our test set was replaced by an exam randomly drawn from this pool, maintaining the same number of daily exams from the original data stream. See Figure 3 for performance and MMC plots. 

Clinical workflows, especially those that include AI modeling are complex and heavily rely on metadata. There are many situations where this metadata can be inaccurate or inconsistent which cause these workflows to deteriorate or fail over time. This experiments explores exaggerated cases of these situations in two experiments. The first injects lateral view images which our model was not trained on, and the second adds pediatric data that typical AI systems are not cleared to report on.

For lateral image induced drift experiment, we simulate a failure in a metadata filter by adding lateral images into our test data stream. With the availability of all metadata variables as well as pathology labels on the PadChest lateral images, this use-case enables an end-to-end demonstration of our drift detection pipeline covering each specified metric and input modality. The original PadChest data set includes both frontal and lateral data, so to "inject" lateral data, we simply began including these images in the detection windows. We also simulate a complete failure in this workflow by removing all frontal images, leaving only lateral images. See Figure 4 for performance and MMC plots. 

This experiment tests how our method works when metadata is missing or unavailable and used the Pediatric Pneumonia Chest X-ray dataset from [30] . This dataset contains 5,856 Chest X-rays labeled as either pneumonia or normal and contains only images and labels. Since this dataset has removed all metadata this allows us to investigate how our method performs using only a subset of metrics, particularly the VAE-based drift metrics. As mentioned above, our code repository enables a workflow whereby a user can experiment with their own Chest X-ray dataset, labels, and metadata variables to visualize the entire drift pipeline. The Pediatric Pneumonia Chest X-ray dataset does not include any temporal data, so we injected this data from this dataset similar to the method used in experiment 4.2, except the entire dataset was used as a sample pool. As before, the data was sampled randomly without replacement until no samples remained at which point the pool was reset and reshuffled. See Figure 5 for performance and MMC plots.

The performance and concordance metrics of each experiment are visualized in Figures 3, 4 We start by discussing results in Figure 3 which corresponds to our first experiment outlined in Section 4.2. In this figure, the top panel shows micro-averaged AUROC, the middle depicts our metric MMC w , and the bottom panel shows our metric without the use of performance correlative weights (MMC 0 ). As mentioned in Section 4.2, this experiment induces performance drift by limiting samples to hard data. In the figure legend the Q values refers to the degree to which the samples were limited. Specifically, Q denotes the quantile of predicted probabilities used to find high scoring negatives and low scoring positives, i.e Q = 0.25 mean that the highest 25% of negative and lowest 25% of positive samples (by predicted probability values) were used. "All data" set this quantile value to 1.0, thereby using all the negative and positive samples. In this figure, we see clear correlation between measured ground truth and our concordance metric when comparing the top panel (performance) to both the middle and bottom panels (MMC w and MMC 0 respectively). First, we notice that the baseline does not show a visible drop. This is representative of a clinical scenario where an exceedingly well performing model remains as effective as baseline (validation) over the course of production. We also see that as we reduce the Q value, thereby decreasing performance, we see a corresponding drop to the concordance metric. In this experiment, we also depict both the weighted and unweighted version of our metric and we can see the value of using performance correlated weights when comparing the middle and bottom panels. When using the weighted version of our multi-modal metric, we see very clear separation of all 3 curves correlative to the performance changes in the graph. However, in the bottom panel, we see the 3 curves closely clustered. This comparison shows how our weighting methodology emphasizes relevant metrics to yield a consistent performance proxy. Figure 4 depicts the results of our second experiment that investigates clinical workflow failures that could lead to model drift, described in Section 4.3.1. In this figure, again we show performance in the top panel and our metric result in bottom panel. In this experiment, we have two trials, a baseline which is the original data stream of PadChest (blue) and a second trial (red) drift has been induced. We show two vertical lines which denote the point in time where we modify the data stream in the second trial. At point A, we simulate a failing metadata workflow by allowing lateral images to be passed to the model, and at point B we simulate a catastrophic failure in the metadata workflow by removing in-distribution (frontal) data, leaving only laterals. In this experiment, we again see a correlative response in MMC w to performance as shown in Figure 4 . At point A, where we introduce lateral images, both performance and MMC w drop a modest amount, with AUROC dropping from above 0.9 to around 0.85 and MMC w dropping to about −4. At point B, we see an another decrease in performance (to ≈ 0.75 AUROC) and our metric drops to, and hovers around −8. This demonstrates our method's robustness to changes in data composition that are detectable by metadata tags as well as visual appearance.

Results for our final experiment appear in Figure 5 , which simulates a situation where metadata is unavailable or is not used properly in a clinical workflow and the model begins to receive images that it is has not been validated to classify. This experiment, described in Section 4.3.2, investigates how well our system performs without the use of any metadata, relying only on VAE latent representations and predicted probability distribution shifts to detect drift. The performance metric in this experiment differs slightly from the other two in that the we only measure Pneumonia AUROC as the pediatric data includes only pneumonia labels, invalidating performance measures for other classes. At point A in this experiment, we modify the data stream by injecting pediatric cases such that for every exam in the original PadChest data stream there are 3 pediatric exams added. Then at point B, the data stream switches over to only pediatric data. As seen in Figure 5 , unsurprisingly, the performance metric fluctuates more than in the other figures as this represents performance of a single pathology rather than an average. Comparing the baseline performance and that of the trial with a modified data stream, we see a drop in performance and concordance at both points where we modify the data stream. Even though the baseline performance oscillates between ≈ 0.9 and ≈ 0.7, when we inject pediatric data the AUROC drops and remains consistently below the baseline performance. Likewise we see a significant drop in MMC w at both data stream modification points. This indicates that our approach remains robust even when metadata is not available and can successfully detect drift using only VAE latent representations and predicted probabilities. In this experiment, we also notice an exaggerated drop in MMC w at point A compared to performance. This large drop in concordance is expected since pediatric data is, indeed, out-of-distribution, but we likely see a smaller drop in performance since the specific features of pneumonia in these images are similar enough to those in frontal PadChest images that the classifier can pick them up, therefore performance is less affected by this drift. Concordance takes a more holistic approach to data drift and these images represent significant drift in the data stream; so in this case, the drift metrics are more sensitive to data stream changes even though the classifier is more robust to those same changes. This is a desirable characteristic for concordance measurement since AI models are not typically cleared for use on pediatric patients, so when the system is flooded with patients under 12 -alerting to data drift is appropriate, regardless of performance changes.

The purpose of this work was to explore a data-driven approach to building a system that can perform real-time AI model monitoring for a medical imaging model over time and identify methodologies to achieve useful drift/performance metrics in the absence of contemporaneous ground truth in a chest X-ray model use case. We found that, rather than measuring change in per-image reconstruction loss which is a scalar quantity, utilizing the z vector (latent space encoding generated by the VAE) provides a feature-rich representation with compressed yet relevant information about input X-rays. Furthermore, we found that we were able to generate a strong proxy for ground truth performance using this latent representation along with relevant DICOM metadata tags and distributional shifts between predictive model probabilities. By unifying concordance metrics captured from this data, we present our multi-modal approach that can monitor real time medical imaging AI systems. We demonstrate through experimentation that this approach to unsupervised drift detection correlates with supervised performance drift and has crucial implications on addressing the translation gap related between continuous model performance modeling in dynamic healthcare environments that lack contemporaneous ground truth.

When we monitor drift, one objective is to inform decisions regarding the model performance in production with the expectation that if data distributions are similar between training and production then the model should perform as expected. If the distributions have changed, the whole system might need an update.The task of drift detection focuses on global data distributions in the whole dataset in order to determine if there is significant shift compared to the past period data or model training data. Data drift might occur as a gradual shift in features along one of many potential dimensions; the relationship to model performance will determine the need for intervention. This is different conceptually from the traditional task of out of distribution detection where the focus is to find individual "unusual" or "different" features for input data. In other words, the global data drift and out of distribution outliers can exist independently; the entire dataset might drift without outliers just as an individual outlier might appear without data drift. If drift is detected in this framework the goal is to intervene at the model level (i.e. pull it out of production, retrain, rebuild, etc.). In contrast if out of distribution input is detected the assumption is made that the model still performs well but for that particular input data the prediction would not be accurate and intervention would occur at the data level (i.e. logic is applied to qualify the model decision or avoid model). Safe and effective monitoring in production requires that both global drift and out-of-distribution detection can be identified with different interventions needed, and in the medical imaging use case, that is further challenged in that ground truth is not immediately available. In this framework, model monitoring must accommodate both the gradual global drift as well as the individual case of an outlier; in the former case this work aimed to design for the medical imaging use case as a drift detection metric to alert for potential conditions that would impact overall model performance in the absence of ground truth.

Current proposed workarounds for the lack of ongoing medical imaging AI model performance monitoring solutions include primarily relying on human experts/users to provide model feedback during deployment and/or perform periodic expert auditing on retrospective data by curating representative data and applying ground truth labels for performance analysis. This is an unsustainable and problematic solution for several reasons. First, asking end users to add additional cognitive effort (and clicks) in order to provide model feedback risks decreasing the purported efficiency advantages of using AI models adding to user burnout. Second, while it is generally agreed upon that periodic "model audits" that resemble the initial pre-deployment analysis will be important to ensure ongoing model performance in production, the institutional effort involved in curating, annotating, and analyzing model performance are powerful disincentives. These approaches are difficult to scale in an environment with limited resources and, potentially, many different models deployed in a given clinical practice. One needs to decide when to perform these checks, on what data and how frequently, and leverage expensive expert clinical resources to repeatedly curate and annotate ground truth test datasets. Therein lies the critical trade-off between maintaining consistent patient care and burdening clinicians to evaluate performance, a burden compounded by the fact that individual annotation (with inherent variance) causes noisy feedback. Worse, drift can occur across innumerable axes and features across the imaging domain, which risks overlooking important changes to performance leading to patient harm. This approach not only provides insufficient performance details, it fails to address medical imaging AI systems that do not directly interact with the end user in real time (i.e. image reconstruction, autonomous screening workflows, etc), or perform super-human tasks (i.e. opportunistic screening, mortality prediction), and further, risks missing unconscious biases. To combat the divergence between static models and their dynamic clinical environments, strategies to detect drift in real time must be developed and adopted. We present a multi-modal and sequential drift detection system for medical image classifiers, which can be modified flexibly to fit different data domains. Previous work has mainly been limited to a certain type of data, like streaming text [31] , image and video [32] or metadata-like informational markers from clinics, airlines, internet of things (IoT), etc. [33] . Even though their methods could be expanded into other data types, multiple metrics for drift detection would be generated. Integrated analysis of clinical data and medical images (pathology and radiology) is routine in clinical practice and the lack of multi-modal models which perform such integration represents a significant gap. Our system generates a unified metric from multiple features using standardizing and weighting strategies, which provides a more holistic evaluation to aid in decision-making.

In a production environment for a medical imaging AI model, drift metrics are likely to serve as the primary signal that model performance might have changed in the absence of real-time access to ground truth; we have demonstrated that statistical change in model inputs and outputs may serve as a valuable proxy that can signify a possible decay in the model performance. Ideally, a model experiencing significant drift would be paused from production, allowing for confirmatory performance auditing, identifying potential root causes, intervening with using a new model, current model retraining, refactoring DICOM study routing workflows, and ideally, inform continuous learning strategies. But the actions available to respond to drift in real-time production in medical imaging are constrained by common regulatory environments. For example, the current model approval process by the FDA may discourage continuous model code changes and updates, as they may trigger a re-submission for approval. This in turn discourages continuous monitoring and reporting of flaws due to the high overhead required to provide updates to mitigate these performance gaps. This regulatory challenge does not, however, change the fundamental need for model monitoring to achieve safe, effective model deployment and that the detection of model drift is not itself sufficient without a mechanism to address the disparity in a timely manner. In the future, as expansion of regulatory permissions are expected to include retraining/continuous learning informed by Machine Learning Operations (MLOps) monitoring systems such as the one proposed, the information provided by drift metrics could inform a learning healthcare system allowing for intelligent model monitoring, auditing, retraining, and redeployment [34, 11] .

This work has several important limitations. First was the use of public datasets that precluded access to additional clinical contextual information and important population metrics. Importantly, the primary dataset was chosen due to the preserved temporal relationships between studies, availability of DICOM metadata, and well-labeled ground truth for pathologies in order to simulate a distribution of imaging examinations over time. While useful for temporal relationships, the dataset was highly curated retrospectively and the sustained performance of the baseline model over time likely is a reflection of that curation homogeneity rather than the lack of underlying model drift; a real world sequential dataset with ground truth labels would serve as a more realistic baseline with which to observe and detect routine drift over time in the absence of manufactured drift experiments as was done in this work. There are many other important use cases related to medical imaging not explored including incorporating new disease data, impact of hospital protocol or equipment data, and more. Further the complexity of medical imaging use cases are varied and include increasing data size and complexity (i.e. MRI reconstruction, CT imaging, etc) as well as varied clinical tasks (i.e. segmentation, diagnosis, outcome prediction) which were not explicitly evaluated in this work. Nonetheless, the underlying methodological approach to medical imaging model monitoring explored and validated by this work could be further investigated in these and other important use cases.

In conclusion, this work demonstrated a system that can perform AI model monitoring for a medical imaging with methodologies that can achieve real-time drift metrics in the absence of contemporaneous ground truth in a chest X-ray model use case to inform potential change in model performance. This work will inform further development of automated medical imaging AI monitoring tools to ensure ongoing safety and quality in production to enable safe and effective AI adoption in medical practice. The important contributions include the use of VAE in reconstructing medical images for the purpose of detecting input data changes in the absence of ground truth labels, data-driven unsupervised drift detection statistical metrics that correlate with supervised drift detection approaches and ground truth performance, and open source code and datasets to optimize validation and reproducibility for the broader community.

We release the full code to train, validate and test our approach. The mapping for each label from Pad-Chest radiological findings appears in Table 3 . Please see our code at https://github.com/microsoft/ MedImaging-ModelDriftMonitoringfor more training details. infiltrates, alveolar pattern, pneumonia, interstitial pattern, increased density, consolidation, bronchovascular markings, pulmonary edema, pulmonary fibrosis, tuberculosis sequelae, cavitation, reticular interstitial pattern, ground glass pattern, atypical pneumonia, post radiotherapy changes, reticulonodular interstitial pattern, tuberculosis, miliary opacities Pleural Abnormalities costophrenic angle blunting, pleural effusion, pleural thickening, calcified pleural thickening, calcified pleural plaques, loculated pleural effusion, loculated fissural effusion, asbestosis signs, hydropneumothorax, pleural plaques Pleural Effusion pleural effusion Pneumonia pneumonia We outline key results pertaining to the VAE-based drift detector, and visualize the detector's ability to identify drifted data through multiple signals. We find that leveraging the latent space representation as opposed to the scalar reconstruction loss yields a richer encoded feature set within which to contextualize the VAE's ability to separate out-of-distribution data in production. The VAE's encoded latent space representation (z) consists of a 128-length vector. To demonstrate that the VAE drift detector is contextually meaningful, we present indices of the latent encoding that exhibit strong correlations with drifted inputs (metadata features representing opposing categories, low to high activation scores, lateral X-ray images etc.). Here we highlight changes that specifically related patient population and image construction as an example. Figure 6 shows average latent representations of PadChest exams of patients over 12 years old, 12 years old and under, as well as averaged latent representations from the Pediatric dataset. We see significant differences in the latent space of x-rays of older patients and patients 12 and under. In the pediatric data, we see a similar representation to images of patients under 12 from the PadChest dataset, with some differences likely attributed other differences in image formation from the two datasets. Figure 7 shows the average latent space representations of frontal, lateral and other view positions. Again we see significant differences in their average encoded spaces. In particular, we see a number of values diametrically opposed between the frontal and lateral vectors, demonstrating the large differences between these view positions In Figure 8 , we show averaged latent spaces for each modality as reported by the DICOM metadata. The majority of the data is CR, so we expect a fairly uniform average as expected.

With DX images, we see certain latent variables exaggerated as compared to CR images. 

Artificial intelligence in radiology: 100 commercially available products and their scientific evidence

Global trend in artificial intelligence-based publications in radiology from

The state of artificial intelligence-based fdaapproved medical devices and algorithms: an online database

Applications of artificial intelligence (ai) in diagnostic radiology: a technography study

The state of radiology ai: considerations for purchase decisions and current market offerings

Current clinical applications of artificial intelligence in radiology and their best supporting evidence

Imaging ai in practice: A demonstration of future workflow using integration standards

The algorithmic audit: working with vendors to validate radiology-ai algorithms-how we do it

Toward generalizability in the deployment of artificial intelligence in radiology: Role of computation stress testing to overcome underspecification

Integrating ai into radiology workflow: levels of research, production, and feedback maturity

The clinician and dataset shift in artificial intelligence

Machine learning: The high interest credit card of technical debt

Monitoring and explainability of models in production

Failing loudly: An empirical study of methods for detecting dataset shift

Chexpert: A large chest radiograph dataset with uncertainty labels and

Padchest: A large chest x-ray image dataset with multi-label annotated reports

Densely Connected Convolutional Networks

ImageNet: A Large-Scale Hierarchical Image Database

A unifying view on dataset shift in classification

Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review

Interpretability of sudden concept drift in medical informatics domain

on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine

Area under the ROC Curve

Introduction to the dicom standard

Gaurav Manek, and Vijay Ramaseshan Chandrasekhar. Efficient GAN-Based Anomaly Detection

A Benchmark of Medical Out of Distribution Detection

Does Your Model Know the Digit 6 Is Not a Cat? A Less Biased Evaluation of "Outlier" Detectors

An introduction to variational autoencoders. Foundations and Trends® in Machine Learning

Labeled optical coherence tomography (oct) and chest x-ray images for classification

Concept drift detection and adaptation with weak supervision on streaming unlabeled data

Automatically detecting data drift in machine learning classifiers

Combining active learning with concept drift detection for data stream mining

Continuous learning ai in radiology: implementation principles and early applications

This work was was supported in part by the Stanford Center for Artificial Intelligence in Medicine and Imaging (AIMI) and Microsoft Health and Life Sciences.