key: cord-0671389-eb6qxhkx
authors: Shimron, Efrat; Tamir, Jonathan I.; Wang, Ke; Lustig, Michael
title: Subtle Data Crimes: Naively training machine learning algorithms could lead to overly-optimistic results
date: 2021-09-16
journal: nan
DOI: nan
sha: 58812719c93e9120707baeb00aaa8d33b379ecee
doc_id: 671389
cord_uid: eb6qxhkx

While open databases are an important resource in the Deep Learning (DL) era, they are sometimes used"off-label": data published for one task are used for training algorithms for a different one. This work aims to highlight that in some cases, this common practice may lead to biased, overly-optimistic results. We demonstrate this phenomenon for inverse problem solvers and show how their biased performance stems from hidden data preprocessing pipelines. We describe two preprocessing pipelines typical of open-access databases and study their effects on three well-established algorithms developed for Magnetic Resonance Imaging (MRI) reconstruction: Compressed Sensing (CS), Dictionary Learning (DictL), and DL. In this large-scale study we performed extensive computations. Our results demonstrate that the CS, DictL and DL algorithms yield systematically biased results when naively trained on seemingly-appropriate data: the Normalized Root Mean Square Error (NRMSE) improves consistently with the preprocessing extent, showing an artificial increase of 25%-48% in some cases. Since this phenomenon is generally unknown, biased results are sometimes published as state-of-the-art; we refer to that as subtle data crimes. This work hence raises a red flag regarding naive off-label usage of Big Data and reveals the vulnerability of modern inverse problem solvers to the resulting bias.

Biased performance of machine learning models due to faulty construction of data cohorts or research pipelines has been recently identified for various tasks, including gender classification [2] , COVID-19 prediction [3] and natural language processing [4] . However, to the best of our knowledge, it was not yet studied for inverse problem solvers. We address this gap by highlighting scenarios that lead to biased performance of algorithms developed for image reconstruction from undersampled Magnetic Resonance Imaging (MRI) measurements; the latter is a real-world example of an inverse problem and a current frontier of DL research [5] [6] [7] [8] [9] [10] [11] [12] [13] .

Zero-padded data

Artificial data : Subtle data crimes: how retrospective subsampling of processed data leads to biased results. (a) A common data processing pipeline, which is often implemented inside commercial MRI scanners, includes: k-space zero-padding, application of the inverse Fourier Transform, and coil combination via a Root Sum-of-Squares (RSS) step. The output image, which is interpolated and non-negative, is stored in a database. In subtle data crime I, this image is later used for synthesizing new k-space data; this yields artificial data in previously zero-padded areas. (b) A common data storage pipeline includes JPEG compression. In subtle data crime II, the compressed image is later used for retrospective experiments. (c) Standard research pipelines commonly involve retrospective subsampling of fully-sampled k-space data. In the subtle data crimes scenarios, the fully-sampled data are based on processed data, hence image reconstruction algorithms benefit from the early data processing. Moreover, since the "gold standard" image is based on the same processed data as the reconstructed one, error metrics become blind to the data processing and they are therefore also prone to bias.

MRI measurements are fundamentally acquired in the Fourier domain, which is known as "k-space". Sub-Nyquist sampling is commonly applied for shortening the traditionally lengthy MRI scan time, and image reconstruction algorithms are used for recovering images from the undersampled data [14] [15] [16] [17] . The development of such algorithms should therefore ideally be done using raw k-space data. However, the development of DL methods requires thousands of examples, and databases containing raw k-space data are scarce. To date, there are only several databases that offer such data, e.g. [18] [19] [20] [21] , while there are many more that offer reconstructed and processed Magnetic Resonance (MR) images, e.g. [22] [23] [24] [25] [26] [27] [28] [29] . The latter offer images for post-reconstruction tasks such as segmentation and biomarker discovery. Nevertheless, due to their abundance, they are often downloaded and used for synthesizing "raw" k-space data using the forward Fourier transform; the synthesized data are then used for the development of reconstruction algorithms. We identified that this common approach can lead to undesirable consequences; the underlying cause is that the non-raw MR images are commonly processed using hidden pipelines. These pipelines, which are implemented by commercial scanner software or during database storage, include a full set or a subset of the following steps: image reconstruction, filtering, storage of magnitude data only (i.e. loss of the MRI complex values), lossy compression, and conversion to DICOM or NIFTI formats; these reduce the data entropy. We aim to highlight that when modern algorithms are trained and evaluated using such data, they benefit from the data processing and hence tend to exhibit overly-optimistic results as compared to performance on raw, unprocessed data. Since this phenomenon is largely unknown, such biased results are sometimes published as state-of-the-art, without reporting the data processing pipelines or addressing their effects. In order to raise community awareness to this growing problem, we coin the term subtle data crimes to describe such publications, in reference to the more obvious inverse crime scenario [30] which is described next.

Bias stemming from the underlying data has been previously recognized in a few scenarios related to inverse problems. The term inverse crime describes a scenario in which an algorithm is tested using simulated data, and the simulation resonates with the algorithm such that it leads to improved results [30] [31] [32] [33] [34] . Specifically, the authors of [33] described an inverse crime as a situation where the same discrete model is used for simulating k-space measurements and reconstructing an MR image from them; they showed that this leads to reduced ringing artifacts compared with reconstruction from raw or analytically-computed measurements. A second example is evaluation of MRI reconstruction algorithms on real-valued magnitude images; in this case k-space exhibits conjugate symmetry, hence it is sufficient to use only about half of it for full image recovery. This symmetry is often leveraged in Partial Fourier methods such as Homodyne [15] and POCS [35] , where additional steps are applied for recovery of the full complex data. However, neglecting the data complexity creates a better-conditioned inverse problem and may hence lead to an obvious advantage when evaluating the algorithm on such data as opposed to raw k-space data. However, to the best of our knowledge, inverse crimes were not yet studied in the context of machine learning or public data usage.

Here we introduce two subtle forms of algorithmic bias that were not yet considered and are relevant to the current DL era. We show how they arise from two hidden data processing pipelines that characterize many open-access MRI databases: a commercial scanner pipeline and a JPEG data storage pipeline. To demonstrate these scenarios, we took raw MRI data and "spoiled" them with carefully-controlled data processing steps; we then used the processed datasets for training and evaluation of algorithms from three well-established MRI reconstruction frameworks: Compressed Sensing (CS) with a Wavelet transform [36] , Dictionary Learning (DictL) [37] , and DL [38] . Our large-scale experiments demonstrate that these algorithms yield overly-optimistic results when trained and evaluated on processed data. Preliminary results of this work were published in a conference proceeding [39] .

The main contributions of this work are threefold. First, we reveal scenarios in which algorithmic bias of inverse problem solvers may arise from off-label usage of open-access databases. We also analyze the effects of inverse crimes on complex high-dimensional learning systems via large-scale statistics. Secondly, we expose that CS, DictL and DL algorithms are all prone to this form of subtle bias. While recent studies identified stability issues of MRI reconstruction algorithms [5, 40] , to the best of our knowledge this is the first study that identifies a common vulnerability of canonical algorithms to such data-related bias. Third, by introducing the concept of subtle data crimes and setting a framework for studying them, we hope to raise community awareness to the growing problem of bias stemming from off-label usage of open access data.

In this section we lay out the framework for our experiments.

Subtle Crime I: zero-padded k-space data

We first consider a data processing pipeline that is implemented inside many commercial MRI scanners to reconstruct the scanner output (i.e. the MR image). The k-space data are typically acquired using a multi-coil array and the pipeline includes the following steps ( Figure 1a ): (1) image interpolation, implemented by zero-padding the raw multi-coil k-space data; (2) application of the inverse Discrete Fourier Transform (DFT), and (3) multi-coil image combination via a square Root Sum of Squares (RSS) step. Notice that although the acquired data are complex-valued, the RSS step produces a magnitude (real and non-negative) image. The scanner output is therefore an interpolated real-valued non-negative image; this is the type of images most prevalent in online MRI databases. Figure 2 : An experiment demonstrating how retrospective-subsampling of k-space that was synthesized from processed data leads to increased effective sampling density of the "true" k-space data. (a) Subsampling masks were generated for different combinations of zero-padding factors (left-right) and subsampling schemes (top-down). The masks were generated from symmetric 2D Probability Density Functions (PDFs) (profiles displayed), with 17% sampling in all cases. The regions covering of the original non-padded k-space data are marked with yellow boxes. Notice that the zero-padding squashes the original data to the center, so when a variable density scheme is used, those data are sampled with an increased rate. (b) Effective sampling rate, which is the subsampling rate inside the original k-space area (yellow boxes in (a)), vs. the zero-padding. Notice that for the VD schemes, the effective rate is much higher than the global rate (17%) and may rise above 55%. (c) Real-world examples for k-space data generated from MR images found in public open-access data [22, 23] show evidence of zero-padding (the yellow box is our estimation). These examples indicate that training algorithms using data from public databases could lead to increased effective sampling.

Let us assume that the scanner image is later downloaded and used for synthesizing new k-space data, with the aim of using those data for training a reconstruction algorithm. The synthesized k-space has two interesting features not originally present: it is larger than the original raw k-space (due to the zero-padding), and it has non-zero values everywhere (due to the non-linear RSS step). In other words, the "true" data now lie in the k-space center, while artificial data appear in its periphery ( Figure 1a ). However, since this k-space looks fully sampled, it is considered as "ground truth" and used for algorithm development.

A research pipeline that is commonly used in the development of MRI reconstruction algorithms is based on retrospectivesubsampling, where sub-Nyquist sampling is simulated using a binary sampling mask and applied to a fully-sampled k-space ( Figure 1c ). In the scenario of subtle data crime I, such retrospective subsampling is applied to the synthesized k-space, which includes artificial data. Common subsampling masks are typically based on Variable-Density (VD) sampling schemes, which sample the center of k-space more densely than its periphery; VD schemes are used because they produce incoherent aliasing artifacts that can be removed by sparsity-promoting optimization-based reconstruction algorithms [36] . Importantly, because k-space was zero-padded earlier in the pipeline, application of a VD mask to the entire area of the synthesized k-space results in higher effective sampling density of the "true" k-space data.

To demonstrate this, we performed the following experiment: we generated subsampling masks for combinations of three zero-padding factors and three sub-sampling schemes ( Figure 2a ). All the masks included a global sampling rate of 17%, which corresponds to an acceleration factor of R=6; this rate is measured for the full k-space area. Then, we measured the effective sampling rate, which we define as the sampling rate in the non-padded areas only (yellow boxes in Figure 2a) , and plotted it against the zero-padding rate (Figure 2b ). The results indicate that for VD subsampling Figure 3 : Example for subtle data crime I: CS reconstructions from retrospectively-subsampled k-space of processed images. Notice how the reconstruction quality improves (both visually in terms of NRMSE) with the zero-padding (data processing) extent. This improvement is completely artificial; it stems from the coupling of early processing and retrospective subsampling which leads to an increased sampling of "true" non-padded data (as illustrated in Figure 2 ). The artificial improvement is more significant when the sampling is stronger around k-space center (bottom row, strong VD). .

(both weak-VD and strong-VD), the effective sampling rate is much higher than the global rate. In the case of 2x zero-padding, which is often applied by default in commercial scanners, the effective rates were 24% (R=4.1) and 38% (R=2.6) for weak and strong VD sampling respectively, i.e. much denser than the global rate of 17% (R=6). Nevertheless, since this subtle effect is often missed by researchers, only the global rate is reported, and it is claimed that algorithms are suitable for reconstruction for a sub-sampling rate that is much larger than the one that was used in practice.

In summary, our experiment demonstrates that when processed data are retrospectively subsampled with a VD scheme, there is increased sampling density of "true" data. In the experiments described in the next section we demonstrate that this gives rise to overly-optimistic algorithm performance.

The second studied pipeline involves JPEG compression of the scanner image ( Figure 1b ). Such compression is commonly used to reduce storage footprint, and it is sometimes applied as part of the DICOM data saving pipeline, which is highly prevalent for storage of medical images. To demonstrate the JPEG effect, here we neglect the zeropadding scenario, although the two effects are sometimes combined. In the scenario of subtle data crime II, the JPEG-compressed image is stored in an online database and later downloaded and used for synthesizing a new k-space, which is used for algorithm development ( Figure 1c ). However, since JPEG compression reduces the data entropy, using JPEG data in retrospective-subsampling experiments leads to improved reconstruction fidelity; we aim to show that this leads to an artificial improvement of image reconstruction algorithms.

We studied the effects of the hidden data processing pipelines by simulating those pipelines and conducting a large-scale study using the carefully-controlled processed data. Implementation details are provided in the Materials and Methods section. Figure 4 : Subtle data crime I. The CS, DictL and DL algorithms were trained and tested using two versions of the same knee MRI dataset, processed without zero-padding and with 2x zero padding. In the latter case, which represents the scenario of subtle data crime I, the reconstructions exhibit sharper images, with improved visibility of small clinically-relevant details. This illustrates that training inverse problem solvers using processed data may lead to overly-optimistic results.

The first experiment examines the effect of the commercial-scanner data processing pipeline (Figure 1a ) on the CS algorithm. The results show that this algorithm produces increasingly sharper reconstructions as the k-space zeropadding factor grows, for both weak and strong VD sampling schemes ( Figure 3 ). This effect is reflected by an artificial reduction of the NRMSE as a function of the zero-padding.

In the second experiment we implemented the three algorithms and applied them to two versions of the same knee MRI dataset: one prepared without zero-padding, and the other prepared with 2x zero-padding. The algorithms were trained on each dataset separately, and then tested with the corresponding version of a test image that includes fine details and a knee pathology ( Figure 4 ). As can be seen, all the algorithms produced sharper images in the subtle data crime I scenario, where the data were zero-padded: the fine details and the pathology became more visible than in the non-padded case.

These results were further confirmed in a large set of experiments, where the algorithms were trained and tested on five versions of the underlying knee dataset representing five data processing scenarios; each dataset contained 2971 images. The hyperparameter calibration, training and testing was performed for each dataset separately, to optimize the algorithmic results for each data processing scenario. We then computed the statistics of two image quality metrics, the NRMSE and Structural Similarity Index (SSIM) [41] , and plotted them against the zero-padding rate. Markedly, the results of the three algorithms exhibit the same behaviour: their NRMSE and SSIM values improve consistently with the zero-padding extent ( Figure 5 ). This improvement is completely artificial and stems only from the data processing. Strikingly, for the 2x zero-padding case, which is often the default in commercial scanners, the NRMSE exhibits a large improvement of 26%-42% (Table 1) .

To demonstrate the JPEG compression effect, we performed experiments in which the algorithms were trained and tested on different versions of the same underlying dataset. The JPEG compression level is determined by a Quality Factor (QF), where QF = 75 is the default (that yields lossy compression), and values such as QF = 50 and QF = 20 yield increasingly lossy compression [42] . For reference, our experiments also include the case of image reconstruction from Non-Compressed (NC) data. In all cases, the hyperparameter calibration, algorithm training and inference were done on the same type of data (i.e. with NC or a specific QF). Figure 5 : Subtle data crime I statistics. The CS, DictL and DL algorithms were trained and evaluated using data with various data processing extents. The processing pipeline, which is typically implemented inside commercial scanners, includes k-space zero-padding ( Figure 1a ). Retrospective subsampling experiments were performed with Variable Density (VD) subsampling with R = 4. The curves display the mean and STD of the NRMSE and SSIM error metrics for the test set. Notice that both metrics show an artificial improvement that is correlated with the data processing extent. This demonstrates that algorithms evaluated on retrospectively-subsampled processed data tend to yield overly-optimistic evaluation.

In the first experiment, the DL algorithm was trained on the different datasets. Figure 6 displays an example from the test set, which shows the gold standard images and the DL reconstructions for data undersampled with R = 4.

Generally the visual quality of all the images (both gold standard and reconstructed ones) reduces with increased JPEG compression level (left-to-right in Figure 6 ); this is expected from compressed data. However, the NRMSE metric shows an unexpected effect: it improves with the compression, i.e. the reconstruction error reduces although the image visual quality degrades. The reason for this phenomenon is that in retrospective experiments the reconstruction quality is measured w.r.t. to a "gold standard" image that is also processed (see the pipeline in Figure 1c) ; the error metric is therefore blind to data processing. Strikingly, the NRMSE could show a misleading improvement even when the human eye cannot see any difference, as demonstrated in the left two columns of Figure 6 : although the reconstructions from NC and QF = 75 are visually similar, the NRMSE of the latter is lower by 30%. This reflects the subtle bias induced by the pipeline of subtle data crime II.

The JPEG compression effect was further observed in a statistical analysis of a large-scale experiment, where the algorithms were trained and tested on the four types of data (NC, QF=75, QF=50, QF=20) ( Figure 7 ). As illustrated, the error metrics exhibit a consistent improvement with the compression; notably, this effect is systematically observed for all the studied algorithms and reduction factors ( Table 2 ). Figure 6 : Example for subtle data crime II. The DL algorithm was trained and tested on non-compressed and JPEGcompressed data. Although the compression reduces the visual image quality, the NRMSE surprisingly reduces with increased compression, reflecting a seemingly-better image quality. The reason is that in retrospective experiments both the "gold standard" and reconstructed images are based on processed data, hence the error metric is blind to the data processing and prone to bias. Strikingly, although the reconstructions from non-compressed and default-compressed data are visually similar, the NRMSE of the latter is lower by 30%. This demonstrates the subtle bias induced by training and evaluating algorithms on JPEG-compressed data. 

This study reveals that naive usage of open-access data in development of MRI reconstruction algorithms could give rise to overly-optimistic results. The underlying cause is that open-access data are commonly prepared with hidden data processing pipelines that implicitly affect the data properties. Our large-scale study demonstrates that CS, DictL and DL algorithms exhibit biased results for data prepared with common data processing pipelines. Since this form of bias is largely unknown, it is frequently not addressed in research literature; we introduce a framework for studying such bias and coin the term subtle data crimes to facilitate research in this field.

This work offers insights into subtle mechanisms that lead to biased performance of modern reconstruction algorithms. Our main observation is that such bias stems from the unintentional coupling of hidden data processing pipelines with later retrospective-subsampling experiments; the data processing implicitly improves the inverse problem conditioning, and the retrospective subsampling enables the algorithms to benefit from that. This process may appear in different forms. In subtle data crime I, the zero-padding concentrates the "true" k-space data to the center, and when VD sampling is later applied, those data are densely sampled; the increased amount of "true" data that becomes available to the algorithm makes the inverse problem easier to solve, hence algorithms tend to exhibit misleadingly-good results. In subtle data crime II, the JPEG compression reduces the data entropy, i.e. it increases their sparsity and yields a more compact representation in a sparsifying transform domain. Modern reconstruction algorithms leverage sparsity priors or learn the compact representation from training data [6, 36, 43, 44] ; therefore, they benefit from the compression and yield biased results. Another main insight from this study is that in retrospective-subsampling experiments, the error metrics might show a misleading evaluation. That occurs because they measure the difference between two images (the gold standard and reconstructed image) that are based on the same processed data. Ideally, the error metrics should measure the difference between the reconstructed image and the original unprocessed one, but because the latter is unavailable (since it was not stored in the database), the metrics become blind to the data processing. As a result, they cannot reflect the true reconstruction quality, and they might produce misleading results.

This study also sheds light on a new type of sensitivity of MRI reconstruction algorithms. At present there is growing interest in identifying sensitivities of such algorithms [5, 40, [45] [46] [47] [48] . However, recent studies focused mainly on investigating sensitivities with respect to adversarial attacks. While these attacks are an important research tool, they are not observed in practice since MRI scanners are closed systems. Here, on the other hand, we focused on sensitivity related to a more common cause: off-label usage of public databases. While reviewing papers, we noticed that such usage is becoming increasingly more common due to the growing availability of public databases that offer various types of MRI data. Subtle data crime I may be common since MR images found in public databases are often based on images produced by commercial scanners, where the data processing pipeline described in Figure 1a is often applied by default. Additionally, subtle data crime II may be common since JPEG images are highly prevalent; 73.3% of the Internet websites contain JPEG-format data [49] . These factors suggest that the subtle data crimes might be more common that intuitively expected.

It is worth mentioning that this work did not aim to benchmark the studied algorithms; instead, it aimed to show they are all affected similarly by the subtle data crimes. However, as a side benefit, we did obtain benchmark comparisons.

To ensure a fair comparison, we dedicated significant efforts to calibrating the hyperparameters of each algorithm for each processed version of the underlying dataset separately (see Materials and Methods); specifically, we dedicated one month of computations to tuning the DictL algorithm parameters through a vast search over a huge search space. Moreover, we ensured that the algorithms were calibrated, trained and tested using identical datasets. We empirically observed that the studied algorithms perform overall on-par, with an advantage of CS over DictL and a slight advantage of DL over both. However, due to the pipelines of the subtle data crimes, all of our computations were performed with single-coil magnitude non-negative images; the benchmarking of the algorithms for multi-coil, complex-valued MRI data is beyond the scope of this work and remains for future research.

In summary, this research aims to raise a red flag regarding naive off-label usage of open-access data in development of machine learning algorithms. We showed that such usage may lead to biased results of inverse problem solvers. Furthermore, we demonstrated that training MRI reconstruction algorithms using such data could yield an overlyoptimistic evaluation of their ability to reconstruct small clinically-relevant details and pathology; this increases the risk of translation of biased algorithms into clinical practice. We therefore call for attention of researchers and reviewers; data usage and pipeline adequacy should be considered carefully, reproducible research should be encouraged, and research transparency should be required. By introducing the framework for studying subtle data crimes we hope to raise community awareness, stimulate discussions and set the ground for future studies of data usage. Notice that all the curves show the same trend: the error metrics improve consistently with increased JPEG compression. This improvement is artificial and stems only from the data processing, which reduces the data entropy. The results therefore demonstrate the subtle bias caused by training inverse problem solvers on JPEG-compressed data.

To demonstrate the effects of the hidden data processing pipelines, we took raw MRI data and "spoiled" them with carefully-controlled processing steps. The raw data were obtained from the FastMRI database [18] . This section describes the raw datasets; the data processing steps were described in the main part of the paper for each subtle data crime separately.

1. Brain data. In the experiment presented in Figure 3 we used a single 320 × 320 brain image.

2. Knee Fat-Saturated Proton Density (FSPD) data. In the knee pathology experiment (Figure 4 ) we used data from multi-coil FSPD scans, since knee pathology is usually observed in this type of MRI scans. The training set consisted of 2849 randomly-chosen slices obtained from 300 subjects, and the test case was a specific image that contains a pathology (shown in Figure 4 ).

In the large-scale experiments that were done for demonstrating the two subtle data crimes (Figures 5-7) , we used data from multi-coil Proton Density (PD) scans. Specifically; we used 1427 slices obtained from 80 subjects for training and 122 slices obtained from 7 subjects as the test set. All the slices were chosen randomly.

When constructing the knee PD and FSPD datasets we used only slices from central anatomical regions, i.e. edge slices that contain mostly noise were removed. Additionally, for each dataset, we chose 10 random slices obtained from 10 different subjects and reserved them for tuning the hyperparameters of the studied algorithms; these slices were not included in the training or test sets. It is worth mentioning that the limited number of slices used for hyperparameter calibration was dictated by the need to perform vast computations over a huge search space, especially for the DictL algorithm, as described below.

We designed our research framework such that it would enable isolating the bias related to the subtle data crimes in a controlled setup. Additionally, since a side-result of this study was the benchmarking of the studied algorithms, we also dedicated significant efforts to ensuring their fair comparison. Here we detail the steps that were taken for these two aims.

First, to mimic a scenario in which users download a database from an online resource and then optimize the parameters of their algorithm for that specific dataset, we prepared separate processed datasets for each instance of the data processing parameters (i.e. for each zero-padding factor or JPEG QF), and ensured that there is no mixture between the datasets. We then calibrated, trained and tested the algorithms on each processed dataset separately. This ensured that each algorithm was evaluated using instance-optimal parameters; it therefore mitigated bias related to hyperparameter tuning. Secondly, we applied the three algorithms to identical datasets; their results are therefore comparable. Finally, we generated sampling masks on-the-fly, i.e. a different random mask was generated for each k-space example during the training and test sessions. This technique enables generating a large number of sampling masks while maintaining their statistics, hence it prevents over-fitting to any particular sampling mask.

In the retrospective-subsampling experiments, we generated random 2D subsampling masks from pre-defined PDFs using Monte-Carlo experiments. We implemented three subsampling schemes: (1) random-uniform, in which the PDF was constant and equal to 1/R (R is the acceleration factor); (2) weak VD, in which the PDF was constructed by the function f (r) = (1 − r) p , where r is the distance from k-space center and p is the power [36] , which was set to p = 7 in this case; and (3) strong VD, in which the PDF was also constructed by f (r) = (1 − r) p and the power was set to p = 1, p = 2 and p = 3 for reduction factors of R = 2, R = 3, and R = 4 correspondingly. All the sampling masks included a small fully-sampled area in the center of k-space. In parallel imaging this area is often known as the calibration region [16] , and in single-coil MRI experiments this region ensures sampling of the low-frequency data and helps stabilize the computational results. The calibration region size was 12 × 7 pixels for the 640 × 372 knee images and 6 × 6 pixels for the 320 × 320 brain image. In the zero-padding experiments, where the image size varied, the calibration region size scaled with the image size.

The CS, DictL and DL algorithms recover an MR image from subsampled k-space measurements by solving an inverse problem that has the following general form:

x = arg min

where x is the image to be reconstructed, y are the k-space measurements, E is an encoding operator that describes the imaging system, R(x) is a regularization term, and λ is trainable parameter that controls the tradeoff between the Data Consistency (DC) term (the first term in Eq. [1] ) and the regularization term. In MRI, the encoding operator E is typically described as E = UF, where F is the Fourier transform and U is an operator that describes the k-space subsampling. The studied algorithms differ in their regularization terms and optimization techniques, as described next.

CS algorithm. This algorithm formulates eq. [1] as a convex optimization problem with an 1 prior that promotes the sparsity of x in a sparsifying transform domain [36] . A common choice for the prior is an 1 -wavelet one; the optimization problem is then,x = arg min

where Ψ is the wavelet transform. Eq. [2] can be solved using different optimization techniques; here it was solved using the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) [50] . Our implementation was based on the SigPy python toolbox [51] .

The CS algorithm has one tunable parameter, λ. We calibrated it through a grid search, where the grid included values in λ ∈ [1e − 9, 1e − 1]. We ran the grid search over 10 images from a subset of the data that was reserved for hyperparameter tuning. We then computed the mean NRMSE over those 10 images, and the value of λ that corresponded to the lowest mean NRMSE was chosen. Since in the experiments of subtle data crime I the image size varies with the zero-padding, we repeated this procedure for each image size separately. However, we empirically observed that the same λ value was chosen for all image sizes. The chosen values were λ = 0.005 for the brain data ( Figure 3 ) and λ = 0.001 for the knee data (Figures 4-7) .

DictL algorithm. The DictL algorithm reconstructs the image x by jointly learning a dictionary D and a sparse code A.

The dictionary, which is used for reconstruction of patches, is a sparsifying transform that is learned directly from the subsampled k-space data. In this method, the image is reconstructed by representing it as a sparse linear combination of dictionary atoms, x = DA. The DictL algorithm jointly solves for the image x, the dictionary D, and the sparse code A [37] . The algorithm learns the dictionary adaptively from the subsampled k-space while reconstructing the image, i.e. it is trained without any other examples or without access to the fully-sampled k-space data; the learning is done over image patches.

The optimization problem is formulated as follows:

subject to ||a l || ≤ K, l = 1, ..., L ||d p || 2 ≤ 1, p = 1, ..., P

where R is a reshaping operator that reshapes patches into an image, a l are the columns of A, where each column is a vectorized patch from the image, K is the sparsity level, L is the number of patches that are used as training examples during one iteration of the algorithm, d p are the columns of the dictionary D, where each column is a vectorized atom, and P is the number of dictionary atoms.

We implemented the DictL algorithm in python using our open-source code [52] . DL algorithm. We studied the Model-based reconstruction using Deep Learned priors (MoDL) algorithm, which gives state-of-the-art performance in MRI reconstruction [38] . MoDL solves the following optimization problem:

x = arg min

where D w (x) is the output of a Convolutional Neural Network (CNN). This optimization problem is solved using an unrolled deep neural network which includes interleaved CNNs and DC blocks. The DC blocks ensure consistency of the solution with the k-space measurements; the backpropagation through them is implemented using the Conjugate Gradient (CG) algorithm [55] . The MoDL unrolled network is trained in an end-to-end supervised manner, where the input is an aliased image obtained from the zero-filled subsampled k-space data and the target is a "gold standard" image obtained from the fully-sampled k-space.

In our implementation, the architecture of the network included 6 unrolls, CNNs with a U-Net structure [56] , weight sharing, and 8 CG steps in the DC blocks. The training was performed using an l1-loss and the Adam optimizer [57] , with gradient accumulation such that the effective batch size was 20. The number of epochs was 70. We implemented MoDL using PyTorch [58] .

In this section, we provide implementation details regarding the experiments that were performed for demonstrating the effects of subtle data crime I. In the first experiment, which demonstrates the difference between global and effective sampling ( Figure 2 ), a set of 15 random masks was generated for each combination of a subsampling scheme and zero-padding factor. The curves in Figure 2 depict the mean effective rates measured over those sets.

In the next experiments, which demonstrate the zero-padding effect (Figures 3-5) , we implemented VD sampling with an acceleration factor of R = 4. In these experiments, the MoDL network could not be trained on full-size images because the zero-padding enlarges the image size to an extent that poses a computational challenge even with modern GPUs. However, a major advantage of MoDL is that it is convolutional and at inference it can be implemented to any image size [38] . We therefore trained MoDL on patches extracted from training images; the patch size was 0.25 of the image size in each dimension, and a single patch was extracted randomly from each image. In contrast, during inference the network was applied to the full-size test images; the results shown in Figure 5 therefore represent the reconstruction error for full images.

In the second set of experiments we studied how the performance of reconstruction algorithms is influenced by JPEG compression of the underlying data. We prepared the processed datasets using the standard JPEG implementation found in the PILLOW library [59] . In the JPEG experiments the reduction factor ranged from R = 2 to R = 4 (see Table 2 ).

We quantified the subtle data crimes effects by studying how the data processing pipelines influence two highly common image quality metrics: the Normalized Root Mean Square Error (NRMSE) and the Structural Similarity Index (SSIM) [41] ; the latter was implemented using the SSIM-PIL library [60] .

In this section, we give an overview of the compute times and resources that were used in this research. All of our experiments were performed on 12GB Nvidia Titan Xp GPUs and Intel(R) Xeon(R) Silver 4116 CPUs.

Hyperparameter tuning. The most extensive part of the research was the hyperparameter tuning for the DictL algorithm (described above), which was conducted for over four weeks on 200 CPUs in parallel. The calibration of the DL algorithm hyperparameters was conduced for a similar amount of time using a single GPU. The CS parameter tuning time was only several hours.

Experiments. The experiment for demonstrating subtle data crime I with a single brain image ( Figure 3 ) required only about several minutes on a standard laptop. In contrast, the experiment for demonstrating this subtle data crime using fat-saturated PD knee data ( Figure 4 ) required one day of computations on two GPUs and 40 CPUs. The experiments for obtaining the statistics of subtle data crime I ( Figure 5 ) were computationally more demanding, since they required training and testing ten different instances of each algorithm, for the ten combinations of the studied zero-padding ratios and VD sampling schemes. The compute time of these experiments was one week on a single GPU and 200 CPUs.

Similarly, the experiments that demonstrate subtle data crime II (Figures 6 and 7 ) required training and evaluating twelve instances of each algorithm, for the twelve combinations of the four studied compression scenarios and three reduction factors. The CS computation time was several hours. However, the DictL runs were conducted for about eight days on 100 CPUs, and the DL runs were conducted for six days on 12 GPUs.

Altogether, the compute time was about two months using 200 CPUs and 12 GPUs.

In the spirit of reproducible research, our code is publicly available here:

https://github.com/mikgroup/subtle_data_crimes

All the data used in this research is publicly available as part of the FastMRI database [18] .

The authors acknowledge funding from grants U24 EB029240-01, R01EB009690, R01HL136965. The authors thank Shreyas Vasanawala for his assistance with identifying the pathology cases in the FastMRI data.

Imagenet large scale visual recognition challenge

Gender shades: Intersectional accuracy disparities in commercial gender classification

Prediction Models for Diagnosis and Prognosis of COVID-19: Systematic Review and Critical Appraisal

A survey on bias and fairness in machine learning

On Instabilities of Deep Learning in Image Reconstruction and the Potential Costs of AI

Image reconstruction by domain-transform manifold learning

Learning a variational network for reconstruction of accelerated MRI data

Accelerating magnetic resonance imaging via deep learning

Image reconstruction is a new frontier of machine learning

An overview of deep learning in medical imaging focusing on MRI

Deep learning in radiology: An overview of the concepts and a survey of the state of the art with focus on MRI

Deep-learning methods for parallel magnetic resonance imaging reconstruction: A survey of the current approaches, trends, and issues

Image reconstruction: From sparsity to data-adaptive methods and machine learning

Principles of Magnetic Resonance Imaging

Quantitative evaluation of several partial Fourier reconstruction algorithms used in MRI

Generalized autocalibrating partially parallel acquisitions (GRAPPA)

SENSE: sensitivity encoding for fast MRI

fastMRI: A publicly available raw k-space and DICOM dataset of knee images for accelerated MR image reconstruction using machine learning

Mridata.org: An open archive for sharing MRI raw data

Calgary Campinas Public Dataset

SKM-TEA: A dataset for accelerated MRI reconstruction with dense image labels for quantitative clinical evaluation

The human connectome project

AccelMR dataset

Oasis dataset

The Cancer Imaging Archive

Inverse acoustic and electromagnetic scattering theory

Statistical inverse problems: discretization, model reduction and inverse crimes

Discrete inverse problems: insight and algorithms

Realistic analytical phantoms for parallel magnetic resonance imaging

Linear and nonlinear inverse problems with practical applications

A fast, iterative, partial-Fourier technique capable of local phase recovery

Sparse MRI: The Application of Compressed Sensing for Rapid MR Imaging

MR image reconstruction from highly undersampled k-space data by dictionary learning

MoDL: Model-based deep learning architecture for inverse problems

Subtle inverse crimes: Naïvely using publicly available images could make reconstruction results seem misleadingly better!

Measuring robustness in deep learning based compressive sensing

Image quality assessment: from error visibility to structural similarity

Sparse and Redundant Representations: from Theory to Applications in Signal and Image Processing

Optimization methods for magnetic resonance image reconstruction: Key models and optimization algorithms

Addressing the False Negative Problem of Deep Learning MRI Reconstruction Models by Adversarial Attacks and Robust Training

Solving Inverse Problems With Deep Neural Networks-Robustness Included?

Improving robustness of deep-learning-based image reconstruction

Boosting the signal-to-noise of low-field MRI with deep learning image reconstruction

Usage statistics of JPEG for websites

A fast iterative shrinkage-thresholding algorithm for linear inverse problems

SigPy: a python package for high performance iterative reconstruction

Step-by-Step Reconstruction Using Learned Dictionaries

K-SVD: An algorithm for Designing Overcomplete Dictionaries for Sparse Representation

Orthogonal matching pursuit for sparse signal recovery with noise

Methods of conjugate gradients for solving linear systems

U-net: Convolutional networks for biomedical image segmentation

Adam: A method for stochastic optimization

DeepInPy: Deep Inverse Problems for Python git repo

PILLOW (PIL Fork) Documentation

SSIM-PIL documentation