key: cord-0628358-fi2oz7xb
authors: Packhauser, Kai; Gundel, Sebastian; Munster, Nicolas; Syben, Christopher; Christlein, Vincent; Maier, Andreas
title: Is Medical Chest X-ray Data Anonymous?
date: 2021-03-15
journal: nan
DOI: nan
sha: f636c01db943124c2dd31e73d2fbf1a08951ba30
doc_id: 628358
cord_uid: fi2oz7xb

With the rise and ever-increasing potential of deep learning techniques in recent years, publicly available medical datasets became a key factor to enable reproducible development of diagnostic algorithms in the medical domain. Medical data contains sensitive patient-related information and is therefore usually anonymized by removing patient identifiers, e.g., patient names before publication. To the best of our knowledge, we are the first to show that a well-trained deep learning system is able to recover the patient identity from chest X-ray data. We demonstrate this using the publicly available large-scale ChestX-ray14 dataset, a collection of 112,120 frontal-view chest X-ray images from 30,805 unique patients. Our verification system is able to identify whether two frontal chest X-ray images are from the same person with an AUC of 0.9940 and a classification accuracy of 95.55%. We further highlight that the proposed system is able to reveal the same person even ten and more years after the initial scan. When pursuing a retrieval approach, we observe an mAP@R of 0.9748 and a precision@1 of 0.9963. Furthermore, we achieve an AUC of up to 0.9870 and a precision@1 of up to 0.9444 when evaluating our trained networks on CheXpert and the COVID-19 Image Data Collection. Based on this high identification rate, a potential attacker may leak patient-related information and additionally cross-reference images to obtain more information. Thus, there is a great risk of sensitive content falling into unauthorized hands or being disseminated against the will of the concerned patients. Especially during the COVID-19 pandemic, numerous chest X-ray datasets have been published to advance research. Therefore, such data may be vulnerable to potential attacks by deep learning-based re-identification algorithms.

Chest radiography (X-ray) is a modality that is routinely used for diagnostic procedures around the world 1 . It became the most common medical imaging examination for pulmonary diseases and allows a clear investigation of the thorax 2 . Chest X-ray imaging is therefore well-suited for diagnosing several pathologies including pulmonary nodules, masses, pleural effusions, pneumonia, COPD, and cardiac abnormalities 3 . It is also used for COVID-19 4 screening, as abnormalities characteristic of those infected with the coronavirus can be detected in radiographs 5 . While chest radiography plays a crucial role in clinical care, discovering certain diseases and abnormalities in chest radiographs can be a challenging task for radiologists, which potentially results in undesirable misdiagnoses 6 . Therefore, computer-aided detection (CAD) systems based on deep learning (DL) 7 techniques have been developed in recent years to facilitate radiology workflows. These systems, characterized by their enormous benefits, can be utilized for a wide range of applications, e. g., for the automatic recognition of abnormalities in chest radiographs 3, 8 and the detection of tumors in mammography 9 . Some techniques even show the potential to exceed human performance 10 . However, the CAD systems are only treated as an additional source to support the radiologists and to increase certainty in their reading decisions.

On the one hand, the large variety of medical applications allows DL to grow and tackle real-life problems that were previously not solvable or improving solutions offered by traditional machine learning methods 7 . On the other hand, DL is a data-driven approach and well-known for its need for big data to train the neural networks 11, 12 . For these reasons, a vast amount of medical datasets have been published in recent years that enable researchers to develop diagnostic algorithms in the medical field in a reproducible way 13 . These include several large-scale chest radiography datasets, e. g., the CheXpert 14 , the PLCO 15 and the ChestX-ray14 16 datasets. But especially during the COVID-19 pandemic 17, 18 , the number of publicly available chest radiography datasets increased rapidly. A few selected examples are the COVID-19 Image Data Collection 19 , the Figure 1 COVID-19 Chest X-ray Dataset Initiative 20 , the ActualMed COVID-19 Chest X-ray Dataset Initiative 21 , and the COVID-19 Radiography Database 22 .

Chest radiography datasets typically consist of two parts: First, the image data itself, which provides clinical information about the anatomical structure of the thorax. Second, the associated metadata, which contains sensitive patient-related information that is either stored in a separate file or embedded directly in the images 23 . Proper data anonymization constitutes an important step when preparing medical data for public usage to ensure that a patient's identity cannot be revealed in publicly Figure 1 . General problem scenario: Comparing a given chest radiograph to publicly available dataset images by means of DL techniques would either result in discrete labels indicating whether or not the dataset images belong to the same patient as the given radiograph (verification scenario) or yield a ranked list of the most similar radiographs related to the given scan (retrieval scenario). Images belonging to the same patient are highlighted with the same color. The given radiograph is marked with an asterisk. The shown cases would enable a potential attacker to link sensitive patient-related information contained in the dataset to the image of interest. available datasets 23 . In practice, any personally identifiable information is attempted to be removed from the data before it is shared. These objectives and requirements are specified, e. g., by the Health Insurance Portability and Accountability Act (HIPAA) 24 in the United States or the General Data Protection Regulation (GDPR) 25 in Europe.

In 2017, Google entered into a project with the National Institutes of Health (NIH) to publish a dataset containing 100,000 chest radiographs. However, the release was canceled two days before publication after Google was informed by the NIH that the radiographs still contained personal information which indicates that the data was incorrectly anonymized 26, 27 . This major incident highlights that many potential pitfalls can arise when clinical and technological institutions collect and share large medical datasets to revolutionize health-care.

In the past, various data de-identification techniques have been proposed, including commonly-used methods like pseudonymization 28 and k-anonymity 29 . Pseudonymization describes a technique that replaces a true identifier, e. g., the name or the patient identification number by a pseudonym that is unique to the patient but has no relation to the person 28 . However, pseudonymization is a rather weak anonymization technique as the patient's identity may still be revealed, e. g., by cross-referencing with other publicly available datasets. In contrast, k-anonymity modifies the data before sharing in such a way that every sample in the published dataset can be associated with at least k different subjects. In this way, the probability of performing identity disclosure is limited to at most 1 /k 29, 30 . Nevertheless, when background knowledge is available, k-anonymity is susceptible to many attacks.

To date, little attention has been paid to the possibility of re-identifying patients in large medical datasets by means of DL techniques. However in theory, medical data disclosure, as illustrated in Figure 1 , could be facilitated for potential attackers by using suitable DL approaches. Consider a publicly available dataset that is supposedly anonymized but contains further sensitive patient-related information, e. g., diagnosis, treatment history, and clinical institution. If a radiograph of known identity is accessible to a potential attacker and a properly working verification or re-identification model exists, then the model could be used to compare the given radiograph to each image in the dataset which would essentially result in a set of images belonging to the same patient (patient verification) or yield a ranked list of the most similar images to the given radiograph (patient re-identification). In this way, the patient's identity may be linked to sensitive data contained in the dataset. As a result, more patient-related information may have been leaked, highlighting the enormous data security and data privacy issues involved. In our work, we investigated whether conventional anonymization techniques are secure enough and whether it is possible to re-identify and de-anonymize individuals from their medical data using DL-based methods. Therefore, we considered the public ChestX-ray14 dataset 16 , which is one of the most widely used research datasets in the field of radiographic problems. Our algorithms are trained to determine whether two arbitrary chest radiographs can be recognized to belong to the same patient or not. Moreover, we showed that our proposed methods are able to perform a successful linkage attack on publicly available chest radiography datasets. Furthermore, this work aims to draw attention to the massive problem of releasing medical data without considering that DL systems can easily be used to reveal a patient's identity. Therefore, we call for reconsidering conventional anonymization techniques and developing more secure methods that resist potential attacks by DL algorithms.

First, we trained a siamese neural network (SNN) architecture on the ChestX-ray14 dataset to determine whether two individual chest radiographs correspond to the same patient or not. Our model was designed to process the two input images in two identical network branches, which are then combined by a merging layer. The fused information is fed through further network layers resulting in a single output score indicating the identity similarity. Table 1 summarizes the outcomes of our evaluation. We analyzed a multitude of different experimental setups with varying learning rates η and differing balanced training set sizes N s . Moreover, we investigated the effect of using epoch-wise randomized negative pairs (RNP) versus fixed training sets (FTS) for the entire learning procedure. When using RNP as the data handling technique, the negative image pairs were randomly constructed in each epoch, meaning that much more negative pairs could be utilized in a complete training run compared to FTS where the generated image pairs remain the same for the entire learning procedure. For all experiments on the ChestX-ray14 dataset, we used the same balanced validation and testing set with 50,000 and 100,000 image pairs, respectively, without patient overlap between any split. To assess the performance of the trained models, we performed a receiver operating characteristic (ROC) analysis by computing the AUC value together with the 95 % confidence intervals from 10,000 bootstrap runs. Moreover, we calculated the accuracy, specificity, recall, precision, and F1-score.

The results indicate that the amount of training data plays a crucial role in the patient verification task. We observe a significant performance increase as the training set size grows. For instance, when using a subset of 100,000 image pairs for training, we obtain an AUC value of 0.8610. In contrast, by enlarging the training set size to 800,000 image pairs (i. e. the total number of 400,000 positive image pairs combined with 400,000 negative pairs), we receive an AUC score of 0.9896. These findings have been visualized in the ROC curves shown in Figure 2 which illustrates the effect of the training set size on the verification performance when using fixed training sets. Note that Table 1 only shows the best experiments per training set size N s . Additional experiments were conducted to investigate the effect of the learning rate (LR). The corresponding results are provided in a separate table in the appendix (see Table 5 ).

We also observed that randomly constructing the negative image pairs in each epoch led to further improvements in the final model performance. By using this data handling technique, we achieved our overall best results. The respective outcomes are reported in Table 1 . When training our network architecture with a total of 800,000 training samples with epoch-wise randomly constructed negative pairs, the AUC score improved from 0.9896 to 0.9940. Besides, the other reported evaluation metrics apart from the recall also increased compared to the results achieved by the model trained with the fixed set. Figure 3 depicts the confusion matrix resulting from our best-trained model listed in Table 1 (last row), thus giving clear insights into the patient verification performance. Note that the number of image pairs with age differences of more than 12 years is comparatively small, which is why the corresponding TPRs are neglected in this figure.

We also analyzed how our best model behaves when comparing images of the same patient where the acquisition dates are several years apart. The results are illustrated in Figure 4a . We received a TPR of 0.97 for image pairs that had small age differences of one year or less. As the age variation between the follow-up images and the initial scan increases, we observe a slight decrease in the TPR values. Nevertheless, our model still shows competitive results even if the patient's age in two images differs by several years. Even for an age difference of twelve years, we can verify that two images belong to the same patient by 86 %. At this point, we want to mention that the NIH ChestX-ray14 dataset contains a few images with unrealistic age information about the patient, e. g., an age of 414 years. With only 16 such cases, the proportion of images with unreasonable age information is very low, accounting for only 0.01 % of the total amount of images. Since incorrect information about a patient's age does not have a negative influence on our training strategy, we have not excluded such images for our experiments, meaning that there could potentially occur a few positive pairs with large follow-up intervals during training. However, the test set on which Figure 4 is based does not contain any positive image pairs with unreasonable follow-up intervals. We only report the TPRs for image pairs with follow-up intervals of up to 12 years in Figure 4a as the number of pairs with larger intervals is relatively small.

(e)

False Positive False Negative Additionally, we investigated the model's verification capability in the case of new abnormality patterns appearing in follow-up scans that did not occur in previously acquired chest radiographs. Figure 4b shows that regardless of the abnormality, we nearly observe no decline in the TPR values, emphasizing the robustness of our trained SNN architecture. Furthermore, Figure 4c illustrates that changes in the projection view (e. g., one image taken in the anterior-posterior position and the other image acquired using the posterior-anterior view) hardly lead to any deterioration in the performance.

Moreover, we perform a qualitative evaluation where we visually inspect some exemplary image pairs evaluated using our best-performing verification model. In Figure 5 , we show four true positive (TP) classifications (a)-(d), one pair that has been classified as a false positive (FP) (e), and one example for a false negative (FN) image pair (f). The shown images clearly illustrate the high technical variance present in the ChestX-ray14 dataset. The first image pair (a) shows two images belonging to the same patient with a difference of seven years. Clear differences in pixel intensities and lung shape are observed. However, both images belong to the same person, cf. the small vascular clips in the area of the upper right lung. Also, image pairs with large difference in scaling (b) or rotation (c) are verified correctly. Our model is also robust to the patients' pathology: While the upper image of (b) shows characteristics of pneumothorax, the patient suffered from cardiomegaly, effusion, and masses in the lower image, according to the provided annotations. Similarly in (c), where the upper image indicates the presence of infiltration and pneumothorax, whereas the lower scan shows signs of infiltration and nodules. Figure 5e shows an exemplary image pair that has falsely been classified as positive. Conversely, (f) depicts a positive image pair that has been incorrectly classified as negative. To visually demonstrate which parts of the images are responsible for the verification task, we applied a siamese attention mechanism 31 to our network architecture which utilizes the Grad-CAM algorithm 32 . The obtained attention maps can be seen in Figures 7 and 8 in the appendix. They clearly indicate that the human anatomy, especially the shape of the lungs and ribs, is the driving factor for the network decisions.

To investigate how foreign material (see Figure 5a ) affects the verification performance, we evaluated our trained network on two small manually created subsets of around 200 images. The first one consisted only of images in which foreign material is visible, whereas the second one solely contained images without foreign material. When constructing the subsets, we selected Table 3 . Overview of the obtained results for our image retrieval experiments. In this table, we report the mAP@R, the R-Precision, and the Precision@1. The first 4 rows show the results on the ChestX-ray14 dataset for different image resolutions used for evaluation. The fifth row shows the outcomes on the CheXpert dataset. The last row indicates the results on the COVID-19 Image Data Collection. Bold text represents the overall highest performance metrics. the patients at random and then assigned the corresponding patient images to the respective subset after visual assessment. Furthermore, we ensured that no more than 5 images were used per patient. Table 2 summarizes the results indicating that the patient verification works with high performance regardless of the occurrence of foreign material. We even observe a slight improvement in performance for the subset where no foreign material is visible in the images. Finally, to analyze whether our trained model is able to generalize to other datasets which have not been used during training, we evaluated our network on the CheXpert dataset and the COVID-19 Image Data Collection. The results are summarized in the last two rows of Table 2 . It can be seen that our network still yields high AUC values of 0.9870 (CheXpert) and 0.9763 (COVID-19) for the verification task although the model has not been fine-tuned on the respective datasets. Also the other presented evaluation metrics show competitive values without deteriorating too much.

For our patient re-identification experiments, we trained another SNN architecture on the ChestX-ray14 dataset. In contrast to the verification model, we omitted all the layers from the merging layer onwards. The main objective was to learn appropriate feature representations instead of directly determining whether the inputs belong to the same patient or not. After training the network, we used the ResNet-50 backbone as a feature extractor for the actual image retrieval task. By computing the Euclidean distance between the embeddings of the query image and each other image, we obtained for each query image a ranked list of its most similar images in terms of identity.

The results of the corresponding image retrieval experiments are summarized in Table 3 . When using the original image size of 1024×1024 pixels for evaluation, we obtain a precision@1 of more than 99 % showing that the closest match nearly always is the same patient. The high mean average precision at R (mAP@R) of about 97 % further depicts that most of the most similar images are correctly identified. We observe a slight decrease in performance as the image size was reduced. Nevertheless, when the images were downsampled to a resolution of 512×512 pixels, we still obtain high performance values. When the image size was reduced too aggressively, e. g., to 224×224 pixels, the mAP@R and the R-Precision rates drop. Yet, we still observed a high Precision@1 of more than 97 %.

Similar to the experiments in the patient verification section, we evaluated our best-trained re-identification model on two small subsets, one of which only contained images with visible foreign material and the other consisted exclusively of images without the presence of foreign material. The obtained results are presented in Table 4 . As can be seen, we achieve high performance values for both subsets. Thus, we hypothesize that our outcomes are independent of foreign material which may occur only for specific patients.

Lastly, we analyzed the re-identification performance on the CheXpert dataset and the COVID-19 Image Data Collection. As can be seen in the last two rows of Table 3 , we also obtain high retrieval values although we haven't performed any fine-tuning on both datasets, which demonstrates the feasibility of the trained re-identification network on previously unseen datasets.

In this paper, we investigated the patient verification and re-identification capabilities of DL techniques on chest radiographs. We have shown that well-trained SNN architectures are able to compare two individual frontal chest radiographs and reliably predict whether these images belong to the same patient or not. Moreover, we have shown that DL models have the potential to accurately retrieve relevant images in a ranked list. Our models have been evaluated on the publicly available ChestX-ray14 dataset and showed competitive results with an AUC of up to 0.9940 and classification accuracy of more than 95 % in the verification scenario and an mAP@R of 97 % and a precision@1 of about 99 % in the image retrieval scenario. Especially the fact that basic SNNs have the capability to re-identify patients despite potential age differences, disease changes or differing projection views demonstrated the effectiveness of DL techniques for this task. However, note that the shown results were obtained empirically, i. e. they do not necessarily reflect true measures of certainty.

As shown in Figure 5 , the used dataset suffers from a high technical variance which may occur due to various windowing techniques applied to the images. In a real-life scenario, the resulting variations in image contrast and brightness could be significantly mitigated by using dynamic normalization approaches 33 . Furthermore, we believe that variations in rotation and scaling can be counteracted by appropriate alignment algorithms. Nevertheless, even without such pre-processing steps, we were able to show that patient matching for chest radiographs is possible with a high performance by using DL techniques.

Moreover, we hypothesize that special noise patterns characteristic for unique patients appear in the images which might unintentionally improve the re-identification performance. For example, the initial anonymization strategy may be biased towards the clinical institution and, therefore, also towards follow-up images. To get a better impression of the re-identification capability of our SNN architecture, we also intend to investigate other datasets which show less or ideally no correlation between potential noise patterns and the patient identity. Therefore, further research on multiple datasets should ideally be considered. For our experiments, we already evaluated our models on two completely different datasets, the CheXpert dataset and the COVID-19 Image Data Collection. While the evaluation metrics are lower, we still obtain AUC scores of over 97 % (COVID-19) and 98 % (CheXpert) and precision@1 values of more than 88 % (COVID-19) and 94 % (CheXpert) without fine-tuning on these datasets. This indicates that patient verification and re-identification is also applicable for data that was acquired in various hospitals around the world where other pre-processing steps may be taken before data publication compared to the ChestX-ray14 dataset.

The COVID-19 Image Data Collection is very heterogeneous, containing, e. g., images of different sizes, both gray-scale and color images, and images with visible markers, arrows or date displays. For our experiments, only those images in the COVID-19 Image Data Collection were used that were acquired using the anterior-posterior or the posterior-anterior view, while images taken in the lateral position and CT scans were discarded. Apart from this, no further steps were taken to ensure the quality of the dataset. Although some of the factors mentioned above (e. g., brandings such as markers, arrows or dates) may facilitate the patient re-identification, we hypothesize that the COVID-19 Image Data Collection poses a realistic example of a public medical dataset and we therefore consider the conducted experiment as an authentic real-life application scenario.

Furthermore, we want to accentuate that our trained network architectures are able to handle non-rigid transformations that may appear between two images of the same person in the ChestX-ray14 dataset. Such deformations can occur due to various breath states in follow-up scans or due to different positioning during X-ray acquisition. Hence, the shape of the heart and lungs, or the contours of the ribs may appear deformed compared to an initial scan. The obtained results lead to the assumption that our trained SNN architectures can withstand such deformations and can therefore be used for reliable patient re-identification on chest radiographs.

We conclude that publicly available medical chest X-ray data is not entirely anonymous. Using a DL-based re-identification network enables an attacker to compare a given radiograph with public datasets and to associate accessible metadata with the image of interest. Thus, sensitive patient data is exposed to a high risk of falling into the unauthorized hands of an attacker who may disseminate the gained information against the will of the concerned patient. At this point, we want to emphasize that data leakage of this kind requires that the attacker has previously gained access to an image of a known person. This could happen, for example, through a stolen CD containing raw medical data of a specific patient, or by bribing corrupt medical staff at a radiological facility. Furthermore, data breaches due to inadequate data security measures at, e. g., healthcare institutions or health insurance companies, represent a possibility for attackers to obtain images of known patients, which could subsequently be utilized for a linkage attack as presented in our work. However, even if the attacker owns an image of an unknown identity, a re-identification model can be used to find the same patient across various datasets. Assuming multiple datasets contain the same patient but different metadata, an attacker would be able to obtain a more complete picture of the respective patient. We hypothesize that collecting patient information by this means could significantly help an attacker infer the true identity of the patient. We therefore urge that conventional anonymization techniques be reconsidered and that more secure methods be developed to resist the potential attacks by DL-based algorithms.

Potential solutions to the problems addressed in our work may be found in privacy-preserving approaches such as differential privacy (DP) 34, 35 which guarantees that the global statistical distribution of a dataset is retained while individually recognizable information is reduced 36 . This means that an outside observer is unable to draw any conclusions about the presence or absence of a particular individual. Differentially private datasets can therefore withstand linkage attacks attempting to reveal a patient's identity. One commonly-used technique to achieve DP is to modify the input by adding noise to the dataset 36, 37 . This however may degrade and harmfully affect the quality of the data, especially when applied to medical images. In this context, we consider differentially private synthetic medical image generation 38 to be a promising research area. Nevertheless, further exploration on these topics is required before general conclusions can be made.

Aside from data-perturbation-based privacy approaches, we want to mention that the use of collaborative decentralized learning protocols such as federated learning (FL) 39 can significantly contribute to a safer use of medical data. By training a machine learning model collaboratively without centralizing the data, the need of raw data sharing or dataset release is eliminated 40 . Thus, the medical data is able to reside with its owner, e. g., the healthcare institution where the data was acquired, which resolves data governance and ownership issues 36 . However, FL itself does not provide full data security and privacy, meaning that some risks remain unless combined with other privacy-preserving methods.

To re-identify patients from their chest radiographs, we employ SNN architectures for both the classification and the retrieval tasks. A SNN receives two input images which are processed by two identical feature extraction blocks sharing the same set of network parameters. The resulting feature representations can then be used to compare the inputs. The concept of a SNN was initially introduced by Bromley et al. 41 for handwritten signature verification. Taigman et al. 42 applied this idea in the field of face verification and proposed the DeepFace system. Moreover, Koch et al. 43 presented an approach for one-shot learning on the Omniglot 44 and MNIST 45 datasets.

With a total of 112,120 frontal-view chest radiographs from 30,805 unique patients, the NIH ChestX-ray14 16 dataset counts to one of the largest publicly available chest radiography datasets in the scientific community. Due to follow-up scans, the image collection provides an average of 3-4 images per patient. The originally acquired radiographs were published as 8-bit gray-scale PNG images with a size of 1024 × 1024 pixels. Associated metadata is available for all images in the dataset. The additional data comprises information about the underlying disease patterns (either no finding or a combination of up to 14 common thoracic pathologies), the number of follow-up images taken, the patients' age and gender, and the projection view (anterior-posterior or posterior-anterior) used for radiography acquisition. According to the publisher, the dataset was carefully screened to remove all personally identifiable information before release 46 . Therefore, the patient names were replaced by integer IDs. Moreover, personal data in the image domain itself has been made unrecognizable by placing black boxes over the corresponding image areas.

The CheXpert 14 dataset contains 224,316 frontal and lateral chest radiographs of 65,240 patients, who underwent a radiographic examination from Standford University Medical Center between October 2002 and July 2017. The originally acquired radiographs were published as 8-bit gray-scale JPG images with varying image resolutions. Note that only frontal chest radiographs were used in our work, whereas lateral images were excluded.

The COVID-19 Image Data Collection 19 is a dataset that was created and published as an initiative to provide COVID-19 related chest radiographs and CT scans for machine learning tasks. It comprises data of 448 unique patients and a total of around 950 images with different image resolutions. In this work, only the available frontal radiographs were utilized, whereas the lateral images and CT scans were discarded.

Since SNN architectures require pairs of images for training and evaluation, we construct both positive and negative image pairs from the images contained in the ChestX-ray14 dataset. In this context, a positive pair consists of two images belonging to the same patient, whereas a negative pair comprises two images that belong to different patients. Mathematically, the constructed dataset can be described according to To ensure that images from one patient only appear either in the training, validation, or testing set, we use the patient-wise splitting strategy. According to the official split, the data is roughly divided into 70 % training, 10 % validation, and 20 % testing. Based on this split, we construct the actual image pairs for each subset.

For patient verification, we follow an offline mining approach, meaning that the positive and negative image pairs are generated once before conducting the experiments. First, the positive pairs are generated by only considering the patients for whom multiple images exist in the respective subset. For each patient with follow-up images, we produce all possible tuple combinations assuming the images to be unique. By following this approach, we are able to construct a total of around 400,000 positive image pairs for our training set. The negative pairs in each subset are randomly generated and concatenated with the respective positive pairs afterwards.

For the patient re-identification experiments, we choose an online mining approach, meaning that image pairs are formed in each batch during the training procedure. This means that the embeddings of all batch images are first computed and then subsequently used in all possible combinations as input for the loss function. Moreover, all patients with only one available image were discarded from the training set.

For patient verification, the used SNN architecture (see Figure 6 ) receives two images x x x 1 and x x x 2 of size 3×256×256. Both inputs are processed by a pre-trained ResNet-50 incorporated in each network branch. In its original version, the ResNet-50 was designed to classify images into 1,000 object categories trained on the ImageNet 47 dataset. To adapt the ResNet-50 to our specific needs, we replace its classification layer with a layer consisting of 128 output neurons producing the feature representations z z z 1 and z z z 2 , respectively. To merge both network branches, the absolute difference of the sigmoid activations of the two feature vectors is computed. We add a fully-connected (FC) layer to reduce the dimensionality to one neuron, followed by another sigmoid activation function σ which yields the final output scoreŷ ∈ [0, 1].

The verification model is trained using the binary cross-entropy (BCE) loss. The network parameters are optimized by combining mini-batch stochastic gradient descent (SGD) 48, 49 with the adaptive moment estimation (Adam) 50 method. The batch size N b is set to 32 in all our experiments. We use different LRs to investigate their effect on the model's performance. Furthermore, we include an early stopping criterion with a patience p = 5, which means that the training procedure stops as soon as the network does not improve for 5 epochs. We train the architecture using input dimensions of 3×256×256.

We utilize ROC curves to visualize the trained verification models based on their performance. A ROC curve represents a two-dimensional graph in which the TPR is plotted against the false positive rate (FPR) at various threshold settings 51 , thus indicating how many true positive classifications can be gained as an increasing number of false positive classifications is allowed. Additionally, we calculate the AUC which reflects a proportion of the area of the unit square and will always range from 0 to 1 51 . The higher the AUC score, the better the model's average performance. Nevertheless, it has to be mentioned that a classifier with a high AUC might perform worse in a specific region of ROC space than a classifier with a low AUC value. Moreover, we evaluate the performance by computing the accuracy, specificity, recall, precision and F1-score. Therefore, the threshold at the output neuron is set to t = 0.5.

For patient re-identification, we train a SNN architecture which receives two images x x x 1 and x x x 2 of size 3×1024×1024. Both inputs are processed by a pre-trained ResNet-50 incorporated in each network branch. However, the network head of the used ResNet-50 is slightly modified. The average pooling layer is replaced by an adaptive average pooling layer producing feature maps of size 5×5. In addition to the adaptive average pooling layer, an adaptive max-pooling layer is applied which also yields feature maps of size 5×5. The outputs of the pooling layers are concatenated and processed by a 1×1 convolutional layer reducing the number of feature maps from 2048 to 100. The feature maps are then flattened, followed by two successive FC layers resulting in 128-dimensional feature representations z z z 1 and z z z 2 for the first and the second network branch.

The re-identification model is trained using the contrastive loss function 52 which is typically utilized to achieve a meaningful mapping F from high to low dimensional space. By using the contrastive loss, the network learns to map similar inputs to nearby points on the output manifold while dissimilar inputs are mapped to distant points. Negative pairs contribute to the loss only if their distance is smaller than a certain margin m. In this work, the margin is set to m = 1.

For our image retrieval experiments, the SNN architecture is optimized using the SGD algorithm in combination with the 1cycle learning policy 53, 54 . When using the 1cycle LR schedule, the LR η steadily increases until it reaches a chosen maximum value and gradually decreases again thereafter. This schedule changes the LR after every single batch and is pursued a pre-defined number of epochs. The upper bound is chosen at 0.1584 with the help of a LR finder. The lower bound is set to 0.0063. The L2 regularization technique is used with a decay factor of 10 −5 . Moreover, the batch size is adjusted to 32. We optimize the SNN architecture by first training the adapted network head of the incorporated ResNet-50 for 30 epochs with all other parameters being frozen. Then, the complete architecture is trained for another cycle, this time consisting of 50 epochs.

Since the batch size limits the task of constructing informative positive and negative pairs in the online mining strategy, the concept of cross-batch memory 55 is utilized to generate sufficient pairs across multiple mini-batches. This concept is based upon the observation that the embedding features generally tend to change slowly over time. This "slow drift" phenomenon allows the use of embeddings of previous iterations that would normally be considered out-dated and discarded. For our experiments, a memory size of 128 is chosen, meaning that the last 4 batches are considered for mining.

To evaluate the re-identification performance of our trained model, several metrics are computed. R-Precision represents the precision at R, where R denotes the number of relevant images for a given query image. In other words, if the top-R retrieved images show r relevant images, then R-Precision can be calculated from Equation (2) . Note that this value is then averaged over all query samples. Precision@1 constitutes a special case and evaluates how many times the top-1 images in the retrieved lists are relevant.

To further consider the order of the relevant images within the retrieved list, the mean average precision at R (mAP@R) is computed according to Equation (3) . The mAP@R denotes the mean of the average precision scores at R (AP@R) over all Q query images. The AP@R (see Equation (4)) is the average of the precision values over all R relevant samples, where P@i refers to the precision at rank i and rel@i is an indicator function which equals 1 if the sample is relevant at rank i and 0 if it is not relevant.

The NIH ChestX-ray14 dataset used throughout the current study is available via Box at https://nihcc.app.box. com/v/ChestXray-NIHCC. The COVID-19 Image Data Collection is available on GitHub at https://github.com/ ieee8023/covid-chestxray-dataset. The CheXpert dataset can be requested at https://stanfordmlgroup. github.io/competitions/chexpert.

The code used to train and evaluate both the patient verification and the patient re-identification models will be made available after acceptance. Correspondence and requests for materials should be addressed to K.P. or S.G. 

Medical Imaging Systems: An Introductory Guide

Interpretation of Plain Chest Roentgenogram

Learning to recognize Abnormalities in Chest X-Rays with Location-Aware Dense Networks

World Health Organization (WHO)

COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images

Cognitive and System Factors Contributing to Diagnostic Errors in Radiology

Deep learning

Multi-task Learning for Chest X-ray Abnormality Classification on Noisy Labels

A region based convolutional network for tumor detection and classification in breast mammography

Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

A Survey on Data Collection for Machine Learning: A Big Data -AI Integration Perspective

A gentle introduction to deep learning in medical image processing

Exploring Large-scale Public Medical Image Datasets

CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison

The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial of the National Cancer Institute: History, organization, and status

ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases

Covid-19 pandemic: cardiovascular complications and future implications

Covid-19 pandemic: perspectives on an unfolding crisis. The Br. journal surgery

COVID-19 Image Data Collection: Prospective Predictions are the Future

Figure 1 COVID-19 Chest X-ray Dataset Initiative

ActualMed COVID-19 Chest X-ray Dataset Initiative

COVID-19 Radiography Database

Preparing Medical Imaging Data for Machine Learning

Centers for Disease Control and Prevention. Health Insurance Portability and Accountability Act of 1996 (HIPAA

Complete guide to GDPR compliance

Google axed release of vast x-ray dataset following NIH privacy concerns

Google scrapped the publication of 100,000 chest x-rays due to last-minute privacy problems

Pseudonymization of Radiology Data for Research Purposes

A model for protecting privacy

Medical Data Privacy Handbook

Re-Identification with Consistent Attentive Siamese Networks

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

Robust Classification from Noisy Labels: Integrating Additional Knowledge for Chest Radiography Abnormality Assessment. Med

A Firm Foundation for Private Data Analysis

The Algorithmic Foundations of Differential Privacy

Secure, privacy-preserving and federated machine learning in medical imaging

Signal Processing and Machine Learning with Differential Privacy: Algorithms and challenges for continuous data. IEEE signal processing magazine

Differentially Private Synthetic Medical Data Generation using Convolutional GANs

Federated Optimization: Distributed Machine Learning for On-Device Intelligence

The future of digital health with federated learning

Signature Verification using a "Siamese

Closing the Gap to Human-Level Performance in Face Verification

Siamese Neural Networks for One-shot Image Recognition

Human-level concept learning through probabilistic program induction

Gradient-Based Learning Applied to Document Recognition

NIH Clinical Center provides one of the largest publicly available chest x-ray datasets to scientific community

ImageNet: A Large-Scale Hierarchical Image Database

Neural Networks: Tricks of the Trade

A Method for Stochastic Optimization

An introduction to ROC analysis

Dimensionality Reduction by Learning an Invariant Mapping

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

Cyclical Learning Rates for Training Neural Networks

Cross-Batch Memory for Embedding Learning

The research leading to these results has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (ERC grant no. 810316). 

The authors declare no competing interests.