key: cord-0189888-dfvq717i
authors: Cherti, Mehdi; Jitsev, Jenia
title: Effect of Pre-Training Scale on Intra- and Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images
date: 2021-05-31
journal: nan
DOI: nan
sha: 096cffc20d3bc89e8bc337607435a67e79d888d9
doc_id: 189888
cord_uid: dfvq717i

Transfer learning aims to exploit pre-trained models for more efficient follow-up training on wide range of downstream tasks and datasets, enabling successful training also on small data. Recently, strong improvement was shown for transfer learning and model generalization when increasing model, data and compute budget scale in the pre-training. To compare effect of scale both in intra- and inter-domain full and few-shot transfer, in this study we combine for the first time large openly available medical X-Ray chest imaging datasets to reach a dataset scale comparable to ImageNet-1k. We then conduct pre-training and transfer to different natural or medical targets while varying network size and source data scale and domain, being either large natural (ImageNet-1k/21k) or large medical chest X-Ray datasets. We observe strong improvement due to larger pre-training scale for intra-domain natural-natural and medical-medical transfer. For inter-domain natural-medical transfer, we find improvements due to larger pre-training scale on larger X-Ray targets in full shot regime, while for smaller targets and for few-shot regime the improvement is not visible. Remarkably, large networks pre-trained on very large natural ImageNet-21k are as good or better than networks pre-trained on largest available medical X-Ray data when performing transfer to large X-Ray targets. We conclude that high quality models for inter-domain transfer can be also obtained by substantially increasing scale of model and generic natural source data, removing necessity for large domain-specific medical source data in the pre-training. Code is available at: url{https://github.com/SLAMPAI/large-scale-pretraining-transfer}}

Re-using models obtained by pre-training on available source datasets to improve learning performance on upcoming target datasets is core idea behind transfer learning. It has a long history in machine learning field [1, 2] and was also employed already at the very early rise of deep neural networks in the vision domain [3, 4] . Different architectures like AlexNet [5] , OverFeat [6] , and VGG [7] were pre-trained on supervised tasks using ImageNet-1k, a publicly available natural image dataset that contains about 1.4 Million images and 1000 classes [8, 9] . After pre-training, the resulting models were taken as off-the-shelf generic features reservoirs and re-used by re-training, or fine-tuning, on various downstream target datasets and tasks, including classification, object detection, and segmentation [3, 4] . Importantly, the transfer approach allowed to improve performance on target datasets when compared to training from scratch with randomly initialized weights [10, 11] . Further, it enabled to train models of good quality also on comparatively small amounts of data, in contrast to large amounts usually required for learning high-quality models when training a deep neural network from scratch.

Recent line of work on scaling laws in language modeling [12] and vision demonstrated strong improvement for model's ability to generalize on unseen test data when increasing model, data, and compute budget scale during the training. In language modeling, very large Transformer networks pre-trained on very large text data have also shown very strong transfer performance on a broad range of novel tasks, compared to pre-trained models of smaller scale [13] . In the same line, experimental studies on large-scale pre-training and transfer in image domain found evidence that increasing network model and data size during pre-training results in transfer performance benefits [14, 15] .

The majority of the studies looking at the effect of pre-training scale on transfer deal with the intradomain scenario scenario, where source and target data are close to each other, often originating from the same domain, being for instance natural images. This raises the question whether the observed positive effect of larger scale will also uphold in the inter-domain transfer scenario when using different types of source and target data that are not so closely related.

To address this, we conduct a series of large-scale pre-training and transfer experiments where we vary not only ResNet model [16, 14] and dataset size during pre-training, but also the domain of the source and the target datasets, being either natural or medical X-Ray chest images, which allows us to study effect of scale on both intra-and inter-domain transfer. To vary source data scale in natural domain, we take either large ImageNet-1k or much larger ImageNet-21k [8] . To vary pre-training data scale for medical X-Ray domain, we combine here, for the first time, large openly available medical X-Ray chest imaging datasets (CheXpert [17] , MIMIC-CXR [18] , PadChest [19] , NIH Chest X-ray14 [20] ) into supersets, with the largest scale comparable to ImageNet-1k. We then transfer to either natural or X-Ray image datasets as target. For transfer, we also vary the operation in either full or few-shot regime, where only few examples per class are shown to the pre-trained models during fine-tuning on a target dataset. As large-scale pre-training requires heavy computational resources, we make use of a state-of-the art supercomputer (JUWELS Booster [21] ) tailored for distributed training to conduct our experiments.

The results we obtain show that both intra-and inter-domain transfer benefit from larger pre-training scale. They also reveal a differentiated picture suggesting that effect of larger scale on transfer is expressed differently in intra-and inter-domain scenario and for full shot and few-shot regime. Remarkably, and also of high relevance for practice, we observe that large networks pre-trained on very large generic natural ImageNet-21k are as good or better than networks pre-trained on largest available medical domain-specific X-Ray superset data when performing full shot transfer to large X-Ray targets. This indicates that high quality models for domain specific medical X-Ray targets can be obtained by increasing scale of the network and of the generic natural image data in the pre-training, without relying on large amount of domain-specific data that is often not available in practice. In contrast to previous studies that dealt with smaller scales [22] , we conclude that inter-domain transfer from natural to medical images benefits from substantially larger pre-training scales.

Scaling laws for generalization and transfer. Strong evidence that increasing model and data size for the training may result in steady improvement of generalization comes from language modeling studies systematically looking on the dependency of test error on model, data size, and compute budget used for training [13, 12, 23] . The experiments conducted there set up scaling laws with a power law shape and show consistent further decrease of test error when further increasing model, data size, and compute budget over many orders in magnitude hand in hand. For images, a similar line of work by Henighan et al. shows a decrease of test classification top-1 error when fine-tuning on ImageNet pre-trained generative image models of increasing size [24] . Those works use selfsupervised training of autoregressive models, in language modeling performed on text and in image domain on image patches, employing transformer networks as a backbone. Additional backup for this line of work comes from studies that revise the dependency of generalization performance on model, data size, and epoch number during training and report double or multi descent curves for the test error [25, 26, 27] . There, keeping on increasing model, data size or training time substantially also shows continuous drop in the test error pointing to generalization improvement, for instance when crossing the interpolation threshold and transiting into the over-parameterized regime by scaling up the model size.

Improving transfer by scaling up pre-training. Also improvement in transfer on downstream datasets and tasks is strongly evident from large-scale language modeling experiments. In the study by Brown et al. [13] , large transformer networks in the order of hundred billions of parameters (GPT-3) pre-trained on large text datasets in the order of billions of sentences were shown to have much stronger transfer performance than smaller GPT network models, measured by the test error on different downstream tasks. The difference in transfer performance between different sized models was especially pronounced in the very low data regime when doing zero-shot or few-shot transfer with only few examples available during fine-tuning. Further systematic study on transfer improvements induced by increasing scale was done by Hernandez et al. [23] , who examined scaling laws for transfer on language modeling tasks in the low-data regime, being defined as transfer using less than 10% of the available target data. The authors have shown that increasing model size in the pre-training decreases test error on the target data, emphasizing that test error improvement cannot be observed when increasing model size and training directly from scratch on the target without pre-training. It was also pointed out that the degree of proximity between the source and the target dataset plays a role when predicting effect of the scale on the transfer performance, leading to a revised version of the scaling law that took this additional dependency into account.

In the image domain, the performed studies on transfer improvement due to scale have still far less systematic character. Models and datasets used for training on images are 3-4 orders of magnitude behind those studied in language modeling [13] . Recently, number of works were starting to employ datasets like ImageNet-21k [8] , YFCC-100M [28] , JFT-300M [29] or JFT-3B [30] that are much larger than standard ImageNet-1k to pre-train large network models on large data and observe the effect of scaling up on transfer. The work on Big Transfer by Kolesnikov et al. [14] performed supervised classification based pre-training on ImageNet-1k, ImageNet-21k, and JFT-300M using different sized deep residual networks (ResNets [16] ) to study the performance of pre-trained models on transfer across different target datasets. They found consistent improvement in transfer performance when using larger models and larger data during pre-training. In the same direction, works by [15, 30] pre-trained different sized network models on ImageNet-1k, ImageNet-21k or JFT-3B [30] also observing consistent transfer improvement when scaling-up model and data size during pre-training.

Studies mentioned above deal with closely related source and target datasets containing mostly natural images data. In general, various works related to testing transfer performance across different target datasets often employ targets that are rather close to pre-training source data, like studies introducing transfer benchmarking datasets that use targets resembling mostly natural image domain [10, 31] or domain specific transfer studies that stay within their given domain, for instance in medical imaging [32] .

Only few studies so far attempt to measure transfer performance between datasets that are further apart, for instance natural and medical images, while systematically varying model and data size during pre-training. Work done by Raghu et al. [22] has found no significant difference between models pre-trained on ImageNet-1k and models trained from scratch on target datasets containing medical images. However, it was not using datasets larger than standard ImageNet-1k or networks larger than standard ResNet-50 for pre-training. Another work examines transfer on CheXpert while varying network model size during pre-training [33] . It does find slight benefit for transfer when pre-training with larger models, however it does not vary source data size in the pre-training, using only standard ImageNet-1k as a source. A recent study by Mustafa et al. builds up on Big Transfer work [14] and compares transfer performance of different sized ResNet network models pre-trained on ImageNet-1k, ImageNet-21k, and JFT-300M on different medical imaging target datasets [34] . Slight evidence for transfer improvement was observed when using larger model and larger dataset sizes during pre-training, with inconsistencies across conditions and datasets, where in some cases no significant benefit from larger pre-training scale was seen. The work does not compare models pre-trained on natural images data to models pre-trained on medical imaging data of different scale when measuring transfer performance on medical imaging targets.

In order to test the impact of model and data pre-training scale on transfer performance in full and few-shot regime under different source and target data type constellations in intra-and interdomain scenarios, we conducted experiments on pre-training different sized ResNet models on supervised classification task using either large natural image datasets ImageNet-1k or ImageNet-21k, or compositions of chest X-Ray medical imaging datasets CheXpert, MIMIC-CXR, PadChest, and NIH Chest X-ray14 (see Suppl. Tab. 3 for comprehensive list and further details of datasets). The pre-trained models were then fine-tuned on different target datasets that contain either natural or medical images. In the following, we describe the experimental procedures and outcomes in more detail.

For pre-training, we largely followed the training procedure and used the network architecture of [14] . More concretely, we pre-trained both ResNet-50x1 and ResNet-152x4 (in following R50x1 and R152x4) from [14] on different natural image and medical datasets. Smaller R50x1 has 26M weight parameters, while larger R152x4 has 928M parameters. This substantial difference in size allows us to compare the effect of model scaling in the pre-training on subsequent transfer. The following describes the training procedure and hyper-parameters used for natural image and medical image domain.

For natural images, we pre-trained the two models (R50x1 and R152x4) on ImageNet-1k (≈ 1.4 Millions images) and the much larger full ImageNet-21k (≈ 14 Millions images). For ImageNet-1k and Imagenet-21k models, we used a standard supervised classification setup with softmax as an output activation and cross entropy as a loss.

We followed the training hyper-parameters of [14] , with the difference that we used stochastic gradient descent (SGD) with adaptive gradient clipping (AGC) from [35] , as we found that it helps both pre-training and transfer. With AGC, we found that the default base learning rate used in [14] made training unstable for the ImageNet-21k experiments, so we reduced it from 0.03 to 0.01, but otherwise we used a base learning rate of 0.03. The rest of the hyper-parameters were similar, namely we used a momentum of 0.9, 90 epochs, ≈ 5000 warmup iterations, a batch size of 4096, the linear learning rate rescaling rule of [36] , and the standard step-wise learning rate schedule for ImageNet [14] . For data augmentation, we used the standard random resized crop data augmentation as in [14] . In ImageNet-1k experiments, we additionally used RandAugment [37] and changed the learning rate schedule from step-wise to cosine annealing [38] , as it improved the pre-training results. In order to speedup training, we used data parallel training with Horovod [39] , using 256 A100 GPUs for R152x4 models and 128 A100 GPUs for R50x1 models. A pre-training on ImageNet-21k with large R152x4 takes about 81 hours using 256 GPUs, while with small R50x1 it needs about 13.5 hours to finish using 128 GPUs on JUWELS Booster supercomputer [21] (see Suppl. Sec. A and Suppl. Fig. 3 for more details on distributed training)

For medical data, we pre-trained the two models (R50x1 and R152x4) on combinations of several medical datasets, which as supersets may contain any of the available datasets: CheXpert, MIMIC-CXR, NIH X-ray14, PadChest. We refer to those combinations as X-Ray supersets in following. The largest source X-Ray superset contains about 873K X-Ray chest radiographs (see also Suppl. Tab. 3). The medical datasets are multi-label, as each image can be associated to several diseases. The datasets are combined by finding intersecting labels (diseases) and using the intersected labels as a target. In order to substantially vary data scale for medical domain, we start with single available X-Ray datasets and progressively add other datasets into X-Ray supersets of successively growing size, which provides us with X-Ray source datasets spanning scales from small (≈ 200k samples) to large (≈ 870k samples) to perform pre-training on. For processing the datasets and extracting the labels from raw data, we used TorchXRayVision [40] from the work of [32] .

We followed the literature on medical datasets [32] and pre-trained using a multi-label setup where we have independent binary tasks, one for each label (disease), and we used sigmoid as an output activation function and binary cross entropy as a loss for each label.

We used the same hyper-parameters as in the natural image domain, except that the base learning rate was set to 0.01 instead of 0.03, as we found that a learning rate of 0.03 led to more overfitting. We followed [32] and used a center crop based on the smallest side, then resized the image to 224 × 224. In order to combat overfitting, we used data augmentation from [32] , which included random translation, random rotation, and random scaling. In addition to the data augmentation used in [32] , we also do random horizontal flipping. In order to speedup training, we used data parallel distributed training with Horovod [39] , using 64 A100 GPUs in all pre-training setups.

For fine-tuning, we used the BiT-HyperRule [14] , which is a heuristic that selects fine-tuning hyperparameters (learning rate schedule, resolution, usage of MixUp, and total number of steps) based on training set size and image resolution. We used a batch size of 128, and an initial learning rate of 0.001 on all experiments. Like in [14] , we do not use weight decay. Like in pre-training, we used stochastic gradient descent (SGD) with adaptive gradient clipping (AGC), as we found it to improve few-shot results. We used a momentum of 0.9. In each experiment, the classification head of the pre-trained model was replaced with a new classification head for the fine-tuning task. We fine-tuned all the layers of the network. For each experiment, we performed 5 independent runs with different seeds to have an estimate of the variance of the performance. We ran each fine-tuning experiment on a single A100 GPU.

As in [14] , we consider two kinds of setups, few-shot setups (we used 1 or 5 or 10 or 100 or 500 examples per class) and fine-tuning on the full training set. We used CIFAR-10, CIFAR-100 [41] , Flowers-102 [42] , and Oxford-IIIT Pet [43] for natural image fine-tuning. For medical image finetuning, we used single-label Tuberculosis [44] and COVIDx [45] as small X-Ray targets (≈ 800 and 16k samples each), and multi-label CheXpert, MIMIC-CXR, NIH or PadChest as larger X-Ray targets (magnitude order of 100k-300k samples, see also Suppl. Tab. 3). In addition, to perform few-shot experiments similar to natural domain, we employ PadChest-cl, single-label dataset derived from PadChest, where we keep only images with exactly one label (one disease). For Flowers-102 and COVIDx, since the datasets are strongly imbalanced, we used oversampling. We measure either final accuracy or mean AUC on the test sets.

The experiments allow us to look on both intra-and inter-domain transfer performance following pre-training when varying model size, source data size and source and target dataset domain. In following, we report the obtained results.

Effect of scale on intra-domain transfer. Results we obtain either for natural-natural or medicalmedical full shot transfer (Tab. 1) deliver a clear picture showing transfer improvement across target datasets when increasing pre-training model and data scale. Most consistent is the improvement due to increase of network size, while for data scale there are only few single cases where the increase does not result in improvement (e.g, when using large ResNet-152x4 on Pets for natural-natural and for PadChest for medical-medical scenario; see Supplementary for more detailed results for each transfer experiment scenario).

For few-shot transfer, we observe a differentiated picture. In line with previous work, for naturalnatural transfer we obtain strong improvement due to larger scale in the very low data regime of 1or 5-shot transfer, reaching in some cases 20% − 30% absolute difference in test accuracy in favor of larger scale (as seen for CIFAR-100, Fig. 1a ). In contrast, for medical-medical scenario, there is no evidence for few-shot transfer improvement due to larger scale (Figs. 1b, 2b, 2d; see also Supplementary for further details). Increasing number of shots and approaching full shot regime, the improvement due to scale becomes more and more visible. The observed variance is larger for few-shot transfer experiments, which may suggest less stable fine-tuning in those cases where model has to adapt to target data based on only very limited number of examples. (2) 84.17 ± 0.03 86.38 ± 0.03 85.41 ± 0.10 86.82 ± 0.10 87.63 ± 0.04 88.00 ± 0.03 86.85 ± 0.06 87.79 ± 0.13 CheXpert (2) 82.10 ± 0.07 86.66 ± 0.05 84.83 ± 0.14 86.60 ± 0.14 84.92 ± 0.07 87.82 ± 0.03 86.82 ± 0.06 87.77 ± 0.07 PadChest (2) 68.06 ± 0.24 68.14 ± 0. 21 For small X-Ray targets (Tuberculosis and COVIDx), we do not observe such consistent improvement due to larger scale. For instance, while we see improvement due to larger data scale for small ResNet-50x1 on both small targets, the improvement is not there when increasing network size. There is also no evidence for positive effect of larger scale on few-shot transfer, neither for large nor for small X-Ray targets (Figs. 1b, 2a, 2c ) Again, variance observed in few-shot regime is large, and is getting smaller and smaller when increasing number of shots and moving towards full shot transfer.

Remarkably, when further comparing intra-and inter-domain transfer performance, we observe that large ResNet-152x4 pre-trained on very large generic natural ImageNet-21k are as good or better than networks pre-trained on largest available medical domain specific X-Ray superset data when performing full shot transfer to large X-Ray targets (Tab. 1, Figs. 2a, 2b) . This fits into overall picture of larger model and data pre-training scale improving transfer on larger targets observed here, as ImageNet-21k has order of magnitude larger scale than the largest X-Ray superset constructed for this study. 

Our observations of the transfer performance dependency on the pre-training model and data scale and on source and target domain alignment suggest that both intra-and inter-domain transfer benefit from larger pre-training scale. The effect of pre-training scale depends however on transfer conditions, revealing a differentiated picture of when larger scale may lead to transfer improvement.

Larger pre-training scale improves intra-domain transfer We obtain evidence that both naturalnatural domain transfer and medical-medical domain transfer are improved when increasing model and data size during pre-training (Tab. 1; see also Suppl. material). For natural-natural transfer scenario, the improvement is evident for both full and few-shot regime. Increasing data size by using ImageNet-21k instead of ImageNet-1k or increasing network model size by using ResNet-152x4 instead of smaller ResNet-50x1 creates strong, consistent boost in transfer performance across all natural target datasets, which is in line with previous observations [14, 15] . Improvement is especially pronounced in few-shot regime (e.g, Fig. 1a , see also Suppl. material), also adding evidence for more data efficient transfer due to larger scale. The picture is more differentiated for medical-medical transfer scenario. For full transfer regime, the improvement due to larger pre-training scale is clear and consistent across different targets (Tab. 1). In few-shot transfer however, in contrast to natural-natural scenario (Fig. 1a) , there are no benefits due to larger scale (Figs. 2b, 2d ). Here we have to keep in mind that both absolute data size and increase in data scale we obtain by going from ImageNet-1k to ImageNet-21k (14M samples) in natural pre-training is much stronger than what we achieve in medical pre-training, going from one of X-Ray datasets to the largest combined X-Ray superset that still has much smaller data volume (≈ 870k samples) than ImageNet-21k. This difference in data scale may also explain the observed differences in few-shot regime, while we also cannot rule out that domain type (natural or medical) could as well play an important role in determining how transfer is affected by pre-training scale.

Larger pre-training scale improves inter-domain transfer for larger targets. In contrast to intradomain transfer, where natural-natural or medical-medical source and target datasets are closely related, in natural-medical inter-domain transfer the source and target are much further apart. It is therefore not trivial that effect of pre-training scale brings similar or any improvement in this case as well. Strong discrepancy between source and target may render transfer ineffective, as it was indeed observed in previous studies on natural-medical transfer done on smaller scales [22] . In contrast to these studies, we do find here significant positive effect of larger pre-training scale on inter-domain natural-medical full shot transfer for larger medical targets (Tab. 1; see also Suppl. material). The transfer improvement is clearly expressed when increasing both model and natural image data pre-training scale across all large X-Ray targets.

Another remarkable finding arises when further comparing performance of intra-domain medicalmedical and inter-domain natural-medical transfer. The largest ResNet-152x4 pre-trained on the largest generic natural ImageNet-21k turns out to be as good or in many cases better than any network pre-trained on the largest available medical domain specific X-Ray superset data when performing full shot transfer to large X-Ray targets (Tab. 1, Figs. 2a, 2b) . The finding indicates that by substantially increasing model and generic source natural image data scale during pre-training, we can obtain models for transfer to medical domain-specific X-Ray images that match or even outperform models pre-trained with large amounts of domain-specific X-Ray data, which may be often not available in practice. The observation fits into overall outcome that strongly increasing model and data pre-training scale improve full shot transfer on larger targets -ImageNet-21k scale (14M samples) is more than order of magnitude larger than scale of the largest X-Ray superset we have constructed for this study (≈ 0.87M samples).

In contrast to full shot transfer on large X-Ray targets, for smaller targets, COVIDx and Tuberculosis, neither full transfer nor few-shot transfer show improvement when increasing model and natural source data scale during pre-training (Fig. 2d , see also Suppl. material). We also see no evidence for positive effect of larger scale on few-shot transfer for large PadChest-cl target (Figs. 1b, 2a) .

Thus, we again obtain differential picture of scaling benefits on transfer. Further scaling up of model and data during pre-training may homogenize this picture and make scaling benefit look more consistent through different conditions, showing improvement for smaller targets and for few-shot transfer. For instance, we could not use substantially larger datasets like JFT-300M or JFT-3B [29, 14, 30] which are proprietary and not available publicly. Additionally, computational budget was here not enough to experiment with networks larger than ResNet-152x4. However, there may be also fundamental limitations prohibiting transfer improvement no matter how large pre-training scale may become that are due to strong incompatibility between source and target domains. There is some evidence from language modeling studies that hints on such fundamental limitations for transfer improvement on target datasets far from source when doing straightforward scaling without further changes in model architecture. For instance, the work by Hendrycks et al. [46] finds no improvement of transfer when increasing size of large Transformer networks pre-trained on very large conventional text datasets and fine-tuning those on a specific target dataset far apart from the source, containing mathematical text tasks of advanced difficulty.

Limitations of the current study. There are several limitations of the current study that impede more general conclusions about effect of pre-training scale on intra-and inter-domain transfer from the observations made in this study. In the conducted transfer experiments, we made use of a heuristic hyper-parameter selection rule -BiT-HyperRule, as introduced in [14] -that determines pre-training hyper-parameters directly from target datasets on which transfer is to be performed. This rule may be heavily biased towards transfer on natural image datasets, as those were the targets used in the original study. If modifying the rule to take also target domain -natural or medical -into account, the derived hyper-parameters may serve a much better basis for fine-tuning during transfer. In general, performing hyper-parameter tuning for training procedure can strongly boost performance [47] , and this is no different for transfer procedure. Therefore, it cannot be excluded that performing hyper-parameter tuning for each transfer task would alter the effect of larger pre-training scale on transfer. Hyper-parameter tuning would however also impose further cost on transfer that is avoided by employing the hyper-rule.

We also have not explored other backbone network architectures except the standard ResNet. Although ResNet has proven itself a versatile network architecture for dealing with various vision tasks, it cannot be ruled out that while for instance the inductive bias inherent to its convolutional design is well suited for working on natural image statistics with strong local spatial correlations, it may be less suited for providing good basis for generalization when dealing with other types of image signals.

Scaling up ResNet architecture may thus be a viable strategy to improve generalization capability on natural image data, while other, more generic architectures, may be required to benefit from scaling in the same way across more diverse data types. We also did not experiment with larger datasets than ImageNet-21k -as those are still mostly proprietary and were not publicly available, as it is the case for JFT-300M [29, 14] . For medical imaging domain, we could not experiment with increasing data scale substantially, as the amount of openly available X-Ray chest data is currently still limited. Finally, we studied the dependence of transfer improvement on pre-training scale exclusively in supervised classification problem setting. Promising work is also done on pre-training in unsupervised fashion with unlabeled data [48, 49, 50] , where benefits of scaling up pre-training for transfer may turn out to be as well substantial.

Conclusion and outlook. To summarize, we presented here evidence that substantially increasing model and data scale in the pre-training provides benefits for both intra-and inter-domain transfer across various target datasets from natural and medical X-Ray image domain. The effect of pretraining scale on transfer performance depends on transfer scenario. Transfer improvement due to larger pre-training scale was found to be substantial in natural-natural or medical-medical, intradomain transfer scenarios where source and target datasets were closely related, being especially strongly pronounced in the few-shot transfer regime for natural-natural case and concentrated in fullshot scenario in medical-medical case. For natural-medical inter-domain transfer, clear positive effect of larger pre-training scale was found for full shot transfer on large X-Ray targets. On small X-Ray targets and for few-shot transfer regime, no clear inter-domain transfer improvements were observed.

Remarkably, the largest ResNet-152x4 network pre-trained on very large generic natural ImageNet-21k matched or even outperformed networks pre-trained on largest medical domain-specific X-Ray superset data combined for this study when performing full shot transfer to large X-Ray targets. This is relevant for the practice, as large amount of medical domain-specific data is often not available for pre-training. Here we show that high quality models for large X-Ray targets can be also obtained by substantially increasing pre-training model and generic natural image source data scale instead, obliterating need for large domain-specific data.

The study offers different follow-up directions. One of these are experiments with larger scale both for network and data size, for instance going beyond ImageNet-21k, or combining in the pre-training different source datasets that may contain both natural and medical images. This may also include experiments with scaling up other architectures than ResNet. Another direction to study effect of scale is to employ various unsupervised learning strategies for pre-training [48, 49, 50] instead of supervised learning. Yet another fruitful path is to provide a measure of source and target domain similarity and to experiment with more than two distinct domains, varying systematically relatedness between different source and target data. First steps in this direction for language modeling was already undertaken in [23] . Following these directions would pave the path towards scaling laws for transfer in the image domain, taking into account different pre-training regimes and affinity between source and target domains, to enable systematic prediction of transfer performance and improvement due to increase of pre-training scale.

Our work aims on advancing transfer learning, which can make learning algorithms perform better and more efficient by re-using models already pre-trained on various tasks and therefore requiring less compute and data to learn solutions for other relevant tasks. The approach to improve transfer learning by increasing scale of the pre-training is generic and has impact far beyond vision domain, for instance in language modeling, and is not bound to any specific application. As any generic method, it can be therefore applied to enhance technologies for sensitive applications, for instance in health domain or in public surveillance, that may have both strong positive and negative social impact, depending on policies introduced on their usage. Special care should be taken about applications in clinical domain where further development of diagnostic tools based on data driven machine learning should be accompanied by a broad panel of experts from corresponding domains. The method depends on computationally heavy large-scale pre-training that is energy demanding on the one hand. On the other hand, it contains a promise to pay off the energy budget put into training by obtaining generic models that can be very efficiently adapted to a large range of problems via transfer, saving computational and energy costs that would otherwise incur for their solution from scratch. 

Here, we report scaling behavior during large-scale pre-training for ResNet networks we used in the experiments.

We performed scaling experiments to assess the scalability of data parallel training distributed across many GPUs on multiple nodes using Horovod. The efficiency in Figure 3b (upper part of the figure with percentages) is computed using the following formula: E(N ) = 100 × T (N ) N ×T (1) . T (N ) is the total measured throughput in Im/s for N GPUs. The best achievable efficiency, when scaling is perfect, is 100%.

We also provide the raw throughput (Im/s) numbers in Figure 3a and Tab. 2. On 1024 GPUs, we achieve an efficiency of ≈ 93.7% with single precision (FP32). To make sure distributed training is stable, we check the end accuracy of full training for each number of GPUs to reassure we reach target accuracy acceptable for standard ImageNet-1k Top-1 and Top-5 results.

Achieved scaling on JUWELS Booster allows to perform full pre-training on ImageNet-21k with large R152x4 in about 81 hours using 256 GPUs. For small R50x1, full training needs about 13.5 hours to finish using 128 GPUs. Dataset Size Source pre-training Natural Images ImageNet-1k [8] 1.4M images, 1000 classes ImageNet-21k [8] 14M images, 21842 classes X-Ray Chest Imaging CheXpert [17] 224K radiographs of 65K patients, 14 classes NIH Chest X-ray14 [20] 112K radiographs of 32K patients, 14 classes PadChest [19] 160K radiographs of 67K patients, 19 classes MIMIC-CXR [18] 377K radiographs of 65K patients, 14 classes Total X-Ray images 873K chest radiographs, 229K patients Target transfer Natural Images CIFAR-10, 100 [41] 60K images, 10,100 classes Oxford Flowers-102 [42] 8K images, 102 classes Oxford-IIIT Pet [43] 7.3K images, 37 classes X-Ray Chest Imaging PadChest [19] 160K radiographs of 67K patients, 19 / 27 classes COVIDx [45] 16K radiographs, 15K patients, 2 / 3 classes Tuberculosis [44] 800 radiographs, 800 patients, 2 classes All datasets employed in our experiments are publicly available and can be obtained following links in the Tab. 3

Here, we present more detailed results of transfer experiments described in the main document.

For medical X-Ray targets, we provide tables reporting transfer performance (Tabs. 5, 6, 7, 8, 9) listing each source X-Ray dataset and supersets used for pre-training, as outlined in the experiments description in the main document. (b) CIFAR-10 Figure 4 : Few-shot and full shot transfer performance on target datasets when varying model size and dataset size in pre-training. Transfer improvement due to model and source data size is evident, especially strongly pronounced in few-shot regime. Table 6 : Intra-domain transfer using different sized medical X-Ray source data for pre-training with different sized ResNets, target MIMIC-CXR Mean AUC metric. "+" indicates addition into a successively larger source superset. Clear transfer improvement is evident by scaling the model size. Using a superset containing CheXpert and PadChest improves the results, but adding NIH does not or does very little, this could be explained by the fact that NIH is the smallest dataset among the medical pre-training datasets, and a larger increase in the superset would be needed to substantially improve the transfer results, as it has been observed in transfer results that were obtained using models pre-trained on much larger natural data. Table 7 : Intra-domain transfer using different sized medical X-Ray source data for pre-training with different sized ResNets, target CheXpert Mean AUC metric. "+" indicates addition into a successively larger source superset. Clear transfer improvement is evident by scaling the model size. Using a superset containing PadChest and MIMIC CXR improves the results, adding NIH does not lead to further improvement. This could be explained by the fact that NIH is the smallest dataset among the medical pre-training datasets, and a larger increase in the superset would be needed to substantially improve the transfer results, as it has been observed in transfer results that were obtained using models pre-trained on much larger natural data. Table 8 : Intra-domain transfer using different sized medical X-Ray source data for pre-training with different sized ResNets, target PadChest Mean AUC metric. "+" indicates addition into a successively larger source superset. Clear transfer improvement is evident by scaling the model size. Improvement by increasing data size is not evident and only happens using the small R50x1 model and a superset containing CheXpert and MIMIC, adding NIH (which is smaller compared to CheXpert and MIMIC) the superset does not help further. This indicates that larger increase in the superset may be necessary to further improve the transfer results, as it has been observed when using models pre-trained on much larger natural data. Table 9 : Intra-domain transfer using different sized medical X-Ray source data for pre-training with different sized ResNets, target NIH (Mean AUC metric). "+" indicates addition into a successively larger source superset. Clear transfer improvement is evident by scaling the model size. We also observe transfer improvement by scaling data size, however the improvement seems to flatten. Since the transfer results using pre-trained models on larger natural data show a better performance, this indicates that a larger superset scale may be necessary to further improve transfer. 

Repository containing code used for running experiments and producing figures in this study can be found at https://github.com/SLAMPAI/large-scale-pretraining-transfer. All datasets used in the study are openly available and are listed together with references to the original work in the Table 3 . Further details on the usage of the datasets in the conducted experiments are also provided in the linked repository.

Direct transfer of learned information among neural networks

A survey on transfer learning

CNN features off-the-shelf: An astounding baseline for recognition

Factors of transferability for a generic convnet representation

ImageNet classification with deep convolutional neural networks

Overfeat: Integrated recognition, localization and detection using convolutional networks

Very deep convolutional networks for large-scale image recognition

Imagenet: A large-scale hierarchical image database

Imagenet large scale visual recognition challenge

A large-scale study of representation learning with the visual task adaptation benchmark

Factors of influence for transfer learning across diverse appearance domains and task types

Scaling laws for neural language models

Ilya Sutskever, and Dario Amodei. Language models are few-shot learners

Big transfer (bit): General visual representation learning

Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses

Deep residual learning for image recognition

Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports

Padchest: A large chest x-ray image dataset with multi-label annotated reports

Mohammadhadi Bagheri, and Ronald Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

Transfusion: Understanding transfer learning for medical imaging

Scaling laws for transfer

John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling

Reconciling modern machinelearning practice and the classical bias-variance trade-off

Deep double descent: Where bigger models and more data hurt

Triple descent and the two kinds of overfitting: where & why do they appear

Yfcc100m: The new data in multimedia research

Revisiting unreasonable effectiveness of data in deep learning era

Scaling vision transformers

A dataset of datasets for learning to learn from few examples

On the limits of cross-domain generalization in automated x-ray prediction

Chextransfer: Performance and parameter efficiency of imagenet models for chest x-ray interpretation

Supervised transfer learning at scale for medical imaging

High-performance largescale image recognition without normalization

Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour

Randaugment: Practical automated data augmentation with a reduced search space

Sgdr: Stochastic gradient descent with warm restarts

Horovod: fast and easy distributed deep learning in tensorflow

TorchXRayVision: A library of chest X-ray datasets and models

Learning multiple layers of features from tiny images

Automated flower classification over a large number of classes

Cats and dogs

Two public chest x-ray datasets for computer-aided screening of pulmonary diseases. Quantitative imaging in medicine and surgery

Covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images

Measuring mathematical problem solving with the math dataset

Measuring the effects of data parallelism on neural network training

Big self-supervised models are strong semi-supervised learners

Self-training with noisy student improves imagenet classification

Rethinking pre-training and self-training

In the full shot transfer, improvement due to model and data scale is evident when pre-training on X-Ray chest imaging source data. In few-shot regime, no transfer improvement due to larger model or data size is observed

In the full shot transfer, improvement due to model and data scale is evident when pre-training on X-Ray chest imaging source data. In few-shot regime, no transfer improvement due to CheXpert R50x1 on CheXpert-MIMIC-NIH-PadChest R152x4 on CheXpert R152x4 on CheXpert-MIMIC-NIH-PadChest (b) Tuberculosis, medical sources Figure 7: Few-shot and full shot transfer performance on Tuberculosis dataset when pre-training with different model sizes on different sources (natural or medical datasets) of various sizes. In natural-medical scenario (a), no transfer improvement due to model or data scale is evident. In medical-medical scenario (b), larger model and data size lead to transfer improvement in full shot regime

Intra-domain transfer using different sized medical X-Ray source data for pre-training with different sized ResNets (1) -Top-1 Acc [%] metric; (2) -mean AUC metric

We would like to express gratitude to all the people who are working on making code, models and data publicly available, advancing community based research and making research more reproducible. Special thanks go to creators and maintainers of open available X-Ray medical imaging datasets that also enabled our research, some of those gathered under difficult circumstances of the COVID-19 pandemics. The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this work by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputers JUWELS, JUWELS Booster at Jülich Supercomputing Centre (JSC). We also acknowledge computing resources from the Helmholtz Data Federation and further computing time provided on supercomputer JUSUF in frame of offer for epidemiology research on COVID-19 by JSC.