key: cord-0569853-x6wub1br
authors: Gavrikov, Paul; Keuper, Janis
title: CNN Filter DB: An Empirical Investigation of Trained Convolutional Filters
date: 2022-03-29
journal: nan
DOI: nan
sha: 5d8700cd9fa525363e08f0e5fc13ab27ad7ef4ac
doc_id: 569853
cord_uid: x6wub1br

Currently, many theoretical as well as practically relevant questions towards the transferability and robustness of Convolutional Neural Networks (CNNs) remain unsolved. While ongoing research efforts are engaging these problems from various angles, in most computer vision related cases these approaches can be generalized to investigations of the effects of distribution shifts in image data. In this context, we propose to study the shifts in the learned weights of trained CNN models. Here we focus on the properties of the distributions of dominantly used 3x3 convolution filter kernels. We collected and publicly provide a dataset with over 1.4 billion filters from hundreds of trained CNNs, using a wide range of datasets, architectures, and vision tasks. In a first use case of the proposed dataset, we can show highly relevant properties of many publicly available pre-trained models for practical applications: I) We analyze distribution shifts (or the lack thereof) between trained filters along different axes of meta-parameters, like visual category of the dataset, task, architecture, or layer depth. Based on these results, we conclude that model pre-training can succeed on arbitrary datasets if they meet size and variance conditions. II) We show that many pre-trained models contain degenerated filters which make them less robust and less suitable for fine-tuning on target applications. Data&Project website: https://github.com/paulgavrikov/cnn-filter-db

Despite their overwhelming success in the application to various vision tasks, the practical deployment of convolutional neural networks (CNNs) is still suffering from several inherent drawbacks. Two prominent examples are I) the dependence on very large amounts of annotated training data [1] , which is not available for all target domains * Funded by the Ministry for Science, Research and Arts, Baden-Wuerttemberg, Grant 32-7545.20/45/1 (Q-AMeLiA). The authors also thank Margret Keuper for her support and encouragement to submit this work. and is expensive to generate; and II) still widely unsolved problems with the robustness and generalization abilities of CNNs [2] towards shifts of the input data distributions. One can argue that both problems are strongly related, since a common practical solution to I) is the fine-tuning [3] of pre-trained models by small datasets from the actual target domain. This results in the challenge to find suitable pre-trained models based on data distributions that are "as close as possible" to the target distributions. Hence, both cases (I+II) imply the need to model and observe distribution shifts in the contexts of CNNs. In this paper, we propose not to investigate these shifts in the input (image) domain, but rather in the 2D filter-kernel distributions of the CNNs themselves. We argue that e.g. the distributions of trained convolutional filters in a CNN, which implicitly reflect the sub-distributions of the input image data, are more suitable and easier accessible representations for this task. In order to foster systematic investigations of learned filters, we collected and publicly provide a dataset of over 1.4 billion filters with meta data from hundreds of trained CNNs, using a wide range of data sets, architectures, and vision tasks. To show the scientific value of this new data source, we conduct a first analysis and report a series of novel insights into widely used CNN models. Based on our presented methods we show that many publicly provided models suffer from degeneration. We show that overparameterization leads to sparse and/or nondiverse filters (Fig. 1) , while robust training increases filter diversity, and reduces sparsity. Our results also show that learned filters do not significantly differ across models trained for various tasks, except for extreme outliers such as GAN-Discriminators. Models trained on datasets of different visual categories do not significantly drift either. Most shifts in studied models are due to degeneration, rather than an actual difference in structure. Therefore, our results imply that pre-training can be performed independent of the actual target data, and only the amount of training data and its diversity matters. This is inline with recent findings that models can be pre-trained even with images of fractals [4] . For classification models we show that the most variance in learned filters is found in the beginning and end of the model, while object/face detection models only show significant variance in early layers. Also, the most specialized filters are found in the last layers. We summarize our key contributions as follows:

• Publication of a diverse database of over 1.4B 3 × 3 convolution filters alongside with relevant meta information of the extracted filters and models [5] . • Presentation of a data-agnostic method based on sparsity and entropy of filters to find "degenerated" convolution layers due to overparameterization or non-convergence of trained CNN models. • Showing that publicly available models often contain degenerated layers and can therefore be questionable candidates for transfer tasks. • Analysis of distribution shifts in filters over various groups, providing insights that formed filters are fairly similar across a wide-range of examined groups. • Showing that the model-to-model shifts that exist in classification models are, contrary to the predominant opinion, not only seen in deeper layers but also in the first layers.

Paper organization. We give an overview of our dataset and its collection process in Sec. 3, followed by an introduction of methods studying filter structure, distributions shifts, and layer degeneration such as randomness, low variance in filter structure, and high sparsity of filters. Then in Sec. 4 we apply these methods to our collected data. We show the impact of overparameterization and robust training on filter degeneration and provide intuitions for threshold finding. Then we analyze filter structures by determining a suitable filter basis and looking into reproducibility of filters in training, filter formation during training, and an analysis of distribution shifts for various dimensions of the collected meta-data. We discuss limitations of our approach in Sec. 5 and, finally, draw conclusions in Sec. 6.

We are unaware of any systematic, large scale analysis of learned filters across a wide range of datasets, architectures and task such as the one performed in this paper. However, there are of course several partially overlapping aspects of our analysis that have been covered in related works: Filter analysis. An extensive analysis of features, connections, and their organization extracted from trained Incep-tionV1 [6] models was presented in [7] [8] [9] [10] [11] [12] [13] [14] [15] . The authors claim different CNNs will form similar features and circuits even when trained for different tasks.

Transfer learning. A survey on transfer learning for image classification CNNs can be found in [16] and general surveys for other tasks and domains are available in [17, 18] . The authors of [19] studied learned filter representations in ImageNet1k classification models and presented the first approaches towards transfer learning. They argued that different CNNs will form similar filters in early layers which will mostly resemble gabor-filters and color-blobs, while deeper layers will capture specifics of the dataset by forming increasingly specialized filters.

[20] captured convolution filter pattern distributions with Gaussian Mixture Models to achieve cross-architecture transfer learning.

[21] demonstrated that convolutions filters can be replaced by a fixed filter basis that 1 × 1 convolution layers blend.

Pruning criteria. Although we do not attempt pruning, our work overlaps with pruning techniques as they commonly rely on estimation criteria to understand which parameters to compress. These either rely on data-driven computation of a forward-pass [22] [23] [24] [25] [26] , or backward-propagation [27, 28] , or estimate importance solely based on the numerical weight (typically any ℓ-norm) of the parameters [29] [30] [31] [32] [33] . CNN distribution shifts. A benchmark for distribution shifts that arise in real-world applications is provided in [34] and [35] measured robustness to natural distribution shifts of 204 ImageNet1k models. The authors concluded that robustness to real-world shifts is low. Lastly, [36] studied the correlation between transfer performance and distribution shifts of image classification models and find that increasing training set and model capacity increases robustness to distribution shifts.

We collected a total of 647 publicly available CNN models from [37-39] and other sources that have been pre-trained for various 2D visual tasks 1 . In order to provide a heterogeneous and diverse representation of convolution filters "in the wild", we retrieved pre-trained models for 11 different tasks e.g. such as classification, segmentation and image generation. We also recorded various metadata such as depth and frequency of included operations for each model, and manually categorized the variety of used training sets into 16 visually distinctive groups like natural scenes, medical ct, seismic, or astronomy. In total, the models were trained on 71 different datasets. The dominant subset is formed by image classification models trained on ImageNet1k [40] (355 models). All models were trained with full 32-bit precision 2 

We apply a full-rank principal component analysis (PCA) transformation implemented via a singular-value decomposition (SVD) to understand the underlying structure of the filters [53]. First, we stack the relevant set of n flattened filters into a n× 9 matrix X. Thereupon, we center the matrix and perform a SVD into a n × 9 rotation matrix U , a 9×9 diagonal scaling matrix Σ, and a 9 × 9 rotation matrix V T . The diagonal entries σ i , i = 0, . . . , n − 1 of Σ form the singular values in decreasing order of their magnitude. Row vectors v i , i = 0, . . . , n − 1 in V T then form the principal components. Every row vector c ij , j = 0, . . . , n−1 in U is the coefficient vector for f i .

WhereX denotes the vector of column-wise mean values of any matrix X. Then we obtain a vectorâ of the explained 1 For more details refer to the supplementary materials. 2 Although, initial experiments indicated that mixed/reduced precision training [41] does not affect distribution shifts beyond noise. variance ratio of each principal component. ∥ · ∥ 1 denotes the ℓ 1 -norm.

⃗ a = (Σ · I) 2 /(n − 1) a = ⃗ a/∥⃗ a∥ 1 (2)

Finally, each filter f ′ is described by a linear, shifted sum of principal components v i weighted by the coefficients c i .

All probability distributions are represented by histograms. The histogram range is defined by the minimum and maximum value of all coefficients. Each histogram is divided into 70 uniform bins. The divergence between two distributions is measured by the symmetric, non-negative variant of Kullback-Leibler (KL sym ) [54] .

We define the drift D between two filter sets by the sum of the divergence of the coefficient distributions P i , Q i along every principal component index i. The sum is weighted by the ratio of varianceâ i explained by the i-th principal component.

To avoid undefined expressions, all probability distributions F are set to hold ∀x ∈ X : F (x) ≥ ϵ.

Lottery Ticket Hypothesis [55] suggests that each architecture has a specific amount of convolution filters that saturate its ability to transform a given dataset into a well separable feature-space. Exceeding this number will result in a partitioning of the model into multiple inter-connected submodels. We hypothesize that these are seen in the form of degenerated filters in CNNs. In like manner, an insufficient amount of training samples or training epochs will also lead to degenerated filters. We characterize the following types of degeneration. 3. Randomness: Filter weights are conditionally independent of their neighbours. This indicates that no or not sufficient training was performed.

Sparsity degeneration is detectable by the share of sparse filters S in a given layer. We call a filter f sparse if all entries are near-zero. Consequently, given the number of input channels c in , number of output channels c out , and a set of filters in layer L, we can measure the layer sparsity by:

To detect the other types of degeneration we introduce a layer-wise metric based on the Shannon-Entropy of the explained variance ratio of each principal component obtained from a SVD of all filters in the examined layer (Sec. 3.2).

If H is close to zero this indicates one strong principal component from which most of the filters can be reconstructed and is therefore a low filter diversity degeneration. On the other hand, a large entropy indicates a (close to) uniform distribution of the singular values and, thus, a randomness of the filters. Sparse layers are a specific form of low diversity degeneration and, generally both are correlated, whereas, sparsity and randomness are mutually exclusive. It should be noted, that |Σ · I| = min(c in c out , 9) and therefore the entropy only becomes expressive if c in c out ≫ 9.

In this section we study different causes of degeneration and aim provide thresholds for evaluation.

Overparameterization. The majority of the models that we have trained on our low resolution datasets are heavily overparameterized for these relatively simple problems. We base this argument on the fact that we have models with different depth for most architectures and already observe near perfect performance with the smallest variants. Therefore it is safe to assume that larger models are overparameterized especially given that the performance only increases marginally 1 . First we analyze layer sparsity and entropy for these models trained on CIFAR-10/100 in comparison to all ImageNet1k classification models found in our dataset. For each dataset we have trained identical networks with identical hyperparameters. Both, CIFAR-10 and CIFAR-100, consist of 60,000 32×32px images, but CIFAR-100 includes 10x more labels and thus fewer samples per class forming a more challenging dataset. Fig. 2a shows that the overparameterized models contain significantly more sparse filters on average, and that sparsity increases with depth. In particular, we see the most sparse filters for CIFAR-10. However, ImageNet1k classifiers also seem to have some kind of "natural" sparsity, even though we do not consider most of these models as overparameterized. Entropy, on the other hand, decreases with increasing medical mri natural plants textures 2 0 2 c0 layer depth for every classifier, but more rapidly in overparameterized models (Fig. 2b) . Again, the CIFAR-10 models degrade faster and show more degeneration. The overparameterized models contain layers that have a entropy close to 0 towards deeper layers which indicates that these models are "saturated" and only produce differently scaled variants of the same filters. In line with the oversaturation, these models also have increasingly sparse filters, presumably as an effect of regularization.

Filter degeneration and model robustness. Our dataset also contains robust models from the RobustBench leaderboard [38] . When comparing robust models with nonrobust models trained on ImageNet1k, it becomes clear that robust models form almost no sparse filters after in deeper convolution layers (Fig. 2c) , while regular models show some sparsity there. The entropy of robust models is also higher throughout depth (Fig. 2d) , indicating that robust models learn more diverse filters.

Thresholds. To obtain a threshold for randomness given a number of filters n per layer we perform multiple experiments in which we initialize convolution filters of different sizes from a standard normal distribution and fit a sigmoid T H to the minimum results obtained for entropy.

We obtain the following values L = 1.26, x 0 = 2.30, k = 0.89, b = −0.31 and call any layer L with H > T H (n) random. On the opposite, defining a threshold for low diversity degeneration seems less intuitive and one can only rely on statistics: The average entropy H is 0.69 over all layers and continuously decreases from an average of 0.75 to 0.5 with depth. Additionally, the minimum of the 1.5 IQR also steadily decreases with depth. The same applies to sparsity: the average sparsity S over all layers is 0.12 and only 56.5% of the layers in our dataset hold S < 0.01 and 9.9% even show S > 0.5. In terms of convolution depth, the average sparsity varies between 9.9% and 14% with the largest sparsity found in the last 20% of the model depth. The largest outliers of the 1.5 interquartile range (IQR) are, however, found in the first decile. In both cases we find it difficult to provide a meaningful general threshold and suggest to determine this value on a case-by-case basis 1 .

In the next series of experiments, we analyze only the structure of 3×3 filters, neglecting their actual numerical weight in the trained models. Therefore, we normalize each filter f individually by the absolute maximum weight into f ′ .

Then we perform a PCA transformation on the scaled filters. Fig. 5 shows some qualitative examples of obtained principal components, split by several meta-data dimensions. The images of the formed basis are often similar for all groups except for few outliers (such as GANdiscriminators). The explained variance however fluctuates significantly and sometimes changes the order of components. Consistently, we observe substantially higher variance on the first principal components. The explained variance does not necessarily correlate with the shift observed between models. Here, the biggest mean drift is also located in the first principal component (D = 0.90), but is then followed by the sixth, third, second component (D = 0.78, 0.69, 0.58). The coefficients of the sixth component also contain the strongest outliers (Fig. 6) . We visualize the distributions of PCA coefficients along every component for each group by plots of kernel density estimates (KDEs), e.g. Fig. 3 depicts the distributions of filters grouped by some selected visual categories in comparison to the distribution of coefficients for the full dataset. Filters extracted from models with degenerated layers (as seen in medical mri) result in spiky/multi-modal KDEs. The distributions can alternatively be visualized by bi-variate scatter plots that may reveal more details than KDEs. For example, they let us categorize the distributions into phenotypes depending on their distribution characteristic in the PCA space ( Fig. 4) : sun: distributions where both dimensions are gaussian-like. These are to be expected coefficient distributions without significant sparsity/low diversity degeneration. Yet, this phenotype may also include non-converged filters; spikes: distributions suffering from a low variance degeneration resulting in local hotspots; symbols: at least one distribution is multi-modal, non-centered, highly sparse or otherwise nonnormal (low variance degeneration); point: coefficients are primarily located in the center (sparsity degeneration).

We train low-resolution networks on CIFAR-10 multiple times with identical hyperparameters except for random seeds and save a checkpoints of each model at the best validation epoch. Most models are converging to highly similar coefficient distributions when retrained with different weight initialization (e.g. ResNet-9 with D < 5.3 · 10 −4 ). However, some architectures such as MobileNetv2 show higher shifts (D < 2.6 · 10 −2 ). We assume that this is due to the structure of the loss surface, e.g. the residual skip connections found in ResNets smooth the surface, whereas other networks way contain more local minima due to noisy surfaces [56] .

Formation of filter structures during training. Although our dataset only includes trained convolutional filters we tried to understand how the coefficient distribution shifts during training. Therefore we recorded checkpoints of a ResNet-9 trained on CIFAR-10 every 10 training epochs beginning right after the weight initialization. Fig. 7 shows that the coefficient distributions along all principal components are gaussian-like distributed in the beginning and eventually shift during training. For this specific model, distributions along major principal components retain the standard deviation during training, while less-significant component distributions decrease. The initialization observation helped us removing models from our collection where we failed to load the trained parameters for any reason and is foundation for our provided randomness metric.

In this subsection we are investigating transfer distance in different meta-dimensions of pre-trained models. We compute the shift D and visualize this is the form of heatmaps ( Fig. 8 ) that show shifts between all pairings.

Shifts between tasks. Unsurprisingly, classification, segmentation, object detection, and GAN-generator distributions are quite similar, since the non-classification models typically include a classification backbone. The smallest mean shift to other tasks is observed in object detection, GAN-generators, and depth estimation models. The least transferable distributions are GAN-discriminators. Their distributions do barely differ along principal components and can be approximated by a gaussian distribution. By our randomness metric this indicates a filter distribution that is close to random initialization, implying a "confused" discriminator that cannot distinguish between real and fake samples towards the end of (successful) training. It may be surprising to see a slightly larger average shift for classification. This is presumably due to many degenerated layers in our collected models, which are also visible in the form of spikes when studying the KDEs. An evaluation 1 of distributions including only non-degenerated classifiers actually shows a lower average shift due to the aforementioned similarity to other tasks.

Shifts between visual categories and training sets. We find that the distribution shift is well balanced across most visual categories and training sets. Notable outliers include all medical types. They have visible spikes in the KDEs, once again indicating degenerated layers. Indeed, the average sparsity in these models is extreme in the last 80% of the model depth. Another interesting, albeit less significant outlier is the fractal category. It consists of models trained on Fractal-DB, which was proposed as a synthetic pre-training alternative to ImageNet1k [4] . The standard deviations of coefficient distributions tend to shrink towards the least significant principal components but this trend is not visible for this category indicating that sorting the basis by variance would yield a different order for this task and perhaps the basis itself is not well suited. Also notable is a remarkably high standard deviation on the distribution of the first principal component. Interestingly, we also observe sub-average degeneration for this category. Shifts in other categories can usually be explained by a biased representation. For example we only have one model for plants, our handwriting models consist exclusively of overparameterized networks that suffer from layer degeneration, and textures consists of only one GAN-discriminator which will naturally shows a high randomness.

Shifts by filter/layer depth. The shift between layers of various depth deciles increases with the difference in depth, with distributions in the last decile of depth forming the most distinct interval, and outdistancing the second-to-last and first decile that follow next. An interesting aspect is also the model-to-model shift across deciles. This shift exemplifies the uniqueness of formed filters. Our observations overhaul the general recommendation for fine-tuning to freeze early layers in classification models, as the largest shifts are not only seen in deep layers but also in early vision ( Fig. 9 ). Segmentation 1 models show the most drift in deeper layers. Contrary, object/face detection models only show drift in the early vision (object detection in the first, face detection in the first four depth deciles), but marginal drift in later convolution stages.

Shifts within model families. The shift between models of the same family trained for the same task is negligible (Fig. 10) , indicating that every large enough dataset is good enough and the common practice of pre-training models with ImageNet1k even for visually distant application domains is indeed a valid approach. ResNet-family outliers only consist out of models that show a high amount of sparsity. Additionally, this observation may be exploited by training small teacher networks and apply knowledge distillation [57] to initialize deeper models of the same family.

Our data is biased against classification models and/or natural datasets such as ImageNet1k. Further, some splits will over-represent specific dimensions e.g. tasks may include exclusive visual categories and vice versa. Also, as previously shown, many of the collected models show a large amount of degenerated layers that impact the distributions. This also biases measurements of the distribution shifts. We performed an ablation study by removing filters extracted from degenerated layers, but were unable to find a clear correlation between degeneration and distribution shifts 1 , presumably due to a lack of justified thresholds.

Our first results support our initial hypothesis that the distributions of trained convolutional filters are a suitable and (1) Face Detection (4) GAN-Discriminator (7) Object Detection (16) GAN-Generator (24) Depth Estimation (2) Style Transfer (5) Super Resolution (4) Panoptic Segmentation (2) Sematic Segmentation (15) Classification (555) Segmentation (11) Face Recognition (1) Auto-Encoder (1) Face Detection (4) GAN-Discriminator (7) faces (16) depth (2) natural (557) map (2) thermal (1) astronomy (2) art (5) seismic (4) cars (1) medical ct (4) fractals (2) textures (2) medical xray (9) medical mri (3) plants (1) faces (16) depth (2) natural (557) map (2) thermal (1) astronomy (2) art (5) seismic (4) cars (1) medical ct (4) fractals (2) textures (2) medical xray (9) medical mri (3) plants (1 easy-to-access proxy for the investigation of image distributions in the context of transferring pre-trained models and robustness. While the presented results are still in the early stages of a thorough study, we report several interesting findings that could be explored to obtain better model generalizations and assist in finding suitable pre-trained models for fine-tuning. One finding is the presence of large amounts of degenerated (or untrained) filters in large, wellperforming networks -resulting in the phenotypes points, spikes, and symbols. We assume that their existence is a symptom in line with the Lottery Ticket Hypothesis [55].

We conclude that ideal models should have relatively high entropy (but H < T H ) throughout all layers and almost no sparse filters. Models that show an increasing or generally high sparsity or a massive surge in entropy with depth are most likely overparameterized and could be pruned, which would benefit inference and training speed. Whereas, initialized but not trained models will have a constantly high entropy H ≥ T H throughout all layers and virtually no sparsity. Another striking finding is the observation of very low shifts in filter structure between different meta-groups: I) shifts inside a family of architectures are very low; II) shifts are mostly independent of the target image distribution and task; III) also we observe rather small shifts between convolution layers of different depths with the highest shifts in the first and last layers. Overall, the analysis of over 1.4 billion learned convolutional filters in the provided dataset gives a strong indication that the common practice of pretraining CNNs is indeed a sufficient approach if the chosen model is not heavily overparameterized. Our first results indicate that the presented dataset is a rich source for further research in transfer learning, robustness and pruning. 

We provide CNN Filter DB as a ca. 100 GB large HDF5 file which contains the unprocessed 3 × 3 filters along with meta information as reported in Tab. 3.

We have collected models of the following tasks: Classification, GAN-Generator, Segmentation, Object Detection, Style Transfer, Depth Estimation, Face Detection, Super Resolution, GAN-Discriminator, Face Recognition, Auto-Encoder. The training sets were distributed into the following categories: plants, natural, art, map, handwriting, medical ct, medical mri, depth, faces, textures, fractals, seismic, astronomy, thermal, medical xray, cars.

A visualization of the accumulated frequency of models and filters by task, visual category, and training dataset combination can be found in Fig. 27 . Heatmaps for aggregated frequency of filters/models by task and visual category are shown in Fig. 11 .

As previously mentioned, we used rescaled filters for all distribution shift related experiments. In Fig. 12 we show the mean scale per layer depth decile of the unprocessed filters. We group the filters f by model and depth decile in sets S and compute the mean scale as follows:

The distributions show an unsurprising decrease with depth but also a high variance and many outliers across models, especially in the first two deciles.

Lastly, Tab. 4 contains all models we have used for our analysis.

We draw n = 2 1 , . . . , 2 21 filters with 3 × 3 shape from a standard normal distribution and calculate the entropy H as defined in the Methods section. We repeat this process 1000 times for each n and fit a sigmoid to the lowest entropy we have observed for each n. Fig. 13 shows the obtained samples alongside the fitted sigmoid T H .

As mentioned in the Limitations section, we attempted to reproduce our experiments with a dataset that did not include filters from degenerated layers. We applied the following selection criterion to detect degeneration based on entropy H and sparsity S as defined in our Methods section:

While we had a solid foundation for the entropy upper bound (minus some noise), the lower bound for entropy and the bound for sparsity are based on the average we found in our datasets. Note that increasing the lower bound for H results in more similar distributions and therefore lower shift. Hence, this value should be picked very carefully to not filter out vital layers. Sparsity is usually seen in peaks around the center of the KDEs. Tuning this value has a significant impact on the shift (Fig. 16) since the large center peaks increase the KL-Divergence significantly (Fig. 14) . With the selected threshold we fail to find a meaningful correlation between the ratio of degenerated layers and the average shift to other groups (i.e. tasks or visual categories; Fig. 15 ).

We initially assumed that quantization may lead to the spikes phenotype, so we decided to test what shift we obtain when training with fp16 instead of fp32 precision. Spiky distributions should show high shifts in comparison to smooth distributions. We train all our low resolution models on CIFAR-10 with the same hyperparameters and observe marginal shifts Fig. 17 . Outliers with somewhat higher shifts include MobileNet v2. But we have verified that the shift for MobileNet v2 does not exceed the shift one would measure by retraining with random seeds. The ResNet-9 shift does not exceed its retraining shift, therefore we assume that this also applies to other models.

In addition to the main paper we also report the shift by absolute depth for the first 20 layers in Fig. 18 of classification models and the shift by relative depth for more tasks in Fig. 19 . Please note, that Fig. 19e only contains the same network trained on different datasets.

In Fig. 20 we add interesting counter-parts to the filter basis shown in the main paper. As one can observe the filter basis remains quite similar. Changes usually affect the order of the components (since they are sorted by explained variance ratio), inversion (though this is not characteristic, since the coefficients can simply be inverted), and noise presumably due to degeneration. Fig. 29 shows KDEs for all tasks. Fig. 31 shows only KDEs of tasks of models that were trained with datasets belonging to the natural visual category. Fig. 30 shows KDEs for every visual category. Some categories show shifts due to bias representation while other clearly contain a majority of degenerated filters. Fig. 32 show KDEs by the visual category of the training dataset limited to classification models. Several categories such as medical xray, plants, handwriting are clearly impacted by degeneration. Fig. 33 shows KDEs of classification models split by convolution depth decile. The distribution shift with depth reminds us of the shift of all filters we have seen during training ResNet-9 in our Results section. Fig. 35 shows some selected models from the same family, showing clear shifts between the families but low shifts within. Lastly, Fig. 34 shows all models trained on MNIST. These are consist exclusively of the intentionally overparameterized models. The KDEs show very clear signs of major degeneration, by stark spikes, especially around null.

The main paper showed only scatter plots between two select coefficient distributions c i and c j . Here we include the all bi-variate scatter plots of selected examples for each phenotype over all pairs of distributions (i.e. i = 0, . . . , 8 and j = 0, . . . , 8): Fig. 22 shows the scatter plots over all filters that we have extracted; Fig. 23 shows spikes of filters that belong to the visual category medical ct; Fig. 24 shows symbols based on filters that belong to an EfficientNet-l2-ns-475 pretrained on the massive JFT-300m and fintetuned on ImageNet1k; Fig. 25 shows point computed on filters of our intentionally overparameterized models trained on MNIST; and Fig. 26 shows spikes computed on filters of the task depth estimation.

Models were taken from [59] and slightly modified by us to support different input channel and class modalities. Additionally, some more architectures were added. Generally, Figure 17 . Distribution shift D between low resolution models between trained on CIFAR-10 with fp16 and fp32 precision. these models are quite similar to the architectures proposed in their respective original publications. However typically, Pooling will be reduced, dilated or strided convolutions will be replaced by regular convolutions, and convolution kernel sizes are reduced to be no larger than 3 × 3.

All models are trained on NVIDIA A100 GPUs and hyper-parameters independent of the dataset. Stochastic matrix multiplication is turned off via cuDNN settings. Inputs are scaled to 32 × 32 px and channel-wise normalized. CIFAR data is additionally zero-padded by 4 px along each dimension, and then transformed using a 32 × 32 random crops, and random horizontal flips. For the hyper parameters an initial learning rate of 1e-8, a weight decay of 1e-2, a batch-size of 256 and a nesterov momentum of 0.9 is used. 

The ResNet-9 models were trained as detailed in Appendix I. However, the different random seed were provided for each model. Results are reported in Tab. 1. 

Revisiting unreasonable effectiveness of data in deep learning era

Threat of adversarial attacks on deep learning in computer vision: A survey

Convolutional neural networks for medical image analysis: Full training or fine tuning?

Pre-training without natural images

CNN-Filter-DB v1.0.0

Going deeper with convolutions

Zoom in: An introduction to circuits

An overview of early vision in inceptionv1

Efficientnetv2: Smaller models and faster training

Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search

Inception-v4, inception-resnet and the impact of residual connections on learning

Neural architecture design for gpu-efficient networks

Gluoncv and gluonnlp: Deep learning in computer vision and natural language processing

Hardcore-nas: Hard constrained differentiable neural architecture search

Deep high-resolution representation learning for visual recognition

Squeeze-and-excitation networks

Mixconv: Mixed depthwise convolutional kernels

Mnasnet: Platform-aware neural architecture search for mobile

Searching for mobilenetv3

Rethinking spatial dimensions of vision transformers

Tensorflowslim image classification model library

Designing network design spaces

Progressive neural architecture search

Repvgg: Making vggstyle convnets great again

Revisiting resnets: Improved training and scaling strategies

Aggregated residual transformations for deep neural networks

Rethinking channel dimensions for efficient model design

Xnect: Real-time multi-person 3d motion capture with a single rgb camera

Selective kernel networks

Single-path nas: Designing hardware-efficient convnets in less than 4 hours

Billion-scale semi-supervised learning for image classification

Adversarial examples improve image recognition

Self-training with noisy student improves imagenet classification

Xception: Deep learning with depthwise separable convolutions

Visformer: The vision-friendly transformer

Model rubik's cube: Twisting resolution, depth and width for tinynets

Wide residual networks

Twins: Revisiting the design of spatial attention in vision transformers

Learning a discriminative feature network for semantic segmentation

Pyramid scene parsing network

-imagenet1k Classification natural 1257472 timm seresnext26d 32x4d imagenet 11 [153] -imagenet1k Classification natural 90208 timm resnetrs420 imagenet 11 [162] -imagenet1k Classification natural 7494752 timm seresnext26t 32x4d imagenet 11 [153] -imagenet1k Classification natural 89928 timm seresnet152d imagenet 11 [153] -imagenet1k Classification natural 3292256 timm seresnext50 32x4d imagenet 11 [153] -imagenet1k Classification natural 157184 timm skresnet18 imagenet 11 [166] -imagenet1k Classification natural 1220608 timm skresnet34 imagenet 11 [166] -imagenet1k Classification natural 2342912 timm spnasnet 100 imagenet 11 [167] -imagenet1k Classification natural 2552 timm skresnext50 32x4d imagenet 11 [166] -imagenet1k Classification natural 314368 timm ssl resnet18 imagenet 11 [168] yfcc100m imagenet1k Classification natural 1220608 timm ssl resnet50 imagenet 11 [168] yfcc100m imagenet1k Classification natural 1257472 timm ssl resnext50 32x4d imagenet 11 [168] yfcc100m imagenet1k Classification natural 157184 timm ssl resnext101 32x4d imagenet 11 [168] yfcc100m imagenet1k Classification natural 296448 timm swsl resnet18 imagenet 11 [168] instagram1b imagenet1k Classification natural 1220608 timm ssl resnext101 32x8d imagenet 11 [168] yfcc100m imagenet1k Classification natural 1185792 timm swsl resnet50 imagenet 11 [168] instagram1b imagenet1k Classification natural 1257472 timm ssl resnext101 32x16d imagenet 11 [168] yfcc100m imagenet1k Classification natural 4743168 timm swsl resnext101 32x4d imagenet 11 [168] instagram1b imagenet1k Classification natural 296448 timm swsl resnext50 32x4d imagenet 11 [168] instagram1b imagenet1k Classification natural 157184 timm swsl resnext101 32x8d imagenet 11 [168] instagram1b imagenet1k Classification natural 1185792 timm tf efficientnet b0 ap imagenet 11 [169] -imagenet1k Classification natural 2720 timm tf efficientnet b0 imagenet 11 [78] -imagenet1k Classification natural 2720 timm tf efficientnet b0 ns imagenet 11 [170] jft300m imagenet1k Classification natural 2720 timm tf efficientnet b1 ap imagenet 11 [169] -imagenet1k Classification natural 5280 timm tf efficientnet b1 imagenet 11 [78] -imagenet1k Classification natural 5280 timm swsl resnext101 32x16d imagenet 11 [168] instagram1b imagenet1k Classification natural 4743168 timm tf efficientnet b1 ns imagenet 11 [170] jft300m imagenet1k Classification natural 5280 timm tf efficientnet b2 ap imagenet 11 [169] -