key: cord-0304415-sfysubcz authors: Nolte, Kristopher; Gao, Yunyun; Stäb, Sabrina; Kollmansberger, Philip; Thorn, Andrea title: Detecting ice artefacts in processed macromolecular diffraction data with machine learning date: 2021-10-29 journal: bioRxiv DOI: 10.1101/2021.10.28.466246 sha: 92481b53d98bbf4ae62386a3fdc22dace9036b3d doc_id: 304415 cord_uid: sfysubcz Contamination with diffraction from ice crystals can negatively affect, or even impede macromolecular structure determination and therefore, detecting the resulting artefacts in diffraction data is crucial. However, once the data have been processed, it can be very difficult to automatically recognize this problem. To address this, a set of convolutional neural networks named Helcaraxe has been developed which can detect ice diffraction artefacts in processed diffraction data from macromolecular crystals. The networks outperform previous algorithms and will be available as part of the AUSPEX webserver and CCP4-distributed software. Synopsis A program utilizing artificial learning and convolutional neural networks, named Helcaraxe, has been developed which can detect ice crystal artefacts in processed macromolecular diffraction data with unprecedented accuracy. Crystals of biological macromolecules are routinely cryo-cooled to a temperature of 100K before exposure to X-rays to reduce radiation damage during the diffraction experiment (Garman & Weik, 2019) . Cryo-cooling can lead to the formation of ice (Garman & Owen, 2006) . While anti-freeze agents and flash-cooling are commonly employed to minimise this, rings from the diffraction of small ice crystals are frequently found in diffraction images from cryo-cooled macromolecular samples (Chapman & Somasundaram, 2010) (Fig. 1) . The emergence of fast-readout pixel detectors which ideally are used to measure finely sliced diffraction images makes visual identification of ice artefacts from individual images more difficult . Ice rings can be observed when several images are averaged to increase contrast or through newly developed machine learning approaches (Czyzewski et al., 2021) . Identifying whether a structure -or more exactly the integrated, scaled and merged diffraction data set -available in the worldwide Protein Data Bank (wwPDB) (Berman, 2000) is affected by ice ring contamination is even more difficult. Only a few entries have the corresponding raw data (i.e. images) available, as these are neither required for publication nor can be deposited directly to the wwPDB. However, if an integrated and merged data set is affected by ice diffraction, one can assume that subsequent model refinement will be affected. It has been demonstrated that removing ice rings from the data during integration improves R values by as much as 4.8% . Thus, the correct identification of ice rings in data sets is an important step in assessing and ultimately, improving data quality. In addition to statistical identification in CTRUNCATE (Winn et al., 2011) and phenix.xtriage (Adams et al., 2010) , the AUSPEX icefinder score, recently improved by Moreau and colleagues (Moreau et al., 2021) , is one of the most reliable statistical tools to detect ice crystal artefacts in integrated, merged and scaled diffraction data sets. While statistical identification can identify stronger ice diffraction in processed data automatically, less distinct ice rings can go unnoticed. For this reason, AUSPEX also produces plots of observed intensities (I obs ) (or structure factor amplitudes F obs ) against resolution, which permit easy visual identification of ice ring contamination ( Fig. 2) . The discrepancy that humans can easily recognize ice rings in these AUSPEX plots while automatic statistical detection remains difficult has led us to attempt an identification using artificial intelligence. In recent years, the use of Convolutional Neural Networks (CNNs) for data-driven research enabled the identification and recognition of complicated patterns in noisy data (Schmidt, 2019) , leading to advances in all disciplines of science and data analysis. CNNs are exceptionally suited to classification of multi-dimensional arrays because they can retain spatial input information (Yamashita et al., 2018) . Here, we present the results of employing CNNs to detect ice artefacts in processed macromolecular diffraction data. 4 1827 integrated, scaled, and merged diffraction data sets indicated to have been measured at 100 K were used to generate training and validation sets (Supporting Information: Helcaraxe_train_labels.xlsx). These diffraction data were randomly selected from the Coronavirus Structural Task Force repository (Croll et al., 2021) (396 diffraction data sets), the Integrated Resource for Reproducibility in Macromolecular Crystallography (Grabowski et al., 2016) (280) and the Protein Data Bank (1151) (Berman, 2000) without duplicates. Diffraction data were used in MTZ format, obtained through the conversion of sf.cif files by CCP4 cif2mtz (Winn et al., 2011) . If CIF files had no observed Intensities (I obs ), structure factor amplitudes (F obs ) were used instead. If MAD data had been measured, the wavelength listed first in the deposited CIF file was used. To convert the data into a format that could be presented to a neural network two-dimensional histograms of I obs or F obs against resolution were generated, dubbed "Helcaraxe plots", using the NumPy histogram2d function (Virtanen et al., 2020) (example code in Supporting Information). The size of the histograms was set to 80x80 pixel as this has proven to be the best compromise between information loss and data size. Multiple Helcaraxe plots were produced per dataset, one around every expected ice-ring position present in the overall resolution range of the diffraction data set (Fig. 2 top) . The width of individual ice ring resolution ranges has been previously identified . To avoid extreme intensity outliers and for normalisation, the lower limit of the y-axis was set at the 0.5 th percentile of intensities or amplitudes and the top limit at the 95 th percentile. These parameters have proven to be the best middle ground between data loss and plot similarity. The x-axis was scaled to obtain constant histogram size despite the different widths of the individual ice ring resolution ranges. 80% of Helcaraxe plots generated from the MTZ files were allocated randomly to the training set and 20% to the validation set. The validation set was used to evaluate the CNNs during training and to select two final CNN candidates for the Helcaraxe program. The test set was not used in training: A set of 200 randomly chosen diffraction data sets labelled for ice ring contamination (previously published as AUSPEX Appendix section C and reproduced here in Supplementary Information), was assigned as a test set . For annotation, ice rings were first identified in diffraction data sets by AUSPEX plots. Subsequently, Helcaraxe plots for training, validation, and test sets were generated as described and manually annotated for ice ring presence using the previous annotation results as guidance. A Helcaraxe plot was labelled as ice diffraction contaminated if at least two of the following criteria were met: I. A vertical shift of I obs or F obs values in the shape of a spike must be visible to the naked eye. 6 II. At least 1% of the area of the plot was affected by the ice ring, meaning that either a part of the plot was blank because of the ice ring or intensities were shifted upwards. The area was measured by overlaying a grid. The ice ring must be visible in the corresponding AUSPEX plot. 486 (26.6%) of the 1827 diffraction data sets used for the training and validation set were found to have ice diffraction contamination according to the aforementioned criteria (see Table 1 ). This resulted in the generation of 13,170 individual Helcaraxe plots, of which 984 (7.47%) were annotated as contaminated (Supplementary Information: Helcaraxe_train_labels.xlsx). The test set (AUSPEX Appendix section C), which was previously labelled for ice ring contamination, includes ice ring classification from other software and was also used by Moreau and colleagues to compare the performance of their algorithm (Moreau et al., 2021) . The test set was again reviewed where Moreau's annotation was consulted, and 9 labels (4.5%) were updated. (Supporting Information: Helcaraxe_test_results.xlsx). Three diffraction data sets were omitted because they were already superseded or had a lower maximum resolution than those ranges contaminated by ice rings. Of the 197 diffraction sets 40 (20.3%) were labelled as containing an ice ring. There was no overlap between the training/validation and the test set. The network architecture of the employed CNNs consists of a convolutional and a fully connected part (Fig. 3 ). Two networks with the same architecture were trained, one for Iobs and one for Fobs values. Helcaraxe plots were supplied through an input layer that passes the plot directly to the subsequent convolutional segment. The first segment of the network extracts data features and consists of four blocks, with each block having two convolutional 7 layers followed by an aggregating MaxPooling layer (Fig. 4) and Batch Normalization layers in between to reduce the risk of overfitting through normalization. The second segment is connected through a Flatten layer and contains two fully connected layers of artificial neurons separated by dropout layers which randomly omit neurons during training to further reduce the risk of overfitting (Srivastava et al., 2014) . The output layer is a single neuron that uses a sigmoid activation function. Therefore, a value (hereafter referred to as prediction) between 0 (no ice diffraction artefacts) and 1 (ice diffraction artefacts) is returned for each Helcaraxe plot. The threshold for classification was 0.5. Parameters which control the learning process (hyperparameters) were optimized using the Hyperband optimization algorithm (Li et al., 2018) . The network used for predicting ice diffraction F obs plots (hereafter referred to as F obs network) was trained and validated only using F obs Helcaraxe plots, the network for predicting I obs plots (hereafter referred to as I obs network) was trained through transfer learning, fine-tuning the F obs network using I obs plots and a very moderate rate of learning (0.0005). This was done to make sure that the network could adapt to the differences in the I obs Helcaraxe plots without overriding the pattern recognition abilities already acquired by the F obs network. Network design and training were performed using TensorFlow 2.4.1 (Abadi et al., 2016) . The final trained networks were selected from multiple training runs based on the performance against the validation set. Their ability to operate reliably on unseen data was tested using the independent test set. To acquire an overview of how ice diffraction artefacts manifest in Helcaraxe plots, all plots from the training and validation sets which manually had been annotated as not containing ice diffraction were averaged, for amplitudes and intensities, respectively ( Fig. 5A and C) , as were all plots annotated as containing ice diffraction ( Fig. 5B and D) . The resulting averaged plots of intensities or structure factor amplitudes with no ice show a uniform vertical gradient. The corresponding plots with ice artefacts show an upward shift in the form of a spike in the middle. It is apparent that the points spread more evenly in F obs than in I obs plots. Spikes are also more visually prominent in Fig. 5 B than D, potentially a consequence of conversion of intensities into amplitudes by the French & Wilson method (French & Wilson, 1978) , which imposes distribution expectations in order to facilitate conversion. In F obs the spike is more prominent. Two trained networks were selected from multiple training runs based on their performance on the validation set. They were both evaluated against the test set (see section 3.2) to confirm that they can generalize. The performance was measured using three metrics: accuracy (1), sensitivity (2) and specificity (3) Judging by these criteria, the networks perform well on the validation set used in network training and the independent test set (see Table 2 ), showing there was no overfitting of the networks with regard to the training set. The performance of both networks is sufficient to detect ice-rings in most cases. Both networks have a higher specificity than sensitivity. . 6A ) or the diffraction data set contained relatively few reflections (Fig 6C) (I obs :5 of 12, F obs : 2 of 13). Another cause of false negative classification is the presence of data points in the area of the ice spike (I obs :4 of 12, F obs : 8 of 13) (Fig. 6C) . We suspect this occurs when the ice crystals build up during measurement so that both contaminated and uncontaminated intensities are present in the merged diffraction data. The main reason for false positive classification was shift or absence of intensities in the usual ice ring range without the typical shape of a spike (as described in definition I in 2.1) (I obs : 7 of the 13 false positive classifications, F obs : 2 of 4). To obtain insight into the decision-making process of the networks, SmoothGrad (Smilkov et al., 2017) was used to generate sensitivity maps that highlight which area has the most impact on the classification of the Helcaraxe plot. The area at the bottom, especially in the middle (where ice rings usually appear) had the greatest influence on the decision of the network. The edges on the top, right, and left had close to no impact on the classification. Fig. 7 shows that the networks recognize the characteristic ice spike in Helcaraxe plots and uses it as indicator for the classification. The comparison of the two sensitivity maps suggests that the two models have adapted to the properties (as described in 3.1) of their respective Helcaraxe plots (as described in 3.1). F obs plots have a more spread-out distribution than I obs and the sensitivity map shows the Fobs model is sensitive to a broader area. 13 Figure 6 SmoothGrad's averaged sensitivity mask of the F obs and I obs network against the test set. Larger values indicate a higher significance of the pixel. The area where ice artefacts usually appear have higher relevance than the top, left and right edges. F obs plots have a more spread out distribution, which is the likely reason why the F obs network has also a larger relevant area. Performance of both F obs and I obs networks against the test set was compared to other ice ring detection algorithms, namely phenix.xtriage (Adams et al., 2010) , CTRUNCATE (Winn et al., 2011) , the AUSPEX icefinder score and the recent p ice algorithm (Moreau et al., 2021) . Helcaraxe rates the individual resolution ranges in which ice rings can appear (as described in 3.1) and not the complete diffraction data set. Therefore, a data set was labelled as ice ring contaminated when the network classified even a single Helcaraxe plot as contaminated. The algorithm recently introduced by Moreau and colleagues outperformed all previous statistical methods. Helcaraxe reaches an even higher accuracy and sensitivity (both over 93%). Using I obs plots, Helcaraxe has a specificity of 92%, using F obs plots 97%. In general, the use of Helcaraxe results in the most reliable output in comparison to the traditional statistical methods. The Helcaraxe networks were then used to analyse 117,615 randomly selected diffraction data sets deposited in the PDB. We found that 21,741 (18.5%) PDB entries show evidence of ice contamination in the processed, scaled, and merged diffraction data. This number is similar to other previous large-scale analyses of the PDB (19%, Thorn et al., 2017, 16 % Moreau et al.) . We analysed the historical evolution of the ice contamination. Ice ring presence started rising in the late 1990s when cryo techniques became routine at the first synchrotron MX beamlines and steadily grew when the first sample changer came online. Since the mid-2000s, the fraction of ice contaminated data has stabilized around 19%, even though the advent of pixel detectors translated into shorter measurement times and consequently, less time in which the protein crystal is exposed to a cryo-stream where it may accumulate surface ice crystals on the sample. The data produced in this analysis could be a useful starting point for the research into the impact ice rings have on structure solution and will be available from Helcaraxe's git repository (https://github.com/thorn-lab/helcaraxe). Number of deposited diffraction data sets (green) and the amount of these depositions which were annotated by Helcaraxe as containing at least one ice ring (blue). The contamination proportion is shown as a line (purple). Helcaraxe has been integrated into a new Python-based version of AUSPEX (to be published) and can be used to automatically decide whether a specific resolution range contains ice ring artefacts. The previous icefinder score ( Thorn et al., 2017) is still used to rate the severity of the artefacts as Helcaraxe, trained only for the detection and not the assessment of ice rings, provides a less differentiated classification. An additional discriminator function in the Helcaraxe script detects plots that are completely or partially blank, for example, because resolution ranges were omitted during integration. This is achieved by checking whether, for specified resolution bins, the mean I obs (or F obs ) is close to 0. If this is true, the plot is marked as non-predictable, and a warning is passed to the user. The runtime of AUSPEX is barely influenced by Helcaraxe as the prediction is fast (1 -3 seconds per diffraction data set). No additional input for AUSPEX is needed to use Helcaraxe, and users of AUSPEX can choose between the Helcaraxe networks or AUSPEX icefinder score to detect the potential ice crystal artefacts. The identification of ice diffraction artefacts in integrated, scaled and merged data has been an ongoing problem in macromolecular crystallography, even with modern cryo-cooling 16 techniques (Moreau et al., 2021) and new background estimation algorithms . To aid identification in automatic pipelines as well as by users, a set of neural networks named Helcaraxe was developed to identify whether a scaled and merged X-ray diffraction dataset contains ice diffraction contamination of the reflection data from a macromolecular crystal. This program presents a significant improvement over previous automatic tools using classical statistical indicators. One area of future exploration would be to combine these approaches: Reliable statistical methods such as those recently introduced by Moreau and colleagues could be used as an additional feature in the fully connected part of Helcaraxe's networks. Our work also shows that the multi-dimensional pattern recognition abilities of convolutional neural networks are a valuable addition to the toolbox of diffraction data analysis and the authors of this paper expect to see a rise of AI methods in this field in the near future. Helcaraxe is currently already in use in the Coronavirus Structural Task Force pipeline (Croll et al., 2021) and is integrated into the newest AUSPEX version which is available through the AUSPEX webserver (https://www.auspex.de). Expert Systems with Applications The journal of machine learning research Insights Imaging