key: cord-0256167-c2i5eaua
authors: Gildea, Richard J.; Beilsten-Edmands, James; Axford, Danny; Horrell, Sam; Aller, Pierre; Sandy, James; Sanchez-Weatherby, Juan; Owen, C. David; Lukacik, Petra; Strain-Damerell, Claire; Owen, Robin L.; Walsh, Martin A.; Winter, Graeme
title: xia2.multiplex: a multi-crystal data analysis pipeline
date: 2022-04-14
journal: bioRxiv
DOI: 10.1101/2022.01.17.476589
sha: 4139ba310955b5539cb3cc90dfa97119c3589867
doc_id: 256167
cord_uid: c2i5eaua

In macromolecular crystallography radiation damage limits the amount of data that can be collected from a single crystal. It is often necessary to merge data sets from multiple crystals, for example small-wedge data collections on micro-crystals, in situ room-temperature data collections, and collection from membrane proteins in lipidic mesophase. Whilst indexing and integration of individual data sets may be relatively straightforward with existing software, merging multiple data sets from small wedges presents new challenges. Identification of a consensus symmetry can be problematic, particularly in the presence of a potential indexing ambiguity. Furthermore, the presence of non-isomorphous or poor-quality data sets may reduce the overall quality of the final merged data set. To facilitate and help optimise the scaling and merging of multiple data sets, we developed a new program, xia2.multiplex, which takes data sets individually integrated with DIALS and performs symmetry analysis, scaling and merging of multicrystal data sets. xia2.multiplex also performs analysis of various pathologies that typically affect multi-crystal data sets, including non-isomorphism, radiation damage and preferential orientation. After describing a number of use cases, we demonstrate the benefit of xia2.multiplex within a wider autoprocessing framework in facilitating a multi-crystal experiment collected as part of in situ room-temperature fragment screening experiments on the SARS-CoV-2 main protease.

To facilitate and help optimise the scaling and merging of multiple data sets, we developed a new program, xia2.multiplex, which takes data sets individually integrated with DIALS and performs symmetry analysis, scaling and merging of multicrystal data sets. xia2.multiplex also performs analysis of various pathologies that typically affect multi-crystal data sets, including non-isomorphism, radiation damage and preferential orientation. After describing a number of use cases, we demonstrate the benefit of xia2.multiplex within a wider autoprocessing framework in facilitating a multi-crystal experiment collected as part of in situ room-temperature fragment screening experiments on the SARS-CoV-2 main protease.

Macromolecular structure determination routinely uses data sets obtained under cryogenic conditions from a single crystal. However, radiation damage limits the amount of data that can be collected from a single crystal. Cryocooling vastly increases the dose that can be tolerated by a single crystal, leading to the dominance of cryocrystallography in macromolecular structure determination (Garman, 1999; Garman & Owen, 2007) . However, it is often still necessary to merge multiple data sets from one or more crystals when dealing with radiation sensitive samples and high brilliance X-rays from third generation light sources.

Multi-crystal data collection dates back to the early days of macromolecular crystallography (Kendrew et al., 1960; Clemons Jr et al., 2001) , but has seen a resurgence in recent years (Yamamoto et al., 2017) as many scientifically important targets, such as membrane proteins and viruses frequently yield small, weakly diffracting microcrystals. The development of crystallisation in lipidic mesophases (Caffrey, 2003; Caffrey, 2015) and the availability of microfocus beamlines Smith et al., 2012) have facilitated data collection and structure solution of these difficult targets. Data collection strategies for small weakly diffraction crystals rely on collecting many small wedges of data, typically 5-10°per crystal, at cryogenic temperatures.

For samples in lipidic mesophase this is often preceded by X-ray raster scanning to identify the location of crystals (Cherezov et al., 2007; Rasmussen et al., 2011; Rosenbaum et al., 2011; Cherezov et al., 2009; Warren et al., 2013) . Such experiments are becoming increasingly automated thanks to developments such as MeshAndCollect (Zander et al., 2015) and ZOO (Hirata et al., 2019) .

Multi-crystal data collections have also been applied to experimental phasing, where combining data from multiple crystals enhances weak anomalous signals using highmultiplicity data of sufficient quality to enable structure solution by single-wavelength anomalous dispersion (SAD) (Liu et al., 2011; Liu & Hendrickson, 2015) and sulfur SAD (S-SAD) (Akey et al., 2014; Liu et al., 2014; Huang et al., 2015; Huang et al., 2016; Olieric et al., 2016) .

Although cryogenic structures have provided the gold standard for structural analysis of macromolecules for decades, it has been shown that cryocooling can hide biologically-significant structural features (Fraser et al., 2009; Fraser et al., 2011; Fischer et al., 2015) . Certain classes of macromolecular crystals, such as viruses, can also suffer when cryo-cooled. However, room-temperature data collection presents its own challenges, namely that radiation damage occurs at an absorbed dose one to two orders of magnitude lower than at cryogenic temperatures (Helliwell, 1988; Nave & Garman, 2005) . In contrast to cryogenic data collections, an inverse dose-rate effect on crystal lifetime has been observed in room-temperature data (Southworth-Davies et al., 2007) . As a result, obtaining a complete room-temperature data set from a single crystal is difficult, so combining data from multiple crystals becomes necessary.

As demand for room-temperature methods has increased, beamline developments have enabled routine room-temperature data collection on crystals directly from crys-IUCr macros version 2.1.6: 2014/01/16 tallisation plates (in situ). This has the added benefit of eliminating the need for crystal harvesting Aller et al., 2015; Axford et al., 2015) , and there now exists a beamline, VMXi at Diamond Light Source, dedicated to in situ data collection (Sanchez-Weatherby et al., 2019) . Advances in beamline and detector technology have enabled the collection of room-temperature data at a higher dose rate Owen et al., 2014; Schubert et al., 2016) , increasing the general applicability of room-temperature data collection (Aller et al., 2015; Broecker et al., 2018) .

Merging multiple data sets from small wedges presents a number of challenges.

For novel structures with unknown space group and unit cell parameters, identifying a consensus symmetry can be problematic, particularly in the presence of indexing ambiguities (Brehm & Diederichs, 2014; Kabsch, 2014; . The presence of non-isomorphous or poor-quality data sets may also degrade the overall quality of the merged data set. Various methods have been developed to identify individual non-isomorphous data sets based on comparison of unit cell parameters (Foadi et al., 2013; Zeldin et al., 2015) or intensities (Giordano et al., 2012; Santoni et al., 2017; Diederichs, 2017) to combat this. Rogue data sets, or even individual bad images, can be identified by algorithms such as the ∆CC 1 2 method described by Assmann et al. (2016) and implemented within dials.scale (Beilsten-Edmands et al., 2020) .

Microcrystal and room-temperature data collection strategies are a compromise between maximising useful signal, and minimising the effects of radiation damage. By analysing radiation damage we can provide rapid feedback to guide an ongoing experiment and truncate the number of images used to produce the best final composite data set. The R cp statistic introduced by Winter et al. (2019) can also be applied to multi-crystal data, under the assumption that the dose per-image is approximately IUCr macros version 2.1.6: 2014/01/16 constant for all data sets. This may be appropriate for multi-crystal data collections where approximately uniformly-sized crystals are bathed in the X-ray beam.

Preferential orientation of crystals can be a concern for some multi-crystal data collections, depending on crystal symmetry and morphology, such as plate-like crystals in situ within a flat-bottomed crystallization well. Preferential orientation can lead to under-sampled regions of reciprocal space with systematically low multiplicity or missing reflections, which may have adverse consequences on downstream phasing or refinement. Providing feedback on preferential orientation provides the opportunity for a user to make modifications to their experiment to minimise any resulting issues, for example by fully exploiting the available experimental geometry, or changing the crystallisation conditions or platform (Maeki et al., 2016) .

Structural biologists have become accustomed to highly automated data analysis provided by synchrotron beamlines around the world (Holton & Alber, 2004; Winter, 2010; Vonrhein et al., 2011; Winter & McAuley, 2011; Winter et al., 2013; Monaco et al., 2013; Yamashita et al., 2018) , typically obtaining automated data processing results within minutes of the end of data collection for routine experiments. Multicrystal experiments can generate large volumes of data in minutes, which brings new challenges in terms of bookkeeping and data analysis.

There are two primary aspects in which automated data analysis can support multicrystal experiments. First, rapid feedback from data analysis during beamtime can help guide ongoing experiments, enabling more efficient use of beamtime and allowing a user to more selectively screen sample conditions. Relevant feedback may include suitable metrics on merged data quality, i.e. completeness, multiplicity and resolution, and feedback on experimental pathologies such as non-isomorphism, radiation damage and preferential orientation, that may hinder the experimental goals.

Secondly, after the completion of beamtime, the user may be prepared to invest more time and effort in interactively optimising the best overall data set for any given sample group. Automation is still highly relevant in this context, as the user may have collected data on many sample groups which they wish to process in a similar manner.

Standard autoprocessing pipelines such as xia2 (Winter, 2010) can handle multicrystal data sets to some extent, however, they are optimised to process a small number of relatively complete data sets, rather than the many tens to hundreds of severely incomplete data sets that comprise a multi-crystal experiment. Recent software developments, for example KAMO (Yamashita et al., 2018) , have focused on automating data processing of multi-crystal experiments.

Here we present new program, xia2.multiplex, which has been developed to facilitate the scaling and merging of multiple data sets. It takes as input data sets individually integrated with DIALS and performs symmetry analysis, scaling and merging, and analyses various pathologies that typically affect multi-crystal data sets, including non-isomorphism, radiation damage and preferential orientation.

xia2.multiplex has been deployed as part of the autoprocessing pipeline at Diamond Light Source, including integration with downstream phasing pipelines such as DIMPLE (http://ccp4.github.io/dimple/) and Big EP (Sikharulidze et al., 2016) .

Using data sets collected as part of in situ room-temperature fragment screening experiments on the SARS-CoV-2 main protease, we demonstrate the use of xia2.multiplex within a wider autoprocessing framework to give rapid feedback during a multi-crystal experiment, and how the program can be used to further improve the quality of final merged data set.

Prior to using xia2.multiplex, each data set should be processed individually with DIALS . Data may be processed either in the primitive, P1, setting, or alternatively Bravais symmetry may be determined prior to integration, using dials.refine bravais settings. It is not necessary to individually scale the data at this point.

Preliminary filtering of data sets is performed using hierarchical unit cell clustering methods (Zeldin et al., 2015) . Histograms and scatterplots of the unit cell distribution are generated for visual analysis, after which symmetry analysis and indexing ambiguity resolution are performed with dials.cosym. Finally the data are scaled with dials.scale, followed by radiation damage and isomorphism analysis. The main sequence of steps taken by xia2.multiplex are outlined in Figure 1 .

Initial analysis of the Patterson symmetry of the data is performed using dials.cosym . This is an extension of the methods of Brehm & Diederichs (2014) for resolving indexing ambiguities in partial data sets, for completeness reviewed here.

The maximum possible lattice symmetry compatible with the averaged unit cell is used to compile a list of all potential symmetry operations. The matrix of pairwise correlation coefficients is constructed, of size (n × m) 2 , where n is the number of data sets and m is the number of symmetry operations in the lattice group. The Pearson's correlation coefficient between data sets i and j, after application of the kth and lth symmetry operators respectively, is defined according to

Similarly to Brehm & Diederichs (2014) , correlation coefficients are only calculated for pairs of data sets with three or more reflections in common. If a pair of data sets have two or fewer common reflections, then the correlation coefficient for that pair is assumed to be zero. The minimum number of common reflections required for calcu-lation of correlation coefficients is configurable in dials.cosym and xia2.multiplex.

Each data set is represented as n × m coordinates in an m-dimensional space. Use of an m-dimensional space allows the presence of up to m orthogonal x i clusters, where the orthogonality between clusters corresponds to a correlation coefficient r i k ,j l close to zero. A modification of algorithm 2 of Brehm & Diederichs (2014) , accounting for the additional symmetry-related copies of each data set, is used to iteratively minimise the function

using the L-BFGS minimisation algorithm (Liu & Nocedal, 1989) , with randomlyassigned starting coordinates x in the range 0-1.

It is necessary to use a sufficient number of dimensions to represent any systematic variation present between data sets.

Using m-dimensional space, where m is equal to the number of symmetry operations in the maximum possible lattice symmetry, should be sufficient to represent any systematic variation present due to pseudosymmetry. However, choosing the optimal number of dimensions is a balance between underfitting and overfitting. Using more dimensions than is strictly necessary may reduce the stability of the minimisation, particularly in the case of sparse data, where there is minimal overlap between data sets. As a result, we devised the following procedure to automatically determine the necessary number of dimensions. 3. Determine the 'elbow' point of the plot, in a similar manner to that used by Zhang et al. (2006) , to give the optimal number of dimensions.

Alternatively, the user may specify the number of dimensions to be used for the analysis.

2.1.2. Identification of symmetry A modified form of the algorithms from the program POINTLESS (Evans, 2006; Evans, 2011) are used in the determination of the Patterson group symmetry from the results of the initial cosym procedure.

Evans (2011) estimates the likelihood of a symmetry element S k being present, given the correlation coefficient CC k , as

The probability of observing the correlation coefficient CC k if the symmetry is present, p(CC k ; S k ), is modelled as a truncated Lorentzian centred on the expected value of CC if the symmetry is present, E(CC; S), with a width parameter γ = σ(CC k ).

The distribution of CC k if the symmetry is not present is modelled as

Diederichs (2017) makes clear the relationship between the results of the clustering procedure outlined above, and the correlation coefficient r ij between two data sets i and j:

The length of the vectors |x i | are inversely related to the amount of random error,

i.e. they provide an estimate of CC * . The maximum possible correlation coefficient between two data sets is given by the product of their CC * values. The angles between two vectors represent genuine systematic differences. For points related by genuine symmetry operations we expect cos[∠(x i , x j )] ≈ 1, whereas for points related by symmetry operations that are not present we expect cos[∠(x i , x j )] = 0.

We can therefore use cos[∠(x i , x j )] in place of CC k , with E(CC; S) = 1. The estimated error σ(CC k ) used by Evans (2011) Once a score has been assigned to each potential symmetry operator, all possible point groups compatible with the lattice group are scored as in Evans (2011) Once the most likely Patterson group has been identified by the above procedure, it is then relatively straightforward to assign a suitable reindexing operation to each data set to ensure that all data sets are consistently indexed. First, a high density point is chosen as a seed for the cluster. Then, for each data set, identify the nearest symmetry copy of that data set to the seed. The symmetry operation corresponding to this symmetry copy is then the reindexing operation for this data set.

After symmetry determination, an overall best estimate of the unit cell is obtained by refinement of the unit cell parameters against the observed 2θ angles, using the program dials.two theta refine (Winter et al., 2021) . This program minimises the unit cell constants against the difference between observed and calculated 2θ values, which are determined from background-subtracted integrated centroids. This provides an overall best estimate of the unit cell that is a suitable representative average for use in subsequent downstream phasing and refinement.

Data are then scaled using the physical scaling model in dials.scale The default cutoff value of CC 1 2 = 0.3 is chosen as one that works well in the context of autoprocessing in order to provide a consistent set of merging statistics for judging data quality during and ongoing experiment. Suitable cutoff values may depend on the downstream data processing requirements, but the current gold standard for publication is to use "paired refinement" to determine the resolution at which including higher resolution data in refinement no longer improves the model (Karplus & Diederichs, 2012) .

After the data have been scaled in the Patterson group identified by dials.cosym ( §2.1.2), analysis of potential systematic absences is performed by dials.symmetry in order to assign a final space group. In this analysis, the existence of each potential screw axis allowed by the Patterson group is tested, by calculating the z-score based on the deviation from zero of the merged < I/σ(I) >

for the expected absent reflections. From the individual z-scores, a likelihood for the presence of each screw axis is determined, which are combined to score and select the most likely non-enantiogenic space group.

xia2.multiplex performs a number of analyses that can be useful in assessing the extent of any radiation damage which may be present. Plots of scale factor and R merge vs. image number are generated to look for any trends associated with radiation damage. The R cp statistic introduced by Winter et al. (2019) can also be applied to multi-crystal data. This statistic accumulates the pairwise measured intensity differences as a function of dose (or image number). In the absence of accurate dose information for each data set it is necessary to make the assumption the dose perimage is approximately constant for all data sets. In order to assess how many images per crystal are necessary to achieve a complete data set, a plot of completeness vs.

dose is also generated.

Unit cell clustering, as implemented in the program BLEND (Foadi et al., 2013) and

elsewhere (Zeldin et al., 2015) , is used by xia2.multiplex as a preliminary filtering step to reject any highly non-isomorphous data sets.

xia2.multiplex implements two alternative intensity-based clustering methods that are suitable for identification and analysis of non-isomorphism, once symmetrydetermination, resolution of indexing ambiguities, and scaling have been carried out as described above. Clustering on correlation coefficients (Giordano et al., 2012; Santoni et al., 2017; Yamashita et al., 2018) begins by calculating a matrix of pairwise correlation coefficients:

A distance matrix defined as d i,j = 1 − r i,j is provided as input to the SciPy (Virtanen et al., 2020) hierarchical clustering routine using the average linkage method. Clusters are sorted by distance, and the completeness and multiplicity of each cluster is reported. Optionally, xia2.multiplex can scale and merge the data sets defined by each cluster that meets user-defined criteria for minimum completeness or multiplicity.

A second intensity-based clustering method follows that described by Diederichs (2017) who demonstrated that the methods of Brehm & Diederichs (2014) could be generalised to search for any systematic differences between data sets, not just those caused by an indexing ambiguity. In addition to its use for identifying the Patterson symmetry ( §2.1.2), dials.cosym can also be used for analysis of non-isomorphism.

In this mode, rather than searching for the presence of potential additional symmetry operators, the matrix of pairwise correlation coefficients of size n 2 reduces to Equation 7. The function defined by Equation 2 is minimised as before to obtain a representation of the similarity between data sets in a reduced dimensional space.

As made clear by Diederichs (2017) , the length of a vector, x i is inversely proportional to the random error in data set X i . The angle between vectors x i and x j corresponds to the level of systematic error between data sets X i and X j , and can thus be used to estimate the degree of non-isomorphism between those data sets.

Analysis of the angular separation of vectors, x, can be used to identify groups of systematically different data sets. Hierarchical clustering on the cosines of the angles between vectors is performed to identify possible groupings of data sets for further investigation. Optionally xia2.multiplex can re-scale multiple subsets of data, which can be controlled by specifying a maximum number of clusters to merge and/or the minimum required completeness or multiplicity for a cluster.

The final approach to isomorphism analysis implemented within xia2.multiplex is the ∆CC 1 2 method described by Assmann et al. (2016) 

The report generated by xia2.multiplex includes stereographic projections of crystal orientation relative to the laboratory frame, generated with the program dials.stereographic projection. A random distribution of points (each point corresponds to a crystal, or its symmetry equivalent) in a stereographic projection suggests a random distribution of crystal orientation, whereas a systematic non-random distribution may be indicative of preferential crystal orientation. xia2.multiplex also generates a number of plots that can aid in the analysis of the distribution of multiplicities.

A new command, dials.missing reflections, has been developed to identify To assess the impact of ∆CC 1 2 filtering on the resulting anomalous signal, we performed experimental phasing, structure refinement (via DIMPLE ) and calculated anomalous difference maps using data both with and without ∆CC 1 2 filtering of outliers. Substructure solution and autotracing were successful in both cases. ∆CC 1 2 filtering also resulted in improved merging statistics, typically in CC 1 2 , CC anom , < d"/sigI >, < I/σ(I) > and R pim vs. resolution (Tables 1 and 2 ). For the NaBr and Sm soaks there is a particularly significant improvement in R work and R free after ∆CC 1 2 filtering. These two soaks also correspond to the data sets that showed the largest improvement in anomalous difference peak height after removal of outlier data sets ( Figure 2d ).

We note that merging statistics such as correlation coefficients and R-factors, which are calculated only on the unmerged intensity values without taking into account their errors, can be affected by regions of lower data quality that are suitably downweighted with larger errors during scaling. The presence of these regions however does not adversely affect the resulting merged intensities, which are appropriately weighted.

This disparity is most likely to be evident for high multiplicity data with regions of significant radiation damage, in which case merged data quality indicators are most representative of the data quality.

As outlined in §2.5, there are several different methods available in xia2.multiplex for identifying outlier data sets. Above, we used ∆CC 1 2 filtering to identify and exclude outlier partial data sets. Visualisation of the distribution and hierarchical clustering on unit cell parameters for the Sm soak (Figure 3e and f) identify data set 11 as an outlier, which was also the first data set to be excluded by ∆CC 1 2 filtering. Similarly, hierarchical clustering on pairwise correlation coefficients ( Figure 4a ) and on the cosines of the angles between vectors, x, (Figure 4b ) both identify data set 11 as an outlier. Whilst in this case, all available methods for isomorphism analysis identi-fied data set 11 as the least compatible data set, it is beneficial to have an array of different methods available, as the best method for a particular system may depend on the nature of any isomorphism involved.

Previously published in situ data for Haemophilus influenzae TehA were used to further demonstrate the applicability of xia2.multiplex and the tools contained therein. 73 partial data sets were processed individually with DIALS via xia2 , providing no prior space group or unit cell information. 71 successfullyintegrated data sets were provided as input to xia2.multiplex, where data were combined and scaled using dials.cosym and dials.scale. Two data sets were identified as having inconsistent unit cells by preliminary filtering and removed, leaving 69 data sets for subsequent symmetry analysis and scaling. Structure refinement was performed by REFMAC (Murshudov et al., 2011) via DIMPLE . Data processing and refinement statistics using all data, and only those remaining after filtering by ∆CC 1 2 , are shown in Table 3 . Six cycles of scaling and filtering were performed by dials.scale, where exclusion was performed on whole data sets. A single outlier data set (with a cutoff of 3σ) was removed at each of the first five cycles, removing a total of 6.2% of reflections. No significant outliers were identified in the sixth and final cycle.

Structure refinement was performed by REFMAC (Murshudov et al., 2011) via DIMPLE , using the model from PDB entry 4ycr , using all scale data, and after filtering of outliers using the ∆CC 1 2 method. Filtering of outlier data sets leads to a slight improvement in merging statistics, particularly in < I/σ(I) > and R pim . There is also a slight reduction in the R work and R free reported by REFMAC.

shows that preferential crystal orientatation may be an issue for this experiment (Figures 5c and d) . 

With the emergence of the novel coronavirus SARS-CoV-2 and the associated coronavirus disease 2019 (COVID-19), the SARS-CoV-2 main protease has quickly emerged as one of the primary targets for antiviral drug development (Jin et al., 2020; Jin et al., 2021; Walsh et al., 2021) . Fragment screening experiments using the XChem platform at Diamond Light Source (Cox et al., 2016; Collins et al., 2017; Krojer et al., 2017) screened over 1250 unique chemical fragments, yielding 74 fragment hits (Douangamath et al., 2020) .

Fragment screening experiments such as these are typically carried out using con-ventional cryogenic conditions to minimise the effects of radiation damage, with each structure obtained from a single crystal. Room-temperature data, however, can usefully identify or rule out structural artefacts induced by pushing the temperature far from the biologically relevant level Guven et al., 2021) .

Over the course of several beamline visits, room-temperature in situ data were collected for 30 ligand soaks that were previously shown to bind under cryogenic conditions. Here we highlight room-temperature data collections for five ligand soaks that showed evidence of ligand binding at room-temperature: Z1367324110 (PDB: Analysis of the distribution of unit cell parameters and clustering on unit cell parameters indicated the presence of potential outlier data sets (Figures 7a and b) .

Reprocessing with a lower unit cell clustering threshold resulted in improved merging statistics for some data sets (Figures 7e and f) . Alternatively, ∆CC 1 2 analysis may be useful in identifying outlier data sets. For ligand soak Z4439011520, ∆CC 1 2 analysis by dials.scale identified two outlier data sets over two rounds of scaling and filtering (Figures 7c and d) . ∆CC 1 2 -filtering removed data sets 0 and 18, which were also the two least compatible data sets identified by unit cell clustering, although only the latter was identified as an outlier according to the chosen unit cell clustering threshold.

Using the data improved by rejection of outlier data sets as above, initial structure solution was performed using MOLREP (Vagin & Teplyakov, 2010) with 7AEH as the search model. Structures were refined for 200 cycles in REFMAC5 (Murshudov et al., 2011) using rigid body refinement, followed by iterative rounds of restrained refinement with automatic TLS and assisted model building in COOT (Emsley et al., 2010) .

Final data processing and refinement statistics for five ligand soaks, Z1367324110, Z31792168, Z4439011520, Z4439011584 and ABT-957, are reported in Table 4 Ligand soak ABT-957 is of particular interest, as this unexpectedly crystallised in space group P 21, in contrast to the space group C2 typical of this protein, and indeed observed for the cryo-structure of this ligand (Redhead et al., 2021) . Autoprocessing (including both xia2 and xia2.multiplex) was performed both using the user-specified target space group, C2, and with automatic space group determination. Out of 42 data sets collected, 18 data sets were successfully autoprocessed with DIALS via xia2 in the target space group C2, and combined with xia2.multiplex.

In contrast, all 42 data sets individually processed successfully with automatic space group determination, in a mixture of space groups P 1, P 2, P 21 and C2. 33 data sets remained after filtering for inconsistent unit cells. Analysis of symmetry with dials.cosym identified the Patterson group P 2/m, which features an indexing ambiguity due to the approximate pseudo-symmetry of the supergroup C2 (Tables 5 and   6 ).

Of the ligand soaked structures collected all showed a near identical binding conformation between cryogenic and room temperature structures. A minor difference was observed in the conformation of ABT-957 with the C9-N-C1(R) amide bond in the room temperature structure being flipped compared to the cryogenic structure 

xia2.multiplex has been developed to perform symmetry analysis, scaling and merging of multiple data sets. xia2.multiplex is distributed with DIALS and hence CCP4 , and is available as part of the autoprocessing pipelines across MX beamlines at Diamond Light Source, including integration with downstream phasing pipelines such as DIMPLE and Big EP. It is capable of providing near real-time feedback on data quality and completeness during ongoing multi-crystal data collections, and can be used as part of an iterative workflow to obtain the best possible final data set after an experiment.

We have demonstrated its applicability using two previously-published room-temperature in situ multi-crystal data sets, including an example of experimental phasing. Using data sets collected as part of in situ room-temperature fragment screening experiments on the SARS-CoV-2 main protease, we have shown the ability of xia2.multiplex to provide rapid feedback during multi-crystal experiments, including the identification of an unexpected change in space group with ligand addition.

Remaining challenges include automatic identification of the best subset(s) of data to use for downstream analyses, and providing a user interface via applications such as SynchWeb or CCP4 to view results and facilitate an interactive workflow using xia2.multiplex. Support for MTZ files as input is planned in order to support running xia2.multiplex on the output of other data processing software such as XDS (Kabsch, 2010) and MOSFLM (Battye et al., 2011) . pairwise R ij correlation coefficients and (b) the (n × m) vectors x determined by the minimisation of Equation 2 during symmetry determination with dials.cosym. The R ij correlation coefficients are clustered towards 1 and the majority of the vectors x form a single cluster, suggesting the absence of an indexing ambiguity, i.e. the Patterson group of the data set corresponds to the maximum lattice symmetry. (c) and (d) as above, but after symmetry determination and scaling. The distribution of the n 2 R ij correlation coefficients is sharpened towards 1 as scaling improves the internal consistency of the data. There is also an effect from multiplicity when comparing to (a), as here the n 2 R ij values are calculated in the highest symmetry group for the lattice. All but one of the n vectors x form a tight cluster, with the vector lengths close to 1. Visualisation of the distribution of unit cell parameters (e) and clustering on unit cell parameters (f) suggests the presence of an outlier data set. crystals, representing the direction of hkl = 100 and hkl = 001 for each crystal respectively, relative to the beam direction (z) which is shown as the central '+' into the page. A point close to the centre of the circle indicates that the crystal axis is close to parallel with the beam, whereas a point close to the edge of the unit circle indicates that the crystal axis is close to perpendicular with the beam. Preferential orientation can lead to regions with systematically low multiplicity or missing reflections. (e) shows the reflection multiplicities in the 0kl plane, where white corresponds to missing reflections. The bivariate distribution of multiplicities shown in (f) is also indicative of an uneven distribution of multiplicities. Fig. 7 . Outlier identification and removal for SARS-CoV-2 main protease ligand soak Z4439011520. Visualisation of the distribution of unit cell parameters (a) and clustering on unit cell parameters (b) may suggest possible outlier data sets. ∆CC 1 2 filtering with dials.scale can also remove data sets that strongly disagree with the majority of data sets (c) and (d). Removing outlier data sets can improve overall merging statistics (e) and (f). (Redhead et al., 2021) and (b) at room temperature. Contours for the ligand density are drawn at 3σ. (c) and (d) two slightly displaced views of the active site for SARS-CoV-2 main protease in complex with ABT-957 to show the conformational differences observed particularly for the oxopyrrolidine and benzyl moieties of ABT-957 when bound to M pro at cryo (cyan) and room temperature (green). The structures were superimposed using PyMOL (Schrödinger LLC, 2020) .

A new program, xia2.multiplex, has been developed to facilitate symmetry analysis, scaling and merging of multi-crystal data sets.

Structural Proteomics

Macromolecular crystallization in the structural genomics era

Proceedings of the National Academy of Sciences

Macromolecular Crystallography Protocols

Crystals

Proceedings of the National Academy of Sciences

Mathematical programming

Current opinion in structural biology

The pymol molecular graphics system

Carbohydrates and glycoconjugates ? Biophysical methods

Acta Crystallographica Section D: Biological Crystallography

Protein Science

Acta Crystallographica Section D

The authors would like to thank the authors of DIALS development team for the various components that provide the foundations of xia2.multiplex, and those within the wider Diamond Light Source software team who have assisted in the deployment of xia2.multiplex. We would also like to thank the Diamond XChem team for

CC 1/2 and R pim data processing statistics for ligand Z4439011520 with the inclusion of progressively more data sets, in data collection order, top left to bottom right.(c) and (d) overall data completeness and gemmi (https://gemmi.readthedocs.io) blob search scores. (e), (f) and (g) the ligand density in the autoprocessed DIMPLE maps for 2, 9 and 20 crystals respectively. All contours are drawn at 3σ.