key: cord-0149262-160lzm2z
authors: Chung, Szu-Chi; Hung, Cheng-Yu; Siao, Huei-Lun; Wu, Hung-Yi; Chang, Wei-Hau; Tu, I-Ping
title: Cryo-RALib -- a modular library for accelerating alignment in cryo-EM
date: 2020-11-11
journal: nan
DOI: nan
sha: cc3231dba3764bb30c113ad5c82c03d81fa9fda6
doc_id: 149262
cord_uid: 160lzm2z

With the enhancement of algorithms, cryo-EM has become the most efficient technique to solve structures of molecules. Take a recent event for example, after the outbreak of COVID-19 in January, the first structure of 2019-nCoV Spike trimer was published in March using cryo-EM, which has provided crucial medical insight for developing vaccines. The enabler behind this efficiency is the GPU-accelerated computation which shortens the whole analysis process to 12 days. However, the data characteristics include strong noise, huge dimension, large sample size and high heterogeneity with unknown orientations have made analysis very challenging. Though, the popular GPU-accelerated Bayesian approach has been shown to be successful in 3D refinement. It is noted that the traditional method based on multireference alignment may better differentiate subtle structure differences under low signal to noise ratio (SNR). However, a modular GPU-acceleration package for multireference alignment is still lacking in the field. In this work, a modular GPU-accelerated alignment library called Cryo-RALib is proposed. The library contains both reference-free alignment and multireference alignment that can be widely used to accelerate state-of-the-art classification algorithms. In addition, we connect the cryo-EM image analysis with the python data science stack which enables users to perform data analysis, visualization and inference more easily. Benchmark on the TaiWan Computing Cloud container, our implementation can accelerate the computation by one order of magnitude. The library has been made publicly available at https://github.com/phonchi/Cryo-RAlib.

In contrast to X-ray crystallography, cryo-EM is amenable to the structural determination of proteins resistant to crystallization. Since molecules are captured in their native states, it can further analyze the conformations mixtures [1] . With the enhancement of algorithms, cryo-EM has become the mainstream tool to solve structures of molecules at near-atomic resolution. As the automated data collection becoming mature, the equipment can be up-running 24/7 and generating 1000 to 4000 micrographs in a single day. However, the popular Bayesian cryo-EM clustering technique like RELION [2] will take several days to finish the classification task even with GPU acceleration. Evidently, image classification has become the general bottleneck in the overall workflow.

GPU hardware has been used to accelerate the cryo-EM workflow in different stages in the literature. For example, movie alignment [3] , contrast transfer function estimation [4] , identification of particles within images [5] , Bayesian 2D classification algorithm like RELION [6] and 3D refinement and classification algorithms [6] , [7] . It is noted that the computational complexity of Expectation-Maximization algorithm used in the Bayesian approach is much higher than the one based on multireference alignment (MRA). The computational complexity is at least in the order of O(Kt 2 nL 4 log n) compare with O(KnL 3 ), where n is the sample size and L is the pixel number for one direction of the particle image, and t is the translational shift search range) [1] . The computational complexity increases very steeply with the particle size and the resolution. Consider the RELION's benchmark dataset 80S ribosome [8] (n = 10 5 , L = 360, assume t = 10 and K=4), the complexity is in the order of O (10 20 ) . Besides, to further extended to atomic resolution, we must carefully deal with heterogeneity within the dataset. The method based on MRA [9] may better differentiate subtle structure differences under low SNR compare with the Bayesian approach [10] .

2D and 3D classification based on the MRA or Bayesian approach is the most popular approach and standard step in cryo-EM workflow [9] . The algorithm is conducted as follows, where we use 2D to simplify the illustration, but the extension to 3D is straightforward: Firstly, K initial 2D averages will be randomly generated. Secondly, the particle images are compared with those K initial 2D models in different 2D shifts and in-plane rotations. The positions and the 2D models that maximize the similarity measure (usually the cross-correlation (CC) is used) will be recorded for each image. The new 2D averages are then constructed based on the aligned images. After multiple iterations refinement of parameters, the class assignment of each particle image to a particular 2D models is usually unambiguous, and the shifts and orientations can be assigned to particle images. The popular Bayesian approach like RELION [2] is slightly different in that it employs maximum likelihood algorithms which used fuzzy assignments for both orientations and classes instead of the best one. The 2D models are then obtained through weighted average over all possible orientations and classes.

Although the Bayesian approach is very powerful in 3D classification, the algorithms based on MRA have shown to provide better performance in 2D classification. Several powerful algorithms based on MRA are reviewed below. Firstly, the CL2D [11] which develop an approach based on divisive hierarchical clustering and a new probabilistic image similarity metric based on correntropy to reduce the inference of noise and avoid the unbalance class phenomenon. It utilizes MRA in each iteration for class refinement. The second one is the iterative stable alignment and clustering (ISAC) algorithm which brings the concept of robustness into the realm of cryo-EM image processing [12] . It applies a clustering algorithm which is an extension of K-means by enforcing each cluster to have the equal size (to solve the unbalance class problem) called EQual-size group K-means (EQK-means). Then, the images are pass to a stability test to evaluate whether its orientation parameters are stable across many runs of referencefree alignment (RFA) [13] . The algorithm again used MRA to updates its class averages. The third one is Prime [14] which follows the MRA architecture but used a scheme called stochastic hill climbing where "the in-plane rotation and class assignment that maximizes the similarity" is replaced with the relaxed requirement of identifying "the first in-plane rotation and class assignment that improves the previous similarity score". Other approach that utilize MRA can be found in [15] - [17] . That use RFA can be found in [15] , [17] , [18] . It is noted that most of the works mentioned above do not exploit GPU. However, as we can observe that the calculation of similarities is independent for each particle, 2D reference and orientation; accordingly, it is possible to calculate them in parallel using GPU.

On the other hand, there are many implementation strategies for the alignment step [19] . Alignment in real space, alignment using 2D FFT, Sinograms, alignment using autocorrelation function and re-sample to polar coordinate (RPC). Among these implementations, RPC has been shown to have the best performance under low SNR since the interpolation error will not be accumulated and can be efficiently implemented.

In this work, we exploit the parallelism mentioned above. The alignment based on RPC is used. Both RFA and MRA are accelerated. In addition, we provide a simple link between cryo-EM image processing and the python data science stack to enable rapid development of GPU-accelerated cryo-EM image operations. The programs are integrated into an opensource library called Cryo-RALib which is publicly accessible to the community. Algorithm 1: GPU Accelerated MRA Input: Set of particle images X Set of reference images Y Output: Set of class averages Y 1 Find the largest batch size we can fit into one GPU using binary search. 2 repeat 3 for each batch do 4 Transfer images X and Y in this batch from host memory into device texture memory. 5 do in parallel 6 Convert y i to Polar coordinates and store it into global memory. Perform in-place FFT on transformed images to obtain y i . Convert x i to Polar coordinates with given shift and store it into global memory. Perform in-place FFT on transformed images to obtain x i . Compute CC between x i and y i and store CC into the table. 10 Apply in-place inverse FFT to CC table. 11 do in parallel 12 Find the largest CC value for each image using parallel reduction. Apply the corresponding alignment parameters to each image and store them back to global memory. 13 Compute new Y from aligned images.

14 Transfer averages and parameters to host. 15 until convergence; 16 Return Y and parameters.

The proposed framework is built upon the popular generalpurpose cryo-EM processing library called EMAN2 [16] . The architecture of our proposed alignment framework is illustrated as Figure 1 ; the main program is implemented using message passing interface (MPI) and Nvidia CUDA. The implementation can be scaled to multi-core and multi-GPU and the processing flow is described as follows: Firstly, the particle images are parallel read and immediately prepossess 1 by all N MPI process. Second, every process will send to the first n MPI process that contains GPU resources through MPI. 2 Third, the first n MPI process will perform the alignment algorithm described in Algorithm 1. Finally, the first n MPI process will send the data back to all N MPI process and all the processes will write the metadata and transformed images to the disk. Notice that the main task is to compute the rotation cross-correlation function in the RPC procedure. The function is defined as c(φ) = r2 r1

2π 0 x(r, θ)y(r, θ + φ)|r| dθ dr. 

At the core of our system is the data layout of image files and the CCs. Our goal is to make the data layout compact. To speed up the host to device memory transfer, the texture memory is used as shown in Figure 2 (a). The information about the texture memory of the given GPU will be extracted and the maximum amount of images that can be fit into one texture object will be calculated. In addition, the metadata, including the width and height of the images, will be stored alongside the images in the texture object. A floating array in global memory will be used to store the images after polar conversion, Fast Fourier Transform (FFT), Inverse FFT (IFFT) and alignment.

The results of the CC computation will be stored in a table, as shown in Figure 2 (b). The table holds the CC information of all (mirrored) particle images with all reference images given all shifts. All information is stored in a single, continuous array that can be regarded as a 2D matrix. Each row stores all CC information for one particle image (per reference, 3 Here, the image is x and y is represented using polar coordinate. 4 Here the image is X and Y is the Fourier representation of x and y. per shift, per original/mirrored). This is to ensure that it works well with CUDA kernels that work on continuous data ranges. The results of CC stores IF F T (F F T (x i ) × F F T (y j )), i.e., the CC of particle image i and reference image j and one CC result slot holds M values.

The NVIDIA General-Purpose GPU computing [20] enables parallelism through the single-instruction multiple threads architecture. A GPU hardware contains multiple banks of global memory that are accessible from a grid containing thousands of thread-blocks. Each thread-block has its own faster but smaller bank of memory, called shared memory, which is available to a number of threads that are executed concurrently. The designer can configure the number of blocks and threads through the built-in 3D variables to increase the throughput of their GPU code called kernel.

Our implementation follows the CPU implementation from EMAN2 without any significant diversions, as shown in Algorithm 1. 5 we detail some of the major design choices of each step below. In the Cartesian to polar coordinate conversion, we have n images × n rings blocks in total where n rings blocks process one image. Each block contains len rings threads where each of which computes one value of the polar representation of an image. 6 The CC computations of F F T (x) ×F F T (y) are boil down to a number of scalar products with an additional factor for weighting the individual rings of the polar representation of the data. 7 Each block processes one particle images and com- putes its CC values with the given reference images. All the reference images span the y-dimension of another block. Each block contains len rings threads that sequentially accumulates the CC values across all n rings rings. Note that the mirrored particle image is also immediately computed. This can be done by multiplying using the complex conjugate (representing a yaxis flipped image). Consequently, each thread computes two multiplications and writes two complex result values. Finally, the FFT and IFFT are implemented using the optimized CUDA primitives from cuFFT library. The modified parallel reduction scheme is employed to find the maximum value of each row in the CC table as shown in Figure 3 . Firstly, the two shared memory are employed where the first half stores maximum values and the second half stores value indices in the original array. Second, when creating the kernel, we launch n images block. However, we are not deciding how many threads to launch based on the overall data size since the length of each row is usually very large and can easily exceed the limit. 8 Instead, we are launching a fixed number of threads (say, 128, the number can be adjusted to find out what runs fastest), and letting each thread (and thus the entire block) loop through memory, computing partial argmax operations on each element in shared memory. Finally, the standard parallel reduction is employed to reduce the initialized memory to find the largest element and its index. It is noted that our approach also enables coalesced access of the memory within each threadblocks which can improve the memory access speed.

The computation of rotation and alignment is conducted in parallel after finding the parameters for each particle image. In this step, we launch n images block and each block contains a number of threads, which is equal to the width or height of the image, to fetch pixels from the source image in texture memory and put into the destination in global memory for the transformed image.

The software stack of our library is shown in Figure 4 . Popular computational cryo-EM packages are built around C 8 CUDA architecture limits the numbers of threads per block to 1024. and C++ [2] , [16] , [21] , [22] . On the other hand, the recent trend of packages designed for data analysis, data visualization and inference is built upon Python libraries. However, there is no convenient way to perform cryo-EM data analysis in the Python environment.

In our framework, the functionality of EMAN2 and our CUDA code is made available through several well-defined interfaces. We used ctypes [23] and CuPy [24] to implement the required interoperability layer. The returned value can be accessed through the python environment using the ctypes package. To compute the final averages and calculate the statistics of the transformed particle images, we implement a custom class as the GPU array interface. This class encapsulates the cuda array interface which is created for interoperability between different implementation of GPU array-like objects in various python projects. With this custom class, we can easily access the GPU buffer of the transformed images and calculate them in the Python. This class helps us to reduce the need to transfer data between host and device as much as possible. 9 In the Python environment, we use NumPy [25] /CuPy array to represents the particle images; it can be transformed into other popular computer vision or machine learning libraries for analysis [26] , [27] . Besides, it can be easily converted to tensors to be used in a deep-learning library. Finally, the metadata are represented as Pandas [28] dataframe for exploratory data analysis. The images and metadata are stored into either HDF5 or Berkeley DB (BDB) file format that is lightweight and supports parallel read. 10 Developer or user can use the Jupyter notebook interface with our API to perform common data analysis pipeline. Several notebooks are included in the library to show that basic image operations like rotation and shift can be easily accelerated and the data analysis/visualization can be performed on the platform.

The typical processing pipeline of our framework is illustrated as follows. Firstly, the data source (cryo-EM images and metadata stores in HDF5, BDB or text file) will be read in through the Python wrapper in parallel. Second, the user performs preprocess or exploratory data analysis in the Jupyter notebook. Third, the main task for alignment and clustering is performed by calling MPI script. Finally, the transformed images and metadata can be read into a notebook for further data analysis and visualization. Figure 5 shows an example of exploratory data analysis. Where the 1D and 2D histogram of rotations, shifts and class assignments from MRA procedure are plotted. We can then analyze the performance of particle picking or if there are problems with preferred orientations or attraction by large clusters [11] , etc. In addition, we use 2SDR [29] and t-SNE [30] to plot the 2D embedding of all particles as shown in Figure 5 (d)-(f). We can then get an idea about the alignment progress and the performance of 2D classification. 

We implemented a system for a fair comparison between the CPU implementation in EMAN2 and our GPU implementation on the TaiWan Computing Cloud (TWCC). The testing scripts are running in the TWCC c.super instance, with 90GB of memory, four 3.00 GHz cores of Xeon Gold 6154 Processors and an Nvidia Tesla V100 GPU. The benchmark dataset from RELION is used which is Ribosome 80s [8] with height and width downsampling to 90 pixels.

We first compared the CPU implementation of MRA from EMAN2 as shown in Figure 6a . The search range in the x and y direction is set to 3, the radius of particles is 36 pixels and the iteration is set to 6. We can perceive that the speedup is 22x ∼ 37x with different reference numbers. We then compared the CPU implementation of RFA from EMAN2 as shown in Figure 6b . The search step is set to 1, the radius of particle is set to 36 pixels, and the iteration is set to 6. The speedup is 2.4x ∼ 9.4x with different 2D shift. Figure 6a presents the performance of our implementation increase that is gained over multiple-GPU. 11 Note that we concentrating on the scalability of the computation of our solution which the reading time is excluded. It can be observed that our implementation scales well from 1 to 4 GPUs and provides near linear speedup compare to single GPU. On the other hand, when the number of GPU is greater than four the speedup is saturated because the time is dominated by the data transfer and data distribution.

With the enhancement of algorithms, cryo-EM has become the most efficient technique to solve structures of molecules. However, the performance of the popular GPU-accelerated Bayesian approach is not satisfying in 2D Classification either in the classification quality or the computation time. Therefore, other powerful methods based on MRA is needed. In the literature, however, MRA based methods are not GPU-accelerated. In this work, a modular GPU-accelerated alignment library called Cryo-RALib is proposed. The library contains several alignment routines that can be widely used to accelerate stateof-the-art classification algorithms. We also provide several well-defined interfaces to connect the cryo-EM image analysis with the python data science stack which helps users to perform analysis and visualization more easily. Our implementation can accelerate the alignment procedure by one order of magnitude according to the benchmark results on TWCC. Further extending the library to accommodate more popular cryo-EM packages into the framework is an interesting task. Specifically, our goal is to provide a friendly environment for both users and developers to efficiently analyze their data and develop new methods with GPU acceleration.

Computational methods for singleparticle electron cryomicroscopy

Relion: implementation of a bayesian approach to cryo-em structure determination

Motioncor2: anisotropic correction of beam-induced motion for improved cryo-electron microscopy

Gctf: Real-time ctf determination and correction

Sphire-cryolo is a fast and accurate fully automated particle picker for cryo-em

Accelerated cryo-em structure determination with parallelisation using gpus in relion-2

cryosparc: algorithms for rapid unsupervised cryo-em structure determination

Cryo-em structure of the plasmodium falciparum 80s ribosome bound to the anti-protozoan drug emetine

A primer to single-particle cryo-electron microscopy

Massively parallel unsupervised single-particle cryo-em data clustering via statistical manifold learning

A clustering approach to multireference alignment of single-particle projections in electron microscopy

Iterative stable alignment and clustering of 2d transmission electron microscope images

Threedimensional reconstruction of single particles embedded in ice

A stochastic hill climbing approach for simultaneous 2d alignment and clustering of cryogenic electron microscopy images

Multivariate statistical analysis of large datasets: Single particle electron microscopy

Eman2: an extensible image processing suite for electron microscopy

Spider and web: Processing and visualization of images in 3d electron microscopy and related fields

Pre-pro is a fast pre-processor for single-particle cryo-em by enhancing 2d classification

Efficiency of 2d alignment methods

Cuda: Scalable parallel programming for highperformance scientific computing

Xmipp: a new generation of an open-source image processing package for electron microscopy

cistem, userfriendly software for single-particle image processing

Python for scientific computing

Cupy: A numpy-compatible library for nvidia gpu calculations

Array programming with numpy

Scikit-learn: Machine learning in Python

RAPIDS: Collection of Libraries for End to End GPU Data Science

The pandas development team

Two-stage dimension reduction for noisy high-dimensional images and application to cryogenic electron microscopy

Visualizing data using t-sne

We would like to thank Ryan Jeng from Nvidia and the NCHC GPU Hackathon for the help on our research project. We would also like to thank the open-source projects that we built upon, especially the EMAN2 and gpu isac 2.3.2.