is the regularization parameter of nuclear norms. the hyperparameter λ actually (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . can be different among different source 𝑘 and thus the regularizer in ( ) can be replaced by ∑ 𝜆𝑘 ‖𝑻𝑘 ‖∗ 𝐾 𝑘= if necessary. please see supplementary information for the reason to select subtype- specific regularization terms. optimization of swcam objective function the objective function in ( ) is bi-convex w.r.t. the two block-wise variables, i.e. 𝑨 ≜ [𝒂 𝑇 , … , 𝒂𝑀 𝑇 ]𝑇 and 𝑻 ≜ [𝑻 𝑇 , … , 𝑻𝑀 𝑇 ]𝑇 ∈ ℝ𝐾𝐿×𝑀 . accordingly, we can solve ( ) by alternatively solving the following two convex subproblems until convergence: 𝑻𝑝+ ∈ argmin ∆𝑺i≽−𝑺,∀𝒊 𝒥(𝑨𝑝, 𝑻) ( ) 𝑨𝑝+ ∈ argmin 𝑨≽ 𝑀×𝐾,𝑨𝟏𝐾=𝟏𝑀 𝒥(𝑨, 𝑻𝑝+ ) ( ) where 𝒥(𝑨, 𝑻) ≜ ∑‖𝒙𝑖 − 𝒂𝑖 (�̅� + ∆𝑺i)‖ 𝑀 𝑖= + 𝜆 ∑‖𝑻𝑘 ‖∗ 𝐾 𝑘= cam-estimated subtype-specific expression matrix serves as the initial reference 𝑺. note that in ( ) ( ), we have implicitly used the following relationship for concise representation: 𝑻 ≜ [𝑣𝑒𝑐(Δ𝑺 𝑇 ), … , 𝑣𝑒𝑐(Δ𝑺𝑀 𝑇 )], where ( ) can be decoupled w.r.t each row of 𝑨: 𝒂𝑖 𝑝+ ∈ argmin 𝒂𝑖≽𝟎𝐾,𝒂𝑖𝟏𝐾= ‖𝒙𝑖 − 𝒂𝑖 (�̅� + ∆𝑺𝑖 𝑝+ )‖ which can be solved using quadratic programming. if a prior proportion matrix or cam-estimated proportion matrix has already been of high quality, we can skip the alternative optimization on 𝑨 matrix, and obtain 𝑻 matrix by optimizing the subproblem ( ) only once. to solve ( ), we notice that the main bottleneck is its huge dimension of variables (typically, l is several ten thousand), preventing conventional convex solvers from being readily applicable here. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we propose to solve ( ) by adapting the alternating direction method of multipliers (admm), which has been widely applied to many large-scale problems in areas such as statistical learning, image processing and computational biology (boyd, parikh et al. ). admm naturally allows decoupling the non-smooth regularization term from the smooth loss term, which is computationally advantageous. specifically, we reformulate ( ) in the form that the primal variable can be “split” into several parts, with the associated objective function “separable” across this splitting (boyd, parikh et al. ). we will use the following definitions: 𝑻 ≜ [𝑣𝑒𝑐(Δ𝑺 𝑇 ), … , 𝑣𝑒𝑐(Δ𝑺𝑀 𝑇 )] = [ 𝑻 … 𝑻𝐾 ] ∈ ℝ𝐾𝐿×𝑀 𝑺 ≜ [𝑣𝑒𝑐(𝑺 𝑇 ), … , 𝑣𝑒𝑐(𝑺𝑀 𝑇 )] ∈ ℝ𝐾𝐿×𝑀 𝑽 ≜ 𝑿𝑇 ∈ ℝ𝐿×𝑀 𝑾 ≜ [ 𝑻 𝑺 ] ∈ ℝ 𝐾𝐿×𝑀 𝑪 ≜ [ 𝑰𝐾𝐿 𝑰𝐾𝐿 ] ∈ ℝ 𝐾𝐿×𝐾𝐿 𝑪 ≜ −𝑰 𝐾𝐿 ∈ ℝ 𝐾𝐿× 𝐾𝐿 𝑪 ≜ [ 𝟏𝑀 𝑇 ⨂𝑣𝑒𝑐(�̅�𝑇) 𝟎𝐾𝐿×𝑀 ] ∈ ℝ 𝐾𝐿×𝑀 𝑩 ≜ [𝟎𝐾𝐿×𝐾𝐿 , 𝑰𝐾𝐿 ] ∈ ℝ 𝐾𝐿× 𝐾𝐿 𝑩𝑘 ≜ [𝟎𝐿×(𝑘− )𝐿 , 𝑰𝐿 , 𝟎𝐿×(𝐾−𝑘)𝐿 , 𝟎𝐿×𝐾𝐿 ] ∈ ℝ 𝐿× 𝐾𝐿 , 𝑘 = , … , 𝐾 then we can simplify ( ) as the equivalent form: min 𝑼∈ℝ𝐾𝐿×𝑀,𝑾∈ℝ 𝐾𝐿×𝑀 ‖𝒜(𝑼) − 𝑽‖𝐹 + 𝜆 ∑‖𝑩𝑘 𝑾‖∗ 𝐾 𝑘= + 𝐼+(𝑩 𝑾) ( ) 𝑠. 𝑡. 𝑪 𝑼 + 𝑪 𝑾 = 𝑪 , (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . where 𝐼+(∙) is the indicator function for the non-negative orthant; 𝐼+(𝑩 𝑾) = 𝐼+(𝑺) = if 𝑺 ≽ 𝟎𝐾𝐿×𝑀 ( 𝐼+(𝑼) = +∞ , otherwise). the linear transformation in the first term is 𝒜(𝑼) = 𝒜([𝒖 , … , 𝒖𝑀]) = [𝑯 𝒖 , … , 𝑯𝑀𝒖𝑀] with 𝑯𝑖 = [𝒂𝑖 𝑝 ⨂𝐼𝐿 ], 𝑖 = , … , 𝑀 . note that ( ) has been with the admm form w.r.t. the two split block variables 𝑼 and 𝑾, and, as ( ) is solved, the solution of ( ) can be obtained by 𝑻𝑝+ = [ 𝑰𝐾𝐿 , 𝟎𝐾𝐿×𝐾𝐿 ]𝑾 ∗. given a penalty parameter 𝛾 > (empirically, 𝛾 ≔ generally guarantees good convergence speed), the augmented lagrangian (ignoring some irrelevant terms) of problem ( ) is defined by ℒ(𝑼, 𝑾, 𝒁) = ‖𝒜(𝑼) − 𝑽‖𝐹 + 𝜆 ∑‖𝑩𝑘 𝑾‖∗ 𝐾 𝑘= + 𝐼+(𝑩 𝑾) + 𝛾 ‖𝑪 𝑼 + 𝑪 𝑾 − 𝑪 − 𝒁‖𝐹 where “−𝛾𝒁”∈ ℝ 𝐾𝐿×𝑀 is the dual variable (or lagrange multiplier) associated with the constraint 𝑪 𝑼 + 𝑪 𝑾 = 𝑪 . then, admm solves ( ) via the following iterative procedure: 𝑼𝑞+ 𝜖 argmin 𝑼∈ℝ𝐾𝐿×𝑀 ℒ(𝑼, 𝑾𝑞 , 𝒁𝑞 ) ( 𝑎) 𝑾𝑞+ 𝜖 argmin 𝑾∈ℝ 𝐾𝐿×𝑀 ℒ(𝑼𝑞+ , 𝑾, 𝒁𝑞 ) ( 𝑏) 𝒁𝑞+ = 𝒁𝑞 − (𝑪 𝑼 𝑞+ + 𝑪 𝑾 𝑞+ − 𝑪 ) ( 𝑐) where 𝑾 can be initialized by [𝑻 𝑇 , 𝑼 𝑇 ]𝑇 with 𝑻 = 𝟎𝐾𝐿×𝑀 and 𝑼 = 𝟏𝑀 𝑇 ⨂𝑣𝑒𝑐(�̅�𝑇 ); 𝒁 can be simply initialized by 𝟎 𝐾𝐿×𝑀. as we will show, both ( a) and ( b) can be solved with closed-form expressions, thanks to the decomposability of admm. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . the objective function of swcam for sample-specific deconvolution problem and its reformulation by admm. (for convenient illustration, 𝑻 matrix in all figures are the transposed version of those in the text and equations.) notice that ( a) is a column-wise separable optimization problem, so we can decouple w.r.t each column of 𝑼: 𝒖𝑖 𝑞+ ∈ argmin 𝒖𝑖∈ℝ 𝐾𝐿 ‖𝑯𝑖 𝒖𝑖 − 𝒗𝑖 ‖ + 𝛾 ‖𝑪 𝒖𝑖 + 𝒚𝒊 𝑞 ‖ 𝐹 ( ) where [𝒚 𝑞 , … , 𝒚𝑀 𝑞 ] ≜ 𝑪 𝑾 𝑞 − 𝑪 − 𝒁 𝑞 . the subproblem ( ) is an unconstrained quadratic problem, which can be solved by 𝒖𝑖 𝑞+ = (𝑯𝑖 𝑇 𝑯𝑖 + 𝛾𝑪 𝑇 𝑪 ) − (𝑯𝑖 𝑇 𝒗𝑖 − 𝛾𝑪 𝑇 𝒚𝒊 𝑞 ). ( ) the matrix inversion can speed up by (𝑯𝑖 𝑇 𝑯𝑖 + 𝛾𝑪 𝑇 𝑪 ) − = ((𝒂𝑖 𝑝 ) 𝑇 𝒂𝑖 𝑝 + 𝛾𝑰𝐾 ) − ⨂𝑰𝐿 . the right term in ( ) can also be simplified as (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . 𝑯𝑖 𝑇 𝒗𝑖 − 𝛾𝑪 𝑇 𝒚𝒊 𝑞 = (𝒂𝑖 𝑝 ) 𝑇 ⨂𝒙𝑖 𝑇 − 𝛾 (𝒚 𝒊 𝑞 + 𝒚𝒊 𝑞 ), where 𝒚𝒊 𝑞 = [(𝒚 𝒊 𝑞 ) 𝑇 , (𝒚𝒊 𝑞 ) 𝑇 ] 𝑇 with 𝒚 𝒊 𝑞 ∈ ℝ𝐾𝐿 and 𝒚𝒊 𝑞 ∈ ℝ𝐾𝐿 being the first and second half vector of 𝒚𝒊 𝑞 , respectively. finally, the column vectors of 𝑼𝑞+ in ( a) can be computed fast by 𝒖𝑖 𝑞+ = 𝑣𝑒𝑐 {𝑑𝑒𝑣𝑒𝑐 {(𝒂𝑖 𝑝 ) 𝑇 ⨂𝒙𝑖 𝑇 − 𝛾 (𝒚 𝒊 𝑞 + 𝒚𝒊 𝑞 ) |𝐿, 𝐾} ((𝒂𝑖 𝑝 ) 𝑇 𝒂𝑖 𝑝 + 𝛾𝑰𝐾 ) − } ( ) to solve ( . b), we remove some irrelevant terms from its objective function: min 𝑾∈ℝ 𝐾𝐿×𝑀 𝜆 ∑‖𝑩𝑘 𝑾‖∗ 𝐾 𝑘= + 𝐼+(𝑩 𝑾) + 𝛾 ‖𝑪 𝑼 𝑞+ + 𝑪 𝑾 − 𝑪 − 𝒁 𝑞 ‖𝐹 , ( ) and then, by defining 𝑼𝑘 𝑞+ ∈ ℝ𝐿×𝑀, 𝑘 = , … , 𝐾 as block matrices from top to bottom in 𝑼𝑞+ ∈ ℝ𝐾𝐿×𝑀 , 𝒁𝑘 ∈ ℝ 𝐿×𝑀, 𝑘 = , … , 𝐾 and 𝒁 ∈ ℝ 𝐾𝐿×𝑀 as block matrices from top to bottom in 𝒁 ∈ ℝ 𝐾𝐿×𝑀 , respectively (i.e., 𝒁 ≜ [𝒁 𝑇 , … , 𝒁𝐾 𝑇 , 𝒁 𝑇 ]𝑇 ), we decouple the objective function ( ) as functions of 𝑻𝑘 , 𝑘 = , … , 𝐾 and 𝑺: min 𝑾∈ℝ 𝐾𝐿×𝑀 ∑ {𝜆‖𝑻𝑘 ‖∗ + 𝛾 ‖𝑼𝑘 𝑞+ − 𝑻𝑘 − 𝟏𝑀 𝑇 ⨂�̅�𝑘 − 𝒁𝑘 𝑞 ‖ 𝐹 } 𝐾 𝑘= + {𝐼+(𝑺) + 𝛾 ‖𝑼𝑞+ − 𝑺 − 𝒁 𝑞 ‖ 𝐹 } therefore, 𝑾𝑞+ can be solved by the proximal point algorithm (ppa) (parikh and boyd ). specifically, we have 𝑾𝑞+ = [(𝑻 𝑞+ ) 𝑇 , … , (𝑻𝐾 𝑞+ ) 𝑇 , (𝑺𝑞+ )𝑇 ] 𝑇 in which 𝑻𝑘 𝑞+ ∈ argmin 𝑻∈ℝ𝐾𝐿×𝑀 𝜆‖𝑻𝑘 ‖∗ + 𝛾 ‖𝑼𝑘 𝑞+ − 𝑻𝑘 − 𝟏𝑀 𝑇 ⨂�̅�𝑘 − 𝒁𝑘 𝑞 ‖ 𝐹 ( 𝑎) 𝑺𝑞+ ∈ argmin 𝑻∈ℝ𝐾𝐿×𝑀 𝐼+(𝑺) + 𝛾 ‖𝑼𝑞+ − 𝑺 − 𝒁 𝑞 ‖ 𝐹 ( 𝑏) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . note that ( . a) and ( . b) are exactly the proximal operators of ‖𝑻𝑘 ‖∗ and 𝐼+(𝑺), respectively (parikh and boyd ), and their closed-form solutions are given by 𝑻𝑘 𝑞+ = ∑ (𝜎𝑘ℓ − 𝜆 𝛾 ) + 𝝁𝑘ℓ𝝂𝑘ℓ 𝑇 𝑟 ℓ= , 𝑘 = , … , 𝐾, ( ) 𝑺𝑞+ = [𝑼𝑞+ − 𝒁 𝑞 ] + , ( ) where the singular value decomposition (svd) of is performed ahead of the computation of ( ), i.e. 𝑼𝑘 𝑞+ − 𝑻𝑘 − 𝟏𝑀 𝑇 ⨂�̅�𝑘 − 𝒁𝑘 𝑞 = ∑ 𝜎𝑘ℓ𝝁𝑘ℓ𝝂𝑘ℓ 𝑇𝑟 ℓ= . a reasonable termination criterion is that the primal residual, 𝑝𝑟𝑖 = ‖𝑪 𝑼 + 𝑪 𝑾 − 𝑪 ‖ , and dual residual, 𝑑𝑢𝑎𝑙 = ‖𝛾𝑪 𝑇 𝑪 (𝑾 𝑞+ − 𝑾𝑞 )‖ , are smaller than a predefined tolerance. model parameter tuning in noisy scenarios, the penalty parameter 𝜆 setting is critical to determine how much variation is persevered as patterns of interest or ignored as noise. an extremely large 𝜆 will coerce the individual variation to be zero. decreasing 𝜆 will allow more subtype-specific patterns to be detected until overfitting. cross-validation is a popular strategy in parameter tuning for the balance of underfitting and overfitting. one round of cross-validation excludes a certain portion of samples and uses the model learned from other samples to predict the excluded ones. then every model is assessed by summarizing prediction performances across multiple rounds. however, our sample-specific deconvolution estimates the individual expression of each sample in each subtype, which cannot be used to predict the excluded samples directly. thus, we proposed to randomly exclude entries rather than samples in 𝑿 matrix (fig. ), similar to the strategy used in missing value imputation. the foundation of success is that the low-rank patterns in 𝑻𝑘 matrix are detectable by only a portion of 𝑿 entries and able to predict the excluded 𝑿 entries. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . -fold cross-validation strategy for model parameter tuning. a part of entries is randomly removed before applying swcam. the removed entries are reconstructed by estimated 𝑻 matrix and compared to observed expressions for computing rmse to decide the optimal parameter 𝜆. specifically, we fix the 𝑨 and 𝑺 at the initialization values (from cam-estimation or a priori knowledge) and randomly remove entries in 𝑿 matrix, leading to the objective function w.r.t ∆𝑺𝑖 , 𝑖 = , … , 𝑀: min {∆𝑺𝑖}𝑖= 𝑀 ∑‖𝑃Ω𝑖 (𝒙𝑖 ) − 𝑃Ω𝑖 (𝒂𝑖 (�̅� + ∆𝑺𝑖 ))‖ 𝑀 𝑖= + 𝜆 ∑‖𝑻𝑘 ‖∗ 𝐾 𝑘= ( ) 𝑠. 𝑡. �̅� + ∆𝑺𝑖 ≽ 𝟎𝐾×𝐿 , 𝑻𝑘 = [∆𝑺 𝑇 (𝑘), … , ∆𝑺𝑀 𝑇 (𝑘)] ∈ ℝ𝐿×𝑀, 𝑘 = , … , 𝐾, where 𝑃Ω𝑖 (𝒙𝑖) ∈ ℝ 𝐿 denote a vector with the entries in Ω𝑖 left alone, and all other entries set to zero. the workflow of our proposed -fold cross-validation strategy is: ( ) randomly split all entries into folds; (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ( ) remove one fold of entries and use the remaining folds of entries to solve ( ) with different 𝜆 values [𝜆 , 𝜆 , …]; ( ) use estimated ∆𝑺𝑖 (𝜆𝜃 ), 𝑖 = , … , 𝑀, 𝜃 = , , …, together with fixed 𝑨 and 𝑺 matrix to reconstruct 𝑿 matrix and only record the reconstructed values for the removed entries in 𝑿; ( ) repeat step ( )-( ) and obtained a reconstructed �̃�(𝜆𝜃 ) matrix in which all entry values are reconstructed when their original values are absent in optimization processes with 𝜆 = 𝜆𝜃. ( ) calculate root mean square error (rmse) by 𝑅𝑀𝑆𝐸(𝜆𝜃 ) = √ 𝑀𝐿 ∑ ∑ (𝑿𝑖𝑗 − �̃�𝑖𝑗 (𝜆𝜃 )) 𝐿 𝑗= 𝑀 𝑖= ( ) ( ) choose the 𝜆𝜃 yielding the minimum rmse. warm start can be used in step ( ) with the decreasing parameter 𝜆 > 𝜆 > ⋯, which use the estimation with 𝜆𝜃 as the initialization of next optimization with 𝜆𝜃+ . the optimization problem ( ) can be solved using a similar admm algorithm in ( - ) that have solved ( ). the only modification is that ( ) becomes 𝒖𝑖 𝑞+ ∈ argmin 𝒖𝑖∈ℝ 𝐾𝐿 ‖𝑃Ω𝑖 ′ (𝑯𝑖 𝒖𝑖 ) − 𝑃Ω𝑖 ′ (𝒗𝑖 )‖ + 𝛾 ‖𝑪 𝒖𝑖 + 𝒚𝒊 𝑞 ‖ 𝐹 ( ) where 𝑃Ω𝑖 ′ (∙) = [𝟏𝐾 𝑇 ⨂ 𝑃Ω𝑖 (∙) 𝑇 ] 𝑇 ∈ ℝ𝐾𝐿 makes all excluded-entry related variables be optimized only by the second term, which is still an unconstrained quadratic problem that can be solved easily. the remaining variables unrelated to excluded entries can still be optimized following ( - ). sparsity regularization in addition to low-rank assumption, we could also reasonably assume only limited genes are involved in functional modules and thus impose a row-sparsity regularization by ℓ , -norm minimization. the alternative swcam formulation will be: min 𝑨,{∆𝑺𝑖}𝑖= 𝑀 ∑‖𝒙𝑖 − 𝒂𝑖 (�̅� + ∆𝑺𝑖 )‖ 𝑀 𝑖= + 𝜆 ∑‖𝑻𝑘 ‖∗ 𝐾 𝑘= + 𝛿‖𝑻‖ , ( ) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . where 𝛿 > is the regularization parameter of ℓ , norm of 𝑻, defined as ‖𝑻‖ , ≜ ∑‖𝒕𝑖 ‖ 𝐾𝐿 𝑖= accounting for the row-sparsity of 𝑻. if necessary, the parameter 𝛿 actually can be varied for different rows based on the character of each gene, such as mean-variance trend. the supplementary information gives more details on the optimization of ( ) by admm method. the ℓ or ℓ -norm minimization, as common-used sparsity regularization methods, could impose the entry sparsity in 𝑻 matrix. we also provide admm optimization for sample-specific deconvolution with ℓ or ℓ -norm minimization, which could be useful in other sbss problems. results as swcam focuses on subtype-specific variation estimation, simulating biological variance within each subtype and technical variance for each observation is important for validating swcam performance. we conduct two sets of simulations. the first is in an ideal scenario where the variance is not related to mean value. the second is more realistic where genes with larger mean usually have larger variance. validation on ideal simulations in the first simulations, we design twelve function modules, with four in each of three subtypes. the observations for genes in samples were simulated with subtype-specific expression baseline, �̅� , sampled from the purified cell populations in real benchmark microarray gene expression data gse (kuhn, thu et al. ). 𝒂𝒊, 𝑖 = , … , 𝑀, are drawn randomly from a flat dirichlet distribution. between-sample variation, ∆𝑺𝑖 (𝑘, 𝑗), 𝑖 = , … , 𝑀, for the kth subtype and jth gene was drawn from normal distribution 𝒩( , 𝜎𝑘𝑗 (𝑠) ) if the jth gene was involved in a function module in the kth subtype; otherwise zero (fig. a). the genes in the same function module has pairwise correlation coefficient equal to one, thus generating a highly correlated gene set in each module. 𝜎𝑘𝑗 (𝑠) are drawn from uniform distribution 𝑈[ , ]. the technical noise, 𝒏𝑖 , 𝑖 = , … , 𝑀, was drawn from zero-mean normal distribution with the variance 𝜎𝑖𝑗 (𝑛) = . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the twelve functional modules can be recognized in the variation matrix from swcam when 𝜆 falls into a certain range (fig. b~ i). increasing the penalty parameter of the nuclear norm will filter more noise but at the cost of the possibility of missing the true variation signal. rmse derived by -fold cross-validation strategy is relatively small when 𝜆 = ~ and reach the minimum at 𝜆 = (fig. a). the estimated variation matrix looks quite similar when ≤ 𝜆 ≤ (fig. e~ g), with clear patterns and some artifacts. the artifacts are formed when the signal variation in one subtype spreads to other subtypes for the same genes, which are much lower than detected true signals if 𝜆 is not extremely small. (as shown in the supplementary information, the nuclear norm minimization for each subtype’s variation matrix is a good option to reduce artifacts compared to other regularization terms.) it is interesting to find 𝜆 = is also the point where both primal and dual residuals surge in admm algorithm (fig. c~ f). it is because larger 𝜆 tends to train an over-simplified model and thus approach the optimum solution more easily in admm. the recovery of sample-specific signals in a subtype is also affected by the mixing proportions of this subtype within the sample. when a subtype accounts for a very small portion in a certain sample, its true signal in this sample will be very weak and thus underestimated (green points in fig. ). on the contrary, the major subtype in a sample can be estimated very well by cam-ss (red points in fig. ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . heatmap of estimated 𝑇 matrix with varied 𝜆 parameters compared to ground truth in the ideal simulation. increasing the penalty parameter of the nuclear norm will filter more noise but at the cost of the possibility of missing true signal variation. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . -fold cross-validation results under different 𝜆 parameter in the ideal simulation. (a) rmse; (c) residuals for primal feasibility condition; (e) residuals for dual feasibility condition; (b), (d), (f) are zoomed curves of (a), (c), (e). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . estimated 𝑻 matrix versus ground truth when 𝜆= in the ideal simulation. the mixing proportions associated with estimated entries are colored to show the sample-specific expression estimations for high- proportion subtypes can be estimated more accurately than those for low-proportion ones validation on realistic simulations mean-variance trend is widely existing in molecular expression data. in our second simulation, all settings are the same as above except that the variance of subtype-specific expression, 𝜎𝑘𝑗 (𝑠) , and the technical variance of observations, 𝜎𝑖𝑗 (𝑛) , are proportional to the subtype-specific expression mean and mixed expression level, respectively. the coefficient of variation (cv), as the ratio of the standard deviation to the mean, is drawn from uniform distribution 𝑈[ . , . ] and 𝑈[ . , . ], respectively. -fold cross-validation strategy still obtains the minimum rmse at 𝜆 = (fig. a~ b) when both primal and dual residuals also surge (fig. c~ f). however, the estimated variation matrix by (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . swcam is blurred by artifacts trained from noise (fig. ). some high-expressed genes have relatively large variance, which could be falsely modeled as subtype-specific signal variations. as shown in fig. , the entries with zero value in ground truth variation matrix could be overestimated. though the absolute expression values estimated by swcam could deviate from ground truth, we can still clearly detect functional modules defined by the weighted gene correlation network analysis (wgcna) (zhang and horvath , langfelder and horvath ) on the estimated sample-specific expressions (fig. ). wgcna constructs weighted networks based on correlation patterns among genes across samples and thus detects function modules of highly- correlated gene sets. in fig. , the second and third subtype finds the exact four true modules with very few genes are missed. the first subtype detects an extra false module, but it is a less significant pattern compared to other modules and can be undetectable with stricter tree height cut threshold. more importantly, without swcam based deconvolution (fig. d), wgcna on mixture expression profiles can find none of the true modules, but three false modules that are related to the mixing process of three subtypes. incorporation of l -norm regularization in the above simulations, the deconvoluted sample-specific signals contain artifacts trained from signals of other subtypes and artifacts trained from noise (fig. and fig. ). we can use a ℓ , - norm regularization to enforce the sparsity of genes that have signal variation across samples. it is supposed to reduce artifacts while it also follows the assumption that genes contributing to source variation in hidden modules are limited. figure shows the alleviated artifacts with 𝜆 = and 𝛿 = , , or . . the true function modules are correctly detected with 𝜆 = and 𝛿 = or . , where the false module in the first subtype is suppressed when 𝛿 = (fig. ). increasing the penalty parameter 𝛿 will force more genes to have zero variance, which suppresses the artifacts and false function modules but brings the risk of missing the true signals. it is critical to propose a parameter tuning method for 𝛿. however, the cross-validation strategy with randomly excluding entries for tuning parameter 𝜆 is based on the low-rank assumption, where the hidden low-rank patterns can be trained from a part of entries and then used to reconstruct the remaining entries. this strategy is not applicable to 𝛿 selection, which needs further study. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . heatmap of estimated 𝑻 matrix scaled by associated means compared to ground truth in the realistic simulation with varied 𝜆 parameters. increasing the penalty parameter of the nuclear norm will filter more noise but at the cost of the possibility of missing true variation signal. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . -fold cross-validation results under different 𝜆 parameter in the realistic simulation. (a) rmse; (c) residuals for primal feasibility condition; (e) residuals for dual feasibility condition; (b), (d), (f) are zoomed curves of (a), (c), (e). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . estimated 𝑻 matrix scaled by associated means versus ground truth in the realistic simulation (𝜆= ). the mixing proportions associated with estimated entries are colored to show the sample-specific expression estimations for high-proportion subtypes can be estimated more accurately than those for low- proportion ones. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . gene co-expressed function modules detected by wgcna on swcam estimated sample-specific expression for each subtype (a~c) or on originally observed expressions without deconvoluton (d). (network interconnectedness is measured by topological overlap; cutheight = . ; minsize = .) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . heatmap of estimated t matrix scaled by associated means compared to ground truth in the realistic simulation with 𝜆 = and varied 𝛿. increasing the penalty of l norm will enforce more zero columns in 𝛥𝑆𝑘 matrix. fig. . gene co-expressed function modules detected by wgcna on swcam estimated sample- specific expression for each subtype with λ= and δ= or . . (network interconnectedness is measured by topological overlap; cutheight = . ; minsize = .) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . discussion most existing tissue deconvolution methods ignore the expression variability of subtypes across individual samples. swcam will significantly expand the utility of cam by producing subtype- specific expression profiles in each sample. the success of swcam depends on the low-rank assumption, which takes advantage of biologically expected cooperation among genes and thus sheds light on solving the seemingly underdetermined sample-specific deconvolution problem. the low-rank assumption holds naturally in molecule expression data when there exist activated functional modules required by particular biological processes or pathways in different subtypes. the detection of such subtype-specific associations or networks is one of the major targets in the analysis of molecule expression profiles. after our sample-specific deconvolution by swcam, conventional network analysis methods can be applied directly to the estimated sample-subtype- specific signals to construct subtype-specific networks, e.g. weighted correlation network analysis (wgcna (zhang and horvath , langfelder and horvath )) and differential dependency network analysis (ddn (zhang, li et al. , zhang, tian et al. , tian, zhang et al. , tian, zhang et al. )). the cross-validation strategy of excluding entries randomly is inspired by the similar ideas in matrix imputation methods that commonly assume the matrix to be recovered has a low rank. our results consistently show a u-curve over parameter 𝜆, demonstrating the feasibility of the proposed cross-validation strategy. meanwhile, cam is not sensitive to the choice of 𝜆, as the u-curve has a wide platform where the recovered sample-subtype-specific signals are similar and detected modules are close. it is also reasonable to assume that genes involved in biological associations or networks are sparse. therefore, it deserves our further study to use ℓ , -norm regularization for reducing artifacts and improving function module detection. when group information is available, we can also apply basic cam algorithm to each group to obtain group-wise expression profiles of subtypes. compared to sample-specific deconvolution, group-specific deconvolution aims at a lower resolution of underlying subtype signals and thus could obtain more robust results. if grouping is fine enough, group-specific deconvolution can also (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . acquire signal variation in each subtype and thus help detect function modules and construct biological networks. though swcam can solve a seemingly underdetermined problem theoretically based on a low- rank assumption. it still needs improvement and validations. first, the improvement of swcam by sparsity regularization. the sparsity assumption is practically reasonable, and we already show some preliminary results after imposing ℓ , norm regularization. however, introducing one more regularization term will increase the difficulty of parameter tuning. besides, the current cross- validation strategy with matrix entry sampling is not applicable to selecting the coefficient of ℓ , norm term. therefore, the integration of sparsity regularization still needs our further study. second, the improvement of function module detection based on swcam estimated sample- specific signals in each subtype. recovering the exact values of sample-specific signals is impossible unless there are more strong assumptions. luckily, our goal is to detect function module or networks from the between-sample variations in each subtype. thus, increasing the accuracy of estimated intercorrelations among molecules can be regarded as our target of further efforts. third, the validation of validate swcam in real data analysis. we have demonstrated the capacity of swcam to estimate sample-specific signals in each subtype using simulations where the between-sample variation matrices are low-rank. validation of swcam in real molecule expression data would be difficult, as the benchmark datasets with true subtype-specific signals are unavailable. one possible direction is to verify the constructed subtype-specific networks through biological experiments. conclusion we propose a sample-specific deconvolution algorithm to estimate simple-specific molecule expressions for each subtype, from which between-sample variation can be used to detect biological associations and construct networks in each subtype. the contributions of this work include: we formulate the objective function for swcam with a penalty term to minimize the nuclear norm of between-sample variation matrix in each subtype, based on our expectation on the existence of subtype-specific networks. we design an efficient method based on admm to solve swcam’s optimization problem in large-scale biological data. we design a -fold cross- validation strategy to select the coefficient of nuclear norm term, and demonstrate its feasibility in (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . simulations where a u-curve of rmse is obtained to determine the optimal selection. we validate swcam in simulations to demonstrate sample-specific signals can be well estimated when low- rank assumption holds. even though artificial signal variances exist in swcam estimations, the intercorrelations among genes can still be well preserved for function module detection and biological network construction. we propose to use extra ℓ , norm regularization to enforce the sparsity of genes involved in networks and thus reduce the artifacts trained from noise or from signals of other subtypes. acknowledgments this work has been supported by the national institutes of health under grants hl - a , hl , ns - , and the department of defence under grant w xwh- - - (bc p ). competing financial interests the authors declare no competing financial interests. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . reference boyd, s., n. parikh, e. chu, b. peleato and j. eckstein ( ). "distributed optimization and statistical learning via the alternating direction method of multipliers." found. trends mach. learn. ( ): - . buettner, f., k. n. natarajan, f. p. casale, v. proserpio, a. scialdone, f. j. theis, s. a. teichmann, j. c. marioni and o. stegle ( ). "computational analysis of cell-to-cell heterogeneity in single- cell rna-sequencing data reveals hidden subpopulations of cells." nat biotechnol ( ): - . cai, j.-f., e. j. candès and z. shen ( ). "a singular value thresholding algorithm for matrix completion." siam journal on optimization ( ): - . candes, e. j., c. a. sing-long and j. d. trzasko ( ). "unbiased risk estimates for singular value thresholding and spectral estimators." trans. sig. proc. ( ): - . chasman, d. and s. roy ( ). "inference of cell type specific regulatory networks on mammalian lineages." current opinion in systems biology (supplement c): - . chen, l. ( ). mathematical modeling and deconvolution for molecular characterization of tissue heterogeneity. ph.d. doctoral dissertation, virginia polytechnic institute and state university. chen, l., y. lu, c.-t. wu, r. clarke, g. yu, j. e. van eyk, d. herrington and y. wang ( ). "data-driven detection of subtype-specific differentially expressed genes." scientific reports. gal, e., m. london, a. globerson, s. ramaswamy, m. w. reimann, e. muller, h. markram and i. segev ( ). "rich cell-type-specific network topology in neocortical microcircuitry." nature neuroscience : . hastie, t., r. tibshirani and j. friedman ( ). the elements of statistical learning. new york, ny, usa, springer new york inc. junttila, m. r. and f. j. de sauvage ( ). "influence of tumour micro-environment heterogeneity on therapeutic response." nature : . kuhn, a., d. thu, h. j. waldvogel, r. l. faull and r. luthi-carter ( ). "population-specific expression analysis (psea) reveals molecular changes in diseased brain." nat methods ( ): - . langfelder, p. and s. horvath ( ). "wgcna: an r package for weighted correlation network analysis." bmc bioinformatics : . parikh, n. and s. boyd ( ). "proximal algorithms." foundations and trends® in optimization ( ): - . recht, b., m. fazel and p. a. parrilo ( ). "guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization." siam review ( ): - . shen-orr, s. s., r. tibshirani, p. khatri, d. l. bodian, f. staedtler, n. m. perry, t. hastie, m. m. sarwal, m. m. davis and a. j. butte ( ). "cell type-specific gene expression differences in complex tissues." nat methods ( ): - . sonawane, a. r., j. platig, m. fagny, c.-y. chen, j. n. paulson, c. m. lopes-ramos, d. l. demeo, j. quackenbush, k. glass and m. l. kuijjer "understanding tissue-specific gene regulation." cell reports ( ): - . thouvenin, p. a., n. dobigeon and j. y. tourneret ( ). "hyperspectral unmixing with spectral variability using a perturbed linear mixing model." ieee transactions on signal processing ( ): - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . tian, y., b. zhang, e. p. hoffman, r. clarke, z. zhang, i.-m. shih, j. xuan, d. m. herrington and y. wang ( ). "knowledge-fused differential dependency network models for detecting significant rewiring in biological networks." bmc systems biology ( ): . tian, y., b. zhang, e. p. hoffman, r. clarke, z. zhang, m. shih ie, j. xuan, d. m. herrington and y. wang ( ). "kddn: an open-source cytoscape app for constructing differential dependency networks with significant rewiring." bioinformatics ( ): - . wang, n., e. p. hoffman, l. chen, l. chen, z. zhang, c. liu, g. yu, d. m. herrington, r. clarke and y. wang ( ). "mathematical modelling of transcriptional heterogeneity identifies novel markers and subpopulations in complex tissues." scientific reports : . zhang, b. and s. horvath ( ). "a general framework for weighted gene co-expression network analysis." stat appl genet mol biol : article . zhang, b., h. li, r. b. riggins, m. zhan, j. xuan, z. zhang, e. p. hoffman, r. clarke and y. wang ( ). "differential dependency network analysis to identify condition-specific topological changes in biological networks." bioinformatics ( ): - . zhang, b., y. tian, l. jin, h. li, m. shih ie, s. madhavan, r. clarke, e. p. hoffman, j. xuan, l. hilakivi-clarke and y. wang ( ). "ddn: a cabig(r) analytical tool for differential network analysis." bioinformatics ( ): - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . analysis of next- and third-generation rna-seq data reveals the structures of alternative transcription units in bacterial genomes analysis of next- and third-generation rna-seq data reveals the structures of alternative transcription units in bacterial genomes qi wang , zhaoqian liu , , bo yan , wen-chi chou , laurence ettwiller , qin ma ,†, and bingqiang liu ,† school of mathematics, shandong university, jinan , china. department of biomedical informatics, college of medicine, the ohio state university, columbus, oh , usa. new england biolabs inc., ipswich, ma , usa. infectious disease and microbiome program, broad institute of mit and harvard, cambridge, ma , usa. †corresponding author. email: bingqiang@sdu.edu.cn (b.l.); qin.ma@osumc.edu (q.m.) abstract alternative transcription units (atus) are dynamically encoded under different conditions or environmental stimuli in bacterial genomes, and genome-scale identification of atus is essential for studying the emergence of human diseases caused by bacterial organisms. however, it is unrealistic to identify all atus using experimental techniques, due to the complexity and dynamic nature of atus. here we present the first-of-its-kind computational framework, named seqatu, for genome-scale atu prediction based on next-generation rna-seq data. the framework utilizes a convex quadratic .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / programming model to seek an optimum expression combination of all of the to-be-identified atus. the predicted atus in e. coli reached a precision of . / . and a recall of . / . in the two rna- sequencing datasets compared with the benchmarked atus from third-generation rna-seq data. we believe that the atus identified by seqatu can provide fundamental knowledge to guide the reconstruction of transcriptional regulatory networks in bacterial genomes. introduction an operon in bacterial genomes is defined as a group of consecutive genes regulated by a common promoter that all share the same terminator ( ). genes in the same operon generally encode proteins with relevant or similar biological functions; e.g., lacz, lacy, and laca in the lac operon encode proteins that help cells use lactose ( , ). with decades of research on bacterial transcriptional regulation, the operon model has been found to have complex mechanisms that control expression ( - ). multiple studies have shown that bacterial genes are dynamically transcribed under different triggering conditions, leading to shared genes among different mrna transcripts ( - ). this dynamic architecture can be redefined by all of the alternative transcription units (a.k.a., atus) ( , ), and more details can be found in fig. s . atu identification is of fundamental importance for understanding the transcriptional regulatory mechanisms of bacteria, and these dynamic structures have been demonstrated to be associated with human diseases ( - ). for example, bhat et al. studied the alr-groel operon, which is essential for the survival or virulence of m. tuberculosis ( , ), the causative agent of tuberculosis (tb), and found that .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the regulation of the sub-operon is distinct from the main operon (alr-groel operon) under stress, especially during heat shock, ph, and sds stresses ( ). another example is helicobacter pylori, a gastric pathogen that is the primary known risk factor for gastric cancer ( ). sharma et al. found an acid-induced sub-operon cag - transcribed from the primary cag - operon in the cag pathogenicity island of the h. pylori genome under acid stress ( ). the mechanism of the complex atu structure in these pathogenic bacteria can help us to study the emergence of human diseases caused by bacterial organisms. several newly developed techniques have provided a comprehensive view of the e. coli transcriptome by identifying full-length primary transcripts ( - ). for example, smrt-cappable-seq ( ) combines the isolation of the full-length bacterial primary transcriptome with pacbio smrt (single molecule, real-time) sequencing ( ), and simultaneous ’ and ’ end sequencing (send-seq) ( ) captures both transcription start sites (tsss) and transcription termination sites (ttss) via circularization of transcripts ( ). despite the great progress in experimental techniques, there are still some deficiencies. on the one hand, the read depth and error rate of the third-generation sequencing used in smrt-cappable-seq have an impact on atu prediction compared with illumina-based rna- seq ( , ). on the other hand, the time-consuming, laborious, and costly properties of these experimental techniques make them unrealistic to be generally applicable to atu predictions in bacteria under specific conditions. thus, novel and robust computational methods for atu identification in bacterial genomes based on rna-seq are urgently needed. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fortunately, many computational studies have been carried out to predict atus in bacteria, which have provided some preliminary studies for atu prediction. several public databases, such as regulondb ( ), dbtbs( ), microbesonline ( ), door ( , ), operomedb ( ), dminda . ( ), and proopdb ( ), provide various levels of operon information and small amounts of atu information. however, these databases cannot provide genome-scale atu information under specific conditions. some computational studies, including rockhopper ( ), seqtu ( , ), bac- browser( ), rseqtu ( ), and operon-mapper ( ), utilize machine learning and model integration methods based on genomic information and gene expression profiles to identify bacterial transcription architecture. however, these works still cannot solve the dynamic patterns and overlapping nature of atus. here, we present seqatu, a novel computational method for genome-scale atu prediction by analyzing next- and third-generation rna-seq data (fig. and table s ). seqatu utilizes a convex quadratic programming model (cqp) and aims to provide the optimum expression combination of all of the to-be-identified atus. specifically, cqp minimizes the squared error between the predicted expression level of atus and the actual expression levels in genetic and intergenic regions. it is noteworthy that seqatu also utilizes the information about the bias rate function in modeling non- uniform read distribution as the linear constraints of cqp to profile the complexity of the atu architecture. overall, seqatu provides a generalized framework for the inference of atus based on next-generation rna-seq data collected under multiple conditions and can be easily applied to any .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / bacterial organism to identify the atu architecture and construct a transcriptional regulatory network. please place fig. here. materials and methods data collection the two cappable rna-seq datasets used in this study, m enrich_seq and rienrich_seq, were obtained from e. coli grown under two different conditions: m minimal medium and rich medium, respectively ( ). the full-length primary transcripts were enriched as described in ( ) with modifications to be adapted to illumina sequencing. the capping and polya tailing were performed as described in ( ). the capped rna was enriched using hydrophilic streptavidin magnetic beads (new england biolabs) and eluted with biotin using the same condition ( ). differently, the eluted rna was enriched once more using streptavidin beads to further remove processed rna (e.g., rrna). subsequently, the eluted rna was used for library preparation using nebnext ultra ii directional rna library prep kit (e ). sequencing was performed on the illumina miseq system (paired-end, bp). all reads were mapped to the e. coli genome using burrows-wheeler aligner (bwa) with the default parameters ( ). read alignment and other computational analyses were carried out using the e. coli genome nc_ . , and the corresponding gene annotations (gcf_ . _asm v _genomic.gff) were downloaded from ncbi. two experimentally verified atu datasets, smrt_m enrich and smrt_rienrich, were used as the benchmark data to evaluate the predicted atus, which were .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / generated by smrt-cappable-seq under the same conditions as the illumina datasets m enrich_seq and rienrich_seq, respectively ( ). in addition, the atus defined by regulondb ( ) and send-seq ( ) were also used as additional evaluation data in our study. calculation of the expression values of genetic and intergenic regions after the rna-seq reads in m enrich_seq and rienrich_seq were mapped to the e. coli genome using bwa, we determined the number of reads �(�) covering each genomic position �. suppose that �� and ���� are two consecutive genes on the same strand; we denote the expression value of �� as �� and the expression value of the intergenic region between genes �� and ���� as ��,���. then, the calculation of �� and ��,��� is given by: �� = ∑ �(�)�∈�� |�� | ( ) ��,��� = ∑ �(�)�∈��,��� |��,���| ( ) where � ∈ �� denotes that genomic position � is on the gene �� and |�� | denotes the genomic length of ��. modeling non-uniform read distribution along mrna transcripts we introduced the bias rate function, which is similar to the bias curves in the work of wu et al. ( ), to address the non-uniform distribution of the rna-seq reads along mrna transcripts ( - ). the bias function reflects the relative read distribution bias from the ’ end to the ’ end of an mrna transcript. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / we assumed that the maximum read coverage of all the genomic positions of an mrna transcript is the expression level without bias. it is noteworthy that a single gene mrna transcript with no shared gene among different mrna transcripts can serve as the ideal template for modeling non-uniform read distribution along mrna transcripts. the specific steps of modeling non-uniform read distribution are detailed as follows: step : single gene mrna transcript selection. we selected single gene mrna transcripts from the evaluation data and plotted their expression distributions. specifically, groups of single gene mrna transcripts with lengths ranging from to , bp were selected from the evaluation data (more details are given in method s ), and each group had ten randomly chosen mrna transcripts. apparent decline trends appeared in the single gene mrna transcripts with long lengths, ranging from , to , bp (fig. s ). the reason for this phenomenon may be that the incomplete transcription and ’ end degradation or processing induce the enrichment of signal at ’ end of the mrna transcripts with long lengths ( , ). finally, we plotted the expression distribution of single gene mrna transcripts with lengths ranging from , to , bp. step : acquiring the bias rate function. we applied nonlinear regression to the expression distribution of the selected single gene mrna transcripts and acquired the hypothetical function �(�). specifically, the � axis and � axis of the expression distribution were converted to the distance from the ’ end of an mrna transcript and the bias rate of read distribution, respectively. to apply nonlinear regression to single gene mrna transcripts with different lengths, normalization was also implemented .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / on �. here, � = (��, ��, … , ��) and � = (��, ��, … , ��) are defined by: �� = ⎩ ⎨ ⎧ �� − ������ ���� − �� × �, ������� �� − �� ���� − �� × �, ������� ( ) �� = ⎩ ⎪ ⎨ ⎪ ⎧ �(������) ���� , ������� �(�� ) ���� , ������� ( ) where � denotes the number of genomic positions on an mrna transcript; � = (��, ��, … , ��) denotes the genomic positions on an mrna transcript; ���� = ��; �(�� ) denotes the expression level of the genomic position �� , i.e., the number of reads covering the genomic position �� ; and ���� denotes the expression level without bias in an mrna transcript, which is calculated as ��� {�(�� )}, ≤ � ≤ �. we used the function nls in r to acquire the hypothetical function �(�). step : constructing bias rate vectors. we constructed a genetic or intergenic region bias rate vector for each mrna transcript by calculating the bias rate of all of its component genetic or intergenic regions. the bias rate of a genetic or an intergenic region is the average bias rate of all the genomic positions that it contains. considering an mrna transcript � and its component gene set {��, ��, … , ��} (the details of the gene labels are described in method s ), we denoted the genetic region bias rate vector as � = (��, ��, … , �� ), which was calculated using the formula: �� = ⎩ ⎪ ⎨ ⎪ ⎧ ∑ �(�� ) ������ �������� ������� − ������� + , ������� ∑ �(�� ) �� ���� ��� − ��� + , ������� ( ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / where � denotes the number of genomic positions on �; �� denotes the bias rate of �� for �; and �� = (��� , ��� , ��� , ��� , … , ��� , ��� ) is the range of the genomic positions of {��, ��, … , ��}, while the range of the genomic positions of �� is [��� , ��� ], ≤ � ≤ �. similarly, the calculation of the intergenic region bias rate vector � = (��, ��, … , ����) is provided in method s . modification of maximal atu clusters a maximal atu cluster is defined as a maximal consecutive gene set such that each pair of its consecutive genes can be covered by at least one atu. similar to atus, maximal atu clusters are also dynamically composed under different conditions or environmental stimuli in bacterial genomes ( , ). such a maximal atu cluster can be used as an independent genomic region for atu prediction, which alleviates the difficulty in computationally predicting atus at the genome scale. the output of our in- house tool rseqtu can serve as the maximal atu cluster data, which lays a solid foundation for atu prediction ( ). we modified the maximal atu clusters from rseqtu: (i) two maximal atu clusters with distances less than bp were combined into one cluster and (ii) a maximal atu cluster was split at the intergenic region where the opposite-strand genes were located. in addition, we selected the maximal atu clusters with expression values over ten (see the details in method s ), according to the study of etwiller et al. ( ). the mathematical programming model for atu prediction the predicted atu expression profile should be consistent with the observed expression profiles of the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / genetic and intergenic regions. therefore, the prediction of the atu profiles can be modeled as an optimization problem, which seeks an optimum expression combination of all of the to-be-identified atus to minimize the gap between the predicted atus and the observed genetic and intergenic region expression profiles. here, a convex quadratic programming model was built to solve this optimization problem. we denoted a maximal atu cluster as �, assuming that it contains the consecutive genes {��, … , ��}, and the intergenic regions of these genes are {��,�, … , ����,�}. the size of � is defined as the number of its component genes �. theoretically, there are �×(���) � atus for �, and an atu with consecutive genes {�� , ����, … , �� } is denoted as � �,� ; the corresponding expression value is ��,�, ≤ � ≤ � ≤ �. for the component gene �� of �, the gap between the gene expression value �� and the sum of the expression level of the atus containing it is denoted as ��, which provides the first � equality constraints in our mathematical programming model, � = , , … , �. similarly, for the intergenic region ��,��� of �, the gap between the intergenic region expression value ��,��� and the sum of the expression level of the atus containing it is denoted as ��, providing the last � − equality constraints in our mathematical programming model, � = , , … , � − . the goal of our mathematical programming model is to minimize the square of � = (��, ��, … , ��, ��, … , ����), as the combination of � �,� with a minimal value of ��� is corresponding to an optimum expression combination of all atus ��,� for �, ≤ � ≤ � ≤ �. additionally, to control the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / number of optimal solutions and reduce the false-positive errors, we added an �� regularization �||�||� to ��� with ��,� ≥ , which is a linear function. because of the variant expression level of different maximal atu clusters, we used the expression value of � as �. in total, the convex quadratic programming model with unknown variables (�, �) is shown as follows: ��� ��� + �||�||� �. �. ∑ ∑ ��,� � �,�� ��� � ��� = �� + �� � = , , … , � ∑ ∑ ��,���� �,�� ����� � ��� = ��,��� + �� � = , , … , � − � = ���,� �, ��,� ≥ ≤ � ≤ � ≤ � � = (��, ��, … , �� , ��, … , ����) ( ) where � = (��,� ) is the genetic region bias rate vector for �, ��,� is the bias rate of gene �� for atu ��,�, ≤ � ≤ � ≤ �,� ≤ � ≤ �, � = (��,� ) is the intergenic region bias rate vector for �, and ��,� is the bias rate of the intergenic region ����,� for atu � �,�, ≤ � < � ≤ �,� ≤ � ≤ � (see the details in method s ). two evaluation methods for atu prediction in the first evaluation method, precision and recall were defined based on perfect matching (eqs. ). perfect matching of two atus means that all of their component genes are the same. here, the true positives (��) are the number of predicted atus with the same component genes as an atu in the evaluation data; the false positives (��) are the number of predicted atus that do not exist in the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / evaluation data; the false negatives (��) are the number of atus that appear in the evaluation data but not in the prediction results of seqatu. ��������� = �� �� + �� ������ = �� �� + �� ( ) in the second evaluation method, precision and recall were defined based on relaxed matching, which is measured by the similarity of two atus. assuming that an atu � is in one of two datasets (the predicted atu dataset and evaluated atu dataset), the definition and calculation of the similarity of � are shown in the following three cases: case : if � shares boundary genes at both ends of an atu in the other dataset, i.e., all component genes of � are the same as one in the other dataset, then ����������(�) = . case : if � shares exactly one boundary gene of atus in the other dataset, then we denote �� as the atus in the other dataset that share the ’-end gene with � and denoted �� as the atus in the other dataset that share the ’-end gene with �, �� ∩ �� = ∅, one of �� and �� can be empty. then, ����������(�) = �����∈�� �(��) �(��) + �����∈�� �(��) �(��) ( ) where �(��) is the number of shared genes of � and �� and �(��) is the maximal size of � and ��. case : if � shares no boundary genes at both ends of the atus in the other dataset, then ����������(�) = . finally, the precision and recall based on relaxed matching are calculated by the following formula: ��������� = ∑ ����������(�)�∈�� �� .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ������ = ∑ ����������(�)�∈�� �� ( ) where �� is the set of predicted atus, �� is the number of predicted atus, �� is the set of evaluated atus, and �� is the number of evaluated atus. results a reliable bias rate function is acquired in modeling non-uniform read distribution along mrna transcripts to ensure the reliability of the bias rate function in modeling non-uniform read distribution, we selected four single gene mrna transcript datasets randomly from the two evaluation datasets (smrt_m enrich and smrt_rienrich), named m enrich_ , m enrich_ , rienrich_ , and rienrich_ . four bias rate functions, which are exponential functions, were generated after conducting nonlinear regression on the mrna transcripts across these four datasets (fig. ). we found that these bias rate functions were similar (�� > . ) when we evaluated the r-square statistic (for more details, see method s and table s ). the similarity of the four bias rate functions indicated that the selection of the single gene mrna transcript datasets had little impact on modeling non-uniform read distribution along mrna transcripts, implying the universal common non-uniform read distribution of different mrna transcripts of e. coli. specifically, we used the average of these four coefficients as the final coefficients of the exponential function, which was �(�) = ���� with � = . and � = . . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / please place fig. here. atus predicted by seqatu reach precision and recall over . the performance evaluation was conducted by comparing the predicted atus with the atus in smrt_m enrich and smrt_rienrich, which were generated based on the third-generation sequencing and are not sensitive to transcripts with low expression levels. for a more accurate and fair evaluation, maximal atu clusters after pre-selection were retained in the subsequent evaluations (more details about the pre-selection of maximal atu clusters can be seen in method s and fig. s ). the precision and recall of the predicted atus were calculated for each maximal atu cluster. by considering only perfect matching, the average precision and recall were . and . for m enirch_seq and . and . for rienrich_seq, respectively. when using relaxed matching, the average precision and recall increased to . and . for m enrich_seq and . and . for rienrich_seq, respectively. the statistics for precision and recall on maximal atu clusters with different sizes, as shown in fig. a and fig. s a. these results showed that the average precision and recall were decreasing with the increasing size of maximal atu clusters (other than several large size ones due to their small number of counts). the results also indicated that the evaluation results based on relaxed matching were significantly higher than those based on perfect matching across different sizes. this result implied that the incorrectly predicted atus by seqatu based on perfect matching tended to have strong similarities with the atus in the evaluation data. in addition, we also found that more than a quarter of the incorrectly predicted atus ( %/ % for m enrich_seq/rienrich_seq) by seqatu .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / based on perfect matching matched with the transcription units in regulondb ( ). the two evaluation datasets (smrt_m enrich and smrt_rienrich) were both from smrt- cappable-seq, while one of the processing steps of the technique filtered rna reads smaller than , bp ( ), which indicated that the atus in these two evaluation datasets were not comprehensive. to address this issue, we enriched the evaluation data by adding the atus defined by send-seq ( ), as send-seq did not introduce any filtering based on rna size. when we used the new evaluation data, the atus predicted by seqatu improved by % ( . ) and % ( . ) in terms of the average precision based on perfect matching for m enrich_seq and rienrich_seq, respectively, and by % ( . ) and % ( . ) based on relaxed matching. the statistics for precision across different sizes of the maximal atu clusters are shown in fig. b and fig. s b, showing that the values of precision based on perfect matching were significantly improved across different sizes of maximal atu clusters by using the evaluated atus from smrt-cappable-seq and send-seq. this result suggested that the atus we predicted, which were not in smrt_m enrich and smrt_rienrich, may be due to the rna length selection of smrt-cappable-seq. we enriched the evaluation data by adding the atus in regulondb ( ) and also found the improvement of precision across different sizes of maximal atu clusters for m enrich_seq and rienrich_seq (fig. s c). furthermore, to facilitate the understanding of the performance of seqatu and to measure the influence of the maximal atu clusters from rseqtu on our atu prediction method, smrt maximal atu clusters collected from smrt_m enrich and smrt_rienrich (for more details, see method s ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / were applied for the cqp in two conditions (m minimal medium and rich medium). we found that precision and recall increased to . and . for m enrich_seq, respectively, and . and . for rienrich_seq based on perfect matching (fig. s d). additionally, when using relaxed matching, precision and recall significantly increased to . and . for m enrich_seq, respectively, and . and . for rienrich_seq (fig. s d). the significantly improved results verified the ability of seqatu to accurately predict atu when giving more accurate maximal atu clusters. in addition, we found that the number of predicted atus and the evaluated atus under the maximal atu cluster with the same size were similar except for the maximal size (fig. c), and they were far less than the theoretical number, which indicated that seqatu can effectively exclude most of the incorrect atus. please place fig. here. the bias rate constraints efficiently improve the ability of seqatu to predict atus we tried to use seqatu without bias rate constraints to predict the atus of e. coli and found that its performance significantly decreased compared with seqatu (fig. and fig. s ). specifically, the f- score of seqatu without bias rate constraints was . / . based on perfect matching for m enrich_seq/rienrich_seq, compared with . / . for seqatu. when using relaxed matching, the f-score of seqatu without bias rate constraints was . / . for m enrich_seq/rienrich_seq, compared with . / . for seqatu. this result suggested that the bias rate constraints of seqatu could capture useful information about the non-uniform distribution of the rna-seq reads along the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / mrna transcripts ( - ) and then efficiently improve the ability of the model to predict complex atus. please place fig. here. atus predicted by seqatu display a dynamic composition and overlapping nature a total of , distinct atus were identified in m minimal medium, and , were identified in rich medium. among them, there were , / , distinct atus on the forward strand and , / , on the reverse strand for m enrich_seq/rienrich_seq. each of the predicted atus was comprised of an average of . genes, with the largest atu containing genes across the two conditions. the distribution of the size of the predicted atus is shown in fig. a, from which we can see that the majority of atus (more than %) contained fewer than five genes in m minimal medium and rich medium. approximately % of the genes in e. coli were contained in more than one atu for m enrich_seq, compared to % genes for rienrich_seq, suggesting that the atus in a maximal atu cluster generally overlapped with each other (fig. b). in addition, there were , atu maximal clusters for m enrich_seq and , atu maximal clusters for rienrich_seq. seqatu identified a total of , identical atus under the two conditions, whereas there were , distinct atus. among the distinct atus across the two conditions, atus were from the same maximal atu clusters in the two maximal atu cluster datasets, and the rest were from different maximal atu clusters. the fact there were distinct atus under the two conditions suggests that atus are dynamically responsive to .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / different conditions or environmental stimuli (for more real examples about the atus under different conditions, see fig. s ). the dynamic composition of predicted atus by seqatu is of great significance to understand the interactions inside polymicrobial communities. for example, chronic airway infection by pseudomonas aeruginosa considerably contributes to lung tissue destruction and impairment of pulmonary function in cystic-fibrosis (cf) patients ( ). marie et al. found that the presence of e. coli complemented the growth defect of a p. aeruginosa bioa-disrupted mutant that is unable to grow on rich medium, and can be beneficial to p. aeruginosa when biotin supply is limited ( ). an atu with a high expression level coded by the uvrb gene is identified by seqatu in rich medium, while it does not exist in m minimal medium (fig. ). we predicted the uvrb gene to be involved in the biotin metabolism pathway, as the biob, biof, bioc, and biod genes contained in a same atu with it have been known in the biotin metabolism kegg pathway. therefore, the observation by marie et al. can be explained that the atus coded by the uvrb gene of e. coli can provide the biotin supply for p. aeruginosa under rich medium. this result showed that seqatu could increase our understanding of interspecies competition and cooperation, which play an important role in shaping the composition and structure of polymicrobial bacterial populations. please place fig. here. please place fig. here. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / predicted atus by seqatu are verified by experimental tsss and ttss an experimental tss dataset of e. coli from send-seq ( ) and a tf binding site dataset of e. coli from the experimental dataset of regulondb ( ) were used to further verify the reliability of seqatu and were named dataset and dataset , respectively. there were , experimental tsss in dataset and , experimental tf binding sites in dataset . we considered the ’-end genes and no ’-end genes of the predicted atus by seqatu. a gene that is not the ’-end gene of any predicted atu is named a no ’-end gene. we identified , / , ’-end genes and , / , no ’-end genes of the predicted atus for m enrich_seq/rienich. a gene validated by experimental tsss or tf binding sites means that it is the immediate downstream gene of an experimental tss or tf binding site. as a result, the proportion of ’-end genes of the predicted atus that were validated by experimental tsss or tf binding sites was over . times greater than that of the no ’-end genes (table ). specifically, the proportion of ’-end genes ( %/ % for m enrich_seq/rienrich_seq) validated by experimental tf binding sites was over three times greater than the no ’-end genes ( . %/ . % for m enrich_seq/rienrich_seq). these results further verified the reliability of the atus predicted by seqatu in terms of the tss level. in addition, four other experimental tss or promoter datasets from regulondb ( ), drna-seq ( ), and cappable-seq ( ) were also examined. the results are shown in table s , and we also found a higher proportion of ’-end genes of the predicted atus validated by experimental tsss or promoters than that of no ’-end genes. we also used two experimental tts datasets of e. coli from send-seq ( ) and regulondb ( ) to .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / verify the reliability of predicted atus by seqatu in terms of tts level. these two experimental tts datasets were named dataset and dataset , respectively. there were , experimental ttss in dataset and experimental ttss in dataset . we considered the ’-end genes and no ’-end genes of the predicted atus by seqatu. a gene that is not the ’-end gene of any predicted atu is named a no ’-end gene. we identified , / , ’-end genes and , / no ’-end genes of the predicted atus for m enrich_seq/rienrich_seq. a gene validated by experimental ttss means that it is the immediate upstream gene of an experimental tts. as a result, the proportion of ’-end genes of the predicted atus that were validated by experimental ttss was over two times greater than that of no ’- end genes (table ). specifically, the proportion of ’-end genes ( %/ % for m enrich_seq/rienrich_seq) validated by experimental ttss from send-seq was over three times greater than that of no ’-end genes ( %/ % for m enrich_seq/rienrich_seq). these results further verified the reliability of the atus predicted by seqatu in terms of the tts level. in addition, two other computationally predicted tts datasets from the works by nadiras et al. ( ) and kingsford et al. ( ) were also examined. the results are shown in table s , and we also found the proportion of ’-end genes ( %/ % for m enrich_seq/rienrich_seq) validated by computationally predicted rho- independent ttss was over two times greater than that of no ’-end genes ( %/ % for m enrich_seq/rienrich_seq). please place table here. please place table here. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the gene pairs frequently encoded in the same atus are more functionally related than those that can belong to two distinct atus functional analysis was conducted by integrating go terms from the gene ontology (go) database ( ). in detail, we measured the level of functional relatedness for two types of consecutive gene pairs, which is similar to the definition in the work by mao et al. ( ). two types of consecutive gene pairs were (i) gene pairs each consisting of a ’-end gene of an atu and the gene in its immediate upstream on the same strand and (ii) all the other gene pairs inside an atu (fig. a). in addition, we used a scoring scheme to measure the go-based functional similarity between a pair of genes by wu et al. ( ). this study developed a go similarity score and showed that the larger the score, the more likely that two genes are functionally related. in brief, the go similarity score of a gene pair �� and �� is denoted as ��� (�� , �� ): ��� ���, �� � = �����∈�(��), ��∈�(��) �(�� , �� ) where �� and �� are the go terms assigned to �� and �� , respectively; �(�� , �� ) is the maximal number of common terms between paths in the two go graphs induced by the go terms �� and ��. as a result, the mean go similarity score was higher for type-ii gene pairs ( . versus . for m enrich_seq and . versus . for rienrich_seq) than for type-i gene pairs. a total of / type-ii gene pairs had go similarity scores greater than four ( %/ % of a total of / ), while only / type-i gene pairs had go similarity scores greater than four ( %/ % of a total of , / , ) for m enrich_seq/rienrich_seq. we also applied a c�-test ( ) to determine whether the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / distribution of ��� ��� , �� � was different for the type-i gene pairs and type-ii gene pairs. the c �- statistics corresponded to a p-value less than ��, which revealed that the distribution of ��� ��� , �� � for the type-ii gene pairs was significantly different from the type-i gene pairs. fig. b shows the distribution of ��� ��� , �� � for the type-i gene pairs and the type-ii gene pairs. these results strongly indicated that the type-ii gene pairs had a higher degree of go similarity than the type-i gene pairs, suggesting that the gene pairs frequently encoded in the same atus (type-ii gene pairs) are more functionally related than those that can belong to two distinct atus (type-i gene pairs). we also carried out a similar analysis of the two different gene pairs based on kegg enrichment analysis (see more details in method s ) and found that the proportion of type-ii gene pairs ( %/ % for m enrich_seq/rienrich_seq), whose two genes were contained in the same kegg pathway, was higher than the proportion of type-i gene pairs ( %/ % for m enrich_seq/rienrich_seq) (fig. c). the distribution of the kegg similarity scores of the two different types of gene pairs is shown in fig. d, suggesting that genes of type-ii gene pairs have a higher probability of participating in the same kegg pathway than those of type-i gene pairs. please place fig. here. discussion we developed seqatu, the first computational method for genome-scale atu prediction by analyzing next- and third-generation rna-seq data, using a cqp model. linear constraints provided by the bias .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / rate of read distribution were, for the first time, integrated into the cqp model. positional bias refers to the non-uniform distribution of reads over different positions of a transcript ( , ), which is handled by learning non-uniform read distributions from given rna-seq reads ( ) or modeling the rna degradation ( ). the bias rate function we proposed can address the non-uniform read distribution along mrna transcripts and also be desirable for standard next-generation rna-seq data that involves more degraded mrnas, as the exponential function has been used to model the degradation of mrna transcripts ( ). as a result, a total of , distinct atus for m enrich_seq and , distinct atus for rienrich_seq were identified by seqatu. the precision and recall reached . / . and . / . , respectively, based on perfect matching and . / . and . / . , respectively, based on relaxed matching for m enrich_seq/rienrich_seq. we further validated predicted atus using experimental transcription factor binding sites or transcription termination sites from regulondb and send-seq. in addition, the proportion of the ’- or ’-end genes of predicted atus that were validated by experimental transcription factor binding sites and transcription termination sites was over three times greater than that of no ’- or ’-end genes, demonstrating the high reliability of predicted atus. gene pairs frequently encoded in the same atus were more functionally related than those that can belong to two distinct atus according to go and kegg enrichment analyses. these results demonstrated the reliability and accuracy of our predicted atus, implying the ability of seqatu to reveal the transcriptional architecture of the bacterial genome. in fact, the atu architecture of bacteria is much more complex than that determined with currently .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / used experimental techniques. we investigated the ’-end genes and no ’-end genes of the experimental atus identified by smrt-cappable-seq ( ) using a combination of experimental tsss from regulondb ( ), drna-seq ( ), cappable-seq ( ), and send-seq ( ). as a result, we found that the proportion of ’-end genes ( %) validated by experimental tsss was not significantly different from that of no ’-end genes ( %). the high percentage of no ’-end genes validated by experimental tsss implied that the atus identified by experimental techniques are only a small proportion of the comprehensive atus in bacterial organisms due to the dynamic mechanisms of atus. these results further verified the necessity of developing robust computational methods for atu identification. seqatu not only provides a powerful tool to understand the transcription mechanism of bacteria but also provides a fundamental tool to guide the reconstruction of a genome-scale transcriptional regulatory network. first, the atu structure can help us to make new functional predictions, as genes in an atu tend to have related functions. second, atus can elucidate condition-specific uses of alternative sigma factors ( , ). for example, the thrlabc operon is regulated by transcriptional attenuation. totsuka et al. found that under the log phase growth condition, the thrlabc operon is the only transcript, while two transcripts are found under stationary phase growth condition, the thrlabc and thrbc. as validated experimentally, � � can regulate the additional promoter located in front of thrb under the stationary phase growth condition and then separately regulate thrbc, which elucidates the condition-specific uses of � � ( ). third, understanding the atu structure is of great help to construct transcriptional and translation regulatory networks, such as for the construction of the σ-tug (σ-factor-transcription unit .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / gene) network ( ). the transcription regulatory network consists of nodes (atu and regulatory proteins) and links (interactions) ( ), and the comprehensive atu structure can provide a nearly complete set of nodes, which can improve the accuracy of regulatory prediction. although seqatu has obtained satisfactory predicted results, there are still several challenges regarding the computational prediction of atus. on the one hand, due to the influence of the ’ untranslated region (utr) and ’ untranslated region (utr) in the intergenic regions, the expression value of intergenic regions cannot be reproduced perfectly by the same calculation used for the expression value of genetic regions. without accurate reproduction, it is difficult to obtain the best expression combination of atus by the programming model based on the expression value of genetic and intergenic regions. on the other hand, due to the lack of strand-specific rna-seq data, it is difficult to distinguish the expression level of intergenic regions between two consecutive genes on the same strand derived from atus containing these two genes or antisense rnas (asrnas) ( , ). all of these challenges and the great significance of atu prediction inspire and encourage us to discover more information to determine the atu structure in bacteria. for example, we plan to add high confidence tsss and ttss information to our programming model in the future. additionally, since the microbiome is increasingly recognized as a critical component in human diseases, such as inflammatory bowel disease ( ), antibiotic-associated diarrhoea ( ), neurological disorders ( ), and cancer ( ) ( ), predicting new atus of uncultured species from metagenomic and metatranscriptomic data is of great significance in uncovering new regulatory pathway and metabolic products during the development of .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / diseases ( ). however, due to a majority of species with unknown genomes or genome annotations within a microbial community, atu prediction on metagenomics and metatranscriptomics is still a challenging task, which encourage us to pay more attention on it. references . f. jacob, d. perrin, c. sanchez, j. monod, operon: a group of genes with the expression coordinated by an operator. c r hebd. seances. acad. sci , - ( ). . f. jacob, j. monod, genetic regulatory mechanisms in the synthesis of proteins. j. mol. biol. , - ( ). . z. liu, j. feng, b. yu, q. ma, b. liu, the functional determinants in the organization of bacterial genomes. brief. bioinform., doi.org/ . /bib/bbaa ( ). . w.-c. chou, q. ma, s. yang, s. cao, d. m. klingeman, s. d. brown, y. xu, analysis of strand- specific rna-seq data using machine learning reveals the structures of transcription units in clostridium thermocellum. nucleic acids res. , e -e ( ). . s.-y. niu, b. liu, q. ma, w.-c. chou, rseqtu—a machine-learning based r package for prediction of bacterial transcription units. frontiers in genetics , ( ). . b. yan, m. boitano, t. a. clark, l. ettwiller, smrt-cappable-seq reveals complex operon variants in bacteria. nat. commun. , ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . x. ju, d. li, s. liu, full-length rna profiling reveals pervasive bidirectional transcription terminators in bacteria. nature microbiology , - ( ). . k. totsuka, k. totsuka, the transcription unit architecture of the escherichia coli genome. nat. biotechnol. , - ( ). . a. h. bhat, d. pathak, a. rao, the alr-groel operon in mycobacterium tuberculosis: an interplay of multiple regulatory elements. scientific reports , ( ). . c. m. sharma, s. hoffmann, f. darfeuille, j. reignier, s. findeiß, a. sittka, s. chabas, k. reiche, j. hackermüller, r. reinhardt, the primary transcriptome of the major human pathogen helicobacter pylori. nature , - ( ). . j. m. durand, g. r. bjork, putrescine or a combination of methionine and arginine restores virulence gene expression in a trna modification-deficient mutant of shigella flexneri: a possible role in adaptation of virulence. mol. microbiol. , - ( ). . l. e. wroblewski, r. m. peek, k. t. wilson, helicobacter pylori and gastric cancer: factors that modulate disease risk. clin. microbiol. rev. , - ( ). . l. ettwiller, j. buswell, e. yigit, i. schildkraut, a novel enrichment strategy reveals unprecedented number of novel transcription start sites at single base resolution in a model prokaryote and the gut microbiome. bmc genomics , - ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . m. k. thomason, t. bischler, s. k. eisenbart, k. u. forstner, a. zhang, a. herbig, k. nieselt, c. m. sharma, g. storz, global transcriptional start site mapping using differential rna sequencing reveals novel antisense rnas in escherichia coli. j. bacteriol. , - ( ). . t. bischler, h. s. tan, k. nieselt, c. m. sharma, differential rna-seq (drna-seq) for annotation of transcriptional start sites and small rnas in helicobacter pylori. methods , - ( ). . d. dar, m. shamir, j. mellin, m. koutero, n. stern-ginossar, p. cossart, r. sorek, term-seq reveals abundant ribo-regulation of antibiotics resistance in bacteria. science , ( ). . j. clauwaert, g. menschaert, w. waegeman, an in-depth evaluation of annotated transcription start sites in e. coli using deep learning. biorxiv, doi: https://doi.org/ . / . . . , november , pre-print: not peer-reviewed. ( ). . s. goodwin, j. d. mcpherson, w. r. mccombie, coming of age: ten years of next-generation sequencing technologies. nat. rev. genet. , - ( ). . a. santos-zavaleta, h. salgado, s. gama-castro, m. sánchez-pérez, l. gómez-romero, d. ledezma-tejeida, j. s. garcía-sotelo, k. alquicira-hernández, l. j. muñiz-rascado, p. peña- loredo, regulondb v . : tackling challenges to unify classic and high throughput knowledge of gene regulation in e. coli k- . nucleic acids res. , d -d ( ). . n. sierro, y. makita, m. j. l. de hoon, k. nakai, dbtbs: a database of transcriptional regulation in bacillus subtilis containing upstream intergenic conservation information. nucleic acids res. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / , - ( ). . p. s. dehal, m. p. joachimiak, m. n. price, j. t. bates, j. k. baumohl, c. dylan, g. d. friedland, k. h. huang, k. keith, p. s. novichkov, microbesonline: an integrated portal for comparative and functional genomics. nucleic acids res. , d -d ( ). . h. cao, q. ma, x. chen, y. xu, door: a prokaryotic operon database for genome analyses and functional inference. brief. bioinform. , - ( ). . x. mao, q. ma, c. zhou, x. chen, h. zhang, j. yang, f. mao, w. lai, y. xu, door . : presenting operons and their functions through dynamic and integrated views. nucleic acids res. , d - d ( ). . k. chetal, s. c. janga, operomedb: a database of condition-specific transcription units in prokaryotic genomes. biomed research international , - ( ). . j. yang, x. chen, a. mcdermaid, q. ma, dminda . : integrated and systematic views of regulatory dna motif identification and analyses. bioinformatics , - ( ). . t. blanca, c. ricardo, c. e. martinez-guerrero, m. enrique, proopdb: prokaryotic operon database. nucleic acids res. , d -d ( ). . r. mcclure, d. balasubramanian, y. sun, m. bobrovskyy, p. sumby, c. a. genco, c. k. vanderpool, b. tjaden, computational analysis of bacterial rna-seq data. nucleic acids res. , .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / e -e ( ). . x. chen, w. chou, q. ma, y. xu, seqtu: a web server for identification of bacterial transcription units. scientific reports , ( ). . i. a. garanina, g. y. fisunov, v. m. govorun, bac-browser: the tool for visualization and analysis of prokaryotic genomes. frontiers in microbiology , ( ). . b. taboada, k. estrada, r. ciria, e. merino, operon-mapper: a web server for precise operon identification in bacterial and archaeal genomes. bioinformatics , - ( ). . h. li, r. durbin, fast and accurate short read alignment with burrows–wheeler transform. bioinformatics , - ( ). . z. wu, x. wang, x. zhang, using non-uniform read distribution models to improve isoform expression inference in rna-seq. bioinformatics , - ( ). . a. roberts, c. trapnell, j. donaghey, j. l. rinn, l. pachter, improving rna-seq expression estimates by correcting for fragment bias. genome biol. , - ( ). . r. bohnert, g. rï¿ ½tsch, rquant. web: a tool for rna-seq-based transcript quantitation. nucleic acids res. , w -w ( ). . w. li, t. jiang, transcriptome assembly and isoform expression level estimation from biased rna-seq reads. bioinformatics , - ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . b. xiong, y. yang, f. r. fineis, j.-p. wang, degnorm: normalization of generalized transcript degradation improves accuracy in rna-seq analysis. genome biol. , ( ). . j. chaitanya, degradation of mrna in escherichia coli. iubmb life , - ( ). . x. mao, q. ma, b. liu, x. chen, h. zhang, y. xu, revisiting operons: an analysis of the landscape of transcriptional units in e. coli. bmc bioinformatics , ( ). . b. marie, k. h. thilo, f. thierry, t. mikael, r. adriana, v. d. christian, metabolic pathways of pseudomonas aeruginosa involved in competition with respiratory bacterial pathogens. frontiers in microbiology , ( ). . c. nadiras, e. eveno, a. schwartz, n. figueroa-bossi, m. boudvillain, a multivariate prediction model for rho-dependent termination of transcription. nucleic acids res. , - ( ). . c. l. kingsford, k. ayanbule, s. l. salzberg, rapid, accurate, computational discovery of rho- independent transcription terminators illuminates their relationship to dna uptake. genome biol. , r ( ). . m. ashburner, s. lewis, on ontologies for biologists: the gene ontology—untangling the web. novartis found. symp. , - ; discussion - , - , - ( ). . h. wu, z. su, f. mao, v. olman, y. xu, prediction of functional modules based on comparative genome analysis and gene ontology application. nucleic acids res. , - ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . s. a. teukolsky, b. p. flannery, w. press, w. vetterling, numerical recipes in c: the art of scientific computing. cambridge university press, cambridge ( ). . l. wan, x. yan, t. chen, f. sun, modeling rna degradation for rna-seq with applications. biostatistics , - ( ). . c. yanofsky, attenuation in the control of expression of bacterial operons. nature , ( ). . b. k. cho, d. kim, e. m. knight, k. zengler, b. o. palsson, genome-scale reconstruction of the sigma factor network in escherichia coli : topology and functional states. bmc biol. , - ( ). . b.-k. cho, p. charusanti, m. j. herrgård, microbial regulatory and metabolic networks. curr. opin. biotechnol. , - ( ). . a. toledo-arana, o. dussurget, g. nikitas, n. sesto, h. guet-revillet, d. balestrino, e. loh, j. gripenland, t. tiensuu, k. vaitkevicius, the listeria transcriptional landscape from saprophytism to virulence. nature , - ( ). . b. yue, x. luo, z. yu, s. mani, z. wang, w. dou, inflammatory bowel disease: a potential result from the collusion between gut microbiota and mucosal immune system. microorganisms , ( ). . b. h. mullish, h. r. williams, clostridium difficile infection and antibiotic-associated diarrhoea. clin. med. , ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . m. maguire, g. maguire, gut dysbiosis, leaky gut, and intestinal epithelial proliferation in neurological disorders: towards the development of a new therapeutic using amino acids, prebiotics, probiotics, and postbiotics. rev. neurosci. , - ( ). . s. vivarelli, r. salemi, s. candido, l. falzone, m. santagati, s. stefani, f. torino, g. l. banna, g. tonini, m. libra, gut microbiota and cancer: from pathogenesis to therapy. cancers , ( ). . g. cammarota, g. ianiro, a. ahern, c. carbone, a. temko, m. j. claesson, a. gasbarrini, g. tortora, gut microbiome, big data and machine learning to promote precision medicine for cancer. nature reviews gastroenterology & hepatology , - ( ). . s. s. a. zaidi, x. zhang, computational operon prediction in whole-genomes and metagenomes. briefings in functional genomics , - ( ). acknowledgements funding: this work was supported by the national nature science foundation of china (nsfc) [ to b.l., to b.l.]; interdisciplinary science innovation group project of shandong university ( ); and the innovation method fund of china [ im to b.l.]. the authors would like to thank yang li for his assistance in language polishing. authors’ contributions: b.l., q.m. and w.c. conceived the basic idea and designed the overall analyses. q.w. carried out most of the computational analysis and data interpretation. all the authors wrote the manuscript. competing .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / interests: the authors declare that they have no competing interests. data and materials availability: the raw data and source code of seqatu and a detailed tutorial can be found at https://github.com/osu-bmbl/seqatu. figures and tables table . results of predicted atus verified by experimental tsss or tf binding sites. overview of the experimental tss and tf binding site datasets (dataset and dataset ) and the proportion of ’-end genes and no ’-end genes of the predicted atus by seqatu for m enrich_seq and rienrich_seq, which were validated by experimental tsss or tf binding sites. dataset dataset source ju et al. ( ) regulondb tf binding sites technique send-seq collection tsss/tf binding sites , , m enrich_se q ’-end genes % % no ’-end genes % . % rienrich_seq ’-end genes % % no ’-end genes % . % table . results of predicted atus verified by experimental ttss. overview of the experimental tts datasets (dataset and dataset ) and the proportion of ’-end genes and no ’-end genes of the predicted atus by seqatu for m enrich_seq and rienrich_seq, which were validated by experimental ttss. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / dataset dataset source ju et al. ( ) regulondb ttss technique send-seq collection ttss , , m enrich_se q ’-end genes % % no ’-end genes % . % rienrich_seq ’-end genes % % no ’-end genes % . % fig. . schematic overview of seqatu. the blue arrow and orange line denote gene and rna-seq read, respectively. the preprocessing stage requires rna-seq data in the fastq format, the reference .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / genome sequence in the fasta format, and gene annotations in the gff format, generating linear constraints for the next convex quadratic programming (cqp) stage. there are two steps in the preprocessing stage: (i) calculating the expression value of the genetic region �� and intergenic region ��,� and (ii) modelling non-uniform read distribution along mrna transcripts; specifically, we acquired a bias rate function �(�) = �� � using nonlinear regression and then constructed genetic or intergenic region bias rate vectors. the maximal atu cluster data determined by rseqtu and the linear constraints from preprocessing are both taken as inputs of cqp. cqp seeks the optimum expression combination of all of the to-be-identified atus to minimize the gap ��� between the predicted atu expression profile and the genetic and intergenic region expression profile. finally, the output of cqp is the predicted atus. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . results of modelling non-uniform read distribution along mrna transcripts. the four bias rate functions (� = ����) by nonlinear regression had similar coefficients (� and �) across the four datasets m enrich_ , m enrich_ , rienrich_ and rienrich_ . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . overall evaluation results of seqatu. (a) precision and recall based on perfect matching and relaxed matching for m enrich_seq (left) and rienrich_seq (right) using evaluated atus from smrt- .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / cappable-seq. (b) average precision based on perfect matching for m enrich_seq (left) and rienrich_seq (right) using evaluated atus from smrt-cappable-seq (black) and evaluated atus from smrt-cappable-seq and send-seq (red). the magnitude of the point denotes the number of maximal atu clusters with same size. (c) average number of atus across different sizes of smrt maximal atu clusters for m enrich_seq (left) and rienrich_seq (right). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . comparative analysis of the performance between seqatu and seqatu without the bias rate constrains for smrt maximal atu clusters. (a) precision, recall and f-score based on perfect matching for m enrich_seq and rienrich_seq. (b) precision, recall and f-score based on relaxed matching for m enrich_seq and rienrich_seq. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . comprehensive analysis of the predicted atus by seqatu. (a) number of atus across different sizes. the size of an atu is the number of its component genes. (b) distribution of the number of atus per gene. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . integrative genomics viewer (igv) representation of the mapping and atus. mapping and atus of m enrich_seq (orange) and rienrich_seq (blue) were shown for the maximal atu cluster containing the biob, biof, bioc, biod and uvrb genes. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . interpretation and results of the functional relatedness of different gene pairs based on go and kegg enrichment analyses. (a) illustration of two different gene pairs i and ii. (b) functional relatedness results based on go enrichment analysis for m enrich_seq (left) and rienrich_seq (right). (c) the proportion of two different gene pairs whose genes are contained in the same kegg pathway for m enrich_seq (left) and rienrich_seq (right). (d) the functional relatedness results based on kegg enrichment analysis for m enrich_seq (left) and rienrich_seq (right). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / integrated cross-study datasets of genetic dependencies in cancer integrated cross-study datasets of genetic dependencies in cancer clare pacini , , joshua m. dempster , isabella boyle , emanuel gonçalves , hanna najgebauer , , , emre karakoc , , dieudonne van der meer , andrew barthorpe , howard lightfoot , patricia jaaks , james m. mcfarland , mathew j. garnett , , aviad tsherniak , francesco iorio , , ,* wellcome sanger institute, wellcome genome campus, hinxton, cambridge, cb sa, uk open targets, wellcome genome campus, hinxton, cambridge, cb sa, uk broad institute of mit and harvard, main street, cambridge, ma , usa european molecular biology laboratory, european bioinformatics institute, wellcome genome campus, cambridge cb sa, uk human technopole, via cristina belgioioso , milano - italy * corresponding author: francesco.iorio@sanger.ac.uk abstract crispr-cas viability screens are increasingly performed at a genome-wide scale across large panels of cell lines to identify new therapeutic targets for precision cancer therapy. integrating the datasets resulting from these studies is necessary to adequately represent the heterogeneity of human cancers and to assemble a comprehensive map of cancer genetic vulnerabilities. here, we integrated the two largest public independent crispr-cas screens performed to date (at the broad and sanger institutes) by assessing, comparing, and selecting methods for correcting biases due to heterogeneous single guide rna efficiency, gene-independent responses to crispr-cas targeting originated from copy number alterations, and experimental batch effects. our integrated datasets recapitulate findings from the individual datasets, provide greater statistical power to cancer- and subtype-specific analyses, unveil additional biomarkers of gene dependency, and improve the detection of common essential genes. we provide the largest integrated resources of crispr-cas screens to date and the basis for harmonizing existing and future functional genetics datasets. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:francesco.iorio@sanger.ac.uk https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / cancer is a complex disease that can arise from multiple different genetic alterations. the alternative mechanisms by which cancer can evolve result in considerable heterogeneity between patients, with the vast majority of them not benefiting from approved targeted therapies . in order to identify and prioritize new potential therapeutic targets for precision cancer therapy, analyses of cancer vulnerabilities are increasingly performed at a genome-wide scale and across large panels of in vitro cancer models – . this has been facilitated by recent advances in genome editing technologies allowing unprecedented precision and scale via crispr-cas screens. of particular note are two large pan-cancer crispr-cas screens that have been independently performed by the broad and sanger institutes , . the two institutes have also joined forces with the aim of assembling a joint comprehensive map of all the intracellular genetic dependencies and vulnerabilities of cancer: the cancer dependency map (depmap) , . the two generated datasets collectively contain data from over , screens of more than cell lines. however, it has been estimated that the analysis of thousands of cancer models will be required to detect cancer dependencies across all cancer types . consequently, the integration of these two datasets will be key for the depmap and other projects aiming at systematically probing cancer dependencies. these integrated datasets will provide a more comprehensive representation of heterogeneous cancer types and form the basis for the development of effective new therapies with associated biomarkers for patient stratification . further, designing robust standards and computational protocols for the integration of these types of datasets will mean that future releases of data from crispr-cas screens can be integrated and analyzed together, paving the way to even larger cancer dependency resources. we have previously shown that the pan-cancer crispr-cas datasets independently generated at the broad and sanger institutes are consistent on the domain of commonly screened cell lines . the reproducibility of these crispr screens holds despite extensive differences in the experimental pipelines underlying the two datasets, including distinct crispr-cas sgrna libraries. here we investigate the integrability of the full broad/sanger gene dependency datasets, yielding the most comprehensive cancer dependency resource to date, encompassing dependency profiles of , genes across different cell lines that span tissues and different cancer types. we compare different state-of-the-art data processing methods to account for heterogeneous single-guide rna (sgrna) on-target efficiency, and to correct for gene independent responses to .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/votga https://paperpile.com/c/bnwyax/e ooj+ jkgi+ayqe +as lx+ymsj +t woi+odthp+dctjj+bifqg+g buj https://paperpile.com/c/bnwyax/f tt +e ooj https://paperpile.com/c/bnwyax/kl bc+htoyk https://paperpile.com/c/bnwyax/ jkgi https://paperpile.com/c/bnwyax/wjxm https://paperpile.com/c/bnwyax/ uh g https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / crispr-cas targeting , , , evaluating their performance on common use cases for crispr-cas screens (figure a, b and c). figure : schematic of the integration strategy. a. broad and sanger gene dependency datasets (raw count data of single-guide rnas) are downloaded from respective web-portals. b. the datasets from each institute are pre-processed with three different methods, accounting for gene-independent responses to crispr-cas targeting (arising from copy number amplifications) and heterogeneous sgrna efficiency, providing gene-level corrected depletion fold changes. then, four different batch-correction pipelines are applied to the gene level fold changes across the two institute datasets for each of the pre-processing methods. c. twelve different integrated datasets resulting from applying three different pre-processing methods (as indicated by the border colors) and four different batch-correction pipelines (as indicated by the fill colors) are benchmarked. d. advantages provided by the final integrated datasets and conservation of analytical outcomes from the individual ones are investigated. we show that our integration strategy accounts and corrects for technical biases whilst preserving gene dependency heterogeneity and recapitulates established associations between molecular features and gene dependencies. we highlight the benefits of the integrated dataset over the two individual ones in terms of improved coverage of the genomic heterogeneity across different cancer types, identification of new biomarker/dependency associations, and increased reliability of human .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/f tt +q esm+htdux https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / core-fitness/common-essential genes (figure d). finally, we estimate the minimal size (in terms of the number of screened cell lines) required in order to effectively correct batch effects when integrating a new dataset. collectively, this study presents a robustly benchmarked framework to integrate independently generated crispr-cas datasets that provide the most comprehensive resource for the exploration of cancer dependencies and the identification of new oncology therapeutic targets. results overview of the integrated crispr-cas screens the sanger’s project score crispr-cas dataset (part of the sanger depmap) and the broad’s q depmap dataset , contain data for and cell lines, respectively. overall, these represent screens for unique cell lines (figure a, supplementary table ). together these cell lines spanned different tissues (figure b) and for of these the number of cell lines covered increased when considering both datasets together. similarly, the integrated dataset provided richer coverage of specific cancer types and clinically relevant subtypes (figure c). these preliminary observations highlight the first benefit of combining these resources to increase statistical power for tissue-specific as well as pooled pan-cancer analyses. between the two datasets, there was an overlap of cell lines screened by both institutes, encompassing different tissue types (median = , min for soft tissue, biliary tract and kidney, max for lung, figure a and b). the set of overlapping cell lines enabled the estimation of batch effects due to differences in the experimental protocols underlying the two datasets , without biasing the correction toward specific cell line lineages. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/ cgu https://paperpile.com/c/bnwyax/ qc +n jvg https://paperpile.com/c/bnwyax/ uh g https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . overview of crispr-cas screened cancer cell lines. a. number of cell lines screened by the broad and the sanger institutes and their overlap. b. overview of the number of cell lines screened for each tissue type across the two datasets. c. number of screened lung cancer and breast cancer cell lines split according to cancer types and pam subtypes, respectively, across the two datasets. data pre-processing known biases in crispr screens arise due to nonspecific cutting toxicity that increases with copy number amplifications (cnas) , and heterogeneous levels of on-target efficiency across sgrnas targeting the same gene . multiple methods exist to correct for these biases. here, we evaluate three: crisprcleanr, an unsupervised nonparametric cna effect correction method for individual genome-wide screens ; a method resulting from using crisprcleanr with jacks, a bayesian method accounting for differences in guide on target efficacy (ccr-jacks) through joint analysis of multiple screens; and ceres, a method that simultaneously corrects for cna effects and accounts for differences in guide efficacy , also analyzing screens jointly. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/iqbee+ o i https://paperpile.com/c/bnwyax/eqqvf https://paperpile.com/c/bnwyax/q esm https://paperpile.com/c/bnwyax/htdux https://paperpile.com/c/bnwyax/f tt https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / batch effect correction technical differences in screening protocols, reagents and experimental settings can cause batch effects between datasets. these batch effects can arise from factors that vary within institute screens (for example, differences in control batches and cas activity levels) as well as between institutes (such as differences in assay lengths and employed sgrna libraries). when focusing on the set of cell lines screened at both institutes, a principal component analysis (pca) of the cell line dependency profiles across genes (dpgs) highlighted a clear batch effect determined by the origin of the screen, irrespective of the pre-processing method, consistent with previous results (figure a) . we quantile-normalized each cell line dpg and adjusted for differences in screen quality in the individual broad/sanger data sets. the combined broad/sanger dataset was then batch corrected using combat (methods). following combat correction, the combined datasets on the overlapping cell lines showed reduced yet persistent residual batch effects clearly visible along the two first principal components (supplementary figure ). analysis of the first two principal components (using msigdb gene signatures and all cell lines, methods), showed enrichment for metabolic processes (phosphorus metabolic process q-value = . e- , protein metabolic process q-value = . e- , hypergeometric test) in the first principal component. the enrichment of metabolic processes is consistent with differences identified across these datasets due to different media conditions employed in the underlying experimental pipelines , . the second principal component contained significant enrichments for protein complex organisation and assembly (q-value = . e- and . e- respectively, hypergeometric test) (supplementary table ), which have no obvious associations with technical biases found in crispr-cas screens. based on these results, we considered four different batch correction pipelines and evaluated their use in our integrative strategy. in the first pipeline, we processed the combined broad/sanger dpg dataset using combat alone (combat). in the second, we applied a second round of quantile normalization following combat correction (combat+qn) to account for different phenotype intensities across experiments, resulting in different ranges of gene dependency effects. in the third and fourth pipelines we also removed the first one or two principal components respectively (combat+qn+pc ) and (combat+qn+pc - ). the final datasets contained data from unique screens of cell lines using each of the three pre-processing methods and four different batch correction pipelines as outlined in the previous section. to assess the performance of different batch correction pipelines we estimated, using the overlapping cell lines, the extent to which each cell line dpg from one study matched that of its counterpart (derived from the same cell line) from the other study .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/ uh g https://paperpile.com/c/bnwyax/ax xh https://paperpile.com/c/bnwyax/wm a https://paperpile.com/c/bnwyax/ezh +rxwn https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / following batch correction. to quantify the agreement, we calculated for each dpg its similarity to all other screen dpgs using a weighted pearson’s (wpearson) correlation (methods). we then calculated the proximity of a cell line to its counterpart compared to all other cell lines using the wpearson as a metric (recall of cell line identity) (figure b ). the best performances were obtained when removing either the first or the first two principal components following combat and quantile normalization, i.e. combat+qn+pc or combat+qn+pc - . across pre-processing methods, ceres performed best with ( %) of the cell lines being closest to their counterpart from the other study (k = ) followed by crisprcleanr with cell lines ( %) and ccr-jacks with ( %). the recall of cell line identity was high for each integration pipeline with normalized area under the curve (nauc) values of . for ccr-jacks and . for crisprcleanr and ceres when considering the best performing combat+qn+pc - batch correction method. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : batch effect assessment and correction. a. principal component plots of the dependency profile across genes (dpgs) for cell lines screened in both broad and sanger studies and pre-processing methods. screens are colored by the institute of origin. b. percentages of cell line dpgs that have the corresponding (same cell line) dpg screened at the other institute among their k most correlated dpgs (the k-neighborhood). results are shown across different pre-processing methods (in different plots) and different batch correction pipelines (as indicated by the different colors). correlations between dpgs are computed using a weighted pearson correlation metric. genes with higher selectivity have a larger weight in the correlation calculation. as a measure of selectivity we used the average (across the two individual datasets) skewness of a gene’s dependency profile across cell lines. the proportion of cell lines closest to their counterpart from the other study (k = ) is shown and the normalised areas under the curves (nauc) are shown in brackets. the x-axis values are restricted to between - to highlight the range over which performance differences are visible between datasets. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / performance of the integration pipelines we evaluated the performance of each of the integrated datasets, containing cell lines, under four use-cases: the identification of i) essential and non-essential genes ii) lineage subtypes iii) biomarkers of selective dependencies and iv) functional relationships. identification of essential and non-essential genes a cell line dpg with a large separation of dependency scores (ds) of common essential and non-essential genes should yield lower misclassification rates when identifying dependencies specific to that cell line. for each cell line we measured the separation of dependency scores (ds) between known common essential and non-essential genes across all integrated datasets. as a measure of separation we used the null-normalized mean difference (nnmd) , defined as the difference between the mean ds of the common essential genes and non-essential genes divided by the standard deviation of the dss of the non-essential genes. by analysing multiple screens jointly, ceres and jacks borrow essentiality signal information across screens. as a consequence, these methods better identify consistent signals across cell line dpgs (i.e. for common essential and non-essential genes), especially for dpgs derived from lower quality experiments, or reporting weaker depletion phenotypes , . consistently, ceres (median nnmd range [- . , - . ]) showed better nnmd values than crisprcleanr (median nnmd range [- . , - . ], wilcox test (wt) p-value < . e- ) and ccr-jacks (median nnmd range [- . , - . ], wt p-value < . e- )), and similarly ccr-jacks had better nnmd values than crisprcleanr (largest wt p -value < . ) (figure a). comparing the batch correction methods, combat+qn+pc - had marginally better performance across all pre-processing methods. next, we evaluated the gene dependency false-positive rates across all integrated datasets. for each cell line dpg, we defined a set of putative negative controls composed of genes not expressed at the basal level in that cell line (methods). false positives were calculated as the sum of negative controls identified as significant dependencies (in the top % most depleted genes) normalized by their total number across the dpg. there was little difference in false-positive rates across the four different batch correction pipelines, with a slight improvement when two principal components were removed (figure b). ceres outperformed ccr-jacks significantly for all batch correction methods (largest 𝜒 .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/g buj https://paperpile.com/c/bnwyax/fojka https://paperpile.com/c/bnwyax/ o i+htdux https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / contingency table p-value . x - , n= . x ) and ccr-jacks outperformed crisprcleanr (p-value below machine precision). comparing the correction methods, the differences between combat and combat+qn and between combat+qn+pc and combat+qn+pc - were generally not significant across preprocessing methods, while the difference between either combat or combat+qn and either combat+qn+pc or combat+qn+pc - were generally significant (largest p-value . x - ). as a final test of control separation, we used the unexpressed genes as an empirical null distribution for each dpg to estimate p- values for all ds and thus false discovery rates (fdrs) within each dpg. we calculated the recall of a reference set of common essential genes at % fdr (figure c ). again ceres outperformed ccr-jacks which outperformed crisprcleanr, and increasing the number of steps in the batch correction pipeline monotonically improved essential recall for all preprocessing methods. all differences between preprocessing methods and batch correction methods were significant, with the largest observed t-test (related) p-value . x - (n = ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/g buj https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : use case recall of essential genes and lineage identification . a. null-normalized mean difference (nnmd, a measure of separation between dependency scores of prior-known essential and non-essentials genes): defined as the difference in means between dependency scores of essential and non-essential genes divided by standard deviation of dependency scores of the non-essential genes. lower values of nnmd indicate better separation of essential genes and non-essential genes. b. false positive rates across all pre-processing methods and batch-correction pipelines. in the gene dependency profile of a given cell line, a significant dependency gene was called a false positive if that gene was not expressed in that cell line. c. recall of known essential genes across all pre-processing methods and batch-correction-pipelines at % .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fdr. d. agreement between cell line clusters based on dpgs correlation and tissue lineage labels of corresponding cell lines, across pre-processing methods and batch-correction pipelines. e. agreement of lung crispr-cas fitness profiles according to the lung cancer subtypes. for each query lung cancer cell line in turn we computed correlation scores to all other lung cancer cell lines (responses). we then ranked the response cell lines according to these correlations. for each query cell line, the rank position k of the most correlated response cell line from the same cancer subtype (matching response) was identified. a rank of k = indicates that the query cell line was closest to another cell line from the same cancer subtype. the curves show the ratio of query cell lines with a matching response within a given rank position. the proportion of query cell lines with a matching response in k = are also shown as percentages for each dataset. the normalised area under the curve (nauc) for each dataset is shown in brackets. the figure shows the x-axis zoomed in to between and . identification of lineage subtypes many dependencies are context specific, reducing cellular fitness in a subset of lineages, that can be used to elucidate gene function and identify cancer type specific vulnerabilities. to evaluate the ability of the integrated datasets in recapitulating tissue lineages and clinical subtypes we first estimated the extent of conserved similarity between screens of cell lines derived from the same tissue lineage. we evaluated the tendency of screens of cell lines from the same lineage to yield similar results by comparing unsupervised clusterings of the batch-corrected cell line dpgs to the lineage labels of the cell lines. to this aim, we performed one hundred k-means clusterings of each of the datasets, with k equal to the number of tissue lineages screened in at least one study. we then calculated the adjusted mutual information (ami, methods) between each dpg clustering and the partition of the cell lines induced by their lineage labels. we observed higher than chance ami between the obtained k clusters and the tissue lineages of the cell line dpgs, regardless of the starting batch corrected dataset (largest single-sample t-test p-value of . x - , n = , figure d ). under each pre-processing method the removal of one or two principal components resulted in an increased ami between cell line dpgs clusters and tissue lineages. we next measured the ability of each of the integrated datasets to separate cell lines according to lineage subtypes. the integrated datasets contain over lung cell lines. these cell lines can further be stratified into subtypes such as small cell lung carcinoma and mesothelioma, whilst clinical subtypes such as pam classifications are available for the breast cancer cell lines (figure c). to quantify the clustering of cell lines by subtype we calculated the correlation between all cell lines dpgs, and for a given query cell line the rank of the cell line with most correlated dpg to the query from the same subtype (k-rank). for the lung cancer cell lines, the percentage of cell lines whose closest neighbour was from the same subtype (k = ) was greatest for ceres ( - % across batch correction methods) .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / followed by crisprcleanr ( - %) and ccr-jacks ( - %), with slight improvement with the removal of or principal components (figure e). the normalised area under the curve (nauc) values showed little variation across batch correction methods and were broadly similar between the pre-processing methods ceres (lung = . , breast = . - . ), ccr-jacks (lung = . - . , breast = . - . ), crisprcleanr (lung= . - . , breast= . - . )(supplementary figure ). identification of biomarkers interesting potential novel therapeutic targets are genes that show a pattern of selective dependency, i.e. exerting a strong reduction of viability upon crispr-cas targeting in a subset of cell lines. furthermore, these selective dependencies are often associated with molecular features that may explain their dependency profiles (biomarkers). we investigated each of the integrated datasets’ ability to reveal tissue-specific biomarkers of dependencies. as potential biomarkers we used a set of clinically relevant cancer functional events (cfes ), across different tissue types. the cfes encompass mutations in cancer driver genes, amplifications/deletions of chromosomal segments recurrently altered in cancer, hypermethylated gene promoters and microsatellite instability status. for each cfe and tissue type, we performed a student’s t-test for each selective gene dependency (sgd, methods) contrasting two groups of cell lines based on the status of cfe under consideration (present/absent), for a total number of , , biomarker/dependency pairs tested. the total number of significant biomarker/dependency associations showed little variation across batch-correction methods at % fdr. however, a significantly larger number of biomarker/dependency associations were identified when using crisprcleanr compared to ccr-jacks (largest p-value . e- , proportion test) or ceres (largest p-value . e- , proportion test) whilst little significant difference was found between ccr-jacks and ceres (smallest p-value . , proportion test) (figure a, supplementary table ). similar results were seen when the cfes were split according to whether the biomarker was a mutation, recurrent copy number alteration or hypermethylated region (supplementary figure ) . we next examined the ability of each dataset to recover known selective dependencies in individual cell lines. we downloaded a set of oncogenic gene alterations .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/hbt j https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / from oncokb , . after filtering for genes that tend to be common essentials (mean dependency score lower than - . in the crisprcleanr-combat dataset, where - is the median of scores of known common essentials), we considered the oncogenes as positive controls in cell lines where they had indicated oncogenic or likely-oncogenic gain of function alterations, and negative controls in all others. for each oncogene, we measured the nnmd between positive and negative cell lines (figure b). we found little difference in median performance by either preprocessing method or batch correction method. we then collected the dependency scores of all oncogenes in cell lines with a corresponding oncogenic alteration and measured receiver operator characteristic (roc) auc between them and the dependency scores of the same genes in cell lines without oncogenic alterations (figure c). by this measure, crisprcleanr outperformed ceres by . % and ccr-jacks by . %, with minimal variations across batch correction method. recovery of functional relationships we tested the ability of each dataset to identify expected dependency relations between paralogs, gene pairs coding for interacting proteins, or members of the same complex using gene pairs annotation from publicly available databases – (methods). for each pair of genes known to have a functional relationship, we selected a random pair of genes with similar mean dependency scores across cell lines to serve as null examples. we calculated the false discovery rate for the known pairs using the absolute pearson correlation of their dependency profiles versus those of the null examples. recovery of known relationships was unsurprisingly low, since many genes with known functional relationships do not exhibit selective viability phenotypes. combat+qn+pc or pc - recovered the greatest number of expected gene dependency relations at % fdr (figure d). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/assl+d gc https://paperpile.com/c/bnwyax/dwirj+z a+kxhhl https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : use case biomarkers and functional relationships . a. for each tissue pairs of cancer functional events (cfes) and dependencies were tested for significant associations between the gene dependency and the absence/presence of a biomarker (cfe). the bar chart shows the total number of significant associations at % fdr across tissue types for each of the integrated datasets. b. the per-oncogene nnmd between cell lines with and without an indicated oncogenic gain-of-function indication (more negative is better). c. for all identified oncogenes collectively, the receiver-operator characteristic (roc) auc between oncogene scores in cell lines where they have an indicated gain-of-function mutation and cell lines where they do not. d. for each dataset, the number of known gene-gene relationships recovered at % fdr. final selection of pre-processing methods and batch-correction pipelines comparing the performance of batch correction methods across the use-cases we found that combat+qn outperformed combat alone and removing one or two principal components had similar or noticeable increases in performance compared to combat+qn. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the principal component analysis indicated that combat+qn+pc corrected for linear and non-linear effects of technical confounders including assay length, guide library and media conditions. removing the first two principal components offered little improvement over removing the first principal component alone and we found no attributable technical bias in the gene sets enriched in the second principal component. overall, we selected combat+qn+pc as the batch correction pipeline as it had good performance over all metrics and a reduced impact on the data with respect to combat+qc+pc - , whilst still correcting for multiple technical biases. comparing the pre-processing methods we found that ceres outperformed the other methods while identifying essential genes and lineage subtypes, that crisprcleanr showed higher performance in the biomarker association use case, and these two methods performed comparably and better than ccr-jacks in identifying known gene-gene relationships. as a conclusion, we selected both ceres and crisprcleanr as processing methods and considered the two corresponding integrated datasets as the final results of our pipeline. advantages of the integrated datasets over the individual ones in-line with the results from all the use-cases, we estimated the benefits of the integrated datasets with respect to the individual ones, in terms of increased capacity to unveil reliable sets of common essential genes (using ceres), as well as increased diversity of genetic dependencies and biomarker associations (using crisprcleanr). to evaluate the increased coverage of molecular diversity and genetic dependencies in the integrated dataset we first estimated the increase in the number of detected gene dependencies with respect to the two individual datasets. to this aim, using the crisprcleanr processed dataset we quantified the number of genes significantly depleted in n cell lines (at % fdr, methods) for a fixed number of cell lines n (with n = , , or n ≥ ) of the integrated dataset, as well as in the individual broad and sanger datasets. the integrated dataset identified more dependencies, indicating greater coverage of molecular features and dependencies than in the individual datasets (supplementary figure a). we then evaluated the ability of the ceres processed integrated dataset to predict common essential genes and its performance when compared to the individual datasets and two existing sets of common essential genes from recent publications: behan and hart . we predicted common essential genes using two methods: the th-percentile method and .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/e ooj https://paperpile.com/c/bnwyax/karn https://paperpile.com/c/bnwyax/ uh g https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the adaptive daisy model (adam) . the majority of genes called common essentials according to one of adam or th percentile methods was also identified by the other ( , out of , , supplementary figure b ). we assigned to each of the , common essential genes a tier based on the amount of supporting evidence of their common essentiality. tier , the highest confidence set comprised the , genes found by both methods. tier had genes found by only one method (supplementary table ). for each predicted set of common essential genes, we calculated recall rates of known essential genes sets obtained from kegg and reactome pathways. these pathways included ribosomal protein genes, genes involved in dna replication and components of the spliceosome (methods). the integrated set of common essentials (tier and ) showed greater recall of known essential genes compared to behan and hart, and increased recall over the individual datasets for out of the gene sets (figure a). we next generated a set of genes that were never expressed across the panel of cell lines, to serve as high confidence negative controls (methods). we calculated the proportion of negative controls in each set of common essentials genes. the best performance was for the hart gene set ( %) followed by the integrated data set ( . %) (figure b ). as the positive and negative controls did not cover all genes we further investigated the genes predicted to be common essentials. the integrated dataset predicted the largest number of common essentials, with genes found in the integrated data set alone. the genes were enriched for cell cycle genes (fdr . e- ) and mitochondrial gene expression (fdr . e- ), indicative of essential cellular processes. similar results were observed for the , genes in the integrated set of common essentials but neither of the existing datasets (behan and hart) (supplementary table ) we next asked whether the crisprcleanr processed integrated dataset was able to unveil additional significant gene dependencies and cfe/gene-dependency statistical interactions compared to either one of the broad or sanger (individual) datasets. performing systematic biomarker analysis using cfes on cell lines from individual tissue lineages unveiled additional significant associations in the integrated dataset (when considering only cfe/gene-dependency pairs testable in the individual datasets at % fdr) with respect to those using the sanger dataset alone, and with respect to the broad dataset (supplementary table ). examples included decreased dependency on mdm in tp mutant lung cell lines for the sanger dataset, and increased dependency on stag in stag mutated central nervous system cancer cell lines for the broad dataset (figure c). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/e ooj https://paperpile.com/c/bnwyax/thhr https://paperpile.com/c/bnwyax/shsw https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / furthermore, tissue-specific significant associations identified in the integrated dataset were tested but not found significant in either the broad or the sanger dataset (figure d). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : advantages of an integrated dataset . a. recall of essential genes sets for the integrated dataset, across different tiers, compared to two previously published gene sets (behan and hart). b. proportion of genes in the common .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / sample size requirements for efficient data integration to further increase the coverage of a cancer dependency map, new crispr-cas screens should be integrated into the existing datasets as they are generated. to aid in this integration we estimated the minimum number of overlapping cell lines that should be screened to efficiently calculate and correct batch effects. we performed a downsampling analysis on the cell lines screened at both sanger and broad, ranging from % to %, and used the obtained subset of cell lines to estimate and correct batch-effects using combat. following this, for each cell line dpg generated at either institute, we computed the pearson correlation following batch correction using all overlapping cell lines (figure e). we found a high degree of correlation between datasets at all levels of downsampling, with the minimum of samples still reducing batch effects when compared to no batch correction (n = ) (supplementary figure c). we next evaluated the batch correction using the average silhouette width (asw) of the clustering induced by the institute of origin of the cell lines as a measure of the extent to which cell lines from the same institute clustered together. as expected, as the number of samples used to estimate and correct the batch effect decreases, the dpgs increasingly cluster by the batch of origin (figure f). the asw and pearson correlation metrics both showed clear convergence with increasing sample size and at the same rate. given the convergence of these metrics, the results showed that the overlapping cell lines used were sufficient to maximise the batch correction using combat. further the downsampling analysis showed convergence was reached at cell lines and that between and cell lines would be sufficient to provide a batch corrected dataset that is highly correlated (over . ) with that obtained when estimating and correcting batch effects with using more than cell lines. the overlapping cell lines contained cell lines from different lineages. to investigate the impact of lineage composition of the cell lines on the batch correction we also essential gene sets that are constitutively not expressed across the panel of cell lines and therefore likely to be false positive results. c. examples of significant associations between genes and features, found in the integrated dataset compared to the individual dataset. d. examples of significant associations found in the integrated dataset that were not significant in either of the individual datasets. e. the boxplots contain random samples of between % and % of the overlapping cell lines (number of cell lines in each sample indicated on the x-axis). for each sample the pearson correlation of the dpgs following combat correction compared to the integrated dataset was calculated for each pre-processing method. f. the average silhouette width (asw) for each downsampled dataset was calculated using the institute of origin as the cluster label. an asw of close to zero indicating a near random performance of the clustering, meaning the samples do not cluster by the origin of the screen and batch effects have been removed. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / used a single lineage to estimate the batch effects. in the overlapping cell lines the lung lineage had the most cell lines ( in total). we subsampled the lung cell lines to include , or cell lines (supplementary figure de ) and found little difference in performance between using a single and a mixture of lineages, indicating that this is not a major factor for estimating batch effects. discussion the integration of data from different high-throughput functional genomics screens is becoming increasingly important in oncology research to adequately represent the diversity of human cancers. integrating crispr-cas screens performed independently and/or using distinct experimental protocols, requires correction and benchmarking strategies to account for technical biases, batch effects and differences in data-processing methods. here, we proposed a strategy for the integration of crispr-cas screens and evaluated methods accounting for biases within and between two dependency datasets generated at the broad and sanger institutes. our results show that established batch correction methods can be used to adjust for linear and non-linear study-specific biases. our analyses and assessment yielded two final integrated datasets of cancer dependencies across cell lines. in contrast to existing databases of crispr-cas screens , , our integrated datasets are corrected for batch effects allowing for their joint analysis. following integration, dependency profiles of cell lines from the same tissue lineage and cancer specific subtypes show good concordance. our integrated datasets cover a greater number of genetic dependencies, and the increased diversity of screened models allows additional associations between biomarkers and dependencies to be identified. the integrated datasets were the output of two orthogonal pre-processing methods, crisprcleanr and ceres. the use-case analysis showed that ceres (which borrows information across screens) yields a final dataset better able to identify prior known essential and non-essential genes and clustering of cell lines by lineage. in contrast, crisprcleanr (a per sample method) was better able to detect associations between selective dependencies and potential biomarkers, and had better recall of known oncogenic .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/xh a +czfn https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / addictions. therefore, results from both processing methods provide the best overall data-driven functional cancer dependency map. the data integration strategies and sample size guidelines outlined here can be used with future and additional crispr-cas datasets to increase coverage of cancer dependencies. this will be important for oncological functional genomics, for the identification of novel cancer therapeutic targets, and for the definition of a global cancer dependency map. further, as library design improves , , we would expect the coverage and accuracy of the integrated datasets to also improve. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/eqqvf+ztmd+dkgl https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / data availability the final integrated datasets are available for download at https://figshare.com/projects/integrated_crispr/ . the data will also be made accessible through the depmap (https://depmap.org) and score (https://score.depmap.sanger.ac.uk) web portals in early . code availability scripts and software packages implementing the integration pipeline described in this manuscript and needed to reproduce results and figures are available on github at https://github.com/depmap-analytics/integratedcrispr with data sources available on figshare: https://figshare.com/projects/integrated_crispr/ . acknowledgments this work was partially funded by open targets [project otar ] and by the wellcome trust [grant ]. we thank leo parts for a number of insightful discussions. author contributions cp conceived the study, designed, implemented and performed analyses, assembled figures, curated data, wrote the manuscript. jmd conceived the study, designed, implemented and performed analyses, assembled figures, and contributed to manuscript writing. ib contributed to pipeline implementation. eg performed analyses, assembled figures, revised the manuscript. hn assembled figures, revised the manuscript. ek, dvdm, ab, hl, pj contributed to data curation. jmm, mjg, and at revised the manuscript and contributed to study supervision. fi conceived the study, designed analyses, contributed to figure production, wrote the manuscript, acquired funds and supervised the study. competing interests mjg, and fi receive funding from open targets, a public-private initiative involving academia and industry. mjg receives funding from astrazeneca and performs consultancy for sanofi. fi performs consultancy for the joint cruk - astrazeneca functional genomics centre. at is a consultant for tango therapeutics and cedilla therapeutics. jmd, jm and at receive funding from the cancer dependency map consortium, but no consortium member was involved in or influenced this study. all the other authors declare no competing interests. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://figshare.com/projects/integrated_crispr/ https://github.com/depmap-analytics/integratedcrispr https://figshare.com/projects/integrated_crispr/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / methods preprocessing data sanger data processed with crisprcleanr were obtained from the score website (https://score.depmap.sanger.ac.uk/). the crisprcleanr corrected counts were used as input into jacks, for the ccr-jacks processing method. raw counts and the copy number profiles for the sanger dataset downloaded were processed with ceres . the broad data processed with ceres (unscaled gene effect) version q scores were downloaded from the broad depmap portal . the raw counts for broad data q were processed with crisprcleanr and the crisprcleanr corrected counts processed with jacks. gene names were matched across the broad and sanger datasets by updating both to the current version of hugo gene symbols from the hgnc website. missing entries were mean imputed for the principal component removal and then re-assigned as na in the final matrix. cell lines processed by both ceres and crisprcleanr were used for analysis. tissue annotations for each cell line were obtained from the cell model passports (https://cellmodelpassports.sanger.ac.uk/) . batch correction pipelines the dependency profiles across genes (dpgs) for overlapping cell lines from each institute were first quantile normalized using the preprocesscore package in r . screen quality adjustments were made by fitting a spline to the average gene fold change across cell line dpgs. each dpg was then adjusted to remove the difference between the fitted spline and the diagonal. the overlapping cell lines were then batch corrected using three different methods. a standard least squares model was fitted in r. the combat correction was performed using the sva package in r . batch correction pipelines’ assessment and weighted pearson correlation metric cell lines’ rank neighborhoods were based on a weighted pearson correlation metric. the weights were defined as the absolute mean (over the broad and sanger datasets) of a gene dependency signal skewness across the overlapping cell lines for the broad and sanger datasets. using skewness upweights genes with a variable and sufficiently selective fitness profile whilst downweighting those that show weak/no-signal or unselective dependencies. then for each query dpg we ranked all the others based on how similar they were to the fixed one in decreasing order, according to the wpearson scores. for each position k in the resulting rank we then defined a k-neighborhood of the query dpg composed of all the other dpgs whose rank position was ≤ k. finally we determined the number of cell line dpgs that .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://score.depmap.sanger.ac.uk/ https://paperpile.com/c/bnwyax/ qc https://paperpile.com/c/bnwyax/ qc https://cellmodelpassports.sanger.ac.uk/ https://paperpile.com/c/bnwyax/wfsum https://paperpile.com/c/bnwyax/ zwnw https://paperpile.com/c/bnwyax/zcfxr https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / had the dpg derived from screening the same cell line in the other dataset (a matching dpg) in its k-neighborhood. the final rank for each cell line was defined based on the minimum rank obtained for each cell line when considering the dpg for that cell line from the broad data compared to all dpgs, and similarly the dpg for the cell line in the sanger dataset compared to all dpgs. analysis of principal components the first two principal components (pcs) were extracted from combat corrected crisprcleanr data using the prcomp function in r. the top genes (according to the absolute value of their pc loadings) were selected for enrichment analysis. the gene lists were used as input into the gsea website (https://www.gsea-msigdb.org/) and were tested against the gene ontology biological processes, hallmark and canonical pathway databases. the top significantly enriched (q-value < . ) gene sets were downloaded from the website. batch correction extended to cell lines the combat estimates, pooled mean, variance and empirical bayes adjustments (mean and standard deviation) for each batch based on the analysis of cell lines common to both initial dataset were computed. the combat correction using these estimates was then applied to all screens, i.e. the union of the two initial datasets. particularly, each individual cell line dpg was shifted and scaled gene-wise using the batch correction vectors outputted by combat. further adjustments were then applied to all screens including quantile normalization, and the removal of either the st principal component of the joint datasets or the first two. finally, dpgs for overlapping cell lines passing a similarity threshold (detailed below) were averaged. across the three pre-processing methods the number of cell lines that matched their counterparts exactly after combat correction ranged from % - % (figure b), suggesting that under all pre-processing methods there remained cell lines whose dpgs diverged between studies. for each of the cell lines that matched their counterpart as the first neighbor we considered their distances ( -wpearson) as a measure of the variability in distance profiles between dpgs of the same cell line across institutes. we called divergent dpgs those with a distance greater than the th percentile of distances from matching cell lines. for cell lines with divergent dpgs across all three processing methods we selected the dpg from the screen with the highest quality to be included in the integrated datasets. as a quality metric we used the null-normalized mean difference (nnmd, defined in the .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.gsea-msigdb.org/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / main text) and took its consensual value across the three datasets (resulting from applying ceres, ccr-jacks and crisprcleanr). agreement between dependency profile clusterings and cell line tissue labels we selected genes with the highest variance in the ceres combat integrated dataset and performed repeated k-means clusterings cell lines using the high variance genes for each pre-processing and batch-correction method. for each clustering, we calculated the adjusted mutual information between the obtained clusters and the cell line tissue labels as specified in the annotation provided by the sample_info file of the depmap_public_ q dataset using sklearn’s python function adjusted_mutual_info_score (https://scikit-learn.org/stable/). recall of known gene relationships we assembled a set of functionally related gene pairs using paralogs identified by ensemblcompara , protein-protein interactions identified by li et al , and corum complex comemberships . for a given dataset, for each pair of related genes, we calculated a pearson correlation coefficient between those genes’ dependency scores across cell lines. we then binned each gene that appeared in the list of known gene relationships according to its mean gene score using equally spaced bins. for pairs of genes in the related genes pairs, we chose one as the query gene and replaced its related partner with another randomly selected gene of similar gene mean, i.e. belonging to the same bin, excluding genes known to be related to the query gene. we calculated pearson’s correlation coefficients between these randomly selected gene pairs to generate a null distribution, from which we calculated empirical p-values and benjamini-hochberg fdrs for known related gene pairs. ensuring that the pairs of genes used in the null distribution have similar distributions of mean gene effect as the pairs of known related genes is necessary because variable screen quality can produce a high but artifactual correlation between any pair of common essential genes, and corum is highly biased towards common essentials. this is discussed further in the comparisons of batch corrections in dempster et al . unexpressed false positives we defined a gene as unexpressed in a cell line if the log (transcripts per million + ) of its depmap expression was less than . . any score of an unexpressed gene in a cell line was called a false positive if it fell in the bottom % of gene scores for that cell line. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/ qc https://scikit-learn.org/stable/ https://paperpile.com/c/bnwyax/dwirj https://paperpile.com/c/bnwyax/z a https://paperpile.com/c/bnwyax/kxhhl https://paperpile.com/c/bnwyax/fojka https://paperpile.com/c/bnwyax/ zofe https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / identifying selective dependencies normlrt and likelihood of normal distribution was calculated in r using the mass package . for the skew t-distribution the st.mple function from the sn package was used to calculate the likelihood. if the fitting procedure failed different degrees of freedom were used iteratively until a solution was found. the degrees of freedom used in order were , , , , and . systematic association test between molecular features and gene dependencies we performed a systematic two-sample unpaired student’s t-test (with the assumption of equal variance between compared populations) to assess the differential essentiality of each gene across a dichotomy of cell lines defined by the status (present/absent) of each cfe in turn. we tested genes whose normlrt values were greater than in any integrated dataset. from these tests, we obtained p-values against the null hypothesis that the two compared populations had an equal mean, with the alternative hypothesis indicating an association between the tested cfe/gene-dependency pair. p-values were corrected for multiple hypothesis testing using benjamini–hochberg (method ‘fdr’ using the p.adjust function in r). we also estimated the effect size of each tested association using cohen’s delta (Δfc), i.e. the difference in population means divided by their pooled standard deviations. evaluating known selective dependencies a table of all annotated oncogene variants was downloaded from oncokb . the table was filtered first for genes that were (likely) oncogenic and alterations that were (likely) gain-of-function or switch-of-function. for each alteration, the depmap public q mutation and fusion calls were used to identify which cell lines had the alteration. these cell lines were treated as positive controls for the gene in question, with all other cell lines treated as negative controls. only oncogenes with at least one positive cell line were retained. for each integrated dataset, we calculated the roc auc between all positive oncogene-cell line pairs and negative pairs. then, for each oncogene with at least two positive cell lines, we calculated the nnmd between its positive and negative cell lines. identification of common essential genes via the th percentile method the th percentile method finds for each gene the cell line on the boundary of its th percentile least dependent cell lines. it then calculates the rank of that gene in that cell line, by sorting all the genes based on their dependency score in increasing order. a mixture of .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/fenjn https://paperpile.com/c/bnwyax/d gc https://paperpile.com/c/bnwyax/ qc https://paperpile.com/c/bnwyax/ezh https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / two normal distributions is then fitted to the rank positions of all genes. those genes with ranks below the crossover point of these two distributions are labeled as common essentials. adam method binary depletion matrices for the integrated datasets were calculated as outlined in the next section and used with the adam method as described in behan et al . the adam method determines the number of cell lines dependent on a gene required to call that gene a common essential. the number of cell lines is calculated by maximizing the tradeoff between true positive rate (using a set of known prior essential genes) and the deviance from the null expected rate (calculated using random permutations of the binary depletion matrix). common essential genes were identified for each tissue separately (according to the cell line annotation from the cell model passports ) and were then used as input into adam to determine pan-cancer common essential genes. binary depletion calls binary depletion calls were computed by considering each cell line dpg as a rank-based classifier of essential/non-essential genes (with gene rank positions determined by their fitness effect, i.e. average depletion fold-change of targeting single guide rnas abundance at the end of the assay with respect to plasmid counts). the fitness effect threshold was then fixed as that corresponding to the largest rank position r guaranteeing a false discovery rate (fdr) < %, when the predicted essential genes are those with a rank position ≤ r. this allowed us to assign to each gene in each cell line, in each of the two datasets, a binary dependency score. to identify significantly depleted genes for a given cell line at a % fdr, we ranked all the genes in the cell line dpg in increasing order based on their depletion log fold-changes. we used the ranked list to calculate the precision curve using a set of prior known essential (e) and non-essential (n) genes, respectively, derived from hart et al . to estimate the rank position corresponding to the % fdr threshold we calculated for each rank position k, a set of predicted essential genes p(k) = {s ∈ e ∪ n: r(s) ≤ k }, with r(s) indicating the rank position of s, and the corresponding positive predictive value (or precision) ppv(k) as: .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/e ooj https://paperpile.com/c/bnwyax/wfsum https://paperpile.com/c/bnwyax/g buj https://paperpile.com/c/bnwyax/g buj https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ppv(k)=|p(k)∩e|/|p(k)| we then determined the largest rank position k* with ppv(k*) ≥ . (equivalent to a fdr ≤ . ). the % fdr logfcs threshold f* was defined as the logfcs of the gene s such that r(s) = k*. we called all genes with a logfc < f* as significantly depleted at % fdr. binary dependency matrices were defined as gene by cell lines matrices with non null entries corresponding to significant dependency genes at % fdr, for each cell line, i.e. column. positive controls for common essentials to generate sets of prior known common essential genes we downloaded gene sets from msigdb (v . ) using the r package qusage. the gene sets used were from kegg were kegg_spliceosome, kegg_ribosome, kegg_proteasome, kegg_rna_polymerase and kegg_dna_replication. for the histones gene set we combined two reactome gene sets reactome_hats_acetylate_histones and reactome_hdacs_deacetylate_histones as well as the curated histones gene set from . negative controls for common essentials we compiled a set of negative controls for the common essential genes as those genes that were not expressed across all cell lines. we defined a gene as unexpressed across the panel of cell lines using the log (transcripts per million + ) of its ccle expression and the th percentile method (the input into the adam package (available at https://github.com/depmap-analytics/adam ) performing the th percentile method was - *log (tpm+ ) to ensure correct ranking). a gene defined as constitutively unexpressed was therefore one that was still lowly expressed in its highly ranked ( th percentile) most expressed cell line. downsampling for batch correction sample sizes we downsampled times the overlapping cell lines at different levels between % and %. random samples were generated using probabilities of selecting a cell line based .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/e ooj https://paperpile.com/c/bnwyax/ qc https://github.com/depmap-analytics/adam https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / on the relative proportions of each cell line lineage in the overlapping data set. using the downsampled set of overlapping cell lines combat was used to calculate the batch adjustment vectors. the batch adjustment vectors were then applied to all , cell lines. the correlation of a cell lines fold changes batch corrected using the downsampled datasets and the full overlapping cell lines was calculated and compared to the correlation with no batch correction. to evaluate the batch correction we also used the average silhouette width as a measure of clustering. we calculated the average silhouette width for each batch corrected data set (using samples of the overlapping cell lines) using the institute of origin as the cluster label. the average silhouette width is for perfect clustering (or complete separation of cell lines by the institute of origin) with indicating random performance of the clusters. references . prasad, v. perspective: the precision-oncology illusion. nature , s ( ). . behan, f. m. et al. prioritization of cancer therapeutic targets using crispr-cas screens. nature , – ( ). . tsherniak, a. et al. defining a cancer dependency map. cell , – .e ( ). . mcdonald, e. r., rd et al. project drive: a compendium of cancer dependencies and synthetic lethal relationships uncovered by large-scale, deep rnai screening. cell , – .e ( ). . shalem, o. et al. genome-scale crispr-cas knockout screening in human cells. science , – ( ). . koike-yusa, h., li, y., tan, e.-p., velasco-herrera, m. d. c. & yusa, k. genome-wide recessive genetic screening in mammalian cells with a lentiviral crispr-guide rna library. nat. biotechnol. , – ( ). . wang, t., wei, j. j., sabatini, d. m. & lander, e. s. genetic screens in human cells using the crispr-cas system. science , – ( ). . steinhart, z. et al. genome-wide crispr screens reveal a wnt-fzd signaling circuit as a druggable vulnerability of rnf -mutant pancreatic tumors. nat. med. , – ( ). . shi, j. et al. discovery of cancer drug targets by crispr-cas screening of protein domains. nat. biotechnol. , – ( ). . tzelepis, k. et al. a crispr dropout screen identifies genetic vulnerabilities and therapeutic targets in acute myeloid leukemia. cell rep. , – ( ). . hart, t. et al. high-resolution crispr screens reveal fitness genes and genotype-specific cancer liabilities. cell , – ( ). . meyers, r. m., bryan, j. g., mcfarland, j. m. & weir, b. a. computational correction of .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/bnwyax/votga http://paperpile.com/b/bnwyax/votga http://paperpile.com/b/bnwyax/votga http://paperpile.com/b/bnwyax/votga http://paperpile.com/b/bnwyax/votga http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/t woi http://paperpile.com/b/bnwyax/t woi http://paperpile.com/b/bnwyax/t woi http://paperpile.com/b/bnwyax/t woi http://paperpile.com/b/bnwyax/t woi http://paperpile.com/b/bnwyax/t woi http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/f tt https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / copy number effect improves specificity of crispr–cas essentiality screens in cancer cells. nature ( ). . wellcome sanger institute. cancer dependency map. https://depmap.sanger.ac.uk/. . broad institute of harvard and mit. cancer dependency map. https://depmap.org/. . feng, f. y. & gilbert, l. a. lethal clues to cancer-cell vulnerability. nature vol. – ( ). . dempster, j. et al. agreement between two large pan-cancer genome-scale crispr knock-out datasets. nature communications in press ,. . iorio, f. et al. unsupervised correction of gene-independent cell responses to crispr-cas targeting. bmc genomics , ( ). . allen, f. et al. jacks: joint analysis of crispr/cas knockout screens. genome res. , – ( ). . project score. https://score.depmap.sanger.ac.uk/. . depmap, b. depmap q public. ( ) doi: . /m .figshare. .v . . project achilles. https://figshare.com/articles/depmap_ q _public/ . . aguirre, a. j. et al. genomic copy number dictates a gene-independent cell response to crispr/cas targeting. cancer discov. , – ( ). . gonçalves, e. et al. structural rearrangements generate cell-specific, gene-independent crispr-cas loss of fitness effects. genome biol. , ( ). . doench, j. g. et al. rational design of highly active sgrnas for crispr-cas -mediated gene inactivation. nat. biotechnol. , – ( ). . leek, j. t., johnson, w. e., parker, h. s., jaffe, a. e. & storey, j. d. the sva package for removing batch effects and other unwanted variation in high-throughput experiments. bioinformatics , – ( ). . liberzon, a. et al. molecular signatures database (msigdb) . . bioinformatics , – ( ). . dempster, j. m. et al. agreement between two large pan-cancer crispr-cas gene dependency data sets. nat. commun. , ( ). . lagziel, s., lee, w. d. & shlomi, t. inferring cancer dependencies on metabolic genes from large-scale genetic screens. bmc biol. , ( ). . dempster, j. m., rossen, j., kazachkova, m. & pan, j. extracting biological insights from the project achilles genome-scale crispr screens in cancer cell lines. biorxiv ( ). . iorio, f. et al. a landscape of pharmacogenomic interactions in cancer. cell , – ( ). . chakravarty, d. et al. oncokb: a precision oncology knowledge base. jco precis oncol , ( ). . oncokb. all annotated variants. oncokb.org http://oncokb.org/api/v /utils/allannotatedvariants ( ). . aken, b. l. et al. ensembl . nucleic acids res. , d –d ( ). . li, t. et al. a scored human protein-protein interaction network to catalyze genomic interpretation. nat. methods , – ( ). . ruepp, a. et al. corum: the comprehensive resource of mammalian protein complexes-- . nucleic acids res. , d – ( ). . hart, t. et al. evaluation and design of genome-wide crispr/spcas knockout screens. g , – ( ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/bnwyax/f tt http://paperpile.com/b/bnwyax/f tt http://paperpile.com/b/bnwyax/f tt http://paperpile.com/b/bnwyax/f tt http://paperpile.com/b/bnwyax/kl bc http://paperpile.com/b/bnwyax/kl bc http://paperpile.com/b/bnwyax/kl bc http://paperpile.com/b/bnwyax/htoyk https://depmap.org/ http://paperpile.com/b/bnwyax/htoyk http://paperpile.com/b/bnwyax/wjxm http://paperpile.com/b/bnwyax/wjxm http://paperpile.com/b/bnwyax/wjxm http://paperpile.com/b/bnwyax/wjxm http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/ cgu https://score.depmap.sanger.ac.uk/ http://paperpile.com/b/bnwyax/ cgu http://paperpile.com/b/bnwyax/ qc http://dx.doi.org/ . /m .figshare. .v http://paperpile.com/b/bnwyax/ qc http://paperpile.com/b/bnwyax/n jvg https://figshare.com/articles/depmap_ q _public/ http://paperpile.com/b/bnwyax/n jvg http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/rxwn http://paperpile.com/b/bnwyax/rxwn http://paperpile.com/b/bnwyax/rxwn http://paperpile.com/b/bnwyax/rxwn http://paperpile.com/b/bnwyax/rxwn http://paperpile.com/b/bnwyax/rxwn http://paperpile.com/b/bnwyax/fojka http://paperpile.com/b/bnwyax/fojka http://paperpile.com/b/bnwyax/fojka http://paperpile.com/b/bnwyax/fojka http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/d gc http://paperpile.com/b/bnwyax/d gc http://paperpile.com/b/bnwyax/d gc http://oncokb.org/api/v /utils/allannotatedvariants http://paperpile.com/b/bnwyax/d gc http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . kanehisa, m. et al. kegg for linking genomes to life and the environment. nucleic acids res. , d – ( ). . fabregat, a. et al. the reactome pathway knowledgebase. nucleic acids res. , d –d ( ). . lenoir, w. f., lim, t. l. & hart, t. pickles: the database of pooled in-vitro crispr knockout library essentiality screens. nucleic acids res. , d –d ( ). . rauscher, b., heigwer, f., breinig, m., winter, j. & boutros, m. genomecrispr - a database for high-throughput crispr/cas screens. nucleic acids research vol. d –d ( ). . gonçalves, e., thomas, m., behan, f. m., picco, g. & pacini, c. minimal genome-wide human crispr-cas library. biorxiv ( ). . elmentaite, r., noell, g., turner, g., iyer, v. & parts, l. minimized double guide rna libraries enable scale-limited crispr/cas screens. biorxiv ( ). . van der meer, d. et al. cell model passports—a hub for clinical, genetic and functional datasets of preclinical cancer models. nucleic acids res. , d –d ( ). . bolstad, b. m. preprocesscore: a collection of pre-processing functions. . r package version ,. . leek, j. t. et al. sva: surrogate variable analysis. r package version . . . depmap, b. depmap q public. ( ) doi: . /m .figshare. .v . . ripley, b. et al. package ‘mass’. cran r , ( ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/xh a http://paperpile.com/b/bnwyax/xh a http://paperpile.com/b/bnwyax/xh a http://paperpile.com/b/bnwyax/xh a http://paperpile.com/b/bnwyax/xh a http://paperpile.com/b/bnwyax/xh a http://paperpile.com/b/bnwyax/czfn http://paperpile.com/b/bnwyax/czfn http://paperpile.com/b/bnwyax/czfn http://paperpile.com/b/bnwyax/czfn http://paperpile.com/b/bnwyax/czfn http://paperpile.com/b/bnwyax/ztmd http://paperpile.com/b/bnwyax/ztmd http://paperpile.com/b/bnwyax/ztmd http://paperpile.com/b/bnwyax/ztmd http://paperpile.com/b/bnwyax/dkgl http://paperpile.com/b/bnwyax/dkgl http://paperpile.com/b/bnwyax/dkgl http://paperpile.com/b/bnwyax/dkgl http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/ zwnw http://paperpile.com/b/bnwyax/ zwnw http://paperpile.com/b/bnwyax/ zwnw http://paperpile.com/b/bnwyax/ zwnw http://paperpile.com/b/bnwyax/ zwnw http://paperpile.com/b/bnwyax/ zwnw http://paperpile.com/b/bnwyax/zcfxr http://paperpile.com/b/bnwyax/zcfxr http://paperpile.com/b/bnwyax/zcfxr http://paperpile.com/b/bnwyax/ zofe http://dx.doi.org/ . /m .figshare. .v http://paperpile.com/b/bnwyax/ zofe http://paperpile.com/b/bnwyax/fenjn http://paperpile.com/b/bnwyax/fenjn http://paperpile.com/b/bnwyax/fenjn http://paperpile.com/b/bnwyax/fenjn http://paperpile.com/b/bnwyax/fenjn http://paperpile.com/b/bnwyax/fenjn http://paperpile.com/b/bnwyax/fenjn https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / identification and design of vinyl sulfone inhibitors against cryptopain- – a cysteine protease from cryptosporidiosis- causing cryptosporidium parvum arpita banerjee author contributions: designed the computational experiments: ab performed the computational experiments: ab analyzed the data: ab wrote the paper: ab correspondence: arpita. @gmail.com .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / abstract: cryptosporidiosis, a disease marked by diarrhea in adults and stunted growth in children, is associated with the unicellular protozoan pathogen cryptosporidium; often the species parvum. cryptopain- , a cysteine protease characterized in the genome of cryptosporidium parvum, had been earlier shown to be inhibited by a vinyl sulfone compound called k (or k- ). cysteine proteases have long been established as valid drug targets, which can be covalently and selectively inhibited by vinyl sulfones. this computational study was initiated to identify purchasable vinyl sulfone compounds, which could possibly inhibit cryptopain- with higher efficacy than k . docking simulations screened a number of such possibly better inhibitors. the work was furthered to probe the enzymatic pocket of cryptopain- , through in-silico mutations, to derive a map of receptor-ligand interactions in the docked complexes. the idea was to provide crucial clues to aid the design of inhibitors, which would be able to bind the protease well by making favorable interactions with important residues of the enzyme. the analyses dictated placement of ligands towards the front of the enzymatic cleft, and disfavored interactions deep within. the s ’ and s subsites of the enzyme preferred to remain occupied by polar ligand subgroups. reasonably distanced ring systems and polar backbones of ligands were desired across the cleft. large as well as inflexible subgroups were not tolerated. double ringed systems such as substituted napthalene, especially in s , were exceptions though. the s subsite, which is typically a specificity determinant in papain (c ) family cysteine proteases such as cathepsin l-like cryptopain- , can possibly accommodate polar and hydrophobic ligand subgroups alike. keywords: vinyl sulfone inhibitors, cryptopain- , cysteine protease, molecular modeling, covalent docking, in-silico mutational analysis, drug design. running title: identification and design of vinyl sulfone inhibitors against cryptopain- .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / introduction: cryptosporidiosis is an intestinal disease that is clinically manifested by diarrhea in adults [ ] and stunted growth in children [ ]. the infection can persist indefinitely in immunocompromised individuals such as hiv patients, and could be fatal in the form of life-threatening diarrhea [ ]. the disease is caused by unicellular protozoan parasite cryptosporidium, which infects humans and animals [ ] through consumption of contaminated water and/or ingestion of contaminated food products [ ]. the majority of infections are caused by cryptosporidium species hominis and parvum [ ] [ ]. a cysteine protease named cryptopain- , characterized in the genome of cryptosporidium parvum [ ], most likely facilitates host cell invasion and nutritional uptake (through proteolytic degradation) [ ] [ ] [ ]. the pathogenic enzyme, being cathepsin l –like, belongs to papain-like or clan ca (family c ) cysteine protease enzymes - which in general have been of particular use as therapeutic targets against parasitic infections [ ]. the catalytic triad of such enzymes is constituted by cys, his and asn residues [ ], [ ]. orthologous proteases to cryptopain- have been validated as drug targets viz: cruzain (from chagas’ disease agent trypanosoma cruzi), rhodesain (from sleeping sickness causing trypanosoma brucei), falcipain- (from malarial parasite plasmodium falciparum), smcb (from intestinal schistosomiasis causing schistosoma mansoni) [ ] [ ] etc. vinyl sulfone compounds have been particularly effective inhibitors of such parasitic cysteine proteases [ ] [ ] [ ] [ ] [ ]. these inhibitors form a covalent bond with the active site cys thiol to bind the proteases, thereby irreversibly blocking the enzymatic pocket. such inhibition interferes with the pathogenic activity of the proteases that would otherwise participate in general acid-base reaction for hydrolysis of host-protein peptide bonds [ ]. molecular modeling studies had previously shown that unlike serine proteases (which also cleave peptide bonds and have ser in their active site), the catalytic his in cysteine proteases remains protonated to act as a general acid [ ]. hydrogen bonding between the protonated his and the sulfone oxygen of a vinyl sulfone compound polarizes the vinyl group of the ligand to impart a positive charge on its beta carbon that eventually promotes nucleophilic attack by negatively charged cys thiolate of the protease’s active site. vinyl sulfone class of inhibitors are preferred over other covalent inhibitors because of its selectivity for cysteine proteases over serine proteases, relative inertness in the absence of target protease [ ] [ ], and safe pharmacokinetic profile [ ] [ ]. the peptidyl vinyl sulfones that have been co-crystallized with cysteine proteases so far reveal that the –co-nh- backbones of the pharmacologically active compounds fit snugly in the enzymatic cleft, with the ligand sidechains (or subgroups) protruding into the different subsites of the proteases. the subgroup near the vinyl carbon that undergoes nucleophilic attack is equivalent to p in the inhibitor/substrate [ ]. therefore, ligand sidegroups starting from the vinyl side are designated as p , p … that interact with the s , s … protease subsites. the ligand subgroups beyond the sulfonyl are referred to as p ’, p ’… and they occupy the s ’, s ’… subsites on the prime side of the enzyme (figure ). typically, the p -s interaction is the key specificity determinant in papain (c ) family cysteine proteases [ ] [ like cryptopain- . .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / k (or k- ), a vinyl sulfone that binds cryptopain- as its target as per inhibitor competition experiments with active site probe of the recombinant protease, has been demonstrated to arrest cryptosporidium parvum growth in human cell lines at physiologically achievable concentrations [ ]. the cryptopain- structure however, by itself or in complex with k , has not been solved till date. k -bound co-crystals of other orthologous cysteine proteases such as cruzain, rhodesain and smcb [ ] [ ], showed the orientation of the inhibitor in the cysteine proteases as depicted in figure . the earlier mentioned study on cryptopain- had simulated the binding of k within the active site of the enzyme homology model [ ], and mimicking nature, the inhibitor was put in an orientation as illustrated in figure the present computational study was initiated to explore other (purchasable) vinyl sulfones that could better bind the active site of the cryptopain- enzyme, with possibly higher efficacy than k . the study was extended to probe the enzymatic pocket of cryptopain- to figure preferential binding of certain ligand chemical groups at the subsites, for the purpose of providing clue to drug design against the pathogenic cysteine protease. materials and methods: homology model building of enzyme the sequence of cryptopain- , with the accession number aba . , belonging to cryptosporidium parvum was retrieved from genbank [ ]. the protein sequence was downloaded in fasta format. the homology model template search for cryptopain- (cathepsin l-like) through ncbi blast against pdb database [ ] led to f , which is the activated toxoplasma gondii cathepsin l (tgcpl) in complex with its propeptide. the template shared % sequence identity with the sequence to be modeled. the homology model of cryptopain- was built within the full refinement module of icm [ ]. the structure-guided sequence alignment between the template and the model was generated using the default matrix with gap opening penalty of . and gap extension penalty of . . loops were sampled for the alignment gaps where the template did not have co-ordinates for the model. the loop refinement parameters were used according to default settings. acceptance ratio for the simulation process was . . the generated homology model of a length of amino acids was then validated in procheck [ ] and prosa [ ] webservers. ligand structures from chemical compound database k (or k- ) was downloaded from pubchem [ ] in sdf format. the vinyl sulfone substructure of k was then searched in pubchem, with the additional option of ‘ring systems not embedded’ so as to filter out those structures where the vinyl bonds would extend into ring systems. the search, which was obviously not restricted to peptidyl vinyl sulfones, led to , hits (as of april , ). compounds, which were purchasable amongst the hits, were downloaded in sdf format. the downloaded compounds were checked for redundancy. from the non-redundant vinyl sulfone .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / compounds, cyanide compounds were discarded due to the usual high toxicity profile of such compounds, and the remaining were saved to be used as ligands for docking into cryptopain- . docking simulation of covalent inhibition of enzyme the n-terminal propeptide (which is not part of the active enzyme and acts as a self- inhibitory peptide for regulatory purposes) of the cryptopain- homology model was deleted. the residues were then renumbered in the enzyme model, with position allocated to the beginning of the mature protease. the pdb file of the edited cryptopain- model was then prepared as a receptor in icm with the addition of protons, optimization of his, pro, asn, gln and cys residues. the protonation step was crucial for mimicking the reaction (and hence bonds) between a vinyl sulfone and the cysteine protease. the active site residues of the binding pocket had been derived from the structural alignment of cryptopain- homology model with the orthologous cruzain that was bound to k (pdb id: oz ), followed by mapping of the residues around k in the cruzain onto the cryptopain- sequence. the pre-determined pocket residues were selected (except the catalytic cys or c ) on the prepared cryptopain- in the gui of icm and the relevant box size was created on the receptor for defining the area for ligand docking. further, c was selected for specifying the covalent docking site. from the set of preloaded reactions in icm, alpha, beta-unsaturated sulfone/sulfonamide/cysteine reaction was selected, which specified the simulation of covalent bond formation between the supposedly thiolate (c of protease) and the beta carbon atom (of the vinyl group of ligand). the receptor maps were finally made for grid generation. k , downloaded from pubchem in sdf format, was read in as a chemical table in the gui of icm, and was specified for docking into the prepared cryptopain- receptor. thoroughness of . was set in the docking protocol, and twenty conformations of the ligand in the receptor were generated. following k , a total of non-cyanide vinyl sulfone compounds were attempted for covalent docking into the cryptopain- homology model, using the same protocol as described above. in-silico mutation of enzyme residues for assessing binding for the purpose of evaluating the contribution of the individual residues to the binding of the ligands, mutational analysis was undertaken. the protein-ligand stability was measured by in-silico mutation of the contact residues in the complexes. k -docked cryptopain- and the best-scored complexes (with a score of - or lower) were read in separately, and then for each of them, the ligand-subgroup contacting residues were selected one at a time in the workspace panel, and were mutated to alanine. the outputs of the calculations were displayed in several columns. dgwt column had the dg (gibbs free energy) value for the wild type complex (without mutation), the dgmut held the dg value for the mutated complex (where the residue was mutated to ala), and the ddgbind (dgmut – dgwt) column, which showed the binding free energy change (in kcal/mol) upon mutation, essentially predicted the stability of the native complex, thereby hinting at the contribution of the residue in question towards binding the ligand. positive values of ddgbind implied the mutation to be less favorable, indicating greater .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / contribution of the wild type residue towards binding. hence, with more positive ddgbind, better binding of the ligand by the residue could be expected. negative values, on the other hand, implied the mutated form to be more stable, thereby delineating the native residue’s involvement in unfavorable interactions with the ligand. the residues that were detected to make high number of favorable ligand interactions in thirty-two of the complexes (k -cryptopain- plus thirty-one best-scored ones) were subjected to a fresh round of mutations in the updated version of the icm software. the recalculated ddgbind values were then tallied with the placement and orientation of ligand-subgroups around the residues to decipher the preference of chemical groups across the enzymatic cleft of cryptopain- . [the gui of icm was used to make the enzyme/complex structure figures. illustration and compilation of figures were done in inkscape, which is an open-source vector graphics editor] results and discussion: validation of theoretical enzyme structure the ramachandran plot for the cryptopain- homology model showed % of the residues to lie in the allowed region, and the remaining % to be within the generously allowed region of the plot (supplementary figure a). the prosa z-score for the cryptopain- model was - . , better than the - . z-score of its crystal structure template (supplementary figure b). screening of docked compounds besides k , a total of purchasable, non-redundant and non-cyanide vinyl sulfone compounds were docked and scored in the cryptopain- homology model ( symmetric molecules could not be docked using icm). post docking, the conformation of k - where the ligand p ’ group (beyond the sulfonyl) got oriented across the enzyme s ’ and its p ..p groups (beyond the vinyl) were placed across the s ..s subsites (as in figure ), and had the lowest score in the said category, was chosen as a reference for the analysis. such orientation appeared first in the eighteenth pose (conformation) of k docked into cryptopain- , with a score of - . . the conformations of some other docked vinyl sulfone compounds that had similar orientation (described above) where the ligand subgroups beyond the sulfonyl were placed across s ’ or beyond, with lowest scores <= - . (and hence possibly better binders than k ), were included in the study for further detailed analysis. [the chemical structures of k and the thirty-one best-scored vinyl sulfones are provided in supplementary figure , as pubchem ids associated with (some) chemical compounds change due to frequent updates to the database. the ids .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / mentioned throughout the text, tables and figures are from the current pubchem records as of may , ] ligand binding to preferential enzyme residues the residues around Å of the ligand subgroups were noted for each complex. k - docked cryptopain- was taken as a reference, as k had been shown experimentally (on bench) to bind cryptopian- . the protease subsite residues were thus primarily derived from this complex. figure show the chosen conformation of k docked into cryptopain- with the derived subsites colored differently. for the other best-scored complexes, the additional contact residues that showed up were assigned subsites according to their vicinity/placement to the already derived subsite residues in the three dimensional structure of cryptopain- . figure shows all the residues that were contacted by ligand subgroups across the enzymatic cleft, in one or more of the complexes. the panels a, b, c and d of figure show the selected conformations of the other vinyl sulfones in the cryptopain- , amidst the subsites derived from the reference complex. the ligand subgroup-contacting residues in each complex had been mutated to alanine; one at a time, to figure the favorable interactions based on the ddgbind values. the interactions that showed ddgbind values worse than - (less than - ) were not taken into account. the residues that corresponded with the rest of the ddgbind values (greater than - ) were considered to be contributing to favorable interactions with the ligand. supplementary table lists the ddgbind interactions in terms of residue versus ligand (represented by pubchem ids). the columns have all the residues that had been favorably contacted in one or many of the complexes, and the rows hold the compounds whose subgroups had shown favorable interactions with the corresponding column residues. table lists the scores, contact residues, h-bonding residues and the favorably interacting subsite residues (derived from supplementary table ) in the complexes. the tables feature also the additional subsite residues that showed up in the other best-scored complexes, which included ligands that, unlike k , were not typical peptidyl vinyl sulfones. thirteen of the favorably interacting cryptopain- residues emerged to be heavily contacted by ligand subgroups in the complexes (see supplementary table ). the number of times each of the residues was shown to make favorable interactions ranged from to . with a threshold of , q , k , g , c , w , g , t , a , v , n , h , g , and w turned out to be the most frequently contacted of the favorably interacting residues. the derived residues were then subjected to ddgbind recalculations (barring a ). the results from the calculations were studied with respect to the orientation and positioning of the ligand subgroups near the mentioned residues in the complexes. the ddgbind values for the interaction of the frequently contacted residues with the ligands are listed in table . the purpose was to deduce the contributing factors for binding and to shed light on the enzymatic-pocket preference for accommodating certain ligand groups, which could be ultimately useful for designing a potent vinyl sulfone inhibitor (better than k ) to target cryptopain- . .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / interactions: enzyme subsite residues - ligand subgroups unlike k which occupied the central part of the pocket and was spread equally amongst all the subsites (figure ), the best-scored vinyl sulfones more often occupied the upper part of the cleft and tended to position themselves on the right, making contacts mostly with s ’ and s . ligands that lacked p ’, p ’ etc., were sometimes exceptions and got placed at the lower end of the cleft, heavily contacting s . the positioning of the ligand-contacting residues in the three dimensional structure of the enzyme can be seen in figure , and the other vinyl sulfone ligands’ placement therein is visible in figure . the accommodation of various ligand subgroups of the best-scored vinyl sulfones across the enzymatic cleft is described as follows. s ’ enzyme subsite the s ’ subsite residues f and w , in the uppermost part of the pocket, were not amongst the frequently contacted, and hence they were excluded from detailed analysis. s ’ enzyme subsite the derived s ’ residues n , h and w were frequently contacted by the other vinyl sulfones, along with an additional g (placed between n and h ). q and k also featured as additional contacts, which though positioned on the opposite side in the structure, made interactions with p ’ of the ligands. thus the residues were categorized as part of s ’. the upper part of the heavily occupied enzymatic pocket region is constituted by s ’ residues: w on the right, and q , k on the left. w , which made most of the hydrophobic interactions, on the right side of the pocket, with the ligand ring systems showed highly positive ddgbind values for thiophen group in particular. the residue seemed to prefer pi stacking with ligand ring systems as it showed favorable ddgbind values for in-plane ring interactions. the ligands with ethenyl group as well as the ones that did not place any subgroups near the residue showed moderately favorable interactions. the ligands whose rings were out of plane with the residue’s six-membered ring, and the ones that had groups like bromopyridine near the residue, showed unfavorable interactions. for q that is situated at the back of the cleft wall, the compounds’ covalent moiety with their sulfonyl group and/or benzyl/phenyl ring(s), when placed near the lower end of the residue, resulted in favorable interactions. large halide containing subgroups such as bromopyridine resulted in unfavorable interaction. k , positioned at the front of the cleft, showed favorable interactions with reasonably distanced polar substituents. interactions were favorable even when no substituent was close to the residue. understandably, unfavorable interactions were observed when the non-polar moiety of the residue’s sidechain was near polar ligand atoms, and interactions of non-polar ethenyl group of the ligand with polar end of the residue also led to highly negative ddgbind values. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / the mid-region of the highly occupied cleft is constituted by n , h and g (s ’ residues) on the right. these frequently contacted residues were actually within the contact range of both p ’ and p of k . however, the proximity of the ligand’s p ’ to the sidechains of n and h in the reference complex led to the residues’ allocation to s ’ – which therefore extends into the middle of the cleft. n showed favorable interaction with halide-containing substituents including bromopyridine that otherwise had unfavorable interactions with the other residues. the ligands that had their benzyl/phenyl rings at a comfortable distance from the residue showed favorable interactions. closely spaced ligand ring systems led to clashes. h , which is situated at the back (compared to n ) of the enzyme’s mid-pocket, preferred favorable interactions with the ligands’ sulfonyl or backbone. the residue, if not always, showed favorable interactions even when no ligand group was placed near it. favorable ring interactions were observed when the ligands’ ring systems were mostly tilted towards w . unfavorable ddgbind values were observed for inflexible ethenyl groups in ligands. g , which is buried in the mid-pocket, made interactions primarily with the covalent- bond forming moieties of the ligands. the residue showed favorable interactions with reasonably distanced ring systems. interactions were unfavorable for closely spaced rings and inflexible groups such as ethenyl. overall, the arrangement of the mentioned residues suggest that substituted benzene/napthalene ring systems could be accommodated in the upper region of the subsite, where the ligand rings can engage in hydrophobic interaction with w , and the polar substituents on those rings could interact with q and k to the left of the pocket. however, large (polar) halide-substituted rings such as bromopyridine could lead to clashes. the s ’ in the mid-pocket shows a preference for reasonably distanced ring systems and halide-substituted ligand subgroups. the subsite is not likely to tolerate inflexible groups such as diazospiro, ethenyls etc. s enzyme subsite the frequently contacted (derived) s residues g and c were positioned on the left side of the mid-pocket. w , that emerged as an additional frequent contact was placed close-by to g and c on the left, and formed part of s . g was observed to like interactions with double ring systems such as substituted napthalene or two separate benzyl/phenyl rings placed near the residue. it also showed favorable interactions with groups like sulfonyl and/or polar backbone atoms. ring as well as polar interactions showed the most favorable ddgbind values. the interactions became unfavorable when no ligand group was in the vicinity of the residue. bromopyridine showed unfavorable interactions with this residue too. c , the enzymatic triad residue that formed the covalent bond with the vinyl sulfones, preferred the ligands to be placed away from it and towards the front of the cleft. the favorably interacting compounds were positioned to the right and at the bottom of the residue. the compounds that were tilted towards the inside of the cleft showed moderately unfavorable interactions, and so did the ones that did not place any ring .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / system near the residue. unfavorable interactions for the residue were observed with the close proximity of ligands’ polar substituents or backbone. again, bromopyridine made unfavorable interactions with this residue as well. unlike other residues, c had far less borderline interactions and the individual ddgbind values mostly ranged on either side of favorable and unfavorable. w made favorable interactions with the ring systems of the ligands that were placed away, and towards the right side of the pocket. the interactions were better with more number of rings. the highest ddgbind value was obtained for the compound that had four ring systems. however close interactions either with the ligand backbone or side chain resulted in unfavorable interactions. inflexible groups such as diazospiro, even if placed away from the residue, amounted to negative ddgbind values. taken together, inflexible groups such as diazospiro, ethenyl etc. would not be tolerated by s . the subsite can accommodate multiple ring systems. the mid-pocket would have a preference towards polar backbone of ligands that are positioned towards the front. the catalytic c of s too dictates the compounds to be placed not too deep inside the cleft. large halide containing subgroups such as bromopyridine will not be favored in the subsite. the site shows a propensity towards closely packed ring interactions. s enzyme subsite the lowest part of the heavily occupied pocket is comprised by the frequently contacted (derived) s subsite residues: g , t , a and v . the s residues are distributed on both sides of the cleft. g , t are on the left, and a , v are on the right. g , placed above t , engaged mostly in h-bond interactions with backbone of the ligands, rather than favorably accommodating their side chains. the residue showed favorable ddgbind values for slightly spaced away ring systems of ligands. the most unfavorable interactions were shown for the compound containing bromopyridine. for t , the highest positive ddgbind value was observed for a halide-substituted ligand subgroup (fluro-triazinyl group) with its polar ring and polar backbone near the residue. t preferred reasonably distanced ring interactions (polar and non-polar). however, with no ligand group placed near the residue, the interactions were unfavorable. also, with large subgroups like bromopyridine again, the interactions were unfavorable. a had to be excluded from the mutational analysis as ddgbind value for ala to ala mutation is zero, and could not have provided any useful clue towards the type of interactions. v , despite being mostly hydrophobic, showed favorable interactions with comfortably distanced polar subgroups of ligands including the fluro-triazinyl group-containing compound that showed the best ddgbind value. such polar groups were presumably stabilized by long-ranged electrostatic effect of other s residues (see tables). summing up, s can certainly accommodate polar subgroups/backbone of ligands. the subsite however, like the other subsites, does not like to accommodate large polar subgroups like bromopyridine. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / orientation and placement of ligands across the enzymatic cleft the best-scored vinyl sulfones tended to occupy the s ’, s ’, s and s subsites. unlike k , the other compounds showed optimal interactions mostly with the prime site residues of the enzyme. the s ’ residues made half of the frequently contacted favorable interactions with the ligands. the rest half of such interactions were accounted by s and s members. with respect to the entire enzymatic cleft of cryptopain- , it can be deduced that the ligands’ placement towards the front of the cleft would be preferred to deep-seated interactions. polar backbones of ligands (even if not peptidyl) would be desired. s ’ and s like to be occupied, and are prone to make favorable interactions with polar subgroups of ligands. large halide-containing subgroups are not well tolerated presumably because of their size. reasonably distanced ring interactions would be preferred all across the cleft. unlike inflexible groups like substituted napthalene which could be favorably accommodated in s , the strain arising out of the inflexibility of ethenyl and/or diazospiro groups is not likely to be tolerated, especially in the s ’ and s subsites, as per the computational mutational analysis. quite relevantly, the compound that showed the maximum number of favorable interactions with the frequently contacted residues, (see table ) had all the preferred attributes and lacked the undesirable ones. the ligand-bound protease showed a very good score of - . . some other compounds that showed slightly better scores than were (score: - . ), (score: - . ), and (score: - . ). and were placed deep inside the cleft that led to clashes with the covalent bond forming c . the ligands’ polar backbones, in addition to the occupation of the enzymatic s ’ site with polar subgroups, somewhat mitigated the unfavorable interactions in totality. the compounds also had the undesirable ethenyl near s ’, which contributed to unfavorable interactions with k in case of (where the ethenyl was placed much closer to the residue). however, the overall scoring algorithm did not penalize ethenyl’s presence as much as the individual ddgbind calculations did. , which showed the best score, too had an ethenyl group (albeit not close to k ). this compound however was placed towards the front of the cleft, thereby avoiding unfavorable interactions with c . also, the ligand had ring systems in abundance (six) for favorable interactions. rings comprised its (polar) backbone as well as subgroups. the ligand desirably occupied the s ’ and s subsites, though not with much polar subgroups. conclusion: the efficacy of the thirty-one best-scored compounds as drug candidates within physiological limits remains to be tested on bench. the information, which has been garnered through this study on the substrate/ligand-binding cleft of the enzyme and its .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / interaction with the chemical groups of the docked compounds, could ultimately guide the design of potent vinyl sulfone inhibitors. and that shared most of the preferred ligand-subgroup attributes can serve as model compounds, based on which effective inhibitors against cryptopain- could be designed. figure provides the chemical structures of the reference (k ) and the model compounds. unlike the other two mentioned compounds ( and ), the subgroups of the model ligands extended into s – typically the key specificity determinant in cathepsin l-like cysteine proteases such as cryptopain- . placed a polar subgroup at s in contrast to the hydrophobic subgroup put by . polar ligand subgroups (as in ) at the enzyme’s s are likely to be stabilized via polar/electrostatic interactions by residues like t , m , t , k and e . hydrophobic subgroups too (as in ) could be accommodated by the virtue of s residues like a and v . thus, the study attempted to identify purchasable vinyl sulfone compounds that can possibly inhibit cryptopain- , as well as it provided crucial information pertaining to receptor-ligand interactions to help future design of other vinyl sulfones, which could prove to be effective in curbing cryptosporidiosis. acknowledgement: the author would like to thank prof. ruben abagyan of university of california san diego, for providing computational resources. references: [ ] dupont hl, chappell cl, sterling cr, okhuysen pc, rose jb, jakubowski w. . the infectivity of cryptosporidium parvum in healthy volunteers. n. engl. j. med. : – . [ ] janoff en, mead ps, mead jr, echeverria p, bodhidatta l, bhaibulaya m, sterling cr, taylor dn. . endemic cryptosporidium and giardia lamblia infections in a thai orphanage. am. j. trop. med. hyg. : – . [ ] griffiths jk. . human cryptosporidiosis: epidemiology, transmission, clinical disease, treatment, and diagnosis. adv. parasitol. : – . [ ] fayer r, santin m, macarisin d. . cryptosporidium ubiquitum n. sp. in animals and humans. vet. parasitol. : – . [ ] juranek dd. . cryptosporidiosis: sources of infection and guidelines for prevention. clin. infect. dis. (suppl. ): s –s [ ] o’donoghue pj. . cryptosporidium and cryptosporidiosis in man and animals. int. j. parasitol. : – . [ ] tzipori s, widmer g. . a hundred-year retrospective on cryptosporidiosis. trends parasitol. : – . .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / [ ] na bk, kang jm, cheun hi, cho sh, moon su, kim ts, sohn wm. . cryptopain- , a cysteine protease of cryptosporidium parvum, does not require the pro- domain for folding. parasitology : – [ ] teo cf, zhou xw, bogyo m, carruthers vb. . cysteine protease inhibitors block toxoplasma gondii microneme secretion and cell invasion. antimicrobial agents and chemotherapy : – . [ ] shaw mk, roos ds, tilney lg. . cysteine and serine protease inhibitors block intracellular development and disrupt the secretory pathway of toxoplasma gondii. microbes and infection : – . [ ] rosenthal pj. . hydrolysis of erythrocyte proteins by proteases of malaria parasites. current opinions in hematology : – [ ] sajid m, mckerrow jh. . cysteine proteases of parasitic organisms molecular & biochemical parasitology : – . [ ] powers jc, asgian jl, ekici od, james ke. . irreversible inhibitors of serine, cysteine, and threonine proteases. chem. rev. : - . [ ] kerr id, lee jh, farady cj, marion r, rickert m, sajid m, pandey kc, caffrey cr, legac j, hansell e, mckerrow jh, craik cs, rosenthal pj, brinen ls. . vinyl sulfones as antiparasitic agents and a structural basis for drug design. ( ): – . [ ] jílkova a, rˇezácˇová p, lepsˇík m, horn m, va´chova´ j, fanfrlík j, brynda j, mckerrow jh, caffrey cr, mares m. . structural basis for inhibition of cathepsin b drug target from the human blood fluke, schistosoma mansoni. j. biol. chem. ( ): – . [ ] chen yt, lira r, hansell e, mckerrow jh, roush wr. . synthesis of macrocyclic trypanosomal cysteine protease inhibitors. bioorg med chem lett. ( ): – . [ ] jaishankar p, hansell e, zhao dm, doyle ps, mckerrow jh, renslo ar. . potency and selectivity of p /p -modified inhibitors of cysteine proteases from trypanosomes bioorg. med. chem. lett. : – . [ ] rasnick d. . small synthetic inhibitors of cysteine proteases perspectives in drug discovery and design december. ( ): – . [ ] palmer jt, rasnick d, klaus jl, bromme d. . vinyl sulfones as mechanism- based cysteine protease inhibitors j. med. chem. ( ): – [ ] mckerrow jh, rosenthal pj, swenerton r, doyle p. . development of protease inhibitors for protozoan infections. curr opin infect dis. ( ): - [ ] ndao m, nath-chowdhury m, sajid m, marcus v, mashiyama st, sakanari j, chow e, mackey z, land km, jacobson mp, kalyanaraman c, mckerrow jh, arrowood mj, caffrey cr. . a cysteine protease inhibitor rescues mice from a lethal .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / cryptosporidium parvum infection. antimicrob agents chemother. ( ): - [ ] benson da, cavanaugh m, clark k, karsch-mizrachi i, lipman dj, ostell j, sayers ew. . genbank. nucleic acids res. (database issue): d - . [ ] berman hm, westbrook j, feng z, gilliland g, bhat tn, weissig h, shindyalov in, bourne p. . the protein data bank. nucl acids res. : - [ ] abagyan ra, totrov mm, kuznetsov da. . icm—a new method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation. j. comp. chem. : - . [ ] laskowski ra, macarthur mw, moss ds. procheck: a program to check the stereochemical quality of protein structures. j. appl. cryst. : - . [ ] wiederstein m, sippl mj. prosa-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. nucleic acids research. ( ): – . [ ] kim s, thiessen pa, bolton ee, chen j, fu g, gindulyte a, han l, he j, he s, shoemaker ba, wang j, yu b, zhang j, bryant sh. . pubchem substance and compound databases. nucleic acids res. (database issue): d - . . .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / ligands score contact residues h-bond residues fav s ’ residues fav s ’ residues fav s residues fav s residues fav s residues (k ) - . a , d n , h , w g , c , c , d , g g , t , a , v , e f , l g , w a , d , h c , g g , t , a , v , e f , l - . q , k , n , c , g , c , w , d , g , g , t , n , h ,g , w , k , w q , k , n , c , n , g , w g , c , w , d , g g , t - . q , k , c , w , g , t , a , v n , h , g , w w q , k , h , g , w w t , a , v - . n , q , k , c , w , g , t , m , a , v , n , h , g , w , e h , w n , q , k , n , g w t , m , a - . q , k , n , c , g , c , w , g , t , q , n ,h , g , w , w q q , w q , k , n , c , n , h , g , w g , w g , t .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / - . n , q , k , n , g , c , w , d , g , g , t , m , a , q n , h , g , w , w g , w q , w n , q , k , n , n , g , w w , d g , m , a - . n , q , k , g , c , w , d , g , g , t , q , n , h , g w w n , g , w q , w n , q , k , n , h , g , w g , c , w , d , g g , t - . q , k , n , c g , c , w , d , g , g , t , n , h , g w q , w q , k , n , c , n , h , g , w g , c , w , d , g g , t - . n , q , k , g , c , w , d , g , g , t , n , h , g , w , w n , g , w w n , q , k , n , h , g , w g , c , w , d , g g , t - . n , q k , g , c , d q , f n , h w w g , q , h , w q , f , w n , q , k , d , n , h , w g , c - . q , k , c , w , g , m , a , v , q , k , g , w c , w g , m , a .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / n , h g , w - . c , w , g , t , m , a , v , n , h , g w w n , g , w w g , t , m , a , v - . q , k , n , c , g , c , w , g , t , m , q , v , n , h g , w , w q , h q q , k , n , c , n , h , g , w g , c , w g , t , m , v - . q , k , c , w , g , t , m , a , a , d v , n h , g w q , h q , k , a , h , g w t , m , a , v - . n , g , c , w , d , g , g , t , m , a , a , d q , v , n , h g , w , w c , c , g , w q , w n , a , d , n , h , g , w g , w , g g , t , m , a , v - . n , q , k , n , c , g c , w , f , d , g , g , t , m a , q , v , n , h , g , w , e g q n , q , k , n , c , n , h , g , w g , c , w , d , g g , t , m , a , v , e f - . g , d , g g , v n g , g t , a , .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / t , a , t , k , v , n , h , e t , k - . t , a , t , k , v , n h , e v n , h t , a t , k v , e - . n , q , k , c , w , g , g , t , a , q , n , h g w w q n , q , k , n , h , g , w c , w , g t , a - . q , k , n , c , g , c , c , d , v , n , h , w g , n , w q , k , n , c , n , h , w c - . q , k , n , c , g , c , w , d , g , g , t , n , h , g q , c , w q , k , n , c , n , h , g g , c , w , d , g g , t - . c , g , c , w c , d g , t , m , a , v , n , h , g c , n , h , g g , c , w , c , d g , t , m , a , v - . g , c , w , f , c , d , g , g , t , m , a , t , k , v , h , g , w , e g , w h , g , w g , w , c g , t , m , a , t , k , v , e f .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / - . q , g , c , w g , t m , a , a , d , q , v , n , h , g , w q , c , g q q , a , d , n , h , g , w g , c , w g , t , m , a , v - . q , k , n , c , g , c w , g , t , m , a , a , d , q , v , n , h , g w , w q q , w q , k , n , c , a , d , n , h , g , w g , w g , t , m , a , v - . n , q , k , c , w , g , g , t , a , q v , n , h , g w , w g q , w n , q , k , n , h , g , w c , w , g g , t , m ,a v - . c , w g , g , t , m , a , a d , q n , h g ,w , w q a , d , n , h , w c , g g , t , a - . c , w g , t m , a v , n , h , g w a , g w g , t , m , a , v - . q , c , g , s , c , w , c , d , g , g , t , a , a , d , q q , w q , c , a , d , n , h , g , w g , s , c , w , c , d , g g , t , a , v .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / table : the contact residues around k in cryptopain- are color-coded as per subsites. the residues around the p ’ sidegroup of k (s ’ subsite) are in orange. the s site is in green, s in pink and s in red. the residues that made favorable contacts with k are shown in bold in the subsequent columns. the residues around the ligand subgroups of the best-scored vinyl sulfones compounds (pubchem ids in ligands column) are listed. the favorable interactions (including additional contact residues, which does not appear for k ) are shown in bold and colored as per subsites. the additional s ’ subsite is shown in mauve. the scores and the h-bonding residues for the individual complexes are also listed. q , v , n , h , g , w - . n , q , k , c , g , s , c , w , a , f , c , d , g , a , q , n , h , w w q , g , h , w q , w n , q , k , c , a , n , h , w g , s , c ,w , a ,f , c , d , g - . q , k , n , c , g , c , w , g , t , m , a , a , d , q v , n , h , g w q q q , k , n , c , a , d , h , g , w c , w g , t , m , a , v - . n , q , k , n , c , g , c , c , d , g q , n , h , w g , n , w q n , q , k , n , c , n , h , w g , c , c , d , g .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / q k g c w g t v n h g w k ( ) - . . - . - . - . - . - . - . - . - . . - . - . . - . - . - . - . - . - . - . . - . - . . - . - . . - . . - . - . - . - . . . . - . . - . . - . . . . . . - . - . - . - . - . - . - . . . - . - . - . - . . . - . - . . - . - . . - . . - . . . - . - . . - . . - . - . . - . . - . - . - . - . - . . - . - . . - . . . - . - . - . - . . - . - . - . - . . - . - . - . - . - . . - . . . . . . . . . . . . . - . . - . - . . . - . . . - . - . . - . . - . . - . - . - . - . - . . - . . - . - . - . - . . . - . - . . . . - . . - . . . - . - . - . - . - . - . - . . . . - . - . - . - . . - . . . . . - . . . . . . . . - . . - . . . - . - . . - . . - . - . - . - . - . . - . - . - . . - . . - . . . . . . . . . . . . . - . . - . . - . - . - . - . - . . - . - . . - . . . - . - . - . - . - . - . . . - . - . - . - . - . - . - . . . . - . . - . . - . - . - . - . - . . - . .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / - . . . - . - . - . - . . . - . - . - . . - . . . - . - . - . - . . - . - . . - . . - . - . - . - . - . . - . . . - . - . . . . - . - . - . . . - . . - . . . - . - . - . - . . - . - . - . - . . - . . . - . . - . - . . - . . - . . - . - . - . - . - . - . - . . - . - . . . . - . . . . . - . table : the ddgbind values for the interaction of k and the best-scored ligands with the important residues of cryptopain- are tabulated. the residues that had showed high number of favorable interactions (supplementary table ) were taken into consideration for the second round of calculations to chart this table. the values for the most favorable interactions are shown in purple, moderately favorable interactions in brown, slightly unfavorable in aquamarine and unfavorable in blue. the scale for demarcation varies for each residue, depending on the range and type of its interactions. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / s ' p figure : illustration of the typical binding of vinyl sulfone inhibitors to cysteine protease enzymes. colored spheres represent the different subsites of the enzyme, and the ligand sidechain/subgroups of the vinyl sulfone inhibitor are in violet rectangles. spatial distribution of the subsites in three-dimensional protease structures differs from the linear arrangement that has been shown here for simplicity. the backbones of the enzyme and inhibitor are not shown. the site of covalent bond formation at c has been marked in red. the positioning/denotation of the ligand subgroups within the different subsites of the enzyme is according to their placement near the vinyl warhead – depicting what has been observed so far in the solved structures of peptidyl vinyl sulfone-bound cysteine proteases. the ligand sidegroup nearest the beta carbon of vinyl is p that fits into s . the following ligand subgroups are p , p etc. the groups beyond the sulfonyl are p ’, p ’ etc. which interact with the prime side subsites of the enzyme. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / sulfone p p ' p p figure : k or k- (pubchem id: ) docked into the three-dimensional (homology) model of cryptopain- . the selected conformation (score: - . ) shown here conforms to the arrangement of the ligand subgoups (p ’, p , p , p ) in the different enzyme subsites as depicted in figure , and so does the color code that demarcates the subsites. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / figure : all the residues that are contacted by one or more ligands in the docked complexes of k and the best-scored (score <= - . ) vinyl sulfones are labeled and shown in spacefill representation (colored as per hydrophobicity) in the three dimensional structure (homology model) of cryptopain- . the enzymatic triad residue c - the site of covalent attachment - is in yellow. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / a .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / b .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / c .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / d figure : panels a, b, c, d show the orientation and placement of the best-scored (score <= - . ) compounds docked into the cryptopain- theoretical structure. the ligands are shown with respect to the enzyme subsites that have been derived from the k -cryptopain- reference complex. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / (k ) figure : the chemical structures (along with the pubchem identifiers) of the reference ligand k or k- , and the two model compounds - which showed optimum interactions with the enzymatic cleft of cryptopain- and thereby could aid the design of effective inhibitors to target the protease. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / hla-spread: a comprehensive resource for hla associated diseases, drug reactions and snps across populations hla-spread: a comprehensive resource for hla associated diseases, drug reactions and snps across populations dhwani dholakia , *#, ankit kalra #, uma kanga , mitali mukerji , * . institute of genomics and integrative biology-council of scientific and industrial research, new delhi- , india. . academy of scientific and innovative research, ghaziabad- , india. . netaji subhas university of technology, new delhi- , india. . all india institute of medical sciences, new delhi- , india. * correspondence: mitali mukerji; email: mitali@igib.res.in dhwani dholakia; email: dhwani.dholakia@igib.in #equal contribution keywords: hla associations, natural language processing, adverse drug reactions, hla biomarker, transplantation, hla alleles (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract extreme complexity in the hla system and its nomenclature makes it difficult to interpret and integrate relevant information for hla associations with diseases, adverse drug reactions (adr), transplantation. pubmed search displays ~ , studies on human leukocyte antigens (hla) reported from, diverse locations and on multiple populations and ipd-imgt/hla database houses data on , hla alleles till date. we developed an automated pipeline with a unified graphical user interface hla-spread that provides a structured information on snps, populations, resources, adrs and diseases information. information on hla was extracted from ~ million pubmed abstracts extracted using natural language processing (nlp). python scripts were used to mine and curate information on diseases, filter false positives and categorize to tree hierarchical groups and named entity recognition (ner) algorithms and semantic analysis to infer hla association(s). this resource from countries and ethnic groups provides interesting insights on: markers associated with allelic/haplotypic association in autoimmune, cancer, viral and skin diseases, transplantation outcome and adrs for hypersensitivity. summary information on clinically relevant biomarkers related to hla disease associations with mapped susceptible/risk alleles are readily retrievable from hlaspread. this resource is first of its kind that can help uncover novel patterns in hla gene-disease associations. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction human leukocyte antigen (hla) locus consists of six classical genes (hla-a, -b, -c, -dp, -dq and - dr) that play an important role in eliciting immune response against pathogens ( ) and three non- classical genes (hla-e, -f and -g) that interact with natural killer cells to regulate virus-infected and malignant cells ( ). hla genes harbour a large number of mutations. as of september , there are , hla alleles reported in ipd-imgt/hla database. these variations mostly arise to generate defensive mechanisms against pathogens. however, some variations also confer risk to autoimmune diseases like rheumatoid arthritis, multiple sclerosis, type diabetes and graves’ disease etc. more than different autoimmune diseases, infectious diseases and adverse drug reactions have been reported to be associated with hla genes ( – ). these alleles have clinical utility as diagnostic markers for example in rheumatoid arthritis, ankylosing spondylitis ( – ). they are also used in genetic screening e.g. hla-b* : in caucasian population for abacavir hypersensitivity, hla-b* : in chinese and asians for carbamazepine induced life-threatening conditions like stevens-johnson syndrome (sjs) and toxic epidermal necrolysis (ten) and also for sjs due to carbamazepine and other drug combinations ( , ). in the context of transplantation, mismatch of hla alleles between donor and recipient impacts the solid organ and hematopoietic stem cell transplantation outcomes ( ). in addition, mismatching for certain hla loci are also reported to provide benefit in terms of graft versus leukemia effect ( ). each of the reported studies is unique in itself as they describe the molecular basis of disease associations, hla matching and anti-hla antibody formation that are relevant for transplantation. besides, studies also report some relevant and associated clinical information, e.g different hla-b subtypes are reported to be associated with clinical categories under spondyloarthropathies ( ). there are other studies that implicate hla allele association with the composition of gut microbiome and diseases ( – ). the expanse of this information is immense as there is wide genetic variability and heterogeneity among populations ( ). although advancements in hla typing technologies has been beneficial in identifying novel hla sequences ( ), this has also led to reporting the same hla allelic variant using different hla nomenclature. with the rapid increase in biomedical data, hla alleles and their associations in multiple diseases, it becomes imperative to create a platform with structured information to query and retrieve relevant information. current knowledge about hla limits to individual papers that can be searched through pubmed or reviews where a subset of studies has been summarised. hitherto, there exists no database that complies the existing hla related information in an organised framework. in absence of such a repository with meta information gaps, resource sharing among researchers and clinicians becomes a big challenge. the integration of computer sciences with biomedical research has accelerated the progress, both in terms of novel discoveries and data structuring. natural language processing (nlp) is a method to extract relevant information from unstructured data ( ). a simple nlp pipeline contains components: data assembly, pre-processing and normalization, named entity recognition (ner) and relation extraction (re). the output of nlp algorithms, i.e. structured dataset can be used to generate insights via direct interpretation or through downstream analyses. in recent times, nlp methods have started (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . gaining popularity in biological sciences. for instance, rakhi et.al ( ) reported a text mining pipeline to study spice-disease associations and link phytochemicals from different spices/herbs to diseases. another report by lee et.al highlights biobert, a pre-trained biomedical language representation model that can be used for various text mining tasks like name entity recognition (ner), relationship extraction (re) and question answering, specifically on biomedical datasets. similarly, pubtator central ( ) is an open access tool available via ncbi that uses text mining algorithms for assisted bio- curation of entities in literature. the tool uses ner to identify and thus highlight six bio-entities viz. gene, disease, chemical, mutation, cell line and species from abstracts and open access articles available on pubmed. another interesting report by kuleshov et.al( ) presents a machine compiled database for studying genotype-phenotype associations generated using applications of text mining on genome-wide association studies (gwas). all these resources work on similar text mining algorithms, but each has a different set of applications and tasks to perform. the use of these resources as such in addressing the hla research often overlooks the extent of variability of hla complex and involved parameters in this domain. for instance, pubtator central is able to mine gene names from literature, but would not pick hla allele information e.g. hla-drb * : when hla-drb is the search query. conventional processes to individually mine a large amount of unstructured literature available on hla research requires both manpower and resources. for understanding and integrating the observations from hla studies we require knowledge of genomic datasets, i.e. diseases, snps, drugs, populations, and ethnic groups along with an understanding of the relationship between them. nlp based text mining is an ideal approach to understand the complexity of this process to create a structured information. we provide hla-spread (figure ) as a platform for integrated hla resources that has been developed using nlp to understand the complexity of this locus. the resource provides a platform to summarize hla related genomics knowledge as well as to design and develop new hypothesis. in this study, we have used publicly available ~ million peer reviewed abstracts. we extracted biomedical entities including hla alleles, diseases, snps, drugs and geographical locations. we also tried assigning positive and negative relationships between disease and alleles. this hla connectivity was then used to address biologically and clinically relevant objectives like hla-biomarkers and risk and protective alleles for various diseases. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . material and methods data retrieval medline was used as a source of biomedical literature that comprises more than million peer- reviewed articles from over scholar journals. bulk data was downloaded from the ftp server in xml format. hla alleles with nomenclature were downloaded from ipd-imgt/hla database( ). to maintain uniformity in disease names and their ids, we used mesh keywords from umls (unified medical language system). drugs associated with side effects were obtained from sider . and allele frequency net database (afnd) ( , ). allele frequency of hla alleles were also taken from afnd. extensive pre-processing was done on all the datasets before they were implemented in the pipeline. pre-processing and keywords dictionary pubmed parsing: a modified version of pubmed parser was used to extract pmid, title, abstract, publication date, journal, article type and authors’ information from medline biomedical literature dataset ( ). only records with the above information were considered for further analysis and stored in a tabular format. all the subheadings in the abstract viz background, introduction, objective, method, experimental design, result, discussion, importance, setting, design, study objective, patients, participants and conclusion were removed. disease dictionary: mentions of disease keywords were identified using a dictionary created from umls mrconso.rrf ( ). umls is a set of biomedical vocabulary that includes data from omim, gene ontology, clinical repositories, medical subject headings (mesh) and ncbi taxonomy. in this study, we used mesh descriptors including entry term (et), main heading (mh), preferred entry term (pep), descriptor sort version (dsv), machine permutation (pm). descriptor entry version (dev) was excluded as keywords belonging to this category were incomplete, e.g. abdominal injury was reported as abdominal inj. these descriptors are assigned a unique mesh id which is stored in a hierarchical format with head categories along with a unique descriptor id. we termed the root form of the disease as level-zero and top-level diseases as level-one for our analysis. multiple forms of a disease like diabetes insipidus, diabetes mellitus, type diabetes, juvenile-onset diabetes and others are assigned the same mesh id. this dataset was also supplemented with keyword variants such as plural and lemmatised forms to increase the search space. hla dictionary: keywords for hla alleles and their nomenclature were fetched from the centralized repository of international immunogenetics project (imgt) database. imgt is updated quarterly with submission or deletion of alleles and their nomenclature and currently houses , alleles. many reports do not follow the conventional hla allele nomenclature which makes mapping a strenuous task. to maximally capture all hla alleles, we created a dataset comprising of all possible keywords including the removal of special characters, whenever required. we have also attempted (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . mapping all the old nomenclature to the current allele names. this dictionary also includes few generic hla keywords like hla class i, hla class ii, hla linked and hla associated. there are few alleles based on old nomenclature that belong to more than one antigenic group, hence they were put under “broad antigen” category. a few haplotypes that were a combination of more than one hla allele were grouped in “haplotype” category. named entity recognition keyword matching across abstracts a python-based ner pipeline was implemented to filter abstracts based on a dictionary matching approach using parallel multiprocessing. disease and hla allele keyword dictionaries were used for initial screening. abstracts were converted to lower case with special characters removed and if a match was found in either title or text, the abstract was sentence tokenized using sentence tokenizer, a part of python natural language tool kit (nltk). we encountered a great extent of variability in the names of disease keywords. most of it had special characters like (-) and (‘) in the keyword or with the plural and singular forms. to deal with the former, we kept instances of sentences where special characters were not removed, this increased the search space that enables capturing of keywords such as stevens-johnson syndrome (stevens-johnson syndrome), graves' disease (graves disease). our disease dictionary was already enriched with plural and lemmatized forms of keywords to tackle the latter. for hla allele keywords, word boundary-based regex matching was implemented to search alleles in the sentences. sentences with at least a single mention of both hla allele and disease keywords were considered for further steps. identification of tags: populations, drugs and snps populations: the filtered abstracts were processed using spacy nlp tagging algorithm (model: en_core_web_md) to search for mention of populations in text. from the two output tags, i.e. gpe (geo-political entities) and norp (nationalities or religious groups), we selected the keywords having the latter as gpe tag often reported scientific names of organisms as populations when applied on biomedical data, e.g. scientific names such as chlamydia spp. and chlamydomonas spp. were reported under gpe tags. the output was classified into countries and ethnic groups for further analysis with the help of an expert anthropologist. manual curation of the obtained list was also done to remove plural and inappropriate entries. drugs: the information on drugs with side effects were taken from the sider database (sider . ). we also added drugs from afnd, whose information was missing in sider. the list of drugs was mapped across the dataset to check for its occurrences in selected hla related abstracts. there were many instances where drug names were subpart of disease keywords, e.g. “insulin” was obtained as a false match wherever it was present as a part of the disease name “insulin dependent diabetes mellitus”. a small python snippet was written to remove such false positives. snps: snp ids were mapped across abstracts of the hla dataset using the regex module of python. the algorithm iteratively searched for all instances of rsids using regular expression “[rr][ss][ - ]{ ,}”. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . all the tags captured in various sentences of abstracts were stored in a list of strings format along with their respective pmids for facilitated future access. semantic assessment n-gram evaluation and manual labelling n-grams refers to a contiguous sequence of n items (can be syllables, letters, or word pairs) in a text for determining the context of said items in a sentence or paragraph. we used the functions of nltk viz. wordnetlemmatizer, wordpuncttokenizer and collocationfinder to create a corpus of ngrams (n= , and ) from the abstract dataset. after removal of stop words, that do not add significant meaning to the context, a subset consisting of all reported verb/adverb(n= ), adverb-verb(n= , ) combinations based on a frequency cut-off was filtered out using part of speech (pos) tags of tokenised words. we observed that n-grams for negative labels often gave misleading information, e.g. “hla-b negative” refers to the absence of allele rather than a negative association between entities. hence, we used very stringent criteria for choosing negative labels. manual annotation of positive and negative labels was then carried out on this dataset and a total of labels (supplementary table ) were categorised ( positive and negative) for labelling the sentences. we assert a positive label where the hla allele is positively associated with disease and hence its presence makes individuals susceptible to disease, whereas in negative statements the hla allele is negatively associated with disease and hence protective for the disease. we also considered negation words like “not, none, no” which if present, can reverse the actual meaning of the sentences. instances of above mentioned three keyword sets (positive, negative and negation) were iteratively searched in all the sentences. further, a coding scheme was constructed using the binary layout to label sentences as positive, negative, complex ambiguous. sentences having no match from either of the categories were labelled as others. root-verb and associated adverbs using dependency parsing dependency parsing refers to the formation of a tree layout based on the semantics of a sentence, where the root node is represented by a verb that relates different entities of that sentence. the allele and disease keywords present in each sentence were replaced with @gene and @disease tags and a parse tree was generated using stanfordcorenlp python module (stanford-corenlp-full- - - package). the list of verbs obtained from the root nodes of all the sentences in the dataset was manually curated under positive and negative labels. we also added a category “studied/investigatory” that doesn’t convey any positive or negative context but have mentions of both entities together, e.g. “to investigate the association of hla-a, b, and drb alleles with leukaemia in the han population in hunan province”. sentence annotation we termed our approach as “hybrid approach” for labelling sentences, where annotation was done using both n-gram labels and the type of root verbs. if a sentence had a positive n-gram label and a positive root verb, that inferred the relationship between entities as associated or linked, then the sentence was labelled as positive. for negative labelling also we used the same approach. finally, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . labelling of sentences were grouped into different categories: ) positive, ) negative, ) both positive and negative, referring as complex sentences, ) positive+negation referring as ambiguous group, and ) investigatory. database and web server hla spread database is built for quick and easy retrieval of information related to hla genes. the web interface was coded in html , css , bootstrap & es . we used d .js for data visualization and jquery datatables for table integration. the server was hosted using apache http server. the database uses flat file system with data stored in excel file. javascript handles the search queries & data visualizations. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . results mining medline literature for hla association nlp based text mining of million publicly available biomedical abstracts provided abstracts with either one or more sentences that describe the relationship between the hla alleles and diseases. to understand the distribution of various kinds of articles published among the filtered abstracts, we studied the article type per year trend from to (figure ). we found research journal, comparative study and review articles to have maximum numbers every year. in addition, there were papers corresponding to clinical trials phase i, ii, iii and iv and observational studies highlighting the importance of this locus in translational studies. hla genes, alleles and its distribution there are , alleles, and we hypothesize that not all of them would be associated with a disease or pathological condition. for instance, while collating data/analysing of hla alleles, we observed a great extent of variability in the names within articles. e.g. hla-b* : , a risk factor for dapsone hypersensitivity syndrome in multiple populations was written as hla-b* : , hla-b* , b* , b(*) and b in different papers. in such instances, if one has to search for an allele and its related information, the user must be aware of all possible formats of writing an allele encompassing its current and previous nomenclature. so, based on this, we converted all existing hla keywords to a standard allele name. we identified only ~ % of the total alleles to be associated with conditions like diseases, graft survival, or drug reactions. to represent these alleles in the form of a graph, we collapsed the nomenclature to two-digit level (figure ). majority of the studies were with hla-drb loci, followed by hla-b and hla-a, while fewer studies were on hla-c locus. each hla alleles, collapsed to its two-digit information are linked to afnd server highlighting its allele frequency. the focus of our present study was also to understand the semantics between alleles and diseases, wherein we noted that some alleles were reported as protective and some as risk alleles. e.g. some reports indicated hla-drb * was protective for hiv and diabetes whereas some studies reported it as a risk allele for pulmonary tuberculosis. we were also interested in exploring the effects of multiple alleles individually on a single disease. to address this, we listed out articles (supplementary table )highlighting the fact that for a single disease, different alleles can have contrasting effects, e.g. hla- dqa * : and hla-dqb * : can be protective in artemisia pollen-induced allergic rhinitis while hla-dqa * : can be a risk factor ( ). exploring diseases, its associated categories and other relevant information (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the hla studies were divided into four broad categories: diseases, transplantations, sign and symptoms, and therapeutics/adrs, to study the information systematically. this grouping was done based on the mesh keywords identified in the abstracts. there is a total of categories for diseases in mesh, ranging from c to c and transplantation procedures are listed under e . keywords falling under c were grouped as “sign and symptoms” and c . (gvhd) and e were grouped as “transplantations”. for “therapeutics/adrs”, we selected only those sentences that had mentions of drug keywords, allele name and disease names together. we then filtered them further if they satisfied either of the three conditions: ) belongs to category drug adverse reactions category or ) sentences had mentions of keywords such as reactions, -induced(carbamazepine-induced) or ) disease keyword had mention of –induced (drug-induced liver injury). the remaining were grouped as “diseases”. table shows the number of articles under each category. to study the association with diseases, we analysed data from both the “diseases” and “transplantation” category. inconsistency in writing disease names increases the efforts in searching a specific query. to reduce this variability, mesh id was used to summarise the obtained information e.g. diseases like tumour, cancer, malignancy, and neoplasm (malignant and benign) were mapped to a single entity malignancy (d ). collapsing a large number of similar keywords to a single id reduces the complexity in searching for articles related to particular diseases. we observed a total of different disease terms mapping to unique mesh ids. figure represents a snapshot of common hla associated diseases. to examine the disease associations, we mapped it to level-one (level-zero) terms. diabetes mellitus type , rheumatoid arthritis, multiple sclerosis (autoimmune disease), melanoma and leukemic (neoplasms by histologic type), psoriasis (skin disease) and celiac disease (metabolic) were the topmost hla associated diseases. in the analysed abstracts, the list of hla associated diseases/conditions indicates that some diseases were very frequently reported, whereas other diseases like down syndrome, guillain-barre syndrome, polymyalgia rheumatica were infrequently or rarely reported. supplementary table represent the distribution of both common and less explored hla associated diseases. to get an overall perspective of genes and diseases, we considered the diseases at level-one along with hla gene. we observed the majority of reported associations with hla-drb , followed by hla- b and hla-a (figure ). we also listed details of individual allele-disease pairs for more information (supplementary table ). hla-drb was reported to be linked with disease conditions like rheumatoid arthritis, type diabetes, multiple sclerosis, melanoma and other diseases. hla-b association was reported with spondylitis, infections, hypersensitivities, psoriasis, drug allergies and other diseases and hla-a was reported to be associated with melanoma, leukemia, influenza, haemochromatosis, and other diseases. the analysis also takes into consideration the diseases which require transplantation and also include the complications associated with it both pre and post-transplantation. as anticipated, we observed that individuals suffering from beta thalassemia and sickle cell anaemia (genetic and congenital disorders), multiple myeloma (an immunoproliferative disorder) and liver injury underwent transplantations of bone marrow, hematopoietic stem cells and renal tissue. however, there were other additional details (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . included with the transplantation data such as disease history of patients before undergoing transplantation e.g. psoriasis, graves’ disease, diabetic neuropathy and post-transplantation complications e.g. ischemia, necrosis, fibrosis, haemorrhage.” such collated information under one platform may be of interest to a clinician for designing therapy modules. supplementary table represents details of transplantation related studies. snps and hla diseases hla loci have a repertoire of genetic variations, a large number of which have been linked to multiple diseases via genome-wide association studies (gwas). though gwas lists information about snps in/associated with hla gene, a number of genetic variation studies go unnoticed either because they are small cohort analysis or are not compiled in a single resource for systematic study. thus, to include the overlooked studies and missing information, this analysis reports information from all kinds of studies and includes abstracts mainly from journal articles, review, metanalysis, letters, and clinical trials. to acquire robust data, we retained only those hla variations, that are present in the sentences along with the disease and allele keywords. we identified unique snps mention and its details is compiled in supplementary table . majority of snps mapped to intronic variants followed by missense and intergenic. figure represents genomic distribution of mapped snps. a substantial number of variations also mapped to genes other than hla, indicating they may be in linkage disequilibrium (ld) or frequently occur in conditions like transplantation success or adrs example. we observed top hits of snps mapping to infectious diseases like hiv and hepatitis, inflammatory conditions like psoriasis, complex diseases like asthma and diabetes and hypersensitivity largely attributed by drug adrs. snp association studies are also based on a proxy snp, which can be in ld with the causal variant and the ld values vary from one population to another. to address this, we also added population information of the studies whenever available in the abstract. the most studied snp rs , associated with hepatitis b virus, has been studied across a large number of populations from asian and central asian countries like china, japan, asia, turkey, korea, and indonesia. geographical spread of hla literature across various ethnic groups and populations genetic differences in hla genes across populations and their link with biological conditions make it imperative to consider geographical information while studying hla association with a particular condition. we assumed that the population/ethnic groups name might not be present in the same sentences that mention hla and disease, so we used a flexible approach here and fetched the names of geographical locations present anywhere in the abstracts. in total, we reported norp tags, mapping to unique geographical entities. these unique tags were binned into country-based populations and ethnic groups. figure represents the frequency distribution of these matched populations belonging to the countries and ethnic groups. japan, china, usa, india and italy are the major countries where the hla gene-disease association studies have been reported with disease (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . groups as shown in supplementary table . along with this, the european subcontinent has been extensively studied ( unique reports) as a major ethnic group. apart from frequently studied areas, we also observed locations like new zealand, armenia and sri lanka that have a low number of reported studies. this type of analysis can help researchers understand not only the extent of allele- disease associations among populations in the context of these immune players but also the scope of research in their selected geographical location while planning their hypothesis. response to therapeutics hla genes are known to have association with various hypersensitivities and drug reactions, a few of them like stevens-johnson syndrome can also be life-threatening. due to allele differences among individual and population level, these hypersensitivities vary, and thus studying these pharmacogenetic markers with the population information becomes important. for instance, we observed from our data that hla-a* : is associated with carbamazepine induced stevens-johnson syndrome in european population while hla-b* : is associated with chinese and indian populations. a meta resource like hla-spread can help understand such population-wise differences that obstruct designing of therapy modules for adrs/ hypersensitivities. to be more specific, this analysis focuses on drugs that are present in sentences along with the disease and allele keywords. we observed a total of abstracts mentioning unique drugs, of which mapped to adr category. details of drugs and related information are listed in supplementary table . we also validated our results with afnd, a manually curated database that has information about adrs. out of drugs present, we were able to find common. one of the drugs “valporic acid”, mentioned in afnd, was not present in the actual cited article. the remaining drugs could not be captured because of the stringent criteria of drug mapping i.e. the drug name should be present in the sentence along with disease and allele keyword. figure lists the frequency-based distribution of top drugs fetched from our analysis. interestingly, we also observed drugs that are not mentioned in afnd database, e.g. hla-b* : : allele was found to predict carbimazole/methimazole induced agranulocytosis, hla-drb associated azathioprine induced pancreatitis in ibd patients. this analysis highlights, how one can miss information apart from the time and manpower intensive nature in manual curation. insights from hla-spread: biomarker analysis we demonstrate the usability of the database to address clinically relevant queries. multiple questions on the identification of hla alleles and diseases linked with hypersensitivity, allergy, genetic marker, prognosis and diagnosis can be addressed using hla-spread. as an example, we present an analysis to identify biomarkers in hla studies. to address this question, we used an n-gram based approach to identify the keyword most frequently occurring with “marker” in the sentences. supplementary table list the most common keywords identified. we checked the details of such sentences and complied the information (supplementary table ). a few of them like abacavir (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . hypersensitivity and sjs syndrome were present in multiple papers. hla-g and hla-e were also reported to be markers for conditions like tumour, transplantation and heart diseases. discussion hla alleles are known to be associated with a large number of diseases. there is no existing repository that summarises this information in a systemic manner. manual curation is a cumbersome process and one might also miss a lot of important information. the need for such a user-friendly platform increases significantly since hla alleles have been found clinically associated with a large number of conditions. nlp based text mining offers a way to fetch this information pragmatically. nlp is instrumental in terms of extracting information from unstructured data. this method has started assuming immense importance in the biomedical domain. a few papers like gwaskb and snp literature have used it for extracting information such as snp and its related knowledge from the biomedical data whereas monarch initiative has used it for studying phenotype information ( ). extracting information from hla related literature is very difficult owing to the large number of studies and complex nomenclature. this project is an attempt to consolidate all the hla relevant information such as snps, populations studied, adrs and associated diseases into a structured database. this resource is also handy for user-specific advanced hla searches like looking for biomarkers for toxicity-based studies and disease progression. there were a few drawbacks of this analysis worth highlighting – primary arising due to the different formats of various journals. the initial tokenised data used in the analysis was based on english stop words. however, we observed in a small set of papers, the author missed giving full stops or spaces which lead to the fusion of two sentences. the subheadings were present in different cases and often followed by different special characters leading to complexity in their removal. also, a prefix of keywords like settings, study design, etc. have been observed in a few sentences, as those papers did not follow standard headlines. apart from these, few other parameters like abbreviations at the end of sentences, presence of roman letters in sentences and different brackets and quotes styles in title caused errors during tokenisation process. similarly, it was observed that with the updation of various abstracts in new releases, the previous incorrect entries were not removed which lead to duplication of different information. since hlaspread has catalogued information from diverse resources, in many instances it provides pieces of information that would be more informative and exhaustive. for instance, besides information retrieved from databases like disgenet, omim (mendelian) reporting information on a few diseases we also used mesh is more comprehensive as it houses variant disease terms mapping to diseases. we also reduced the high variability in the method of mentioning the disease name in various articles. on average, a disease has around names with one id, showing the wide spectrum of disease dictionary required to capture all possible disease terms. in order to capture the hla and adrs we selected a list of drugs from sider . . however, not all drugs present in side effect database will be associated with ards. to get a more specific answer, we selected drugs from categories such as adverse drug reactions, hypersensitivity and toxicity. we were able to fetch a large number of studies (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . and observed that the afnd database has missed quite some drugs in the adr analysis. we thus added information from both afnd and sider to get heuristic information for a set of different drugs. there were a few unique aspects that we could capture because of our approach. for instance, in transplantation studies in addition to just listing different kinds of transplantations, we also observed the most common diseases which required transplantation and drugs given during the process with few side effects. also, a unique aspect we added was a category called signs and symptoms for simplifying user searches. for instance, some users may also be interested in knowing the context of hla alleles with conditions like inflammation, relapse, hypoxia, septic shock, diarrhoea, etc. we aim to add a few features in future updates for example mapping the variants reported in dbsnp, omim, clinvar with to the hla alleles. this would help in seamless integration of high-throughput variation data with the wealth of hla information in literature and hla alleles reported in imgt database. to summarise this is one of its kind of efforts to integrate the diversity of hla information into a structured format for ease of query and analysis. this could also provide an informative resource for the non-hla specialists for initiating any new studies in populations and diseases. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . acknowledgements the authors would acknowledge coe m/o ayush grant mlp- to mm and dd and srf fellowship to dd from department of biotechnology (dbt) and dr. yatender kumar (nsit) for permitting ak to work on this project. we would also acknowledge mr praveen sinha for designing and developing the webpage of hla spread, dr. debasis dash, csir-igib for critical reviewing of work, dr. ganesh bagler and rudransh tunwani from iiitd for nlp discussion, dr. ganganath jha from hazaribagh university in qc of population curation and malika seth in qc of semantic annotations. the authors would also like to acknowledge mr. raghunandanan mv and mr. amit khulve at csir-igib for it support. authors contributions mm, dd designed the study and co-wrote the manuscript. dd and ak executed the entire work. uk helped in hla analysis, interpretation and manuscript writing (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . list of figures figure . workflow of hla-spread: an automated pipeline developed to extract information related from ~ , studies related to hla retrieved from over million abstracts. structured information from these abstracts was created using natural language processing methods developed into a database hla-spread. the various resources used at each step are indicated. figure . nature and trends of hla related publications in pubmed annually from onwards: stacked bar plot shows distribution of pubmed articles in different categories. a) diverse studies including clinical trials are reported, with maximum numbers represented in the “journal article” category. b) a subplot of (a) after removing the most frequent “journal article” type to visualise the trends in other categories. a b (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . the topmost reported hla alleles associated with diseases: all the hla alleles indicated have been grouped to their second digit and represented in the pie chart. hla-a, hla-b and hla- drb are the most studied amongst the hla genes. figure . diseases/conditions associated with hla genes: graph represents three level hierarchy of diseases. each colour represents a level. there are major categories as represented in green colour, which is further divided into subcategories. each disease name is matched to its mesh id and a normalised mesh keyword. autoimmune, neoplasms and joint disease are the top most associated diseases. as anticipated, significant numbers of studies related to transplantation are also observed. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . heatmap of hla disease associations: the gradient heat map representing the number of diseases associated with hla genes. first column represents generic “hla” studies where specific gene information is not mentioned. a large number of associations were also observed with non- classical(hla-e,f,g) genes. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . genomic distribution of snps: pie chart representing the number of variations in genic region with majority of them mapping to introns. figure . geographical spread of hla studies: identified geographical locations are binned to the nearest a) country b) ethnic group. color gradient representing the count of various hla alleles with respect to disease or ard’s studies. china, japan and the usa report maximum studies and european, asian and african are the most studied ethnic groups a b count (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . statistics of drugs related hla studies: this bar plot includes the most common top drugs associated with adr’s identified using hla-spread. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . list of tables table : number of articles in broad categories supplementary tables:- https://doi.org/ . /zenodo. categories number of pubmed abstracts diseases transplantation signs and symptoms adr (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . references: . mosaad,y.m. ( ) clinical role of human leukocyte antigen in health and disease. scand j immunol, , – . . niehrs,a. and altfeld,m. ( ) regulation of nk-cell function by hla class ii. front. cell. infect. microbiol., , . . shiina,t., hosomichi,k., inoko,h. and kulski,j.k. ( ) the hla genomic loci map: expression, interaction, diversity and disease. j hum genet, , – . . blackwell,j.m., jamieson,s.e. and burgner,d. ( ) hla and infectious diseases. cmr, , – . . fricke-galindo,i., llerena,a. and lópez-lópez,m. ( ) an update on hla alleles associated with adverse drug reactions. drug metabolism and personalized therapy, . . klimenta,b., nefic,h., prodanovic,n., jadric,r. and hukic,f. ( ) association of biomarkers of inflammation and hla-drb gene locus with risk of developing rheumatoid arthritis in females. rheumatol int, , – . . khan,m.a., mathieu,a., sorrentino,r. and akkoc,n. ( ) the pathogenetic role of hla-b and its subtypes. autoimmunity reviews, , – . . khan,m.a. ( ) hla-b and its pathogenic role: jcr: journal of clinical rheumatology, , – . . ferrell,p.b. and mcleod,h.l. ( ) carbamazepine, hla-b* and risk of stevens–johnson syndrome and toxic epidermal necrolysis: us fda recommendations. pharmacogenomics, , – . . sawal,n., kanga,u., shukla,g., goyal,v. and srivastava,a.k. ( ) stevens-johnson syndrome triggered by levetiracetam—caution for use with carbamazepine. seizure, , – . . ayuk,f., beelen,d.w., bornhäuser,m., stelljes,m., zabelina,t., finke,j., kobbe,g., wolff,d., wagner,e.-m., christopeit,m., et al. ( ) relative impact of hla matching and non-hla donor characteristics on outcomes of allogeneic stem cell transplantation for acute myeloid leukemia and myelodysplastic syndrome. biology of blood and marrow transplantation, , – . . petersdorf,e.w. ( ) which factors influence the development of gvhd in hla-matched or mismatched transplants? best practice & research clinical haematology, , – . . kanga,u., mehra,n.k., larrea,c.l., lardy,n.m., kumar,a. and feltkamp,t.e.w. ( ) seronegative spondyloarthropathies and hla-b subtypes: a study in asian indians. clin rheumatol, , – . . xu,h. and yin,j. ( ) hla risk alleles and gut microbiome in ankylosing spondylitis and rheumatoid arthritis. best practice & research clinical rheumatology, , . . andeweg,s.p., keşmir,c. and dutilh,b.e. ( ) quantifying the impact of human leukocyte antigen on the human gut microbiome bioinformatics. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . gomez,a., luckey,d., yeoman,c.j., marietta,e.v., berg miller,m.e., murray,j.a., white,b.a. and taneja,v. ( ) loss of sex and age driven differences in the gut microbiome characterize arthritis-susceptible * mice but not arthritis-resistant * mice. plos one, , e . . buhler,s. and sanchez-mazas,a. ( ) hla dna sequence variation among human populations: molecular signatures of demographic and selective events. plos one, , e . . saxena,a., suzuki,s., mourya,m., shiina,t. and kanga,u. ( ) novel and extended hla class i and ii alleles encountered in kashmiri brahmin population from north india. hla, , – . . sfakianaki,p., koumakis,l., sfakianakis,s., iatraki,g., zacharioudakis,g., graf,n., marias,k. and tsiknakis,m. ( ) semantic biomedical resource discovery: a natural language processing framework. bmc med inform decis mak, , . . rakhi,n.k., tuwani,r., mukherjee,j. and bagler,g. ( ) data-driven analysis of biomedical literature suggests broad-spectrum benefits of culinary herbs and spices. plos one, , e . . wei,c.-h., allot,a., leaman,r. and lu,z. ( ) pubtator central: automated concept annotation for biomedical full text articles. nucleic acids research, , w –w . . kuleshov,v., ding,j., vo,c., hancock,b., ratner,a., li,y., ré,c., batzoglou,s. and snyder,m. ( ) a machine-compiled database of genome-wide association studies. nat commun, , . . giudicelli,v., chaume,d., bodmer,j., muller,w., busin,c., marsh,s., bontrop,r., marc,l., malik,a. and lefranc,m.-p. ( ) imgt, the international immunogenetics database. nucleic acids research, , – . . kuhn,m., letunic,i., jensen,l.j. and bork,p. ( ) the sider database of drugs and side effects. nucleic acids res, , d –d . . ghattaoraya,g.s., dundar,y., gonzález-galarza,f.f., maia,m.h.t., santos,e.j.m., da silva,a.l.s., mccabe,a., middleton,d., alfirevic,a., dickson,r., et al. ( ) a web resource for mining hla associations with adverse drug reactions: hla-adr. database, , baw . . achakulvisut,t., acuna,d. and kording,k. ( ) pubmed parser: a python parser for pubmed open-access xml subset and medline xml dataset xml dataset. joss, , . . bodenreider,o. ( ) the unified medical language system (umls): integrating biomedical terminology. nucleic acids research, , d – . . wang,m., xing,z.-m., yu,d.-l., yan,z. and yu,l.-s. ( ) association between hla class ii locus and the susceptibility to artemisia pollen-induced allergic rhinitis in chinese population. otolaryngol head neck surg, , – . . shefchek,k.a., harris,n.l., gargano,m., matentzoglu,n., unni,d., brush,m., keith,d., conlin,t., vasilevsky,n., zhang,x.a., et al. ( ) the monarch initiative in : an integrative data and analytic platform connecting phenotypes to genotypes across species. nucleic acids research, , d –d . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adroit: an accurate and robust method to infer complex transcriptome composition adroit: an accurate and robust method to infer complex transcriptome composition tao yang , nicole alessandri-haber , wen fury , michael schaner , robert breese , michael lacroix-fralish , jinrang kim , christina adler , lynn e. macdonald , gurinder s. atwal , yu bai , * affiliations . regeneron pharmaceuticals, inc., tarrytown ny . cellular longevity, inc., san francisco, ca *corresponding author abstract rna sequencing technology promises an unprecedented opportunity in learning disease mechanisms and discovering new treatment targets. recent spatial transcriptomics methods further enable the transcriptome profiling at spatially resolved spots in a tissue section. in controlled experiments, it is often of immense importance to know the cell composition in different samples. understanding the cell type content in each tissue spot is also crucial to the spatial transcriptome data interpretation. though single cell rna-seq has the power to reveal cell type composition and expression heterogeneity in different cells, it remains costly and sometimes infeasible when live cells cannot be obtained or sufficiently dissociated. to computationally resolve the cell composition in rna-seq data of mixed cells, we present adroit, an accurate and robust method to infer transcriptome composition. the method estimates the proportions of each cell type in the compound rna-seq data using known single cell data of relevant cell types. it uniquely uses an adaptive learning approach to correct the bias gene-wise (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . due to the difference in sequencing techniques. adroit also utilizes cell type specific genes while control their cross-sample variability. our systematic benchmarking, spanning from simple to complex tissues, shows that adroit has superior sensitivity and specificity compared to other existing methods. its performance holds for multiple single cell and compound rna- seq platforms. in addition, adroit is computationally efficient and runs one to two orders of magnitude faster than some of the state-of-the-art methods. introduction rna sequencing is a powerful tool to address the transcriptomic perturbations in disease tissues and help understand the underlying mechanism to develop treatments . due to the presence of heterogeneous cell populations, bulk tissue transcriptome only characterizes the averaged expression of genes over a mixture of different types of cells. the identity of individual cell types and their prevalence remain unelucidated in the bulk data. however, knowledge of the cell type composition and gene expression perturbation at the cell type level is often critical to identifying disease-manifesting cells and designing targeted therapies. for instance, the constitution of stromal and immune cells sculpts the tumor microenvironment that is essential in cancer progression and control – . excessive expression of cytokines in particular leukocyte types underlines the etiology of many chronic inflammatory diseases – . such information cannot be directly read out from the bulk rna-seq. recent breakthroughs in spatial transcriptomics methods enable characterizing whole transcriptome-wise gene expressions at spatially resolved locations in a tissue section . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . however, it remains challenging to reach a single cell resolution while measuring tens of thousands of genes transcriptome-wise. some widely used technologies can achieve a resolution of - μm, equivalent to – cells depending on the tissue type , . the transcripts therein may originate from one or more cell types. unlike the bulk rna-seq, the profiling data at each spot contains substantial dropouts as merely a few cells are sequenced, imposing additional challenges to demystify the cell type content. we refer to bulk rna-seq and spatial transcriptomics data at the multi-cell resolution as compound rna-seq data hereafter. the rapid development of single-cell rna-seq (scrna-seq) technologies has allowed for cell- type specific transcriptome profiling . it provides the information missing from the compound rna-seq data. nevertheless, the technologies have low sensitivity and substantial noise due to the high dropout rate and the cell-to-cell variability. consequently, scrna-seq technologies require a large number of cells (thousands to tens of thousands) to ensure statistical significance in the results. in addition, the cells must remain viable during capture. these requirements render the scrna-seq technologies costly, prohibiting their application in clinical studies that involve many subjects or cannot allow real time tissue dissociation and cell capture. furthermore, scrna-seq technologies may not be well suited to characterizing cell-type proportions in solid tissues because the dissociation and capture steps can be ineffective to certain cell types – . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . as sequencing at the single cell level is not always feasible, in silico approaches have been developed to infer cell type proportions from compound rna-seq data – . the most common strategy is to conduct a statistical inference through the maximum likelihood estimation (mle) or the maximum a posterior estimation (map) on a constrained linear regression framework, wherein the unobserved mixing proportion of a finite number of cell types are part of the latent variables to be optimized. – the deconvolution methods are often applied to dissect the immune cell compositions in blood samples – . however, their performance in more complex tissues, such as the nervous, ocular, respiratory and gastrointestinal organs, remains unclear. these tissues often contain many cell types ( - ) and the difference among related cells can be subtle, rendering the deconvolution a challenging task. for example, a recent study on the mouse nervous system contains more than cell clusters and many are highly similar neuronal subtypes . earlier works often utilized the transcriptome profiling of the purified cell populations to estimate the gene expressions per cell type (e.g. cibersort) . more recently, acquiring cell type specific expression from the scrna-seq data was shown to be an intriguing alternative – . though it provides higher throughput by measuring multiple cell types in one experiment, profiling at single cell level is substantially noisy. deconvolution using scrna-seq data as reference can be biased by noise non-relevant to cell identities if not treated properly. moreover, the platform difference between the compound data and the single cell data cannot be ignored. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . to overcome these challenges, additional information from the data may be considered. a recent method that weighs genes according to their expression variances across samples greatly improved the accuracy , highlighting the importance of gene variability in inferring cell type composition. some other methods and applications have pointed out the importance of cell type specific genes , , , . in these works, the cell type specific expression was only used to select the input genes (e.g., markers). nonetheless, it measures how informative a gene is in distinguishing cell types and thus can be incorporated as a part of the model. to address the platform difference between the compound data and the single cell data it is usually assumed there exists a single scaling factor or a linearly scaled bias for all genes that can be learned and corrected accordingly , . this assumption is hardly held because the impact of the platform difference to each gene is different. though learning a uniform scaling factor would correct the difference in the majority of genes, a few genes that remain significantly biased can easily confound the estimation, especially under a linear model framework. thus, a gene-wise correction should be considered. in this work, we presented a new deconvolution method, adroit, a unified framework that jointly models the gene-wise technology bias, genes’ cell type specificity and cross-sample variability. the method estimated the cell type constitution in the compound rna-seq samples using relevant single cell data as a training source. genes used for deconvolution were automatically selected from the single cell data based on their information richness. uniquely, it uses an adaptively learning approach to estimate gene-wise scaling factors, addressing the issue that different platforms impact genes differently. the model of adroit is further (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . regularized to avoid collinearity among closely related cell subtypes that are common in complex tissues. over a comprehensive benchmarking data sets with a varying cell composition complexity, adroit showed superior sensitivity and specificity to other existing methods. applications to real rna-seq bulk data and spatial transcriptomics data revealed strong and expected biologically relevant information. we believe adroit offers an accurate and robust tool for cell type deconvolution and will promote the value of the bulk rna-seq and the spatial transcriptomics profiling. results overview of the adroit framework adroit estimates the proportions of cell types from compound transcriptome data including but not limited to bulk rna-seq and spatial transcriptome. it directly models the raw reads without normalization, preserving the difference in total amounts of rna transcript in different cell types. the method utilizes as reference the relevant pre-existing single cell rna-seq data with cell identity annotation. it selects informative genes, estimates the mean and dispersion of the expression of selected genes per cell type, and constructs a weighted regularized linear model to infer percent combinations (fig. a). because sequencing platform bias impacts genes differently , , , a uniform scaling factor for all genes does not sufficiently eliminate such bias. a key innovation of adroit is that it uniquely adopts an adaptive learning approach, where the bias was first estimated for each gene, then adjusted such that more biased gene is corrected with a larger scaling factor (fig. b). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we also attribute the success of adroit to the consideration of a comprehensive set of other relevant factors including genes’ cross-sample variability, cell type specificity and collinearity of expression profiles among closely related cell types. the cross-sample variability of a gene confounds its biological expression variability due to the variety of cell types. the latter is referred as the cell type specific expression that helps identify the cell type. adroit weighs down genes with high cross-sample variability whilst weighs up those with an expression highly specific to certain cell types. the definition of cross-sample variability and cell type specificity also accounts for the overdispersion nature in counts data. lastly, adroit adopted a linear model to ensure the interpretability of the coefficients. at the same time, adroit included a regularization term to minimize the impact of the statistical collinearity. each of the factors contributes an indispensable part to adroit, leading to an accurate and robust deconvolution method for inferring complex cell compositions. to evaluate the performance, we compared adroit with music and nnls , for bulk data deconvolution, and stereoscope for spatial transcriptomics data deconvolution. when evaluating the algorithms, a common practice is to pool the single cell data to synthesize a “bulk” sample with the known ground truth of the cell composition. we measured the performance by comparing the estimated cell proportions with true proportions using four metrics: mean absolution difference (mad), rooted mean squared deviation (rmsd) and two correlation statistics (i.e., pearson and spearman). both correlations are included because pearson reflects linearity, while spearman avoids the artificial high scores driven by outliers when majority of estimates are tiny. good estimations feature low mad and rmsd along with (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . high correlations. when estimating cell proportions for a synthetic sample, cells from this sample are excluded from the input single cell reference (i.e., leave-one-out) to avoid overfitting. we further applied adroit to real bulk rna-seq data and validated the results by available rna fluorescence in-situ hybridization (rna-fish) data. the estimates were further confirmed by relevant biology knowledge of human pancreatic islets. we also used adroit to map cell types on spatial spots, and the accuracy was verified by in-situ hybridization (ish) images from allen mouse brain atlas . adroit excels in datasets with both simple and complex cell constitutions we started with a simple human pancreatic islets dataset that contains cells and four distinct endocrine cell types (i.e., alpha, beta, delta, and pp cells) (extended data fig. a; supplementary table ). the synthesized bulk data were constructed by mixing the single cell data at known proportions. though all three methods achieved satisfactory performance according to the evaluation metrics, adroit has slightly better performance as reflected by scatterplots of estimated proportion vs. true proportion (extended data fig. b, supplementary table ). it has moderately lower mad ( . vs. . for music and . for nnls), and rmsd ( . vs. . for music and . for nnls) and comparable correlations (pearson: . vs . for music and . for nnls; spearman: . vs . for music and . for nnls) (extended data fig. c). this performance was expected because there were only four cell types with very distinct transcriptome profiles. deconvoluting such data was a relatively easy task for all three methods. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we then tested the methods on a couple of complex tissues that are more challenging to deconvolute. one is the human trabecular meshwork (tm) tissue. we acquired published single cell data that contains cells and cell types from donors . the data include similar types of endothelial cells, types of schwann cells and types tm cells (supplementary fig. ; supplementary table ). cells from each donor were pooled as a synthetic bulk sample. the cell type proportions vary from < % to %. these proportions were the ground truth cell composition and were compared head-to-head with the estimated proportions inferred by adroit, music and nnls. for each synthetic bulk sample, estimations were performed using a reference built from cells of other donors (i.e., leaving-one-out). in each of the samples, the estimates made by adroit best approximated the true proportions. in particular, adroit had significantly lower mad ( . ) and rmsd ( . ), and higher correlations (pearson = . ; spearman = . ), comparing to music (mad = . ; rmsd = . ; pearson = . ; spearman = . ) and nnls (mad = . ; rmsd = . ; pearson = . ; spearman = . ) (fig. a). we further assessed the deviation of the estimates from the true proportions for each cell type. adroit consistently had the lowest deviations from the true proportions for all cell types, as well as the lowest variation among samples (fig. b, blue dots), indicating a higher robustness over various cell types and samples. notably, adroit only missed one rare cell type (true proportion = . %) out of cell types in one sample, while music missed to cell types in of the samples, and nnls missed to cell types in all samples (supplementary fig. , supplementary table ). adroit has better sensitivity and specificity (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we next systematically addressed the sensitivity and specificity of these algorithms. in the context of the cell type deconvolution, a false negative occurs when the proportion of an existing cell type is predicted to be zero (or below a given threshold). conversely, a non-zero prediction (or above a given threshold) of an absent cell type results in a false positive. false negatives and false positives measure the sensitivity and specificity of a deconvolution algorithm, respectively. both quantities are crucial to establish the utility of the algorithm. particularly, in real world applications, it is often difficult to know a prior what cell types exist in a bulk sample, users may inform the algorithm to consider more possible cell types than what are actually in the sample. false positive predictions in this situation would make the algorithm unusable. we designed a simulation to test the sensitivity and specificity. we selected out of the cell types, i.e., schwann-cell like cell, tm , smooth muscle cell, melanocyte, macrophage and pericyte, from each donor sample and pooled them within that sample to synthesize new bulk samples. the unselected cell types are considered absent in the bulk samples. some cell types in presence are highly similar to those in absence, challenging the programs to pinpoint the right cell type present in the bulk among similar candidates. we provided the full list of single cell types as reference to the programs to estimate the cell type proportions. nnls was excluded from this evaluation due to its low benchmarking performance observed earlier (fig. a, b). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . consistently across samples, adroit had very accurate estimates for the present cell types, and zero or close-to-zero estimated values for the non-existing cell types in the synthetic bulk data. music was notably less accurate on the selected cell types, meanwhile it had many non- negligible values (> % for out estimates) of the cell types excluded in the synthetic samples (fig. c, supplementary table ). for example, smooth muscle cells accounted for ~ % in donor but was largely missed (~ . %) by music. we noted that tm had false non- zero estimates from both methods though not included. this is because tm is easily mistaken as tm due to their high similarity . nonetheless, adroit’s estimates of tm were consistently small across samples (< % for out of estimates), while music had significantly larger estimates of tm that occasionally even exceeded the tm estimates (donors and in fig. c right). for a systematic comparison, we constructed the receiver operating characteristic (roc) curve by varying the threshold of detection (i.e., a cutoff below which the cell type was deemed undetected) (fig. d). adroit had significantly higher area under the curve (auc) than music ( . vs. . ), implying a dominantly better sensitivity and specificity. adroit outperforms in deconvoluting closely related subtypes to further evaluate adroit when multiple cell subtypes present in a complex tissue, we performed scrna-seq experiment on mouse lumbar dorsal root ganglion (drg) from five mice. following the standard analysis pipeline (methods), we obtained single cells after quality control procedures. after clustering and annotation, we discovered cell types including multiple subtypes of neuronal cells (fig. a, supplementary table ). the heatmap of the top marker genes showed distinct patterns of the major cell types as well as similar patterns of the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . subtypes (extended data fig. a), and the cell type proportions varied from . % to . % (extended data fig. b). these cell types include subtypes of neurofilament containing neurons (i.e., nf_calb , nf_pvalb, nf_ntrk .necab ), subtypes of non-peptidergic neurons (i.e., np_nts, np_mrgpra , np_mrgprd), and subtypes of peptidergic neurons (i.e., pep _dcn, pep _s a .tagln , pep _slc a .sstr , pep _htr a.sema a, pep _trpm ). also discovered were tyrosine hydroxylase containing neurons (th), satellite glia and endothelial cells. such complex compositions formed a challenging testing ground for evaluating the ability to distinguish closely related cell types. we again did the leave-one-out deconvolution on five synthesized bulk samples. adroit had highly accurate estimations on all cell types across samples (fig. b). it is worth to mention that, for the rare cell types that account for less than %, adroit still had a good estimation that is fairly close to the true proportions and never missed a single cell type, showing that adroit is very robust on rare cell types. for example, . % endothelial cells were predicted to be . %, and . % nf _ntrk .necab cells were predicted to be . % (supplementary fig. , supplementary table ). on the contrary, music and nnls were notably less accurate, especially for the cell types less than %, and missed multiple cell types including some large cell clusters taking account of ~ % (pep _slc a .sstr cells of sample ). we further examined how much the variability of the estimates was in each individual sample. we computed the metrics to evaluate the performance on each of the synthetic samples and compared them head-to-head among the algorithms. this fine comparison showed adroit significantly outperformed music and nnls on every sample (fig. c). further, the performance (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . metrics of adroit were highly consistent across samples with the lowest variability among the three methods. adroit excels on simulated spatial transcriptomics data given the promising performance on complex tissues, we continued to test adroit’s applicability to spatial transcriptomics data. spatial transcriptomics data differs from bulk rna- seq data in that each spot only contains transcripts from a handful of cells ( - ) . some of the spots contain multiple cells of the same type, while others may have mixtures of cell types at varying mixing percentages (e.g., spatial spots at the boundary of different cell types). also, because the mixture is a pool of only a few cells, the variations across spatial spots are expected to be greater than in bulk samples. we simulated a large number of spatial spots ( in total) by using sampled cells from the drg single cell data above (methods), then compared adroit with stereoscope over a range of simulation scenarios. we first tested whether the methods could correctly infer a single cell type when the spots contain cells from that same type. for each of the cell types from drg, we sampled cells and pooled them to form a spatial spot. we repeated the simulation for times for a robust testing, then used the full set of cell types as reference to deconvolute the simulated spots. both methods were able to identify the correct cell types with indistinguishable accuracy on the simulated cell types (i.e., estimates close to ) and comparably low estimated values (i.e., estimates close to zero) for other cell types not included when simulating the spots (extended data fig. ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we then continued a difficult scenario where we sampled cells from the pep subtypes and mixed them. we created three simulation schemes for a comprehensive evaluation: ) pep subtypes had same percent of . ; ) pep _dcn was . and the other were . ; ) pep _s a .tagln and pepe _dcn were . , pep _htr a.sema a and pep _slc a .sstr were . , and pep _trpm was . . again, each simulation scheme was repeated times. under each scheme, the estimates by adroit consistently centered around true proportions and the other cell types had very low estimated values (close to zero) (fig. a, supplementary table ). in comparison, though the estimates for the other cell types were also generally close to zero, the estimates of the pep cells by stereoscope systematically deviated from the true proportions for all three simulated schemes except for pep _s a .tagln . we further expanded the simulated spatial spots to the mixture of np cell types and mixture of nf cell types. in addition, we sampled np_mrgpra cells and mixed them with other distinct cell types (i.e., th, satellite glia and endothelial), as well as nf_calb cells mixed with other distinct cell types, and pep _trpm mixed with other distinct cell types. for all these simulated spatial spots, adroit’s estimates were consistently centered at true proportions, whereas stereoscope’s estimates deviated in almost all simulated schemes (extended data fig. , supplementary table ). we speculate the main reason stereoscope underperformed at these simulated spots is that it normalizes the total umi counts to the same number for all cells. in real world, a spatial spot is unlikely to be a pool of cells that have the same total rna transcripts sampled, especially when a spot contains different cell types (e.g., immune cells (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . have about -fold less total umis than the neuronal cells or subtypes of neuronal cells). our simulation pooled the sampled cells by adding up the raw umi counts per gene, which we believe best mimics the real data. next, we asked how sensitive the methods are in detecting rare cell populations. we simulated mixtures of pep subtypes (i.e., pep _slc a .sstr , pep _htr a.sema a, pep _trpm ) with a series of low percent pep _trpm (from . to . by . ), and the other two cell types sharing the rest percentage equally (methods). at each given percent, the simulation was repeated times. we then checked how accurately the percent of pep _trpm cells was estimated. the medians of adroit’s estimates were always close to the true proportions (fig. b, red lines), whereas that of stereoscope’s estimates were largely lower than true proportions. stereoscope also missed the majority of pep _trpm cell type when the simulated proportion was below . . this comparison implied adroit is more advantageous in detecting low percent cells. for a complete comparison, we also simulated other types of cell mixtures in the same way. at each given low percent, we computed how many times out of the low percent cell component was detected (estimates > . ). adroit had systematically higher detection rates, as well as higher consistency across different cell mixtures (fig. c, supplementary table ). notably, at a simulated percent of %, adroit achieved > % of detention rate, making it a powerful tool in detecting rare cells. though music was not designed for deconvoluting spatial spots, theoretically it also can be applied to spatial transcriptomics data. we thus also compared adroit to music on the same (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sets of simulation data above. we observed adroit was also significantly more accurate over all simulation scenarios of spatial spots (fig. a, extended data fig. and , supplementary fig. ), and more sensitive when detecting low percent cells (fig. b, c, supplementary fig. ). application to real bulk rna-seq data of human pancreatic islets though using synthetic bulk data based on mixing of single cells is a useful benchmarking strategy, the bulk and single cell rna-seq often use distinct rna library preparation and sequencing protocols. the capability of a method to deconvolute real bulk samples shall be addressed to ensure it is useful in the real-world applications. we acquired real human pancreatic islets bulk samples from published studies , , (supplementary table ) and used single cell data of the same tissue as reference to infer the percentages of endocrine cell types (i.e., alpha, beta, delta, pp). the bulk samples were collected from distinct donors, including healthy donors, and donors with type diabetes (t d). each donor contributed to replicated bulk rna samples. replicates from the same donor are expected to have similar compositions and thus were used to assess the reproducibility of the estimates from adroit. for all cell types, adroit had highly consistent estimates for the same donors (fig. a, supplementary table ). the average standard deviations did not exceed % for all cell types (i.e., alpha: . ; beta: . ; delta: . ; pp: . ). to seek an independent validation, we obtained cell sorting results by rna- fish for of the donors (supplementary table ). the estimated cell proportions of the were highly consistent with the percentages measured by rna-fish (fig. b), and the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . consistency held for both major cells (alpha and beta) and the minor cells (delta and pp). reproducibility and independent validation showed adroit is reliable in deconvoluting real bulk rna-seq data. we then asked if adroit can detect known biological differences between healthy and t d donors. loss of functional insulin-producing beta cells is a prominent characteristic of t d – , typically reflected by elevated level of hemoglobin a c (hba c) , . among the healthy donors, the majority of beta cell proportions estimated by adroit ranged from % to % (fig. c), agreed with the known percent range of beta cells in human islets tissue , . a significant decreasing of the estimated beta cell proportions was seen in t d patients (p value = . e- ). further, a linear regression of estimated beta cell proportions on hba c levels showed a statistically significant negative association (p value = . e- ). adroit adequately reflected the cell composition difference between healthy donors and t d patients. application to mouse brain spatial transcriptomics we lastly demonstrated an application to the real spatial transcriptomics data. given the molecular architecture of brain tissue has been well studied, we chose mouse brain spatial transcriptomics data generated by x genomics, containing spatial spots (methods). the reference single cell data were acquired from an independent study which contains a comprehensive set of nervous cell types in brain . we curated the cell types by merging highly similar clusters and came down to a consolidated set of distinct brain cell types (methods, supplementary table ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the cell contents inferred by adroit per spot appear to accurately match the expected cell types at that location (extended data fig. , supplementary table ). for example, the three subtypes of cortex excitatory neurons each occupied a sub-area in the cerebral cortex region. as another example, the shape of hippocampal region was delineated by the estimated percentages of dentate gyrus granule/excitatory neurons. for an independent validation, we checked the consistency between estimated cell types with the in-situ hybridization (ish) images from allen mouse brain atlas . we chose genes highly expressed in brain regions respectively, i.e., spink for hippocampal field ca , c ql for dentate gyrus, clic for choroid plexus, and synpo for thalamus . the spots enriched with the cell types (i.e., hippocampal ca excitatory neuron type , dentate gyrus granule neuron type , choroid plexus cell, thalamus excitatory neuron type ), as mapped by adroit, precisely co-localized with the strong signals of the marker genes on the ish images respectively (fig. d). this agreement confirmed that the spatial mapping of cell types by adroit is reliable. computational efficiency besides the accuracy and robustness, another major advantage of adroit is its magnitude higher computational efficiency. adroit uses a two-step procedure to do the inference. the first step prepares the reference on single cell data where per-gene means and dispersions are estimated, and cell type specificity is subsequently computed. the built reference can be saved and reused. we tested the running time on the reference building using the aforementioned mouse brain single cell dataset containing ~ , cells. it took about . minutes on a cpu (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . that has cores ( used for parallel computing). the second step inputs the built reference and target compound data and does the estimation. deconvoluting ~ compound rna-seq samples took around minutes. therefore, adroit in total took less than minutes and ~ gb memory usage on a regular cpu. as a comparison, music took about hour and minutes on the same data using the same cpu. stereoscope ran about hours continuously with the published parameter setting (-scb -sce -topn_genes -ste -lr . -stb -scb ) on a powerful v gpu with cores and g memory, which is prohibitive for seeking a quick turnaround. discussion in this work we have demonstrated that adroit is capable of deconvoluting the cell compositions from the compound rna-seq data with a leading accuracy, measured by the consistency between the true and predicted cell proportions. its advantage over the existing state-of-the-art methods was verified over a wide range of use cases. in particular, adroit excelled in complex tissues composed of more than ten different cell types with wide range of cell proportions (e.g., trabecular meshwork, dorsal root ganglion). in both cases, adroit performed significantly better than the comparators music and nnls on deconvoluting bulk rna-seq data. adroit is also more accurate and sensitive than stereoscope in demystifying spatial transcriptomics spots, especially in detecting low percent cells. previous benchmarking often assumed the types of cells in the synthetic bulk data are not more or less than the cell types collected in the reference, and thus the only unknown was the proportion of each cell type. this assumption may not hold. missing existing cell types or false predictions of non- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . existing ones can hinder the utility of an algorithm. thus, besides the overall accuracy, we also examined the sensitivity and specificity of the algorithms. we observed a superior sensitivity and specificity in adroit, an important leverage for its usage in practice. the reference single cell data used by adroit came from different platforms, such as the x genomics chromium instrument (the mouse dorsal root ganglion), and the fluidigm c system (the human pancreatic islets data). adroit consistently exhibited excellent performance across all benchmarking datasets independent of their single cell sequencing technology platforms. more importantly, this statement holds not only for deconvoluting the synthesized bulk data, but also for the real bulk rna-seq data. the latter typically does not apply the unique molecular barcoding and requires a significantly different cdna amplification procedure from what is used in the single cell rna-seq (methods). besides, the sequencing depth, read mapping and gene expression quantification are dissimilar as well. the fact that adroit accurately dissected the cell compositions in the real bulk samples based on the single cell reference data further supports its cross-platform applicability. we attribute the power of adroit to its comprehensive modeling of relevant factors. firstly, we think a common rescaling factor is not sufficient to correct the platform difference between single cells and the compound data. rather, the impact of platform difference to genes is quite different and hardly is linearly scaled. correcting such differences entails rescaling factors specifically tailored to each gene. adroit uses an adaptive learning approach to estimate such gene-wise correcting factor and does the correction in a unified model. in addition, the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . contribution of a gene in a cell type to the loss function is jointly weighted by its specificity and variability in a cell type, where specificity and variability are defined in a way accounting for the overdispersion property of counts data. our observations over the multiple benchmarking dataset also show that the coexistence of similar cell types may have induced a collinearity condition that negatively impacted the regression-based methods developed by others. being able to alleviate this problem gives adroit an edge to outperform. all these factors help adroit to distinguish similar cell clusters while sensitive enough to separate rare cell types. technically, the input profiles of individual cell types to adroit does not necessarily come from the single cell rna-seq. bulk rna-seq profiles of individual isolated cell types can be used as well. nevertheless, using single cell rna-seq data as the reference has a few key advantages. it is a high throughput approach wherein multiple cell types can be interrogated simultaneously. prior knowledge of the cell types in presence as well as their specific gene markers are not required, which allows novel cell types to be identified. although detection of lowly expressing genes has been a challenge for the single cell rna-seq, significant enhancements have been demonstrated. for example, the number of detectable genes currently can reach an order of , per cell and keeps improving . as adroit focuses on the informative genes whose expressions are generally high, the detection limit of the single cell rna-seq does not impose a significant drawback. indeed, given the single cell reference profiles, adroit successfully deconvoluted the real bulk rna-seq data and spatial transcriptomics data. the results suggest that, besides enriching our understanding of the bulk transcriptome data, adroit can leverage the usage of the vast amount and continuously growing single cell data as well. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adroit is a reference-based deconvolution algorithm. a comprehensive collection of the possible cell components is important. however, completeness may not always be guaranteed. even with the single cell acquisition that is independent of prior knowledge, rare and/or fragile cell types may not survive through the capture procedure and hence are excluded. it is also difficult to generate a solid reference profile for cells that are versatile from sample to sample (e.g., tumor cells). currently adroit deals implicitly with the components unknown to the reference. if an unknown cell type reassembles one of the referenced ones, it may be considered as part of the known cell type and their joint population is predicted. such an outcome is acceptable as treating two similar cell types as one is still biologically meaningful although the resolution of the system may be compromised. if the unknown component is dissimilar to all the known ones, it will be ignored by adroit because its representative markers are unlikely among the top weighted genes associated with the known components. at the same time, the distinct component is expected to have a unique gene expression pattern and thus unlikely interferes significantly with the gene expressions from the known cell types. therefore, adroit essentially deconvolutes the relative populations among the known cell components. for example, adroit was able to correctly uncover the populations of endocrine cell types from the human islet bulk data despite the absence of many other cell types such as macrophages, schwann cells and endothelial cells in the input single cell reference . although under such a circumstance, the absolute percentages of the cells remain obscure, we expect their relative proportions can be studied and valuable. a future improvement is to explicitly (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . model the unknown cell types and estimate their percentages upon the signals in the compound data that cannot be explained by the contribution from the known components. methods gene selection adroit selects genes that contain information about cell type identity, excluding non- informative genes that potentially introduce noise. there are two ways for selecting such genes: ) union of the genes whose expression is enriched in one or more cell types in the single cell umi count matrix. these genes are referred as marker genes; ) union of the genes that vary the most across all the cells in the single cell umi count matrix, referred as the highly variable genes. for marker genes, we recommend selecting top ~ genes (p value < . ), ranked by fold change, from each cell type for resolving complex compound transcriptome data. considering some genes may mark more than one cell types, we further require selected markers presenting in no more than cell types to ensure specificity. we also suggest select a minimal of total number unique genes for an accurate estimation. if not satisfied, one may consider expand the number of top genes and/or loose the p value cutoff. adroit also offer the option to use highly variable genes. to avoid the selected highly variable genes being dominated by large cell clusters whilst underrepresents small clusters, adroit first balances the cell types in the single cell umi count matrix by finding the median size among all cell clusters, then sample cells from each cluster to make them equal to this size. next, adroit computes the variance of each gene across the cells in the balanced single cell umi matrix. due (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . to the well-known dispersion effect in rna-seq data, directly computing variances from count matrix can results in overestimation. we thus compute variances on the normalized data done by variance-stabilizing transformation (vst) . genes with top large variances are then selected. in both ways, mitochondria genes were excluded as their expression do not have information of cell identity. the results shown in current paper were based the marker genes as described above. but we also demonstrated that using the balanced highly variable genes yields comparably accurate estimations (supplementary fig. ). estimate gene mean and dispersion per cell type modeling single cell rna-seq data is challenging due to the cellular heterogeneity, technical sensitivity, and noise. while the expression of some genes can be not detected by chance, other genes may be found to be highly dispersed. these factors can lead to excessive variability even within the same cell type. adroit combats high noise and computational complexity by building models with estimated mean and dispersion per cell type. this strategy reduced the data complexity while preserve the cell type specific information. although typical analyses of rna-seq data starts with normalization, adroit does not do normalization prior to the mean estimation. performing a normalization across all cell types forces every cell type to have the same amount of rna transcripts, measured by the total unique molecular identifier (umi) counts per cell. however, different cell types can have (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . dramatically different amounts of transcripts. for example, the amount of rna transcripts in neuronal cells is about times fold of that in glial cells. thus, normalization can falsely alter the relative abundance of cell types, misleading the estimation of cell type percentages. to avoid this problem, adroit models the means using the raw umi counts. studies have shown that umi counts follows negative binomial distribution , , we therefore fit negative binomial distributions to single cells of each cell type and build the model based on the estimated means and dispersions from the selected genes. more specifically, let 𝑋!"be the set of single cell umi counts of gene i ∈ ,..,i for all cells in cell type k ∈ ,…,k. i is the number of selected genes, and k denotes number of cell types in the single cell reference. the distribution of 𝑋!"follows negative binomial distribution, 𝑋!" ∼ 𝑁𝐵(𝜆!",𝑝!"), ( ) where 𝜆!" is the dispersion parameter of the gene i in cell type k, and 𝑝!" is the success probability, i.e., the probability of gene i in cell type k getting one umi. the two parameters are estimated by mle. the likelihood function is 𝐿𝐻(𝜆!",𝑝!"|𝑋!") = ∏ 𝑓(𝑋!"|𝜆!",𝑝!") #! !$% , ( ) where 𝑛" is the number of cells in cell type k, and f is the probability mass function of negative binomial distribution. the mle estimates are then given by (𝜆&" ,𝑝&") = 𝑎𝑟𝑔max '"!,)"! 𝐿𝐻(𝜆!",𝑝!"|𝑋!"). ( ) once success probability and dispersion are estimated, the mean estimates can be computed numerically according to the property of negative binomial distribution, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . 𝜇!" = '#!* ∙)#!, %-)#!, , ( ) 𝜎!" . = '#! * ∙)#!, (%-)#!, )$ . ( ) estimation using mle has been readily coded in many r packages. we choose ‘fitdist’ function from ‘fitdistrplus’ package for its fast computation speed and flexibility in selecting distributions. estimations are done for each selected gene in each cell type, resulting in a 𝐼 × 𝐾 matrix of cell type means. cell type specificity of genes genes with cell-type specific expression patterns better represent cell types, thus are more important when be used for resolving cell type composition. in line with this property, adroit weights genes with high specificity more than less specific ones. highly specific genes usually have consistently high expression and thus relatively low variance among cells within a cell type. to compute cell type specificity of a gene, we first identify the cell type in which the gene has the highest expression (i.e., most specifically expressed cell type), then defines the specificity of this gene as the mean-to-variance ratio within the cell type. a high ratio renders high weight to the gene in the model. we use the estimated means and variances from negative binomial fitting (𝜇!" and 𝜎!" . in eq. and ). let 𝑘 be the index of cell type that has the highest mean expression of gene i, 𝑘 = 𝑎𝑟𝑔max " {𝜇!"| 𝑘 𝜖 …𝐾}, ( ) then the cell type specificity weight for gene i, denoting 𝑤! , is given by, 𝑤! = "!% "!% $ , ( ) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . and it is computed for each gene in the set of selected genes. cross-sample gene variability the variability of a gene contrasts how much stable a gene is across samples. the idea of weighting genes based on variability across samples is first explored by wang et al , where variability was defined as the cross-sample variance. by weighting down the high variability genes, the authors achieved a great advantage over the traditional unweighted method. genes with low cross-sample variability better represent the population, hence are more trust-worthy to be used to learn the cell composition. adroit incorporates the same notion to weight the importance of genes, however, defines the variability in a more sophisticated way. similar as we define the cell type specificity, adroit utilizes mean and variance, and computes variance- to-mean ratio (vmr) to stand for cross-sample gene variability. but here the mean and variance are computed across samples. the vmr is better scaled than the simple variance, and it can avoid underweighting genes that has low expression, while circumvent overweighting genes hugely dispersed. in addition, adroit extends the method to fit the case where multiple samples are not available. we proposed three ways to compute the vmr, depending on whether multi-sample data is available. typically, the compound transcriptome data to be deconvolved have multiple samples. in bulk rna-seq data, multiple samples are usually included to control for biological variability. in spatial transcriptome data, the spatial dots can be seen as multiple samples. therefore, we first consider computing the cross-sample gene variability from compound (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . transcriptome data. in case multi-sample for compound data is not available, adroit utilizes the single cell reference, and synthesizes compound samples by pooling all cells belonging to the same sample. if multi-sample is not available for both data, adroit subsample single cells and pool them to make pseudo samples. let 𝑌! denote the counts of sequences for gene i in sample j ∈ ,…,j, then 𝑌! ∼ 𝑁𝐵(𝜆! ,𝑝! ), ( ) where 𝜆! is the dispersion parameter of the gene i in sample j, and 𝑝! is the success probability. again, we use mle to get the estimates 𝜆& and 𝑝& g, following which cross-sample mean and variance can be numerically computed: 𝜇! = '#&* ∙)#&, %-)#&, , ( ) (𝜎! .) = '#&* ∙)#&, %-)#&, $, ( ) and cross-sample variability for gene i is then defined as 𝑉𝑀𝑅! = ( " $)' " ' = % " (, ( ) where 𝑤! : is later used in the model. the cross-sample variability weight is computed for each gene in the set of selected genes. gene-wise scaling factor to correct platform bias when linking the compound data to the single cell data, rescaling factor is often used to account for the library size and platform difference. the existing methods adopt a single rescaling factor for each unit of sample, i.e., all genes of a single sample are multiplied by the same factor , . this operation is based on a strong assumption that the impact of platform (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . difference to every gene is the same and linearly scaled among different cell types, which is hardly true. in addition, because estimates can be easily affected by outliers in linear model, estimation of cell proportions can be steered away from the truth by extremely high expression genes. therefore, applying a uniform scaling factor to all gene is inappropriate. to overcome this problem, adroit instead estimates gene-wise scaling factors via an adaptive learning strategy and rescales each gene with its respective scaling factor. to proceed, we first input the mean gene expression from the compound samples (𝜇! in eq. ) and the estimated means of each cell type from the single cell data (𝜇!" in eq. ), then apply a traditional non- negative least square regression (nnls) to get a rough estimation of the proportions of each cell type, denoting 𝜏". for each gene, a predicted mean expression (∑ 𝜏"g;" 𝜇!" in eq. ) is computed as the weighted sum of the means of each cell type wherein the weights are the roughly estimated proportions. the regression equation is given by, 𝜇! = 𝐴 ∙ (∑ 𝜏";" 𝜇!" + 𝜀), < 𝜏", ∑ 𝜏" ; " = ( ) where a is a constant to ensure 𝜏"’s sum to and 𝜀 is the error term. we use ‘nnls’ function in the ‘nnls’ package to estimate 𝜏"’s. next, we calculate the ratio between the mean expression from compound samples and the predicted means, and define the gene-wise rescaling factor as the logarithm of the ratio plus , 𝑟! = log ( " ) ∑ =!, * ! "! + ). ( ) given the dispersion property of count data, the logarithm of the ratio is a more appropriate statistic as it results in relatively stable scaling factors. the addition of avoids taking logarithm on zero. by multiplying the flexible gene-wise rescaling factor, the “outlier” genes will be (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . pushed toward the truth regression line direction, while the genes around the true regression lines are less affected (fig. b). weighted and regularized model we next designed a model that incorporates all these factors to do the actual estimation of cell type proportions. adroit builds upon non-negative least square regression model. it gives high weights to the genes with high cell type specificity and low cross-sample variability. this was done by optimizing a weighted sum of squared loss function l, where the weights consist of two components (𝑤! : in eq. , 𝑤! in eq. ). the gene-wise scaling factor tailored for each gene effectively corrects the bias due to technology difference between compound sample and single cell data (𝑟!in eq ). in cases of complex tissues (e.g., neural tissues) where many highly similar subtypes are common, closely related subtypes can have strong collinearity, leading to overestimation of some cell types whilst underestimate or miss some others. adroit handles this problem by including a l norm of the estimates as the regularization component. denote 𝛽" as the unscaled coefficient for cell type k. for a compound transcriptome sample j, the loss function is given by, 𝐿 (𝛽%,…,𝛽;|𝑦! ,𝑤! :,𝑤! ,𝑟!,𝜇&"g) = ∑ 𝑤! : ∙ 𝑤! ∙ (𝑦! − 𝑟! ∙ ∑ 𝛽"𝜇&"g;" ). > ! + ∑ 𝛽" .; " . ( ) then the coefficient 𝛽" can be estimated by minimizing the loss function with the constraint 𝛽%,…,𝛽; > , 𝛽% ,…,𝛽; = argmax ?+,…,?* ?+,…,?*ab𝐿 . ( ) the estimation is done by a gradient projection method by byrd et al . we derive the gradient function by taking partial derivative of the loss function with w.r.t. 𝛽", (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . 𝐺" = ∇?!𝐿 = − ∑ 𝑟! ∙ 𝜇&"g ∙ 𝑤! : ∙ 𝑤! ∙ ^𝑦! − 𝑟! ∙ ∑ 𝛽"𝜇&"g;" _ + > ! 𝛽". ( ) adroit uses the function ‘optim’ from the r package ‘stats’ to do the estimation , providing the loss function (eq. ) and the gradient (eq. ). to get the final estimates of cell type proportions, we rescale the coefficients 𝛽"’s to ensure a summation of , 𝜃" = ?!* ∑ ?!* * ! . ( ) each compound sample j is independently estimated by the model described above. simulation of bulk rna-seq and spatial transcriptomics data bulk rna-seq data used for benchmarking are synthesized by adding up the raw umi reads per gene from all single cells of a sample regardless of cell types. denote 𝑡" as a cell in cell type k, and 𝑡" ∈ , …, 𝑇", where 𝑇" is the number of cells in cell type k. let 𝑌! d be the read count of gene i in a synthesized bulk sample j, and 𝑋! e! be the umi count of the gene, then 𝑌! d = ∑ ∑ 𝑋! e! f! e! ; " . the true proportion of cell type k is given by, 𝜃" b = f! ∑ f! * ! . to simulate spatial transcriptomic spots, we first sample cells without replacement from each cell type and added them up, then mix them with designed proportions. for example, to simulate a spot with 𝑝" percent of cell type k, the read count 𝑌! g of gene i in a spatial spot j is given by, 𝑌! g = ∑ 𝑝";" ∑ 𝑋!"#%b#$% , (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . where 𝑋!"g is umi count of gene i in a sampled cell n of cell type k. for each mixing scheme, the simulation is repeated times. evaluation statistics we compared the estimated cell type proportions with the ground truth by calculating statistics. the mad and rmsd are given by, 𝑚𝐴𝐷 = ∑ hi!-i! , h*! ; , 𝑅𝑀𝑆𝐷 = ∑ i!-i! , $* ! ; . pearson correlation coefficient is computed as, 𝜌) = ∑ i!-i!jjjj ki! ,-i! ,jjjjl*! m∑ i!-i!jjjj * ! $m∑ ki! ,-i! ,jjjjl $* ! , where 𝜃"ggg and 𝜃" bggg are means of the estimated proportions and true proportions, respectively. spearman correlation coefficient is given by, 𝜌g = ∑ (n!-n!jjjj)kn! ,-n! ,jjjjl*! m∑ (n!-n!jjjj) * ! $m∑ kn! ,-n! ,jjjjl $* ! , where 𝑟"is the rank of 𝜃". single cell rna sequencing of mouse dorsal root ganglion as described previously , lumbar drgs were isolated from adult c bl/ mice and transferred to a dissociation buffer (dulbecco's modified eagle's medium supplemented with % heat- inactivated fetal calf serum) (gibco; cat # a - ). to generate a single cell suspension, drgs were subjected to a step-enzymatic dissociation followed by a mechanical dissociation. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . in brief, drgs were first incubated with . % collagenase p from clostridium histolyticum (roche applied science; cat # ) for minutes in an eppendorf thermomixer c ( °c; intermittent rpm shaking for about sec every minutes). then, drgs were transferred to a hank's balanced salt solution (hbss, mg + and ca + free; invitrogen) supplemented with . % trypsin (worthington biochemical corp.; cat # lsoo ) and . % edta and incubated for minutes at °c in the eppendorf thermomixer c. trypsin was neutralized by the addition of . mg/ml mgso (sigma; cat #m- ) and drgs were triturated with pasteur pipettes. the resulting cell suspension was passed through a µm mesh filter to remove remaining chunks of tissues and centrifuged for minutes at rpm at room temperature. the pellet was resuspended in hbss (ca +, mg + free; invitrogen) and the cell suspension was run on a % percoll plus gradient (sigma ge - - ) to further remove debris. finally, cells were resuspended in pbs supplemented with . % bsa at a concentration of cells/µl and cell viability was determined using the automated cell analyzer nucleocounter® nc- ™. the suspended single cells were loaded on a chromium single cell instrument ( x genomics) with about cells per lane to minimize the presence of doublets. - cells per lane were recovered. rna-seq libraries were constructed using chromium single cell ’ library, gel beads & multiplex kit ( x genomics). single end sequencing was performed on illumina nextseq . read starts with a -bp umi and cell barcode, followed by an -bp i sample index. read contains a -bp transcript read. sample de-multiplexing, alignment, filtering, and umi counting were conducted using cell ranger single-cell software suite ( x genomics, v . . ). mouse mm genome assembly and ucsc gene model were used for the alignment. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . data preprocessing drg single cell data the umi data output from cell ranger single-cell software suite ( x genomics, v . . ) was analyzed using seurat package to assess the cell quality and identify cell types, similar to what described previously . cells with the number of detected genes less than or over , or with a umi ratio of mitochondria encoded genes versus all genes over . were also removed. the umi data was normalized by the ‘normalizedata’ method in seurat with default settings. to avoid potential sample-to-sample variation caused by technical variation at various experiment steps, we employed seurat data integration method. the top variable genes of each of the samples were identified using ‘findvariablefeatures’ with selection.method=‘vst’. based on the union of these variable genes, the anchor cells in each sample were identified by ‘findintegrationanchors’. all the samples were then integrated by ‘integratedata’. we subsequently scaled the integrated data (‘scaledata’) and performed dimension reduction (‘runpca’). cells were then clustered based on the first principal components by applying ‘findneighbors’ and ‘findclusters’ (resolution= . , algorithm= ). marker genes for each cluster were identified using ‘findallmarkers’. parameters were used such that these genes were expressed in at least % of the cells in the cluster, and on average -fold higher than the rest of cells with a multiple-testing adjusted wilcoxon test p value of less than . . the specificity of the canonical cell type-specific genes or cell cluster-specific genes were further examined by visualizations (extended data fig. ) and used to define the cell type for each cluster. at the end, the original umi data from genes and cells that passed (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the quality control were organized into a matrix (genes as rows and cell identifiers as columns). this matrix, together with the cell type label for each cell therein, were loaded into adroit as reference profiles. mouse brain single cell data the scrna-seq reference data of the mouse brain were obtained from zeisel et. al . among all the available data, we only retained , cells that were acquired from the brain regions, had an assigned cell type by the authors and a minimal total umi of . these cells corresponded to clusters at the finest taxonomy level in the original study. as many of the clusters are highly similar, we decided to merge some of them to simplify the reference landscape. first, the top cluster enriched markers were derived using scanpy via the ‘rank_genes_groups’ function (method=‘wilcoxon’), following the normalization (‘normalize_per_cell’), log transformation (‘log p’) and regressing out (‘regress_out’) the variances associated with the total umi and the percentage of mitochondrial chromosome encoded genes per cell. then, the pair-wise overlapping p-values among the clusters were calculated using the top marker genes assuming the hypergeometric null distribution. last, clusters with overlapping p-values more significant than e- were merged and new names were assigned by combinedly considering the original annotation, the molecular features and the specificity to certain brain regions. a total of cell types were determined that cover all the brain regions and their important substructures (supplementary table ). to make the reference dataset more manageable in size and more balanced in the representation of cell types, we down sampled (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . each cluster to no more than cells. a final set of , cells over cell types were used for the deconvolution of the mouse brain spatial transcriptome data. human islets we used the high quality human islets single cell and annotation from xin et al . the rpkm expression table was directly downloaded and used as is. the rna-fish data was also from this study . for the real bulk human pancreatic islets data , , , the read counts table were deconvoluted. only data from donors with hba c level available were included in the regression of beta cell proportion on hba c level (fig. c, supplementary table ). trabecular meshwork we downloaded the raw sequence data and followed the same analysis procedure as in patel et al for quality control and cell type identification. mouse brain spatial transcriptomics data by x visium platform the filtered cell matrix, tissue image and the spatial coordinates of a coronal section of an adult c bl/ mouse brain from the x genomics were available for download and used as is. mouse brian ish images the ish images were directly downloaded from allen mouse brain atlas by searching the gene names. the images were used with further editing except for cropping. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . data availability drg single cell data are deposited at ncbi geo (accession number: gse ) . the bulk rna- seq and rna-fish data for human pancreatic islets were initially published as aggregated data where the data processing and experimental procedure were described therein , , . we acquired the individual sample data from the authors and released them along with the current study (supplementary table and supplementary table ). the other public data analyzed in this study are available from: geo (human pancreatic islets single cell data: gse ); ncbi (human trabecular meshwork single cell data: prjna ; mouse brain single cell data: srp ). mouse brain spatial transcriptomic data was downloaded from the x genomics website (https://support. xgenomics.com/spatial-gene- expression/datasets/ . . /v _adult_mouse_brain_coronal_section). code availability adroit’s source code is available on github (https://github.com/taoyang-dev/adroit). software the statistical analyses were done with r statistical software (v . . ) and python (v . . ) . the packages used include seurat (v . . ) , scanpy (v . . ) , dplyr (v . . . ) , doparallel (v . . ) , data.table (v . . ) , fitdistrplus (v . - ) , nnls (v . ) . reference (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . wang, z., gerstein, m. & snyder, m. rna-seq: a revolutionary tool for transcriptomics. nature reviews genetics ( ) doi: . /nrg . . chu, g. c., kimmelman, a. c., hezel, a. f. & depinho, r. a. stromal biology of pancreatic cancer. journal of cellular biochemistry ( ) doi: . /jcb. . . bussard, k. m., mutkus, l., stumpf, k., gomez-manzano, c. & marini, f. c. tumor- associated stromal cells as key contributors to the tumor microenvironment. breast cancer research ( ) doi: . /s - - - . . munn, d. h. & bronte, v. immune suppressive mechanisms in the tumor microenvironment. current opinion in immunology ( ) doi: . /j.coi. . . . . gonzalez, h., hagerling, c. & werb, z. roles of the immune system in cancer: from tumor initiation to metastatic progression. genes and development ( ) doi: . /gad. . . . garner, h. & de visser, k. e. immune crosstalk in cancer progression and metastatic spread: a complex conversation. nature reviews immunology ( ) doi: . /s - - -z. . singh, u. p. et al. chemokine and cytokine levels in inflammatory bowel disease patients. cytokine ( ) doi: . /j.cyto. . . . . van lint, p. & libert, c. chemokine and cytokine processing by matrix metalloproteinases and its effect on leukocyte migration and inflammation. j. leukoc. biol. ( ) doi: . /jlb. . . zelová, h. & hošek, j. tnf-α signalling and inflammation: interactions between old (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . acquaintances. inflammation research ( ) doi: . /s - - - . . koelman, l., pivovarova-ramich, o., pfeiffer, a. f. h., grune, t. & aleksandrova, k. cytokines for evaluation of chronic inflammatory status in ageing research: reliability and phenotypic characterisation. immun. ageing ( ) doi: . /s - - - . . landskron, g., de la fuente, m., thuwajit, p., thuwajit, c. & hermoso, m. a. chronic inflammation and cytokines in the tumor microenvironment. journal of immunology research ( ) doi: . / / . . ståhl, p. l. et al. visualization and analysis of gene expression in tissue sections by spatial transcriptomics. science ( ) doi: . /science.aaf . . vickovic, s. et al. high-definition spatial transcriptomics for in situ tissue profiling. nat. methods ( ) doi: . /s - - -y. . tang, f. et al. mrna-seq whole-transcriptome analysis of a single cell. nat. methods ( ) doi: . /nmeth. . . denisenko, e. et al. systematic assessment of tissue dissociation and storage biases in single-cell and single-nucleus rna-seq workflows. genome biol. ( ) doi: . /s - - - . . nguyen, q. h., pervolarakis, n., nee, k. & kessenbrock, k. experimental considerations for single-cell rna sequencing approaches. frontiers in cell and developmental biology ( ) doi: . /fcell. . . . tanay, a. & regev, a. scaling single-cell genomics from phenomenology to mechanism. nature ( ) doi: . /nature . . abbas, a. r., wolslegel, k., seshasayee, d., modrusan, z. & clark, h. f. deconvolution of (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. plos one ( ) doi: . /journal.pone. . . newman, a. m. et al. robust enumeration of cell subsets from tissue expression profiles. nat. methods ( ) doi: . /nmeth. . . baron, m. et al. a single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. cell syst. ( ) doi: . /j.cels. . . . . tsoucas, d. et al. accurate estimation of cell-type composition from gene expression data. nat. commun. ( ) doi: . /s - - -z. . wang, x., park, j., susztak, k., zhang, n. r. & li, m. bulk tissue cell type deconvolution with multi-subject single-cell expression reference. nat. commun. ( ) doi: . /s - - -x. . andersson, a. et al. single-cell and spatial transcriptomics enables probabilistic inference of cell type topography. commun. biol. , ( ). . newman, a. m. et al. determining cell type abundance and expression from bulk tissues with digital cytometry. nat. biotechnol. ( ) doi: . /s - - - . . myung, i. j. tutorial on maximum likelihood estimation. j. math. psychol. ( ) doi: . /s - ( ) - . . bassett, r. & deride, j. maximum a posteriori estimators as a limit of bayes estimators. math. program. ( ) doi: . /s - - - . . zhao, y. & simon, r. gene expression deconvolution in clinical samples. genome medicine ( ) doi: . /gm . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . chiu, y. j., hsieh, y. h. & huang, y. h. improved cell composition deconvolution method of bulk gene expression profiles to quantify subsets of immune cells. bmc med. genomics ( ) doi: . /s - - - . . kang, k. et al. cdseq: a novel complete deconvolution method for dissecting heterogeneous samples using gene expression data. plos comput. biol. ( ) doi: . /journal.pcbi. . . qiao, w. et al. pert: a method for expression deconvolution of human blood samples from varied microenvironmental and developmental conditions. plos comput. biol. ( ) doi: . /journal.pcbi. . . zaitsev, k., bambouskova, m., swain, a. & artyomov, m. n. complete deconvolution of cellular mixtures based on linearity of transcriptional signatures. nat. commun. ( ) doi: . /s - - - . . zeisel, a. et al. molecular architecture of the mouse nervous system. cell ( ) doi: . /j.cell. . . . . donovan, m. k. r., d’antonio-chronowska, a., d’antonio, m. & frazer, k. a. cellular deconvolution of gtex tissues powers discovery of disease and cell-type associated regulatory variants. nat. commun. ( ) doi: . /s - - - . . phipson, b., zappia, l. & oshlack, a. gene length and detection bias in single cell rna sequencing protocols. f research ( ) doi: . /f research. . . . chen, g., ning, b. & shi, t. single-cell rna-seq technologies and related computational data analysis. frontiers in genetics ( ) doi: . /fgene. . . . chen, d. & plemmons, r. j. nonnegativity constraints in numerical analysis. in the birth (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . of numerical analysis ( ). doi: . / _ . . lein, e. s. et al. genome-wide atlas of gene expression in the adult mouse brain. nature ( ) doi: . /nature . . xin, y. et al. rna sequencing of single human islet cells reveals type diabetes genes. cell metab. ( ) doi: . /j.cmet. . . . . patel, g. et al. molecular taxonomy of human ocular outflow tissues defined by single- cell transcriptomics. proc. natl. acad. sci. , lp – ( ). . xin, y. et al. pseudotime ordering of single human b-cells reveals states of insulin production and unfolded protein response. diabetes ( ) doi: . /db - . . gutierrez, g. d. et al. gene signature of proliferating human pancreatic a cells. endocrinology ( ) doi: . /en. - . . cerf, m. e. beta cell dysfunction and insulin resistance. frontiers in endocrinology ( ) doi: . /fendo. . . . maedler, k. & donath, m. y. beta-cells in type diabetes: a loss of function and mass. hormone research ( ). . donath, m. y. et al. mechanisms of β-cell death in type diabetes. diabetes ( ) doi: . /diabetes. .suppl_ .s . . calanna, s. et al. alpha- and beta-cell abnormalities in haemoglobin a c-defined prediabetes and type diabetes. acta diabetol. ( ) doi: . /s - - - . . kanat, m. et al. the relationship between β-cell function and glycated hemoglobin. diabetes care , lp – ( ). . nepton, s. beta-cell function and failure. in type diabetes ( ). doi: . / . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . dolenšek, j., rupnik, m. s. & stožer, a. structural similarities and differences between the human and the mouse pancreas. islets ( ) doi: . / . . . . lein, e. s. et al. genome-wide atlas of gene expression in the adult mouse brain. nature , – ( ). . vieth, b., parekh, s., ziegenhain, c., enard, w. & hellmann, i. a systematic evaluation of single cell rna-seq analysis pipelines. nat. commun. ( ) doi: . /s - - - . . anders, s. & huber, w. differential expression analysis for sequence count data. genome biol. ( ) doi: . /gb- - - -r . . hafemeister, c. & satija, r. normalization and variance stabilization of single-cell rna- seq data using regularized negative binomial regression. genome biol. ( ) doi: . /s - - - . . svensson, v. droplet scrna-seq is not zero-inflated. nature biotechnology ( ) doi: . /s - - - . . delignette-muller, m. l. & dutang, c. fitdistrplus: an r package for fitting distributions. j. stat. softw. ( ) doi: . /jss.v .i . . mullen, katharine m., i. h. m. van s. nnls: the lawson-hanson algorithm for non- negative least squares (nnls). r packag. version . ( ). . byrd, r. h., lu, p., nocedal, j. & zhu, c. a limited memory algorithm for bound constrained optimization. siam j. sci. comput. ( ) doi: . / . . the r core team. r: a language and environment for statistical computing. r foundation for statistical computing ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . alessandri-haber, n. et al. hypotonicity induces trpv -mediated nociception in rat. neuron ( ) doi: . /s - ( ) - . . zheng, g. x. y. et al. massively parallel digital transcriptional profiling of single cells. nat. commun. ( ) doi: . /ncomms . . stuart, t. et al. comprehensive integration of single-cell data. cell ( ) doi: . /j.cell. . . . . wolf, f. a., angerer, p. & theis, f. j. scanpy: large-scale single-cell gene expression data analysis. genome biol. ( ) doi: . /s - - - . . van rossum, g. & drake, f. l. python reference manual. scotts valley, ca ( ). . wickham, h. & francois, r. dplyr: a grammar of data manipulation. r packag. version . . . ( ). . weston, s., calaway, r. & tenenbaum, d. doparallel: foreach parallel adaptor for the parallel package. cran ( ). . dowle, m. & srinivasan, a. data.table: extension of ‘data.frame’. r package version . . . manual ( ). acknowledgements we thank yurong xin for pointing us to the relevant public data resource. we also thank gabor halasz and yuan zhu for the advice to algorithm design. author contributions (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . t.y., y.b., w.f., n.a.-h., m.l.-f., l.e.m. and g.s.a. designed the research. t.y., y.b., and w.f. developed the algorithm. t.y., y.b., w.f. and j.k. participated in the data analyzing. m.s. and r.b. performed the drg tissue collection. c.a. performed the single cell library preparation and sequencing experiment. t.y., y.b., n.a.-h. and g.s.a. wrote the manuscript. competing interests t.y., y.b., w.f. and g.s.a. have filed a patent application relating to the adroit computational framework. m.l.-f. is an employee of cellular longevity. all other authors are employees and shareholders of regeneron pharmaceuticals, although the manuscript’s subject matter does not have any relationship to any products or services of this corporation. figure legends fig. : schematic representation of adroit computational framework. a, adroit inputs bulk or spatial rna-seq data, single cell rna-seq data and cell type annotations. it first selects informative genes and estimates their means and dispersions, based on which the cell type specificity of genes is computed. depending on multi-sample availability, cross-sample gene variability is estimated from compound data, or single cell samples (dashed arrow). lastly the gene-wise scaling factors are estimated using both compound data and single cell data. these computed quantities are fed to a weighted regularized model to infer the transcriptome composition. b, a mock example to illustrate the role of gene-wise scaling factor. ideally, an accurate estimation of slop (i.e., cell proportion) would be the slope of the green line, however direct fitting would result in the red line due to the impact of the outlier genes. outlier genes (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . can be induced due to platform difference affecting genes differently. adroit adopts an adaptive learning approach that first learns a rough estimation of the slop (red line), then moves the outlier genes toward it such that the more deviated genes will be moved more toward the true line (i.e., longer arrows). after the adjustment, the new estimated slop (blue line) is closer to the truth (green line), thus is a more accurate estimation. fig. : benchmark on simulated bulk data synthesized from trabecular meshwork (tm) single cells data. a, adroit has the closest estimation to the true cell proportion comparing to music and nnls. each dot is a cell type from one donor. b, for each cell type in tm, adroit has the smallest differences from the true cell type proportion and the smallest variance of estimates across the donors. for each cell type, a dot on the graph denotes a donor, and the bars represent the . × interquartile ranges. estimation was done by using the single cell as reference leaving out the donor used for synthesizing bulk. c, adroit’s estimates are more accurate and specific than music’s estimates on synthetic bulk that contains partial cell types. the synthetic bulk was simulated by using only out of the cell types per donor, then estimated with the reference of cell types. adroit has notably fewer false positive estimates of the cell types not included, and more accurate estimation of the cell types used for synthesizing bulk. d, receiver operating characteristic (roc) curve shows adroit has a significantly higher auc than music ( . vs . ), meaning better sensitivity and specificity. fig. : benchmark on scrna-seq data from dorsal root ganglion (drg) where these exist many closely related subtypes of neuronal cells. a, cell types were identified from scrna-seq (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . samples of mice, including multiple subtypes of neurofilaments (nf), peptidergic (pep) and non-peptidergic (np) neurons. b, benchmarking with the synthetic data shows adroit’s estimation of cell type proportions are highly accurate. in particular, adroit achieves reasonably high accuracy when the cells are rare (e.g., < %). each dot represents a cell type from one sample. c, for each individual sample, mad, rmsd, pearson and spearman correlations were computed and compared across three methods. adroit has the lowest mad and rmsd, and highest pearson and spearman correlations. in addition, adroit’s estimation is also the most stable across samples. each dot on the boxplot is a sample. estimation was done by using the single cell reference leaving out the sample used for synthesizing bulk. fig. : adroit is more accurate and sensitive than stereoscope on spatial spots simulated from real drg cells. a, adroit and stereoscope estimations on simulated spatial spots that contains pep neuron subtypes. true mixing proportions were denoted by the red dashed lines. three schemes were simulated: ) the proportions of pep cell types are the same and equal to . ; ) pep _dcn is . and the other are . ; ) pep _dcn and pep _s a .tagln are . , pep _slc a .sstr and pep _htr a.sema a . are . , and pep _trpm is . . in all simulation schemes, adroit’s estimates are more consistently centered around the true proportions than stereoscope’s estimates. b, adroit is more accurate in estimating rare cells in spatial spots. the spots were simulated by simulating mixtures of pep cell types (i.e., pep _slc a .sstr , pep _htr a.sema a and pep _trpm ), with a series of low percent of pep _trpm cell type from % to % and the other two cell types sharing the rest proportion equally. adroit’s estimates are systematically closer to the true simulated (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . proportions than stereoscope’s estimates. c, adroit is consistently more sensitive than stereoscope in detecting low percent cells (estimates > . % deemed as detected) in simulated spots of ) low percent of nf_calb mixed with nf_pvalb and nf _ntrk .necab , ) low percent of np_mrgpra mixed with np_mrgprd and np_nts, ) low percent of pep _trpm mixed with pep _slc a .sstr and pep _htr a.sema a, ) low percent of nf_calb mixed with th, satellite glia and endothelial, ) low percent of np_mrgpra mixed with th, satellite glia and endothelial, and ) low percent of pep_trpm mixed with th, satellite glia and endothelial. fig. : applications to real bulk human islets rna-seq data and mouse brain spatial transcriptome data. a, adroit’s estimates on real human islets bulk rna-seq data were highly reproducible for the repeated samples from same donor. b, adroit estimated cell type proportions agreed with the rna-fish measurements. c, adroit estimated beta cell proportions in type diabetes patients are significantly lower than that in healthy subjects. in addition, the estimated proportions have a significant negative linear association with donors’ hba c level. d, the spatial mapping of mouse brain cell types is consistent with the ish images of marker genes from allen mouse brain atlas respectively. the genes, spink (marker of hippocampal field ca ), c ql (marker of dentate gyrus), clic (marker of choroid plexus), synpo (marker of thalamus) were identified as markers of corresponding tissues by zeisel et al . extended data fig. : benchmark three methods on human pancreatic islets data. a, human islets single cell data contains cell types from subjects including two major cell types alpha (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . and beta cells, and two minor cells pp and delta cells . the cell proportion varies across different subjects. b, c, adroit achieves leading accuracy when applied to the bulk data synthesized from the single cell data. each dot on scatterplot is a cell type from one subject. estimation was done by using the single cell reference leaving out the subject used to synthesize bulk. extended data fig. : dorsal root ganglion single cell shows cell types including subtypes of neurofilament, subtypes of non-peptidergic neurons, and subtypes of peptidergic neurons. a, heatmap of top markers shows distinction between cell types as well as similarity between subtypes. b, the proportion of each cell type varies from . % to . % across different samples. extended data fig. : comparing the performance on estimated simulated spatial spots of pure cell type respectively. a, estimates by adroit and b, estimates by stereoscope are comparably accurate. simulations were done by sampling cells from the same cell type and adding up the read counts per gene. for each of the cell types of the drg tissue, we repeated the simulation times. the results shown were a summary of simulations for each cell type. for both methods, the median estimates of the sampled cell type were close to (red lines), whereas the cell type not sampled has zero or close-to-zero values. extended data fig. : the comparison of adroit and stereoscope on the simulated spots of additional cell mixing schemes. more types of mixed spatial spots were simulated: ) mixture (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . of neurofilaments (nf); ) mixture of non-peptidergic (np) cell types; ) nf _ntrk .necab mixing with th, satellite glia and endothelial; ) np_nts mixing with th, satellite glia and endothelial; and ) pep _trpm mixing with th, satellite glia and endothelial. each simulation was repeated times. consistently for all simulation schemes, adroit’s estimates were always closer to the true simulated proportions (red lines), whereas stereoscope’s estimates largely deviated from the true proportions. extended data fig. : spatial mapping of cell types with adroit quantitative depicts the content in each spot. spatial transcriptomics data was downloaded from x genomics (https://support. xgenomics.com/spatial-gene- expression/datasets/ . . /v _adult_mouse_brain_coronal_section). the reference single cells were sampled from zeisel et al and curated into cell types. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figures fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . learning association for single-cell transcriptomics by integrating profiling of gene expression and alternative polyadenylation guoli ji , , wujing xuan , , yibo zhuang , lishan ye , , sheng zhu , , wenbin ye , , xi wang , and xiaohui wu , * department of automation, xiamen university, xiamen , china xiamen ylz yihui technology co., ltd, xiamen, fujian , china xiamen health and medical big data center, xiamen, fujian , china national institute for data science in health and medicine, xiamen university, xiamen, fujian , china keywords: cell type clustering; alternative polyadenylation; single-cell rna-seq; integrative analysis; software guoli ji is a professor with the department of automation in xiamen university. his research interests include bioinformatics, advanced control, data mining and information system. wujing xuan is a graduate student with the department of automation in xiamen university. his research interests are bioinformatics and data mining. yibo zhuang is an employee in xiamen ylz yihui technology company. his research interests are software design, cloud computing and big data. lishan ye is the director of xiamen health and medical big data center. her research interests are cloud computing and healthcare big data. sheng zhu is a ph.d. candidate with the department of automation in xiamen university. his research interests are bioinformatics and healthcare big data. wenbin ye is a ph.d. candidate with the department of automation in xiamen university. her research interests are bioinformatics and mrna processing. xi wang is a graduate student with the department of automation in xiamen university. her research interests are bioinformatics and data mining. xiaohui wu is an associate professor with the department of automation in xiamen university. her research interests are mrna processing, bioinformatics, and data mining. * corresponding author. e-mail: xhuister@xmu.edu.cn, tel: + (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract single-cell rna-sequencing (scrna-seq) has enabled transcriptome-wide profiling of gene expressions in individual cells. a myriad of computational methods have been proposed to learn cell-cell similarities and/or cluster cells, however, high variability and dropout rate inherent in scrna-seq confounds reliable quantification of cell-cell associations based on the gene expression profile alone. lately bioinformatics studies have emerged to capture key transcriptome information on alternative polyadenylation (apa) from standard scrna-seq and revealed apa dynamics among cell types, suggesting the possibility of discerning cell identities with the apa profile. complementary information at both layers of apa isoforms and genes creates great potential to develop cost-efficient approaches to dissect cell types based on multiple modalities derived from existing scrna-seq data without changing experimental technologies. we proposed a toolkit called sclapa for learning association for single-cell transcriptomics by combing single-cell profiling of gene expression and alternative polyadenylation derived from the same scrna-seq data. we compared sclapa with seven similarity metrics and five clustering methods using diverse scrna-seq datasets. comparative results showed that sclapa is more effective and robust for learning cell-cell similarities and clustering cell types than competing methods. moreover, with sclapa we found two hidden subpopulations of peripheral blood mononuclear cells that were undetectable using the gene expression data alone. as a comprehensive toolkit, sclapa provides a unique strategy to learn cell-cell associations, improve cell type clustering and discover novel cell types by augmentation of gene expression profiles with polyadenylation information, which can be incorporated in most existing scrna-seq pipelines. sclapa is available at https://github.com/bmilab/sclapa. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction single-cell rna-sequencing (scrna-seq) has enabled transcriptome-wide profiling of gene expressions in individual cells, which has great potential to reveal cellular composition of tissues, transcriptional heterogeneity among cells and structure of cell types [ ]. cell-type identification is a critical step in most scrna-seq data analyses, and a myriad of computational methods have emerged to detect novel cell types, previously un-appreciated sub-types of cells and rare cells [ ]. fundamentally, these numerous clustering methods rely on cell-cell associations (or similarities) for categorizing individual cells into different clusters [ ]. a wide range of computational tools have been proposed to cluster cells, which implicitly or explicitly rely on a similarity concept [ ]. simlr (single-cell interpretation via multikernel learning) adapts k-means by simultaneously training a similarity measure based on multiple kernel learning [ ]. raceid extends k-means with outlier detection to discover rare cell types [ ]. sc (single-cell consensus clustering) utilizes a consensus approach to combine multiple clustering solutions [ ]. phenograph combines shared nearest-neighbour graphs and louvain community detection to fast identify cell clusters [ ]. despite of the considerable progress, there is no strong consensus on which is the best clustering approach to define cell types for all situations [ , , ]. particularly, high variability and dropout rate inherent in scrna-seq confounds the reliable quantification of lowly and/or moderately expressed genes [ , ], resulting in extremely sparse gene-cell count matrix. consequently, there might be little satisfactory overlap of observed genes among cells, hindering reliable quantification of cell-cell similarities based on the gene expression profile alone. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . recently, multi-omics methods that leverage additional aspects of the cell, such as the dna methylome, open chromatin or proteome, are beginning to appear [ ]. seurat v [ ] harmonizes scrna-seq and scatac-seq data from a similar tissue to identify subpopulations of cells that are undistinguishable using the scatac-seq data alone. liger [ ], a method based on integrative non-negative matrix factorization (inmf), was proposed to classify cortical cells profiled from single-cell bisulfite sequencing by integrating scrna-seq data. additional modalities of individual cells provide valuable information about the phenotype and genetic cellular state not manifested by the transcriptome. however, not all scrna-seq data is accompanied data from different modalities. even that multimodal omics data are gradually available, integrative multimodal analysis is still in its infancy [ ]. it remains a challenge to reconcile the heterogeneity across modalities as different modalities are normally profiled from cells sampled from the same tissue rather than the same cells. although most scrna-seq studies focus on gene expression profiling, key information on transcript isoforms, e.g., alternative splicing (as) and/or alternative polyadenylation (apa), can be obtained, enabling multiple aspects of transcriptome information to be derived from standard scrna-seq without changing experimental technologies [ - ]. lately, several computational methods, such as scapatrap [ ], sierra [ ] and scapa [ ], have been proposed to identify apa sites in single cells from diverse ′ tag-based scrna-seq protocols, e.g., drop-seq [ ], cel-seq [ ] and x genomics [ ]. cell-to-cell heterogeneity in apa site usage was also observed [ - ]. particularly, the previous study [ ] revealed that the apa profile, even that from non-differentially expressed genes, can distinguish mouse cells in different stages during sperm cell differentiation, suggesting the possibility of discerning cell identities with apa usages independent of gene expression. recent efforts have (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . pioneered methods to identify apa sites or explore apa dynamics across different cell types [ - , - ], however, most studies profiled apa among cells with predefined cell type labels rather than discern cell types in an unsupervised manner. complementary information at both layers of apa isoforms and genes can be refined from the same cells [ - ], which creates great potential to develop more sophisticated and cost-efficient approaches to dissect cell types based on multiple modalities derived from existing scrna-seq experiments. here we proposed a toolkit called sclapa for learning association for single-cell transcriptomics by combing single-cell profiling of gene expression and alternative polyadenylation. sclapa leverages the resolution and huge abundance of scrna-seq, boosting the gene-level analysis with additional layer of apa information directly derived from the same scrna-seq data. by employing the strategy of similarity network fusion, sclapa effectively learns highly informative cell-cell associations from expression profiles of both genes and apa isoforms. we compared sclapa with seven similarity metrics and five clustering methods, using diverse scrna-seq data from different experimental technologies and species. comparative results showed that sclapa is more effective and robust for learning cell-cell similarities and clustering cell types than competing methods. moreover, with sclapa we found two hidden subpopulations of cells in peripheral blood mononuclear cells (pbmcs) that were undetectable using the gene expression data alone. as a comprehensive toolkit, sclapa provides a unique strategy to learn cell-cell associations, improve cell type clustering and discover novel cell types by augmentation of gene expression profiles with polyadenylation information, which can be incorporated in many other standard scrna-seq pipelines for single-cell analyses. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . materials and methods scrna-seq datasets we used five publicly available scrna-seq datasets from animals and plants generated by ′ tag-based scrna-seq protocols (table s ), spanning a wide spectrum of tissues, cell types and species. raw data except for the pbmc data were downloaded from ncbi geo (gene expression omnibus). cell types and cell labels of the data of amygdala, mammary and root were obtained from the corresponding studies; cell labels of the hypothalamus data were obtained from panglaodb [ ]. the pbmc k dataset was downloaded from the x genomics website (https://www. xgenomics.com/). for cell type annotation of pbmcs, we followed the tutorial of seurat v [ ] to cluster cells on the basis of the gene-cell expression matrix. specifically, cells with total read counts less then were discarded. the lognormalize method was adopted for normalization. top highly variable features were selected by the vst method. pca (principal component analysis) was used for dimensionality reduction and top principal components were retained. finally, cells were clustered by seurat’s fundclusters with argument ‘resolution= . . for cell type annotation of cell clusters, known marker genes of pbmcs were complied from relevant studies (table s ). differentially expressed (de) genes for each cell group were calculated with seurat’s findallmarkers. we also calculated, for each cell cluster, the number of cells where a de gene is expressed and the mean expression level of a de gene. the cell type was carefully assigned to a cell cluster according to the presence and expression level of marker gene(s). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . overview of sclapa sclapa mainly consists of four modules (figure s ): (i) the input module, (ii) cell-cell distance, (iii) distance fusion, (iv) cell type clustering. the input module prepares the input for sclapa, including a poly(a) site expression matrix (hereinafter referred as pa-matrix) and a gene expression matrix (hereinafter referred as ge-matrix). the pa-matrix is generated from raw scrna-seq with scapatrap [ ], which stores expression levels of poly(a) sites, with each row denoting a poly(a) site and each column denoting a cell. the ge-matrix can be obtained from websites like ncbi geo and x genomics, or generated by various routine scrna-seq analysis tools like cell ranger. in the module of cell-cell distance, a cell-cell distance matrix is learned for pa-matrix (called pa-dist) and ge-matrix (called ge-dist), respectively. the module of distance fusion employs similarity network fusion (snf) [ ] to integrate the two distance matrices (pa-dist and ge-dist) into one cell-cell distance matrix. the cell type clustering module clusters cells based on the fused distance matrix with various clustering methods. sclapa was implemented as an open source r package and is available at https://github.com/bmilab/sclapa. scripts and data used in this study are also available at the github website. identification of poly(a) sites from scrna-seq we followed the tutorial provided at the scapatrap website (https://github.com/bmilab/scapatrap) to identify poly(a) sites with scapatrap [ ]. it should be noted that alternative tools, such as sierra [ ] and scapa [ ], can also be used. briefly, raw fastq reads were mapped with cell ranger . . (https://www. xgenomics.com/) and then uniquely mapped reads were obtained with samtools (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/bmilab/scapatrap https://github.com/bmilab/scapatrap https://doi.org/ . / . . . (http://samtools.sourceforge.net/). then umi-tools [ ] was employed to remove polymerase chain reaction (pcr) duplicates and extract unique molecular identifiers (umis). the findtails function in the scapatrap package was used to determine exact locations of poly(a) sites from reads with a/t stretches and the findpeaks function was adopted to identify all potential peaks of poly(a) sites from the whole genome level. finally consensus poly(a) sites supported by both of the peak and the tail evidence were used. the featurecounts function in the subread toolkit [ ] was adopted to quantify the expression level for each poly(a) site. poly(a) site annotation was performed with the movapa package [ ], using the latest genome annotation of the respective species -- tair for arabidopsis, mm for mouse and grch for human. briefly, poly(a) sites identified from scapatrap were annotated with rich information, such as genomic regions (i.e., ′ utr, ′ utr, coding sequence (cds), intron, exon and intergenic) and gene id. similar to previous studies [ - ], annotated ′ utrs were extended by a length of bp to recruit intergenic sites that may originate from authentic ′ utrs. calculation of cell-cell distance sclapa learns a cell-cell distance matrix for pa-matrix and ge-matrix, respectively. various distance metrics can be chosen, including euclidean distance, pearson correlation, two metrics of proportionality (𝜌𝑝 and ∅𝑠) [ ], rafsil (random forest based similarity learning) [ ] and simlr [ ]. euclidean distance and pearson correlation are widely used in either single-cell or bulk transcriptomics. the two measures of proportionality were found to have strong performance according to a comprehensive benchmarking analysis of a large single-cell transcriptome compendium (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . [ ]. rafsil is a random forest based approach that learns cell-cell similarities from scrna-seq data, including two variations -- rafsil / . simlr learns a distance metric that fits the structure of the scrna-seq data by combining multiple kernels corresponding to different informative representations of the data. euclidean distance and pearson correlation were calculated by the dists and cor functions in the r package stats, respectively; simlr metric was calculated by the simlr r package with argument ‘cores.ratio= ’; rafsil metric was calculated by the rafsil r package with arguments ‘nrep= , gene_filter=false’; 𝜌𝑝 and ∅𝑠 were calculated by the perb and phis functions in the r package propr, respectively. for each distance metric, cell-cell distance matrices, pa-dist and ge-dist, can be learned for pa-matrix and ge-matrix, respectively. pa-dist represents the cell-cell similarity network learned from the apa isoform layer, whereas ge-dist reflects the network learned from the gene layer, each of which encapsulates complementary information about cell-cell associations absent in the other genomic layer. distance fusion after learning pa-dist and ge-dist, similarity network fusion (snf) [ ] is utilized to flexibly integrate the two layers of cell-cell similarities into one similarity matrix. first, pa-dist and ge-dist were iteratively and gradually fused to a consensus network, utilizing the non-linear method of message passing theory [ ]. then weak similarities representing potential noise were discarded, and strong similarities were retained. by generating coherent cell-cell similarities from both apa isoform and gene layers, snf profiles a more comprehensive biological relationship among cells, beyond the scope of methods solely based on ge-matrix. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . given a pa-matrix storing expression levels of 𝑚 poly(a) sites in 𝑛 cells or a ge-matrix recording expression levels of 𝑚 genes in 𝑛 cells, the corresponding cell-cell distance matrix (pa-dist or ge-dist) can be obtained using a selected distance metric. the distance matrix can also be denoted as a graph 𝐺 =< 𝑉, 𝐸, 𝑊 >, with vertices 𝑉 {𝑐 , … , 𝑐𝑛 } corresponding to cells, edges 𝐸 representing cell-cell link and edge weights 𝑊[𝑛×𝑛] denoting the kernel representation of cell-cell similarities. the weight of an edge linking cells 𝑐𝑖 and 𝑐𝑗 is determined using a scaled exponential similarity kernel: 𝑊𝑖𝑗 = 𝑒𝑥𝑝 − 𝑑𝑖𝑗 𝜇𝛽𝑖𝑗 ( ) here 𝑑𝑖𝑗 represents the distance between cells 𝑐𝑖 and 𝑐𝑗 measured by a distance metric (e.g. pearson correlation). 𝜇 is an empirical hyperparameter with a recommended value in a sizable range of [ . , . ] [ ]. 𝛽𝑖𝑗 is a scaling factor defined as follows: 𝛽𝑖𝑗 = 𝑑 𝑐𝑖,𝑁𝑖 +𝑑 𝑐𝑗 ,𝑁𝑗 +𝑑𝑖𝑗 ( ) where 𝑁𝑖 are neighboring cells of 𝑐𝑖 and 𝑑 𝑐𝑖, 𝑁𝑖 is the average distance of 𝑐𝑖 to its neighbors. to obtain a fused network from pa-dist and ge-dist, a full and sparse kernel on the vertex set 𝑉 is derived from the weight matrix 𝑊. the full kernel is a normalized weight matrix 𝑊 [𝑛×𝑛] which stores the full information of cell-cell similarities. the normalized weight between 𝑐𝑖 and 𝑐𝑗 is defined as: 𝑊 𝑖𝑗 = 𝑊𝑖𝑗 𝑊𝑖𝑘𝑘≠𝑖 𝑤ℎ𝑒𝑛 𝑖 ≠ 𝑗 . 𝑤ℎ𝑒𝑛 𝑖 = 𝑗 ( ) another matrix 𝐴[𝑛×𝑛] encodes the local affinity that measures similarities of a cell to its 𝐾 most similar cells: (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . 𝐴𝑖𝑗 = 𝑊𝑖𝑗 𝑊𝑖𝑘𝑘≠𝑖 𝑤ℎ𝑒𝑛 𝑗 ∈ 𝑁𝑖 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ( ) here 𝑁𝑖 is the set of cell 𝑐𝑖 and its neighbors in the graph 𝐺. the network fusion initiates from 𝑊 , using 𝐴 as the kernel matrix to capture local structure of the graph. to fuse the two distance matrices (pa-dist and ge-dist), first 𝑊𝑃𝐴 and 𝑊𝐺𝐸 were computed, respectively. then the corresponding initial state matrices 𝑊 𝑃𝐴 and 𝑊 𝐺𝐸 were derived from the two similarity matrices, and the kernel matrices 𝐴𝑃𝐴 and 𝐴𝐺𝐸 were also computed. given the initial two status matrices at 𝑡 = , 𝑊 𝑡= 𝑃𝐴 and 𝑊 𝑡= 𝐺𝐸 , the fusion process iteratively updates the respective similarity matrix: 𝑊 𝑡+ 𝑃𝐴 = 𝐴𝑃𝐴 × 𝑊 𝑡 𝑃𝐴 × (𝐴𝑃𝐴 )𝑇 𝑊 𝑡+ 𝐺𝐸 = 𝐴𝐺𝐸 × 𝑊 𝑡 𝐺𝐸 × (𝐴𝐺𝐸 )𝑇 ( ) then after 𝑡 iterations, the final status matrix is obtained: 𝑊 = 𝑊 𝑡 𝑃𝐴 +𝑊 𝑡 𝐺𝐸 ( ) 𝑊 is the fused cell-cell distance network by incorporating cells’ apa isoform and gene expression profiles. the corresponding cell-cell similarity matrix is − 𝑊 . the distance or similarity matrix can be used for downstream cell type clustering. single cell clustering four widely-used clustering methods were provided in sclapa to cluster cells on the basis of the fused cell-cell similarity matrix, including louvain clustering [ ], hierarchical clustering (hc) [ ], spectral clustering (sc) [ ] and k-means. the louvain clustering was implemented by the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cluster_louvain function in the r package igraph, with arguments ‘mode=undirected, weighted=true, diag = true’. the spectral clustering was implemented by the spectralclustering function in the r package snftool with default settings [ ]. the hierarchical clustering [ ] was performed by the flashclust function in the r package flashclust with default settings [ ]. the k-means clustering was implemented by the kmeans function of the r package stats with arguments ‘iter.max= e+ , nstart= ’. performance evaluation we distinguished two scenarios, similarity learning and clustering, to evaluate our approach. for each scenario, we applied sclapa to four scrna-seq datasets with pre-annotated cell labels, and compared results with other competing approaches. for the scenario of similarity learning, we compared sclapa with seven similarity measures, including three measures designed for scrna-seq (rafsil / and simlr), two measures of proportionality (𝜌𝑝 and ∅𝑠) and two traditional similarity measures (euclidean distance and pearson correlation). each of these measures was applied to a given ge-matrix to learn a cell-cell similarity matrix. for sclapa, we applied each measure to learn two cell-cell similarity matrices from pa-matrix and ge-matrix and fused them into one matrix. we also applied different clustering methods including louvain, hc, sc and k-means on the similarity matrix learned from each similarity measure to assess different similarity measures in the context of clustering. for the scenario of clustering, we compared sclapa with five state of the art clustering methods for scrna-seq data, including sc [ ], seurat v [ ], sincera [ ], snn-cliq [ ] and dynamic tree (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cut method (dynamictreecut) [ ]. none of these approaches provides explicit similarity learning procedure, instead they provide cell labels by unsupervised learning on the ge-matrix. each approach was applied to a given ge-matrix for cell clustering and class labels of cells were obtained. for sclapa, we applied each of the four methods (louvain, hc, sc and k-means) on the fused similarity matrix to obtain clustering results. two internal validation metrics, dunn index [ ] and connectivity [ ], were employed for the first scenario to quantitatively assess the goodness of a clustering structure without relying on any clustering methods or knowing external information about class labels. the dunn index [ ] evaluates non-linear combinations of the between-group separation and the within-group compactness. the connectivity reflects the extent of observations that are present in the same group as their neighbors in the data space. the original value of connectivity ranges from zero to infinity, with smaller value denoting higher performance. here we used a transform, /log (connectivity + ), to make connectivity consistent with dunn. the larger the score of connectivity or dunn, the better the separation is. the r package clvalid [ ] was adopted to calculate the connectivity and dunn index. additionally, we used three popular metrics to evaluate the performance of sclapa in the context of clustering, including the ari (adjusted rand index), jaccard and nmi (normalized mutual information). the value of ari ranges from - to , and values of nmi and jaccard range from to , with the higher value indicating the better performance. ari is a widely-used metric for measuring the concordance between two clustering results. the jaccard index quantifies the similarity between two datasets. nmi is a variation of mutual information for evaluating clustering results, which corrects the bias of the consistency caused by chance. ari and jaccard were calculated using the adjustedrand (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://en.wikipedia.org/wiki/mutual_information https://en.wikipedia.org/wiki/cluster_analysis https://doi.org/ . / . . . function in the r package clues [ ]; nmi was obtained by the compare function in the r package igraph (https://igraph.org/r/). bioinformatics analyses umap [ ] was adopted for visualization of distributions of single cells, which employs the non-linear dimensional reduction technique to group similar cells in low-dimensional space. umap was implemented by the calculateumap function in the scater r package [ ]. for the analysis of the arabidopsis root data, deseq [ ] was adopted to identify de genes and de poly(a) sites. first ge-matrix and pa-matrix were normalized by the median ratio method provided in deseq . then the deseq function was applied for de detection. gene or poly(a) sites with log fold change>= . and adjusted p-value<= . were considered as de. results single-cell polyadenylation profile distinguishes cells recently, scrna-seq has emerged as a unique tool to explore cell-specific gene or isoform expression in plants [ - ]. a previous study [ ] utilized root-hair and nonhair cell types as models and revealed the potential of using scrna-seq data for inferring specific cells during the process of cell-type differentiation. here we focused on the epidermal tissue and analyzed differential expression on both gene and apa levels between root-hair and nonhair cells. a total of root-hair cells and nonhair cells were defined by the previous study [ ]. although both ge-matrix and pa-matrix were obtained from the same scrna-seq data, we still found four genes exclusively present in the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . pa-matrix (figure a). for example, at g , a wrky transcription factor gene, was absent in the single-cell ge-matrix, while it has one poly(a) site (coord: ) with much higher expression level in nonhair than in hair cells according to the pa-matrix. interestingly, this poly(a) site is an annotated poly(a) site in extended ′ utr, which was supported by bulk ′-seq data according to plantapadb [ ]. similarly, at g , a hypothetical protein coding gene, is missing in the ge-matrix, while its one poly(a) site (coord: ) is expressed much higher in nonhair cells than in hair cells. this poly(a) site was also annotated as a ′ utr site in plantapadb. moreover, genes possess at least one differentially used poly(a) site, among which genes were not de genes (table s ). for example, at g is a dnaj heat shock family protein expressed in root. although both at g and its one poly(a) site are expressed higher in root hair cells than in nonhair cells, the difference between the two cell types characterized by the poly(a) profile is much more pronounced than that by the gene profile (figure b). further, using only the ge-matrix, a subset of cells are indistinguishable between hair and nonhair cell types (figure c). in contrast, cells from the two cell types were clearly separated on the basis of the pa-matrix and two potential subpopulations of nonhair cells were observed (figure c). therefore, we anticipate that the poly(a) site expression profile may encode complementary information that is absent or insignificant in the gene expression profile, which could be useful to distinguish cell types. there is a great potential to develop integrative approaches for discerning cell identities that can properly incorporate single-cell profiling of both gene expression and polyadenylation information. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . learning cell-cell similarities with sclapa we proposed the sclapa toolkit that can learn cell-cell similarities by taking advantage of the complementarities from both layers of apa isoforms and genes. here we compared the performance of the similarity metric learned from sclapa with other seven similarity metrics by analyzing four scrna-seq datasets. two metrics, dunn and connectivity, were adopted to quantitatively measure cell separation independent of clustering methods. generally, sclapa provides higher or comparable performance than other metrics across all the four datasets, whereas pearson correlation or euclidean has a consistently lower performance (figures a and s ). in terms of both dunn and connectivity, sclapa and simlr perform significantly better than other three metrics. particularly, simlr outperforms sclapa on the hypothalamus data whereas sclapa outperforms simlr on the mammary data. overall, sclapa performs better than at least six out of the seven metrics in all the four datasets, never being the worst in any case. according to the dunn index (figure a), even for datasets where the performance of sclapa is not the best, sclapa is always the close match to the best. for example, the dunn score from sclapa on the hypothalamus data is . , which is very close to the best score ( . from simlr). next we used the radar chart to compare the performance of these similarity metrics more intuitively. apparently, sclapa and simlr stand out as universally better than the others, and discrepancies of performance of other six metrics across different datasets were observed (figure b). for example, the overall similarity based on the rafsil / metric is much higher on mammary and hypothalamus data than the other two datasets, revealing the instability of performance of rafsil across different (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . datasets. in contrast, for all these four datasets, both euclidean and pearson correlation emerge as the worst similarity metric. in contrast, sclapa provides a more robust result regardless of datasets. sclapa is integrative and flexible in that different distance metrics can be chosen to learn cell-cell similarities for distance fusion. next we examined the effect of using different distance metrics in sclapa. the performance of sclapa according to the dunn index is highly robust across all datasets regardless of distance metrics used in sclapa (figure c). it is widely accepted that it is highly challenging to determine an optimal distance metric for profiling true cell-cell relationships from the complex and heterogeneous scrna-seq data [ ]. however, the integrative framework of sclapa provides an effective solution of distance fusion by assembling results from multiple data layers into one ensemble result, which can mitigate limitations in individual similarity metrics or data layers and facilitate the generalization and adaption for different scrna-seq datasets. take the hypothalamus data as an example. apparently, the matrix with block structures obtained from sclapa showed higher consistency with true labels than did other similarity metrics (figure s ). block structures learned by simlr are indistinguishable from background signatures; block structures learned by pearson correlation, euclidean or the two measures of proportion are also mixed with background signatures; block structures learned from rafsil are generally consistent with true structures except that cell types with small number of cells are less distinguishable. overall, sclapa provides more divergent clusters with higher distinction, and individual clusters obtained by sclapa are more compact than those by other similarity metrics. these results demonstrate the ability and robustness of sclapa in improving the cell separation across numerous scrna-seq datasets. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cell type clustering with sclapa cell-cell similarities learned by different similarity metrics can be adapted to other clustering methods that take similarities as inputs. here we performed extensive comparisons of sclapa with other seven similarity metrics by applying different clustering methods for cell clustering. first we applied louvain [ ], a graph-based method for community detection, to different similarity metrics for clustering. according to the ari score, similarities learned by sclapa and simlr significantly outperform similarities obtained from euclidean, pearson correlation or rafsil / (figure a). overall, simlr shows similar performance with sclapa, whereas sclapa outperforms simlr in three out of the four datasets. particularly, euclidean and pearson correlation present the worst performance in two datasets, mammary and root. similar results were obtained in terms of other two indexes, nmi and jaccard (figure s ). in addition to louvain clustering, we also investigated other three popular clustering methods, including hierarchical clustering [ ], spectral clustering [ ] and k-means [ ], to evaluate the robustness of results by applying different clustering methods on the same similarity metric (figures s - ). particularly, the performance of sclapa and rafsil / are robust regardless of clustering methods used, whereas sclapa consistently outperforms rafsil. in contrast, simlr, euclidean and pearson correlation are very sensitive to clustering methods applied (figure b). surprisingly, although simlr achieves comparable performance with sclapa based on louvain clustering (figure a), its performance is the worst using k-means or spectral clustering (figure b). take the mammary data for example, the ari score of simlr drops from . when using louvain clustering to an extremely low median value of . when using k-means. moreover, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we noted that, ari scores from individual runs of k-means clustering on simlr similarities varied greatly, revealing the relatively poor robustness of simlr with k-means clustering (figure s ). these results demonstrate that the cell-cell similarity matrix learned from sclapa is more effective and robust than competing similarity metrics in clustering cell subpopulations. during the preparation of this manuscript, we noticed another method scdapars [ ], which quantifies and recovers apa events in single cells using standard scrna-seq data. the authors also integrated apa information identified by scdapars with imputed gene expression by similarity network fusion to reveal novel cell subpopulations during human embryonic development. different from scdapars that employs the (imputed) percentage of distal poly(a) site usage index (pdui) to measure apa usage, sclapa directly utilizes raw poly(a) expression profile. here we compared the performance of sclapa and scdapars by applying them to the four scrna-seq datasets in our benchmarking analysis. following the process in gao et al. [ ], we calculated pdui based on the pa-matrix and imputed apa profiles using scdapars. then we applied five similarity metrics on the scdapars-imputed apa profile and the ge-matrix to generate scdapars-dist and ge-dist, respectively. after fusing the two distance matrices with snf, we applied louvain clustering on the fused cell-cell similarities to cluster cells. according to the ari score (figure ), sclapa significantly outperforms scdapars on all the four datasets. particularly, ari scores of scdapars with different similarity metrics varied greatly whereas the performance of sclapa is robust with different similarity metrics (figure vs. figure c), revealing that the poly(a) expression profile used in sclapa is more efficient and robust than the pdui profile used in scdapars for clustering cells. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . next we expanded the benchmarking analysis by comparing clustering results of sclapa with other single-cell clustering methods that directly take the gene-cell expression matrix as input without an explicit procedure of similarity learning. specifically, we included five popular tools for comparison, including sc [ ], seurat v [ ], sincera [ ], snn-cliq [ ] and dynamictreecut [ ]. according to the ari score, sclapa achieves generally higher or comparable performance than other methods, whereas dynamictreecut provides a consistently lower performance (figure ). similar results were observed using indexes of jaccard or nmi (figure s ). specifically, sclapa provides the best ari score in three out of the four datasets (figure ). for the hypothalamus data where sc performs the best, sclapa presents very close ari score to sc (sclapa= . ; sc = . ). particularly, for three datasets (mammary, hypothalamus and root), ari scores of individual sc runs varied greatly, reflecting the performance of sc may be unstable on some kinds of datasets. overall, the performance of sclapa is robust and consistently high across diverse scrna-seq datasets. sclapa identifies hidden subpopulations of cells we next applied sclapa on the human pbmc k dataset from x genomics for cell type clustering. first we examined the cell type composition of the pbmcs by applying seurat to the gene-cell expression matrix (ge-matrix). ten distinct cell clusters were yielded (figure a). based on the expression of known markers (table s ), nine clusters were annotated. up to , poly(a) sites from genes were identified from the raw rna-seq data with scapatrap. we learned cell-cell similarities with sclapa by jointly considering expression profiles of apa isoforms and genes. after (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . applying louvain clustering on the cell-cell similarity matrix, cell clusters were obtained and clusters were successfully annotated. these clusters covered the nine clusters identified by seurat and contained two new small clusters (figure b). both subclusters were supported by the expression patterns of markers, suggesting that they represented distinct cell types. one subcluster was annotated as regulatory t cell on the basis of elevated expression of three markers, ccr , foxp and il ra (figure s ). depending only on the gene expression profile, regulatory t cells were not well resolved and are indistinguishable among other t cells (figure a). although the gene expression of the marker ccr is sparse and weak among t cells, we could still distinguish clearly regulatory t cells from other t cell types according to the umap visualization of the gene expression profile (figure c). particularly, ccr has four annotated poly(a) sites according to apasdb [ ], whereas only one poly(a) site was identified from scrna-seq data. this is not unexpected as the bulk ′-seq data contain more diverse tissue samples than the pbmc data and scrna-seq data is generally too sparse to identify all poly(a) sites. however, we have shown that, even for a single poly(a) site, it could encapsulate useful information beyond the gene expression profile (figure ). the other subcluster where cell markers such as ppbp and pf are expressed, was annotated as megakanyocyte progenitors (figures d and s ). according to the pa-matrix, ppbp carries three poly(a) sites, and five poly(a) sites of ppbp were annotated in apasdb. these three poly(a) sites were all highly expressed in megakanyocyte progenitors (cluster ) (figure e). these results demonstrate that sclapa facilitates the capture and identification of hidden subpopulations of cells that are unrecognizable based on the gene expression profile alone. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . discussion sclapa is an integrative framework for learning association for single-cell transcriptomics by leveraging expression profiles of genes and apa isoforms in individual cells, which highlights the inclusion of polyadenylation signatures for improving cell type clustering and discovering new cell types. the effectiveness of sclapa for cell-cell similarity learning and cell type clustering is evidenced by comparisons with various similarity metrics and single-cell clustering methods on several scrna-seq datasets. sclapa has a number of desirable features. first, sclapa incorporates existing tools to extract and quantify poly(a) sites directly from scrna-seq, which augments the gene-level analysis with additional layer of apa information without altering the scrna-seq protocol or performing additional sequencing experiment. second, by employing the strategy of similarity network fusion, sclapa jointly considers expression profiles at both levels of apa isoforms and genes for learning highly informative cell-cell similarities. third, in contrast to many other methods that cluster cells without explicit similarity learning step, sclapa provides two independent but connected modules for similarity learning and cell clustering, each with various methods for users’ choice. accordingly, users can freely combine different similarity metrics and clustering methods in sclapa to evaluate the clustering results for any given dataset. fourth, the framework of sclapa is highly flexible, which can be seamlessly embedded into most existing scrna-seq pipelines or tools for downstream analyses, such as dimension reduction, cell type clustering and differential expression analysis. accordingly, existing tools, such as those designed for dropout imputation, normalization and similarity learning, can also be easily incorporated into sclapa. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . with sclapa, distinct cell-cell similarity networks can be effectively learned from profiles of gene expression and polyadenylation separately by various similarity metrics. sclapa then employed the strategy of similarity network fusion for scalable and robust integration of similarity networks learned from different data layers. this strategy has the advantage to exploit complementarities in distinct data layers for fully profiling the spectrum of underlying data. moreover, the consensus set of cell-cell interactions and associations from the apa layer and the gene layer can be learned from the given data, mitigating noise and dropouts in conventional gene-cell expression profile and thus enhancing accuracy for downstream analyses. by combining expression profiles of apa and gene through similarity network fusion, we found two hidden subpopulations of pbmcs that were undetectable using only gene expression data (figure ). moreover, the augmentation of gene expression profiles with polyadenylation information enhances single-cell clustering results and generates more discriminative cell types (figures - ). as a comprehensive toolkit, sclapa provides a unique strategy to improve cell type clustering and discover novel cell types, by combining gene expression with polyadenylation information at single-cell resolution. sclapa consists of three core function modules, including learning cell-cell similarities, distance fusion and clustering. currently, numerous methods are available to learn cell-cell similarities or cluster cells with reasonable accuracy [ ]. however each method has its own strengths and limitations, and it is extremely challenging, if not impossible, to determine an optimal method for all kinds of datasets as different methods may exploit different characteristics in the data [ ]. moreover, some similarity metrics may be overly dependent on downstream clustering methods, exacerbating difficulties in choosing a universally applicable combination of similarity and clustering methods. for (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . example, based on the ge-matrix alone, similarities learned from simlr provide an overall high performance across datasets in terms of internal validation indexes (figure a). however, simlr is highly dependent on downstream clustering methods for single-cell clustering; it achieves high performance with louvain clustering (figure a), whereas its performance drops sharply with k-means or spectral clustering (figure b). in contrast, our benchmarking analyses showed that performances of sclapa are robust and consistently high across diverse datasets regardless of distance metrics or clustering methods selected in sclapa (figures - ). the unique strength of sclapa may be due to that it efficiently fuses rich structures stored in ge-matrix as well as the accompanied pa-matrix, thus can amplify biological signals and augment cell-cell relationships. sclapa is an easy-to-use and highly flexible framework. the input of sclapa is the ge- and pa-matrix, without using any priori biological information. even with raw scrna-seq data, it is easy obtain the prerequisite ge-matrix and/or pa-matrix using various tools, e.g. cell ranger for ge-matrix, scapatrap and sierra for pa-matrix. lately another tool, scdapars [ ], was proposed to quantify and recover apa usages from scrna-seq data, which uses the relative usage of the distal poly(a) site called pdui to measure a gene’s apa usage. with scdapars, gao et al. [ ] analyzed cell-type-specific apa regulation and discovered hidden cell subpopulations from cancer and human endoderm differentiation scrna-seq data. in sclapa the input pa-matrix can be replaced with any other gene-cell-like matrix, thus the scdapars-imputed pdui matrix can be used readily in sclapa for downstream cell type clustering. however, although the scdapars-imputed pdui profile seems to be effective in revealing apa dynamics among cell types in the previous study [ ], we found that, for cell type clustering, the performance with the pdui-matrix is much lower and less robust than that (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . with sclapa’s pa-matrix (figure ). this may be due to several reasons. first, only genes with at least two ′ utr poly(a) sites can be used for scdapars’ pdui calculation, consequently the pdui-matrix is much more sparse than the pa-matrix and information encoded in genes with single poly(a) site is lost. second, although the pdui profile can be imputed with scdapars, limited information in the highly sparse pdui-matrix confounds reliable imputation and may lead to propagation of errors or noises during the imputation process. third, unlike sclapa which is specifically designed for learning cell-cell similarities and cell type clustering, the main function of scdarpas is to analyze cell-type-specific apa dynamics and identify novel apa-related cell types. we anticipate that the pa-matrix used in sclapa may contain more comprehensive and reliable information than the pdui-matrix or the imputed pdui-matrix, which can significantly enhance the accuracy of cell type clustering. overall, the pa-matrix is simple but effective which can be easily obtained from scrna-seq data by various tools, making it more convenient to use sclapa for scrna-seq analyses. for practical application purpose, the current version of sclapa implements seven similarity metrics and four clustering methods for users’ choice, which allows users to investigate their own strategies for evaluation of the effect of different combinations of distance metrics and clustering methods. moreover, sclapa is easily expandable in that additional distance metrics or clustering methods can be readily incorporated. meanwhile, scrna-seq preprocessing steps, such as dropout imputation and normalization, can also be easily applied before similarity learning. sclapa can also be used as a plug-in architecture for most existing scrna-seq pipelines for similarity learning and cell clustering. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplementary data the file of supplemental materials contains all the supplementary figures, and tables. funding this work was supported by the national natural science foundation of china (nos. to x.w. and to g.j.) and xiamen ylz yihui technology co., ltd (xdht a). references . ziegenhain c, vieth b, parekh s et al. comparative analysis of single-cell rna sequencing methods, mol cell ; : - .e . . kiselev vy, andrews ts, hemberg m. challenges in unsupervised clustering of single-cell rna-seq data, nature reviews genetics . . skinnider ma, squair jw, foster lj. evaluating measures of association for single-cell transcriptomics, nature methods ; : - . . wang b, zhu j, pierson e et al. visualization and analysis of single-cell rna-seq data by kernel-based similarity learning, nature methods ; : . . grun d, lyubimova a, kester l et al. single-cell messenger rna sequencing reveals rare intestinal cell types, nature ; : - . . kiselev vy, kirschner k, schaub mt et al. sc : consensus clustering of single-cell rna-seq data, nature methods ; : . . levine jacob h, simonds erin f, bendall sean c et al. data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis, cell ; : - . key points we proposed a computational toolkit called sclapa for learning association for single-cell transcriptomics from scrna-seq data. sclapa improves cell-cell similarity learning and cell type clustering by integrating single-cell profiling of gene expression and alternative polyadenylation. objective benchmarking analyses using diverse scrna-seq datasets demonstrate higher performance and robustness of sclapa than competing methods in cell-cell similarity learning and cell type clustering. sclapa discovers hidden subpopulations of cells that are unrecognizable based on the gene expression profile alone. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . qi r, ma a, ma q et al. clustering and classification methods for single-cell rna-sequencing data, briefings in bioinformatics ; : - . . petegrosso r, li z, kuang r. machine learning and statistical methods for clustering single-cell rna-sequencing data, briefings in bioinformatics ; : - . . kharchenko pv, silberstein l, scadden dt. bayesian approach to single-cell differential expression analysis, nature methods ; : . . grun d, kester l, van oudenaarden a. validation of noise models for single-cell transcriptomics, nat methods ; : - . . stuart t, satija r. integrative single-cell analysis, nature reviews genetics ; : - . . stuart t, butler a, hoffman p et al. comprehensive integration of single-cell data, cell ; : - .e . . welch jd, kozareva v, ferreira a et al. single-cell multi-omic integration compares and contrasts features of brain cell identity, cell ; : - .e . . wu x, liu t, ye c et al. scapatrap: identification and quantification of alternative polyadenylation sites from single-cell rna-seq data, briefings in bioinformatics . . patrick r, humphreys dt, janbandhu v et al. sierra: discovery of differential transcript usage from polya-captured single-cell rna-seq data, genome biol ; : . . levin m, zalts h, mostov n et al. gene expression dynamics are a proxy for selective pressures on alternatively polyadenylated isoforms, nucleic acids res ; : - . . shulman ed, elkon r. cell-type-specific analysis of alternative polyadenylation using single-cell transcriptomics data, nucleic acids res ; : - . . arzalluz-luque a, conesa a. single-cell rnaseq for the study of isoforms-how is that possible?, genome biology ; : . . song y, botvinnik ob, lovci mt et al. single-cell alternative splicing analysis with expedition reveals splicing dynamics during neuron differentiation, molecular cell ; : - .e . . macosko ez, basu a, satija r et al. highly parallel genome-wide expression profiling of individual cells using nanoliter droplets, cell ; : - . . hashimshony t, wagner f, sher n et al. cel-seq: single-cell rna-seq by multiplexed linear amplification, cell rep ; : - . . zheng gx, terry jm, belgrader p et al. massively parallel digital transcriptional profiling of single cells, nat commun ; : . . ye c, zhou q, hong y et al. role of alternative polyadenylation dynamics in acute myeloid leukaemia at single-cell resolution, rna biology ; : - . . kim n, chung w, eum hh et al. alternative polyadenylation of single cells delineates cell types and serves as a prognostic marker in early stage breast cancer, plos one ; :e . . velten l, anders s, pekowska a et al. single-cell polyadenylation site mapping reveals ' isoform choice variability, molecular systems biology ; : - . . franzén o, gan l-m, björkegren jlm. panglaodb: a web server for exploration of mouse and human single-cell rna sequencing data, database ; . . wang b, mezlini am, demir f et al. similarity network fusion for aggregating data types on a genomic scale, nature methods ; : . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . smith t, heger a, sudbery i. umi-tools: modeling sequencing errors in unique molecular identifiers to improve quantification accuracy, genome research ; : - . . liao y, smyth gk, shi w. featurecounts: an efficient general purpose program for assigning sequence reads to genomic features, bioinformatics ; : - . . ye w, liu t, fu h et al. movapa: modeling and visualization of dynamics of alternative polyadenylation across biological samples, bioinformatics . . shen y, ji g, haas bj et al. genome level analysis of rice mrna '-end processing signals and alternative polyadenylation, nucleic acids research ; : - . . wu x, liu m, downie b et al. genome-wide landscape of polyadenylation in arabidopsis provides evidence for extensive alternative polyadenylation, proceedings of the national academy of sciences, usa ; : - . . zhao z, wu x, raj kumar pk et al. bioinformatics analysis of alternative polyadenylation in green alga chlamydomonas reinhardtii using transcriptome sequences from three different sequencing platforms, g : genes|genomes|genetics ; : - . . wu x, gaffney b, hunt a et al. genome-wide determination of poly(a) sites in medicago truncatula: evolutionary conservation of alternative poly(a) site choice, bmc genomics ; : . . pouyan mb, kostka d. random forest based similarity learning for single cell rna sequencing data, bioinformatics ; :i -i . . pearl j. probabilistic reasoning in intelligent systems: networks of plausible inference. morgan kaufmann, . . blondel vd, guillaume j-l, lambiotte r et al. fast unfolding of communities in large networks, journal of statistical mechanics: theory and experiment ; :p . . eisen mb, spellman pt, brown po et al. cluster analysis and display of genome-wide expression patterns, proc natl acad sci u s a ; : - . . ng ay, jordan m, weiss y. on spectral clustering: analysis and an algorithm. advances in neural information processing systems. , – . . langfelder p, horvath s. fast r functions for robust correlations and hierarchical clustering, journal of statistical software ; : - . . guo m, wang h, potter ss et al. sincera: a pipeline for single-cell rna-seq profiling analysis, plos comput biol ; :e -e . . xu c, su z. identification of cell types from single-cell transcriptomes using a novel clustering method, bioinformatics ; : - . . langfelder p, zhang b, horvath s. defining clusters from a hierarchical cluster tree: the dynamic tree cut package for r, bioinformatics ; : - . . guy brock, vasyl pihur, susmita datta et al. clvalid, an r package for cluster validation, journal of statistical software ; : - . . chang f, qiu w, zamar rh et al. clues: an r package for nonparametric clustering based on local shrinking, journal of statistical software ; : . . mcinnes l, healy j, saul n et al. umap: uniform manifold approximation and projection, journal of open source software ; : . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . mccarthy dj, campbell kr, lun at et al. scater: pre-processing, quality control, normalization and visualization of single-cell rna-seq data in r, bioinformatics ; : - . . love m, huber w, anders s. moderated estimation of fold change and dispersion for rna-seq data with deseq , genome biology ; : . . jean-baptiste k, mcfaline-figueroa jl, alexandre cm et al. dynamics of gene expression in single root cells of arabidopsis thaliana, the plant cell ; : - . . ryu kh, huang l, kang hm et al. single-cell rna sequencing resolves molecular relationships among individual plant cells, plant physiology ; : - . . shahan r, hsu c-w, nolan tm et al. a single cell arabidopsis root atlas reveals developmental trajectories in wild type and cell identity mutants. . . shulse cn, cole bj, ciobanu d et al. high-throughput single-cell transcriptome profiling of plant cell types, cell reports ; . . zhang t-q, xu z-g, shang g-d et al. a single-cell rna sequencing profiles the developmental landscape of arabidopsis root, molecular plant ; : - . . zhu s, ye w, ye l et al. plantapadb: a comprehensive database for alternative polyadenylation sites in plants, plant physiology ; : - . . kaufmann l, rousseeuw p. clustering by means of medoids. in: dodge y. (ed) statistical data analysis based on the l -norm and related methods. amsterdam: north-holland, , – . . gao y, li l, amos ci et al. dynamic analysis of alternative polyadenylation from single-cell rna-seq(scdapars) reveals cell subpopulations invisible to gene expression analysis, biorxiv : . . . . . you l, wu j, feng y et al. apasdb: a database describing alternative poly(a) sites and selection of heterogeneous cleavage sites downstream of poly(a) signals, nucleic acids research ; :d -d . . shirkhorshidi as, aghabozorgi s, wah ty. a comparison study on similarity and dissimilarity measures in clustering continuous data, plos one ; :e . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure legends figure . single-cell poly(a) profile in root hair and nonhair cells. (a) genes exclusively present in the pa-matrix. four genes (at g , at g , at g and at g ) were not present in the ge-matrix, whereas they had at least one poly(a) site according to the pa-matrix. for each gene, the violin plot shows expression levels of its poly(a) site in hair and nonhair cells and the umap visualization shows the d embeddings of poly(a) profile. (b) two example genes (at g and at g ) that are not differentially expressed (de) but possess at least one de poly(a) site. the upper panel places the violin plot and umap visualization showing the poly(a) profile of the respective gene in hair and nonhair cells. the lower panel shows the gene profile. (c) single-cell poly(a) profile distinguishes root hair and root nonhair cells. the left plot is the umap representation on the basis of genes that are not de but with at least one de poly(a) site, the right plot is the umap representation on the basis of poly(a) profile of the genes. figure . benchmarking of similarity learning with sclapa on four published scrna-seq datasets. (a) the internal validation metric of dunn was employed to measure the cell separation. (b) radar chart showing the performance of different similarity metrics across datasets. dataset names are shown near the vertex of the plot. each vertex denoting the dunn score of a metric on the respective dataset. the larger the area of a polygon displayed in a radar chart is, the higher the overall performance is. (c) radar chart showing the performance of sclapa with different distance metrics for distance fusion. each vertex denotes the dunn score of using different distance metrics on the respective dataset. figure . benchmarking of similarity learning with sclapa in the context of clustering on four published scrna-seq datasets. (a) ari was employed to measure the concordance between inferred and true cluster labels. louvain clustering was applied on the similarity matrices obtained from different methods. (b) radar charts showing ari scores by applying different clustering methods on cell-cell similarities learned by each similarity metric. each plot represents results of one dataset. clustering methods are shown near the vertex of the plot. the vertex of a plot denotes the ari score of applying a clustering method on different metrics. the larger the area of a polygon displayed in a radar chart is, the higher the overall performance is. hc, hierarchical clustering; sc, spectral clustering. figure . comparison of performance between sclapa and scdapars across four scrna-seq datasets. five similarity metrics were applied on the scdapars-imputed pdui profile and the ge-matrix to generate scdapars-dist and ge-dist, respectively. after fusing the two distance matrices with snf, louvain clustering was applied on the fused cell-cell similarities to cluster cells. we did not include rafsil in this experiment due to its slow calculation speed. for sclapa, pearson correlation was used for similarity learning and louvain was used for clustering. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . ari scores from six clustering methods across four scrna-seq datasets. for sclapa, pearson correlation was used for similarity learning and louvain was used for clustering. figure . sclapa identifies hidden subpopulations of cells from human pbmcs. (a) umap representation of seurat’s clustering results on the basis of ge-matrix. ten clusters were obtained and nine were annotated with known cell types: naive t cell ( ), cd + monocytes ( ), cd + t cell ( ), b cell ( ), cd + memory t ( ), nk cell ( ), cd + monocytes ( ), monocyte derived dendritic ( , ) and plasmacytorid dendritic ( ). (b) umap representation of sclapa’s clustering results on the basis of ge-matrix and pa-matrix. fourteen clusters were obtained and clusters were annotated with known cell types: regulatory t cell ( ), naive t cell ( , ), plasmacytorid dendritic ( ), cd + memory t ( ), cd + t cell ( ), cd + monocytes ( ), monocyte derived dendritic ( , , ), cd + monocytes ( ), megakaryocyte progenitors ( ), b cell ( ) and nk cell ( ). the two arrows mark two new subpopulations of cells identified by sclapa. (c) gene expression of ccr distinguishes regulatory t cells from other t cell types according to the umap visualization of the gene expression profile. the details in the dashed line box are shown in the solid line box. (d) gene expression of ppbp distinguishes megakanyocyte progenitors from other cell types. (e) three poly(a) sites of ppbp are all highly expressed in megakanyocyte progenitors. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . hair nonhair hair p a a t g (b) (c) (a) a t g (p a c o o rd : ) a t g (p a c o o rd : ) a t g (p a c o o rd : ) a t g (p a c o o rd : ) hair nonhair umap umap u m a p a t g p a - - - - - - u m a p (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . amygdala hypothalamus mammary root . . . . . d u n n euclidean pearson simlr rafsil rafsil sclapa (a) p s hypothalamus mammary amygdala root hypothalamus mammaryroot amygdala (b) (c) euclidean pearson p s simlr rafsil rafsil sclapa euclidean+euclidean pearson+pearson + + p s simlr+simlr rafsil +rafsil rafsil +rafsil p s (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . amygdala hypothalamus mammary root . . . . . a r i euclidean pearson simlr rafsil rafsil sclapa hypothalamus mammary amygdala root (a) p s hc sc k-means louvain (b) euclidean pearson p s simlr rafsil rafsil sclapa hc sc k-means louvain hc sc k-means louvain hc sc k-means louvain (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . amygdala hypothalamus mammary root . . . . . a r i euclidean pearson simlr sclapa p s scdapars+ (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . amygdala hypothalamus mammary root . . . . . a r i sc sincera snnclip seurat dynamictreecut sclapa (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (a) (c) - - - umap u m a p (e) - - - - - - umap u m a p (d) pa (coord: ) pa (coord: ) identity e x p re s s io n l e v e l pa (coord: ) - - - umap u m a p - - - umap u m a p regulatory t cell megakaryocyte progenitors (b) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . interpretable detection of novel human viruses from genome sequencing data i i “output” — / / — : — page — # i i i i i i published online dd mm yyyy preprint, yyyy, vol. xx, no. xx – interpretable detection of novel human viruses from genome sequencing data jakub m. bartoszewicz , , , ∗, anja seidel , and bernhard y. renard , , ∗ bioinformatics (mf ), department of methodology and research infrastructure, robert koch institute, berlin, germany, department of mathematics and computer science, free university of berlin, berlin, germany, data analytics and computation statistics, hasso plattner institute for digital engineering, potsdam, brandenburg, germany and digital engineering faculty, university of postdam, potsdam, brandenburg, germany. current address: central research institute of ambulatory health care, berlin, germany. received yyyy-mm-dd; revised yyyy-mm-dd; accepted yyyy-mm-dd abstract viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. here, we predict whether a virus can infect humans directly from next-generation sequencing reads. we show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. we propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example the sars-cov- coronavirus, unknown before it caused a covid- pandemic in . all methods presented here are implemented as easy-to-install packages enabling analysis of ngs datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics. introduction background within a globally interconnected and densely populated world, pathogens can spread more easily than they ever had before. as the recent outbreaks of ebola and zika viruses have shown, the risks posed even by these previously known agents remain ∗to whom correspondence should be addressed. tel: + ; email: jakub.bartoszewicz@hpi.de, bernhard.renard@hpi.de unpredictable and their expansion hard to control ( ). what is more, it is almost certain that more unknown pathogen species and strains are yet to be discovered, given their constant, extremely fast-paced evolution and unexplored biodiversity, as well as increasing human exposure ( , ). some of those novel pathogens may cause epidemics (similar to the sars and mers coronavirus outbreaks in and ) or even pandemics (e.g. sars-cov- and the “swine flu” h n / strain). many have more than one host or vector, which makes assessing and predicting the risks even more difficult. for example, ebola has its natural reservoir most likely in fruit bats ( ), but causes deadly epidemics in both humans and chimpanzees. as the state-of-the art approach for the open- view detection of pathogens is genome sequencing ( , ), it is crucial to develop automated pipelines for characterizing the infectious potential of currently unidentifiable sequences. in practice, clinical samples are dominated by host reads and contaminants, with often less than a hundred reads of the pathogenic virus ( ). metagenomic assembly is challenging, especially in time-critical applications. this creates a need for read-based approaches complementing or substituting assembly where needed. screening against potentially dangerous subsequences before their synthesis may also be used as a way of ensuring responsible research in synthetic biology. while potentially useful in some applications, engineering of viral genomes could also pose a biosecurity and biosafety threat. two controversial studies modified the influenza a/h n ("bird flu") virus to be airborne transmissible in mammals ( , ). a possibility of modifying coronaviruses to enhance their virulence triggered calls for a moratorium on this kind of research ( ). synthesis of an infectious horsepox virus closely related to the smallpox-causing variola virus ( ) caused a public uproar and calls for intensified discussion on risk control in synthetic biology ( ). © yyyy the author(s) this is an open access article distributed under the terms of the creative commons attribution non-commercial license (http://creativecommons.org/licenses/ by-nc/ . /uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx current tools for host range prediction several computational, genome-based methods exist that allow to predict the host-range of a bacteriophage (a bacteria-infecting virus). a selection of composition-based and alignment-based approaches has been presented in an extensive review by edwards et al. ( ). prediction of eukariotic host tropism (including humans) based on known protein sequences was shown for the influenza a virus ( ). support-vector machines based on word vec representations were shown to outperform homology searches with blast and hmms in the same task, but lost their advantage when applied to nucleic acid sequences directly ( ). two recent studies employ k-mer based, k-nn classifiers ( ) and deep learning ( ) to predict host range for a small set of three well- studied species directly from viral sequences. while those approaches are limited to those particular species and do not scale to viral host-range prediction in general, the host taxon predictor (htp) ( ) uses logistic regression and support vector machines to predict if a novel virus infects bacteria, plants, vertebrates or arthropods. yet, the authors argue that it is not possible to use htp in a read-based manner; it requires long sequences of at least , nucleotides. this is incompatible with modern metagenomic next-generation sequencing (ngs) workflows, where the dna reads obtained are at least - times shorter. another study used gradient boosting machines to predict reservoir hosts and transmission via arthropod vectors for known human-infecting viruses ( ). zhang et al. ( ) designed several classifiers explicitly predicting whether a new virus can potentially infect humans. their best model, a k-nn classifier, uses k-mer frequencies as features representing the query sequence and can yield predictions for sequences as short as base pairs (bp). it worked also with bp-long reads from real dna sequencing runs, although in this case the reads originated also from the viruses present in the training set (and were therefore not "novel"). deep learning for dna sequences while dna sequences mapped to a reference genome may be represented as images ( ), a majority of studies uses a distributed orthographic representation, where each nucleotide {a,c,g,t} in a sequence is represented by a one-hot encoded vector of length . an "unknown" nucleotide (n) can be represented as an all-zero vector. chaos game representation (cgr) and its extension, the frequency matrix cgr (fcgr) are promising alternatives able to encode an arbitrary sequence in an image-like format. fcgr has been used to encode genomic inputs for deep learning approaches, including full bacterial genomes ( ) and coding sequences of hiv for the drug resistance prediction task ( ). in this study, we use one-hot encoding with ns as zeroes, which was previously shown to perform well for raw ngs reads ( ) and abstract phenotype labels. cnns and lstms have been successfully used for a variety of dna-based prediction tasks. early works focused mainly on regulation of gene expression in humans ( , , , , ), which is still an area of active research ( , , ). in the field of pathogen genomics, deep learning models trained directly on dna sequences were developed to predict host ranges of three multi-host viral species ( ) and to predict pathogenic potentials of novel bacteria ( ). deepvirfinder ( ) and viraminer ( ) can detect viral sequences in metagenomic samples, but they cannot predict the host and focus on previously known species. for a broader view on deep learning in genomics we refer to a recent review by eraslan et al. ( ). interpretability and explainability of deep learning models for genomics is crucial for their wide-spread adoption, as it is necessary for delivering trustworthy and actionable results. convolutional filters can be visualized by forward-passing multiple sequences through the network and extracting the most-activating subsequences ( ) to create a position weight matrix (pwm) which can be visualized as a sequence logo ( , ). direct optimization of input sequences is problematic, as it results in generating a dense matrix even though the input sequences are one-hot encoded ( , ). this problem can be alleviated with integrated gradients ( , ) or deeplift, which propagates activation differences relative to a selected reference back to the input, reducing the computational overhead of obtaining accurate gradients ( ). if the bias terms are zero and a reference of all-zeros is used, the method is analogous to layer-wise relevance propagation ( ). deeplift is an additive feature attribution method, and may used to approximate shapley values if the input features are independent ( ). tf-modisco ( ) uses deeplift to discover consolidated, biologically meaningful dna motifs (transcription factor binding sites). contributions in this paper, we first improve the performance of read- based predictions of the viral host (human or non-human) from next-generation sequencing reads. we show that reverse-complement (rc) neural networks ( ) significantly outperform both the previous state-of-the-art ( ) and the traditional, alignment-based algorithm – blast ( , ), which constitutes a gold standard in homology-based bioinformatics analyses. we show that defining the negative (non-human) class is non-trivial and compare different ways of constructing the training set. strikingly, a model trained to distinguish between viruses infecting humans and viruses infecting other chordates (a phylum of animals including vertebrates) generalizes well to evolutionarily distant non- human hosts, including even bacteria. this suggests that the host-related signal is strong and the learned decision boundary separates human viruses from other dna sequences surprisingly well. next, we propose a new approach for convolutional filter visualization using partial shapley values to differentiate between simple nucleotide information content and the contribution of each sequence position to the final classification score. to test the biological plausibility of our models, we generate genome-wide maps of "infectious potential" and nucleotide contributions. we show that those maps can be used to visualize and detect virulence-related regions of interest (e.g. genes) in novel genomes. as a proof of concept, we analyzed one of the viruses randomly assigned to the test set – the taï forest ebolavirus, which has a history of host-switching and can cause a serious disease. to show that the method can also be used for other biological problems, we investigated the networks trained by .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx bartoszewicz et al. ( ) and their predictions on a genome of a pathogenic bacterium staphylococcus aureus. the authors used this particular species to assess the performance of their method on real sequencing data. finally, we studied the sars-cov- coronavirus, which emerged in december , causing the covid- pandemic ( ). materials and methods data collection and preprocessing vhdb dataset we accessed the virus-host database ( ) on july , and downloaded all the available data. we note that all the reference genomes from ncbi viral genomes are present in vhdb, as well as their curated annotations from refseq. additional, manually curated records in vhdb extend on metadata available in ncbi. more non-reference genomes are available, but considering multiple genomes per virus would skew the classifiers’ performance towards the more frequently resequenced ones. the original dataset contained , records comprising refseq ids for viral sequences and associated metadata. some viruses are divided into discontiguous segments, which are represented as separate records in vhdb; in those cases the segments were treated as contigs of a single genome in the further analysis. we removed records with unspecified host information and those confusing the highly pathogenic variola virus with a similarly named genus of fish. following zhang et al. ( ), we filtered out viroids and satellites, which are classified as subviral agents and not bona fide viruses ( , ). note that even though they require helper viruses for replication, this step did not affect ubiquitous adeno-associated viruses and large virophages, which are well established within the viral taxonomy in the families parvoviridae and lavidaviridae, respectively. human-infecting viruses were extracted by searching for records containing "homo sapiens" in the "host name" field. note that vhdb contains information about multiple possible hosts for a given virus where appropriate. any virus infecting humans was assigned to the positive class, also if other, non- human hosts exist. in total, the dataset contained , viruses (grouped in species), including , human viruses ( species). we considered both dna and rna viruses; rna sequences were encoded in the dna alphabet, as in refseq. defining the negative class while defining a human-infecting class is relatively straightforward, the reference negative class may be conceptualized in a variety of ways. the broadest definition takes all non-human viruses into account, including bacteriophages (bacterial viruses). this is especially important, as most of known bacteriophages are dna viruses, while many important human (and animal) viruses are rna viruses. one could expect that the multitude of available bacteriophage genomes dominating the negative class could lower the prediction performance on viruses similar to those infecting humans. this offers an open-view approach covering a wider part of the sequence space, but may lead to misclassification of potentially dangerous mammalian or avian viruses. as they are often involved in clinically relevant host-switching events, a stricter approach must also be considered. in this case, the negative class comprises only viruses infecting chordata (a group containing vertebrates and closely related taxa). two intermediate approaches consider all eukaryotic viruses (including plant and fungi viruses), or only animal-infecting viruses. this amounts to four nested host sets: "all" ( , non-human viruses, species), "eukaryota" ( , viruses, species), "metazoa" ( , viruses, species) and "chordata" ( , viruses, species). auxiliary sets containing only non-eukaryotic viruses ("non-eukaryota"), non-animal eukaryotic viruses ("non-metazoa eukaryota") etc. can be easily constructed by set subtraction. for the positive class, we randomly generated a training set containing % of the genomes, and validation and test sets with % of the genomes each. importantly, the nested structure was kept also during the training-validation-test split: for example, the species assigned to the smallest test set ("chordata") were also present in all the bigger test sets. the same applied to other taxonomic levels, as well as the training and validation sets wherever applicable. read simulation we simulated bp long illumina reads following a modification of a previously described protocol ( ) and using the mason read simulator ( ). first, we only generated the reads from the genomes of human-infecting viruses. then, the same steps were applied to each of the four negative class sets. finally, we also generated a fifth set, "stratified", containing an equal number of reads drawn from genomes of the following disjunct host classes: "chordata" ( %), "non-chordata metazoa" ( %), "non- metazoa eukaryota" ( %) and "non-eukaryota" ( %). in each of the evaluated settings, we used a total of million ( %) reads for training, . million ( %) reads for validation and . million ( %) paired reads as the held-out test set. read number per genome was proportional to genome length, keeping the coverage uniform on average. viruses with longer genomes were therefore represented by more reads than shorter viruses. on the other hand, their sequence diversity was covered at a similar level. this length-balancing step was previously shown to work well for bacterial genomes of different lengths ( , ). while the original datasets are heavily imbalanced, we generated the same number of negative and positive data points (reads) regardless of the negative class definition used. this protocol allowed us to test the impact of defining the negative class, while using the exactly same data as representatives of the positive class. we used three training and validation sets ("all", "stratified", and "chordata"), representing the fully open-view setting, a setting more balanced with regard to the host taxonomy, and a setting focused on cases most likely to be clinically relevant. in each setting, the validation set matched the composition of the training set. the evaluation was performed using all five test sets to gain a more detailed insight on the effects of negative class definition on the prediction performance. human blood virome dataset similarily to zhang et al. ( ), we used the human blood dna virome dataset ( ) to test the selected classifiers on real data. we obtained , , reads of bp and searched all of vhdb using blastn (with default parameters) to obtain high-quality reference labels. if .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx a read’s best hit was a human-infecting virus, we assigned it to a positive class; the negative class was assigned if this was not the case. this procedure yielded , , "positive" and , "negative" reads. virus-level and species-level predictions in this study, we focus on predicting labels for reads originating from novel viruses. what constitutes a "novel" biological entity is an open question – a novel virus does not necessarily belong to a novel species ( ). if a given viral isolate clusters with a known group of isolates, it is considered to be the same virus; if it does not, it may be assigned a distinct name and considered novel ( ). this is separate from its putative taxonomical assignment. assigning a novel virus to a novel or a previously established species is performed pursuing a wider set of criteria, and the criteria for delineating distinct species differ between viral families ( , , , ). in most cases, species are perceived as human constructs rather than biological entities and host range often is explicitly one of the defining features ( , ), rendering reasoning based on cross-species homology searches inherently difficult. the most prominent example of this problem is the sars- cov- virus, which is a novel virus within a previously known species (severe acute respiratory syndrome–related coronavirus). other members of this species include the human-infecting sars-cov- , but also multiple related bat sarsr-cov viruses (e.g. sarsr-cov ratg or bat sars- like coronavirus wiv ). importantly, sars-cov- is not a strain of sars-cov- ; those two viruses share a common ancestor ( ). this echoes similar problems related to pathogenic potential prediction for novel bacterial pathogens. a novel bacterium may be defined as a novel strain or a novel species ( ), and the classifiers must be trained according to the desired definition. as the pandemic has shown, different viruses of the same species can differ wildly in their infectious potential and the broader impact on human societies. therefore, threat assessment must be performed for novel viruses, not only novel taxa; different related viruses are non-redundant. at the same time, redundancy below this level (i.e. multiple instances of the same virus) must be eliminated from the dataset to ensure reliability of the trained classifier. vhdb tackles this problem by collecting and annotating reference genomes – each virus in the database is a separate entity with its own id in ncbi taxonomy. this virus-level approach was previously used by zhang et al. ( ). we show that homology-based algorithms underperform in this setting already, suggesting that machine learning is indeed required to accurately predict labels for novel viruses even if other members of the same species are present in the training database. nevertheless, a more difficult alternative – predictions for reads of viruses belonging to completely novel species – is a related and potentially equally important task. for bacterial datasets, species novelty can be modelled by selecting a single representative genome per species ( ). as the sars- cov- example shows, this is often not possible for viruses. to assess our approach in this stricter setup, we re-divided the vhdb dataset into training, validation and test sets ensuring that all viruses of a given species were assigned to only one of those subsets. this effectively models a "novel species" scenario while also reflecting within-species phenotype diversity. we recreated the species-wide versions of the "all" and "chordata" datasets by assigning %, % and % of the species to the training, validation and test datasets, respectively. we resimulated the reads as outlined above and compared the performance of the machine learning and homology-based approaches achieving the highest accuracy in the simpler "novel virus" setting (see section prediction performance). training we used the deepac package ( ) to investigate rc-cnn and rc-lstm architectures, which guarantee identical predictions for both forward and reverse-complement orientations of any given nucleotide sequence, and have been previously shown to accurately predict bacterial pathogenicity. here, we employ an rc-cnn with two convolutional layers with filters of size each, average pooling and fully connected layers with units each. the lstm used has units (fig. s ). we use dropout regularization in both cases, together with aggressive input dropout at the rate of . or . (tuned for each model). input dropout may be interpreted as a special case of noise injection, where a fraction of input nucleotides is turned to ns. representations of forward and reverse-complement strands are summed before the fully connected layers. as two mates in a read pair should originate from the same virus, predictions obtained for them can be averaged for a boost in performance. if a contig or genome is available, averaging predictions for constituting reads yields a prediction for the whole sequence. we used tesla p and tesla v gpus for training and an rtx ti for visualizations. we wanted the networks to yield accurate predictions for both bp (our data, modelling a sequencing run of an illumina miseq device) and bp long reads (as in the human blood virome dataset). as shorter reads are padded with zeros, we expected the cnns trained using average pooling to misclassify many of them. therefore, we prepared a modified version of the datasets, in which the last bp of each read were turned to zeros, mocking a shorter sequencing run while preserving the error model. then, we retrained the cnn which had performed best on the original dataset. since in principle, the human blood virome dataset should not contain viruses infecting non-human chordata, a "chordata"- trained classifier was not used in this setting. benchmarking we compare our networks to the the k-nn classifier proposed by zhang et al. ( ), the only other approach explicitly tested on raw ngs reads and detecting human viruses in a fully open view setting (not focusing on a limited number of species). we use the real sequencing data that they used ( ) for an unbiased comparison. we trained the classifier on the "all" dataset as described by the authors, i.e. using non-overlapping, bp-long contigs generated from the training genomes (retraining on simulated reads is computationally prohibitive). we also tested the performance of using blast to search against an indexed database of labeled genomes. we constructed the database from the "all" training set and used discontiguous megablast to achieve high inter-species sensitivity. for ngs mappers .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx (bwa-mem ( ) and bowtie ( )), the indices were constructed analogously. kraken ( ) was previously shown to perform worse than both blast and machine learning when faced with read-based pathogenic potential prediction for novel bacterial species ( ). its major advantage – assigning reads to lowest common ancestor (lca) nodes in ambiguous cases – turns into a problem in the infectivity prediction task, as transferring labels to lcas is often impossible ( ). therefore, we focus on alignment-based approaches as the most accurate alternative to machine learning in this context. note that both alignment and k-nn can yield conflicting predictions for the individual mates in a read pair. what is more, blast and the mappers yield no prediction at all if no match is found. therefore, similarly to bartoszewicz et al. ( ), we used the accept anything operator to integrate binary predictions for read pairs and genomes. at least one match is needed to predict a label, and conflicting predictions are treated as if no match was found at all. missing predictions lower both true positive and true negative rates. filter visualization substring extraction in order to visualize the learned convolutional filters, we downsample a matching test set to , reads and pass it through the network. this is modelled after the method presented by alipanahi et al. ( ). for each filter and each input sequence, the authors extracted a subsequence leading to the highest activation, and created sequence logos from the obtained sequence sets ("max- activation"). we used the deepshap implementation ( ) of deeplift ( ) to extract score-weighted subsequences with the highest contribution score ("max-contrib") or all score- weighted subsequences with non-zero contributions ("all- contrib"). computing the latter was costly and did not yield better quality logos. we use an all-zero reference. as reads from real sequencing runs are usually not equally long, shorter reads must be padded with ns; the "unknown" nucleotide is also called whenever there is not enough evidence to assign any other to the raw sequencing signal. therefore, ns are "null" nucleotides and are a natural candidate for the reference input. we do not consider alternative solutions based on gc content or dinucleotide shuffling, as the input reads originate from multiple different species, and the sequence composition may itself be a strong marker of both virus and host taxonomy ( ). we also avoid weight-normalization suggested for zero- references ( ), as it implicitly models the expected gc content of all possible input sequences, and assumes no ns present in the data. finally, we calculate average filter contributions to obtain a crude ranking of feature importance with regard to both the positive and negative class. partial shapley values building sequence logos involves calculating information content (ic) of each nucleotide at each position in a prospective dna motif. this can be then interpreted as measure of evolutionary sequence conservation. however, high ic does not necessarily imply that a given nucleotide is relevant in terms of its contribution to the classifier’s output. some sub-motifs may be present in the sequences used to build the logo, even if they do not contribute to the final prediction (or even a given filter’s activation). to test this hypothesis, we introduce partial shapley values. intuitively speaking, we capture the contributions of a nucleotide to the network’s output, but only in the context of a given intermediate neuron of the convolutional layer. more precisely, for any given feature xi, intermediate neuron yj and the output neuron z, we aim to measure how xi contributes to z while regarding only the fraction of the total contribution of xi that influences how yj contributes to z. although similarly named concepts were mentioned before as intermediate computation steps in a different context ( , ), we define and use partial shapley values to visualize contribution flow through convolutional filters. this differs from recently introduced contribution weight matrices ( ), where feature attributions are used as a representation of an identified transcription factor binding site irreducible to a given intermediate neuron. using the formalism of deeplift’s multipliers ( ) and their reinterpretation in shap ( ), we backpropagate the activation differences only along the paths "passing through" yj. in eq. , we define partial multipliers µ (yj) xiz and express them in terms of shapley values φ and activation differences w.r.t. the expected activation values (reference activation). calculating partial multipliers is equivalent to zeroing out the multipliers mykz for all k =j before backpropagating myjz further. µ (yj) xiz =mxiyjmyjz = φi(yj,x)φj(z,y) (xi−e[xi])(yj−e[yj]) ( ) we define partial shapley values ϕ (yj) i (z,x) analogously to how shapley values can be approximated by a product of multipliers and input differences w.r.t. the reference (eq. ): ϕ (yj) i (z,x)=µ (yj) xiz (xi−e[xi])= φi(yj,x)φj(z,y) yj−e[yj] ( ) from the chain rule for multipliers ( ), it follows that standard multipliers are a sum over all partial multipliers for a given layer y. therefore, shapley values as approximated by deeplift are a sum of partial shapley values for the layer y (eq. ). φi(z,x)=mxiz(xi−e[xi])= ∑ j ϕ (yj) i (z,x) ( ) once we calculate the contributions of convolutional filters for the first layer, ϕ (yj) i (z,x) for the first convolutional layer of a network with one-hot encoded inputs and an all-zero reference can be efficiently calculated using weight matrices and filter activation differences (eq. - ). first, in this case we do not traverse any non-linearities and can directly use the linear rule ( ) to calculate the contributions of xi to yj as a .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx product of the weight wi and the input xi. second, the input values may only be or . φi(yj,x)=wixi = { wi, if xi = , otherwise ( ) ϕ (yj) i (z,x)= wiφj(z,y) yj−e[yj] ( ) resulting partial contributions can be visualized along the ic of each nucleotide of a convolutional kernel. to this end, we design extended sequence logos, where each nucleotide is colored according to its contribution. positive contributions are shown in red, negative contributions are blue, and near- zero contributions are gray. therefore, no information is lost compared to standard sequence logos, but the relevance of individual nucleotides and the filter as a whole can be easily seen. color saturation is limited by the reciprocal of a user- defined gain parameter, here set to nm, where n equals the number of input features xi (sequence length) and m equals the number of convolutional filters yj in a given layer. genome-wide phenotype analysis we create genome-wide phenotype analysis (gwpa) plots to analyse which parts of a viral genome are associated with the infectious phenotype. we scramble the genome into overlapping, bp long subsequences (pseudo-reads) without adding any sequencing noise. for the highest resolution, we use a stride of one nucleotide. for s. aureus, we used a stride of bp. we predict the infectious potential of each pseudo-read and average the obtained values at each position of the genome. analogously, we calculate average contributions of each nucleotide to the final prediction of the convolutional network. finally, we normalize raw infectious potentials into the [− . , . ] interval for a more intuitive graphical representation. we visualize the resulting nucleotide-resolution maps with igv ( ). for protein structures, we average the scores codon-wise to obtain contribution scores per amino acid and visualize them with pymol ( ). for well-annotated genomes, we compile a ranking of genes (or other genomic features) sorted by the average infectious potential within a given region. in addition to that, we scan the genome with the learned filters of the first convolutional layer to find genes enriched in subsequences yielding non-zero filter activations. we use gene ontology to connect the identified genes of interest with their molecular functions and biological processes they are engaged in. results negative class definition choosing which viruses should constitute the negative class is application dependent and influences the performance of the trained models. table s summarizes the prediction accuracy for different combinations of the training and test set composition. the models trained only on human and chordata-infecting viruses maintain similar, or even better performance when evaluated on viruses infecting a much broader host range, including bacteria. this suggests that the learned decision boundary separates human viruses from all the others surprisingly well. we hypothesize that the human host signal must be relatively strong and contained within the chordata host signal. dropout rate of . resulted in the highest validation accuracy for cnnstr- and lstmstr. a rate of . was selected for the other models. adding more diversity to the negative class may still boost performance on more diverse test sets, as in the case of cnn trained on the "all" dataset (cnnall). this model performs a bit worse on viruses infecting hosts related to humans, but achieves higher accuracy than the "chordata"- trained models and the best recall overall. rebalancing the negative class using the "stratified" dataset helps to achieve higher performance on animal viruses while maintaing high overall accuracy. the lstms are outperformed by the cnns, but they can be used for shorter reads without retraining (see sections training and prediction performance). prediction performance we selected lstmall and cnnall for further evaluation. we used a single consumer-grade rtx ti gpu to measure inference speed. the cnn classifies reads/s and the lstm reads/s. analyzing ten million reads takes only minutes using the faster model; linear speed-ups are possible if more gpus are available. therefore, the trained models achieve high-throughputs necessary to analyze ngs datasets. table presents the results of a benchmark using the "all" test set. low performance of the k-nn classifier ( ) is caused by frequent conflicting predictions for each read in a read pair. in a single-read setting it achieves . % accuracy, while our best model achieves . % (table s ). although blast achieves high precision, it yields no predictions for over % of the samples. cnnall is the most sensitive and accurate. as expected, standard mapping approaches (bwa- mem and bowtie ) struggle with analysing novel pathogens – they are the most precise but the least sensitive. our approach outperforms them by - %. although we focus on the extreme case of read-based predictions, our method can also be used on assembled contigs and full genomes if they are available, as well as on read sets from pure, single-virus samples. we note that assembly itself does not yield any labels and a follow-up analysis (via alignment, machine learning or other approaches) is required to correctly classify metagenomic contigs in any case. we ran predictions on contigs without any size filtering with both k- nn and blast (table ). we present performance measures for both individual contigs and whole genome predictions based on contig-wise majority vote. we compare them to blast with read-wise majority vote ( ) and to read-wise average predictions of our networks, analogous to presented previously for bacteria ( ). our method outperforms blast by . % and k-nn by . %, even though they have access to the full biological context (full sequences of all contigs in a genome), while we simply average outputs for short reads originating from the contigs. .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx table . classification performance in the fully open-view setting (all virus hosts), read pairs. acc. – accuracy, prec. – precision, rec. – recall, spec. – specificity. bowtie , bwa-mem and blast yield no predictions for over %, % and % of the samples, respectively. best performance in bold. acc. prec. rec. spec. cnnall (ours) . . . . lstmall (ours) . . . . k-nn . . . . bowtie . . . . bwa-mem . . . . blast . . . . we benchmarked our models against the human blood virome dataset used by zhang et al. ( ). our models outperform their k-nn classifier. as the positive class massively outnumbers the negative class, all models achieve over % precision. cnnall- performs best (table ). however, the positive class is dominated by viruses which are not necessarily novel. the cnn was more accurate on training data, so we expected it to detect those viruses easily. finally, we repeated the analysis in the "novel species" scenario. classifying novel viral species when restricted to chordata-infecting viruses is too challenging for practical purposes (table s ). read-wise predictions are not much better than random guesses for both blast and cnns. low precision of blast shows that it often recovers wrong labels even when it does find a match – sequence similarity is not a reliable predictor of the infectious potential in this setting. even if a whole genome is available, overall accuracy is low. this looks very differently in the fully-open view scenario (table ). the cnn trained on the species-wise division of the "all" dataset (cnnsp-all) outperforms blast by a wide margin on both reads and genomes. strikingly, cnnsp-all predictions based on a single read pair achieve higher accuracy than blast predictions using whole genomes, mainly due to their significantly higher recall. what is more, pooling predictions from all the reads originating from a given genome does not improve overall cnnsp-all accuracy any further. as cnnsp-all does not reliably outperform its chordata-trained analog on the "chordata" dataset (cnnsp-cho, table s ), we suspect that its relatively high accuracy on the "all" dataset is caused by its high sensitivity while maintaining good specificity on non-chordata viruses. filter visualization over % of all contributing first-layer filters in cnnall have positive average contribution scores. we comment more on this fact in section nucleotide contribution logos. for cnnall, the average information content of our motifs is strongly correlated nucleotide-wise with ic of deepbind-like logos (spearman’s ρ> . , p< − for all contributing filter pairs except one). the difference in average ic is negligible ( . bit higher for "max-contrib", wilcoxon test, p< − ). therefore, our contribution logos represent analogous "motifs", while extracting additional, nucleotide- level interpretations. for exactly one filter, "max-contrib" and "max-activation" scores are not correlated. a deeper analysis reveals that this particular filter is activated by stretches table . classification performance, all hosts. whole available genomes. negative class is the majority class. bacc. – balanced accuracy, rec. – recall, spec. – specificity. blast (reads) and our networks use read-wise majority vote or output averaging to aggregate predictions over all reads from a genome. k-nn (genome) and blast (genome) use contig-wise majority vote. k-nn (contigs) and blast (contigs) represent performance on individual contigs treated as separate entities. k-nn (reads) was not used, as high conflicting prediction rates made read-wise aggregation impracticable. bacc. aupr rec. spec. cnnall (ours) . . . . lstmall (ours) . . . . blast (reads) . n/a . . k-nn (genome) . n/a . . blast (genome) . n/a . . k-nn (contigs) . n/a . . blast (contigs) . n/a . . table . classification performance on the human blood virome dataset. positive class is the majority class. bacc. – balanced accuracy, rec. – recall, spec. – specificity. bacc. aupr rec. spec. cnnall- (ours) . > . . . lstmall (ours) . > . . . k-nn . . . . table . classification performance, novel species. top: paired reads (see table ). blast yields predictions for only . % of the pairs. bottom: whole available genomes or contigs – negative class is the majority class (see table ). bacc. – balanced accuracy (equal to accuracy for the balanced paired-read dataset), rec. – recall, spec. – specificity. blast (reads) and our networks use read-wise majority vote or output averaging to aggregate predictions over all reads from a genome. blast (genome) uses contig-wise majority vote. blast (contigs) represents performance on individual contigs treated as separate entities. note that low precision is heavily affected by class imbalance. bacc. prec. rec. spec. cnnsp-all (ours) . . . . blast . . . . cnnsp-all (ours) . . . . blast (reads) . . . . blast (genome) . . . . blast (contigs) . . . . of s (ns) – it is the only filter with a positive bias, and almost all of its weights are negative (with one near- zero positive). therefore, an overwhelming majority of its maximum activations are in fact padding artifacts. on the other hand, regions of unambiguous nucleotide sequences result in high positive contributions, since they correspond to a lack of filter activation, where an activation is present for the all-n reference. in fact, for over . % of the reads, positive contributions occur at every single position. we suspect that the filter works as an "ambiguity detector". since ns are modelled as all-zero vectors in the one-hot encoding .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx scheme used here, the network represents "meaningful" (i.e. unambiguous) regions of the input as a missing activation of the filter. this is supported by the fact that the filter lacks any further preference for the specific non-zero nucleotide type. since sequence logos presented here ignore ambiguous (i.e. noninformative) nucleotides, their ics for this filter are near- zero, preventing meaningful visualization. on the other hand, this ambiguity seems to play a role in the final classification decision, as contribution distributions are well-separated for both classes (fig. s ). we speculate that this could be caused by lower quality of the non-pathogen reference genomes, but understanding how exactly this information is used would require further investigation, including feature interactions at all layers of the network. importantly, only the contribution analysis reveals the relevance of the filter beyond simple activation and nucleotide overrepresentation. the choice of the reference input is crucial. in the fig. we present example filters, visualized as "max- contrib" sequence logos based on mean partial shapley values for each nucelotide at each position. all nucleotides of the filters with the second-highest (fig. a) and the lowest (fig. b) score have relatively strong contributions in accordance with the filters’ own contributions. however, we observe that some nucleotides consistently appear in the activating subsequences, but the sign of their contributions is opposite to the filter’s (low-ic nucleotides of a different color, fig. c). those "counter-contributions" may arise if a nucleotide with a negative weight forms a frequent motif with others with positive weights strong enough to activate the filter. we comment on this fact in the section nucleotide contribution logos. some filters seem to learn gapped motifs resembling a codon structure (fig. c). we extracted this filter from the original deepac network predicting bacterial pathogenicity ( ) where the counter-contributions are common, but we find similar filters in our networks as well (fig. s ). we scanned a genome of s. aureus subsp. aureus (refseq assembly accession: gcf_ . ) with this filter and discovered that the learned motif is indeed significantly enriched in coding sequences (fisher exact test with benjamini-hochberg correction, q< − ). it is also enriched in a number of specific genes. the one with the most hits (srap, q< − ) is a serine-rich adhesin involved in the pathogenesis of infective endocarditis and mediating binding to human platelets ( ). the filter seems to detect serine and glycine repeats in this particular gene (fig. s ), but a broader, cross-species, multi-gene analysis would be required to fully understand its activation patterns. an analogous analysis revealed that the second-highest contributing filter (fig. a) is overall enriched in coding sequences in both taï forest ebolavirus (q< − , refseq accession: nc_ ) and sars-cov- coronavirus (q= . × − , refseq accession: nc_ . ). the top hits are the nucleocapsid (n) protein gene of sars-cov- and the vp ebolavirus gene encoding a polymerase cofactor suppressing innate immune signaling (q< − ). genome-wide phenotype analysis we created a gwpa plot for the taï forest ebolavirus genome. most genes ( out of ) can be detected with visual inspection by finding peaks of elevated infectious potential score predicted by at least one of the models (fig. a). intergenic regions are characterized by lower mean scores. noticeably, most nucleotide contributions are positive, and low non-negative contributions coincide with regions of negative predictions. taken together with the surprisingly good generalization of chordata-trained classifiers and a dominance of positive filters discussed above, this suggests that our networks work as positive class detectors, treating all other sequences as “negative” by default. indeed, the reference sequence of all ns is predicted to be "non-pathogenic" with a score of . we ran a similar analysis of s. aureus using the built-in deepac models ( ) and our interpretation workflow. while a viral genome contains usually only a handful of genes, by compiling a ranking of annotated genes of the analyzed s. aureus strain we could test if the high-ranking regions are indeed associated with pathogenicity (table s ). indeed, out of three top-ranking genes with known biological names and gene ontology terms, sarr and sspb are directly engaged in virulence, while hupb regulates expression of virulence- involved genes in many pathogens ( ). in contrast to the viral models, both negative and positive contributions are present (fig. s ), and the model’s output for the all-n reference is slightly above the decision threshold ( . ). even though the network architecture of the viral and the bacterial model are the same, the latter learns a "two-sided" view of the data. we assume this must be a feature of the dataset itself. fig. b presents a gwpa plot for the whole genome of the sars-cov- coronavirus, successfully predicted to infect humans, even though the data was collected at least months before its emergence. interestingly, its mean infectious potential ( . as scored by cnnall) is relatively close to the decision threshold, while its closest known relative, a bat- infecting sarsr-cov ratg , is actually falsely classified as a human virus with a slightly lower mean infectious potential ( . ). what is more, the gene encoding the spike protein, which plays a significant role in host entry ( ), has a mean score slightly above the threshold for sars- cov- ( . ) and below the threshold for ratg ( . ). as shown in the gwpa plots of both viruses (fig. b and fig. s ), regions that the network has learned to associate with the infectious phenotype are distributed non-uniformly and tend to cluster together. this suggests that low-confidence mean prediction for those viruses is not a result of random guessing, but genuine ambiguity present in the data – and the misclassification of ratg could be indicative of a general zoonotic potential of sars-related coronaviruses. in the fig. b, we highlighted the score peaks aligning the spike protein gene (s), as well as the e and n genes, which were scored the highest (apart from an unconfirmed orf of just aa downstream of n) by the cnn and the lstm, respectively. correlation between the cnn and lstm outputs is significant, but species-dependent and moderate ( . for ebola, . for sars-cov- ), which suggests they capture complementary signals. fig. c shows the nucleotide-level contributions in a small peak within the receptor-binding domain (rbd) of the s protein, crucial for recognizing the host cell. the domain location was predicted with cd-search ( ) using the default parameters. the maximum score of this peak is noticeably higher for sars-cov- ( . ) than for its analog in ratg .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx (a) (b) (c) (d) (e) (f) figure . nucleotide contribution logos of example filters. a: second-highest mean contribution score (cnnall). error bars correspond to bayesian % confidence intervals. b: lowest mean contribution score (cnnall). c: gaps resembling a codon structure, extracted from bartoszewicz et al. ( ). consensus sequence: cawcnncnncnncnn. d- f: analogous logos created with the deepbind-like "max-activation" approach. our "max-contrib" logos visualize contributions of individual nucleotides, including counter-contributions. ( . ). fig. presents the rbd in the structural context of the whole s protein (pdb id: vsb, ( )), as well as in complex with a sars-neutralizing antibody cr (pdb id: w , ( )). the high score peak roughly corresponds to one of the regions associated with reduced expression of the rbd ( ), located in the core-rbd subdomain. it covers over % of the cr epitope, as well as the neighbouring site of the n glycan. the latter is present in the epitope of another core-rbd targeting antibody, s ( ). all the per-residue average contributions in the region are positive (fig. s ), even in the regions of lower pathogenicity score, in accordance with the results presented in fig. c. discussion accurate predictions from short dna reads compared to the previous state-of-the-art in viral host prediction directly from next-generation sequencing reads ( ), our models drastically reduce the error rates. this holds also for novel viruses not present in the training set. generalization of virus-level chordata models to other host groups is a sign of a strong, “human” signal. we suspect our classifiers detect the positive class treating all other regions of the sequence space as “negative” by default, exhibiting traits of a one-class classifier even without being explicitly trained to do so. we find further support for this hypothesis: the networks learn many more “positive” than “negative” filters and regions of near-zero nucleotide contributions (including the null reference sample) result in negative predictions. as this effect does not occur for bacteria, we expect it do be task- and data-dependent. while we ignore the simulated quality information here, investigating the role of sequencing noise will be an interesting follow-up study. although the data setup is crucial in general, the modelling step is also important, as shown by our comparison to the baseline k-nn model. the rc-nets are relatively simple, but they are invariant to reverse-complementarity and perform better than random forests, naïve bayes classifiers and standard nn architectures in another ngs task ( ). in the paired read scenario, the previously described k- nn approach fails, and standard, alignment-based homology testing algorithms cannot find any matches in more than % of the cases, resulting in relatively low accuracy. on a real human virome sample, where a main source of negative class reads is most likely contamination ( ), our method filters out non-human viruses with high specificity. in this scenario, the blast-derived ground-truth labels were mined using the complete database (as opposed to just a training set). in all cases, our results are only as good as the training data used; high quality labels and sequences are needed to develop trustworthy models. ideally, sources of error should be investigated with an in-depth analysis of a model’s performance on multiple genomes covering a wide selection of taxonomic units. this is especially important as the method assumes no mechanistic link between an input sequence and the phenotype of interest, and the input sequence constitutes only a small fraction of the target genome without a wider biological context. still, it is possible to predict a label even from those small, local fragments. a similar effect was also observed for image classification with cnns ( ). virulence arises as a complex interplay between the host and the virus, so the predictions reflect only an estimated potential of the infectious phenotype. this mirrors the caveats of bacterial pathogenic potential prediction ( ), including the considerations of balancing computational cost, reliability of error estimates, size and composition of the reference database. even though deep learning outperforms the standard homology-based methods, it is still an open question whether it captures "functional" signals, or just a more flexible sequence similarity function. by the very nature of machine learning and sequence comparison in general, we expect similar viruses to yield similar predictions; in principle this could be used to asses a risk of a host-switching event. the .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx (a) (b) (c) figure . taï forest ebolavirus and sars-cov- coronavirus genomes. top: score predicted by lstmall. middle: score predicted by cnnall. heatmap: nucleotide contributions of cnnall. bottom, in blue: reference sequence. a: taï forest ebolavirus. genes that can be detected by at least one model are highlighted in black. b: whole genome and sequences encoding the spike protein (s), envelope protein (e) and nucleocapsid protein (n). c: spike protein gene, a small peak (positions , - , , dashed line in fig. b) within the receptor-binding domain (predicted by cd-search, positions , - , ). binding to the receptor is crucial for entry to the host cell. local host adaptation could help switch hosts between the animal reservoir and humans. interpretability suite presented here aims at shedding some light on this question, but more research is needed. dual-use research and biosecurity while we focused on the ngs-based prediction scenario, our models could in principle be used to screen dna synthesis orders for potentially dangerous sequences the context of cyberbiosecurity in synthetic biology. since standard, homology-based approaches like blast are not enough to guarantee accurate screening at a reasonable cost ( , , ), machine learning methods are a promising solution. this has been suggested before for the bacterial deepac models ( ), and is applicable to the viral networks presented here as well. however, this line of research can raise questions about possible dual-use. o’brien and nelson ( ) suggested that while the intended purpose of pathogenicity potential prediction is to mitigate biosecurity threats, it could actually enable designing new pathogens to cause maximal harm. the importance of this concern is difficult to overstate and it must be addressed. if an ml-guided, genome-wide phenotype optimization tool existed, it would indeed be a classical dual-use technology not unlike more established computer-aided design approaches for synthetic biology – potentially dangerous, but offering tremendous benefits (e.g. in agriculture, medicine or manufacturing) as well. however, the models presented here do not allow biologically sensible optimization of target sequences. for example, we find meaningless, low-complexity sequences of mononucleotide repeats corresponding to global maxima (infectious potential of . ). these artifacts highlight the fact that only some generally undefined regions of the theoretically possible sequence space are biologically relevant. what is more, we operate on short sequences constituting minuscule fractions of .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx (a) (b) (c) (d) (e) figure . predicted infectious potentials plotted over the sars-cov- spike glycoprotein receptor-binding domain. a- c: top and side view of the spike protein. three receptor-binding domains (rbds) are colored in blue, white and red according to the predicted infectious potential of the corresponding genomic sequence. one of the domains is in the "up" conformation. red regions corresponding to the peak in fig. c are located in the core-rbd subdomain. d: rbd in complex with a sars-neutralizing antibody cr (green). the red region covers over % of the cr epitope, but spans also to the neighbouring fragments, including the site of the n glycan (carbohydrate in red stick representation). this is a part of the epitope of another neutralizing antibody, s . e: cartoon representation of fig. d. the red region is centered on two exposed α-helices surrounding the core β-sheet (lower score, white). the whole genome with all its complexity. although successful deep learning approaches for both protein ( , , ) and regulatory sequence design ( , , , ) do exist, moving from read-based classification to genome-wide phenotype optimization would require considerable research effort, if possible at all. this would entail capturing a wealth of biological contexts well beyond the capabilities of even the best classification models currently available. nucleotide contribution logos visualizing convolutional filters may help to identify more complex filter structures and disentangle the contributions of individual nucleotides from their "conservation" in contributing sequences. counter-contributions suggest that the information content and the contribution of a nucleotide are not necessarily correlated. visualizing learned motifs by aligning the activating sequences ( ) would not fully describe how the filter reacts to presented data. it seems that the assumption of nucleotide independence – which is crucial for treating deeplift as a method of estimating shapley values for input nucleotides ( ) – does not hold in full. indeed, k-mer distribution profiles are frequently used features for modelling dna sequences (as shown also by the dimer-shuffling method of generating reference sequences proposed by shrikumar et al. ( )). however, deeplift’s multiple successful applications in genomics indicate that the assumption probably holds approximately. we see information content and deeplift’s contribution values as two complementary channels that can be jointly visualized for better interpretability and explainability of cnns in genomics. filter enrichment analysis enables even deeper insight in the inner workings of the networks. we generate activation data for hundreds to thousands of species, genes and filters. yet, aggregation and interpretation of those results beyond case studies is non-trivial, and a promising avenue for further research. genome-scale interpretability mapping predictions back to a target genome can be used both as a way of investigating a given model’s performance and as a method of genome analysis. gwpa plots of well- annotated genomes highlight the sequences with erroneous .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx and correct phenotype predictions at both genome and gene level, and nucleotide-resolution contribution maps help track those regions down to individual amino-acids. on the other hand, once a trusted model is developed, it can be used on newly emerging pathogens, as the sars-cov- virus briefly analyzed in this work. therefore, we see gpwa applications in both probing the behaviour of artificial neural networks in pathogen genomics and finding regions of interest in weakly annotated genomes. what is more, the approach could be easily co-opted to genome-wide activation analyses of any arbitrary, intermediate neuron. the methods presented here may also be applied to other biological problems, and extending them to other hosts and pathogen groups, multi-class classification or gene identification is possible. however, experimental work and traditional sequence analysis are required to truly understand the biology behind host adaptation and distinguish true hits from false positives. conclusion we presented a new approach for predicting a host of a novel virus based on a single dna read or a read pair, cutting the error rates in half compared to the previous state-of-the-art. for convolutional filters, we jointly visualize nucleotide contributions and information content. finally, we use gwpa plots to gain insights into the models’ behaviour and analyze a recently emerged sars-cov- virus. the approach presented here is implemented as a python package (see data availability) and a command line tool easily installable with bioconda ( ). data availability the datasets of simulated reads with associated metadata are hosted at https://doi.org/ . /zenodo. . the tool can be installed with bioconda (conda install deepacvir, requires setting up bioconda), docker (docker pull dacshpi/deepac) or pip (pip install deepacvir). detailed installation instructions, user guide and the main codebase (including the interpretability workflows presented here) are available at https://gitlab.com/dacs-hpi/deepac. source code of the plugin shipping the trained models, config files describing the architectures used and the models themselves are available at https://gitlab.com/dacs-hpi/deepac-vir. acknowledgements we gratefully acknowledge yong-zhen zhang and the scientists at the shanghai public health clinical center & school of public health, fudan university, who shared the sequence of the sars-cov- virus ahead of publication. we thank melania nowicka (max plank institute for molecular genetics) for inspiring discussions on efficient calculations of partial shapley values, vitor c. piro (hasso plattner institute) for discussions on traversing taxonomy graphs, lothar h. wieler (robert koch institute) for useful comments on the first draft of the manuscript and the anonymous reviewers for their suggestions and feedback. funding this work was supported by the german academic scholarship foundation (jmb), the bmbf computational life sciences initiative (project deepath, to byr) and the bmbf-funded de.nbi cloud within the german network for bioinformatics infrastructure (de.nbi) ( a b, a a, a a, a b, a a, a c, a a, a b). references . calvignac-spencer, s., schulze, j. m., zickmann, f., and renard, b. y. ( ) clock rooting further demonstrates that guinea ebov is a member of the zaïre lineage. plos currents, . . vouga, m. and greub, g. (january, ) emerging bacterial pathogens: the past and beyond. clinical microbiology and infection, ( ), – . . trappe, k., marschall, t., and renard, b. y. (september, ) detecting horizontal gene transfer by mapping sequencing reads across species boundaries. bioinformatics, ( ), i –i . . leendertz, s. a. j., gogarten, j. f., düx, a., calvignac-spencer, s., and leendertz, f. h. (mar, ) assessing the evidence supporting fruit bats as the primary reservoirs for ebola viruses. ecohealth, ( ), – . . lecuit, m. and eloit, m. ( ) the diagnosis of infectious diseases by whole genome next generation sequencing: a new era is opening. frontiers in cellular and infection microbiology, , . . calistri, a. and palù, g. ( ) editorial commentary: unbiased next-generation sequencing and new pathogen discovery: undeniable advantages and still-existing drawbacks. clinical infectious diseases: an official publication of the infectious diseases society of america, ( ), – . . andrusch, a., dabrowski, p. w., klenner, j., tausch, s. h., kohl, c., osman, a. a., renard, b. y., and nitsche, a. ( ) paipline: pathogen identification in metagenomic and clinical next generation sequencing samples. bioinformatics, ( ), i –i . . herfst, s., schrauwen, e. j. a., linster, m., chutinimitkul, s., wit, e. d., munster, v. j., sorrell, e. m., bestebroer, t. m., burke, d. f., smith, d. j., rimmelzwaan, g. f., osterhaus, a. d. m. e., and fouchier, r. a. m. (june, ) airborne transmission of influenza a/h n virus between ferrets. science, ( ), – . . imai, m., watanabe, t., hatta, m., das, s. c., ozawa, m., shinya, k., zhong, g., hanson, a., katsura, h., watanabe, s., li, c., kawakami, e., yamada, s., kiso, m., suzuki, y., maher, e. a., neumann, g., and kawaoka, y. (june, ) experimental adaptation of an influenza h ha confers respiratory droplet transmission to a reassortant h ha/h n virus in ferrets. nature, ( ), – . . lipsitch, m. and inglesby, t. v. (december, ) moratorium on research intended to create novel potential pandemic pathogens. mbio, ( ). . noyce, r. s., lederman, s., and evans, d. h. (january, ) construction of an infectious horsepox virus vaccine from chemically synthesized dna fragments. plos one, ( ), e . . thiel, v. ( ) synthetic viruses-anything new?. plos pathogens, ( ), e . . edwards, r. a., mcnair, k., faust, k., raes, j., and dutilh, b. e. ( ) computational approaches to predict bacteriophage-host relationships. fems microbiology reviews, ( ), – . . eng, c. l., tong, j. c., and tan, t. w. ( ) predicting host tropism of influenza a virus proteins using random forest. bmc medical genomics, ( ), s . . xu, b., tan, z., li, k., jiang, t., and peng, y. (july, ) predicting the host of influenza viruses based on the word vector. peerj, , e . . li, h. and sun, f. ( ) comparative studies of alignment, alignment- free and svm based approaches for predicting the hosts of viruses based on viral sequences. scientific reports, ( ), . . mock, f., viehweger, a., barth, e., and marz, m. ( , ) vidhop, viral host prediction with deep learning. bioinformatics, btaa . . gałan, w., bąk, m., and jakubowska, m. ( ) host taxon predictor - a tool for predicting taxon of the host of a newly discovered virus. scientific reports, ( ), . . babayan, s. a., orton, r. j., and streicker, d. g. (november, ) .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /zenodo. https://gitlab.com/dacs-hpi/deepac https://gitlab.com/dacs-hpi/deepac-vir https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx predicting reservoir hosts and arthropod vectors from evolutionary signatures in rna virus genomes. science, ( ), – . . zhang, z., cai, z., tan, z., lu, c., jiang, t., zhang, g., and peng, y. ( ) rapid identification of human-infecting viruses. transboundary and emerging diseases, ( ), – . . poplin, r., chang, p.-c., alexander, d., schwartz, s., colthurst, t., ku, a., newburger, d., dijamco, j., nguyen, n., afshar, p. t., gross, s. s., dorfman, l., mclean, c. y., and depristo, m. a. ( ) a universal snp and small-indel variant caller using deep neural networks. nature biotechnology, ( ), – . . rizzo, r., fiannaca, a., la rosa, m., and urso, a. (june, ) classification experiments of dna sequences by using a deep neural network and chaos game representation. in proceedings of the th international conference on computer systems and technologies new york, ny, usa: association for computing machinery compsystech ’ pp. – . . löchel, h. f., eger, d., sperlea, t., and heider, d. (january, ) deep learning on chaos game representation for proteins. bioinformatics, ( ), – . . bartoszewicz, j. m., seidel, a., rentzsch, r., and renard, b. y. ( , ) deepac: predicting pathogenic potential of novel dna with reverse-complement neural networks. bioinformatics, ( ), – . . alipanahi, b., delong, a., weirauch, m. t., and frey, b. j. ( ) predicting the sequence specificities of dna- and rna-binding proteins by deep learning. nature biotechnology, ( ), – . . zhou, j. and troyanskaya, o. g. ( ) predicting effects of noncoding variants with deep learning–based sequence model. nature methods, ( ), – . . zeng, h., edwards, m. d., liu, g., and gifford, d. k. ( ) convolutional neural network architectures for predicting dna–protein binding. bioinformatics, ( ), i –i . . quang, d. and xie, x. ( ) danq: a hybrid convolutional and recurrent deep neural network for quantifying the function of dna sequences. nucleic acids research, ( ), e –e . . kelley, d. r., snoek, j., and rinn, j. l. ( ) basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. genome research, ( ), – . . greenside, p., shimko, t., fordyce, p., and kundaje, a. ( ) discovering epistatic feature interactions from neural network models of regulatory dna sequences. bioinformatics, ( ), i –i . . nair, s., kim, d. s., perricone, j., and kundaje, a. (july, ) integrating regulatory dna sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. bioinformatics, ( ), i –i . . avsec, Ž., weilert, m., shrikumar, a., alexandari, a., krueger, s., dalal, k., fropf, r., mcanany, c., gagneur, j., kundaje, a., and zeitlinger, j. (august, ) deep learning at base-resolution reveals motif syntax of the cis-regulatory code. biorxiv, p. . . mock, f., viehweger, a., barth, e., and marz, m. ( ) viral host prediction with deep learning. biorxiv, p. . . ren, j., song, k., deng, c., ahlgren, n. a., fuhrman, j. a., li, y., xie, x., and sun, f. (june, ) identifying viruses from metagenomic data by deep learning. arxiv: . [q-bio], arxiv: . . . tampuu, a., bzhalava, z., dillner, j., and vicente, r. (september, ) viraminer: deep learning on raw dna sequences for identifying viral genomes in human samples. plos one, ( ), e . . eraslan, g., avsec, Ž., gagneur, j., and theis, f. j. (july, ) deep learning: new computational modelling techniques for genomics. nature reviews genetics, ( ), – . . schneider, t. d. and stephens, r. m. (october, ) sequence logos: a new way to display consensus sequences. nucleic acids research, ( ), – . . crooks, g. e., hon, g., chandonia, j.-m., and brenner, s. e. (june, ) weblogo: a sequence logo generator. genome research, ( ), – . . lanchantin, j., singh, r., lin, z., and qi, y. ( ) deep motif: visualizing genomic sequence classifications. corr, abs/ . . . lanchantin, j., singh, r., wang, b., and qi, y. ( ) deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks. pacific symposium on biocomputing. pacific symposium on biocomputing, , – . . sundararajan, m., taly, a., and yan, q. ( ) gradients of counterfactuals. corr, abs/ . . . jha, a., aicher, j. k., singh, d., and barash, y. ( ) improving interpretability of deep learning models: splicing codes as a case study. biorxiv,. . shrikumar, a., greenside, p., and kundaje, a. (august, ) learning important features through propagating activation differences. in precup, d. and teh, y. w., (eds.), proceedings of the th international conference on machine learning, international convention centre, sydney, australia: pmlr vol. of proceedings of machine learning research, pp. – . . bach, s., binder, a., montavon, g., klauschen, f., müller, k.-r., and samek, w. (july, ) on pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. plos one, ( ), e . . lundberg, s. m. and lee, s.-i. ( ) a unified approach to interpreting model predictions. in guyon, i., luxburg, u. v., bengio, s., wallach, h., fergus, r., vishwanathan, s., and garnett, r., (eds.), advances in neural information processing systems , pp. – curran associates, inc. . shrikumar, a., tian, k., shcherbina, a., avsec, Ž., banerjee, a., sharmin, m., nair, s., and kundaje, a. (march, ) tf-modisco v . . . -alpha: technical note. arxiv: . [cs, q-bio, stat], arxiv: . . . altschul, s. f., gish, w., miller, w., myers, e. w., and lipman, d. j. ( ) basic local alignment search tool. journal of molecular biology, ( ), – . . camacho, c., coulouris, g., avagyan, v., ma, n., papadopoulos, j., bealer, k., and madden, t. l. (december, ) blast+: architecture and applications. bmc bioinformatics, ( ), . . wu, f., zhao, s., yu, b., chen, y.-m., wang, w., hu, y., song, z.- g., tao, z.-w., tian, j.-h., pei, y.-y., yuan, m.-l., zhang, y.-l., dai, f.-h., liu, y., wang, q.-m., zheng, j.-j., xu, l., holmes, e. c., and zhang, y.-z. (january, ) complete genome characterisation of a novel coronavirus associated with severe human respiratory disease in wuhan, china. biorxiv, p. . . . . . mihara, t., nishimura, y., shimizu, y., nishiyama, h., yoshikawa, g., uehara, h., hingamp, p., goto, s., and ogata, h. ( ) linking virus genomes with host taxonomy. viruses, ( ), . . king, a. m. q., adams, m. j., carstens, e. b., and lefkowitz, e. j., (eds.) ( ) virus taxonomy: ninth report of the international committee on taxonomy of viruses, academic press, london; waltham. . lefkowitz, e. j., dempsey, d. m., hendrickson, r. c., orton, r. j., siddell, s. g., and smith, d. b. (january, ) virus taxonomy: the database of the international committee on taxonomy of viruses (ictv). nucleic acids research, (d ), d –d . . holtgrewe, m. ( ) mason – a read simulator for second generation sequencing data. technical report fu berlin,. . deneke, c., rentzsch, r., and renard, b. y. ( ) paprbag: a machine learning approach for the detection of novel pathogens from ngs data. scientific reports, , . . moustafa, a., xie, c., kirkness, e., biggs, w., wong, e., turpaz, y., bloom, k., delwart, e., nelson, k. e., venter, j. c., and telenti, a. (march, ) the blood dna virome in , humans. plos pathogens, ( ), e . . gorbalenya, a. e., baker, s. c., baric, r. s., de groot, r. j., drosten, c., gulyaeva, a. a., haagmans, b. l., lauber, c., leontovich, a. m., neuman, b. w., penzar, d., perlman, s., poon, l. l. m., samborskiy, d. v., sidorov, i. a., sola, i., ziebuhr, j., and coronaviridae study group of the international committee on taxonomy of viruses (april, ) the species severe acute respiratory syndrome-related coronavirus : classifying -ncov and naming it sars-cov- . nature microbiology, ( ), – . . simmonds, p. and aiewsakun, p. (august, ) virus classification – where do you draw the line?. archives of virology, ( ), – . . van regenmortel, m. h. v. (january, ) chapter one - the species problem in virology. in kielian, m., mettenleiter, t. c., and roossinck, m. j., (eds.), advances in virus research, vol. , pp. – academic press. . li, h. and durbin, r. ( ) fast and accurate short read alignment with burrows–wheeler transform. bioinformatics, ( ), – . . langmead, b. and salzberg, s. l. ( - ) fast gapped-read alignment with bowtie . nature methods, ( ), – . . wood, d. e. and salzberg, s. l. ( ) kraken: ultrafast metagenomic sequence classification using exact alignments. genome biology, ( ), r . . nix, r. and kantarciouglu, m. (july, ) incentive compatible privacy-preserving distributed classification. ieee transactions on .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx dependable and secure computing, ( ), – conference name: ieee transactions on dependable and secure computing. . matejczyk, s. and michalak, t. ( ) solving influence maximization problem using methods from cooperative game theory., instytut podstaw informatyki pan, publication title: k . . thorvaldsdóttir, h., robinson, j. t., and mesirov, j. p. (march, ) integrative genomics viewer (igv): high-performance genomics data visualization and exploration. briefings in bioinformatics, ( ), – . . delano, w. l. and others ( ) pymol: an open-source molecular graphics tool. ccp newsletter on protein crystallography, ( ), – . . yang, y.-h., jiang, y.-l., zhang, j., wang, l., bai, x.-h., zhang, s.-j., ren, y.-m., li, n., zhang, y.-h., zhang, z., gong, q., mei, y., xue, t., zhang, j.-r., chen, y., and zhou, c.-z. (june, ) structural insights into srap-mediated staphylococcus aureus adhesion to host cells. plos pathogens, ( ), e . . stojkova, p., spidlova, p., and stulik, j. ( ) nucleoid-associated protein hu: a lilliputian in gene regulation of bacterial virulence. frontiers in cellular and infection microbiology, , . . li, f. ( ) structure, function, and evolution of coronavirus spike proteins. annual review of virology, ( ), – . . marchler-bauer, a., bo, y., han, l., he, j., lanczycki, c. j., lu, s., chitsaz, f., derbyshire, m. k., geer, r. c., gonzales, n. r., gwadz, m., hurwitz, d. i., lu, f., marchler, g. h., song, j. s., thanki, n., wang, z., yamashita, r. a., zhang, d., zheng, c., geer, l. y., and bryant, s. h. ( ) cdd/sparcle: functional classification of proteins via subfamily domain architectures. nucleic acids research, (d ), d –d . . wrapp, d., wang, n., corbett, k. s., goldsmith, j. a., hsieh, c.-l., abiona, o., graham, b. s., and mclellan, j. s. (march, ) cryo- em structure of the -ncov spike in the prefusion conformation. science, ( ), – publisher: american association for the advancement of science section: report. . yuan, m., wu, n. c., zhu, x., lee, c.-c. d., so, r. t. y., lv, h., mok, c. k. p., and wilson, i. a. (may, ) a highly conserved cryptic epitope in the receptor binding domains of sars-cov- and sars- cov. science, ( ), – publisher: american association for the advancement of science section: report. . starr, t. n., greaney, a. j., hilton, s. k., crawford, k. h., navarro, m. j., bowen, j. e., tortorici, m. a., walls, a. c., veesler, d., and bloom, j. d. (june, ) deep mutational scanning of sars-cov- receptor binding domain reveals constraints on folding and ace binding. biorxiv, p. . . . publisher: cold spring harbor laboratory section: new results. . pinto, d., park, y.-j., beltramello, m., walls, a. c., tortorici, m. a., bianchi, s., jaconi, s., culap, k., zatta, f., de marco, a., peter, a., guarino, b., spreafico, r., cameroni, e., case, j. b., chen, r. e., havenar-daughton, c., snell, g., telenti, a., virgin, h. w., lanzavecchia, a., diamond, m. s., fink, k., veesler, d., and corti, d. (may, ) cross-neutralization of sars-cov- by a human monoclonal sars-cov antibody. nature, pp. – publisher: nature publishing group. . brendel, w. and bethge, m. ( ) approximating cnns with bag- of-local-features models works surprisingly well on imagenet. in international conference on learning representations. . national research council ( ) sequence-based classification of select agents: a brighter line, the national academies press, . . national academies of sciences, engineering, and medicine ( ) biodefense in the age of synthetic biology, the national academies press, . . diggans, j. and leproust, e. ( ) next steps for access to safe, secure dna synthesis. frontiers in bioengineering and biotechnology, . . o’brien, j. t. and nelson, c. (june, ) assessing the risks posed by the convergence of artificial intelligence and biotechnology. health security, ( ), – . . brookes, d., park, h., and listgarten, j. (may, ) conditioning by adaptive sampling for robust design. in international conference on machine learning pp. – . . alley, e. c., khimulya, g., biswas, s., alquraishi, m., and church, g. m. (december, ) unified rational protein engineering with sequence- based deep representation learning. nature methods, ( ), – . . biswas, s., khimulya, g., alley, e. c., esvelt, k. m., and church, g. m. (january, ) low-n protein engineering with data-efficient deep learning. biorxiv, p. . . . . . gupta, a. and zou, j. (february, ) feedback gan for dna optimizes protein functions. nature machine intelligence, ( ), – . . gupta, a. and kundaje, a. (july, ) targeted optimization of regulatory dna sequences with neural editing architectures. biorxiv, p. . . linder, j., bogard, n., rosenberg, a. b., and seelig, g. (december, ) deep exploration networks for rapid engineering of functional dna sequences. biorxiv, p. . . schreiber, j., lu, y. y., and noble, w. s. (may, ) ledidi: designing genomic edits that induce functional activity. biorxiv, p. . . . . . grüning, b., dale, r., sjödin, a., chapman, b. a., rowe, j., tomkins- tinch, c. h., valieris, r., and köster, j. (july, ) bioconda: sustainable and comprehensive software distribution for the life sciences. nature methods, ( ), – number: publisher: nature publishing group. .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / interpretable detection of novel human viruses from genome sequencing data introduction materials and methods results discussion data availability acknowledgements funding references profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs tsung-yu lu , the human genome structural variation consortium, mark chaisson * * corresponding author, mchaisso@usc.edu department of quantitative and computational biology, university of southern california, california, usa abstract variable number tandem repeat sequences (vntr) are composed of consecutive repeats of short segments of dna with hypervariable repeat count and composition. they include protein coding sequences and associations with clinical disorders. it has been difficult to incorporate vntr analysis in disease studies that use short-read sequencing because the traditional approach of mapping to the human reference is less effective for repetitive and divergent sequences. we solve vntr mapping for short reads with a repeat-pangenome graph (rpgg), a data structure that encodes both the population diversity and repeat structure of vntr loci from multiple haplotype-resolved assemblies. we developed software to build a rpgg, and use the rpgg to estimate vntr composition with short reads. we used this to discover vntrs with length stratified by continental population, and novel expression quantitative trait loci, indicating that rpgg analysis of vntrs will be critical for future studies of diversity and disease. introduction the human genome is composed of roughly % simple sequence repeats (ssrs) (i. h. g. s. consortium and international human genome sequencing consortium ) , loci composed of short, tandemly repeated motifs. these sequences are classified by motif length into short tandem repeats (strs) with a motif length of six nucleotides or fewer, and variable-number tandem repeats (vntrs) for repeats of longer motifs. ssrs are prone to hyper-mutability through motif copy number changes due to polymerase slippage during dna replication (viguera, canceill, and ehrlich ) . variation in ssrs are associated with tandem repeat disorders .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:mchaisso@usc.edu https://paperpile.com/c/h ctd /ndo a https://paperpile.com/c/h ctd /ndo a https://paperpile.com/c/h ctd /oc w https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / (trds) including amyotrophic lateral sclerosis and huntington’s disease (gatchel and zoghbi ) , and vntrs are associated with a wide spectrum of complex traits and diseases including attention-deficit disorder, type diabetes and schizophrenia (hannan ) . while str variation has been profiled in human populations (mallick et al. ) and to find expression quantitative trait loci (eqtl) (fotsing et al. ; gymrek et al. ) , and variation at vntr sequences may be detected for targeted loci (bakhtiari et al. ; dolzhenko et al. ) , the landscape of vntr variation in populations and effects on human phenotypes are not yet examined genome-wide. large scale sequencing studies including the genomes project ( genomes project consortium et al. ) , topmed (taliun et al. ) and dna sequencing by the genotype-tissue expression (gtex) project (g. consortium and gtex consortium ) rely on high-throughput sequencing (srs) characterized by srs reads up to bases. alignment and standard approaches for detecting single-nucleotide variant (snv) and indel variation ( insertions and deletions less than bases) using srs are unreliable in ssr loci (li et al., n.d.) , and the majority of vntr svs are missed using sv detection algorithms with srs (chaisson et al. ) . the full extent to which vntr loci differ has been made more clear by single-molecule sequencing (lrs) and assembly. lrs assemblies have megabase scale contiguity and accurate consensus sequences (koren et al. ; chin et al. ) that may be used to detect vntr variation. nearly % of insertions and deletions discovered by lrs assemblies greater than bases are in str and vntr loci (chaisson et al. ) , accounting for up to mbp per genome. furthermore, lrs assemblies reveal how vntr sequences differ kilobases in length and by motif composition (song, lowe, and kingsley ) . here we propose using a limited number of human lrs genomes sequenced for population references and diversity panels (chaisson et al. ; audano et al. ; seo et al. ; shi et al. ) to improve how vntr variation is detected using srs. it has been previously demonstrated that vntr variation discovered by lrs assemblies may be genotyped using srs (hickey et al. ; audano et al. ) . however, the genotyping accuracy for vntr svs is considerably lower than accuracy for genotyping other svs, owing to the complexity of representing .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /p a https://paperpile.com/c/h ctd / k ci https://paperpile.com/c/h ctd /t pi https://paperpile.com/c/h ctd / dguv+qanj https://paperpile.com/c/h ctd / dguv+qanj https://paperpile.com/c/h ctd / gs +qaf https://paperpile.com/c/h ctd / gs +qaf https://paperpile.com/c/h ctd /jzbjy https://paperpile.com/c/h ctd /jzbjy https://paperpile.com/c/h ctd /crk v https://paperpile.com/c/h ctd /lyx d https://paperpile.com/c/h ctd /ymn z https://paperpile.com/c/h ctd /rpd https://paperpile.com/c/h ctd /pj xm+q ll https://paperpile.com/c/h ctd /pj xm+q ll https://paperpile.com/c/h ctd /rpd https://paperpile.com/c/h ctd /jel https://paperpile.com/c/h ctd /rpd +k rob+xd +b ifz https://paperpile.com/c/h ctd /rpd +k rob+xd +b ifz https://paperpile.com/c/h ctd /jzyin+k rob https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / vntr variation and mapping reads to sv loci. most existing tools support a limited description of the complexity of tandem repeats using a single motif, such as in gangstr (mousavi et al. ) and advntr (bakhtiari et al. ) . while expansionhunter (dolzhenko et al. ) allows the repeat structure to be defined by a regular expression, it is mostly restricted to str genotyping and has not been extended to vntrs. additionally, gangstr and advntr are designed to estimate the number of a repeat unit, which leaves the variation in motif sequences unexplored. furthermore, traditional genotyping tests (chen et al. ) for the presence of a known variant, and does not reveal the spectrum of copy number variation that exists in tandem repeat sequences. repeat length estimation in tools specialized for tandem repeat genotyping allows more biological meaningful analyses (gymrek et al. ; saini et al. ; gymrek et al. ) . an alternative approach to tackle the vntr genotyping problem is to use lrs assemblies as population-specific references that improve srs read mapping by adding sequences missing from the reference (du et al. ; shi et al. ) . because missing sequences are enriched for vntrs (audano et al. ) , haplotype-resolved lrs genomes may help improve alignment to vntr regions, as well as facilitate the development of a model to discover vntr variation by serving as a ground truth. the hypervariability of vntrs prevents a single assembly from serving as an optimal reference. instead, to improve both alignment and genotyping, multiple assemblies may be combined into a pangenome graph (pgg) (hickey et al. ; eggertsson et al. ; garrison et al. ; chen et al. ) composed of sequence-labeled vertices connected by edges such that haplotypes correspond to paths in the graph. sequences shared between haplotypes are stored in the same vertex, and genetic variation is represented by the structure of the graph. a conceptually similar construct is the repeat graph (pevzner, tang, and tesler ) , with sequences repeated multiple times in a genome represented by the same vertex. graph analysis has been used to encode the elementary duplication structure of a genome (jiang et al. ) and for multiple sequence alignment of repetitive sequences with shuffled domains (raphael et al. ) , making them well-suited to represent vntrs that differ in both repeat count and composition. here we propose the representation of .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /akii https://paperpile.com/c/h ctd / gs https://paperpile.com/c/h ctd /qaf https://paperpile.com/c/h ctd /hn t https://paperpile.com/c/h ctd /qanj+ xl+yulf https://paperpile.com/c/h ctd /eix e+b ifz https://paperpile.com/c/h ctd /eix e+b ifz https://paperpile.com/c/h ctd /k rob https://paperpile.com/c/h ctd /jzyin+n kax+lmbav+hn t https://paperpile.com/c/h ctd /tdftw https://paperpile.com/c/h ctd /wqpb https://paperpile.com/c/h ctd /xhkpd https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / human vntrs as a repeat-pangenome graph (rpgg), that encodes both the repeat structure and sequence diversity of vntr loci (figure c). the most straight-forward approach that combines a pangenome graph and a repeat graph is a de bruijn graph, and was the basis of one of the earliest representations of a pangenome by the cortex method (iqbal, turner, and mcvean ; iqbal et al. ) . the de bruijn graph has a vertex for every distinct sequence of length k in a genome ( k- mer), and an edge connecting every two consecutive k -mers, thus k -mers occurring in multiple genomes or in multiple times in the same genome are stored by the same vertex. while the cortex method stores entire genomes in a de bruijn graph, we construct a separate locus-rpgg for each vntr and store a genome as the collection of locus-rpggs, which deviates from the definition of a de bruijn graph because the same k -mer may be stored in multiple vertices. we developed a toolkit, tan d em repe a t ge n otyping b ased on haplotype-der i ved pange n ome g raphs (danbing-tk) to identify vntr boundaries in assemblies, construct rpggs, align srs reads to the rpgg, and infer vntr motif composition and length in srs samples. this enables the alignment of srs datasets into an rpgg to discover population genetics of vntr loci, and to associate expression with vntr variation. results. repeat pan-genome graph construction our approach to build rpggs is to de novo assemble lrs genomes, and build de bruijn graphs on the assembled sequences at vntr loci, using srs genomes to ensure graph quality. we used public lrs data for individuals with diverse genetic backgrounds, including genomes from individual genome projects (seo et al. ; zook et al. ) , structural variation studies (chaisson et al. ) , and diversity panel sequencing (audano et al. ) (figure a, supplementary table ). each genome was sequenced by either pacbio single long read (slr) between, or high-fidelity (hifi) sequencing between and -fold coverage along with matched - -fold illumina sequencing (table ). this data reflects a wide range of technology revisions, .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /jmtrf+cjaux https://paperpile.com/c/h ctd /jmtrf+cjaux https://paperpile.com/c/h ctd /xd +cclhp https://paperpile.com/c/h ctd /xd +cclhp https://paperpile.com/c/h ctd /rpd https://paperpile.com/c/h ctd /k rob https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / sequencing depth, and data type, however subsequent steps were taken to ensure accuracy of rpgg through locus redundancy and srs alignments. we developed a pipeline that partitions lrs reads by haplotype based on phased heterozygous snvs and assembles haplotypes separately by chromosome. when available, we used existing telomere-to-telomere snv and phase data provided by strand-seq and/or x genomics (porubsky et al. ; chaisson et al. ) with phase-block n size between . - . mb. for other datasets, long-read data were used to phase snvs. while this data has lower phase-block n (< . - mb), the individual locus-rpgg do not use long-range haplotype information and are not affected by phasing switch error. reads from each chromosome and haplotype were independently assembled using the flye assembler (kolmogorov et al. ) for a diploid of . - . mb n , with the range of assembly contiguity reflected by the diversity of input data. in this study, the number of resolved vntr loci is a more accurate measurement of useful assembly contiguity than n because a disjoint rpgg is generated for each vntr locus. an initial set of , vntr intervals with motif size > bp, minimal length > bp and < k bp (mean length= bp in grch , methods, supplementary table ) were annotated by tandem repeats finder (trf) (benson ) , and then mapped onto contig coordinates using pairwise contig alignments. long vntr loci tended to have fragmented trf annotation, which can cause erroneous length estimates in downstream analysis and fail to properly interpret repeat structures as a whole such as in advntr-nn (supplementary fig. ). during locus assignment, danbing-tk expands boundaries and merges loci to ensure boundaries of all vntrs are well-defined and harmonized across genomes (methods) (figure b). in practice, we found that , / , ( %) of the vntr loci are subject to boundary expansion, with an average expansion size of bp. the set of vntrs that can be properly annotated ranges from , - , depending on the assembly quality, with a final set of , loci (mean length= bp) across genomes (supplementary fig. ). the rpggs are constructed as disjoint bi-directional de bruijn graphs of each vntr locus and flanking bases from the haplotype-resolved assemblies. in a bi-directional de bruijn graph, each distinct sequence of .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /haw +rpd https://paperpile.com/c/h ctd /haw +rpd https://paperpile.com/c/h ctd /r by https://paperpile.com/c/h ctd /r by https://paperpile.com/c/h ctd /pgh u https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / length k ( k -mer) and its reverse complement map to a vertex, and each sequence of length k + connects the vertices to which the two composite k -mers map. there was little effect on downstream analysis for values of k between and , and so k = was used for all applications. to remove spurious vertices and edges from assembly consensus errors, srs from genomes matching the lrs samples were mapped to the rpgg, and k -mers not mapped by srs were removed from the graph (average of per locus). using the number of vertices as a proxy for sampled genetic diversity, we find that % ( , , new nodes) of the sequences novel with respect to grch ( , , nodes) are discovered after the inclusion of genomes, with diversity linearly increasing per genome after the first four genomes are added to the rpgg ( , , nodes, figure c). the alignment of a read to an rpgg may be defined by the path in the rpgg with a sequence label that has the minimum edit distance to the read among all possible paths. we used error-free bp paired end reads simulated from six genomes (hg , hg , hg , hg , na and na ) to evaluate how reads are aligned to the rpgg. while several methods exist to find alignments that do not reuse cycles (garrison et al. ; rakocevic et al. ) , alignment with cycles is a more challenging problem recently solved by the graphaligner method to map long reads to pangenome graphs (rautiainen, mäkinen, and marschall ) . although > . % of the reads simulated from vntr loci were aligned, . % of reads matched with less than % identity, indicating misalignment. we developed an alternative approach tuned for rpgg alignments in danbing-tk (figure d) to realign all srs reads within a bam/fastq file to the rpgg in two passes, first by finding locus-rpggs with a high number (> in each end) shared k -mers with reads, and next by threading the paired-end reads through the locus-rpgg, allowing for up to two edits (mismatch, insertion, or deletion) and at least matched k-mers per read against the threaded path (methods). using danbing-tk, . % of vntr-simulated reads were aligned with > % identity. when reads from the entire genome are considered, for . % of the loci, danbing-tk can map > % of the reads back to their original vntr regions. misaligned reads from either other vntr loci or untracked regions target relatively few loci; . % .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /lmbav+jqzsb https://paperpile.com/c/h ctd /uke r https://paperpile.com/c/h ctd /uke r https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ( , / , ) loci have at least one read misaligned from outside the locus. the graph pruning step is the primary cause of missed alignments, and affects on average , loci per assembly. on real data, danbing-tk required . gb of memory to map base paired-end reads at . mb/sec on cores. read-to-graph alignment in vntr regions alignment of srs reads to the rpgg enables estimation of vntr length and motif composition. the count of k -mers in srs reads mapped to the rpgg are reported by danbing-tk for each locus. for samples and vntr loci, the result of an alignment is count matrices of dimension , where is the number of vertices in the de bruijn graph on the locus , excluding flanking sequences. if srs reads from a genome were sequenced without bias, sampled uniformly, and mapped without error to the rpgg, the count of a k -mer in a locus mapped by an srs sample should scale by a factor of read depth with the sum of the count of the k -mer from the locus of both assembled haplotypes for the same genome. the quality of alignment (aln- ) and sequencing bias were measured by comparing the k -mer counts from the matched illumina and lrs genomes (figure a). in total, % ( , / , ) loci had a mean aln- ≥ . between srs and assembly k -mer counts, and were marked as “valid” loci to carry forward for downstream diversity and expression analysis (figure b). valid had an average length of bp, compared to bp in the entire database (figure c). vntr loci that did not align well (invalid) were enriched for sequences that map within alu ( , ), sva ( , ), and other , mobile elements (supplementary fig. ); loci with false mapping in the simulation experiment are also enriched in the invalid set (supplementary table ) . specifically, . % ( , / , ) of loci with fp mapping, . % ( , / , ) of loci with fn mapping are not marked as valid. loci with false mapping but retained in the final set have lower but still decent length prediction accuracy ( . versus . ). the complete rpgg on valid loci contains , , vertices, in contrast to the corresponding rpgg only on grch (repeat-grch ), which has , , vertices. we validate that the additional vertices in the rpgg are indeed important for accurately recruiting reads pertinent to a vntr locus, using the cacna c vntr as .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=m# https://www.codecogs.com/eqnedit.php?latex=l# https://www.codecogs.com/eqnedit.php?latex=l# https://www.codecogs.com/eqnedit.php?latex=m% ctimes% n_i# https://www.codecogs.com/eqnedit.php?latex=n_i# https://www.codecogs.com/eqnedit.php?latex=i# https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=r% e # https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / an example (figure d). it is known that the reference sequence at this locus is truncated compared to the majority of the populations ( bp in grch versus , bp averaged across genomes). the limited sequence diversity provided by repeat-grch at this locus failed to recruit reads that map to paths existing in the rpgg but missing or only partially represented in repeat-grch . a linear fit between the k -mers from mapped reads and the ground truth assemblies shows that there is a -fold gain in slope, or measured read depth, when using rpgg compared to repeat-grch (figure e). the k -mer counts in the rpggs also correlate better with the assembly k -mer counts compared to the repeat-grch (aln- = . versus . ). new genomes with arbitrary combinations of motifs and copy numbers in vntrs should still align to an rpgg as long as the motifs are represented in the graph. we used leave-one-out analysis to evaluate alignment of novel genomes to rpggs and estimation of vntr length. in each experiment, an rpgg was constructed with one lrs genome missing. srs reads from the missing genome were mapped into the rpgg, and the estimated locus lengths were compared to the average diploid lengths of corresponding loci in the missing lrs assembly. the locus length is estimated as the adjusted sum of k -mer counts mapped from srs sample : , where is sequencing depth of , is a correction for locus-specific sampling bias (lsb). because the srs datasets used in this study during pangenome construction were collected from a wide variety of studies with different biases, there was no consistent lsb in either repetitive or nonrepetitive regions for samples from different sequencing runs (supplementary fig. - ). however, principal component analysis (pca) of repetitive and nonrepetitive regions showed highly similar projection patterns (supplementary fig. ), which enabled using lsb in nonrepetitive regions as a proxy for finding the nearest neighbor of lsb in vntr regions (supplementary fig. ). leveraging this finding, a set of nonrepetitive control regions were used to estimate the lsb of an unseen srs sample (methods), giving a median length-prediction accuracy of . for unrelated genomes (figure a left, supplementary fig. ). the read depth of a repetitive region correlates to the locus length when aligning short reads to a linear reference .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=kms# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=kms% f(cov_s% ctimes% % chat% bb% d)# https://www.codecogs.com/eqnedit.php?latex=cov_s# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=% chat% bb% d# https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / genome. however, estimation of vntr length from read depth has an accuracy of . (figure a left). we also compared the performance for length prediction using the rpgg versus repeat-grch , and observed a % improvement in accuracy ( . versus . , figure a left, supplementary fig. ). the overall error rate, measured with mean absolute percentage error (mape), of all loci (n= , ) are also significantly lower when using rpggs (mape= . , figure a right) compared with the repeat-grch ( . , paired t -test p = . ⨉ - ) or reference-aligned read depth ( . , paired t -test p = . ⨉ - ). furthermore, a % reduction in error size is observed for the , loci poorly genotyped (mape > . ) using repeat-grch (figure b, mape= . versus . ). profiling vntr length and motif diversity to explore global diversity of vntr sequences and potential functional impact, we aligned reads from , individuals from diverse populations sequenced at -fold coverage sequenced by the -genomes project ( kgp) (fairley et al. ; genomes project consortium et al. ) , and gtex genomes (g. consortium and gtex consortium ) to the rpgg. the fraction of reads from these datasets that align to the rpgg ranges from . %- . %, similar to the matched lrs/srs data ( . %). pca on the lsb of both datasets showed the kgp and gtex genomes as separate clusters in both repetitive and nonrepetitive regions (supplementary fig. ), indicating experiment-specific bias that prevents cross data set comparisons. consistent with the finding in previous leave-one-out analysis, genomes from the same study cluster together in the pca plot of lsb, and so within each dataset and locus, k -mer counts from srs reads normalized by sequencing depth were used to compare vntr content across genomes. the k -mer dosage: , was used as a proxy for locus length to compare tandem repeat variation across populations in the kgp genomes. the kgp samples contain individuals from african ( . %), east asian ( . %), european ( . %), admixed american ( . %), and south asian ( . %) populations. when .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd / q kl+jzbjy https://paperpile.com/c/h ctd /lyx d https://www.codecogs.com/eqnedit.php?latex=kms% fcov# https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / comparing the average population length to the global average length, . % ( , / , ) have differential length between populations (fdr= . on anova p values), with similar distributions of differential length when loci are stratified by the accuracy of length prediction (figure a). population stratification was calculated using the v st statistic (redon et al. ) on vntr length (figure b). previous studies have used > standard deviations above the mean to define for highly stratified copy number variants (sudmant et al. ) . under this measure, variants are highly stratified, including that overlap genes, however this is not significantly enriched (p= . , one-sided permutation test). two of the top five loci ranked by v st are intronic: a base vntr in plcl (v st = . ), and a base locus in spata (v st = . ) (figure c,d). these values for v st are lower than what are observed for large copy number variants (redon et al. ) and may be the result of neutral variation, however this may be affected by the high variance of the length estimate, as v st decreases as the variance of the copy number/dosage values increase (supplementary methods). vntr loci that are unstable may undergo hyper-expansion and are implicated as a mechanism of multiple diseases (hannan ) . to discover new potentially unstable loci, we searched the kg genomes for evidence of rare vntr hyper-expansion. loci were screened for individuals with extreme (> standard deviations) variation, and then filtered for deletions or unreliable samples (methods) to characterize loci as potentially unstable. these loci are inside genes and are significantly reduced from the number expected by chance (p< ⨉ - , one-sided permutation test; n= , ). of these loci, have an individual with > standard deviations above the mean, of which two overlap genes, kcna , and grm (supplemental fig. ). alignment to an rpgg provides information about motif usage in addition to estimates of vntr length because genomes with different motif composition will align to different vertices in the graph. to detect differential motif usage, we searched for loci with a k -mer that was more frequent in one population than another and not simply explained by a difference in locus length, comparing african (afr) and east asian populations for maximal genetic diversity. lasso regression against locus length was used to find the k -mer with the most variance explained (vex) in eas genomes, denoted as the most informative k -mer (mi-kmer). two .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /pjd https://paperpile.com/c/h ctd /n ru https://paperpile.com/c/h ctd /pjd https://paperpile.com/c/h ctd / k ci https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / statistics are of interest when comparing the two populations: the difference in the count of mi-kmers ( ) and the difference between proportion of vex ( ) by mi-kmers. describes the usage of an mi-kmer in one population relative to another, while indicates the degree that the mi-kmer is involved in repeat contraction or expansion in one population relative to another. we observe that , loci have significant differences in the usage of mi-kmers between the two populations (two-sided p < . , bootstrap, supplementary fig. ). among these, the mi-kmers of , loci are crucial to length variation in the eas but not in the afr population (two-sided p < . , bootstrap) (figure e, supplementary fig. ). a top example of these loci with at least . in the eas population was visualized with a heatmap of relative k -mer count from both populations, and clearly showed differential usage of cycles in the rpgg (figure f). association of vntr with nearby gene expression because the danbing-tk length estimates showed population genetic patterns expected for human diversity, we assessed whether danbing-tk alignments could detect vntr variation with functional impact. genomes from the gtex project were mapped into the rpgg to discover loci that have an effect on nearby gene expression in a length-dependent manner. a total of / genomes with matching expression data passed quality filtering (methods). similar to the population analysis, the k -mer dosage was used as a proxy for locus length. methods previously used to discover eqtl using str genotyping (fotsing et al. ) were applied to the danbing-tk alignments. in sum, , vntrs within kb to , gtex gene-annotations (including genes, lncrna, and other transcripts) were tested for association, with a total of , tests and approximately . vntrs tested per gene. using a gene-level fdr cutoff of %, we find eqtl (evntrs) (figure a), among which ( . %) discoveries are novel (supplementary table ), indicating that the spectrum of association between tandem repeat variation and expression extends beyond the lengths and the types of ssr considered in previous str (mousavi et al. ) and vntr (bakhtiari et al. ) studies. both .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=kmc_d# https://www.codecogs.com/eqnedit.php?latex=r% e _d# https://www.codecogs.com/eqnedit.php?latex=kmc_d# https://www.codecogs.com/eqnedit.php?latex=r% e _d# https://www.codecogs.com/eqnedit.php?latex=r% e # https://paperpile.com/c/h ctd / dguv https://paperpile.com/c/h ctd /akii https://paperpile.com/c/h ctd /s lm https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / positive and negative effects were observed among evntrs (figure b). more evntrs with positive effect size were found than with a negative effect size ( versus , binomial test p = . ), with an average effect of + . (from + . to + . ) versus − . (from − . to − . ), respectively. evntrs tend to be closer to telomeres relative to all vntrs (mann–whitney u test p = . ⨉ - , supplementary fig. ). because many exons contain vntr sequences, expression measured by read depth should increase with length of the vntr, and there is an . -fold enrichment of evntrs in coding regions as expected. the evntrs have the potential to yield insight to disease. in one example, an intronic evntr at chr : , , - , , flanks exon of erap (figure d, supplementary fig. ). the evntr has a - . effect size and was reported across tissues. it colocalizes with a regulatory hotspot with peaks of histone markers, dnase and different chip signals. the protein product of erap , or endoplasmic reticulum aminopeptidase , is a zinc metalloaminopeptidase involving in the process of class i mhc mediated antigen presentation and innate immune response. it has been reported to be associated with several diseases including ankylosing spondylitis (wellcome trust case control consortium et al. ) and crohn’s disease (franke et al. ) . abnormal expansion of the vntr might increase autoimmune disease risk through reducing erap expression, leaving longer and more antigenic peptides, yet potentially higher fitness against virus infection (ye et al. ) . this vntr is a unique sequence in grch that is a bp tandem duplication in / of the haplotypes. another example is an intergenic vntr at chr : , , - , , that associates with the expression of kansl ~ kb upstream (figure c, supplementary fig. ). the evntr has a maximal effect size of + . and is significant across tissues. the protein product of kansl , or kat regulatory nsl complex subunit , is a part of the histone acetylation machinery. deletion of this gene is linked to koolen-de vries syndrome (koolen et al. ) , and the locus is associated with parkinson disease (witoelar et al. ) . the evntr colocalizes with strong chip signals the association of this vntr with the epigenetic landscape warrants further investigation. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd / gyl https://paperpile.com/c/h ctd / me https://paperpile.com/c/h ctd / me https://paperpile.com/c/h ctd /str https://paperpile.com/c/h ctd /str https://paperpile.com/c/h ctd /cq b https://paperpile.com/c/h ctd /gwpe https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / discussion. previous commentaries have proposed that variation in vntr loci may represent a component of undiagnosed disease and missing heritability (hannan ) , which has remained difficult to profile even with whole genome sequencing (mousavi et al. ) . to address this, we have proposed an approach that combines multiple genomes into a pangenome graph that represents the repeat structure of a population. this is supported by the software, danbing-tk and associated rpgg. we used danbing-tk to generate a pangenome from haplotype-resolved assemblies, and applied it to detect vntr variation across populations and to discover eqtl. the structure of the rpgg can help to organize the diversity of assembled vntr sequences with respect to the standard reference. in particular, % of the graph structure is novel after the addition of genomes to the rpgg relative to repeat-grch . combined with the observation that using the -genome rpgg gives a % decrease in length prediction error, this indicates that the pan-genomes add detail for the missing variation. with the availability of additional genomes sequenced through the pangenome reference consortium ( https://humanpangenome.org/ ) and the hgsvc ( https://www.internationalgenome.org ), combined with advanced haplotype-resolve assembly methods (porubsky et al. ) , the spectrum of this variation will be revealed in the near future. while we anticipate that eventually the full spectrum of vntr diversity will be revealed through lrs of the entire kg, the rpgg analysis will help organize analysis by characterizing repeat domains. for example, with our approach, we are able to detect , loci with differential motif usage between populations, which could be difficult to characterize using an approach such as multiple-sequence alignment of vntr sequences from assembled genomes. there are several caveats to our approach. in contrast to other pangenome approaches (garrison et al. ; rakocevic et al. ) , danbing-tk does not keep track of a reference (e.g. grch ) coordinate system. furthermore, because it is often not possible to reconstruct a unique path in an rpgg, only counts of mapped reads are reported rather than the order of traversal of the rpgg. an additional caveat of our approach is that .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /iybxb https://paperpile.com/c/h ctd /akii https://humanpangenome.org/ https://www.internationalgenome.org/ https://paperpile.com/c/h ctd /jlne https://paperpile.com/c/h ctd /lmbav+jqzsb https://paperpile.com/c/h ctd /lmbav+jqzsb https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / genotype is calculated as a continuum of k -mer dosage rather than discrete units, prohibiting direct calculation of linkage-disequilibrium for fine-scale mapping (lapierre et al., n.d.) . finally this approach only profiles loci where k -mer counts from reads and assemblies are correlated; loci for which every k -mer appears the same number of times are excluded from analysis (on average , / , per genome). the rich data provided by danbing-tk and pangenome analysis provide the basis for additional association studies. while most analysis in this study focused on the diversity of vntr length or association of length and expression, it is possible to query differential motif usage using the rpgg. the ability to detect motifs that have differential usage between populations brings the possibility of detecting differential motif usage between cases and controls in association studies. this can help distinguish stabilizing versus fragile motifs (braida et al. ) , or resolve some of the problem of missing heritability by discovering new associations between motif and disease (song, lowe, and kingsley ) . finally, this work is a part of ongoing pangenome graph analysis (paten et al. ; li, feng, and chu ) , and represents an approach to generating pangenome graphs in loci that have difficult multiple sequence alignments or degenerate graph topologies. additional methods may be developed to harmonize danbing-tk rpggs with genome-wide pangenome graphs constructed from other methods. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /lgtuz https://paperpile.com/c/h ctd /yrlys https://paperpile.com/c/h ctd /jel https://paperpile.com/c/h ctd /gdid+n qw https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . sequence diversity of vntrs in human populations. a , global diversity of sms assemblies. b, dot-plot analysis of the vntr locus chr : - (ski intron vntr) in genomes that demonstrate varying motif usage and length c , diversity of rpgg as genomes are incorporated, measured by the number of k -mers in the , vntr graphs. total graph size built from grch and an average genome are also shown. d, danbing-tk workflow analysis. (top) vntr loci defined from the reference are used to map haplotype loci. each locus is converted to a de bruijn graph, from which the collection of graphs is the rpgg. the de bruijn graphs shown illustrate sequences missing from the rpgg built only on grch . the alignments may be either used to select which loci may be accurately mapped in the rpgg using srs that match the assemblies (red), or may be used to estimate lengths on sample datasets (blue). genome continental population study assembly n (mb) fraction of vntr annotated ancestry cov ak eas kg . . korean hg eur dp . . finnish hg eas hgsvg . . han chinese hg eas hgsvg . . han chinese .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table . source genomes for rpgg. continental populations represented are east asian (eas), european (eur), admixed amerindian (amr), south asian (sas), and african (afr). coverage is estimated diploid coverage based on alignment to grch . assembly n is of haplotype-resolved assemblies. the fraction of vntr annotated are all vntr with at least flanking bases assembled. figure . mapping short reads to repeat-pangenome graphs. a, an example of evaluating the alignment quality of a locus mapped by srs reads. the alignment quality is measured by the of a linear fit between the k -mer counts from the ground truth assemblies and from the mapped reads (methods). b, distribution of the alignment quality scores of , loci. loci with alignment quality less than . when averaged across samples are removed from downstream analysis (methods). c, distribution of vntr lengths in grch hg eas hgsvg . . han chinese hg amr hgsvg . . puerto rican hg amr hgsvg . . puerto rican hg amr hgsvg . . puerto rican hg amr dp . . colombian hg eas dp . . vietnamese hg amr dp . . peruvian hg afr dp . . gambian hg sas dp . . telugu na eur dp . . central european na afr hgsvg . . yoruba na afr hgsvg . . yoruba na afr hgsvg . . yoruba na afr dp . luhya na eur giab . . ashkenazim .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=r% e # https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / removed or retained for downstream analysis. d-e , comparing the read mapping results of the cacna c vntr using rpgg or repeat-grch . the k -mer counts in each graph and the differences are visualized with edge width and color saturation ( d ). the k -mer counts from the ground truth assemblies are regressed against the counts from reads mapped to the rpgg (red) and repeat-grch (blue), respectively ( e ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . vntr length prediction. a , accuracies of vntr length prediction measured for each genome (left) and each locus (right). mean absolute percentage error (mape) in vntr length is averaged across loci and genomes, respectively. lengths were predicted based on repeat-pangenome graphs (rpgg), repeat-grch (rhg) or naive read depth method (rd), respectively. boxes span from the lower quartile to the upper quartile, with horizontal lines indicating the median. whiskers extend to points that are within . interquartile range (iqr) from the upper or the lower quartiles. b, relative performance of rpgg versus repeat-grch . loci are ordered along the x-axis by genotyping accuracy in repeat-grch . the y-axis shows the decrease in mape using rpgg versus repeat-grch . the subplot shows loci poorly genotyped (mape> . ) in repeat-grch . the red dotted line indicates the baseline without any improvement. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . population properties of vntr loci. a , ratios of median length between populations for loci with significant differences in average length. loci are stratified by accuracy prediction (< . ), medium ( . - . ), and high ( . +). b , manhattan plot of v st values. c-d , the distribution of estimated length via k -mer dosage in continental populations for plcl and spata vntr loci, selected to visualize the distribution of dosage in different populations. each point is an individual. e, differential usage and expansion of motifs between the eas and afr populations. for each locus, the proportion of variance explained by the most informative k -mer in the eas is shown for the eas and afr populations on the x and y axes, respectively. points are colored by the difference in normalized k -mer counts, with red and blue indicating k -mers more abundant in eas and afr populations, respectively. f, an example vntr with differential motif usage. edges are colored if the k -mer count is biased toward a certain population. the black arrow indicates the location of the k -mer that explains the most variance of vntr length in the eas population. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . cis -eqtl mapping of vntrs. a, evntr discoveries in human tissues. the quantile-quantile plot shows the observed p value of each association test versus the p value drawn from the expected uniform distribution. black dots indicate the permutation results from the top % associated (gene, vntr) pairs in each tissue. the regression plots for erap and kansl are shown in c and d. b, effect size distribution of significant associations from all tissues. c-d , genomic view of disease-related (egene,evntr) pairs ( erap , chr : - ) (c) and ( kansl , chr : - ) (d) are shown. red boxes indicate the location of egenes and evntrs. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / materials and methods pangenome construction. initial discovery of tandem repeats: trf v . (option: -f -d -h) (benson ) was used to roughly annotate the ssr regions of five pacbio assemblies (ak , hg , hg , na , na ). the scope of this work focuses on vntrs that cannot be resolved by typical short read sequencing methods. we selected the set of ssr loci with a motif size greater than bp and a total length greater than bp and less than kbp. for each haplotype, the selected vntr loci were mapped to grch reference genome to identify homologous vntr loci. to maintain data quality, vntr loci that could not be assigned homology were removed from datasets. boundary expansion of vntrs: the biological boundaries of a vntr are ill-defined; vntrs with sparse recurring motifs or transition between different motifs or a nested motif structure often fail to be fully annotated by trf. a misannotation of vntr boundaries can cause erroneous length estimates. to avoid the propagation of this error to downstream analysis, we developed a multiple boundary expansion algorithm to recover the proper boundary for each vntr across all haplotypes, including the the remaining genomes (hg , hg , hg , hg , hg , hg , hg , hg , hg , hg , na , na , na and na ). the algorithm maintains an invariant: the flanking sequence in any of the haplotypes does not share k -mers with the vntr regions from all haplotypes. vntr boundaries in each haplotype are iteratively expanded until the invariant is true or if expansion exceeds kbp in either ’ or ’ direction. the size of the flanking regions is chosen to be bp, which is approximately the upper bound of the insert size of typical srs reads. the following qc step removes a haplotype if its vntr annotation is within bp to breakpoints or if the orthology mapping location to grch is different from the majority of haplotypes. a vntr locus with the number of supporting haplotypes less than % of the total number of haplotypes is also removed. adjacent vntr loci within bp to each other in any of the haplotypes will .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / induce a merging step over all haplotypes. haplotypes with distance between adjacent loci inconsistent with the majority of haplotypes are removed. finally, vntr loci with the number of supporting haplotypes less than % of the total number of haplotypes are removed, leaving , of the initial , loci. read-to-graph alignment: for the two haplotypes of an individual, three data structures are used to encode the information of all vntr loci, including vntrs and their bp flanking sequences. the first data structure allows fast locus lookup for each k -mer ( k = ) by hashing each canonical k -mer in the vntrs and the flanking sequences to the index of the original locus. the second data structure enables graph threading by storing a bi-directional de bruijn graph for each locus. the third data structure is used for counting k -mers originating from vntrs. the read mapping algorithm maps each pair of illumina paired-end reads to a unique vntr locus in three phases: ( ) in the k -mer set mapping phase, the read pair is converted to a pair of canonical k -mer multisets. the vntr locus with the highest count of intersected k -mers is detected with the first data structure. ( ) in the threading phase, the algorithm tries to map the k -mers in the read pair to the bi-directional de bruijn graph such that the mapping forms a continuous path/cycle. to account for sequencing and assembly errors, the algorithm is allowed to edit a limited number of nucleotides in a read if no matching k -mer is found in the graph. the read pair is determined feasible to map to a vntr locus if the number of mapped k -mers is above an empirical threshold. ( ) in the k -mer counting phase, canonical k -mers of the feasible read pair are counted if they existed in the vntr locus. finally, the read mapping algorithm returns the k -mer counts for all loci as mapped by srs reads. alignment timing was conducted on an intel xeon e - v . ghz node. graph pruning and merging: pan-genome representation provides a more thorough description of vntr diversity and reduces reference allele bias, which effectively improves the quality of read mapping and downstream analysis. considering the fact that haplotypes assembled from long read datasets are error prone in vntr regions, it is necessary to prune the graphs/ k -mers before merging them as a pan-genome. we ran the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / read mapping algorithm with error correction disabled so as to detect k -mers unsupported by srs reads. the three data structures were updated by deleting all unsupported k -mers for each locus. by pooling and merging the reference regions corresponding to the vntr regions in all individuals, we obtained a set of “pan-reference” regions, each indicating a location in grch that is likely to map to a vntr region in any other unseen haplotype. by referencing the mapping relation of vntr loci across individuals, we encoded the variability of each vntr locus by merging the three data structures across individuals. alignment quality analysis: to evaluate the quality of the haplotype assemblies and the performance of the read mapping algorithm, vntr k -mer counts in the original assemblies were regressed against those mapped from srs reads. the of the linear fit was used as the alignment quality score (referred to as aln- ). to measure alignment quality in the pan-genome setting, only the k -mer set derived from the genotyped individual was retained as the input for regression. data filtering: a final set of , vntr regions was called by filtering based on aln- . the quality of a locus was measured by the mean aln- across individuals. loci with mean aln- below . were removed from the final call set. the final pan-genome graphs were used to genotype large illumina datasets, measure length prediction accuracy, analyze population structures and map eqtl. predicting vntr lengths : read depths at vntr regions usually vary considerably from locus to locus. furthermore, the sampling bias of different sequencing runs are also different, which limits our ability to genotype the accurate length of vntrs. to account for this, we compute locus-specific biases (lsbs) for each sample , a tuple of (genome , sequencing run) as follows: .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=b_s# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=g# https://www.codecogs.com/eqnedit.php?latex=b_s% d% cdfrac% b % d% bcov_s% ctimes% l_g% d% csum_% be% dw_% bs% ce% d# https://www.codecogs.com/eqnedit.php?latex=b_s% d% cdfrac% bkms_s% d% bcov_s% ctimes% l_g% d# https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ,where is the ground truth vntr lengths of , loci in genome ; is the sum of k -mer counts in each locus mapped by samples ; is the global read depth of sample estimated by averaging the read depths of unique regions without any types of repeats or duplications. the ground truth vntr length of a locus in genome is averaged across haplotypes: ,where is the number of haplotype(s) in genome , i.e. for normal individuals and for complete hydatidiform mole (chm) samples. with the above bias terms, the vntr length of locus in sample can be computed by: ,where is same as described above; is the estimated lsbs computed from sample with ground truth vntr lengths; is the sum of k -mer counts of locus mapped by sample . we assume the lsbs that best approximates come from samples within the same sequencing run. without prior knowledge on the ground truth vntr lengths of and therefore , we determine the “closest” sample w.r.t. based on between the read depths, , of the unique regions as follows: , where is the set of samples with ground truths and within the same sequencing run as . we cross-validate our approach by leaving one sample out of the pan-genome database and evaluating the prediction accuracy on the excluded sample. for comparison, vntr lengths were also estimated by a read depth method. for each vntr region, the read depth, computed with samtools bedcov -j, was divided by the global read depth, computed from the nonrepetitive regions, to give the length estimate. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=l_g# https://www.codecogs.com/eqnedit.php?latex=g# https://www.codecogs.com/eqnedit.php?latex=kms_s# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=cov_s# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=l# https://www.codecogs.com/eqnedit.php?latex=g# https://www.codecogs.com/eqnedit.php?latex=l_% bg% cl% d% d% cdfrac% b % d% bh% d% csum_% bh% d % d% e% bh% dl_% bg% ch% cl% d# https://www.codecogs.com/eqnedit.php?latex=h# https://www.codecogs.com/eqnedit.php?latex=g# https://www.codecogs.com/eqnedit.php?latex=l# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=l_% bs% cl% d% d% cdfrac% bkms_% bs% cl% d% d% bcov_s% ctimes% b_% b% chat% bs% d% d% d# https://www.codecogs.com/eqnedit.php?latex=cov_s# https://www.codecogs.com/eqnedit.php?latex=b_% b% chat% bs% d% d# https://www.codecogs.com/eqnedit.php?latex=% chat% bs% d# https://www.codecogs.com/eqnedit.php?latex=kms_% bs% cl% d# https://www.codecogs.com/eqnedit.php?latex=l# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=b_s# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=b_s# https://www.codecogs.com/eqnedit.php?latex=% chat% bs% d# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=rd# https://www.codecogs.com/eqnedit.php?latex=% chat% bs% d% d% coperatorname*% bargmax% d_% bs% % c% s% % cin% gt% c% s% % cneq% s% d% r% e (rd_% bs% % d% crd_s)# https://www.codecogs.com/eqnedit.php?latex=gt# https://www.codecogs.com/eqnedit.php?latex=s# https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / comparing with graphaligner: the compact de bruijn graph of each vntr locus was generated with bcalm v . . (option: -kmer-size -abundance-min ) using the vntr sequences from all assemblies as input. gfa files were then reindexed and concatenated to generate the rpggs for , loci. error-free paired-end reads were simulated from all vntr regions at x coverage with bp read length and bp insert size ( bp gap between each end). reads were aligned to the rpgg using graphaligner v . . with option -x dbg --seeds-minimizer-length . reads with alignment identity > % were counted from the output gam file. to compare in a similar setting, danbing-tk was run with option -gc -thcth -k -cth -rth . to assert > % identity for all reads aligned, given that . v st calculation: v st was calculated according to (redon et al. ) : top v st loci were considered as the sites with v st at least three standard deviations above the mean. identifying unstable loci: a locus was annotated as a candidate for being unstable if at least one individual had outlying k -mer dosage ≥ six standard deviations above the mean, using population and locus specific summary statistics on data discarding individuals with zero no individuals had dosage less than or a bimodal distribution was not detected (diptest v . - , p > . ). among this set, the number of times each genome appeared as an outlier was used to select a set of genomes with an over abundant contribution to fragile loci. any candidate locus with an individual that was an outlier in at least four other loci was removed from the candidate list. the loci were compared to gencode v , excluding readthrough, pseudogenes, noncoding rna, and nonsense transcripts. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=(read% c_% length-kmer% c_size% b )% ctimes% . % d # https://paperpile.com/c/h ctd /pjd https://www.codecogs.com/eqnedit.php?latex=v_% bst% d% bi% d% dmax( % c% % cfrac% bvar_% ball% d-% cfrac% b % d% bn% d% csum_% bp% cin% p% d% bvar_p% ctimes% n_p% d% d% bvar_% ball% d% d)# https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / identifying differential motif usage and expansion : sample outliers in the genomes were detected from the read sampling biases over control regions and the tr dosages over , loci using dbscan. a total of / , samples were removed from downstream analysis. we use the eas population as the reference for measuring differential motif usage and expansion. initially, a lasso fit using the statsmodel.api.ols function in python statsmodel v . . (seabold and perktold ) was performed for each locus to identify the k -mer with the most variance explained (vex) in vntr lengths using the following formula: , where is the vntr length of individuals in the eas population; is the k -mer dosage matrix for individuals with k -mers; is the model coefficient, and is the error term. the lasso penalty weight was scanned starting at . with at a step size of − . until at least one covariate has a positive weight or is below . . the k -mer with the highest weight is denoted as the most informative k -mer (mi-kmer) for the locus. to identify loci with differential motif usage between populations, we subtracted the median count of the mi-kmer of the afr from the eas population for each locus, denoted as . the null distribution of was estimated by bootstrap. specifically, eas individuals were sampled with replacement times, matching the sample sizes of the eas and afr populations, respectively. the bootstrap statistics, , were computed by subtracting the median count of the mi-kmer of the last from the first bootstrap samples for each locus. the estimated null distribution is then used to determine the threshold for calling a locus having significant differential motif usage between populations (two-sided p < . ). to identify loci with differential motif expansion between populations, we subtracted the proportion of vex by mi-kmer in the afr from the eas population, denoted as . the null distribution of was estimated by bootstrap in a similar sampling procedure as , except for subtracting the proportion of vex by the mi-kmer in the last from the first bootstrap samples for each locus. the estimated null .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=y% dxb% b% cepsilon# https://www.codecogs.com/eqnedit.php?latex=y% cin% % cmathbb% br% d% en# https://www.codecogs.com/eqnedit.php?latex=n# https://www.codecogs.com/eqnedit.php?latex=x% cin% % cmathbb% br% d% e% bn% ctimes% m% d# https://www.codecogs.com/eqnedit.php?latex=n# https://www.codecogs.com/eqnedit.php?latex=m# https://www.codecogs.com/eqnedit.php?latex=b% cin% % cmathbb% br% d% em# https://www.codecogs.com/eqnedit.php?latex=% cepsilon% csim% n( % c% csigma% e )# https://www.codecogs.com/eqnedit.php?latex=% calpha# https://www.codecogs.com/eqnedit.php?latex=% calpha# https://www.codecogs.com/eqnedit.php?latex=kmc_d# https://www.codecogs.com/eqnedit.php?latex=kmc_d# https://www.codecogs.com/eqnedit.php?latex=n_% beas% d% bn_% bafr% d# https://www.codecogs.com/eqnedit.php?latex=kmc_d% e*# https://www.codecogs.com/eqnedit.php?latex=n_% bafr% d# https://www.codecogs.com/eqnedit.php?latex=n_% beas% d# https://www.codecogs.com/eqnedit.php?latex=r% e _d# https://www.codecogs.com/eqnedit.php?latex=r% e _d# https://www.codecogs.com/eqnedit.php?latex=kmc_d# https://www.codecogs.com/eqnedit.php?latex=n_% bafr% d# https://www.codecogs.com/eqnedit.php?latex=n_% beas% d# https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / distribution is used to determine the threshold for calling a locus having significant differential motif expansion between populations (two-sided p < . ). eqtl mapping retrieving datasets : wgs datasets of individuals, normalized gene expression matrices and covariates of all tissues are accessed from the gtex analysis v (dbgap accession phs .v .p ). genotype data preprocessing : vntr lengths are genotyped using daunting-tk with options: -gc -thcth -cth -rth . . all the k -mer counts of a locus are summed and adjusted by global read depth and ploidy to represent the approximate length of a locus. sample outliers were detected from the read sampling biases over control regions and the tr dosages over , loci using dbscan. a total of / samples were removed from downstream analysis. adjusted values are then z-score normalized as input for eqtl mapping. expression data preprocessing : the downloaded expression matrices are already preprocessed such that outliers are rejected and expression counts are quantile normalized as standard normal distribution. confounding factors such as sex, sequencing platform, amplification method, technical variations and population structure are removed prior to eqtl mapping to avoid spurious associations. technical variations are corrected with the covariates, including peer factors, provided by the gtex consortium. population structures are corrected with the top principal components (pcs) from the snp matrix of all samples. particularly, principal component analysis (pca) was performed jointly on the intersection of the snp sets from gtex samples and kgp omni . snp genotyping arrays (ftp://ftp. genomes.ebi.ac.uk/vol /ftp/release/ /supporting/hd_genotype_chip/all.chip.omni_broa d_sanger_combined. .snps.genotypes.vcf.gz). this is done by first using crossmap v . . to liftover the snp sites from omni . arrays to grch , followed by extracting the intersection of the two snp sets .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / using vcftools isec. the snp set is further reduced by ld-pruning with plink v . b . using the options: --indep , leaving a total of , sites. finally, pca on the joint snp matrix was done by smartpca v . the normalized expression matrix are residualized with the above covariates using the following formula: , where is the residualized expression matrix; is the normalized expression matrix; is the projection matrix; is the identity matrix; is the covariate matrix where each column corresponds to a covariate mentioned above. the residualized expression values are z-score normalized as the input of eqtl mapping. association test : vntrs within kb to a gene are included for eqtl mapping. linear regression was done using the statsmodel.api.ols function in python statsmodel v . . (seabold and perktold ) with expression values as the dependent variable and genotype values as the independent variable. nominal p values are computed by performing t tests on slope. adjusted p values are computed by bonferroni correction on nominal p values. under the assumption of at most one causal vntr per gene, we control gene-level false discovery rate at %. specifically, the adjusted p values of the lead vntr for each gene are taken as input for benjamini-hochberg procedure using statsmodels.stats.multitest.fdrcorrection v . . . lead vntrs that passed the procedure are identified as evntrs. data availability the overall analysis pipeline is delivered in a software package at https://github.com/chaissonlab/danbing-tk . genomes acknowledgement: .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=y% d(i-h)y% # https://www.codecogs.com/eqnedit.php?latex=h% dc(c% etc)% e% b- % dc% et# https://www.codecogs.com/eqnedit.php?latex=y# https://www.codecogs.com/eqnedit.php?latex=y% # https://www.codecogs.com/eqnedit.php?latex=h# https://www.codecogs.com/eqnedit.php?latex=i# https://www.codecogs.com/eqnedit.php?latex=c# https://github.com/chaissonlab/danbing-tk https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the following cell lines/dna samples were obtained from the nigms human genetic cell repository at the coriell institute for medical research: [na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na . na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na ,, na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na ]. these data were generated at the new york genome center with funds provided by nhgri grant um hg - s . data accession ids are given in supplementary table s . references. genomes project consortium, adam auton, lisa d. brooks, richard m. durbin, erik p. garrison, hyun min kang, jan o. korbel, et al. . “a global reference for human genetic variation.” nature ( ): – . audano, peter a., arvis sulovari, tina a. graves-lindsay, stuart cantsilieris, melanie sorensen, annemarie e. welch, max l. dougherty, et al. . “characterizing the major structural variant alleles of the human genome.” cell ( ): – .e . bakhtiari, mehrdad, jonghun park, yuan-chun ding, sharona shleizer-burko, susan l. neuhausen, bjarni v. halldórsson, kári stefánsson, melissa gymrek, and vineet bafna. . “variable number tandem repeats mediate the expression of proximal genes.” biorxiv . https://doi.org/ . / . . . . bakhtiari, mehrdad, sharona shleizer-burko, melissa gymrek, vikas bansal, and vineet bafna. . “targeted genotyping of variable number tandem repeats with advntr.” genome research ( ): – . benson, g. . “tandem repeats finder: a program to analyze dna sequences.” nucleic acids research . https://doi.org/ . /nar/ . . . braida, claudia, rhoda k. a. stefanatos, berit adam, navdeep mahajan, hubert j. m. smeets, florence niel, cyril goizet, et al. . “variant ccg and ggc repeats within the ctg expansion dramatically modify mutational dynamics and likely contribute toward unusual symptoms in some myotonic dystrophy type patients.” human molecular genetics ( ): – . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/h ctd /jzbjy http://paperpile.com/b/h ctd /jzbjy http://paperpile.com/b/h ctd /jzbjy http://paperpile.com/b/h ctd /jzbjy http://paperpile.com/b/h ctd /jzbjy http://paperpile.com/b/h ctd /k rob http://paperpile.com/b/h ctd /k rob http://paperpile.com/b/h ctd /k rob http://paperpile.com/b/h ctd /k rob http://paperpile.com/b/h ctd /k rob http://paperpile.com/b/h ctd /s lm http://paperpile.com/b/h ctd /s lm http://paperpile.com/b/h ctd /s lm http://paperpile.com/b/h ctd /s lm http://paperpile.com/b/h ctd /s lm http://dx.doi.org/ . / . . . http://paperpile.com/b/h ctd /s lm http://paperpile.com/b/h ctd / gs http://paperpile.com/b/h ctd / gs http://paperpile.com/b/h ctd / gs http://paperpile.com/b/h ctd / gs http://paperpile.com/b/h ctd / gs http://paperpile.com/b/h ctd /pgh u http://paperpile.com/b/h ctd /pgh u http://paperpile.com/b/h ctd /pgh u http://paperpile.com/b/h ctd /pgh u http://dx.doi.org/ . /nar/ . . http://paperpile.com/b/h ctd /pgh u http://paperpile.com/b/h ctd /yrlys http://paperpile.com/b/h ctd /yrlys http://paperpile.com/b/h ctd /yrlys http://paperpile.com/b/h ctd /yrlys http://paperpile.com/b/h ctd /yrlys http://paperpile.com/b/h ctd /yrlys https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / chaisson, mark j. p., ashley d. sanders, xuefang zhao, ankit malhotra, david porubsky, tobias rausch, eugene j. gardner, et al. . “multi-platform discovery of haplotype-resolved structural variation in human genomes.” nature communications ( ): . chen, sai, peter krusche, egor dolzhenko, rachel m. sherman, roman petrovski, felix schlesinger, melanie kirsche, et al. . “paragraph: a graph-based structural variant genotyper for short-read sequence data.” genome biology ( ): . chin, chen-shan, paul peluso, fritz j. sedlazeck, maria nattestad, gregory t. concepcion, alicia clum, christopher dunn, et al. . “phased diploid genome assembly with single-molecule real-time sequencing.” nature methods ( ): – . consortium, gtex, and gtex consortium. . “genetic effects on gene expression across human tissues.” nature . https://doi.org/ . /nature . consortium, international human genome sequencing, and international human genome sequencing consortium. . “initial sequencing and analysis of the human genome.” nature . https://doi.org/ . / . dolzhenko, egor, viraj deshpande, felix schlesinger, peter krusche, roman petrovski, sai chen, dorothea emig-agius, et al. . “expansionhunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions.” bioinformatics ( ): – . du, zhenglin, liang ma, hongzhu qu, wei chen, bing zhang, xi lu, weibo zhai, et al. . “whole genome analyses of chinese population and de novo assembly of a northern han genome.” genomics, proteomics & bioinformatics ( ): – . eggertsson, hannes p., snaedis kristmundsdottir, doruk beyter, hakon jonsson, astros skuladottir, marteinn t. hardarson, daniel f. gudbjartsson, kari stefansson, bjarni v. halldorsson, and pall melsted. . “graphtyper enables population-scale genotyping of structural variation using pangenome graphs.” nature communications . https://doi.org/ . /s - - - . fairley, susan, ernesto lowy-gallego, emily perry, and paul flicek. . “the international genome sample resource (igsr) collection of open human genomic variation resources.” nucleic acids research (d ): d – . fotsing, stephanie feupe, jonathan margoliash, catherine wang, shubham saini, richard yanicky, sharona shleizer-burko, alon goren, and melissa gymrek. . “the impact of short tandem repeat variation on gene expression.” nature genetics ( ): – . franke, andre, dermot p. b. mcgovern, jeffrey c. barrett, kai wang, graham l. radford-smith, tariq ahmad, charlie w. lees, et al. . “genome-wide meta-analysis increases to the number of confirmed crohn’s disease susceptibility loci.” nature genetics ( ): – . garrison, erik, jouni sirén, adam m. novak, glenn hickey, jordan m. eizenga, eric t. dawson, william jones, et al. . “variation graph toolkit improves read mapping by representing genetic variation in the reference.” nature biotechnology ( ): – . gatchel, jennifer r., and huda y. zoghbi. . “diseases of unstable repeat expansion: mechanisms and common principles.” nature reviews. genetics ( ): – . gymrek, melissa, thomas willems, audrey guilmatre, haoyang zeng, barak markus, stoyan georgiev, mark j. daly, et al. . “abundant contribution of short tandem repeats to gene expression variation in humans.” nature genetics ( ): – . gymrek, melissa, thomas willems, david reich, and yaniv erlich. . “interpreting short tandem repeat variations in humans using mutational constraint.” nature genetics . https://doi.org/ . /ng. . hannan, anthony j. . “tandem repeat polymorphisms: modulators of disease susceptibility and candidates for ‘missing heritability.’” trends in genetics . https://doi.org/ . /j.tig. . . . ———. . “tandem repeats mediating genetic plasticity in health and disease.” nature reviews. genetics ( ): – . hickey, glenn, david heller, jean monlong, jonas a. sibbesen, jouni sirén, jordan eizenga, eric t. dawson, .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/h ctd /rpd http://paperpile.com/b/h ctd /rpd http://paperpile.com/b/h ctd /rpd http://paperpile.com/b/h ctd /rpd http://paperpile.com/b/h ctd /rpd http://paperpile.com/b/h ctd /hn t http://paperpile.com/b/h ctd /hn t http://paperpile.com/b/h ctd /hn t http://paperpile.com/b/h ctd /hn t http://paperpile.com/b/h ctd /hn t http://paperpile.com/b/h ctd /q ll http://paperpile.com/b/h ctd /q ll http://paperpile.com/b/h ctd /q ll http://paperpile.com/b/h ctd /q ll http://paperpile.com/b/h ctd /q ll http://paperpile.com/b/h ctd /lyx d http://paperpile.com/b/h ctd /lyx d http://paperpile.com/b/h ctd /lyx d http://dx.doi.org/ . /nature http://paperpile.com/b/h ctd /lyx d http://paperpile.com/b/h ctd /ndo a http://paperpile.com/b/h ctd /ndo a http://paperpile.com/b/h ctd /ndo a http://paperpile.com/b/h ctd /ndo a http://paperpile.com/b/h ctd /ndo a http://dx.doi.org/ . / http://paperpile.com/b/h ctd /ndo a http://paperpile.com/b/h ctd /qaf http://paperpile.com/b/h ctd /qaf http://paperpile.com/b/h ctd /qaf http://paperpile.com/b/h ctd /qaf http://paperpile.com/b/h ctd /qaf http://paperpile.com/b/h ctd /eix e http://paperpile.com/b/h ctd /eix e http://paperpile.com/b/h ctd /eix e http://paperpile.com/b/h ctd /eix e http://paperpile.com/b/h ctd /eix e http://paperpile.com/b/h ctd /n kax http://paperpile.com/b/h ctd /n kax http://paperpile.com/b/h ctd /n kax http://paperpile.com/b/h ctd /n kax http://paperpile.com/b/h ctd /n kax http://dx.doi.org/ . /s - - - http://paperpile.com/b/h ctd /n kax http://paperpile.com/b/h ctd / q kl http://paperpile.com/b/h ctd / q kl http://paperpile.com/b/h ctd / q kl http://paperpile.com/b/h ctd / q kl http://paperpile.com/b/h ctd / q kl http://paperpile.com/b/h ctd / dguv http://paperpile.com/b/h ctd / dguv http://paperpile.com/b/h ctd / dguv http://paperpile.com/b/h ctd / dguv http://paperpile.com/b/h ctd / dguv http://paperpile.com/b/h ctd / me http://paperpile.com/b/h ctd / me http://paperpile.com/b/h ctd / me http://paperpile.com/b/h ctd / me http://paperpile.com/b/h ctd / me http://paperpile.com/b/h ctd /lmbav http://paperpile.com/b/h ctd /lmbav http://paperpile.com/b/h ctd /lmbav http://paperpile.com/b/h ctd /lmbav http://paperpile.com/b/h ctd /lmbav http://paperpile.com/b/h ctd /p a http://paperpile.com/b/h ctd /p a http://paperpile.com/b/h ctd /p a http://paperpile.com/b/h ctd /p a http://paperpile.com/b/h ctd /qanj http://paperpile.com/b/h ctd /qanj http://paperpile.com/b/h ctd /qanj http://paperpile.com/b/h ctd /qanj http://paperpile.com/b/h ctd /qanj http://paperpile.com/b/h ctd /yulf http://paperpile.com/b/h ctd /yulf http://paperpile.com/b/h ctd /yulf http://paperpile.com/b/h ctd /yulf http://dx.doi.org/ . /ng. http://paperpile.com/b/h ctd /yulf http://paperpile.com/b/h ctd /iybxb http://paperpile.com/b/h ctd /iybxb http://paperpile.com/b/h ctd /iybxb http://paperpile.com/b/h ctd /iybxb http://dx.doi.org/ . /j.tig. . . http://paperpile.com/b/h ctd /iybxb http://paperpile.com/b/h ctd / k ci http://paperpile.com/b/h ctd / k ci http://paperpile.com/b/h ctd / k ci http://paperpile.com/b/h ctd / k ci http://paperpile.com/b/h ctd /jzyin https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / erik garrison, adam m. novak, and benedict paten. . “genotyping structural variants in pangenome graphs using the vg toolkit.” genome biology ( ): . iqbal, zamin, mario caccamo, isaac turner, paul flicek, and gil mcvean. . “de novo assembly and genotyping of variants using colored de bruijn graphs.” nature genetics ( ): – . iqbal, zamin, isaac turner, and gil mcvean. . “high-throughput microbial population genomics using the cortex variation assembler.” bioinformatics . https://doi.org/ . /bioinformatics/bts . jiang, zhaoshi, haixu tang, mario ventura, maria francesca cardone, tomas marques-bonet, xinwei she, pavel a. pevzner, and evan e. eichler. . “ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution.” nature genetics ( ): – . kolmogorov, mikhail, jeffrey yuan, yu lin, and pavel a. pevzner. . “assembly of long, error-prone reads using repeat graphs.” nature biotechnology ( ): – . koolen, d. a., a. j. sharp, j. a. hurst, h. v. firth, s. j. l. knight, a. goldenberg, p. saugier-veber, et al. . “clinical and molecular delineation of the q . microdeletion syndrome.” journal of medical genetics ( ): – . koren, sergey, brian p. walenz, konstantin berlin, jason r. miller, nicholas h. bergman, and adam m. phillippy. . “canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.” genome research ( ): – . lapierre, nathan, kodi taraszka, helen huang, rosemary he, farhad hormozdiari, and eleazar eskin. n.d. “identifying causal variants by fine mapping across multiple studies.” https://doi.org/ . / . . . . li, heng, jonathan m. bloom, yossi farjoun, mark fleharty, laura gauthier, benjamin neale, and daniel macarthur. n.d. “new synthetic-diploid benchmark for accurate variant calling evaluation.” https://doi.org/ . / . li, heng, xiaowen feng, and chong chu. . “the design and construction of reference pangenome graphs with minigraph.” genome biology ( ): . mallick, swapan, heng li, mark lipson, iain mathieson, melissa gymrek, fernando racimo, mengyao zhao, et al. . “the simons genome diversity project: genomes from diverse populations.” nature ( ): – . mousavi, nima, sharona shleizer-burko, richard yanicky, and melissa gymrek. . “profiling the genome-wide landscape of tandem repeat expansions.” nucleic acids research ( ): e . paten, benedict, adam m. novak, jordan m. eizenga, and erik garrison. . “genome graphs and the evolution of genome inference.” genome research ( ): – . pevzner, pavel a., haixu tang, and glenn tesler. . “de novo repeat classification and fragment assembly.” genome research ( ): – . porubsky, david, shilpa garg, ashley d. sanders, jan o. korbel, victor guryev, peter m. lansdorp, and tobias marschall. . “dense and accurate whole-chromosome haplotyping of individual genomes.” nature communications ( ): . porubsky, david, human genome structural variation consortium, peter ebert, peter a. audano, mitchell r. vollger, william t. harvey, pierre marijon, et al. . “fully phased human genome assembly without parental data using single-cell strand sequencing and long reads.” nature biotechnology . https://doi.org/ . /s - - - . rakocevic, goran, vladimir semenyuk, wan-ping lee, james spencer, john browning, ivan j. johnson, vladan arsenijevic, et al. . “fast and accurate genomic analyses using genome graphs.” nature genetics . https://doi.org/ . /s - - - . raphael, benjamin, degui zhi, haixu tang, and pavel pevzner. . “a novel method for multiple alignment of sequences with repeated and shuffled elements.” genome research ( ): – . rautiainen, mikko, veli mäkinen, and tobias marschall. . “bit-parallel sequence-to-graph alignment.” bioinformatics ( ): – . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/h ctd /jzyin http://paperpile.com/b/h ctd /jzyin http://paperpile.com/b/h ctd /jzyin http://paperpile.com/b/h ctd /jzyin http://paperpile.com/b/h ctd /cjaux http://paperpile.com/b/h ctd /cjaux http://paperpile.com/b/h ctd /cjaux http://paperpile.com/b/h ctd /cjaux http://paperpile.com/b/h ctd /jmtrf http://paperpile.com/b/h ctd /jmtrf http://paperpile.com/b/h ctd /jmtrf http://paperpile.com/b/h ctd /jmtrf http://dx.doi.org/ . /bioinformatics/bts http://paperpile.com/b/h ctd /jmtrf http://paperpile.com/b/h ctd /wqpb http://paperpile.com/b/h ctd /wqpb http://paperpile.com/b/h ctd /wqpb http://paperpile.com/b/h ctd /wqpb http://paperpile.com/b/h ctd /wqpb http://paperpile.com/b/h ctd /r by http://paperpile.com/b/h ctd /r by http://paperpile.com/b/h ctd /r by http://paperpile.com/b/h ctd /r by http://paperpile.com/b/h ctd /cq b http://paperpile.com/b/h ctd /cq b http://paperpile.com/b/h ctd /cq b http://paperpile.com/b/h ctd /cq b http://paperpile.com/b/h ctd /cq b http://paperpile.com/b/h ctd /pj xm http://paperpile.com/b/h ctd /pj xm http://paperpile.com/b/h ctd /pj xm http://paperpile.com/b/h ctd /pj xm http://paperpile.com/b/h ctd /pj xm http://paperpile.com/b/h ctd /lgtuz http://paperpile.com/b/h ctd /lgtuz http://paperpile.com/b/h ctd /lgtuz http://dx.doi.org/ . / . . . http://paperpile.com/b/h ctd /lgtuz http://paperpile.com/b/h ctd /ymn z http://paperpile.com/b/h ctd /ymn z http://paperpile.com/b/h ctd /ymn z http://dx.doi.org/ . / http://paperpile.com/b/h ctd /ymn z http://paperpile.com/b/h ctd /n qw http://paperpile.com/b/h ctd /n qw http://paperpile.com/b/h ctd /n qw http://paperpile.com/b/h ctd /n qw http://paperpile.com/b/h ctd /t pi http://paperpile.com/b/h ctd /t pi http://paperpile.com/b/h ctd /t pi http://paperpile.com/b/h ctd /t pi http://paperpile.com/b/h ctd /t pi http://paperpile.com/b/h ctd /akii http://paperpile.com/b/h ctd /akii http://paperpile.com/b/h ctd /akii http://paperpile.com/b/h ctd /akii http://paperpile.com/b/h ctd /gdid http://paperpile.com/b/h ctd /gdid http://paperpile.com/b/h ctd /gdid http://paperpile.com/b/h ctd /gdid http://paperpile.com/b/h ctd /tdftw http://paperpile.com/b/h ctd /tdftw http://paperpile.com/b/h ctd /tdftw http://paperpile.com/b/h ctd /tdftw http://paperpile.com/b/h ctd /haw http://paperpile.com/b/h ctd /haw http://paperpile.com/b/h ctd /haw http://paperpile.com/b/h ctd /haw http://paperpile.com/b/h ctd /haw http://paperpile.com/b/h ctd /jlne http://paperpile.com/b/h ctd /jlne http://paperpile.com/b/h ctd /jlne http://paperpile.com/b/h ctd /jlne http://paperpile.com/b/h ctd /jlne http://paperpile.com/b/h ctd /jlne http://dx.doi.org/ . /s - - - http://paperpile.com/b/h ctd /jlne http://paperpile.com/b/h ctd /jqzsb http://paperpile.com/b/h ctd /jqzsb http://paperpile.com/b/h ctd /jqzsb http://paperpile.com/b/h ctd /jqzsb http://paperpile.com/b/h ctd /jqzsb http://dx.doi.org/ . /s - - - http://paperpile.com/b/h ctd /jqzsb http://paperpile.com/b/h ctd /xhkpd http://paperpile.com/b/h ctd /xhkpd http://paperpile.com/b/h ctd /xhkpd http://paperpile.com/b/h ctd /xhkpd http://paperpile.com/b/h ctd /uke r http://paperpile.com/b/h ctd /uke r http://paperpile.com/b/h ctd /uke r https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / redon, richard, shumpei ishikawa, karen r. fitch, lars feuk, george h. perry, t. daniel andrews, heike fiegler, et al. . “global variation in copy number in the human genome.” nature ( ): – . saini, shubham, ileena mitra, nima mousavi, stephanie feupe fotsing, and melissa gymrek. . “a reference haplotype panel for genome-wide imputation of short tandem repeats.” nature communications ( ): . seo, jeong-sun, arang rhie, junsoo kim, sangjin lee, min-hwan sohn, chang-uk kim, alex hastie, et al. . “de novo assembly and phasing of a korean human genome.” nature ( ): – . shi, lingling, yunfei guo, chengliang dong, john huddleston, hui yang, xiaolu han, aisi fu, et al. . “long-read sequencing and de novo assembly of a chinese genome.” nature communications (june): . song, janet h. t., craig b. lowe, and david m. kingsley. . “characterization of a human-specific tandem repeat associated with bipolar disorder and schizophrenia.” american journal of human genetics ( ): – . sudmant, peter h., swapan mallick, bradley j. nelson, fereydoun hormozdiari, niklas krumm, john huddleston, bradley p. coe, et al. . “global diversity, population stratification, and selection of human copy-number variation.” science ( ): aab . taliun, daniel, daniel n. harris, michael d. kessler, jedidiah carlson, zachary a. szpiech, raul torres, sarah a. gagliano taliun, et al. . “sequencing of , diverse genomes from the nhlbi topmed program.” biorxiv . https://doi.org/ . / . viguera, e., d. canceill, and s. d. ehrlich. . “replication slippage involves dna polymerase pausing and dissociation.” the embo journal ( ): – . wellcome trust case control consortium, australo-anglo-american spondylitis consortium (tasc), paul r. burton, david g. clayton, lon r. cardon, nick craddock, panos deloukas, et al. . “association scan of , nonsynonymous snps in four diseases identifies autoimmunity variants.” nature genetics ( ): – . witoelar, aree, iris e. jansen, yunpeng wang, rahul s. desikan, j. raphael gibbs, cornelis blauwendraat, wesley k. thompson, et al. . “genome-wide pleiotropy between parkinson disease and autoimmune diseases.” jama neurology ( ): – . ye, chun jimmie, jenny chen, alexandra-chloé villani, rachel e. gate, meena subramaniam, tushar bhangale, mark n. lee, et al. . “genetic analysis of isoform usage in the human anti-viral response reveals influenza-specific regulation of transcripts under balancing selection.” genome research ( ): – . zook, justin m., nancy f. hansen, nathan d. olson, lesley chapman, james c. mullikin, chunlin xiao, stephen sherry, et al. . “a robust benchmark for detection of germline large deletions and insertions.” nature biotechnology , june. https://doi.org/ . /s - - - . author contributions. t.y.l. and m.j.p.c. performed data analysis and wrote the manuscript. m.j.p.c. supervised the work. hgsvc generated sequencing data. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/h ctd /pjd http://paperpile.com/b/h ctd /pjd http://paperpile.com/b/h ctd /pjd http://paperpile.com/b/h ctd /pjd http://paperpile.com/b/h ctd /pjd http://paperpile.com/b/h ctd / xl http://paperpile.com/b/h ctd / xl http://paperpile.com/b/h ctd / xl http://paperpile.com/b/h ctd / xl http://paperpile.com/b/h ctd / xl http://paperpile.com/b/h ctd /xd http://paperpile.com/b/h ctd /xd http://paperpile.com/b/h ctd /xd http://paperpile.com/b/h ctd /xd http://paperpile.com/b/h ctd /b ifz http://paperpile.com/b/h ctd /b ifz http://paperpile.com/b/h ctd /b ifz http://paperpile.com/b/h ctd /b ifz http://paperpile.com/b/h ctd /b ifz http://paperpile.com/b/h ctd /jel http://paperpile.com/b/h ctd /jel http://paperpile.com/b/h ctd /jel http://paperpile.com/b/h ctd /jel http://paperpile.com/b/h ctd /jel http://paperpile.com/b/h ctd /n ru http://paperpile.com/b/h ctd /n ru http://paperpile.com/b/h ctd /n ru http://paperpile.com/b/h ctd /n ru http://paperpile.com/b/h ctd /n ru http://paperpile.com/b/h ctd /crk v http://paperpile.com/b/h ctd /crk v http://paperpile.com/b/h ctd /crk v http://paperpile.com/b/h ctd /crk v http://paperpile.com/b/h ctd /crk v http://dx.doi.org/ . / http://paperpile.com/b/h ctd /crk v http://paperpile.com/b/h ctd /oc w http://paperpile.com/b/h ctd /oc w http://paperpile.com/b/h ctd /oc w http://paperpile.com/b/h ctd /oc w http://paperpile.com/b/h ctd / gyl http://paperpile.com/b/h ctd / gyl http://paperpile.com/b/h ctd / gyl http://paperpile.com/b/h ctd / gyl http://paperpile.com/b/h ctd / gyl http://paperpile.com/b/h ctd / gyl http://paperpile.com/b/h ctd /gwpe http://paperpile.com/b/h ctd /gwpe http://paperpile.com/b/h ctd /gwpe http://paperpile.com/b/h ctd /gwpe http://paperpile.com/b/h ctd /gwpe http://paperpile.com/b/h ctd /str http://paperpile.com/b/h ctd /str http://paperpile.com/b/h ctd /str http://paperpile.com/b/h ctd /str http://paperpile.com/b/h ctd /str http://paperpile.com/b/h ctd /str http://paperpile.com/b/h ctd /cclhp http://paperpile.com/b/h ctd /cclhp http://paperpile.com/b/h ctd /cclhp http://paperpile.com/b/h ctd /cclhp http://paperpile.com/b/h ctd /cclhp http://dx.doi.org/ . /s - - - http://paperpile.com/b/h ctd /cclhp https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a read count-based method to detect multiplets and their cellular origins from snatac-seq data a read count-based method to detect multiplets and their cellular origins from snatac-seq data asa thibodeau *, alper eroglu *, nathan lawlor , djamel nehar-belaid , romy kursawe , radu marches , george a. kuchel , jacques banchereau , michael l. stitzel , , , a. ercument cicek , , duygu ucar , , the jackson laboratory for genomic medicine, farmington, ct, , usa university of connecticut center on aging, uconn health center, farmington, ct, , usa department of genetics and genome sciences, university of connecticut health center, farmington, ct, , usa institute for systems genomics, university of connecticut health center, farmington, ct, , usa. computer engineering department, bilkent university, ankara, , turkey computational biology department, carnegie mellon university, pittsburgh, pa, , usa * these authors contributed equally to this work. correspondence: duygu.ucar@jax.org abstract similar to other droplet-based single cell assays, single nucleus atac-seq (snatac-seq) data harbor multiplets that confound downstream analyses. detecting multiplets in snatac-seq data is particularly challenging due to its sparsity and trinary nature ( reads: closed chromatin, : open in one allele, : open in both alleles), yet offers a unique opportunity to infer multiplets when > uniquely aligned reads are observed at multiple loci. here, we implemented the first read count-based multiplet detection method, atac-doubletdetector, that detects multiplets independently of cell-type. using pbmc and pancreatic islet datasets, atac-doubletdetector captured simulated heterotypic multiplets (different cell-types) with ~ . recall, showing ~ % improvement over state of the art. atac-doubletdetector detected homotypic multiplets with ~ . recall, representing the first method to detect multiplets originating from the same cell type. using our novel clustering-based algorithm, multiplets were annotated to their cellular origins with ~ % accuracy. application of atac-doubletdetector will improve downstream analysis of snatac-seq. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . main single nucleus atac-seq (snatac-seq) – technology is widely used to study epigenomes of diverse cells and tissues with increased resolution , . however, as with other droplet based single cell technologies, snatac-seq data harbor multiplet nuclei . the presence of multiplets can confound downstream analyses by introducing combined epigenomic profiles that originate from two or more nuclei, increasing the difficulty of clustering and comparing different cell types within a sample. compared to other single cell assays, the difficulty of detecting multiplets in snatac-seq is further increased due to data sparsity and the trinary nature of chromatin accessibility levels (e.g., reads: closed chromatin, : open in one allele, : open in both alleles). the current state of the art for detecting multiplets in snatac-seq data adapt detection methods developed for single cell rnaseq (scrna-seq). notably, two snatac-seq data analysis packages, snapatac and archr , either employ or implement a method similar to multiplet detection methods (i.e., doubletfinder and scrublet ) for scrna-seq. in these methods, synthetic heterotypic multiplets (i.e., originating from different cell types) are simulated by combining profiles of two or more cells, which are then used to detect putative multiplets based on cluster similarity. such algorithms assume that multiplets and singlets exhibit distinct genomic profiles, which becomes problematic when true singlets share genomic profiles with two or more cell types. under this assumption, these methods will fail to detect homotypic multiplets (i.e., originating from the same cell type) since their overall genomic profile is considered to be similar to that of the underlying cell type. however, homotypic multiplets are characterized by increased read counts compared to singlets, suggesting new methods that utilize read counts can detect them. in order to overcome the limitations of existing methods to detect both homotypic and heterotypic multiplets, we developed a novel multiplet detection method, atac- doubletdetector, that exploits read count distributions to infer multiplets in snatac-seq data. atac-doubletdetector’s efficacy was tested in two snatac-seq datasets generated from peripheral blood mononuclear cells (pbmcs) samples (n= ) and pancreatic islet (n= ) tissues. we identified multiplets in these tissues and quantified the algorithm’s efficacy using simulated homotypic and heterotypic multiplets. we found that when snatac-seq samples were adequately sequenced (e.g., > k valid read pairs per cell), atac- doubletdetector proved very effective for detecting both homotypic and heterotypic multiplets (recall ranging from . - . in pbmcs). in addition, atac-doubletdetector includes a novel clustering-based algorithm that accurately annotates the cellular origins of detected multiplets ( % average accuracy in our simulations), (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . providing further data quality insights. atac-doubletdetector is provided as a user-friendly computational framework with documentation and source code freely available at: https://github.com/ucarlab/atac- doubletdetector. results atac-doubletdetector leverages the fact that the expected number of uniquely aligned reads for a given locus ranges from to per nucleus in snatac-seq data: = closed chromatin, = open in one allele (i.e., from either maternal or paternal chromosomes), = open in two alleles (i.e., both maternal and paternal chromosomes) (fig. a). a locus can have more than two reads (> ) when: ) it contains repetitive sequences; ) there are sequencing or alignment errors; or ) reads stem from multiplet nuclei. in the case of multiplets, we expect to observe many loci with > reads since their epigenomic profiles are derived from two or more nuclei resulting in increased accessible dna. atac-doubletdetector identifies all loci with > reads for each cell/nucleus (fig. b) by utilizing sorted read alignments to detect their overlapping read intervals ( - bp on average across all samples). a unified list of these loci across all nuclei is then generated to quantify the number of occurrences where > reads align to a locus in a given nucleus (fig. c). as a proof of concept, highly significant multiplets (p-values < - ) can be clearly seen harboring many more loci with > reads ( - loci) than average (~ loci per nuclei) (extended data fig. ). random occurrences of loci with > reads (i.e., due to sequencing or alignment errors) were modeled with the poisson cumulative distribution function using the mean number of overlaps detected across all cells. nuclei that harbor significantly more loci with > reads are identified as multiplets based on their deviations from the distribution using false discovery rate (fdr) (fig. c). to trace multiplets back to their cellular origins, we employed a clustering-based algorithm as part of the atac- doubletdetector framework. marker peaks are detected to generate reference accessibility profiles for each cell type using single cell clustering. epigenomic similarity scores at marker peaks are then used to compare multiplet profiles with singlet profiles to differentiate between heterotypic and homotypic multiplets and annotate them. we demonstrate the utility and performance of our computational framework by applying our methods in pbmc and islet sample datasets (fig. d). first, we simulated artificial multiplets in pbmc and islet samples and quantified atac-doubletdetector’s ability to identify and annotate these multiplets. second, we compared atac-doubletdetector to archr, measuring their overall performances and their ability to detect simulated (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . heterotypic and homotypic multiplets. finally, we measure the efficacy of our annotation method and analyze multiplet cellular origins to understand whether cell type influences the rate of multiplet occurrences. atac-doubletdector detects heterotypic and homotypic multiplets in pbmc and islet samples. we generated snatac-seq libraries from two human pbmc and two human pancreatic islet samples using x genomics chromium platform . sequence reads were preprocessed using cell ranger atac pipeline (methods), resulting in an average of , and , nuclei per sample and an average of , and , valid read pairs per cell for pbmc and islet samples respectively (fig. a). valid read pairs refer to all pairs of paired end reads that align to autosomes and pass quality control flags/thresholds (methods). despite deeper sequencing for islet samples, fewer valid read pairs were observed in islet samples compared to pbmc samples (fig. b), which can be explained by increased mitochondrial reads in islets ( , , and , , total reads aligned to chrm) compared to pbmcs ( , , and , total reads aligned to chrm). nuclei clustering using an in-house implementation (methods) of a two-pass clustering method for snatac-seq data identified and clusters for pbmc and pbmc . correlating pseudo-bulk accessibility profiles of these clusters with accessibility maps from sorted bulk atac-seq data (extended data fig. a,b) grouped them into major cell types: myeloid (including cd +, cd monocytes and conventional dendritic cells), b, cd + t, cd + t, and nk cells (extended data fig. c,d). these annotations were confirmed based on chromatin accessibility patterns at cell-specific marker genes (extended data fig. a,b). the same clustering procedure identified and distinct clusters for islet and islet , which were then annotated as alpha, beta, delta, and ductal cells by integrating their accessibility profiles with in-house islet scrna-seq data (extended data fig. a,b). these annotations were confirmed by analyzing the chromatin accessibility patterns at known cell-specific marker genes (extended data fig. c,d). we applied atac-doubletdetector on pbmcs and human islet samples using an fdr cutoff of . (methods). nuclei detected as multiplets were distributed throughout all clusters (fig. c-d, extended data fig. ) and in one case (pbmc ) multiplets formed their own distinct cluster (see selected multiplets in fig. d). the percentage of detected multiplets were higher in pbmcs ( %, . %) compared to islets ( % for both samples) (fig. e), which is likely due to the lower valid read pairs per nuclei in islets as previously mentioned (fig. b). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . to further study the biological relevance of these detected multiplets, we selected a cluster which exclusively encompassed multiplets (fig. d; pbmc selected multiplets) and analyzed their chromatin accessibility profiles (fig. f). the selected multiplets were characterized by a high chromatin accessibility at the promoters of both cd g (t cell marker gene) and lyz (monocyte marker gene), suggesting t cell-monocyte multiplets. these results demonstrate how read count distribution information from snatac-seq can be used to effectively detect multiplets. atac-doubletdetector effectively detects simulated heterotypic and homotypic multiplets. to quantify the efficacy of atac-doubletdetector, we generated artificial multiplets by randomly selecting % of nuclei in a sample and pairing them together to artificially form multiplets (repeated times per sample). this resulted in artificial multiplets at . % of the total number of nuclei within a sample. these artificial multiplets serve as positive multiplet examples and enable us to measure recall (i.e., the fraction of detected artificial multiplets among all artificial multiplets introduced in the sample). we first evaluated atac-doubletdetector’s ability to detect heterotypic, homotypic, and a combination of both multiplet types. we then compared it’s performance in comparison to another method archr . atac-doubletdetector detected heterotypic multiplets introduced in pbmc samples with high recall (average recall . for pbmc and . for pbmc over runs), outperforming archr ( . and . respectively) (fig. a). average recall for atac-doubletdetector was lower in islet and islet than pbmcs ( . and . average recall respectively) whereas the average recall showed improvement for archr ( . and . average recall respectively). decreased performance of atac-doubletdetector’s in islets can be explained by low number of valid read pairs per nuclei in islet samples compared to pbmcs (fig b). notably, atac-doublet detector was equally effective for detecting homotypic multiplets (average recall . and . for pbmc and pbmc , . and . for islet and islet ) (fig. b), demonstrating the utility of using read counts to detect multiplets. as expected, archr had low recall for detecting homotypic multiplets (average between . and . for all samples), as this algorithm identifies multiplets with distinct genomic profiles from singlets. finally, we measured the efficacy to simultaneously detect both types of multiplets by introducing a more realistic- heterotypic and homotypic multiplet : ratio (extended data fig. a). as expected, the average recall values of atac-doubletdetector’s were similar ( . and . for pbmc and pbmc , . and . for islet and islet (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . respectively), while, those of archr were lower ( . and . for pbmc and pbmc , . and . for islet and islet ), likely due to its poor homotypic multiplet detection performance. to further study how the valid read pairs influence atac-doubletdetector’s performance, we generated artificial multiplets using cells with ranging reads per nucleus (fig c-d, extended data fig. b). we observed a noticeable increase in average recall (> . recall) for atac-doubletdetector, when the number of valid read pairs was above . k, corresponding to an average of . k valid reads pairs per nucleus. in contrast, archr did not show significant differences in performances with respect to the number of valid read pairs per nucleus (extended data fig. b), as it relies more on genomic profile similarity to detect multiplets. more exhaustive analyses of repetitions per sample further confirmed that the majority ( %, % for pbmc and pbmc and %, % for islet and islet ) of multiplets with > k valid read pairs (i.e., multiplets formed from nuclei with k valid read pairs each) were detected with this method (extended data fig. ). together, these analyses suggest that when > k valid read pairs are captured per nucleus, atac-doubletdetector is very effective in detecting both homotypic and heterotypic multiplets from snatac-seq data. to compare atac-doubletdetector and archr performances, we ran archr with recommended parameter settings (i.e., k= nearest neighbors and . filter ratio). only to multiplets across all samples were detected by both methods (fig e-f, extended data fig. , extended data fig. a-b) and majority of these multiplets were among the ones that formed their own clusters (i.e., heterotypic multiplets). for example, the majority of selected multiplets detected in cluster in fig d were detected by both methods (extended data fig. ), which are multiplets that have unique epigenomic profiles; hence easier to detect with the synthetic multiplet- based method employed by archr. notably, . % of delta cells were identified as multiplets by archr for islet (figure f, extended data fig. ). delta cells resemble both alpha and beta cells in their genomic profile, hence these cells were mistakenly detected as multiplets by archr, demonstrating a pitfall for synthetic multiplet- based methods. multiplets are expected to have higher read counts than singlets since they combine chromatin accessibility profiles of more than one nucleus. in alignment with this, multiplets detected by atac- doubletdetector had significantly higher valid read pair counts compared to singlets (average valid read pairs of , for multiplets and , for singlets for all samples) (p-values < . x - ). in contrast, read counts for archr multiplets were significantly lower (average p-values < . x - ) than atac-doubletdetector multiplets, observing read counts closer to that of singlets (average read count per cell , for archr (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . multiplets and , for singlets) (extended data fig. c). in summary, these analyses showed that when there is sufficient number of valid read pairs per cell (> k), count based methods are advantageous over synthetic multiplet-based methods as they can accurately detect both homotypic and heterotypic multiplets. marker peaks can effectively annotate cellular origins of multiplets. cellular origin annotations of multiplets were inferred using a three-step algorithm (fig. a). first, nuclei were clustered and annotated to their respective cell types. second, marker peaks were detected for each cluster/cell type. third, we calculated epigenomic similarity of each multiplet to different cell types by counting marker peak reads for the multiplet and the k= nearest neighbor nuclei (methods). cluster similarity scores were then used to annotate multiplets. for example, in pbmcs, for each multiplet we calculated scores, where each score represents the similarity of the multiplet epigenome to that of the five studied clusters (figure b). the distribution of these similarity scores are used to first distinguish heterotypic and homotypic multiplets, by comparing their profiles to annotated singlets (methods). for example, in pbmc , nuclei in b cell cluster (cluster ) had high similarity score for b cell marker peaks and low scores for all other cell types (figure b). in contrast, nuclei in cluster had high similarity scores for nk, cd + t, cd + t and myeloid cells, a signature of heterotypic multiplets (fig. b). once the multiplet type is identified, their cellular origins are annotated using the highest scoring cell type(s). we evaluated the efficacy of this annotation pipeline using artificial multiplets, where cells were randomly selected and paired together to form both heterotypic and homotypic multiplets. using these artificial multiplets, we categorized multiplets as homotypic or heterotypic and annotated multiplets with respect to the number of cell types associated with them. we identified the cellular origins of both types of multiplets with an average accuracy of . %, . % in pbmc , pbmc and . %, . % in islet , islet (fig. c). for example, in pbmc , % of all simulated b and myeloid multiplets were correctly annotated. cell types that have similar functions, hence similar epigenomes, observed lower annotation accuracies; such as % for simulated nk and cd + t cells. our annotations were equally effective for annotating both homotypic and heterotypic multiplets, showing . % accuracy on average to annotate homotypic multiplets and . % accuracy to annotate heterotypic multiplets. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . multiplet cell-type compositions reflect cellular compositions of the underlying tissue. using atac- doubletdetector’s annotation pipeline, we annotated all detected multiplets in pbmcs and islets. inspection of aggregate accessibility profiles at marker gene promoters (ms a , cd g, cd , cd a, trem , nkg , and klrf ) for each cell type in pbmc (fig. a) revealed that annotated multiplets have accessibility at relevant marker gene promoters. for instance, homotypic b cell multiplets had strong signal at the promoter of b cell marker gene ms a , whereas heterotypic multiplets originating from cd + t cell and b cells had high accessibility signals for both b cell marker gene ms a and cd + t cell marker gene cd a. as expected, homotypic multiplets clustered together with the underlying cell type, whereas heterotypic multiplets typically formed their own clusters (fig. b-c, extended data fig. a-b). the majority of heterotypic multiplets for islet were found between major cell type clusters and near the delta cell cluster while homotypic multiplets resided within the boundaries of singular cell type clusters (fig. d). for pbmc , the majority of multiplets resided within multiplet cluster we previously identified and as a subcluster of cd + t cells (fig. e). as before, homotypic multiplets were found within corresponding cell type clusters. overall, the majority of detected multiplets were homotypic ( . - . % in islets, - . % in pbmcs), with cell types being distributed with respect to their cell proportions for both homotypic and heterotypic multiplet types (fig. d-e, extended data fig. c-d). indeed, in both tissues, the propensity of a cell type to form a multiplet was positively correlated with the percent of that cell type within the tissue (pearson’s r = . , . , p-value < . , . for pbmc and pbmc , pearson’s r = . , . p-value < . , . for islet and islet ) (fig. f-g, extended data fig. e-f), suggesting that snatac-seq multiplets are more likely to occur randomly than through specific interactions between nuclei. for example, the most abundant cell type in islet was beta cells ( . % of the cell population) which contributed to . % of multiplets (fig. f). heterotypic multiplet annotations in islet samples mostly originated from alpha, beta and delta cells. in pbmcs, the most frequent heterotypic multiplets were the ones stemming from cd + t and cd + t cells (fig. f, extended data fig. e). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . discussion detecting and discarding multiplets from snatac-seq data is a critical step for improving data quality as multiplets can form their own clusters and can confound downstream analyses. atac-doubletdetector exploits read count distributions for a given nucleus to effectively detect and eliminate multiplets without requiring prior knowledge of cell-type information. it accomplishes this by first efficiently counting loci with > uniquely aligned reads per nucleus and identifying nuclei with read count distributions deviating from expectations. unlike other methods that utilize artificial multiplet examples to identify putative multiplets (i.e., archr), atac- doubletdetector is capable of detecting both homotypic (i.e., multiplets originating from the same cell type) and heterotypic multiplets (i.e., multiplets originating from different cell types). eliminating heterotypic multiplets is essential for improved clustering and differential analyses between clusters and samples, whereas homotypic multiplets introduce bias in allele-specific analyses. hence, detecting and removing both types of multiplets will improve downstream analyses. the number of valid read pairs per cells is the most important factor affecting the performance of atac- doubletdetector. when read depth per nucleus is sufficiently high (e.g., > k read pairs per nucleus), atac- doubletdetector is very effective in detecting both heterotypic and homotypic multiplets (average recall = . to detect artificial multiplets in pbmcs). since atac-doubletdetector does not depend on artificial multiplet examples, it is not inherently biased towards cell types that resemble others. for example, in islets, delta cells transcriptionally resemble alpha and beta cells, hence artificial multiplets generated by combining alpha and beta cells have genomic profiles that resemble delta cells. these instances are particularly challenging for methods that depend on artificial multiplet examples (e.g., archr for snatac , doubletfinder and scrublet for scrna- seq). in alignment with this, archr categorized . % of delta cells as multiplets in islet . given the success of atac-doubletdetector for identifying multiplets from snatac-seq data with enough reads per nuclei, it can also be effective in detecting and eliminating multiplets in recent multi-ome transcriptome and epigenome assays . epigenomic signal at marker peaks is an effective way to annotate cellular origins of multiplets, where we achieved . % accuracy on average in simulations. annotations of detected multiplets showed that majority are homotypic. furthermore, the propensity of nuclei to form multiplets was positively correlated with the abundance of that cell type within the tissue. since cells are lysed and nuclei are profiled in snatac-seq protocols ; these assays will likely not be prone to biological multiplets due to cell-cell interactions). therefore, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . snatac-seq multiplets likely occur randomly among all cells; hence the most abundant cells are the most likely to form multiplets. quantifying the efficacy of multiplet detection methods is a challenging task since true examples of singlet and multiplets are not known. to overcome this challenge, we evaluated atac-doubletdetector’s ability to capture multiplets by simulating artificial multiplets, enabling us to measure recall. atac-doubletdetector identified - . % of cells as multiplets in islet and pbmc samples, which was in alignment with expectations. hence, we believe false positive calls are also restricted in our method. although we quantified our method by forming artificial multiplets, atac-doubletdetector pipeline can be easily extended to capture and annotate multiplets that include data from multiple nuclei. multiplets are inevitable in single cell sequencing and performing better data analyses calls for their removal. atac-doubletdetector introduces a novel and effective count-based solution for detecting multiplets and provides a framework for annotating their cellular origins, improving future downstream analyses. atac- doubletdetector code and documentation is freely available at https://github.com/ucarlab/atac- doubletdetector, providing an easy to use interface for all backgrounds. our multiplet detection algorithm is fast and can be incorporated into data analyses pipelines, where processing of an average library (i.e., ~ , cells at ~ , valid read pairs per cell) takes < minutes. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . methods snatac-seq cell labeling, capture, library preparation, and sequencing. for single nucleus atac sequencing (snatacseq) experiments, viable single cell suspensions from each sample were used to generate snatacseq data using the x chromium platform according to the manufacturer’s protocols (demonstrated protocol nuclei isolation for atac sequencing document cg ; chromium single cell atac_user guide revb document cg ). briefly, > , cells of interest were centrifuged, the supernatant was removed without disrupting the cell pellet, lysis buffer was added for minutes on ice to generate isolated and permeabilized nuclei, followed by quenching by dilution with wash buffer. after centrifugation to pellet the washed nuclei, diluted nuclei buffer was used to re-suspend nuclei at the desired nuclei concentration as determined using a countess ii fl automated cell counter and combined with atac buffer and atac enzyme to form a transposition mix. transposed nuclei were immediately combined with barcoding reagent, reducing agent b and barcoding enzyme and loaded onto a x chromium chip e for droplet generation, followed by library construction. the barcoded sequencing libraries were subjected to bead clean-up and checked for quality on an agilent tapestation, quantified by qpcr (kapa biosystems library quantification kit for illumina platforms), and pooled for sequencing on an illumina novaseq s flow cell (paired-end libraries x bp). human islet isolation human islets were obtained through partnerships with the integrated islet distribution program (iidp, http://iidp.coh.org/). assessment of human islet function was performed by islet gsis static incubation assay on the day after arrival, following the iidp protocol. primary human islets were cultured in prodo media (pim-s + supplements pim-g + pim-abs) in % co at oc for ~ hours prior to beginning studies. in preparation of single cell suspension for x platform, human islets were dispersed with stempro accutase (thermo fisher scientific) ml/ ieq for min at oc. islet single cell suspension was washed three times in pbs- . % bsa and cell number determined using countess ii fl automated cell counter (life tech). nuclei isolation for single cell atac sequencing was performed following the x protocol (https://assets.ctfassets.net/an im xiti/ g d ngcw ab dfqppho/ a fb ea a c cb d /cg _demonstratedprotocol_nucleiisolation_atac_sequencing_revd.pdf, based on the omni nucleiprep by corces et al. ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . identifying snatac-seq loci with > reads. position sorted paired-end read alignments from snatac-seq data are compared to detect all loci with > unique reads per nucleus. to avoid instances where reads overlap due to technical reasons, we removed all read pairs that are marked using the following parameters in the htsjdk library: ) readpairedflag = true, ) readunmappedflag = false, ) mateunmappedflag = false, ) secondaryorsupplementary = false, ) duplicatereadflag = false, and referenceindex != matereferenceindex (i.e., read pairs map to the same chromosome). to reduce overlaps due to alignment errors, reads are excluded based on i) mapping quality scores less than or equal to , and ii) insert sizes (i.e., the end to end distance between ’ and ’ read positions) greater than bp (~ nucleosomes) in length. to identify instances of > reads overlapping at any specific locus, all intervals are identified for which an overlap was observed for at least two valid read pairs. reads defining each interval are then compared to one another to identify all subintervals that exceed the specified overlap threshold (i.e., ). to efficiently identify these subintervals, for each subset, interval breakpoints were defined at the start and end positions of each paired end read. for each interval breakpoint, an integer value of was assigned to all breakpoints originating from start positions, and - to all breakpoints originating from an end position. interval breakpoints are then visited in start position sorted order to generate a cumulative sum based on the assigned values at each breakpoint. the cumulative sum indicates the total number of overlaps between two interval breakpoints and efficiently identifies all sub-intervals with a number of overlaps greater than the specified threshold. once all subintervals satisfying the threshold are identified for a subset of reads, the algorithm repeats this process for the remaining paired end read subsets. each step is performed using a linear time algorithm (i.e., o(n), n is the number of total reads), with an additional o(log(m)) (m equals the number of nuclei) overhead for each read to identify their respective nucleus origin, resulting in o(n*log(m)) runtime. the runtime can be reduced to an expected o(n) runtime by instead using an appropriate hash function for cell identifiers/barcodes. note that this algorithm assumes that reads are sorted beforehand and is otherwise superseded by time it takes to sort reads by their chromosome and start positions (i.e., o(n*log(n)). detecting significant multiplets from snatac overlap counts. loci with > reads were first filtered using simple repeats, segmental duplications, repeat masker and blacklist regions obtained from ucsc genome (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . browser and encode , . next, filtered regions from all nuclei were merged if they overlapped by at least one base pair. using this unified list of loci, a binary matrix was generated where rows in the matrix represent loci with > reads for at least one nucleus, and the columns represent the individual cells within the sample. values within the matrix were assigned to if the cell and genomic region combination observed > reads overlapping, and otherwise. from this matrix, multiplets can be detected using column sums (i.e., the total number of > read overlap instances for each nucleus) while repetitive element sequences can be inferred using row sums (i.e., the total number of cells observing > reads at the same locus). the events of observing > reads overlapping within the same region for multiple cells or across multiple regions within the same cell can be modeled using the poisson distribution. occurrences of these events are independent, counted within set intervals (i.e., counting regions across the entire genome within cells or counting cells within the same genomic regions), are either present or not within these intervals, and have a constant average rate of occurring, satisfying the assumptions of the poisson distribution. we therefore detected significant multiplets and inferred repetitive sequences using the poisson cumulative distribution function, using respective mean row and column sum counts as the expected values to calculate poisson probabilities. in this process, we first use poisson probabilities to infer repetitive sequences where a significant number of nuclei observe > reads at the same genomic region. all inferred repetitive sequence loci are removed from further analysis. next, we calculate the poisson probability of observing more loci with > reads than expected in a nucleus(i.e., multiplets) using column sums. poisson probabilities for both inferring repetitive sequence and multiplet detection were corrected using the benjamini hochberg procedure to adjust for multiple hypothesis testing. repetitive sequence inferences and multiplets were predicted by selecting regions or cells with adjusted poisson probabilities less than . . multiplet annotation pipeline. detected multiplets are annotated using clusters identified for snatac-seq samples, merging them with respect to specific cell types present in the cell population. in our study, pbmc clusters were merged to represent cd +t, cd +t, natural killer (nk), myeloid and b cells and islet clusters were merged to represent alpha, beta, delta and ductal cells. marker peaks for all cell type clusters with at least cells were identified with the findmarkers function in seurat , using the logistic regression setting. for the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sake of unison, the top marker peaks are then identified for each cell type cluster based on bonferroni adjusted p-value of average log fold changes. to account for data sparsity in snatac-seq data, aggregate read profiles are calculated for each cell and marker peak. aggregate read profiles are found by taking average read counts for each cell’s nearest neighbors using the top singular value decomposition (svd) components. the cumulative distribution function in r (i.e., ecdf) is then used to find the abundance of reads for each cluster’s marker peaks. distribution scores represent the percent of each cell type’s accessibility profiles present within the cell. in order to distinguish multiplet types (i.e., heterotypic or homotypic) singlet profiles were calculated for each cell type in the sample. for each cell type’s singlet cells, abundance scores at every marker peak were averaged to find the representive abundance score profile for that cell type. multiplets that have a profile close to their abundant cell type’s singlet profile were classified as homotypic. euclidean distance was used to measure the similarity between the profiles of multiplets and singlets. mixture models were then fitted to the distances with the mclust r package to group the closeness of the multiplets to their corresponding cell type’s singlet profile. multiplets in the group with largest distance to the singlet profile are considered heterotypic. multiplets are then annotated using the top (for homotypic) or (for heterotypic) abundance scores. snatac-seq nuclei clustering. to cluster nuclei from snatac-seq data, we employed an in-house implementation (https://github.com/ucarlab/snatacclusteringpipeline) of a two pass clustering method previously described with notable differences. first, we restrict the number of . kb bins in the first pass clustering to the top k bins, up from k bins. for second pass clustering, we increase the number of peaks to include all peaks identified in pass up to k. integration of scrna-seq and snatac-seq data. integrative clustering and analysis of single cell transcriptomes and single nucleus epigenomes was performed using the r package seurat , . first, gene activity scores were derived from the resultant snatac-seq peak count-matrix using the creategeneactivitymatrix function with default parameters. next, single nuclei with < , total read counts were discarded from analyses. the resultant single nuclei and gene activity scores were log normalized and scaled. using the processed scrna-seq data (also analyzed with seurat), we identified anchors between the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . snatac-seq gene activity score matrix and scrna-seq gene expression matrix following the methodology described by butler et al. ( ) . after identifying anchors between the datasets, cell-type labels from the scrna-seq dataset were transferred to the snatac-seq dataset and a prediction and confidence score was assigned for each cell. simulating artificial multiplets to measure multiplet detection performances. to measure recall for detecting multiplets, artificial multiplets were simulated by combining accessibility profiles of nuclei within each sample population tested. for each sample, cells were randomly selected equal to % of the total cell population and paired together to introduce artificial multiplets equivalent to . % of the total population. introducing . % artificial multiplets ensured that they were not the majority compared to real multiplets ( - % of cells across all samples) present in the data. cell pairs were randomly reselected until they formed heterotypic, homotypic, or : ratio of heterotypic and homotypic multiplets based on cell type annotations. simulations measuring the number of valid read pairs per nucleus did not have restrictions based on cell type and were selected based on read depth when stratifying by number of valid read pairs (i.e., fig. c-d, extended data fig. b) or completely at random (i.e., extended data fig. ). once cell pairs were identified, artificial multiplets were introduced by generating modified barcode mappings (for atac-doubletdetector) or barcodes in fragment files (for archr ), which assigned artificial multiplet reads to the same cell identifier (i.e., the first nucleus in the pair). artificial multiplets were simulated or runs depending on the analysis. code availability atac-doubletdetector is provided as a user-friendly computational framework with documentation and source code freely available at: https://github.com/ucarlab/atac-doubletdetector. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . references . buenrostro, j. d. et al. single-cell chromatin accessibility reveals principles of regulatory variation. nature , – ( ). . cusanovich, d. a. et al. multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. science , – ( ). . satpathy, a. t. et al. massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral t cell exhaustion. nat. biotechnol. , – ( ). . rai, v. et al. single-cell atac-seq in human pancreatic islets and deep learning upscaling of rare cells reveals cell-specific type diabetes regulatory signatures. mol. metab. , – ( ). . lareau, c. a., ma, s., duarte, f. m. & buenrostro, j. d. inference and effects of barcode multiplets in droplet-based single-cell assays. nat. commun. , ( ). . fang, r. et al. snapatac: a comprehensive analysis package for single cell atac-seq. https://www.biorxiv.org/content/ . / v ( ). . granja, j. m. et al. archr: an integrative and scalable software package for single-cell chromatin accessibility analysis. http://biorxiv.org/lookup/doi/ . / . . . ( ) doi: . / . . . . . mcginnis, c. s., murrow, l. m. & gartner, z. j. doubletfinder: doublet detection in single-cell rna sequencing data using artificial nearest neighbors. cell syst. , - .e ( ). . wolock, s. l., lopez, r. & klein, a. m. scrublet: computational identification of cell doublets in single- cell transcriptomic data. cell syst. , - .e ( ). . ucar, d. et al. the chromatin accessibility signature of human immune aging stems from cd + t cells. j. exp. med. , – ( ). . lawlor, n. et al. single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type diabetes. genome res. , – ( ). . ma, s. et al. chromatin potential identified by shared single-cell profiling of rna and chromatin. cell , - .e ( ). . corces, m. r. et al. an improved atac-seq protocol reduces background and enables interrogation of frozen tissues. nat. methods , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . li, h. et al. the sequence alignment/map format and samtools. bioinforma. oxf. engl. , – ( ). . haeussler, m. et al. the ucsc genome browser database: update. nucleic acids res. , d – d ( ). . encode project consortium. an integrated encyclopedia of dna elements in the human genome. nature , – ( ). . davis, c. a. et al. the encyclopedia of dna elements (encode): data portal update. nucleic acids res. , d –d ( ). . butler, a., hoffman, p., smibert, p., papalexi, e. & satija, r. integrating single-cell transcriptomic data across different conditions, technologies, and species. nat. biotechnol. , – ( ). . scrucca, l., fop, m., murphy, t. b. & raftery, a. e. mclust : clustering, classification and density estimation using gaussian finite mixture models. r j. , – ( ). . stuart, t. et al. comprehensive integration of single-cell data. cell , - .e ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. : overview of detecting multiplets in snatac-seq. a, tn transposase cleaves accessible dna at maternal and paternal chromosomes. number of atac-seq read counts per loci per nucleus are expected to be , , or . b, instances where more than (> ) reads are observed for any locus in a cell are identified using an efficient algorithm for counting the number of overlapping reads. c, poisson cumulative distribution function is used to detect multiplets based on deviations from expected number of loci with > reads. d, overview of downstream analyses: ) quantification of multiplet detection performances using artificial multiplets, ) comparison of atac- doubletdetector to alternative method archr, ) annotating cellular origins of multiplets using a clustering-based method. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. : atac-doubletdetector identifies heterotypic and homotypic multiplets in human pbmc snatac-seq data. a, summary of snatac-seq samples generated and used in this study from human pbmc and islets. b, valid read pair distributions for pbmc and islet snatac-seq samples. c, pbmc clusters were annotated based on their correlations with sorted bulk atac-seq data (see. extended data fig. ). d, all multiplets (heterotypic and homotypic) detected by atac- doubletdetector in pbmc . selected multiplets refer to multiplets for which aggregated profiles are shown in panel f of this figure. e, the number of cells and percentage of multiplets detected by atac-doubletdetector in pbmc and islet samples. f, chromatin accessibility profiles of cd + t, myeloid, and selected multiplets around for t cell marker gene (cd g) and myeloid cell marker gene (lyz). cd + t and myeloid cells show strong accessibility signals for their relevant marker genes while selected multiplets have accessible chromatin for both marker genes. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. : atac-doubletdetector detects multiplets with high recall when read depth is sufficient. a-b, recall for detecting heterotypic (a) and homotypic (b) artificial multiplets. atac-doubletdetector consistently detected both heterotypic and homotypic multiplets with similar recall, while archr was only effective for predicting heterotypic multiplets for data with high heterogeneity. c-d, performance of detecting artificial multiplets at increasing valid read pair (insertions) distributions for pbmc (c) and islet (d). atac-doubletdetector effectively detects multiplets at the > k valid read pairs per nucleus. archr’s performance did not observe the same level of effect for read depth. e, reference annotations for islet . islet annotations correspond to alpha, beta, delta and ductal cell types. f, representative umap plots for multiplets detected by atac- doubletdetector and archr for islet (other samples shown in extended fig. ). we identified islet clusters for alpha, beta, delta, and ductal cells. majority of multiplets detected were not shared between the two methods. heterotypic multiplets were the most common. note: archr detected the majority of delta cells as multiplets. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. : multiplet cell-type origins are predicted with high accuracy. a, overview of the cell origin annotation pipeline. first, cells are clustered. second, marker peaks are identified. third, multiplets and their k-nearest neighbor cells are used to generate cluster similarity scores. b, example of aggregate cluster profiles for predicting cell origin annotations. clusters corresponding to cell types observe strong signal for their respective cell types (e.g., cluster ) while clusters corresponding to multiplets show a mixed profile of cell types (e.g., cluster ). c, heatmaps of cell origin annotation accuracies for predicting artificial multiplets derived from cells of the specific cell type pairings. multiplet annotations showed high accuracies for the majority of cell type compositions. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. : majority of multiplets are homotypic and correspond to cell type proportions. a, accessibility maps for cell origin annotations for multiplets identified in pbmc . homotypic multiplets observe strong signal for their respective marker genes. heterotypic multiplets observe a combined signal at respective marker genes corresponding to the respective annotated cell types. b-c, umap clustering for heterotypic and homotypic multiplet annotations in pbmc (b) and islet (c). heterotypic multiplets are found between major cell type clusters. homotypic multiplets are observed on the periphery of major cell type clusters. d-e, heterotypic and homotypic multiplet cell distributions (left bars). homotypic cell type annotations (right bars) for pbmc (d) and islet (e) samples. majority of multiplets are annotated as homotypic. homotypic cell type distributions show similar distribution to the overall proportions of each cell type in their respective samples. f-g, cell and multiplet proportions for pbmc (f) and islet (g). multiplet cell type proportions are highly correlated with overall cell proportions. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : multiplets observe many loci with > reads. the binary matrix of loci with > reads per cell reveals high confidence multiplet (marked by arrows) that harbor many loci with > reads. these multiplets can be clearly seen compared to the other cells in the subset. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : pseudo-bulk snatac-seq profile correlations with sorted bulk atac-seq revealed major cell types. a, b, spearman correlation heatmaps between pseudo-bulk (snatac) and sorted bulk atac-seq accessibility profiles for pbmc (a) and pbmc (b). pseudo-bulk profiles cluster with four major cell types: myeloid, b, cd + t, cd + t and natural killer (nk). c, d, annotated umap clusters for pbmc (c) and pbmc (d). myeloid, b form distinct clusters for both samples. cd +t, cd +t and nk cell types share more accessible loci and tend to cluster more closely to one another. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : annotated snatac-seq clusters reflect accessibility at cell specific promoters. a, b, annotated umaps for pbmc (a) and pbmc (b) at the promoters of cd g (t-cell marker), cd (cd + t cell marker), cd a (cd + t cell marker), ms a (b cell marker), nkg (nk cell marker), and trem (myeloid cell marker). accessibility was binarized to or based on the presence or absence of a read within these promoters. using these markers, b and myeloid cell types are clearly annotated with their respective markers. cd + t and cd + t cells can be observed by combining cd g with cd and cd a markers respectively whereas nk cells are can be seen using nkg and excluding nuclei with accessibility at cd g promoter. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : islet snatac-seq clusters correspond to scrna-seq and cell marker annotations. a, b, umap clusters of snatac-seq data for islet (a) and islet (b) annotated as alpha, beta delta or ductal cells via integration with annotated scrna-seq data. four distinct clusters are observed with these cell types. c, d. cell specific clusters correspond to their respective marker peaks for both islet (c) and islet (d). accessibility was binarized to or based on the presence or absence of a read within these promoters. alpha, beta, delta and ductal cells are clearly identified with their respective marker genes: gcg (alpha), ins (beta), sst (delta), and krt (ductal). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : multiplets are distributed throughout snatac-seq clusters. multiplet annotated umap clustering of pbmc , pbmc , islet and islet reveal that multiplets are distributed throughout all identified clusters and in some cases form their own multiplet clusters (i.e., center cluster in pbmc ). multiplets between major cell type clusters are likely to be heterotypic whereas multiplets at the periphery of annotated clusters are likely to be homotypic. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : atac-doubletdetector detects both homotypic and heterotypic multiplets at high read depth. a, recall for detected both homotypic and heterotypic artificial multiplets at a : ratio. atac-doubletdetector did not observe noticeable differences in performances due to its robustness for detecting both multiplet types. archr showed reduced performance compared to heterotypic multiplet only detection due to the inclusion of homotypic multiplets. b, recall for multiplets stratified by read count distributions (top for each sample) and valid read pair distributions for each multiplet subset (bottom for each sample). atac-doubletdetector performances increased when the number of valid read pairs exceeded ~ k valid read pairs per nuclei, suggesting multiplets can be reliably detected when nuclei have > k valid read pairs each. archr did not show significant differences in performance due to read depth. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : artificial multiplets are detected when combined valid read pairs exceed k. for each sample, multiplets were detected (top left for each sample) or not detect (top right for each sample), depending on whether one or both nuclei exceeded k valid read pairs. histogram of combined profiles revealed that the majority of detected multiplets (bottom left for each sample) had at least k valid read pairs while multiplets not detected were those with less than kb valid read pairs (bottom right for each sample). when nuclei are sequenced for k valid reads per nuclei, multiplets will harbor k valid read pairs and can be detected by atac- doubletdetector. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : atac-doubletdetector and archr identify different multiplet subsets. umap clusters annotating atac- doubletdetector multiplets (green), archr multiplets (orange), or their intersection (black). majority of multiplets detected by both atac- doubletdetector and archr were between major cell type clusters (i.e., heterotypic multiplets). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : atac-doubletdetector and archr multiplets comparisons reveal nature of their underlying algorithms. a, venn diagrams and total number of multiplets detected by atac-doubletdetector and archr. only a small subset of multiplets is detected by both methods. b, total number of nuclei and multiplets detected by each method. differences in number of nuclei are due to differences in inputs (i.e., alignment (bam) files for atac-doubletdetector and fragment files (cell ranger output) for archr). overall, archr detects more multiplets using default parameters than atac-doubletdetector. c, valid read pair distributions between multiplets and singlets detected by atac-doubletdetector and archr. differences in number of valid read pairs between multiplet and singlets were more significant for atac-doubletdetector than archr while the number valid read pairs for atac-doubletdetector were significantly greater than archr multiplet. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : multiplet annotations correspond to cell proportions. a-b, umap clustering for heterotypic and homotypic multiplet annotations in pbmc (a) and islet (b). heterotypic multiplets are found between major cell type clusters. homotypic multiplets are observed on the periphery of major cell type clusters. c-d, heterotypic cell type annotations for pbmc (d) and islet (e) samples. majority of multiplets are annotated as homotypic. f-g, cell and multiplet proportions for pbmc (f) and islet (g). multiplet cell type proportions are highly correlated with overall cell proportions. islet observed more beta cell multiplets than other cell types/samples, reducing correlation and significance for islet . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . structural genetics of circulating variants affecting the sars-cov- spike / human ace complex structural genetics of circulating variants affecting the sars-cov- spike / human ace complex francesco ortuso , , daniele mercatelli , pietro hiram guzzi , federico manuel giorgi ,* department of health sciences, university “magna græcia” of catanzaro, catanzaro, italy net science srl, c/o university “magna græcia” of catanzaro, catanzaro, italy department of pharmacy and biotechnology, university of bologna, bologna, italy department of surgical and medical sciences, university “magna græcia” of catanzaro, catanzaro, italy * corresponding author e-mail: federico.giorgi@unibo.it (fmg) orcids francesco ortuso: - - - daniele mercatelli: - - - pietro hiram guzzi: - - - federico manuel giorgi: - - - classification biophysics and computational biology keywords sars-cov- , covid- , mutations, spike, ace author contributions fmg, phg and fo designed the study. fo designed and performed the structural analysis. fmg designed the genetics analysis. fmg and dm performed the genetics analysis. fmg financially supported the study. phg drafted the manuscript and performed literature search. all authors contributed to the writing of the final version of the manuscript. abstract .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / sars-cov- entry in human cells is mediated by the interaction between the viral spike protein and the human ace receptor. this mechanism evolved from the ancestor bat coronavirus and is currently one of the main targets for antiviral strategies. however, there currently exist several spike protein variants in the sars-cov- population as the result of mutations, and it is unclear if these variants may exert a specific effect on the affinity with ace which, in turn, is also characterized by multiple alleles in the human population. in the current study, the gbpm analysis, originally developed for highlighting host-guest interaction features, has been applied to define the key amino acids responsible for the spike/ace molecular recognition, using four different crystallographic structures. then, we intersected these structural results with the current mutational status, based on more than , sequenced cases, in the sars-cov- population. we identified several spike mutations interacting with ace and mutated in at least distinct patients: s n, n k, n y, y f, e k, k n, s i and g s. among these, mutation n y in particular is one of the events characterizing sars-cov- lineage b. . . , which has recently risen in frequency in europe. we also identified five ace rare variants that may affect interaction with spike and susceptibility to infection: s p, e k, m i, e g and g v. significance statement we developed a method to identify key amino acids responsible for the initial interaction between sars-cov- (the covid- virus) and human cells, through the analysis of spike/ace complexes. we further identified which of these amino acids show variants in the viral and human populations. our results will facilitate scientists and clinicians alike in identifying the possible role of present and future spike and ace sequence variants in cell entry and general susceptibility to infection. abbreviations aa: amino acid ace : angiotensin-converting enzyme covid- : coronavirus disease gbpm: grid based pharmacophore model iep: interaction energy point mifs: molecular interaction fields orf: open reading frame pdb: protein data bank rbd: spike receptor binding domain with ace rmsd: root mean square deviation sars-cov- : severe acute respiratory syndrome coronavirus main text introduction the severe acute respiratory syndrome coronavirus (sars-cov- ) has emerged in late ( ) as the etiological cause of a pandemic of severe proportions dubbed coronavirus disease (covid- ). the disease has reached virtually every country in the globe ( ), with more than , , .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / confirmed cases and more than , , deaths (source: world health organization). sars-cov- is characterized by a , -long single stranded rna genome, densely packed in open reading frames (orfs); the orf encodes for a polyprotein which is furtherly split in proteins, for a total of proteins ( ). the second orf encodes for the spike (s) protein, which is the key protagonist in the viral entry into host cells, through its interaction with human epithelial cell receptors angiotensin converting enzyme (ace ) ( ), transmembrane serine protease (tmprss ) ( ), furin ( ) and cd ( ). investigators have focused their attention on the spike/ace interaction, trying to disrupt it as a potential anti-covid- therapy, using small drugs ( ) or spike fragments ( ). using x-ray crystallography, some models of the spike/ace have been generated ( – ), providing a structural instrument for the analysis of this key interaction. these models determined that the receptor binding domain (rbd) of spike, directly interacting with ace , is a compact structure of ~ amino acids (aas) over a total of aas of the full-length spike. the sars-cov- spike protein adapted from subsequent mutations from a wild bat beta-coronavirus ( ), in order to exploit the n-terminal ace peptidase domain conformation. as a result, sars-cov- spike can establish a strong interaction with the human cell surface, allowing the virus to fuse its membrane with that of the host cell, releasing its proteins and genetic material and starting its replication cycle ( ). while sars-cov- shows low mutability ( ), with less than predicted events/year ( ), the virus is in continuous evolution from the original wuhan reference sequence (nc_ . ) ( ), and there are currently at least major variants circulating in the population ( , ). some of these strains are characterized by a mutation in spike, at aa , whereas an aspartic acid (d) is substituted by a glycine (g) ( ). in fact, the spike d g mutation gives the name to the most frequent viral clade (g), which was first detected in europe at the end of january , and is currently present in all continents, with increasing frequency over time ( ). d g does not fall within the putative rbd (aa ~ - ), but some studies suggest it may have a clinically relevant role: d g is positively correlated with increased case fatality rate ( ), and it shows increased transmissibility and infectivity compared to the reference genome ( ). in vitro studies show that viruses carrying the d g spike mutation have an increased viral load and cytopathic effect in cultured vero cells ( ). despite these preliminary observations, there are still several doubts on the molecular effects of the d g variant ( ). other recurring spike mutations have been observed in the population worldwide, however at frequencies of % or below ( ); some of these mutations fall within the rbd and therefore may have a direct role in ace interaction. on the other hand, genetic variants of ace in human population may influence susceptibility or resistance to sars-cov- infection, possibly contributing to the difference in clinical features observed in covid- patients ( ). ace gene is located on chromosome xp . and consists of exons, coding for an aas long protein exposed on the cell surface of a variety of human organs, including kidneys, heart, brain, gastrointestinal tract, and lungs ( ). it is unclear if tissue-expression patterns of ace may be linked to the severity of symptoms or outcomes of sars-cov- infections; however, ace levels in lungs were found to be increased in patients with comorbidities associated to severe covid- clinical manifestations ( ), whereas polymorphisms of ace have been already described to play a role in hypertension and cardiovascular diseases ( ), particularly in association with type diabetes ( ), all conditions predisposing to an increased risk of dying from covid- ( ). despite early studies, the presence of spike mutations potentially altering the binding with .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ace is still largely under-investigated, as is the role of ace variants in the human population in determining patient-specific molecular interactions between these two proteins. in the present study, we aim at detecting which spike and ace aas are the most important in determining the sars-cov- entry interaction and analyze which ones have already mutated in the population. the task is clinically relevant, providing a functional characterization of present and future mutations targeting the ace /spike binding and detected by sequencing sars-cov- on a patient-specific basis. characterizing the variability of both proteins must be taken in consideration in the process of developing anti-covid- strategies, such as the spike-based vaccine currently deployed by the national institute of allergy and infectious diseases and moderna ( ). results we set out to analyze the key aas involved in the spike/ace interaction, in order to highlight which ones may alter the binding affinity and therefore etiological and clinical properties of different sars- cov- variants on different patients. following that, we determined which spike and ace aa variations relevant for this interaction have been observed in the sars-cov- and human population, respectively. structural analysis of spike/ace interaction we obtained structural models of the sars-cov- spike interacting with the human ace from three recent x-ray structures, deposited on the protein data bank: lzg ( ), m j ( ) and vw ( ). for vw , two spike/ace complexes were available, so we report results for both as vw -a and wv -b, separately. all models show the core domains of interaction, located in the region of aa - for spike and in the region aa - of ace . full length proteins would be aas (spike only known isoform, from reference sars-cov- genome nc_ . ) and aas (ace isoform , uniprot id q byf - ). selected pdb entries are wild type and their primary sequence and the higher order structures were identical. residues - were missed in vw -b. with the aim to investigate the conformation variability, pdb complexes were aligned by backbone and the root mean square deviation (rmsd) was computed on all equivalent not hydrogen atoms. rmsd data have shown some conformation flexibility that confirmed our idea to take into account all pdb structures in the next investigation (fig ). the gbpm method was originally developed for identifying and scoring pharmacophore and protein- protein interaction key features by combining grid molecular interaction fields (mifs) according to the grab tool algorithm ( ). in the present study, gbpm has been applied to all selected complex models considering spike and ace either as host or guest. dry, n and o grid probes were considered for describing hydrophobic, hydrogen bond donor and hydrogen bond acceptor interaction. for each probe a cut-off, required for highlighting the most relevant mifs points, was fixed above the % from the corresponding global minimum interaction energy value. with respect to the known gbpm application, where pharmacophore features are used for virtual screening .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / purposes, here these data guided us in the complex stabilizing aas identification. in fact, spike or ace- residues, within Å from gbpm points, were marked as relevant in the host-guest recognition and were qualitatively scored by assigning them the corresponding gbpm energy. if a certain residue was suggested by more than one gbpm point, its score was computed as summa of the related gbpm points energy (fig. ). finally, for each selected residue, the four models averaged score was considered for estimating the role in complex stabilization. taking into account their average scores, spike and ace aas were divided by quartiles to facilitate the interpretation of the results: quartile (q ) includes the strongest complex stabilization contributors; quartile (q ) contains residues less important than those reported in q but most relevant of those included in quartile (q ); quartile (q ) indicates the weakest predicted interacting aas. such an extension of the original approach allowed us to highlight known relevant interaction residues of both spike (table ) and ace- (table ). basically, the same number of aas was highlighted for spike ( aas) and ace ( aas). the average score was also in the same range. spike reported a population of q larger than ace : and aas, respectively. the opposite scenario was observed in the q that accounted for residues for spike and for ace . no remarkable difference can be addressed to the q and q spike-ace comparison. we reasoned that mutations and variants in q residues could have a more relevant impact in the complex stability. the analysis of all designed gbpm suggested the spike - ace molecular recognition is largely sustained by polar interactions, such as hydrogen bonds, and by very few putative hydrophobic contributions (table ). mutational analysis of sars-cov- spike we analyzed , publicly available sars-cov- full-length genome sequences collected worldwide and deposited on the gisaid database on december , ( ). from these, we obtained , samples containing at least one aa-changing mutation in the spike protein. a total of , different aa-changing mutations were detected in the , aa-long spike sequence. however, many of these are unique events (or possibly even sequencing errors), as only , mutations were found in more than one sample, were found in more than ten samples, and in more than one hundred samples (supplementary file ). we then focused on mutations located in the spike rbd (aa - ) with predicted interaction contribution, as assessed by our gbpm method. the majority of mutations here are found in only a handful of samples (table and fig a), with a few notable exceptions. the mutations s n and n k are the most frequent in the current population and were identified in , patients ( . %) and , patients ( . %) respectively. these two variants (n k and s n) are also amongst the top most frequent in the population and involve two positions productively contributing to the interaction between spike and ace , according to gbpm (see table and fig for locations and ). the graphical inspection of the pdb structures revealed that spike asparagine (n) , raked at gbpm q , is mainly involved in intra-protein interaction. in fact, by means of its backbone sp oxygen atom, n accepts one hydrogen bond from spike serine sidechain and, by its sidechain amide .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / group, donates one hydrogen bond to the spike proline backbone: all these aas are located into a random coil loop of spike so the n k could minimally modify the spike-ace recognition. on the other hand, after the theoretical mutation of the asparagine with a lysine, it is possible to predict a productive electrostatic interaction between the new net positively charged residue and the ace glutamate . such a long-distance interaction could improve the stabilization of the complex with respect to the spike wild type (figure s ). a similar effect could be addressed to the mutation at position . serine (s) is a weak contributor to the complex interaction. in all pdb entries we selected, serine is located into a solvent exposed random coil loop. no interaction with ace or spike residues can be observed. actually, the gbpm analysis included such a residue in q . conversely, its mutation to asparagine (s n), in our in silico model, revealed the possibility to establish hydrogen bond to the ace serine that can clearly result in a stabilization of the complex (figure s ). moreover, position is also affected by three other events with lower occurrence: s i, s r and s g, with , and observations (table ). among all, the s r could be the most interesting one. actually, a net positively charged residue, such as arginine (r), can establish a weak electrostatic interaction to ace glutamate , as suggested by a theoretical model we built. the s i and s g could modify the conformation of a random coil segment, so it does not appear very relevant. conversely, s n and s g could productively contribute to the spike ace complex stabilization. of course, deeper theoretical and experimental investigations should be carried out to confirm this hypothesis. unfortunately, full-scale simulations cannot be rigorously performed today because the available d structural models report only fragments of the complex between spike and ace . the third most common mutation, n y (fig ), targets an aa predicted to have a strong role in the interaction in all four models, sitting in the gbpm q . n y was detected in , patients ( . % of the dataset): the majority of which were located in the united kingdom ( ). from a structural point of view, we predict that a substitution, at position , of an asparagine (n) with a tyrosine (y) may have an effect: their total polar surface area (tpsa), equal to . and to . Å respectively, is different, however both their sidechains can donate/accept a hydrogen bond. therefore, their contribution to complex stabilization may be slightly different, also taking into account the chemical environment. in fact, the wild type asparagine donates one hydrogen bond to ace tyrosine : such an interaction could be possible also for n y mutant or, as we observed in our theoretical model, it could be replaced by pi-pi stacking (figure s ). the rapid increase in frequency of mutation n y has been recently observed in the united kingdom and other countries, as it is one of the variants characterizing lineage b . . ( ). the asparagine/tyrosine substitution in spike position could contribute to determine an evolutionary advantage for this lineage, based on differential affinity for the human receptor ace ( , ). a less frequent mutation amongst those predicted to contribute to the ace /spike interaction is g s, detected in samples ( . %), and supported by three out of four structural models (table , fig b). the glycine (g) was included by gbpm analysis in q : its contribution to the complex stabilization is weak. conversely to the other mutation described here, the replacement of glycine with a serine (s) could have more evident effects on spike ace molecular recognition. in fact, in all pdb entries, the alpha carbon of this glycine is very close, about Å, to the sidechain amide group of the ace glutamine . between these two aas no productive interaction can be established but the substitution of the spike glycine with a serine could allow one inter-protein hydrogen bond to .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ace glutamine . moreover, g s could establish the same interaction with spike glutamine that could stabilize the conformation of a random coil segment of the viral protein resulting in a better pre-organization to the ace recognition (figure s ). another spike residue, predicted by our analysis for playing a relevant role in ace recognition, is the glutamine (table ). the gisaid data revealed that such an aminoacid is rarely replaced by a leucine (q l) or by an arginine (q r). these mutations could affect the recognition of ace in an opposite way. spike glutamine is involved in hydrogen bond with ace glutamate . the mutation q l cannot establish such a productive contribution and could only hydrophobically interact to spike leucine . conversely, q r could locate its net positively charged sidechain into an ace pocket delimited by aspartate , histidine and glutamate . such a positioning could produce a remarkable electrostatic stabilization of the complex (figure s ). in general, we could observe that aas with the strongest evidence for interaction contribution in the spike/ace interface tend not to diverge from the reference (fig b), which may indicate a solid evolutionary constraint to maintain the interface residues unchanged. for example, one of the most relevant st quartile aa in the ace /spike interaction, glutamine (q) , is rarely mutated, with cases of q l, of q * (the substitution of q with a stop codon), of q k, and of q r and q h. one possible exception is the aforementioned spike mutation n y, located in the strongest st quartile gbpm-predicted aa for ace binding, which was found in the considerable number of different patients. mutational analysis of human ace we also investigated the variants of human ace , since these could constitute the basis for patient- specific covid- susceptibility and severity. ace protein sequence is highly conserved across vertebrates ( ) and also within the human species ( ), with the most frequent missense mutation (rs , n d) present in . % of the world population (supplementary file ). our analysis shows that only variants of ace detected in the human population are also located in the ace /spike direct binding interface (table and fig ). of these, rs (causing a s p aa variant) is both the most frequent in the population ( . %) and the most relevant in the interaction with the viral protein, with a gbpm score of - . (q ) and support from all models (table ). the rs snp frequency is higher in the population of african descent ( . %). the second snp, rs (e g, table ) is a very rare allele ( . %) in the european (non-finnish) asian population. the rs (m i) snp is also a very rare allele ( . %) found in the african population. e k (rs ) is more frequent in the finnish ( . %) and g v (rs ) in the european non-finnish ( . %) population. none of these five snps have a reported clinical significance, according to dbsnp and literature search ( ). it must be mentioned that m i, together with s p, has been predicted to adversely affect ace stability ( ). m i, together with e g, has been simulated to increase binding affinity with spike when compared to wild type ace , hypothesizing greater susceptibility to sars-cov- for patients carrying these variants ( ). instead, e k ( ) and g v ( ) were predicted to possess a lower affinity with spike, suggesting lower susceptibility to the infection. however, while describing .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / potential explanations to the existence of a possible predisposing genetic background to infection, all these studies remain inconclusive in linking allele variants to covid- susceptibility. structurally, the s p variant may greatly differ from the reference sequence in the interaction with ace : serine (s) is a polar residue, able to accept and donate, by means of its side chain alcoholic group, a hydrogen bond. proline (p), on the other hand, cannot be involved in hydrogen bonding, and therefore should establish a weaker interaction with spike. in fact, ace serine sidechain donates a hydrogen bond to spike alanine backbone (figure s ) and potentially could establish the same interaction with spike glycine (g) , which could also be mutated (table ). both methionine (m) and glutamate (e) are in q minimally contributing to spike ace recognition (figures s and s ). they are located within two alpha helices so their mutation could modify the secondary structure of ace corresponding to a different affinity against spike. such a possibility should be more evident in the case of e g because glutamate sidechain is involved in hydrogen bond with ace- glutamine . discussion sars-cov- spike evolved through a series of adaptive mutations that increased its affinity for the human ace receptor ( ). there is no reason to believe that the evolution and adaptation of the virus will stop, making continuous sequencing and mutational tracking studies of paramount importance to strategically contain covid- ( ). in our study, we highlighted which specific locations of spike can influence the ace molecular recognition, required for the viral entry into the host cell ( ). we further showed that some mutations are already present in the sars-cov- population that may weakly affect the interaction with the human receptor, specifically spike n k, s n and n y. these mutations are rising in the viral population (> %) and in particular n y is one of the key mutations characterizing lineage b. . . ( ), which has seen a recent dramatic increase in frequency in the united kingdom ( ). having identified this mutation proves that our combination of targeted mutation frequency and gbpm is a useful pipeline to monitor events in the key region used by sars-cov- to recognize and enter human bronchial cells. the same approach can be used to monitor, in the future, if any of these events will increase in frequency, suggesting an adaptation to the human host leveraging a higher affinity with ace . on the other hand, we studied the variants in the human ace population, identifying loci that can affect the binding with sars-cov- spike. they are all rare variants, with the most frequent, s p, present in . % of the population, and with no known clinical significance. however, other in silico studies have predicted their role in decreasing ace stability (s p and m i) ( ), and in altering the affinity with spike (increasing it: m i and e g ( ); decreasing it: e k ( ) and g v ( )). the most common ace variant, rs (n d), is not located in the binding region, and so far its predicted effects on the etiopathology of covid- are still largely conjectural and associated to neurological complications via mechanisms probably independent from direct interaction with spike ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / it remains to be seen whether, in the future, the combination of spike and ace sequences will produce novel and unexpected covid- specificities, that will require granular efforts in developing wider-spectrum anti-sars-cov- strategies, such as vaccines or antiviral drugs. so far, our analysis has shown a location on the spike/ace complex where both proteins vary in the viral/human population, specifically on ace s and spike a /g . while, as described in our results, these mutations on spike are not likely to strongly affect the interaction surface, future combinations of ace /spike variants may have peculiar effects that will require constant mutation monitoring. identifying single or multiple aas involved in this viral entry interaction will allow for personalized diagnosis and clinical prediction based on the specific combination of sars-cov- strain and ace variant. personalized covid- treatment will require targeted sequencing of the patient ace and spike, to identify the combination causing the specific case. this technical obstacle can be further complicated by the intra-host genetic variability of sars-cov- , which has recently been reported from rna-sequencing studies ( ). structural investigation will benefit, in the next future, from the availability of experimental structural models reporting the complete sequence of both spike and ace , or at least spike. this will allow more rigorous computational analyses (i.e. molecular dynamics simulation, free energy perturbation) on the effect of mutations on the spike/ace recognition. beyond the complex investigated in this manuscript, our approach can be fully extended to any other partners in the sars-cov- /human interactome, for example the recently discovered interaction between viral protease nsp ( ) and human histone deacetylase hdac ( ), which is indirectly responsible for the transcriptional activation of pro-inflammatory genes. our approach can also be extended to other viruses exploiting human receptors as an entry mechanism, such as cd for the human immunodeficiency virus (hiv) or tim- for the ebola virus ( ). materials and methods structural analysis the pdb ( ) was searched for high resolution spike/ace complexes. pdb entries lzg ( ), m j ( ) and vw ( ), reporting the spike rbd interacting to ace , have been retrieved and taken into account for our gbpm analysis ( ). such a computational approach compares grid ( ) molecular interaction fields (mifs) computed on a generic complex (a) and on its host (b) and guest (c) components, separately. actually, mifs describe the interaction between a certain probe and a certain target. if the target is represented by a complex, depending on the selected area, the mif energies can be referred to the interaction between the probe and one of the complex subunits or, at the host/guest interface, with both of them. the gbpm analysis, objectively, highlights these last. five steps are required: ( ) the complex a is disassembled in its subunits b and c; ( ) mifs are computed on a, b and c by using the most appropriate grid probes. a hydrogen bond acceptor/donor and a generic hydrophobic probe can describe the basic interaction. because grid mifs are stored as a d matrix of interaction energy points (iep), the same box dimensions are adopted in all calculations; ( ) each iep of b is compared with respect to the equivalent point of a generating a new mifs named d. the following algorithm, available into the grab tool, is applied: if iep(a) > and iep(b) > then iep(d) = ; if iep(a) > and iep(b) < then iep(d) = iep(b); if iep(a) < and iep(b) > then iep(d) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / = -iep(a); if iep(a) < and iep(b) < then iep(d) = iep(a)-iep(b). the resulting mif d reports as negative energy values the productive interaction between the grid probe and b and the interface a and b; ( ) in order to obscure the interaction between the probe and b, mifs d and c are compared, by using the grab approach, producing to a new mif e; ( ) the most relevant interaction points (gbpm features) of the mif e are, finally, selected taking into account an energy cutoff % above the global minimum. supplementary figures focusing on the most relevant mutation are available in supplementary file . before starting the gbpm analysis, co-crystalized water molecules were removed from pdb structures. in vw , showing two spike-ace complexes, namely chains a-e and b-f, both structures have been investigated and further reported as model a and b, respectively. all selected complexes have been conformationally compared one each other by alignment and computing the rmsd on the cartesian coordinates of equivalent not hydrogen atoms. dry, n and o original grid probes have been used to highlight hydrophobic, hydrogen bond donors and acceptors areas. in order to identify the most relevant residues of both spike and ace , we conceptually and technically extended the gbpm algorithm, originally designed for drug/target interactions ( ). in the gbpm analysis presented here, the two interacting proteins have been considered either as host and guest units, and relevant aas were selected if their distance from gbpm features was lower or equal to Å. for each pdb model, the selected residues were scored as summa of the corresponding gbpm features interaction energy. in order to prevent unrealistic distortion of the spike-ace complex, due the usage of structures not covering the full length of the interacting proteins, the mutations effect has been qualitatively estimated by means of the mutagenesis tool implemented in pymol software ( ). wild type residues have been replaced by the mutation and the new sidechain conformations have been optimized taking into account the neighboring aas. the graphical analysis was carried out onto the predicted most populated rotamers. on the basis of its better x-ray resolution, the m j pdb structure has been selected for the above reported investigation. genetical analysis sars-cov- genome sequences from human hosts and accounting for a total of , submissions were obtained from the gisaid database on october ( ). low quality (with more than % uncharacterized nucleotides) and incomplete (< , nucleotides, based on a total reference length of , ) sequences were removed. the resulting , genome sequences were aligned on the reference sars-cov- wuhan genome (ncbi entry nc_ . ) using the nucmer algorithm ( ). position-specific nucleotide differences were merged for neighboring events and converted into protein mutations using the coronapp annotator ( ). the results were further filtered for aa- changing mutations targeting the spike protein. ace variants in the human population were extracted from the gnomad database, v , july ( ). we considered only missense variants affecting specific aas in the protein sequence, for a total of entries (supplementary file ). graph generation was performed with the r statistical software and the corto package v . . ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / acknowledgments we thank the italian ministry of education and research for their financial support under the montalcini initiative. we thank prof. giovanni perini for his continued support and scientific enthusiasm, prof. massimo battistini for his lessons on logic and writing, prof. elena bacchelli for her suggestions on the use of gnomad, and prof. stefano alcaro who provided the computational resources required by the gbpm analysis. finally, we thank mr. george wolf for the final proofreading the manuscript. references .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figures and tables .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . conformational comparison of spike-ace pdb complexes: (a) alignment of pdb entries, spike and ace are respectively surrounded by cyan and orange fog, and (b) bar graph showing rmsd (in Å) computed on structures aligned without hydrogen atoms. , , , , , , , , , , , , , , , , , , , , , , lzg m j vw -a vw -b r m s d ( Å ) pdb entries lzg m j vw -a vw -bb a .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . summary of the pipeline adopted by gbpm to identify key residues contributing to the sars-cov- spike / human ace interface. spike is depicted in cyan, and ace in orange, based on the lzg pdb model ( ). residues highlighted by gbpm are then tested for mutation frequency in the worldwide sars-cov- population. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . d ribbon representation of the interaction domains of sars-cov- spike (left, orange) and human ace (right, green), based on the crystal structure lzg deposited on protein data bank and produced by ( ). the positions of the three most frequent spike mutations in the interacting region (aa - ) with a non-zero gbpm score are indicated: n k, n y and s n. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . (a) occurrence of aa-changing variants on sars-cov- spike protein. x-axis indicates the position of the affected aa. y-axis indicates the log of the number of occurrences of the variant in the sars-cov- dataset. labels indicate variants affecting ace /spike binding and detected in at least sars-cov- sequences. vertical dashed lines indicate crystalized region analyzed (aa – ). the d g variant, located outside the rbd, is also indicated. (b) scatter plot indicating the occurrence of the variant in the population (x-axis) and the gbpm score of the reference aa in the model (y-axis). mutations with non-zero gbpm score are indicated. cc indicates the pearson correlation coefficient and p indicates the p-value of the cc. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . frequency of mutations on ace . x-axis indicates the aa position in isoform (uniprot q byf - ). y-axis indicates the allele frequency in the global population according to the gnoma v database. labels indicate aa changes observed in the human population with non-zero gbp average score in the ace /spike interaction models. vertical dashed lines indicate the crystaliz region analyzed in this study (aa – ). id ad pm ed .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table . gbpm scores, average values, and quartile distribution of spike relevant aas in three pdb models. gbpm scores and average values are reported in kcal/mol. residue # pdb entries gbpm lzg m j vw -a vw -b average score quartile lys - . - . . . - . q asn . . - . - . - . q gly - . - . . - . - . q gly - . . . . - . q tyr - . - . - . - . - . q tyr . . - . - . - . q leu - . - . - . - . - . q phe - . - . - . - . - . q ala - . - . - . - . - . q gly - . . - . - . - . q ser - . . - . - . - . q glu - . - . . . - . q phe - . - . - . - . - . q asn - . - . - . - . - . q tyr - . - . - . - . - . q phe - . - . - . - . - . q gln - . - . - . - . - . q gly - . - . - . - . - . q phe - . . - . - . - . q gln - . - . - . . - . q pro . . . - . - . q thr . - . - . - . - . q asn - . - . - . - . - . q gly - . - . - . - . - . q val . - . - . - . - . q tyr - . - . - . - . - . q .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table . gbpm scores, average values, and quartile distribution of ace relevant aas in three pdb models. gbpm scores and average values are reported in kcal/mol. residue # pdb entries gbpm lzg m j vw -a vw -b average score quartile ser - . - . - . - . - . q gln - . - . - . - . - . q thr - . - . - . - . - . q phe - . - . - . - . - . q asp . - . . . - . q lys - . - . - . - . - . q his . - . - . - . - . q glu - . . . - . - . q glu - . - . - . - . - . q asp - . - . - . - . - . q tyr - . - . - . - . - . q gln - . - . - . - . - . q leu - . - . . - . - . q leu . . . - . - . q met . . - . - . - . q tyr - . - . - . - . - . q glu . . . - . - . q asn - . - . - . - . - . q gly - . - . - . - . - . q lys - . - . - . - . - . q gly - . - . - . - . - . q asp - . - . - . - . - . q arg . - . . . - . q ala . . - . . - . q arg . . - . . - . q .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table . composition of the gbpm models designed. hbd = hydrogen bond donor; hba = hysdrogen bond acceptor; # = number of features; aie = average interaction energy (in kcal/mol). gbpm feature lzg m j vw -a vw -b host/guest # aie # aie # aie # aie hydrophobic - . - . - . - . spike/ace hbd - . - . - . - . hba - . - . - . - . hydrophobic - . - . - . - . ace /spike hbd - . - . - . - . hba - . - . - . - . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table . spike mutations located within the rbd (aa - ) with at least two cases in the population and non-zero gbpm average score in the ace /spike interaction models. the asterisk (*) indicates a stop codon. a lower gbpm score indicates a stronger effect in the ace /spike interaction. mutation position abundance frequency gbpm average score quartile s n . - . q n k . - . q n y . - . q y f . - . q e k . - . q k n . - . q s i . - . q g v . - . q f s . - . q s r . - . q n t . - . q l f . - . q g s . - . q e q . - . q a v . - . q f l . - . q f l . e- - . q yq wk . e- - . q q l . e- - . q v f . e- - . q e a . e- - . q g s . e- - . q e d . e- - . q q * . e- - . q y w . e- - . q g a . e- - . q s g . e- - . q f l . e- - . q v i . e- - . q y f . e- - . q .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table . ace variants with non-zero gbpm score in the spike interaction model. variant rsid allele frequency gbpm average score quartile s p rs . - . q e g rs . e- - . q m i rs . e- - . q e k rs . e- - . q g v rs . e- - . q .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplementary files description supplementary file : table of sars-cov- spike mutations (source: gisaid database, december ), indicating position, frequency in the sequenced sars-cov- genome and gbpm score (lower: predicted stronger effect in the spike/ace interaction). supplementary file : table of human ace variants (source: gnomad database, v , july ), indicating position, frequency in the sequenced sars-cov- genome and gbpm score (lower: predicted stronger effect in the spike/ace interaction). supplementary file : supplementary figures focusing on the most relevant mutations described in this study, with structural, chemical and positional considerations. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / title taxonomy-aware, sequence similarity ranking reliably predicts phage-host relationships authors andrzej zielezinski ,*, jakub barylski , wojciech m. karlowski author affiliations: department of computational biology, faculty of biology, adam mickiewicz university poznan, uniwersytetu poznanskiego , - , poznan, poland molecular virology research unit, faculty of biology, adam mickiewicz university poznan, uniwersytetu poznanskiego , - , poznan, poland * address correspondence to: andrzej zielezinski: andrzejz@amu.edu.pl abstract motivation: similar regions in virus and host genomes provide strong evidence for phage-host interaction, and blast is one of the leading tools to predict hosts from phage sequences. however, blast-based host prediction has three limitations: (i) top-scoring prokaryotic sequences do not always point to the actual host, (ii) mosaic phage genomes may produce matches to many, typically related, bacteria, and (iii) phage and host sequences may diverge beyond the point where their relationship can be detected by a blast alignment. results: we created an extension to blast, named phirbo, that improves host prediction quality beyond what is obtainable from standard blast searches. the tool harnesses information concerning sequence similarity and bacteria relatedness to predict phage-host interactions. phirbo was evaluated on two benchmark sets of known phage-host pairs, and it improved precision and recall by percentage points, as well as the discriminatory power for the recognition of phage- host relationships by percentage points (area under the curve = . ). phirbo also yielded a mean host prediction accuracy of % and % at the genus and family levels, respectively, representing a % improvement over blast. when using only a fraction of phage genome sequences ( kb), the prediction accuracy of phirbo was - % higher than blast at all taxonomic levels. conclusion: our results suggest that phirbo is an effective, unsupervised tool for predicting phage-host relationships. availability: phirbo is available at https://github.com/aziele/phirbo. keywords phage-host prediction, phage, prokaryote, bacteria, virus, genome sequence .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:andrzejz@amu.edu.pl https://github.com/aziele/phirbo https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / introduction prokaryotic viruses (phages) are the most abundant entities across all habitats and represent a vast reservoir of genetic diversity [ ]. phages mediate horizontal gene transfer and constitute a major selection pressure that shapes the evolution of bacteria [ ]. prokaryotic viruses also affect biogeochemical cycles and ecosystem dynamics by controlling microbial growth rates and releasing the contents of microbial cells into the environment [ , ]. moreover, phages play a key role in shaping the composition and function of the human microbiome in health and disease [ – ]. recently, there has been renewed interest in phage therapy and phage-based biocontrol of harmful bacteria [ , ] in medical treatment [ , ] and the food industry [ , ]. hence, characterizing phage–host interactions is critical to understanding the factors that govern phage infection dynamics and their subsequent ecological consequences [ ]. the scope of phage-host interactions is poorly understood, although it has been hypothesized that all prokaryotic organisms fall prey to viral attacks [ ]. methods for studying phage-host interactions primarily rely on cultured virus-host systems; however, recent in silico approaches suggest a much broader range of hosts may be susceptible to viral infections [ ]. these methods predict prokaryotic hosts based on sequence composition [ , ], direct sequence similarity between phages and hosts [ ], analysis of crispr spacers or trnas [ , ], as well as supervised approaches that integrate several sequence-based methods [ , ]. despite significant progress in phage-host predictions, the classic blast [ ] algorithm is currently the most effective, unsupervised method for identifying phage-host interactions [ , ]. depending on the dataset, the tool finds the correct genus level host for - % of phages [ , ]. the task of finding a host for a given phage using blast is conceptualized as obtaining the host sequence with the highest similarity to the query phage sequence. however, restricting host predictions to the first top-scored prokaryotic sequence has three limitations. first, the true host may not be the top-scoring match in the blast results. second, selecting a prokaryotic host based on the first sequence assumes that a phage infects a single host. although phages are generally host-specific, some may infect multiple host species [ , ]. finally, many distantly-related prokaryotic species may obtain a comparable blast score for a query phage due to spurious alignments. these ambiguous host predictions require further manual curation of the taxonomic or phylogenetic relationship between the top-scored prokaryotic species to select the true host(s). we have addressed these issues by developing a simple extension to blast, named phirbo, that exploits the information contained in the full blast results, rather than its top-ranking matches. phirbo improved the accuracy of finding hosts, beyond what is found from the best blast match, by relating phage and host sequences through intermediate, common reference sequences that are potentially homologous to both phage and host queries. subsequent quantification of the overlapping signals allows for the reliable prediction of phage-host interactions without the need .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / for direct comparisons between the phage and host sequences and without any prior knowledge of their phylogenetic or taxonomic context. results phirbo algorithm overview this algorithm is based on the assumption that the degree of similarity between phage and host sequences is proportional to the overlap between ranked similarity matches of each sequence to the same reference data set of prokaryotic sequences. specifically, to compare a pair of phage (p) and host (h) sequences, we first perform two independent blast searches against the reference database of prokaryotic genomes (d)—one blast search for phage and the other for the host query (fig. a). the two lists of blast results (fig. b), p → d and h → d, contain prokaryotic genomes ordered by decreasing sequence similarity (i.e., bit-score). to avoid a taxonomic bias due to multiple genomes of the same prokaryote species, we rank prokaryotic species according to their first appearance in the blast list (fig. c). in this way, both lists represent phage and host profiles consisting of the ranks of top-score prokaryotic species. the properties of these lists (fig. c) closely resemble the outcome of an internet search and can be characterized by four features: (i) species listed at the top of each ranking are more important (similar) to the query than those listed at the bottom; (ii) the lists may not be conjoint (some species may appear in one ranking but not in the other); (iii) the ranking lists may vary in length (blast may return few prokaryotic matches in response to virus sequences in contrast to thousands of matches in cases of multiple-species prokaryotic families); (iv) two or more species from the database may achieve the same blast score and, therefore, occupy the same position on the ranking list (fig. c). a recently introduced similarity measure used for comparing the rankings of web search engine results [ ], the rank-biased overlap (rbo), satisfies these four conditions. the rbo algorithm starts by scoring the overlap between the sub-list containing the single top- ranked item of each list. it then proceeds by scoring the overlaps between sub-lists formed by the incremental addition of items further down the original lists. each consecutive iteration has less impact on the final rbo score as it puts heavier weights on higher-ranking items by using geometric progression, which weighs the contribution of overlaps at lower ranks (see ‘methods’). an overall rbo score falls between and , where signifies that the lists are disjoint (have no items in common) and means the lists are identical in content and order. our results indicate that the extent of the phage-host relationship can be estimated by the application of an rbo measurement to the ranking lists generated from blast results (fig. d). phirbo differentiates between interacting and non-interacting phage-host pairs to assess the discriminatory power of phirbo to recognize phage-host interactions, we used two published reference data sets: edwards et al. ( ) [ ], which contains , complete bacterial genomes and phages with reported hosts, and galiez et al. ( ) [ ] that has , complete .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / prokaryotic genomes and , phage genomes. for each data set, we compared the distribution of phirbo scores between all known phage-host interaction pairs and the same number of randomly selected non-interacting phage-prokaryote pairs (fig. ). the scores obtained by phirbo in both data sets separated the interacting from non-interacting phage-host pairs more than the blast scores. the median phirbo score across interacting phage-host pairs was nearly , times greater than for non-interacting pairs, while the median blast score was three times higher for interacting pairs than non-interacting pairs (supplementary table ). both methods, however, differentiated between interacting and non-interacting phage-host pairs with higher accuracy than wish — the state-of-the-art, alignment-free, host prediction tool [ ]. to further examine the discriminatory power of phirbo across all possible phage-prokaryote pairs, we used receiver operating characteristic (roc) curves (fig. a,b). the area under the roc (auc), which measured the discriminative ability between interacting and non-interacting phage- host pairs, was higher for phirbo (auc = . ) in the edwards et al. and galiez et al. data sets than for blast (auc = . ) and wish (auc = . - . ). an additional advantage of phirbo was its capacity to score phage-host pairs whose sequence similarity could not be established by a direct blast comparison but, instead, through other, ‘intermediate’ prokaryotic sequences that were detectably similar to both phage and host query sequences. for example, blast did not provide scores for % of the interacting phage-host pairs in the edwards et al. and galiez et al. data sets due to alignment score thresholds (supplementary table ). using the same blast lists, phirbo evaluated % of the interacting phage-hosts pairs. this high coverage indicated that nearly every pair of phage-prokaryote sequences could be related by at least one common prokaryotic sequence detectably similar to both the phage and host sequences. phirbo has the highest host prediction performance to evaluate host prediction performance, we used precision-recall (pr) curves, which provide more reliable information than roc when benchmarking imbalanced data sets for which the non- interacting pairs vastly outnumber the interacting pairs [ , ]. accordingly, we plotted pr curves for phirbo, blast, and wish predictions obtained from the edwards et al. (fig. a) and galiez et al. (fig. b) data sets. overall, phirbo performed better at host prediction at the species level than blast and wish, regardless of the data set. the area under the pr curve (aupr), which summarized overall performance, was higher in phirbo by percentage points (aupr = . - . ) than in blast (aupr = . - . ). phirbo also reported the highest f score (an average of precision and recall [see ‘methods’]) in the edwards et al. and galiez et al. data sets (fig. ). specifically, the precision and recall of phirbo were - % and - %, respectively, while blast had precision and recall in the range of - % (fig. ). furthermore, phirbo yielded slightly higher specificity ( . - . %) and accuracy ( . - . %) than blast or wish. phirbo preserves blast top-ranked host predictions .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / we further evaluated the host prediction accuracy of phirbo by selecting a top-scored prokaryotic sequence for each phage [ – , ]. briefly, host prediction accuracy is calculated as the percentage of phages whose predicted hosts have the same taxonomic affiliation as their respective known hosts (if multiple top-scoring hosts are present, the prediction is scored as correct if the true host is among the predicted hosts). phirbo restored all hosts predicted by blast in the datasets by edwards et al. and galiez et al., achieving the same prediction accuracy as blast across all taxonomic levels (table ). of note, blast found multiple different host species with equal scores for phage genomes. this was observed in phages infecting bacteria from the enterobacteriaceae family and the rhodococcus and bacillus genera. however, phirbo assigned the highest score to the correct host species (supplementary table ). additionally, it refined the host prediction for the cronobacter phage ent sequence, which blast assigned to the escherichia coli genome. phirbo revealed cronobacter sakazaki as the primary host species, as the blast list of the cronobacter phage is more similar in content and order to the blast list of c. sakazaki (phirbo score = . ) than e. coli (phirbo score: . ) (figure s ). as phirbo links phage to host through common sequences, the content of the sequence database was the main factor defining host prediction quality. since the similarity between viruses may indicate a common host [ , ], we expanded the two blast databases of prokaryotic sequences obtained from edwards et al. and galiez et al. by phage sequences (n = and n = , respectively), and recalculated phirbo scores between every phage-prokaryote pair. the phage- host linkage through homologous prokaryotic and phage sequences increased the host prediction accuracy of phirbo at all taxonomic levels, allowing correct identification of hosts at the genus level for - % of phages (table ). specifically, phirbo refined blast mis-predictions for phage genomes and showed which sequences demonstrated low similarity to the sequences of their host species. the direct blast alignments of these phage sequences, and the sequences of their corresponding hosts, obtained significantly lower scores than alignments obtained by the other known phage-host pairs (p = . × - , mann–whitney u test). notably, phirbo also assigned correct host species for phages whose hosts were not reported in the blast results, mainly chlamydia species, vibrio cholerae, and the opportunistic pathogen, acinetobacter baumannii. phirbo is suitable for incomplete phage sequences we tested the robustness of our host prediction algorithm to fragmentation of the phage sequence. following earlier studies [ , , ], phage genomes from edwards et al. and galiez et al. data sets were randomly subsampled to generate contigs of different lengths ( kb, kb, kb, kb, and kb) with replicates. host prediction accuracy was calculated as the mean percentage of phages whose predicted hosts had the same taxonomic affiliation as their respective known hosts (fig. ). although phirbo achieved equal host prediction accuracy with blast across all contig lengths, it had substantially higher overall performance in terms of auc and aupr (figure s ; p < − , wilcoxon signed-rank test). surprisingly, blast-based methods obtained higher host .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / prediction accuracy across all contig lengths compared to wish, a tool designed to predict the hosts of short viral contigs (fig. ). the host prediction accuracy of phirbo was examined using the expanded blast database of both prokaryotic and phage full-length sequences. to ensure fairness, for each tested phage contig we removed its corresponding full-length sequence from the blast database and recalculated phirbo scores between the phage contig and every prokaryotic sequence. this approach outperformed blast at every contig length across all taxonomic levels in both data sets (fig. ). generally, the host prediction accuracy of phirbo improved by - percentage points compared to the blast results. for example, when the contig length was kb, the prediction accuracy of phirbo was - % higher than blast at the family level, and - % higher than wish (fig. ; supplementary table ). phirbo also achieved the highest auc and aupr scores when discriminating between interacting and non-interacting phage-host pairs (figure s ). phirbo uses multiple protein and non-coding rna signals for host prediction we investigated the sequence information used by blast and phirbo for host prediction. for each phage that was correctly assigned to the host species by both tools (n = ), we calculated the fraction of the phage genome that was included in the segments aligned with prokaryotic sequences (sequence coverage). this analysis revealed that our tool used three times more phage sequence (median sequence coverage: %) than blast ( %) (figure s ; p < - , wilcoxon signed-rank test). this increased sequence coverage indicates that different genome regions of the phages map to the genomes of prokaryotic species other than the host species. for of the phages, more than half of their genomes were aligned to genomes of their host species (supplementary table ). such large regions of homology are likely prophages or phage debris left by large-scale recombination events during phage replication. the observed high sequence coverage points to the virus taxa, known for their temperate lifestyle and frequent recombination with host genomes (i.e., siphoviridae family as well as the peduovirinae and sepvirinae subfamilies). to further examine the properties of sequences that may be exchanged between a phage and its host, we selected a population of phages with sequence coverage below % (n = ). these phages, which are less likely to represent complete prophages, belong to viral families (supplementary table ). next, we re-annotated the genomic sequences of the phages to find putative protein and non-coding rna (ncrna) genes. phage sequence regions used by phirbo for host predictions were significantly enriched (p < - ) in more than a hundred protein families of known or probable function. in contrast, only half of the protein families were used in blast- based host predictions (supplementary table ). the protein families used by phirbo covered most of the processes of the viral life cycle including dna replication, cell lysis, recombination, and packaging of the phage genome (fig. ). in contrast to blast, phirbo also exploited the information contained in phage ncrnas while assigning phages to host genomes. the vast .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / majority of these ncrnas (> %) were trnas, which showed significant overrepresentation in the phage sequence fragments used by phirbo (p = × - ) (supplementary table ). the remaining ncrnas belonged to group i introns ( %), rnas associated with genes associated with twister and hammerhead ribozymes ( %), skipping-rope rna motifs ( %), and less abundant rna families. implementation and availability predicting hosts from phage sequences using blast is accomplished by querying phage sequences against a database of candidate hosts. however, phirbo also uses information about sequence relatedness among prokaryotic genomes. therefore, it requires ranked lists of prokaryote species generated by blast for the phage and host genomes. the computational cost of querying every host sequence against the database of all candidate hosts using blast may still be a limiting factor. however, for mass host searches, the computational cost of all-versus-all host comparisons becomes marginal, as it must be done only once. after the relatedness among host genomes is established, the time required for phirbo host predictions is negligibly higher than the time for typical blast-based host predictions. for example, running phirbo between ranked lists of host species for , phages and , candidate hosts from galiez et al. (resulting in ~ . million phage-host comparisons) took minutes on a -core . ghz intel xeon. as phirbo operates on rankings, blast can be replaced by an alternative sequence similarity search tool to reduce the time to estimate homologous relationships between host genomes. for instance, mash [ ] computed host relationships in minutes for the edwards et al. and galiez et al. data sets (see ‘methods’). the host prediction performance of phirbo using blast-based rankings for phages and mash-based rankings for host genomes is high compared to the performance of phirbo predictions using blast rankings for both phage and host genomes (supplementary table ). we envisage phirbo as a natural extension to standard blast-based host predictions. the phirbo tool is written in python and freely available at https://github.com/aziele/phirbo/. discussion the identification of similar sequence regions between host and phage genomes using blast has been a baseline for the identification of putative virus-host connections in numerous metagenomic projects [ , , ]. however, a blast search requires regions with significant similarity between the query phage and host [ – ]. yet, many phage and host sequences lack sufficient similarity and escape detection with standard blast searches. to tackle this issue, alignment- free tools have been developed to predict hosts from phage sequences [ – , ]. the rationale behind these tools is based on the observation that viruses tend to share similar patterns in codon usage or short sequence fragments with their hosts [ – ]. as virus replication is dependent on the translational machinery of its host, some phages adapt their codon usage to match the .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/aziele/phirbo/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / availability of trnas during viral replication in the host cell [ – ]. similar oligonucleotide frequency use may be driven by evolutionary pressure on the virus to avoid recognition by host restriction enzymes and crispr/cas defense systems [ , ]. although state-of-the-art alignment-free tools (i.e., wish [ ] and virushostmatcher [ ]) can rapidly assess sequence similarity between any pair of phage and prokaryote sequences, they are less accurate for host prediction than blast [ , ]. the relatively high accuracy of blast suggests that localized similarities of genetic material may be a stronger indication of phage-host interactions than global convergence of their genomic composition. this evidence comes in the form of protein-coding dna fragments and non-coding rnas. the latter group is dominated by trna genes, which are strongly over-represented in direct blast alignments between phages and their hosts, and are even more prevalent among indirect connections used by phirbo. this may be important, as previous studies have shown that not all phage trna genes come directly from their hosts. some appear to be derived from genomes of other, often distantly related, bacteria and may be the result of earlier evolutionary events [ ]. for protein-coding genes, a more diverse picture emerges. proteins rich in phage-host blast alignments can be assigned into different functional categories including phage virion components, replication-related proteins, regulatory factors, and proteins involved in the metabolism of the host. the transfer of some over-represented families in phages and/or prophages has been previously reported (e.g., lytic proteins, dna replication and recombination proteins, and enzymes involved in nucleotide and energy metabolisms [ ]) and some of these genes are connected with the phage-host range [ , ]. however, no clear pattern emerges after analyzing the functions of the remaining, over-represented proteins. in this study, we attempted to expand the information content of a single local alignment of phage and host sequences by incorporating the results of multiple local alignments between a phage sequence and different prokaryotic genomes. this approach may more closely resemble a manual assignment of phage-host pairs, where an expert analyst not only considers a top-ranked matching prokaryote in the blast results, but also uses the information contained in other, less significant, matches and their sequence and taxonomic similarity. through a taxonomically-aware stratification scheme, this approach tracks the multilateral dynamics of horizontal gene transfer. therefore, we propose to relate phage and host sequences through multiple intermediate sequences that are detectably similar to both the phage and host sequences. by linking phage and host sequences through similar sequences, phirbo achieved a more comprehensive list of phage-host interactions than blast. simultaneously, phirbo was capable of assessing almost all phage-host pairs, bringing the method closer to alignment-free tools, which compute scores between all possible phage and host pairs. thus, our approach can be directly applied to different phage and prokaryote data sets without training or optimizing the underlying rbo algorithm. we intentionally avoided machine learning components in phirbo to ensure the general applicability of the approach and avoid possible overfitting. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / our results show that expanding the information obtained from plain similarity comparisons by incorporating taxonomically-grounded measurements of phage-host similarity leads to improved accuracy of phage-host predictions. the phirbo method provides the phage research community with an easy-to-use tool for predicting the host genus and species of query phages, which is usable when searching for phages with appropriate host specificity and for correlating phages and hosts in ecological and metagenomic studies. methods virus and prokaryotic host data sets the data sets analyzed in this study were retrieved from two previously published phage-host studies [ , ]. the first set (edwards et al. [ ]) contained , complete bacterial genomes obtained from ncbi refseq and refseq genomes of phages for which the host was reported. the data set encompassed , known virus-host interaction pairs and , , pairs for which interaction was not reported (non-interacting phage-host pairs). the second data set (galiez et al. [ ]) contained , complete prokaryotic genomes of the kegg database and phages for which host species were reported in the refseq virus database. the data set consisted of , interacting- and , , non-interacting virus-host pairs. phirbo score the interaction score for a given phage-host pair was calculated using the rbo metric. rbo [ ] is a measurement of rank similarity that compares two lists of different lengths (giving more attention to high ranks on the lists). rbo ranges from to , where a greater value indicates greater similarity between lists. equation was used for the calculation of the rbo value between two ranking lists, s and t. 𝑅𝐵𝑂(𝑆, 𝑇, 𝑝) = ( − 𝑝) ∑ 𝑝𝑑− 𝑛 𝑑= 𝐴(𝑆, 𝑇, 𝑑) where the parameter p ( < p < ) determines how steeply the weight declines (the smaller the p, the more top results are weighted). when p = , only the top-ranked item is considered, and the rbo score is either zero or one. in this study, we set p to . , which assigned ~ % of the weight to the first hosts. a(s, t, d) is the value of overlap between the two ranking lists, s and t, up to rank d, calculated by eq. . n is the number of distinct ranks on the ranking list. 𝐴(𝑆, 𝑇, 𝑑) = |𝑆:𝑑 ∩ 𝑇:𝑑 | |𝑆:𝑑 ∪ 𝑇:𝑑 | where s:d and t:d represents the elements present in the first d ranks of lists s and t, respectively. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / host prediction tools the host prediction tools blast [ ], wish [ ], and phirbo were run separately in the edwards et al. and galiez et al. data sets. for each tool, sequence similarity scores were calculated across all combinations of phage-host pairs. blast . . + [ ] was run with default parameters (task: blastn, e-value threshold = ) to query each phage sequence against a database of candidate host genomes. for each blast alignment, the highest bit-score between every phage-host pair was reported (for phage-host pairs that were absent in the blast results, a bit-score of was assigned). for rbo host prediction, an additional blast search was performed to establish ranked lists of genetically similar host genomes. specifically, a nucleotide blast was run with default parameters to query each host sequence against a database of candidate host genomes. as an alternative to blast, mash . [ ] was used with default parameters (k-mer size = , sketch size = , ) to establish ranked lists for each host by comparing its sequence against the database of candidate host genomes. rbo scores were calculated between all pairwise combinations of phage and host ranking lists. wish . [ ] was used with default parameters to calculate log- likelihood scores between all pairwise combinations of phage-host sequences. evaluation metrics the metrics of host prediction performance were calculated using sklearn (i.e., auc, aupr, recall, precision, specificity, and accuracy) [ ]. optimal score thresholds to calculate recall, precision, specificity, and accuracy was computed as maximizing the f score, an accuracy metric, which is the harmonic mean of precision and recall. host prediction accuracy was evaluated analogous to previous studies [ , , ]. specifically, for each query phage, the host with the highest score to the query virus was selected as the predicted host. in cases where multiple hosts were predicted, the prediction was scored as correct if the correct host was among the predictions. the prediction accuracy was calculated at each taxonomic level as the percentage of viruses whose predicted hosts shared a taxonomic affiliation with known hosts. phage genome annotation to define phage genes potentially exchanged between phage and host genomes, we re-annotated phage genomes that were correctly assigned to host species by both phirbo and blast. the genes were classified into predefined pvogs (prokaryotic virus orthologous groups) [ ] and rna families [ ]. briefly, open reading frames (orfs) in the analyzed phage genomes were identified using transeq from emboss [ ]. the orfs were then assigned to the respective orthologue group by hmmsearch (e-value < - ) against the database of hidden markov models (hmms) created for every of , pvog alignments using hmmbuild of hmmer v . . [ ]. non-coding rnas (ncrnas) were predicted in the phage genomes (e-value < - ) using rfam covariance models v . [ ] and the infernal tool v . . [ ]. we counted the number of times each pvog and rfam term was present in phage sequences used by blast and phirbo during host prediction. to determine whether the observed level of pvog/rfam counts was significant .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / within the context of all the terms within the phage genome, we calculated the p-value using the hypergeometric distribution implemented in scipy [ ]. acknowledgments we thank bas dutilh, rob edwards, clovis galiez, and johannes söding for providing us with the benchmark data sets used in their studies. we likewise acknowledge william webber for assistance with modifying the rbo formula to account for tied ranks. the computations were performed at the poznan supercomputing and networking center. author contributions az conceived the project and designed the experiments. az and jb wrote phirbo and tested its performance. wmk provided the conceptual framework for sequence comparisons through intermediate sequences and reviewed the software and manuscript. az and jb analyzed the results and wrote the paper. all authors read and approved the final manuscript. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure legends figure . calculation of the interaction score between phage and host sequences. a. the blast search of phage and prokaryote sequences against a reference dataset result in b. two blast lists containing prokaryote matches ordered by decreasing similarity (i.e., bit-score). c. blast lists were converted into rankings of prokaryote species. the ranked lists differ in content: yersinia rohdei and y. ruckeri are present in the first ranking list but absent in the second list, while shigella dysenteriae and erwinia toletana are only present in the second list. two species, y. rohdei and y. ruckeri, from the first blast search have the same scores and are consequently tied for the same rank. d. an interaction score was calculated between two ranking lists using rank-biased overlap. figure . discriminatory power of phirbo, blast, and wish scores to differentiate between interacting and non-interacting phage-host pairs. phage-host pairs were obtained from a. edwards et al. and b. galiez et al. data sets. box plots show the distribution of scores for all interacting phage-host pairs (n = , and n = , in edwards et al. and galiez et al., respectively) and the same number of randomly selected, non-interacting phage-host pairs. the horizontal line in each box displays the median; boxes display the first and third quartiles; whiskers depict lowest and highest non-outlier scores (details of distributions including outliers are provided in supplementary table ). receiver operating characteristic curves and the corresponding area under the curve (auc) display the classification accuracy of phage–host predictions across all possible phage-host pairs. dashed lines represent the levels of discrimination expected by chance. figure . host prediction performance of phirbo, blast, and wish. the performance is provided by precision-recall (pr) curves and statistical measures (i.e., f score, precision, recall, specificity, and accuracy) separately for a. edwards et al. and b. galiez et al. data sets. dashed lines in the pr-curve plots represent the levels of discrimination expected by chance. score cut-offs for each tool were set to ensure the highest f score. figure . host prediction accuracy over phage contig length. prediction accuracy is provided separately for a. edwards et al. and b. galiez et al. data sets. each complete virus genome was randomly subsampled times for different sequence lengths (i.e., kb, kb, kb, kb, and kb). hosts were predicted on each subsampling replicate by selecting a prokaryotic sequence with the highest similarity to the query viral sequence. points indicate the average of the resulting accuracies for all the viruses at a given subsampling length and host taxonomic level (i.e., species, genus, and family). an extended version of this figure containing host prediction accuracy values is provided in supplementary table . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . functional classification of phage coding sequences used by phirbo for host prediction. protein families (pvogs) were classified into functions related to phage-cycle (e.g., dna replication, transcription). numbers in the dark circles indicate the number of different pvogs related to a given function. an extended version of this figure containing the list of pvogs is provided in supplementary table . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / tables table . host prediction accuracies (%) for phage and host genomes from the data sets by edwards et al. [ ] and galiez et al. [ ]. dataset method species genus family order class phylum edwards et al. ( ) wish blast phirbo* phirbo (+phages)† galiez et al. ( ) wish blast phirbo* phirbo (+phages)† the highest accuracies among the methods for each taxonomic level are in bold. * interaction scores were calculated using rank-biased overlap (rbo) between blast lists containing prokaryotic sequences. specifically, the blast database contained , sequences of bacterial genomes in the edwards et al. data set, and , sequences of bacterial and archaeal genomes in the galiez et al. data set. † interaction scores were calculated using rbo between blast lists containing both prokaryotic and phage sequences. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary figures supplementary figure . host predictions for cronobacter phage ent (refseq accession: nc_ ) using a. blast and b. phirbo. querying the cronobacter phage sequence with a blast search against the host database returned the genomic sequence of escherichia coli (nc_ ) as the best match (bit-score = , ), and cronobacter sakazakii (nc_ ) as the second-best match (bit-score = , ). phirbo predicted cronobacter sakazakii as the top-score host for the cronobacter phage due to the highest extent of overlap between the top-ranking blast matches of each sequence (nc_ and nc_ ) of the same database. for clarity, only the first ten blast matches are shown. supplementary figure . host prediction performance of phirbo, blast and wish over phage contig length in terms of a. area under the curve (auc) and b. area under the precision- recall curve (aupr). bars indicate the auc or aupr averaged across replicates at a given subsampling length of phage sequence. supplementary figure . scatter plot of the phage sequence coverage used in host predictions of phirbo versus that of blast. each dot represents a phage genome. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary tables supplementary table . distribution of phirbo, blast and wish scores among interacting and non-interacting phage-host pairs obtained from edwards et al. and galiez et al. data sets. score ranges were summarized separately for , interacting and non-interacting phage-host pairs from edwards et al., and , interacting and non-interacting phage-host pairs from galiez et al. supplementary table . number of phage-host pairs evaluated by phirbo, blast, and wish in edwards et al. and galiez et al. data sets. supplementary table . phages assigned by blast to multiple, equally-scored host species. phirbo differentiated between host species and provided the highest score to primary host species. supplementary table . host prediction accuracy of phirbo, blast, and wish over phage contig length. supplementary table . phage sequence coverage of phages correctly assigned by blast and phirbo to their host species. sequence coverage was calculated for each phage as the sum of the lengths of its non-overlapping high scoring pairs to the genome of the correct host species, divided by the size of the query-phage genome. prophages were assumed to have sequence coverage greater than or equal to %. supplementary table . summary of taxonomic affiliations of phages that had sequence coverage < % with the host species genomes. supplementary table . protein families present in sequence regions of phage genomes that were used by blast and/or phirbo in host prediction. the table provides information on each protein family (prokaryotic virus orthologous group (pvog)) used by blast and phirbo, including: (i) pvog description and functional assignment (manually curated), (ii) pvog count (number of times a given pvog was present in the phage genome, as well as in sequences used by blast or phirbo), (iii) pvog percentage (pvog count divided by pvog count in the genome), and (iii) p-value of pvog enrichment. supplementary table . rna families present in sequence regions of phage genomes that were used by blast and phirbo in host prediction. the table provides information on each rfam family used by blast and phirbo. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary table . comparison of phirbo’s host prediction performance between blast- based and mash-based rankings of prokaryotic species. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / references . suttle ca. marine viruses--major players in the global ecosystem. nat rev microbiol. ; : – . . breitbart m, bonnain c, malki k, sawaya na. phage puppet masters of the marine microbial realm. nat microbiol. ; : – . . roux s, brum jr, dutilh be, sunagawa s, duhaime mb, loy a, et al. ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. nature. ; : – . . norman jm, handley sa, baldridge mt, droit l, liu cy, keller bc, et al. disease- specific alterations in the enteric virome in inflammatory bowel disease. cell. ; : – . . manrique p, bolduc b, walk st, van der oost j, de vos wm, young mj. healthy human gut phageome. proc natl acad sci u s a. ; : – . . meyer jr. sticky bacteriophage protect animal cells. proceedings of the national academy of sciences of the united states of america. proceedings of the national academy of sciences; . pp. – . . reardon s. phage therapy gets revitalized. nature. ; : – . . salmond gpc, fineran pc. a century of the phage: past, present and future. nat rev microbiol. ; : – . . svoboda e. bacteria-eating viruses could provide a route to stability in cystic fibrosis. nature. ; : s –s . . dedrick rm, guerrero-bustamante ca, garlena ra, russell da, ford k, harris k, et al. engineered bacteriophages for treatment of a patient with a disseminated drug-resistant mycobacterium abscessus. nat med. ; : – . . samson je, moineau s. bacteriophages in food fermentations: new frontiers in a continuous arms race. annu rev food sci technol. ; : – . . sulakvelidze a. using lytic bacteriophages to eliminate or significantly reduce contamination of food by foodborne bacterial pathogens. j sci food agric. ; : – . . paez-espino d, eloe-fadrosh ea, pavlopoulos ga, thomas ad, huntemann m, mikhailova n, et al. uncovering earth’s virome. nature. ; : – . . edwards ra, mcnair k, faust k, raes j, dutilh be. computational approaches to predict bacteriophage–host relationships. fems microbiol rev. ; : – . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . ahlgren na, ren j, lu yy, fuhrman ja, sun f. alignment-free d_ ^* oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically- derived viral sequences. nucleic acids res. ; : – . . galiez c, siebert m, enault f, vincent j, söding j. wish: who is the host? predicting prokaryotic hosts from metagenomic phage contigs. bioinformatics. ; : – . . andersson af, banfield jf. virus population dynamics and acquired virus resistance in natural microbial communities. science. ; : – . . wang w, ren j, tang k, dart e, ignacio-espinoza jc, fuhrman ja, et al. a network-based integrated framework for predicting virus-prokaryote interactions. nar genom bioinform. ; : lqaa . . zhang m, yang l, ren j, ahlgren na, fuhrman ja, sun f. prediction of virus-host infectious association by supervised learning methods. bmc bioinformatics. ; : . . altschul sf, madden tl, schäffer aa, zhang j, zhang z, miller w, et al. gapped blast and psi-blast: a new generation of protein database search programs. nucleic acids res. ; : – . . lima-mendez g, faust k, henry n, decelle j, colin s, carcillo f, et al. ocean plankton. determinants of community structure in the global plankton interactome. science. ; : . . flores co, meyer jr, valverde s, farr l, weitz js. statistical structure of host-phage interactions. proc natl acad sci u s a. ; : e - . . webber w, moffat a, zobel j. a similarity measure for indefinite rankings. acm trans inf syst. ; : – . . saito t, rehmsmeier m. the precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. plos one. ; : e . . davis j, goadrich m. the relationship between precision-recall and roc curves. proceedings of the rd international conference on machine learning - icml ’ . new york, new york, usa: acm press; . doi: . / . . villarroel j, kleinheinz ka, jurtz vi, zschach h, lund o, nielsen m, et al. hostphinder: a phage host prediction tool. viruses. ; . doi: . /v . ondov bd, treangen tj, melsted p, mallonee ab, bergman nh, koren s, et al. mash: fast genome and metagenome distance estimation using minhash. genome biol. ; . doi: . /s - - -x . gao nl, zhang c, zhang z, hu s, lercher mj, zhao x-m, et al. mvp: a microbe–phage interaction database. nucleic acids res. ; : d –d . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . paez-espino d, roux s, chen i-ma, palaniappan k, ratner a, chu k, et al. img/vr v. . : an integrated data management and analysis system for cultivated and environmental viral genomes. nucleic acids res. ; : d –d . . roux s, hallam sj, woyke t, sullivan mb. viral dark matter and virus-host interactions resolved from publicly available microbial genomes. elife. ; . doi: . /elife. . lawrence jg, ochman h. amelioration of bacterial genomes: rates of change and exchange. j mol evol. ; : – . . pride dt, wassenaar tm, ghose c, blaser mj. evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. bmc genomics. ; : . . carbone a. codon bias is a major factor explaining phage evolution in translationally biased hosts. j mol evol. ; : – . . sharp pm, rogers ms, mcconnell dj. selection pressures on codon usage in the complete genome of bacteriophage t . j mol evol. ; : – . . morgado s, vicente ac. global in-silico scenario of trna genes and their organization in virus genomes. viruses. ; : . . sousa jam de, pfeifer e, touchon m, rocha epc. genome diversification via genetic exchanges between temperate and virulent bacteriophages. biorxiv. biorxiv; . doi: . / . . . . shapiro jw, putonti c. gene co-occurrence networks reflect bacteriophage ecology and evolution. mbio. ; . doi: . /mbio. - . hernandes coutinho f, zaragosa-solas a, lópez-pérez m, barylski j, zielezinski a, dutilh be, et al. rafah: a superior method for virus-host prediction. biorxiv. biorxiv; . doi: . / . . . . camacho c, coulouris g, avagyan v, ma n, papadopoulos j, bealer k, et al. blast+: architecture and applications. bmc bioinformatics. ; : . . pedregosa f, varoquaux g, gramfort a, michel v, thirion b, grisel o, et al. scikit-learn: machine learning in python. j mach learn res. ; : – . . grazziotin al, koonin ev, kristensen dm. prokaryotic virus orthologous groups (pvogs): a resource for comparative genomics and protein family annotation. nucleic acids res. ; : d –d . . kalvari i, nawrocki ep, ontiveros-palacios n, argasinska j, lamkiewicz k, marz m, et al. rfam : expanded coverage of metagenomic, viral and microrna families. nucleic acids res. . doi: . /nar/gkaa .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . rice p, longden i, bleasby a. emboss: the european molecular biology open software suite. trends genet. ; : – . . finn rd, clements j, eddy sr. hmmer web server: interactive sequence similarity searching. nucleic acids res. ; : w - . . nawrocki ep, eddy sr. infernal . : -fold faster rna homology searches. bioinformatics. ; : – . . virtanen p, gommers r, oliphant te, haberland m, reddy t, cournapeau d, et al. scipy . : fundamental algorithms for scientific computing in python. nat methods. ; : – . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / blast reference prokarote dna database (d) match score e. coli k e. coli o :h s. flexneri a s. boydii e. coli k e. coli o :h e. coli m s. flexneri a s. boydii e. toletana s. dysenteriae y. rohdei s. flexneri brank species compare rankings match match rank e. coli s. boydii y. rohdei, y. ruckeri s. flexneri s. flexneri e. coli s. dysenteriae e. toletana s. boydii match rank agtcgtgtactgcgcgccgcgcgccaggac ggttcggccaacgactgggtccttatcgat ccaacgacgacggctccaacgacgttaggc acgttaccgtttaggcgcgatgcgatgcgt phage dna sequence (p) a b c d score host dna sequence (h) rank-biased overlap (rbo) = . y. ruckeri .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / a s im ila ri ty s c o re interaction non-interaction . . . . s im ila ri ty s c o re interaction non-interaction s im ila ri ty s c o re interaction non-interaction - . - . - . - . - . - . phirbo blast wish s im ila ri ty s c o re interaction non-interaction . . . . s im ila ri ty s c o re interaction non-interaction - . - . - . - . - . phirbo wish b . . . . . . . . t ru e p o s it iv e r a te false positive rate auc = . auc = . auc = . . . . . . . . . t ru e p o s it iv e r a te false positive rate wishblastphirbo auc = . auc = . auc = . - . s im ila ri ty s c o re interaction non-interaction blast .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / a aupr = . aupr = . aupr = . . . . . . . . . recall wishblastphirbo p re ci si o n b aupr = . aupr = . aupr = . . . . . . . . . recall p re ci si o n f score recall precision specificity accuracy . . . . . . . . . . . . . . . phirbo blast wish f score recall precision specificity accuracy . . . . . . . . . . . . . . . wishblastphirbo score cut-off . - . score cut-off . - . phirbo blast wish .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / a species p re d ic ti o n a c c u ra c y ( % ) sequence length (kb) genus family b phirbo (+phages) blast / phirbo wish % % % % % sequence length (kb) % % % % % sequence length (kb) species p re d ic ti o n a c c u ra c y ( % ) sequence length (kb) genus family % % % % % sequence length (kb) % % % % % sequence length (kb) .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / capsid head collar tail baseplate fiber spike amino acid metabolism po l dna replication genome packaging transcription cell lysis host defence systems energy metabolism nucleotide metabolism bacterial chromosome integration / recombination other functions a t g c t antibiotic resistance full phage assembly .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of sars-cov- genomes human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of sars-cov- genomes yuki iwasaki , takashi abe , toshimichi ikemura . department of bioscience, nagahama institute of bio-science and technology. shiga, japan . graduate school of science and technology, niigata university, niigata, japan abstract background when a virus that has grown in a nonhuman host starts an epidemic in the human population, human cells may not provide growth conditions ideal for the virus. therefore, the invasion of severe acute respiratory syndrome coronavirus- (sars- cov- ), which is usually prevalent in the bat population, into the human population is thought to have necessitated changes in the viral genome for efficient growth in the new environment. in the present study, to understand host-dependent changes in coronavirus genomes, we focused on the mono- and oligonucleotide compositions of sars-cov- genomes and investigated how these compositions changed time-dependently in the human cellular environment. we also compared the oligonucleotide compositions of sars-cov- and other coronaviruses prevalent in humans or bats to investigate the causes of changes in the host environment. results time-series analyses of changes in the nucleotide compositions of sars-cov- genomes revealed a group of mono- and oligonucleotides whose compositions changed in a common direction for all clades, even though viruses belonging to different clades should evolve independently. interestingly, the compositions of these oligonucleotides .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / changed towards those of coronaviruses that have been prevalent in humans for a long period and away from those of bat coronaviruses. conclusions clade-independent, time-dependent changes are thought to have biological significance and should relate to viral adaptation to a new host environment, providing important clues for understanding viral host adaptation mechanisms. keyword “covid- ”, “sars-cov- ”, “oligonucleotide composition”, “time-series analysis”, “big data”, “zoonotic virus”, “rna virus”, “viral adaptation”, “coronavirus” background severe acute respiratory syndrome coronavirus- (sars-cov- ), an rna virus belonging to the betacoronavirus genus, began to spread in the human population in . this viral strain is believed to have been originally prevalent in bats and transferred to the human population through intermediate hosts [ ]. viral growth requires a wide variety of host factors (nucleotide pools, proteins, rna, etc.) and should evade the diverse antiviral mechanisms of host cells (antibodies, killer t cells, interferon, rna interference, etc.) [ - ]. since ancestral sars-cov- strains are thought to be endemic in bats, they should be well adapted to their host environment; when the virus invades the human population, human cells may not provide growth conditions ideal for the virus. for efficient growth and rapid spread of the infection, changes in the viral genome should be required. analyses of time-dependent changes in sars-cov- in the human population can be used to characterize how and why viral genomes change to adapt to a new host environment. due to the great threat of covid- and remarkable development of sequencing technology, a massive number of sars-cov- genome sequences are .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / available in databases, even though the epidemic has lasted for approximately months. these sequence data have provided a wide range of insights into sars-cov- [ , ]. phylogenetic methods based on sequence alignment have been widely used in molecular evolution studies [ , ], and these methods are well refined and essential for studying phylogenetic relationships between different viral species and variations in the same viral species at the single-nucleotide level. however, when dealing with a massive number of genome sequences, methods based on sequence alignment become problematic because they require a large amount of computational resources. we have continued to develop sequence alignment-free methods focused on the oligonucleotide compositions of genome sequences [ - ]. notably, oligonucleotide composition varies widely among species, including viruses, and is designated as genome signatures [ ]. these compositions can be treated as numerical data, and a massive amount of sequence data can easily be subjected to various statistical analyses. furthermore, even genomic fragments without orthologous and/or paralogous pairs can be compared [ , , - ]. specifically, our previous work on influenza a-type virus genomes found that the oligonucleotide compositions of the viral genomes differed between hosts (e.g., humans and birds), even for viruses within the same subtype (e.g., h n and h n of type a) [ , , ]; we also examined changes in the oligonucleotide compositions of influenza h n / , which have been epidemic in humans beginning in , and found that their compositions changed to approach those of the seasonal flu strains h n and h n [ ]. furthermore, although epidemics of the h n and h n strains began several decades apart, these strains showed highly similar chronological changes from the start of these epidemics. these evolutionary yet reproducible changes suggest that mutations to adapt to a new host environment inevitably accumulate when the host species of a virus changes, and these changes can be efficiently detected by analyzing oligonucleotide compositions. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / several groups, including ours, have examined changes in sars-cov- genomes during the early stages of the sars-cov- epidemic and found clear directional changes in a group of mono- and oligonucleotides detectable on even a monthly basis [ , , ]. these directional changes will allow us to predict changes in the near future. notably, near-future prediction and verification should be the most direct ways to test the reliability of the obtained results, models and ideas (e.g., those discovered for influenza viruses), providing a new paradigm for molecular evolutionary studies. in this context, the present study analyzed the genome sequences of over seventy thousand sars-cov- strains isolated from december to september . results directional changes in the mononucleotide compositions (%) of sars-cov- for fast-evolving rna viruses, diversity within the viral population arises rapidly as the epidemic progresses and subpopulation structure forms; the gisaid consortium has defined at least seven main clades (g, gh, gr, l, v, s and others). notably, the elementary processes of molecular evolution are based on random mutations, and strains belonging to different clades are thought to have evolved independently. therefore, the observation of highly similar time-dependent changes independent of clade has certain biological meanings and may be inevitable for efficient growth in human cells. from this perspective, we first examined time-dependent changes in the mononucleotide compositions (%) of sars-cov- strains isolated from december to september . among the seven clades (g, gh, gr, l, v, s and others) reported by the gisaid consortium, we used six clades (g, gh, gr, l, v and s), excluding others, in the analysis. for the time-series analysis, we calculated the average mononucleotide .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / compositions (%) of the genomes in each clade collected monthly; in fig. a, the mononucleotide composition of each clade is shown as a colored line, while that for the monthly collected genomes belonging to all clades is shown as a dashed line. regardless of clade, the composition of c decreased, while that of u increased in a time-dependent manner, but the changes in a and g composition were less clear (fig. a). correlation coefficients between the mononucleotide composition and month from the start of the epidemic showed a high negative correlation for c and a high positive correlation for u for all clades, but there was no clear directionality for a and g (fig. a and tables , ). these results indicate that the mononucleotide composition of this virus may be prone to biased mutations that reduce c and increase u or the mutated strains tend to be more favorable for growth in human cells. directional changes in short oligonucleotide compositions oligonucleotides are known to act as functional motifs, such as binding sites for a wide variety of proteins and target sites for rna modifications. therefore, directional changes in some oligonucleotides independent of clade may relate to certain processes for adaptation to the new host environment. our previous work on influenza a viruses found that their oligonucleotide compositions varied among prevalent hosts [ , ]; notably, although influenza virus isolated from humans tended to prefer a and u (but not g and c) more than viruses isolated from birds, the human viruses showed a preference for ggcg and gggg, which are g- or c-rich. importantly, there are various examples of oligonucleotides whose changes in composition cannot be explained by changes in mononucleotide composition alone, and these changes may relate to the molecular mechanisms of viral adaptation to a new host. from this perspective, we next analyzed time-dependent changes in di- and trinucleotide compositions and found that a group of di- and trinucleotides showed a highly positive or negative correlation (figs. b, s and tables , ). interestingly, a .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / group of a- or g-rich oligonucleotides, such as gag and gga, showed a high positive correlation independent of clade, which was not expected from the changes in mononucleotide compositions alone. to confirm the extent of these changes, we also calculated the fold change in composition for the first isolated month and the last examined month (fig. ) and found clear increases and decreases in mono- and oligonucleotide compositions common among the six clades, which supports the result presented in fig. and tables and . changes towards the sequences of other coronaviruses prevalent in humans in a previous study of sars-cov- [ ], we analyzed mono- and dinucleotide compositions for the first four epidemic months without separating the sequences by clade. notably, the directional changes shown in figs. and and tables and were absolutely consistent with the previous results, even when the six clades were separately analyzed. in the previous study, time-series analysis of ebolavirus at the beginning of the epidemic in west africa in also showed directional changes in a group of mono- and dinucleotide compositions, but these directional increases/decreases tended to slow approximately months after the start of the epidemic. the increase/decrease trend for sars-cov- is far from slowing after months, and the next important questions are how long these directional changes in this virus will last and whether there are possible goals to these changes. to conduct this near-future prediction, the following information concerning influenza viruses should be useful. as mentioned before, mono- and oligonucleotide compositions in influenza h n / changed towards those of seasonal influenza strains such as the h n and h n subtypes [ ]. furthermore, all the human subtypes showed directional changes away from the compositions of all avian influenza a subtypes and closer to those of the human influenza b type, which has been prevalent only in humans [ ]. if we assume that changes similar to those in the influenza virus .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / will occur, the mono- and oligonucleotide compositions of interest for sars-cov- are expected to change towards those of other coronaviruses that have been prevalent in humans and away from those of coronaviruses prevalent in bats. to test this hypothesis, we analyzed the following coronaviruses: human-cov strains (alphacoronaviruses e and nl : betacoronaviruses hku and oc ) and bat-cov strains (alphacoronaviruses and betacoronaviruses, including the sars virus). as shown in fig. a, we compared the mononucleotide compositions of sars-cov- with those of the human- and bat-cov strains; the data for bat sars among bat-cov strains, which is thought to be the original strain that caused the current covid- pandemic, are marked in pink. interestingly, concerning the human- and bat-cov strains, differences in mononucleotide composition were more pronounced between hosts than between the alpha and beta linages, and the levels for all six clades of sars-cov- were between those for the two hosts. fig. b shows the results of di- and trinucleotides, for which the directional, time-dependent changes were primarily common among the six clades. the increases and decreases in nucleotide composition observed for sars-cov- in figs. and are indicated by hollow up and down arrows, respectively. interestingly, all changes of interest tended to move away from the compositions of bat sars and approach those of human-cov, supporting the view that the directional changes of interest have biological significance and are possibly inevitable, as observed for influenza viruses. assuming that approaching the levels in human-cov strains is the hypothetical goal of the directional change of sars-cov- , the current compositions are far from this hypothetical goal (fig. ); therefore, we predict that directional changes of interest will continue in the near future. then, assuming that the average value for all human-cov strains is a hypothetical goal, we investigated how sars-cov- has approached this possible goal. specifically, we calculated the square of the difference between the composition of each nucleotide in sars-cov- and the average value for human-cov strains and plotted .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / the values of the difference according to the elapsed month for each nucleotide. changes in the compositions of both c and u clearly reduced this difference, as the compositions of these nucleotides approached the hypothetical goal (fig. a); their linear reduction supports the prediction that directional changes in the composition of c and u will continue for the foreseeable future. in contrast, a and g did not show directional changes in composition, which is most likely due to the absence of clear differences in the a and g compositions of human- and bat-cov, i.e., there is no possible target (fig. a). fig. b shows examples of di- and trinucleotides whose compositions have moved towards the hypothetical goal, but fig. c shows a few exceptional nucleotides whose compositions have not changed towards the hypothetical goal but have changed with a common directionality among the six clades. in fig. d, correlation coefficients between the above difference and the elapsed month are presented. most nucleotides of interest showed a negative coefficient (i.e., a directional change towards human-cov), but three oligonucleotides, gg, agc and cau, showed positive coefficients indicating an increase in the difference (i.e., moving away from the human-cov level). for these opposing directional changes, certain causes specific to sars-cov- may be assumed. motifs for rna-binding proteins next, we considered the mechanisms that move oligonucleotide compositions away from those of bat coronaviruses and closer to those of human coronaviruses. certain human cellular factors involved in viral growth may be candidates in such mechanisms. when considering possible protein factors, oligonucleotides longer than trinucleotides should be a focus. as an attempt, we here focused on host rna-binding proteins because their binding to hepatitis c virus is known to be involved in the growth of this rna virus [ ]. we thus searched for motifs for human rna-binding proteins in coronavirus genomes (see methods section) and found multiple loci with binding motifs .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / for each protein. table (and table s ) lists the motifs for which a directional time- dependent change was primarily common among six clades. table and fig. a show that only elavl showed a positive correlation, but the other nine proteins in table showed a negative correlation for almost all clades; the results for other motifs are presented in table s . we next compared the numbers of these motifs in sars-cov- with the numbers of human- and bat-cov motifs (fig. b). of the ten proteins shown in table , the only elevated motif, that for elavl binding, was found in a significantly higher number of loci in human-cov than in bat-cov, but motifs for pcbp and srsf binding, which tended to decrease (table ), were found in significantly fewer loci in human-cov. these observations appear to be consistent with the features found in the mono-, di- and trinucleotide compositions of interest. however, unlike these changes, there was significant diversity within even a single clade, which appears to be greater than the differences between hosts, with the possible exception of elavl . in regard to long oligonucleotides, they should carry out a variety of functions, and mutations that accumulate in their functional motifs may have complex effects on the presence of functional motif sequences, so an analysis from a new perspective appears to become important. discussion we first discuss possible molecular mechanisms related to time-dependent directional changes in mononucleotide composition. fig. a shows that the frequency of c tended to decrease in sars-cov- , while that of u tended to increase. since a similar change was previously found for mers and all a-type influenza subtypes [ , ], these changes may have biological significance for a wide range of rna viruses that invade from nonhuman hosts. one possible mechanism is the host rna-editing function; simmonds ( ) proposed that the c→u hypermutation in sars-cov- may be due .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / to the influence of apobec family proteins in humans [ ]. apobec is an antiviral protein in various animal species, including humans, that can convert c to u by the deacetylation of c [ - ]. such rna editing is also known to act as a defense mechanism against various viruses, including retroviruses [ ]. the apobec gene family has generated various paralogs during mammalian evolution, with seven known apobec genes in humans and ten in bat families [ - ]. the prevalence c→u change in sars-cov- upon transfer of its host environment from bats to humans suggests that these changes may be due to human-specific apobec genes. we next discuss changes in short oligonucleotides. directional changes in some oligonucleotides, such as gag and gga, cannot be explained by apobec- induced c→u mutations alone. although the evidence is weak, these oligonucleotides are part of the binding motifs of several rna-binding proteins, such as srsf and pcbp (table s ); the number of loci for these motifs has decreased independently of clade. in contrast, the number of motif loci for only elavl among the ten proteins listed in table has increased independently of clade. as an rna-binding protein that binds a- or u-rich elements, elavl binding to mrna is known to contribute to rna stability [ , ]; sars-cov- and human-cov, which are prevalent in humans, may contain increased binding motifs for elavl for efficient growth in the human cellular environment. however, for further analysis, information on rna-binding proteins in bat cells is needed. conclusions in the present study, we found that the compositions of a group of mono- and oligonucleotide in sars-cov- genomes have changed in a host cell-dependent manner. this is totally consistent to our previous finding for influenza a and b viruses [ , , ], supporting the previous prediction that the host-dependent directional changes of various mono- and oligonucleotides should inevitably occur in zoonotic .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / rna viruses that have invaded from nonhuman hosts. phylogenetic methods based on sequence alignment [ , ] are well refined and undoubtedly essential for studying the phylogenetic relationships between viruses. the present alignment-free method to analyze mono- and oligonucleotide compositions can also serve as a powerful tool for molecular evolutionary studies of viruses, revealing directional changes in viruses and predicting the possible goals of these changes. methods sars-cov- genome sequences human sars-cov- genome sequences were downloaded from the gisaid database (https://www.gisaid.org/); sequences that were complete, showed high coverage and had been isolated from humans were downloaded on sep , . among the acquired sequences, strains with an unknown isolation month were excluded from the analysis, and the polya tail was removed. a list of all , strains used is provided in table s . genome sequences of coronaviruses prevalent in humans or bats the complete sequences of two types of human coronavirus (human-cov) strains, alphacoronaviruses ( e and nl strains) and betacoronaviruses ( hku and oc strains), were obtained from the ncbi virus database (https://www.ncbi.nlm.nih.gov/labs/virus/). the complete genome sequences of two types of bat coronavirus (bat-cov) strains, alphacoronaviruses ( strains) and betacoronaviruses ( strains, including sars-cov), isolated from three types of bats (chiroptera, vespertilionidae and rhinolophidae) were obtained from the ncbi virus database (https://www.ncbi.nlm.nih.gov/labs/virus/), and the polya tail of each sequence was removed. the strains are listed in table s . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / time-series analysis of changes in oligonucleotide compositions in the time-series analysis, the average mono- and oligonucleotide compositions (%) of viruses collected in each month were calculated for each clade. to avoid statistical fluctuations due to the small sample size, months in which fewer than strains had been collected were excluded from the monthly analysis. rna-binding motif analysis rna-binding motifs were obtained from the attract database [ ]. in this database, multiple binding motifs are registered as corresponding to one rna-binding protein; we calculated the total number of loci containing the binding motifs for each protein in the viral genomes. list of abbreviations sars-cov- : severe acute respiratory syndrome coronavirus- human-cov: human coronavirus bat-cov: bat coronavirus ethics approval and consent to participate not applicable consent for publication not applicable .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / availability of data and materials the sequence dataset analyzed in this study are stored in gisaid. other data are available from yi. competing interests the authors declared that there are no conflicts of interests. funding this work was supported by jsps kakenhi grant number k , by amed under grant number jp he and by covid- counterplan research project (supervised by prof. tatsumi hirata, nig) from the research organization of information and systems (rois). authors' contributions yi conceived the approach and conducted this analysis. ta developed the algorithm. ti supervised this study. acknowledgements we gratefully acknowledge the authors submitting their sequences from gisaid’s database and also the valuable comments of dr. yashushi hiromi of national institute of genetics (mishima). we thank springer nature author services for editing this manuscript for english language. figure legends fig. . time-dependent directional changes in nucleotide compositions. (a) average mononucleotide compositions (%) in the sars-cov- genomes of each clade isolated in each month are plotted against the elapsed month. to compare the four .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / mononucleotides, the scale widths on the vertical axis are set to the same values. the colored lines distinguishing the clade (g, gh, gr, l, v and s) are shown at the bottom of the figure. the dashed line shows the averaged compositions for all strains isolated in each month. (b) the average di- and trinucleotide compositions that primarily undergo common directional changes among the six clades are plotted against the elapsed month. fig. . fold changes in nucleotide composition between the epidemic start and the last month of analysis. a bar plot shows the fold change in composition of each mono- or oligonucleotide; this value was calculated by dividing the nucleotide composition in the last month of analysis by that at the start of the epidemic. each bar is colored to indicate the clade, as described in fig. . since we analyzed strains belonging to different clades separately, data from the first or last month differed among clades; see also the methods section. fig. . nucleotide compositions of human and bat coronavirus sequences. a boxplot shows the nucleotide compositions in human-cov (alpha e, alpha nl , beta hku and beta oc ), bat-cov (bat sars, alphacoronavirus and betacoronavirus) and sars-cov- strains. bat sars are marked pink. a hollow arrow indicates the direction of change in oligonucleotide composition observed for sars- cov- in figs. and . (a) mononucleotides. to compare the four mononucleotides, the scale widths on the vertical axis scale are set to the same values. (b) di- and trinucleotides. fig. . differences in nucleotide composition between sars-cov- and human- cov. (a) values for the square of the difference in mononucleotide composition between sars-cov- isolated in each month and human-cov are plotted against the .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / elapsed month. the data are presented as colored or dashed lines, as described in fig. . (b and c) oligonucleotide compositions that approach and move from those of human- cov are presented, respectively. (d) the correlation coefficients between the elapsed month from the start of the epidemic and the above differences in mono- and oligonucleotides whose directionality of change is common among six clades are presented. the results for a and g mononucleotides, which show nondirectional change, are also presented. fig. . time-dependent changes in the numbers of rna-binding motif loci. (a) the numbers of loci containing rna-binding motifs per genome are plotted against the elapsed month. here, we selected rna-binding proteins for which the number of motif loci increased or decreased by at least one for all six clades from the epidemic start. the data are presented as colored or dashed lines, as described in fig. a. (b) a boxplot shows the number of loci containing rna-binding motifs in human-cov (alpha e and nl : beta hku and oc ), bat-cov (bat sars, alphacoronavirus and betacoronavirus) and sars-cov- strains. bat sars are marked pink. a hallow arrow indicates the direction shown in fig. a with which the oligonucleotide compositions of sars-cov- changed. table . correlation coefficients for time-dependent changes in mono- and oligonucleotide compositions in sars-cov- that have increased. table . correlation coefficients for time-dependent changes in mono- and oligonucleotide compositions in sars-cov- that have decreased. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table . the number motif-containing loci for rna-binding proteins whose occurrences have increased or decreased between strains of the first and last month of the analysis. additional file fig. s : average di- and trinucleotide compositions (a and b) of for sars-cov- strains collected in each elapsed month. fig. s : oligonucleotide compositions of human and bat coronavirus sequences. fig. s : differences in oligonucleotide composition between sars-cov- and human- cov. additional file table s : list of sars-cov- strains used in the analysis. table s : list of human-and bat-cov strains used in the analysis. table s : number of sars-cov- strains in each clade isolated in each elapsed month. table s : average oligonucleotide compositions for sars-cov- strains in each clade isolated in each elapsed month. table s : correlation coefficients for time-dependent changes in oligonucleotide compositions of sars-cov- . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table s : fold change in compositions between strains of the first and last month of the analysis. table s : distance between the oligonucleotide composition of sars-cov- isolated in each elapsed month and that of human-cov. table s : correlation coefficients for time-series changes in the distance between oligonucleotide compositions of sars-cov- and human-cov. table s : list of rna-binding motifs. table s : numbers of motif-containing loci for rna-binding proteins whose abundance increases or decreases between strains of the first and last month of the analysis. table s : p-value from t-test to analyze the number of rna-binding motif loci whose abundance increases or decreases between strains of the first and last month of the analysis. table s : correlation coefficients for time-dependent changes in the number of loci containing rna-binding motifs. reference . singhal t: a review of coronavirus disease- (covid- ). indian j pediatr. ; : - . . garcía-sastre a: inhibition of interferon-mediated antiviral responses by influenza a viruses and other negative-strand rna viruses. virology. ; : – . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . voinnet o: induction and suppression of rna silencing: insights from viral infections. nat. rev. genet. ; : – . . randall re, goodbourn s: interferons and viruses: an interplay between induction, signalling, antiviral responses and virus countermeasures. j. gen. virol. ; : – . . konno y, kimura i, uriu k, et al: sars-cov- orf b is a potent interferon antagonist whose activity is increased by a naturally occurring elongation variant. cell rep. ; : . . zhou et al: a novel bat coronavirus closely related to sars-cov- contains natural insertions at the s /s cleavage site of the spike protein. curr biol. ; : - . . nei m: molecular evolutionary genetics. columbia university press: new york. . . kumar s, nei m, dudley j, tamura k: mega: a biologist-centric software for evolutionary analysis of dna and protein sequences, brief bioinform. ; : – . . abe t, kanaya s, kinouchi m, et al: informatics for unveiling hidden genome signatures, genome res. ; : – . . abe t, sugawara h, kinouchi m, kanaya s, ikemura t: novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples, dna res. ; : – . . iwasaki y, abe t, wada k, itoh m, ikemura t,: prediction of directional changes of influenza a virus genome sequences with emphasis on pandemic h n / as a model case. dna res ; : - . iwasaki y, abe t, wada y, wada k, ikemura t: novel bioinformatics strategies for prediction of directional sequence changes in influenza virus genomes and for surveillance of potentially hazardous strains. bmc infect dis. ; : - .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . karlin s, campbell am, mrazek j: comparative dna analysis across diverse genomes. annu. rev. genet. ; : – . . wada y, wada k, iwasaki y, kanaya s, ikemura t: directional and reoccurring sequence change in zoonotic rna virus genomes visualized by time-series word count. sci rep. ; : . . wada k, wada y, iwasaki y, ikemura t: time-series oligonucleotide count to assign antiviral sirnas with long utility fit in the big data era. gene ther. ; : – . . wada k, wada y, ikemura t: time-series analyses of directional sequence changes in sars-cov- genomes and an efficient search method for candidates for advantageous mutations for growth in human cells. gene. ; : . . qiu y, abe t, nakao r, satoh k, sugimoto c: viral population analysis of the taiga tick, ixodes persulcatus, by using batch learning self-organizing maps and blast search. journal of veterinary medical science, ; ( ): - . . mercatelli d, giorgi fm: geographic and genomic distribution of sars-cov- mutations. front microbiol. ; : : . . simmonds p: rampant c→u hypermutation in the genomes of sars-cov- and other coronaviruses: causes and consequences for their short- and long-term evolutionary trajectories. msphere. ; :e - . . paek ky, kim cs, park sm, kim jh, jang sk: rna-binding protein hnrnp d modulates internal ribosome entry site-dependent translation of hepatitis c virus rna. j virol. ; : - . . harris rs, bishop kn, sheehy am, craig hm, petersen-mahrt sk, watt in, neuberger ms, malim mh: dna deamination mediates innate immunity to retroviral infection. cell. ; : – . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . mangeat b, turelli p, caron g, friedli m, perrin l, trono d: broad antiretroviral defence by human apobec g through lethal editing of nascent reverse transcripts. nature. ; : – . . zhang h, yang b, pomerantz rj, zhang c, arunachalam sc, gao l: the cytidine deaminase cem induces hypermutation in newly synthesized hiv- dna. nature. . : – . https://doi.org/ . /nature . . harris rs, dudley jp: apobecs and virus restriction. virology. ; – : – . . sawyer sl, emerman m, malik hs: ancient adaptive evolution of the primate antiviral dna-editing enzyme apobec g. plos biol. ; :e . . münk c, willemsen a, bravo ig: an ancient history of gene duplications, fusions and losses in the evolution of apobec mutators in mammals. bmc evol biol. ; : . . henry m, terzian c, peeters m, wain-hobson s, vartanian jp: evolution of the primate apobec a cytidine deaminase gene and identification of related coding regions. plos one. ; :e . . wang w, caldwell mc, lin s, furneaux h, gorospe m: hur regulates cyclin a and cyclin b mrna stability during cell proliferation. embo j. ; ( ): - . . lal a, mazan-mamczarz k, kawai t, yang x, martindale jl, gorospe m: concurrent versus individual binding of hur and auf to common labile target mrnas. embo j. ; ( ): - . . giudice g, sánchez-cabo f, torroja c, lara-pezzi e: attract-a database of rna-binding proteins and associated motifs. database (oxford). ; :baw . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table clade g clade gh clade gr clade l clade v clade s u . . . . . . ua . . . . . . auu . . . . . . cau . . . . . . ugu . . . . . . uua . . . . . . uug . . . . . . uuu . . . . . . table clade g clade gh clade gr clade l clade v clade s c - . - . - . - . - . - . ag - . - . - . - . - . - . ca - . - . - . - . - . - . cc - . - . - . - . - . - . cu - . - . - . - . - . - . ga - . - . - . - . - . - . gg - . - . - . - . - . - . uc - . - . - . - . - . - . agc - . - . - . - . - . - . ccc - . - . - . - . - . - . gac - . - . - . - . - . - . gag - . - . - . - . - . - . gga - . - . - . - . - . - . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table clade g clade gh clade gr clade l clade v clade s ptbp - . - . - . - . - . . hnrnpl - . - . - . - . - . . nova - . - . - . - . - . . srsf - . - . - . - . - . . zfp . - . - . - . - . . hnrnpa - . - . - . - . - . . elavl . . . . . . tia - . - . - . - . - . . pcbp - . - . - . - . - . . srsf - . - . - . - . - . . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / debar, a sequence-by-sequence denoiser for coi- p dna barcode data title: debar, a sequence-by-sequence denoiser for coi- p dna barcode data authors cameron m. nugent , ,* tyler a. elliott sujeevan ratnasingham paul d. n. hebert sarah j. adamowicz department of integrative biology, university of guelph. guelph, ontario, canada centre for biodiversity genomics, university of guelph. guelph, ontario, canada *corresponding author: nugentc@uoguelph.ca .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:nugentc@uoguelph.ca https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / abstract dna barcoding and metabarcoding are now widely used to advance species discovery and biodiversity assessments. high-throughput sequencing (hts) has expanded the volume and scope of these analyses, but elevated error rates introduce noise into sequence records that can inflate estimates of biodiversity. denoising —the separation of biological signal from instrument (technical) noise—of barcode and metabarcode data currently employs abundance-based methods which do not capitalize on the highly conserved structure of the cytochrome c oxidase subunit i (coi) region employed as the animal barcode. this manuscript introduces debar, an r package that utilizes a profile hidden markov model to denoise indel errors in coi sequences introduced by instrument error. in silico studies demonstrated that debar recognized % of artificially introduced indels in coi sequences. when applied to real-world data, debar reduced indel errors in circular consensus sequences obtained with the sequel platform by %, and those generated on the ion torrent s by %. the false correction rate was less than . %, indicating that debar is receptive to the majority of true coi variation in the animal kingdom. in conclusion, the debar package improves dna barcode and metabarcode workflows by aiding the generation of more accurate sequences aiding the characterization of species diversity. keywords: coi, dna barcode, metabarcode, denoising, markov model, biodiversity .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / introduction motivated by global biodiversity decline, conservation policies and strategies are being implemented to mitigate extinction rates (driscoll et al. ; baynham-herd et al. ). accurate assessments of biodiversity and its change over time are critical to support conservation strategies, to remediate environmental damage, and to manage natural resources, but this information is lacking for most ecosystems (sogin et al. ; hajibabaei et al. ; hebert et al. ; d’souza & hebert ). dna barcoding provides a technological solution to the problem of identifying organisms and characterizing biodiversity (hebert et al. ; hubert & hanner ). instead of identifying specimens through morphological study, standardized dna regions—termed dna barcodes—are used to identify specimens belonging to known species and to recognize new taxa. reflecting advances in sequencing technology, dna barcode studies are expanding in scale from analyzing single specimens to characterizing bulk samples, an approach termed metabarcoding, as well as multi-marker and metagenomics approaches (taberlet et al. ; cristescu ; hajibabaei et al. ; wilson et al. ). these advances are providing newly detailed information on species diversity in different geographic regions and habitats (hajibabaei et al. ; hebert et al. ; delabye et al. ; lopez-vaamonde et al. ) while also aiding the identification of invasive species (brown et al. ; xu et al. ), food web analysis (wirta et al. ; kanuisto et al. ), and environmental monitoring (hajibabaei et al. ; stat et al. ; cordier et al. ). despite the broad adoption of dna barcoding and metabarcoding, a fundamental problem persists. efforts to quantify biodiversity from barcode and metabarcode data can be .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / strongly affected by analytical methodology (clare et al. ; braukmann et al. ). for example, if high-throughput sequence (hts) data are cleaned suboptimally, the estimated number of taxa may be grossly inflated as variation introduced by sequencing (technical) errors are interpreted as biological variation (hardge et al. ). to reduce the impact of technical errors, sequence reads are often clustered into operational taxonomic units (otus) at specific identity thresholds (elbrecht et al. ). several software packages have attempted to increase the accuracy of this otu method by separating biological signal from technical noise (rosen et al. ; callahan et al. ; edgar ; amir et al. ; elbrecht et al. ; kumar et al. ; nearing et al. ). many standard denoisers, such as dada (callahan et al. ), deblur (amir et al. ), and unoise (edgar ), utilize cluster-based approaches, custom error models, or pre-clustering algorithms to account for and correct technical errors. comparative studies have shown that all three of these methods outperform threshold-based otu-clustering approaches (nearing et al. ). it has also been shown that they produce similar estimates of species richness and relative abundance, but significantly different values for alpha diversity (intra-habitat diversity) and the number of unique exact sequence variants (esvs) (nearing et al. ). when a highly conserved protein-coding region, such as cytochrome c oxidase subunit i (coi), is employed as the barcode, structural information can be leveraged to improve denoising. the adoption of this approach can improve the accuracy of alpha-diversity estimates and the quality of identified barcode sequences by ensuring barcodes conform to biological reality. additionally, rare sequences or important intra-species variants need not be discarded based solely on their abundance and can be retained with higher confidence if they conform to the expected gene structure. this latter benefit will be particularly valuable for work on hyper-diverse communities, .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / (e.g. tropical insects) and for analyses of metabarcode data, where uneven sampling is often the norm and the resolution of intra-species variation is challenging (elbrecht et al. ; nearing et al. ; braukmann et al. ; zizka et al. ). hidden markov models (hmms) are probabilistic representations of sequences that allow unobserved (hidden) states to be inferred through the observation of a series of non-hidden states (durbin et al. ; wilkinson ). hmms have been applied widely in the analysis of biological sequences, in areas such as sequence alignment and annotation (durbin et al. ; eddy ). profile hidden markov models (phmms) are a variant well suited for the representation of biological sequences with a shared evolutionary origin (durbin et al. ; eddy , ). they are probabilistic models that contain position-specific information about the likelihood of potential characters (base pairs or amino acid residues) at the given position in the sequence (emission probabilities) and the likelihood of the observed character given the previously observed character in the sequence (transition probabilities). once a phmm is trained on a set of sequences, the viterbi algorithm can be used to obtain the path of hidden states that align the novel sequences to the phmm (durbin et al. ). the viterbi path is comprised of hidden match states (indicating the observed character matches to a position in the phmm) and non-match states: either inserts or deletions. in the context of error correction, hidden non-match states identify the most likely positions at which novel sequences deviate from the phmm’s statistical profile. in this manner, individual sequences can be queried for evidence of insertion or deletion (indel) errors and adjusted in a statistically informed manner. the conserved protein- coding structure of the most common animal barcode gene, coi, and the wealth of available training sequences (ratnasingham & hebert ) for this region have allowed phmms to be successfully applied in the detection of technical errors in novel barcode sequences (nugent et .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / al. ). correction of technical indel errors in data from protein-coding barcode sequences is an important development as it maximizes the likelihood that both the nucleotide and amino acid sequences correspond to the true biological sequence. mitigation of indels arising due to technical errors also makes sequence reads from a given specimen more directly comparable, allowing low-frequency point mutations to be eliminated when multiple reads are available for a given biological sequence. here, we aim to extend the use of phmms in coi data processing to allow for the sequence-by-sequence correction (denoising) of technical errors. this study had four primary goals: ( ) design a denoising tool for coi barcode data that utilizes phmms to identify and correct insertion and deletion errors resulting from technical error; ( ) test the tool’s performance and optimize its default parameters by denoising a set of , barcode sequences with artificially introduced indel errors; ( ) develop, implement, and evaluate a workflow for denoising dna barcode data produced through single-molecule, real time (smrt) sequencing of , specimens on the sequel platform (pacific biosciences); and ( ) denoise a dna metabarcode mock community data set using debar and evaluate the improvement in quality of consensus sequences and the ability to resolve intra-otu haplotype variation. the denoiser resulting from this work, debar (denoising barcodes), is a free, publicly available package written in r that is available through cran (https://cran.r- project.org/package=debar) and github (https://github.com/cnuge/debar). materials and methods implementation .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://cran.r-project.org/package=debar https://cran.r-project.org/package=debar https://github.com/cnuge/debar https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / the debar utility includes several customizable steps which denoise dna barcode and metabarcode data (figure ; supplementary file ). corrections with debar are based upon the comparison of input sequences with a nucleotide-based profile hidden markov model (phmm) (model training detailed in nugent et al. ) using the viterbi algorithm (durbin et al. ). briefly, debar’s phmm was trained using a curated set of , coi- p barcode sequences obtained from the barcode of life data systems (bold: www.boldsystems.org) public database that were checked to ensure: (i) the sequence was > bp in length, (ii) taxonomy was known to a genus level, (iii) there were no missing base pairs, (iv) the amino acid sequence did not contain stop codons, and (v) bold’s internal check for contaminants was negative (nugent et al. ). the viterbi path produced through alignment of the sequence to the phmms is used to match the input sequence to the phmm (by finding the first set of consecutive match states which indicate the absence of indels for the given base pairs). the read is then adjusted to account for detected insertions or deletions (figure ). three consecutive nucleotide insertions or deletions are permitted (not adjusted) as sequences of this kind are more likely to reflect true biological variants than technical errors (they do not result in reading frame shifts and may reflect an insertion or deletion of an amino acid in a functional protein-coding gene). the probability of such changes through sequencing error is relatively low (i.e. for the pacific biosciences sequel platform the baseline probability of three consecutive deletions would be . % (baseline delete probability) cubed, or . %). the denoising of sequences with debar is controlled using a suite of parameters (figure ). the censorship parameter is most important as it controls the size of the masks (substitution of nucleotides for placeholder n characters) applied around sequence adjustments. this option is designed to prevent the introduction of errors that would be caused if the denoising process .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / deleted the wrong base pair or inserted a placeholder in the incorrect position. derivation of the default value for the censorship parameter is detailed in the methods and results sections. the package also enables the translation of denoised sequences to amino acids to confirm that denoised outputs conform to the expected properties of the protein-coding gene region. because debar can interface directly with fasta and fastq files, it enables file-to-file denoising in addition to denoising within an r programming environment. the default phmm used for denoising by debar represents the complete bp barcode region of coi. the package also permits the use of customized phmms provided by a user, which allows the denosiser to be applied to data from other gene regions or for the denoiser to be targeted to a specific user-defined subsection of the coi barcode. training of a phmm for a new barcode or gene is supported by the r package aphid (wilkinson ), while sub-setting of debar’s default phmm is enabled by the r package coil (nugent et al. ). details of the package’s components together with a demonstration of its implementation is available in the package’s vignette (supplementary file ). quantification of package performance simulated error data the debar package was tested using a phylogenetically stratified random sample of publicly available coi- p sequences with artificially introduced indels. this test was designed to assess the accuracy of sequence corrections and to obtain a quantitatively informed set of default parameters for the denoising process. a random sample of , animal coi- p sequences (excluding those used in phmm model training) were obtained from bold and cleaned using the steps described in nugent et al. (methods section – bold data acquisition). errors were introduced into each sequence in accordance with the statistical error profile of the pacific .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / biosciences sequel based upon the error profile for coi barcode region in hebert et al. ( ). this profile indicated a baseline indel rate of . % (insertions and deletions equally likely), a baseline substitution rate of . %, and an elevated indel rate for long homopolymers (repeat length of , , and + with indel probabilities of . %, . %, and . %, respectively) (hebert et al. ). the location of all errors was recorded so that accuracy of subsequent corrections could be evaluated. sequences were iteratively processed, and errors were limited to a single insertion or deletion error of one base pair in length (with the error introduction process being repeated for the original sequence when more than one indel occurred), which allowed for the accuracy of corrections to be assessed without the need to consider interaction effects. the resultant sequences, each with one indel, were then denoised with debar (‘denoise’ function, using the parameter censor_length = ). the outputs of the denoise function were queried to determine the number and location of indel corrections applied by debar. this information was compared to the recorded ground truth error locations to quantify the following: ) the frequency with which debar located and exactly corrected indels, ) the miss distance (number of nucleotide positions) between introduced errors and corrections applied in instances where debar did not correct the indel errors in exactly the correct position, and ) the frequency at which debar applied an incorrect number of sequence corrections (i.e. correction or + corrections). if one correction was made and the distance between the correction and true indel position was , then the correction was considered accurate. corrections were also considered accurate if all base pairs between the correction location and the true indel position were the same (i.e. if base pair in the homopolymer "ttttt" was an insertion, but the th t in the sequence was removed by debar, this is functionally an exact correction as the true sequence is restored). all other corrections at inexact positions were considered inaccurate, and the distance .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / (number of positions) between the correction and true indel location was recorded. the mean and standard deviation of the miss distance were determined and used to select the default censor_length parameter for the debar package, equal to the mean miss distance plus standard deviations (censor_length = ceiling( μmiss_distance + ( x σmiss_distance)) ). this value was selected as it would be expected to avoid the introduction of an error for > % of inexact corrections. sequences where no corrections or multiple corrections were made had their outputs inspected further to determine if other parts of the denoising pipeline (e.g. the check for stop codons in the translated amino acid sequence or trimming of sequence edges in the framing process) removed the error or led to the complete rejection of the sequence. false correction rate the performance of debar on sequences with no indel errors was also quantified to determine the frequency and cause of erroneous corrections applied to cleaned, publicly available coi- p barcode sequences with no known technical errors. a random sample of , sequences from all the animal coi- p barcode sequences available on bold was obtained (supplementary file ) meeting the following criteria was obtained: ) the barcode was publicly available on the bold database, ) the barcode was > bp in length, ) the barcode did not contain missing characters (“n”) in the folmer region, ) the corresponding amino sequence did not contain stop codons, ) the result of bold’s internal check for contaminants was negative, and ) the sequence was not used in phmm training and the simulated error dataset. sequences were processed using debar’s denoise function (censor_length = ). all sequences that had corrections applied, or that were flagged for rejection, were counted and examined in detail to search for evidence of the proximal cause of the false correction. to search for evidence of taxonomic bias, .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / the taxonomy associated with all falsely corrected sequences were tallied at the order level, and manually examined for evidence of bias. denoising pacbio sequel data we quantified the performance of debar on raw dna barcode sequence data by interfacing with the existing mbrave workflow (http://www.mbrave.net) used to process dna barcode circular consensus sequences (ccs) obtained with the sequel platform. a custom analysis pipeline (supplementary file ) was constructed to analyze and denoise the final set of ccs barcodes produced by the mbrave workflow (one ccs per otu) (figure ). the pipeline was designed to search the final barcodes produced by mbrave for evidence of indel errors (by considering the translated amino acid sequence with the r package coil (nugent et al. )), denoise all the associated ccs with detected errors using the debar package, and then regenerate a consensus barcode sequence using the denoised data to produce a final, denoised barcode sequence for each specimen (figure ). the outputs of this analysis were examined to determine if the debar pipeline decreased the number of technical errors in the barcode sequences and that those barcode sequences resulted in likely amino acid sequences when translated. initial quantification of the improvement was conducted by comparing the number of barcode sequences whose amino acid sequences were flagged by the r package coil (nugent et al. , default parameters) before and after denoising. barcodes are flagged by coil when they possess a stop codon when translated to amino acids or when the resultant amino acid sequence is improbable, both indicating that the sequence likely possesses an indel error. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://www.mbrave.net)/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / since the coil and debar packages both employ the same nucleotide profile hidden markov model (coil also utilizes an amino acid phmm), an independent test of pipeline effectiveness was also conducted. the effectiveness of the denoising pipeline was quantified by submitting both the original and denoised barcode sequences to bold. it was used to determine the number of original barcodes and denoised barcodes with evidence of stop codons after aligning the sequences using the bold’s hidden markov model (a model developed independently of the debar phmm) and translating the sequence using the appropriate translation table corresponding to the taxonomic information accompanying the sequence record. comparison of these numbers made it possible to quantify the increase in barcode-compliant sequences (i.e. those with no stop codon) produced by debar. additionally, the sequence quality report on bold was examined to determine the number of unknown nucleotides (“n”) in the barcode sequences after denoising. the report categorizes barcode quality as: high (< % ns), medium (< % ns), low (< % ns), or unreliable (> % ns), and the number of barcodes in these different categories was recorded. denoising metabarcode data to characterize debar’s performance on metabarcode data, we analyzed a metabarcode dataset for a mock arthropod community (braukmann et al. ). these data derived from a single sequencing run on an ion torrent s on coi amplicons generated by pooled dna extracts from abdomens from single specimens of arthropod species (methods described in detail in braukmann et al. ). sequences were from a bp fragment of the coi barcode region targeted using the primers mlepf and lepr (hebert et al. ; braukmann et al. ). following amplification and sequencing on the ion s , quality control, sequence dereplication, .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / chimeric read filtering, matching to reference sequences, and clustering were performed on mbrave (braukman et al. ). two sets of data resulted from this process, a set of , unique sequences that were assigned to different barcode index numbers (bins) (ratnasingham and hebert ) through the comparison to reference sequences (matched at > % similarity), and a set of , unique sequences not matching to available references that were clustered into an additional , otus at a % similarity threshold (using clustering algorithm described in braukmann et al. ). all sequences were denoised using debar’s denoise_list function and a custom nucleotide phmm. the custom phmm was a bp subset of the complete coi phmm (phmm profile positions – ), corresponding to a segment of the folmer (folmer et al. ) region targeted by the metabarcoding primers. the phmm was created using coil’s ‘subsetphmm’ function (nugent et al. ). after denoising, two tests were conducted to determine if denoising improved the quality of the metabarcode pipeline’s output data. first, for each bin and otu consensus sequences were generated using denoised sequences and the debar function ‘consensus_sequence’. these consensus sequences were assessed for evidence of stop codons using coil and the same custom phmms used in denoising (function coi p_pipe with the additional parameter: trans_table = ). this test revealed the number of denoised consensus sequences which contained a stop codon when translated to amino acids, indicating an indel error persisted in the nucleotide sequence. the centroid sequences for the bins and otus were used as a baseline metric for the number of barcode- compliant sequences. for each bin, centroid sequences were obtained by clustering the sequences in the group using the r package kmer’s ‘otu’ function (parameters: k = , threshold = . ) (wilkinson , version . . ). for the otus, centroids were obtained from data .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / generated by mbrave. all centroids were assessed with coil (nugent et al. , version . ), and the number of barcode-compliant representative sequences for the original centroids and the final consensus sequences was compared. secondly, the individual sequences within each bin and otu were analyzed with coil to determine the number that were likely error free, as evidenced by the absence of stop codons after translation. this assessment was repeated on the denoised reads to determine the effectiveness of debar in correcting errors in individual sequences and to reveal if the denoising process improved the resolution of esvs for subsequent analysis of intra-species genetic variation by placing the esvs in reading frame and reducing the frequency of identified indel errors. results quantification of package performance simulated error data debar was used to correct , barcodes, each with a single indel error (supplementary file ). the denoised sequences and associated data were compared to the ground truth error locations to determine the accuracy of corrections applied by debar (figure ). for , sequences ( . %), a single correction was applied by debar, indicating that the package correctly identified the type of error in these sequences. however, debar either failed to recognize an indel or made too many corrections ( +) in the other sequences. no correction was made for most ( ) of these sequences, meaning that debar’s phmm did not identify the indel error. the overlooked indels were largely restricted to the terminal regions of the sequence; .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / % ( / ) of them were positioned within base pairs of the read termini (figure ), regions that only comprised % ( bp/ bp) of the sequences. the cause of this is that the debar denoising algorithm uses the first observation of consecutive bp matching to the phmm to establish the corrective window. errors on the periphery of sequences therefore lead to trimming of the sequence (via the keep_flanks function) instead of indel correction. a substantial fraction of the remaining uncorrected indel errors ( ) occurred between positions to (figure ), a region associated with a bp indel present in some animal groups and absent in others. its presence reduced the phmm’s indel detection ability in this region due to greater true variability. not all unidentified indels were retained in the final output sequences as double checks of debar (employing the keep_flanks and aa_check parameters) identified many ( / – %) of the uncorrected sequences and either omit the problem region or flag the sequence as likely to contain an error. therefore, debar’s double checks allow many false negatives to be trimmed or flagged as problematic. for sequences ( . %), two or more corrections were applied by debar when only a single indel existed (figure ). in contrast to the false negatives, debar’s double checks only captured three of the false positives. many of the false corrections appeared to be the presence of indels near codons that are not present in all animals. due to true biological variation in the training data, these regions of the phmm have higher probabilities of transitioning from a match state to an insert or delete state, and therefore indels in these locations are sometimes handled incorrectly (i.e. the sequence is characterized as having two deleted base pairs, when there was a bp insertion). because false corrections of this type result in sequences that conform to the structure of the protein-coding gene region (i.e. a lack of stop codons in the amino acid sequence), they are not identified by debar’s aa_check function. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / the , sequences for which the presence of a single indel was correctly identified were further analyzed to determine how accurately they were located (figure ). the analysis showed that debar was able to exactly locate and correct , ( . % of sequences in single correction category) of the indel errors in the dataset. for the other , sequences ( . % of the single corrections category), the indel corrections were not placed in exactly the correct position (figure ). for these sequences, the average distance between the true indel location and the applied correction was . base pairs (standard deviation = . ). these results were used to select a default censorship value for debar to ensure that inexactly identified indel errors are masked in most sequences (figure ). a default censorship length of (the average miss distance plus two times the standard deviation, rounded up) was selected in order to mask the true error in > % of instances where inexact corrections were applied, thereby successfully denoising sequences, albeit with some associated loss of information in the sequences, which can be overcome by building a consensus sequence when multiple reads are available for an individual. overall, denoising of the , barcodes with the default censorship parameter (censor_length = ) resulted in , / , ( . %) of sequences with errors being successfully denoised. the additional double check parameters (aa_check = true, keep_flanks = false) captured, but did not correct, ( . %) errors. the debar package thereby corrected or removed . % of sequences with indel errors (figure ). false correction rate a set of , barcode sequences with no known indel errors was analyzed with debar to determine the incidence of erroneous corrections. nearly all sequences ( . %) were not altered .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / nor flagged as erroneous. nine sequences were erroneously corrected, and none were flagged for rejection. these sequences included a single sequence from each of five orders and four sequences from the order diptera (flies). interestingly, the four diptera sequences that were incorrectly altered all belonged to the same genus: culicoides. they represented / of all sequences from the family ceratopogonidae that were in dataset, indicating that the performance issue was isolated to this single genus. these results indicate that debar deals well with variation in coi sequences across most of the animal kingdom, but that it displays some taxonomic bias in performance. this is a limitation of debar, as any genus with a coi profile that systematically deviates from the coi phmm used in debar will be erroneously denoised. the benefit of the conservative censorship approach used in the package is that although these reads are erroneously adjusted, the corrections made are masked by ns, and the entire sequence is not rejected. rather, only a small section of the sequences is lost, as if it were to contain an indel error. most of any falsely corrected sequences can thereby be recovered, and in most instances, this would be sufficient to identify associated taxonomy and inform biological conclusions. denoising pacbio sequel data we applied debar in the analysis of real dna barcode data by developing a processing pipeline (figure – hereafter ‘the debar pipeline’) and compared the amount of technical noise in the barcodes before and after processing. a set of , consensus barcode sequences derived from processing data from four sequel runs were obtained from mbrave and were re-processed with the debar pipeline (table ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / analysis of the consensus barcodes with coil (step ii. of the debar pipeline) flagged , ( . % of total) of consensus sequences due to the detection of a stop codon in the translated sequence or due to the presence of an unexpected amino acid (log likelihood score below the default threshold). the large number of flagged sequences is likely reflective of false positives (sequences flagged by coil that lack indel errors due to the incorrect establishment of reading frame). in fact, , sequences ( . % of total, . % of flagged sequences) were flagged due to the presence of a stop codon, and , of them ( . % of total, . % of flagged sequences) contained a stop codon in all three forward reading frames, providing extremely strong evidence of an indel error (i.e. a low likelihood of being a false positive). after denoising, the output sequences were again assessed with coil (step viii. of the debar pipeline) and this analysis revealed that debar had corrected many indel errors (table , table ). only , ( . %) of the final barcode sequences were flagged by coil’s coi p_pipe function, suggesting that . % ( , ) of the flagged sequences were successfully denoised. when comparison was restricted to the , sequences with stop codons, only were still flagged as containing stop codons, indicating that . % ( , / , ) of the sequences in this subcategory were effectively denoised. a more conservative estimate of correction success was provided by the subset of flagged sequences with stop codons in all reading frames. of these sequences, / ( . %) passed the coil check following denoising, suggesting the successful correction of an indel error and improved representation of the true sequence. external quantification of the debar pipeline’s denoising ability was obtained by the submission of pre- and post- pipeline barcode sequences to bold (http://www.boldsystems.org). the sample size for this test was smaller as bold requires taxonomic designations and this information was only provided by mbrave for , sequences. the total number of original .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / sequences flagged by bold due to its detection of a stop codon was , ( . %), a considerably lower frequency than reported by coil on the initial pipeline inputs. of the , sequences with initial evidence of stop codons, were rejected outright by the debar pipeline, were flagged but not successfully corrected, were unflagged and not corrected, and , had no evidence of errors following denoising (table ). based on this assessment with bold, the debar pipeline produced a % reduction in the number of errors in the dataset from . % ( , ) to . % ( ). of the remaining errors, the majority ( ) were detected as problematic and flagged as erroneous by debar. as a consequence, the debar pipeline reduced the number of unidentified errors by > % (from , to ) in the barcode dataset (table ). the denoising of the barcodes with the debar pipeline did not result in sequences with large amounts of missing information. of the , output barcodes, , were high quality (< % ns), were medium quality (< % ns), were low quality (< % ns), and were unreliable (> % ns). there was a strong negative relationship between the number of ccs available for a sample and the amount of missing information in the final barcode sequence (figure ). denoising metabarcode data consensus sequence quality metabarcode data from a mock arthropod community were also denoised followed by comparison of original sequences to the denoised consensus sequences to determine if the debar improved sequence quality (table ). of the original centroid sequences for the bins, / ( . %) contained evidence of indel errors when analyzed with coil. following denoising and consensus sequence generation via debar, the number of barcode-compliant .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / outputs was considerably higher with only / ( . %) displaying evidence of indel errors. four bins had all their component sequences rejected by debar so no consensus sequences were generated. the rate of apparent indel errors was higher in the centroids of the otus; ( %) displayed evidence of a stop codon when analyzed with coil, suggesting the presence of indels in more than half of the sequences representing each otu. the consensus sequences produced through denoising and consensus sequence generation with debar were of apparent higher quality as only ( . %) displayed evidence of a stop codon when analyzed with coil. an additional otus ( . %) failed to produce a valid consensus sequence after denoising because all their component sequences were rejected by debar. the corrections did cause some loss of information; / ( . %) of the consensus sequences for the bin groups contained at least one ‘n’ due to ambiguous or censored base pairs in their component reads, and / ( . %) of the otu consensus sequences contained at least one ‘n’. the number of ‘ns’ per sequence was generally low for the bins (median = ; sequences with or more ‘ns’) but was higher for the otus (median number of ‘ns’ = ), indicating there was on average one correction per otu (correction of an indel, plus the seven bp mask in either direction result in (insertion) or (deletion) consecutive ‘ns’). there was a positive relationship between the number of sequences within an otu and the completeness of information in the final consensus sequence. esv data quality data analysis on mbrave revealed bins represented by , unique dereplicated reads as well as otus lacking taxonomic assignment that were represented by unique sequence reads. when original sequences were checked with coil, it indicated that .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / , / , ( . %) of bin sequences and / ( . %) of the otu sequences displayed strong evidence of an indel error as they contained a stop codon when translated. by contrast, following denoising with debar the incidence of stop codons was far lower as just / , ( . %) of the bin sequences and / , ( . %) of the otu sequences had evidence of indels. this result indicated that denoising of individual sequences reduced the incidence of apparent indel errors by over % for the bins ( , fewer indel errors) and by % for the otus ( fewer indel errors). most sequences were subjected to at least one indel correction by debar, with , / , ( . %) of the final bin sequences and / ( . %) of final otu sequences containing at least one ‘n’ character. low abundance otus in the data set represented by biologically valid sequences need not be discarded solely due to their low abundance and could be further inspected for putative evidence of rare community members. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / discussion this manuscript introduces debar, a phmm-based denoiser, and demonstrates how it can improve the quality of sequence data used for both dna barcode library construction and for metabarcode studies by correcting indels introduced by sequencing error. we first evaluated its effectiveness through an in silico study that tested its capacity to recognize and repair reference barcodes with artificially introduced indels. debar was shown to be effective, as it corrected > . % of the errors and applied erroneous adjustments to less than . % of correct sequences. this strong performance extended to real-world data sets. debar reduced the rate of frameshift indels by % in sequence records generated by the long-read sequel platform, generating more barcode-compliant sequences, most with little or no missing information. debar also improved the quality of metabarcode data generated by the ion s allowing for esvs to be considered with higher confidence and for the recovery of higher-quality representative sequences for otus. denoising sequences with artificial errors and known ground truths showed that the corrections performed by debar were imperfect, with the exact indel location being identified only . % of the time. the application of a default bp censorship on both sides of putative indel corrections proved to be an effective means of masking most errors, improving the denoiser’s error removal rate to > . %. this high error removal rate involves a tradeoff, as sequence adjustments are accompanied with a loss of base pairs of information. this information loss is an acceptable cost, as it ensures that all remaining base pairs can be considered with high confidence. the nature of high-throughput sequence data, namely that there are usually multiple sequencing reads for a given specimen available, can help mitigate the loss .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / of information. corrected sequences from a specimen or otu can be used in conjunction with one another, filling in the different censored locations and overcoming the loss of information. the censorship of bases adjacent to indel corrections is an optional parameter that users may alter to suit their needs. smaller censorship values, or no censorship at all, would result in less loss of information per sequence, but would come at the cost of more errors remaining in the final data. denoising of real dna barcode data obtained from sequencing of specimens on the pacific biosciences sequel platform resulted in higher-quality output sequences. an exact metric quantifying the improvement is, however, difficult to state with certainty, as the ground truth of the sequences is not known. the independent tests of the sequences through submission of consensus sequences to bold before and after denoising provided a conservative estimate of the debar package’s effectiveness. conservatively, this test showed a % reduction in the number of barcode sequences with technical indel errors after application of the debar pipeline and a low false negative rate ( unidentified errors out of , total putative errors). this is an important improvement because the pacific biosciences sequel platform is used at the centre for biodiversity genomics to produce high-quality reference barcodes for the barcoding research community (hebert et al. ). accuracy of these sequences is therefore important; the debar package is shown to improve sequence quality, yielding more biologically likely and therefore reliable outputs. the generation of barcode sequences is also made more efficient. by increasing the rate of barcode-compliant outputs from . % to %, fewer samples require reprocessing or resequencing. understanding within-species patterns of genetic diversity is an essential metric for characterizing community health. high intra-species genetic diversity is assumed to indicate .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / healthy ecosystems, comprised of large and stable populations with the standing genetic variation needed to survive environmental stressors (zizka et al. ). the characterization of esvs within otus can provide intra-species diversity measures for member species of a community (frøslev et al. ). the initial check of the sub-otu sequence data from the mock community sequenced with iontorrent revealed a high rate of putative indel errors ( % of sequences), which would lead to a gross over estimation of the number of esvs within the otus. the reduction of the error rate after denoising with debar allows for a more accurate examination of intra-otu esvs and therefore allows for more accurate assessments of intra- species diversity and community health, despite the fact that debar is not capable of eliminating non-indel errors from sequences. even with the improvements to esv quality by debar, intra- species diversity estimates will likely remain inflated to some extent, as the sequence-by- sequence corrections applied by debar exclusively account for indel errors while substitution errors could persist within the data. we have demonstrated that debar is an effective means of reducing technical errors in dna barcode and metabarcode data, but the package is not without limitations. the package is designed to correct insertion and deletion errors, but these are not the only technical issues that can lead to inflated biodiversity estimates. the program is not an effective means of identifying or correcting chimeric sequences or non-animal coi biological contaminants and should these exist within an input data set they are likely to go uncorrected. additionally, debar does not have the ability to correct substitution errors on a sequence-by-sequence basis. because of indel correction, denoised sequences are aligned, and nucleotide positions become directly comparable across different sequences from a given specimen or otu. random point substitution errors can thereby be corrected in consensus sequence generation, through the ‘majority rule’ approach .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / debar uses in base calling. however, if systematic errors exist (i.e. most sequences possess the same substitution), few sequences are available for consensus sequence generation, or esvs are being examined, then substitution errors may persist in the data. an additional source of error unaccounted for by debar is contaminant sequences. it has been demonstrated previously that the phmm utilized in debar is not an effective means of separating animal barcode sequences from off-target barcodes derived from bacteria, plant, fungi, or other origins (nugent et al. ). taken together, these limitations show that debar cannot single handedly address the technical challenges associated with dna barcoding. the tool is likely most effective when applied in conjunction with existing barcode and metabarcode workflows and improves the quality of final sequences if the inputs have been filtered based on quality, had primers removed, and been cleaned of chimeric and contaminant sequences. the sequence-by-sequence denoising approach of debar means that it is a flexible tool capable of integrating into analysis pipelines for sequencing data from various sources. application of debar in tandem with conventional, clustering-based denoising tools would likely lead to the highest quality assessment of biodiversity. following otu generation with other tools, using debar to denoise all reads within a given otu prior to consensus sequence generation would maximize accuracy of the consensus sequence while conforming to the conserved structure of the coi barcode region. the removal of intra-otu noise can also improve the accuracy of alpha-diversity estimates. additionally, application of debar in the denoising of rare, low-abundance sequences not present in the otus would allow these data to be further examined with higher confidence, revealing biological insights that would be overlooked in conventional workflows. the phmm denoising technique used by debar is an effective barcode-focused framework that can be extended to fit a variety of needs. data from only two sequencing .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / platforms were tested in this study: the pacific biosciences sequel and thermo iontorrent s . since the phmm used in debar is barcode specific and not sequencer specific, debar can be effectively applied in denoising of barcode data obtained from any sequencing platform. however, the effectiveness of the denoiser will depend on the types and rates of technical errors associated with a given platform. when applied to data from sequencers such as the illumina miseq, the rate of technical errors corrected by debar will be lower, as this platform is more prone to introduction of substitution, as opposed to indel, errors (schirmer et al. ). although the debar package contains a phmm for only the common animal barcode coi, the denoising algorithm can in the future be extended and applied in the correction of data for other dna barcodes with conserved structures. conclusion this study has described debar, an r package for denoising dna barcode data, and demonstrated its ability to correct indels in both barcode and metabarcode sequences due to instrument error. in each dataset, debar improved sequence quality. it reduced the apparent number of indels by % in data generated by sequel, increasing the proportion of sequences that met the quality standards required to qualify as a reference barcode. the merits of debar for metabarcode analysis were twofold, allowing more likely consensus sequences to be obtained for otus, and for intra-otu variation to be quantified with higher confidence. overall, debar is a robust utility for identifying deviations from the highly conserved protein-coding sequence of the coi barcode region. corrections informed by its use improve the separation of true biological variation from technical noise, with low frequencies of false corrections. integration of debar .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / into the workflows for processing barcode and metabarcode data will allow biological variation to be characterized with higher accuracy. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / acknowledgements this research was supported by grants from genome canada through ontario genomics and from the ontario ministry of economic development, job creation and trade. the funders played no role in study design or decision to publish. this research was enabled in part by resources provided by compute canada (www.computecanada.ca). we thank tony kuo and thomas braukmann for aid with data acquisition and interpretation and tony for helpful comments on the manuscript. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://www.computecanada.ca/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / references amir, a., mcdonald, d., navas-molina, j. a., kopylova, e., morton, j. t., xu, z. z., ... & knight, r. ( ). deblur rapidly resolves single-nucleotide community sequence patterns. msystems, ( ). baynham-herd, z., amano, t., sutherland, w. j., & donald, p. f. ( ). governance explains variation in national responses to the biodiversity crisis. environmental conservation, ( ), - . braukmann, t. w., ivanova, n. v., prosser, s. w., elbrecht, v., steinke, d., ratnasingham, s., ... & hebert, p. d. n. ( ). metabarcoding a diverse arthropod mock community. molecular ecology resources, ( ), - . brown e.a., chain, f. j., zhan, a., macisaac, h. j., & cristescu, m. e. ( ). early detection of aquatic invaders using metabarcoding reveals a high number of non‐indigenous species in canadian ports. diversity and distributions, ( ), - . callahan, b. j., mcmurdie, p. j., rosen, m. j., han, a. w., johnson, a. j. a., & holmes, s. p. ( ). dada : high-resolution sample inference from illumina amplicon data. nature methods, ( ), . clare, e. l., chain, f. j., littlefair, j. e., & cristescu, m. e. ( ). the effects of parameter choice on defining molecular operational taxonomic units and resulting ecological analyses of metabarcoding data. genome, ( ), - . cordier, t., lanzén, a., apothéloz-perret-gentil, l., stoeck, t., & pawlowski, j. ( ). embracing environmental genomics and machine learning for routine biomonitoring. trends in microbiology, ( ), - . cristescu, m. e. ( ). from barcoding single individuals to metabarcoding biological communities: towards an integrative approach to the study of global biodiversity. trends in ecology & evolution, ( ), - . delabye, s., rougerie, r., bayendi, s., andeime-eyene, m., zakharov, e. v., dewaard, j. r., ... & mavoungou, j. f. ( ). characterization and comparison of poorly known moth communities through dna barcoding in two afrotropical environments in gabon. genome, ( ), - . durbin, r., eddy, s. r., krogh, a., & mitchison, g. ( ). biological sequence analysis: probabilistic models of proteins and nucleic acids. cambridge university press. driscoll, d. a., bland, l. m., bryan, b. a., newsome, t. m., nicholson, e., ritchie, e. g., & doherty, t. s. ( ). a biodiversity-crisis hierarchy to evaluate and refine conservation indicators. nature ecology & evolution, ( ), - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / eddy, s. r. ( ). profile hidden markov models. bioinformatics (oxford, england), ( ), - . eddy, s. r. ( ). a new generation of homology search tools based on probabilistic inference. in genome informatics : genome informatics series vol. (pp. - ). edgar, r. c. ( ). unoise : improved error-correction for illumina s and its amplicon sequencing. biorxiv, elbrecht, v., vamos, e. e., steinke, d., & leese, f. ( ). estimating intraspecific genetic diversity from community dna metabarcoding data. peerj, , e . folmer, o., black m., hoeh w., lutz r, vrijenhoek, r. ( ). dna primers for amplification of mitochondrial cytochrome c oxidase subunit i from diverse metazoan invertebrates. mol mar biol biotechnol, ( ), - . frøslev, t. g., kjøller, r., bruun, h. h., ejrnæs, r., brunbjerg, a. k., pietroni, c., & hansen, a. j. ( ). algorithm for post-clustering curation of dna amplicon data yields reliable biodiversity estimates. nature communications, ( ), - . hajibabaei, m., spall, j. l., shokralla, s., & van konynenburg, s. ( ). assessing biodiversity of a freshwater benthic macroinvertebrate community through non-destructive environmental barcoding of dna from preservative ethanol. bmc ecology, ( ), . hajibabaei, m., baird, d. j., fahner, n. a., beiko, r., & golding, g. b. ( ). a new way to contemplate darwin’s tangled bank: how dna barcodes are reconnecting biodiversity science and biomonitoring. philosophical transactions of the royal society b: biological sciences, ( ), . hebert, p. d. n., cywinska, a., ball, s. l., & dewaard, j. r. ( ). biological identifications through dna barcodes. proceedings of the royal society of london. series b: biological sciences, ( ), - . hebert, p. d. n., ratnasingham, s., zakharov, e. v., telfer, a. c., levesque-beaudin, v., milton, m. a., ... & dewaard, j. r. ( ). counting animal species with dna barcodes: canadian insects. philosophical transactions of the royal society b: biological sciences, ( ), . hebert, p. d. n., braukmann, t. w., prosser, s. w., ratnasingham, s., dewaard, j. r., ivanova, n. v., ... & zakharov, e. v. ( ). a sequel to sanger: amplicon sequencing that scales. bmc genomics, ( ), . hubert, n., & hanner, r. ( ). dna barcoding, species delineation and taxonomy: a historical perspective. dna barcodes, ( ), - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / kaunisto, k. m., roslin, t., sääksjärvi, i. e., & vesterinen, e. j. ( ). pellets of proof: first glimpse of the dietary composition of adult odonates as revealed by metabarcoding of feces. ecology and evolution, ( ), - . kumar, v., vollbrecht, t., chernyshev, m., mohan, s., hanst, b., bavafa, n., ... & golden, m. ( ). long-read amplicon denoising. nucleic acids research, ( ), e -e . lopez-vaamonde, c., sire, l., rasmussen, b., rougerie, r., wieser, c., allaoui, a. a., ... & lees, d. c. ( ). dna barcodes reveal deeply neglected diversity and numerous invasions of micromoths in madagascar. genome, ( ), - . nearing, j. t., douglas, g. m., comeau, a. m., & langille, m. g. ( ). denoising the denoisers: an independent evaluation of microbiome sequence error-correction approaches. peerj, , e . nugent, c. m., elliott, t. a., ratnasingham, s., & adamowicz, s. j. ( ). coil: an r package for cytochrome c oxidase i (coi) dna barcode data cleaning, translation, and error evaluation. genome. ( ): - . ratnasingham, s., & hebert, p. d. n. ( ). a dna-based registry for all animal species: the barcode index number (bin) system. plos one, ( ). rosen, g., garbarine, e., caseiro, d., polikar, r., & sokhansanj, b. ( ). metagenome fragment classification using 𝑁-mer frequency profiles. advances in bioinformatics, . schirmer, m., ijaz, u. z., d’amore, r., hall, n., sloan, w. t., & quince, c. ( ). insight into biases and sequencing errors for amplicon sequencing with the illumina miseq platform. nucleic acids research, ( ), e -e . sogin, m. l., morrison, h. g., huber, j. a., welch, d. m., huse, s. m., neal, p. r., … & herndl, g. j. ( ). microbial diversity in the deep sea and the underexplored “rare biosphere”. proceedings of the national academy of sciences, ( ), - . stat, m., huggett, m. j., bernasconi, r., dibattista, j. d., berry, t. e., newman, s. j., ... & bunce, m. ( ). ecosystem biomonitoring with edna: metabarcoding across the tree of life in a tropical marine environment. scientific reports, ( ), - . taberlet, p., coissac, e., hajibabaei, m., & rieseberg, l. h. ( ). environmental dna. molecular ecology, ( ), - . wilkinson sp. ( ) kmer: an r package for fast alignment-free clustering of biological sequences. r package version . . . https://cran.r-project.org/package=kmer wilkinson, s. p. ( ). aphid: an r package for analysis with profile hidden markov models. bioinformatics, ( ), - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://cran.r-project.org/package=kmer https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / wilson, j. j., brandon-mong, g. j., gan, h. m., & sing, k. w. ( ). high-throughput terrestrial biodiversity assessments: mitochondrial metabarcoding, metagenomics or metatranscriptomics?. mitochondrial dna part a, ( ), - . wirta, h. k., hebert, p. d. n., kaartinen, r., prosser, s. w., várkonyi, g., & roslin, t. ( ). complementary molecular information changes our perception of food web structure. proceedings of the national academy of sciences, ( ), - . zizka, v. m., weiss, m., & leese, f. ( ). can metabarcoding resolve intraspecific genetic diversity changes to environmental stressors? a test case using river macrozoobenthos. biorxiv. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / data accessibility statement dna barcode sequences used in training of the profile hidden markov models are available in the supplementary data of the following paper: https://doi.org/ . /gen- - . dna barcode sequences used in model testing are available in this manuscript’s supplementary files. the r source code for the debar package is available on github: https://github.com/cnuge/debar. additional data and code available on request from the authors. author contributions the study was conceived and designed by sja, pdnh, sr, and cmn. the programming of the debar package was performed by cmn. analyses of package performance were performed by cmn with resources, design, and other assistance provided by tae, sr, and sja. the initial draft of the manuscript was written by cmn and sja. all authors contributed to the editing of the manuscript. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /gen- - https://github.com/cnuge/debar https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / tables and figures table . summary of the results for the , barcode sequences (produced from pacbio sequel data analyzed using the mbrave platform) after processing with the debar pipeline. pacbio sequel run run run run run total consensus sequences generated , , , , , consensus sequences flagged by coil for indel error , ( . %) rejected by debar denoising ( . %) sequences flagged by coil post-denoising , ( . %) sequences corrected , ( . % of flagged sequences) .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table . assessment of the correction ability of the debar pipeline for the subset of sequences in the high-confidence error set. this set of sequences was flagged by coil and produced a stop codon when translated within all reading frames. the top half of the table indicates the number of sequences flagged by coil as likely to be erroneous, based on the log likelihood values of the sequences. results are shown for sequences both before and after the denoising process. the bottom half of the table contains the number of sequences flagged by coil as likely to be erroneous, based on the presence of a stop codon in the amino acid sequence resulting from the censored translation of the framed nucleotide sequence. this high success for the stop-codon metric ( . % of errors removed) indicates that the pipeline is an effective means of correcting frameshift-causing insertion or deletion errors. the relatively lower success at correcting sequences with low log likelihood values suggests that frameshift-causing errors are not the only set of errors being flagged by coil, and that non-frameshift errors are not effectively corrected by the debar pipeline. pacbio sequel run run run run run total original flagged , flagged post- denoising , corrected . % . % . % . % . % pacbio sequel run run run run run total original stop codon , stop codon post- denoising corrected stop codons . % . % . % . % . % .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table . result of the bold data system evaluation of debar denoising workflow’s effectiveness. the number of sequences identified by bold as containing stop codons, before and after processing with the denoising pipeline (figure ). only the , specimens with barcodes and taxonomic information produced through the processing of pacbio sequel data on the mbrave platform were considered, as bold requires taxonomic information for assessing the presence of stop codons. the rows break the sequences down into categories, which indicate the source of the post-denoising sequence that was submitted to bold for assessment. sequence category total sequence count stop codon count percent error reduction original post-denoising unaltered , † - denoised, altered , , † % flagged for potential error, unaltered * - flagged and rejected - labelled as wolbachia by mbrave - total , , ( . %) ( . %) . % total, non- flagged only , , ( . %) ( . %) . % † the sum of these categories (shown in the final row of the column) represents the false negative rate for the denoising pipeline. these are the . % ( / , ) of sequences that appear to contain true stop codons that were not flagged for denoising, or that were denoised unsuccessfully and not flagged as potential errors. * the false positive rate of the denoising pipeline is the number of sequences in this category that do not in fact contain a stop codon. there is a total of ( - ) false positives and an overall false positive rate of . % ( / , ). since this set of sequences are flagged for potential errors, as opposed to being outright rejected, additional inspection of sequences in this category can separate the unsuccessfully denoised sequences with true errors from those that do not contain an error. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table . assessment of the sequence quality for data from a mock community of arthropods sequenced in bulk using a thermo fisher ion torrent and processed on the mbrave platform. sequencing and processing results in two sets of data, groups of sequences assigned to bins and groups of sequences clustered into otus. the representative sequences (centroids before denoising, consensus after denoising) and all individual sequences were checked with the r package coil for evidence of frameshifts (stop codons in amino acid sequence) before and after denoising to see if processing the data with the debar package resulted in higher quality barcode sequences. original after debar denoising sequences analyzed sequence data source total count stop codon count total count stop codon count representative sequences assigned to bins ( . %) ( . %) otus , ( %) , ( . %) esvs assigned to bins , ( . %) , ( . %) otus , ( . %) , ( . %) .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . diagram demonstrating the debar package’s denoising workflow. blue indicates nucleotides that are part of the barcode region and orange nucleotides in bold font indicate technical errors or sequence from outside of the barcode region. a. the debar package operates on a sequence-by-sequence basis, taking each input and constructing a custom dnaseq object. a dnaseq object can receive a dna sequence, an identifier, and optionally a sequence of corresponding phred quality scores. although not utilized in the denoising, indel-correcting adjustments to the sequence are applied to the phred scores as well, so that quality information can be carried from input to output. b. following dnaseq object construction, the sequence is compared to the phmm using the viterbi algorithm. by default, the full length ( bp) coi- p phmm contained in debar is used to evaluate the sequence. when required, a user may pass a custom phmm corresponding to a subsection of the coi- p region (specified using the coil package’s subsetphmm function) or a custom phmm trained on user-defined data (wilkinson ). the frame function isolates the correction window, which is the section of the sequence matching the phmm (the first consecutive base pairs matching to the phmm on the leading and trailing edges of the sequence establish the section of the input on which subsequent corrections are applied). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / c. the adjust function traverses the section of the sequence and viterbi path defined by the frame function. when evidence of an inserted base pair (‘ ’ label in the viterbi path) is encountered, the corresponding base pair is removed. when evidence of a deleted base pair is encountered (a ‘ ’ label in the viterbi path) a placeholder ‘n’ nucleotide is inserted. exceptions are made for triple inserts or triple deletes (three consecutive ‘ ’ or ‘ ’ labels), which are skipped by the adjustment algorithm, as they are indicative of mutations that would not have a large impact on the structure of the protein-coding gene region and could reflect biological amino acid indels. the total number of adjustments made by debar is limited by the parameter ‘adjust_limit’ (default = ), sequences requiring adjustments in excess of this number are flagged for rejection, as this high frequency of indels is likely not the result of technical error, but rather other sources of noise such as pseudogenes. following adjustment, a mask of placeholder ‘n’ nucleotides is applied to base pairs flanking the corrected indel (default is bp in each direction, see figure . for derivation of default). masking of bp flanks adjacent to each correction allows imprecise corrections to effectively correct sequence length and also mask true indel locations in the majority of instances. d. following adjustment, the denoised sequences are output by debar. by default, the outputs will include trailing sequence outside of the correction window. leading information outside of the correction window is dropped, so that sequences are aligned with a common starting position. a user can choose to keep only the correction window, or have both flanking regions appended back on to the sequence output. e. if multiple denoised sequences are available (for either a given specimen in the case of barcoding or a given otu in metabarcoding) then the consensus of the denoised sequences can be taken. the consensus function assumes the sequences have been denoised and their left flanks removed; as a result, they are aligned to one another. the modal base pair for each position is then taken to generate a consensus sequence, and in the case of ties, a placeholder “n” character is added to the consensus. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . diagram of the denoising workflow used to improve the quality of barcodes produced by processing pacific biosciences sequel circular consensus data on the mbrave platform. (i) pacific biosciences sequel data are processed on the mbrave platform, and an initial set of barcode sequences is produced. (ii) the set of consensus barcode sequences produced by the mbrave platform are obtained and analyzed with the coil package, using the ‘coi p_pipe’ function (default parameters). sequences displaying evidence of an indel (either the presence of a stop codon when translated to amino acids or an amino acid sequence with a low likelihood score) are retained for further denoising. (iii) for each barcode with evidence of an error, all component ccs reads (and associated metadata) derived from the given specimen are obtained from mbrave. (iv) based on the mbrave metadata, sequences are trimmed to remove primers, mid tags, and adapter sequence. the reverse complement of reads are taken when required. (v) the ‘denoise_list’ function of debar is used to denoise all ccs reads (options: dir_check = false, keep_flanks = ‘right’, censor_length = ). rejected reads (those flagged by the denoise_list function) are removed from the dataset. (vi) for each specimen, the reads are clustered into otus using the r package kmer (clustering threshold = . ). this is done to mitigate the influence of outlier ccs or contaminant sequences. (vii) for each otu, a consensus sequence is generated using debar’s ‘consensus’ function. for each specimen, otus are ranked based on the number of component ccs reads they contain. (vii) the consensus sequences are reassessed with coil. if the top-ranked consensus sequence now passes the coil check, it is deemed to have been successfully denoised, and it is selected as the output barcode. if not, the check is repeated for the second-ranked consensus sequence (when available), and this output is retained if it is barcode compliant. if neither the first nor second highest ranked consensus sequence passes the coil check, then the original (pre-denoising process) barcode is retained, as no meaningful improvement was made. in this situation the barcode is flagged as likely to contain an error. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . the debar package’s denoising of , coi sequences containing single insertion or deletion errors. so that exact error positions were known, errors were artificially introduced in accordance with known probabilities for coi dna barcode data from the pacbio sequel platform (hebert et al. ). denoising was accomplished through altering sequences in accordance with the viterbi path yielded by comparison to the phmm. the correct number of adjustments was made for , sequences, and . % of these corrections located the indel exactly. masking of bp flanks adjacent to each correction allowed imprecise corrections to correct sequence length and mask the true indel location % of the time. for the instances where an incorrect number of adjustments were made, were caught through query of the amino acid sequence for stop codons and the trimming of spurious matches at the edge of sequences. overall, . % of errors were effectively corrected or identified as erroneous. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . histogram indicating the position in the coi- p region of the uncorrected indel errors from the , -sequence artificial error dataset. the x axis indicates the base pair position in the coi- p profile, and the y axis displays the number of sequences that contained an uncorrected error at the given range of positions (bins of base pair positions). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . histogram showing number of base pairs between inexact corrections applied by debar and the ground truth error location for the given sequence. in total , sequences ( . %) had errors that were denoised inexactly, and corrections were an average of . bp (sd = . ) away from the exact ground truth error location. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . relationship between the amount of missing data in the final denoised barcode sequences (number of ns divided by the total length of the sequence) and the number of ccs reads that contributed to the generation of the barcode. the figure displays only the , denoised barcode sequences submitted to bold that contained at least one “n” (the remaining , barcode sequences in the bold submission did not contain an “n”). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary information supplementary file ('s -single_errors_in_ k_sequences.csv') the , coi barcode sequences with single introduced indel errors that were used to test debar and calibrate the default parameters. supplementary file ('s -control_denoising_no_errors.csv') the , coi barcode sequences with no known indel errors used to assess the false correction rate of debar supplementary file ('s -single_file_pipeline') scripts and example data for the denoising pipeline developed to process coi dna barcode sequence data produced using the pacific biosciences sequel sequencer and mbrave platform supplementary file scripts and example data for the denoising pipeline developed to process coi dna metabarcode sequence data produced using the iontorrent s sequencer and the mbrave platform supplementary file vignette demonstrating the functionality of the debar package. the vignette is also available as part of the r package (https://github.com/cnuge/debar/tree/master/vignettes) .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/cnuge/debar/tree/master/vignettes https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / genetic epidemiology of variants associated with immune escape from global sars-cov- genomes genetic epidemiology of variants associated with immune escape from global sars-cov- genomes bani jolly , ,$, mercy rophina , ,$, afra shamnath , mohamed imran , , rahul c. bhoyar , mohit kumar divakar , , pallavali roja rani , gyan ranjan , , paras sehgal , , pulala chandrasekhar , s. afsar , j. vijaya lakshmi , a. surekha , sridhar sivasubbu , , vinod scaria , ,* csir-institute of genomics and integrative biology (csir-igib), new delhi, india academy of scientific and innovative research (acsir), csir-hrdc ghaziabad, uttar pradesh, india kurnool medical college, kurnool, andhra pradesh, india $authors contributed equally and would like to be known as joint first authors *address for correspondence: vinod scaria, vinods@igib.in abstract many antibody and immune escape variants in sars-cov- are now documented in literature. the availability of sars-cov- genome sequences enabled us to investigate the occurrence and genetic epidemiology of the variants globally. our analysis suggests that a number of genetic variants associated with immune escape have emerged in global populations. keywords: covid- , sars-cov- , antibody, mutations, epidemiology text antibodies are one of the emerging therapeutic approaches being explored in covid- . these antibodies typically target the receptor-binding motif or structural domains of the spike protein of sars-cov- , in an attempt to inhibit binding of spike protein with the host receptors. cocktails of antibodies which target distinct structural and functional domains of spike proteins are also being currently developed considering redundant mechanisms of targeting the virus and therefore minimising escape mechanisms. genomic documentation of the spread of sars-cov- across the globe has provided unique insights into the genetic variability and variants of functional consequence. in-depth studies in recent months have unravelled a wealth of information on the immune response in covid- and offered insights into the development of therapeutics. recent investigations suggest a number of genetic variants in sars-cov- are associated with immune escape and/or resistance to antibodies. their structural and functional features and mechanisms of immune evasion are also being extensively studied ( ) . the natural occurrence and genetic epidemiology of these variants across the global populations are poorly understood. we were motivated by the wide availability of sars-cov- genomes from across the world and the increasing numbers of genetic variants suggested to contribute to escape from antibody inhibition. we analysed a comprehensive compendium of genetic variants associated with immune escape and curated by our group from literature and preprint servers ( ). this compendium included unique variants reported in literature. to understand the genetic epidemiology of these variants in the global compendium of genomes, we compiled the dataset of , sars-cov- from gisaid (as of december ) ( ) apart from , genomes (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:vinods@igib.in https://doi.org/ . / . . . sequenced in-house (bioproject id: prjna ). genome sequences with more than % ns, more than ambiguous nucleotides, higher than expected divergence and mutation clusters were excluded from the analysis. after quality control, the final dataset encompassed , genomes from countries. only countries with at least good quality genome submissions were considered for the analysis. of the genetic variants associated with immune escapes were found in a total of , genomes from countries (figure a), out of which variants had > % frequency in the respective countries. phylogenetic analysis was performed following the nextstrain protocol for a total of , genomes, including , randomly selected genomes having these variants (figure b) ( ). homoplasies were identified in the phylogeny using homoplasyfinder ( ). out of , variant sites were found to be homoplasic, suggesting they could emerge independently in different genetic lineages, out of which were found to be at > % frequency in at least one of the countries analysed. out of , genomes analysed from australia, immune escape associated variants mapped to , genomes ( %). of significant frequency was the s:s n variant which was present in , genomes ( %) from australia. high frequency of this variant was also found in a number of other countries particularly in europe. s:n k was also found at high frequencies in genomes from a number of countries in europe ( ). s:n y, one of the variants in the recently reported emergent sars-cov- lineage from the united kingdom, was present in a total of genomes, including genomes from the united kingdom, australia, south africa, usa, denmark and brazil ( , ). all genomes from south africa having s:n y also had the s:e k variant and s:k n was present in of these genomes ( ). the orf a:g v variant was also found to be prevalent across global genomes, with the highest frequencies in hong kong and south korea. this variant is also one of the defining variants for the nextstrain clade a a (gisaid clade v) (figure b). of the genetic variants were found in genomes from india (supplementary figure). the s:n k variant was found to have a frequency of . % in india and a high prevalence in the state of andhra pradesh ( . % of genomes). the variant site was homplasic and the variant was found in genomes belonging to different clades and haplotypes. time-scale analysis suggested the variant emerged in recent months (figure c). the s:n k variant was also reported in a case of covid- reinfection from north india ( ). put together, our analysis suggests that a number of genetic variants which are associated with immune escape have emerged in global populations, some of them have been found to be polymorphic in many global datasets and a subset of variants have emerged to be highly frequent in some countries. homoplasy of the variant sites suggests that there could be a potential selective advantage to these variants. further data and analysis would be needed to investigate the potential impact of such variants on the efficacy of different vaccines in these regions. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . acknowledgements authors acknowledge disha sharma and abhinav jain for the analysis of in-house genomes and the researchers, originating and submitting laboratories of the sequences retrieved from gisaid (https://doi.org/ . /m .figshare. .v ). bj and mkd acknowledge a research fellowship from the council of scientific and industrial research (csir india). the funders had no role in the study design or the decision to publish. references . weisblum y, schmidt f, zhang f, dasilva j, poston d, lorenzi jcc, et al. escape from neutralizing antibodies by sars-cov- spike protein variants. oct [cited dec ]; https://elifesciences.org/articles/ . rophina m, pandhare k, mangla m, shamnath a, jolly b, sethi m, et al. favicov - a comprehensive manually curated resource for functional genetic variants in sars-cov- . nov https://doi.org/ . /osf.io/wp tx . yuelong shu jm. gisaid: global initiative on sharing all influenza data – from vision to reality. eurosurveillance [internet]. mar [cited dec ]; ( ). https://www.ncbi.nlm.nih.gov/pmc/articles/pmc / . nextstrain [internet]. [cited dec ]. https://nextstrain.org/sars-cov- / . crispell j, balaz d, gordon sv. homoplasyfinder: a simple tool to identify homoplasies on a phylogeny. microbial genomics [internet]. jan [cited dec ]; ( ). https://www.ncbi.nlm.nih.gov/pmc/articles/pmc / . hodcroft eb, zuber m, nadeau s, crawford khd, bloom jd, veesler d, et al. emergence and spread of a sars-cov- variant through europe in the summer of . medrxiv : the preprint server for health sciences [internet]. nov [cited dec ]; https://pubmed.ncbi.nlm.nih.gov/ / . rambaut a, loman n, pybus o, barclay w, barrett j, carabelli a, et al. preliminary genomic characterisation of an emergent sars-cov- lineage in the uk defined by a novel set of spike mutations [internet]. [cited dec ]. https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-co v- -lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/ . shang e, axelsen ph. the potential for sars-cov- to evade both natural and vaccine-induced immunity [internet]. cold spring harbor laboratory. [cited dec ]. p. . . . . https://www.biorxiv.org/content/ . / . . . v .abstract . emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus (sars-cov- ) lineage with multiple spike mutations in south africa [internet]. [cited dec ]. https://www.krisp.org.za/publications.php?pubid= . gupta v, bhoyar rc, jain a, srivastava s, upadhayay r, imran m, et al. asymptomatic reinfection in two healthcare workers from india with genetically distinct sars-cov- . clin infect dis [internet]. [cited dec ]; https://www.ncbi.nlm.nih.gov/pmc/articles/pmc / (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /m .figshare. .v https://elifesciences.org/articles/ https://doi.org/ . /osf.io/wp tx https://www.ncbi.nlm.nih.gov/pmc/articles/pmc / https://nextstrain.org/sars-cov- / https://www.ncbi.nlm.nih.gov/pmc/articles/pmc / https://pubmed.ncbi.nlm.nih.gov/ / https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov- -lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/ https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov- -lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/ https://www.biorxiv.org/content/ . / . . . v .abstract https://www.krisp.org.za/publications.php?pubid= https://www.ncbi.nlm.nih.gov/pmc/articles/pmc / https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . (a) variant frequencies of the immune escape variants in genomes of sars-cov- . the total number of genomes analyzed from each country is specified. variants with frequency > % in the respective countries are highlighted in red. (b) global phylogenetic context of the variants. the vertical bar indicates the clade assigned according to the nextstrain nomenclature (c) time-series data on prevalence for the genetic variants showing the region-wise proportion of genomes per month for the variants supplementary figure. variant frequencies of the immune escape variants in genomes isolated from different states in india. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . predicting chemotherapy response using a variational autoencoder approach i i “output” — / / — : — page — # i i i i i i bioinformatics doi. . /bioinformatics/xxxxxx advance access publication date: day month year original paper topic area: biomedical informatics predicting chemotherapy response using a variational autoencoder approach qi wei ∗ and stephen a. ramsey ∗ school of eecs, oregon state university, corvallis, oregon , usa department of biomedical sciences and school of eecs, oregon state university, corvallis, oregon , usa. ∗to whom correspondence should be addressed. associate editor: xxxxxxx received on xxxxx; revised on xxxxx; accepted on xxxxx abstract motivation: multiple studies have shown the utility of transcriptome-wide rna-seq profiles as features for machine learning-based prediction of response to chemotherapy in cancer. while tumor transcriptome profiles are publicly available for thousands of tumors for many cancer types, a relatively modest number of tumor profiles are clinically annotated for response to chemotherapy. the paucity of labeled examples and high dimension of the feature data limit performance for predicting therapeutic response using fully-supervised classification methods. recently, multiple studies have established the utility of a deep neural network approach, the variational autoencoder (vae), for generating meaningful latent features from original data. here, we report first study of a semi-supervised approach using vae-encoded tumor transcriptome features and regularized gradient boosted decision trees (xgboost) to predict chemotherapy drug response for five cancer types: colon adenocarcinoma, pancreatic adenocarcinoma, bladder carcinoma, sarcoma, and breast invasive carcinoma. results: we found: ( ) vae-encoding of the tumor transcriptome preserves the cancer type identity of the tumor, suggesting preservation of biologically relevant information; and ( ) as a feature-set for supervised classification to predict response-to-chemotherapy, the unsupervised vae encoding of the tumor’s gene expression profile leads to better area under the receiver operating characteristic curve (auroc) classification performance than either the original gene expression profile or the pca principal components of the gene expression profile, in four out of five cancer types that we tested. availability: github.com/athed/vae_for_chemotherapy_drug_response_prediction contact: ramseyst@oregonstate.edu supplementary information: supplementary data are available at bioinformatics online. introduction although chemotherapy is a mainstay of treatment for aggressive cancers, many agents have serious side effects (airley, ). whether or not chemotherapy will provide a net benefit to a patient depends in large part on whether the malignancy responds to the treatment. chemotherapy is often administered in cycles (skeel, ), leading to multiple opportunities where treatment appropriateness may be (re- )assessed (chabner and longo, ). currently, the medical cost-benefit of chemotherapy (versus a non-pharmaceutical approach) is assessed in light of patient health status, expected therapeutic tolerance, and tumor pathological classification (kaestner and sewell, ; gurney, ). for many cancer types, there is a broad spectrum of cases where the decision of whether or not to undergo or continue chemotherapy is difficult (corrie, ; whelan et al., ; malfuson et al., ). the development of a quantitative model that could predict—based on a specific tumor’s molecular signature—whether or not the tumor will respond to chemotherapy would have significant clinical utility and would potentially improve patient quality-of-life. moreover, an advance in machine-learning methods for the response-to-chemotherapy prediction problem (chiu et al., ; geeleher et al., ) would have potential crossover benefits for other prediction problems in precision medicine. oncogenesis is driven by alterations in the somatic genome and epigenome in cancer cells (weir et al., ); however, the somatic genetic or epigenetic determinants of response to chemotherapy are also thought © the author . published by oxford university press. all rights reserved. for permissions, please e-mail: journals.permissions@oup.com .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/athed/vae_for_chemotherapy_drug_response_prediction ramseyst@oregonstate.edu weiqi https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i to exert measurable effects on gene expression in the tumor. consistent with this theory, studies of various cancer types have demonstrated that biomarkers identified from systematic measurement of the patient’s cancer transcriptome or proteome correlate with the probability that a tumor will respond to chemotherapy, for example, a five-protein signature in breast cancer (gámez-pozo et al., ), - and -gene signatures in rectal cancer (casado et al., ; del rio et al., ), and a -gene signature in liver cancer (kurokawa et al., ). taken together, the findings from such “omics” biomarker studies suggest that rna sequencing- (rna- seq (wang et al., ))-based transcriptome measurements of tumor samples labeled with clinical response can be used to train machine- learning classifiers for predicting response to chemotherapy. however, the accuracy of such models is presently limited by the small number of available training cases that are labeled for clinical outcome, given the large size of the transcriptome (∼ k genes frankish et al., ) and the significant intertumoral variance of gene expression. for typical cancers, most of the profiled tumor transcriptomes are not labeled for chemotherapeutic response; the ratio of such unlabeled to labeled tumor datasets in the cancer genome atlas (tcga) dataset (hutter and zenklusen, ) ranges from – , depending on the cancer type. while using (exclusively) supervised learning methods for the response-to- chemotherapy prediction problem has been a sensible first step, unlabeled data are a substantial resource that could—in the context of a semi- supervised approach—reveal multivariate structure or patterns that could ultimately improve predictive accuracy. semi-supervised approaches that fuse unsupervised data reduction methods (such as principal components analysis, or pca) for low-dimensional embedding with supervised methods (such as decision trees) for prediction have proved beneficial in problems where large unlabeled datasets are available, for example, a pca-xgboost method has been previously used in finance (wen and huang, ), and an independent components analysis-based method has been used to classify electroencephalographic signals (qin et al., ). multiple studies (an and cho, ; li and she, ; bouchacourt et al., ; kipf and welling, ) have established the power of the variational autoencoder (vae; kingma and welling ( ); jimenez rezende et al. ( ))—an unsupervised nonlinear data embedding model with two deep neural networks oppositely connected through a low-dimensional probabilistic latent space—for finding meaningful and useful latent features in high-dimensional data. in the context of cancer bioinformatics, vaes have been variously used to (i) model cancer gene expression and capture biologically-relevant features using the tcga pan-cancer project rna-seq dataset (way and greene, ); (ii) find encodings that correlate with biological features such as patient sex and tumor type (titus et al., ); (iii) find encodings that can be used to predict gene inactivation in cancer (way and greene, ); and (iv) obtain an encoding that is predictive of chemotherapy resistance (george and lio, ). based on their exploration of multiple vae architectures for predicting gene inactivation in a pan-cancer dataset, way & greene reported ( ) biological insights obtained from the latent-space embeddings learned by vaes. george and lio ( ) used a vae-based, fully unsupervised approach to encode ovarian tumor transcriptomes and obtained latent-space features that were associated with response to chemotherapy. these studies suggest that a tumor transcriptome vae may be broadly useful for the response-to-chemotherapy prediction problem and they set the stage for the present multi-cancer investigation of the utility of the tumor transcriptome vae in precision oncology. given previous reports of success using a vae to obtain useful low-dimensional encodings of transcriptome data (dong et al., ; way and greene, ; way and greene, ), in this work, we first sought to ascertain to what extent a vae encoding of tumor transcriptome data would preserve biological characteristics—spanning multiple genes at a time that have coordinated variation across tumors— that are associated with distinct cancer types. to answer this question, we trained a pan-cancer transcriptome vae and used it to encode tcga tumor rna-seq data from , tumors comprising different cancer types, focusing on the top , most variable genes. we trained the vae using an efficient contemporary optimization engine (adam) to find the vae coefficient values that together balance reconstruction loss and desired latent-space distributional shape. we applied an unsupervised two-dimensional embedding method (t-distributed stochastic neighbor embedding, or t-sne) directly to tumor transcriptome and to the vae- embedded tumor transcriptome data, and mapped clusters of tumors by cancer type across the two t-sne embeddings. we found (sec. . ) that the vae preserves the clustering of tumors of the same cancer type, suggesting biological fidelity in the components of the vae embedding. next, to set the stage for a semi-supervised approach for predicting cancer response to chemotherapy, we selected five cancer types (breast, bladder, colon, pancreatic, and sarcoma) based on sufficient availability of clinically labeled data and then defined three different vae architectures: vae- , which we used to obtain feature data for bladder, breast, and pancreatic cancer; vae- , for sarcoma; and vae- , for colon cancer. in order to train a vae, it is necessary to specify a reconstruction loss function; both l and l reconstruction loss have been used for training vaes in machine-learning, and we sought to clarify which is best for this application. thus, we trained each of the three vae architectures on , tumor transcriptomes from tcga, in an unsupervised fashion, separately using l loss and l loss. next, in order to label tumors for response to chemotherapy, we analyzed the available tcga clinical data regarding the outcome of pharmaceutical therapy (in most cases including chemotherapy) for each of the patients, and thereby assigned a label “responded” or “progressive” to out of the , tumors (sec. . ); the remainder of the tumors were unlabeled and thus used only during vae training. for the labeled tumors, we used the vae- encoded latent vectors as feature data for supervised prediction of the binary label using gradient boosted decision trees (xgboost; chen and guestrin ( )). using this semi-supervised “vae-xgboost” approach, we found (sec. . ) that a vae trained using l reconstruction loss yields features that result in better classification performance (by area under the receiver operating characteristic, auroc) than a vae trained using l . in the main part of this work, using xgboost, we measured response-to-chemotherapy prediction performance for each of three tumor transcriptome-derived feature sets: (i) expression levels of the top % of genes, by intertumoral variance (a fully supervised approach); (ii) the first principal components of expression levels of “top %” genes (“semi- supervised pca-xgboost”); and (iii) vae-encoded expression levels of the top % genes (“semi-supervised vae-xgboost”, our new method, fig. ). within a cross-validation framework for auroc performance evaluation, we found (sec. . ) that for four out of five cancer types, the semi-supervised vae-xgboost approach outperformed the fully- supervised approach. moreover, for four out of the five cancer types, semi- supervised vae-xgboost outperformed semi-supervised pca-xgboost. finally, for the one cancer type for which pca-xgboost outperformed vae-xgboost, we investigated their relative performance through the lens of xgboost feature importance (sec. . ). below, we describe our results (sec. ) and the vae-xgboost method in detail (sec. ). results . vae encoding preserves cancer type features given multiple reports (dolezal et al., ; esteva et al., ) that t-sne can be used to visualize the grouping of cancer types from high- dimensional molecular tumor data, we investigated the extent to which .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i gene expression data (original input, )x reconstructed gene expression data ( output, )x̃ encoder network eq. mean vector ( )μ ̂θ variance vector ( )σ ̂θ sampled latent vector ( )z decoder network g ̂ϕ add labeled input ( )y latent vector + label as input (z, y) xgboost classifier eq. & eq. probability of predicated label p(ỹ | z) reparameterize sampling eq. & eq. fig. : overview of the vae-xgboost method that we used for predicting tumor response to chemotherapy. for each tumor t, the encoder’s input vector xt contains the levels of the top % of genes by intertumoral gene expression variance (sec. . ). each network has multiple fully connected dense layers (sec. . ). the encoder outputs two vectors of configurable latent variable dimension h � m (sec. . ): a vector of means µ and a vector of standard deviations σ that parameterize the multivariate normal latent-space vector z|xt (sec. . ). the sampled encoding z|xt = zt is passed to the decoding neural network (decoder), whose architecture is identical to (with inversion) that of the encoder network. the sampled latent-space vector zt is passed to xgboost for supervised classification to predict response to chemotherapy (training label y, prediction ỹ). vae encoding of tumor transcriptomes preserves data-space features that determine cancer type-specific groupings. in order to do so, we obtained (sec. . ) from the tcga data portal rna-seq transcriptome data for , tumors labeled for different cancer types (listed in fig. ). as a baseline view of transcriptome-based cancer type groupings, we generated a two-dimensional embedding of the , tumor samples by applying t-sne (sec. . ) to the expression levels of the top , most variable genes, yielding distinct clusters (fig. a). next, we trained (sec. . ) a vae to encode the expression levels of the , most variable genes in each of , tumors into , points in a -dimensional latent space. an unsupervised t-sne visualization (fig b) of the vae-encoded tumor transcriptome data was remarkably similar in structure to the t-sne visualization of the , -dimensional original dataset, with intercluster distances for all pairs of clusters correlated between of the two t-sne plots (r = . ; see fig. s ). this analysis indicated that the vae encoding preserves data-space features that distinguish individual cancer types. . obtaining a labeled tumor transcriptome dataset having demonstrated that the vae can efficiently encode tumor transcriptomes while preserving features that distinguish different cancer types, and to set the stage for implementing a semi-supervised approach for predicting response to chemotherapy, we obtained a five-cancer- type tumor transcriptome dataset with a significant subset of the tumors labeled for “response to chemotherapy”, as described below. we obtained transcriptomes of tumors across five cancer types [colon adenocarcinoma (coad), pancreatic adenocarcinoma (paad), bladder carcinoma (blca), sarcoma (sarc), and breast invasive carcinoma (brca); see table ] that we selected based on availability of a sufficient amount of labeled data in tcga (see sec. . ) and generated binary clinical labels for them corresponding to “responded” or “progressive” (see sec. . ). among these tumors, the class balance ratio, i.e., the ratio of responding tumors to progressive disease tumors, ranged from a low of . for pancreatic cancer to a high of . for breast cancer. . l loss is better than l loss for this application having obtained , tumor transcriptomes across five cancer types with of the tumors labeled for response to chemotherapy, we next sought to determine which type of vae reconstruction loss function—l loss or l loss—would yield transcriptome encodings that are most amenable to accurate xgboost-based prediction of response to chemotherapy. on the , tumor transcriptomes, we trained two sets of cancer type-specific vaes (see sec. . ) using l and l loss functions, respectively. we used the l and l vaes to encode the labeled tumor transcriptomes (the top % most variable genes in each cancer type, merged across the five cancers, for a total of , genes) spanning the five cancer types, yielding (for each cancer type) two feature matrices (one for l loss and one for l loss) that we separately evaluated for xgboost prediction (sec. . ) of the binary response-to-chemotherapy class label. by test-set area under the receiver operating characteristic (auroc; sec. . ), averaged across the five cancers, we found (fig. ) that the features that were generated by the l vaes led to . % better (p < − , welch’s t-test) classification performance than the features generated by the l vaes, and thus, for all subsequent analyses, we used vaes trained with l loss. . chemotherapy drug response classification result having selected l reconstruction loss for training vaes to encode tumor transcriptomes for predicting response-to-chemotherapy, we focused on the key question of whether (and to what extent) a semi-supervised approach using the vae can outperform (in terms of predictive accuracy) a fully supervised approach or a semi-supervised approach based on a traditional dimensional reduction technique (principal components analysis, pca). in brief, our vae-based semi-supervised approach involves three steps: (i) training a vae to encode clinically unlabeled tumor transcriptomes (for the top % most variable genes) for a single cancer type, into a low-dimensional space (sec. . ); (ii) using that vae to obtain latent-space encodings for the tumor transcriptomes that are labeled for a relevant clinical endpoint (in this work, response to chemotherapy); and (iii) training and testing a supervised classifier (in this work, xgboost binary classification) using the latent-space encodings as feature data. to address the question of whether this vae-based, semi-supervised (vae-xgboost) approach can outperform a fully supervised approach, we compared the performance of the vae-xgboost method to a fully supervised approach in which we applied xgboost directly to the tumor expression levels of the top % most variable genes ( , genes) as feature data. in the same analysis, to address the question of whether .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i table . table of numbers of samples with chemotherapy response record for each cancer type (n.b., the total number of labeled tumor samples exceeds the total number of patients because some patients had multiple tumors). after each cancer type, its tcga abbreviation is shown in parentheses. cancer type total number of samples (labeled and unlabeled) number of labeled samples proportion of labeled samples class balance ratio (responding/progressive) breast invasive carcinoma (brca) , . . colon adenocarcinomas (coad) . . bladder carcinoma (blca) . . pancreatic adenocarcinoma (paad) . . sarcoma (sarc) . . sum , table . quantitative auroc performances of xgboost (“raw data”), pca-xgboost (“pca”), and vae-xgboost (“vae”), along with pairwise comparisons. auroc (mean) p (welch’s t-test) p (wilcoxon signed-rank test) cancer type vae pca raw data vae versus raw data vae versus pca vae versus raw data vae versus pca brca . . . . × − . × − . × − . × − coad . . . . × − . × − . × − . × − blca . . . . × − . × − . × − . × − paad . . . . × − . × − . × − . × − sarc . . . . × − . × − . × − . × − the vae-xgboost method could outperform a semi-supervised approach based on pca dimensional reduction, we compared the vae-xgboost method to the pca-xgboost method. we carried out this analysis for each of the five cancer types separately, using the set of cancer type-specific labeled tumors (totaling labeled tumors). we measured performance using test-set auroc in a cross-validation framework (sec. . ). for four out of five cancer types (breast, colon, pancreatic, and sarcoma), in terms of test-set auroc, the vae-xgboost approach outperformed the fully-supervised approach of applying xgboost directly to the expression levels of the tumors’ top % most variable genes (fig. ), by both welch’s t-test and wilcoxon’s signed-rank test (table ); for blca, the semi-supervised vae-xgboost and fully-supervised models’ performances were statistically indistinguishable. additionally, for four out of five cancer types (bladder, breast, pancreatic, and sarcoma), the semi-supervised vae-xgboost method significantly outperformed the semi-supervised pca-xgboost method (fig. and table ). the five- cancer average auroc for vae-xgboost was . , a performance gain of . % over the five-cancer average auroc for pca-xgboost ( . ) and a gain of . % over the fully-supervised model’s average ( . ). notably, a single deep vae architecture (vae- , which had a - dimensional latent space and six layers in the encoder; see sec. . ) yielded latent-space encodings that outperformed semi-supervised pca-xgboost for three cancer types (bladder, breast, and pancreatic). . pca & vae feature importance scores, for coad having established that the semi-supervised vae-xgboost outperforms the semi-supervised pca-xgboost approach for tumor transcriptome- based prediction of response to chemotherapy for four out of five cancer types, we sought to understand the basis for the higher performance of pca-xgboost over vae-xgboost on the fifth cancer type, colon adenocarcinoma (coad). specifically, we investigated whether the strong performance of pca-xgboost on coad is attributable to differences in the distributions of xgboost feature importance scores (sec. . ) of the pca features versus vae latent-space features. we found that the distribution of feature importance scores (as a function of rank) was more sharply peaked at lowest-ranked features in the vae than in the pca (fig. ), suggesting that the performance gain with pca reflects a broader spectrum of informative features for that particular cancer type. discussion as far as we are aware, this work is the first report of a broad (five- cancer) investigation of the potential for a vae-based, semi-supervised approach for predicting response to chemotherapy. across the five cancer types that we studied, the ratio of responding tumors to progressive disease tumors ranged from a low of . for pancreatic cancer to a high of . for breast cancer, reflecting a broad range of resistances to standard-of-care chemotherapy. our results clearly demonstrate the utility of the vae for compressing high-dimensional data to a continuous, low-dimensional latent space while retaining features that are essential for distinguishing different cancer types and for predicting response to chemotherapy. nevertheless, three limitations of this work bear noting. the first limitation concerns the type(s) of tumor “omics” data from which features are derived for the predictive model. while in this work we focused on tumor transcriptome data which can be measured with high precision over a wide dynamic range of transcript abundances by rna- seq, we note that tcga datasets of tumor somatic mutations and copy number alteration events are also available (hutter and zenklusen, ). given the voluminous literature on the use of tumor somatic genomic data for precision cancer diagnosis (mitchel et al., ; zhang et al., ; lee et al., ), tumor dna datasets are fertile ground for developing a semi- supervised, multi-omics model for predicting response to chemotherapy. second, we noted for decision tree-based response-to-chemotherapy prediction, the performance of vae-encoded transcriptome features is somewhat sensitive to the type of normalization used for the input data (data not shown). we explored various types of normalization for the rna- seq data including standardization of log counts and using fpkm data, we ultimately chose min-max-normalized log total-count-normalized counts (sec. . ) for the gene expression levels to be used to derive features. however, there are additional transcript quantification methods (evans et al., ) that could be explored in the context of finding optimal tumor transcriptome vae encodings for precision oncology. a similar comment applies to the specific form of the reconstruction loss function: in our analysis, features from the vae trained with l loss clearly (across five cancers) outperformed those from the vae trained with l loss, and thus, consistent with way and greene ( ), we used l loss for the vae that we used to address the main question of this work (sec. . ) as well as the pan-cancer t-sne analysis (sec. . ) .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i fig. : marks represent tumor transcriptomes visualized using t-sne, with colors representing cancer types. (a) original gene expression data of the top , most variable genes. (b) vae compressed gene expression data. red rectangles denote the five cancer types selected for chemotherapy response classification (sec. . ). the third limitation relates to the vae architecture. while it is promising that a single deep vae architecture (vae- , with a - dimensional latent space and six fully-connected layers) yielded features that outperformend pca and the original rna-seq feature data for three different cancer types (bladder, breast, and pancreatic), for . . . . . . . . . . l _loss l _loss a u r o c fig. : average auroc results over five different types of cancer, by loss type. squares, mean values; bars, % confidence interval (c.i.). colon cancer and sarcoma, it was necessary to use shallower (two- layer) vae architectures with bigger latent space dimensions ( and , respectively). of the five cancers studied, colon cancer and sarcoma had the lowest proportions of labeled samples ( . and . respectively; see table ). our findings suggest that for some cancers, a deep, low-latent-dimension vae architecture yields optimal features for predicting response, while for other cancers, a shallow, medium-sized- latent-dimension vae architecture is more effective. more study with larger datasets will be required in order to determine whether a single vae architecture could be successfully used for general-purpose tumor transcriptome feature extraction for precision oncology. while our results show promise for the vae in the context of a semi- supervised approach for response-to-chemotherapy prediction, for colon cancer, the vae-xgboost method did not outperform pca-xgboost (though it did outperform the fully supervised approach of xgboost trained on the unencoded gene expression data). one possible explanation for the colon cancer-specific superior performance of pca features over vae features for predicting response to chemotherapy may be due to the fact that while (for coad) feature importance for the vae features is sharply peaked for the first few features and falls off fairly rapidly with feature rank, the pca features have a much flatter distribution of relative feature importance (fig. ). follow-on studies with larger datasets will be required to delineate under what circumstances transcriptome vae encodings will prove superior to linear principal components. conclusions for four of the five cancer types that we studied, the semi-supervised vae-xgboost approach significantly outperformed a semi-supervised pca-xgboost approach for tumor transcriptome-based prediction of response to chemotherapy, reaching a top auroc of . for pancreatic adenocarcinoma. for four out of five cancer types, the semi-supervised vae-xgboost approach significantly outperformed a fully-supervised approach consisting of xgboost applied to the expression levels of the top % most variably expressed genes. given high-dimensional “omics” data, the vae is a powerful tool for obtaining a nonlinear low-dimensional embedding; it yields features that retain biological patterns that distinguish between different types of cancer and that enable more accurate tumor transcriptome-based prediction of response to chemotherapy than would be possible using the original data or their principal components. .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i sarc coad paad blca brca raw( , ) pca( ) vae( ) raw( , ) pca( ) vae( ) raw( , ) pca( ) vae( ) raw( , ) pca( ) vae( ) raw( , ) pca( ) vae( ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . a u r o c fig. : test-set performance of the three models for predicting response to chemotherapy, across five cancer types. group abbreviations: “pca( )”, the pca-xgboost semi-supervised method ( : number of principal components used as features); “raw( , )“, the fully-supervised xgboost method ( , : number of genes used as features); and “vae(n)”, the vae-xgboost semi- supervised method (n: dimension of the latent feature space). marks correspond to individual replications of five-fold cross- validation; solid squares denote mean; bars indicate % c.i; colors denote the type of feature-set (sec. . ): red, “pca”; olive, “raw”; cyan, vae- ; magenta, vae- ; green, vae- . methods we carried out all data processing and machine-learning tasks on a dell xps workstation equipped with nvidia titan rtx gpu and running the ubuntu gnu/linux operating system version . . all of the analysis code that we implemented was executed in python version . . except that we used r version . . for statistical analysis of auroc values (sec. . ), gene-level mad calculations (sec. . ) and plotting (sec. . ). we carried out all statistical tests using the r computing environment (version . . ) and using the r software package stats version . . . . gene expression data from the xena data portal (goldman et al., ), we obtained tcga level tumor rna-seq transcriptome data of cancer types (totaling , tumors) and, for the response-to-chemotherapy prediction problem, five cancer types [colon adenocarcinomas (coad), pancreatic adenocarcinoma (paad), bladder carcinoma (blca), sarcoma (sarc), and breast invasive carcinoma (brca)] totaling , tumors. we selected the five cancer types based on two criteria: (i) a sufficient number (at least ) of paired tumor-transcriptome and clinical data sum of importance r a n k o f fe a tu re s group pca vae fig. : bars indicate the sum (over replications) of xgboost feature importance scores. “group” indicates the low-dimensional embedding method used (vae or pca). bars separately ordered from highest to lowest (only top most important features shown). samples available for the cancer type; and (ii) a sufficient number (at least ) of tumor transcriptome samples available (regardless of the clinical data availability) for the cancer type. we obtained both the rna- seq (gene-level) total-read-count-normalized log ( +c) read counts and normalized (fragments per kilobase of transcript per million mapped reads, fpkm (dillies et al., )) expression data for for , human genes. to focus the machine-learning on the portion of the tumor transcriptome that had the most variation from tumor to tumor, we identified the top % most variable genes as measured by the median absolute deviation (mad) across tumors, of gene expression in terms of fpkm (we used fpkm for this purpose in order to mitigate bias due to read length and tumor-specific depth of sequencing). for deriving feature-sets for xgboost prediction directly based on transcript abundances or based on vae- or pca encoding, the % criterion applied to each of the five cancer types yielded a set of , genes. we computed mad using the r package stats version . . (r core team, ) with default parameters. after the variance-filtering step, we used the log ( + c) of total-count-normalized count values for the top- % highest-variance genes (that were selected as described above) to obtain (or encode) feature values. we compared the performance—in terms of minimizing the vae reconstruction loss (see sec. . )—of different feature scaling methods (no scaling, min-max normalization, and standardization (kreyszig et al., )) and selected min-max normalization as the method that we used to rescale gene-level count data for input into the vae. . t-distributed stochastic neighbor embedding (t-sne) we computed t-sne embedding components of the tumors using the function sklearn.decomposition.manifold.tsne from the python software package scikit-learn version . . with parameters init = “pca′′, perplexity = , learning_rate = , and n_iter = . for plotting the tumor transcriptome t-sne embeddings, we used the r software package ggplot version . . . . variational autoencoder (vae) an autoencoder is a type of model that combines “encoder” and “decoder” neural networks to learn a low-dimensional continuous data encoding from which the input signal can be approximately reconstructed (kramer, ). a key advantage of an autoencoder is that it is unsupervised, i.e., it can .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i be trained without labeled examples. unlike classical autoencoders (e.g., sparse or denoising autoencoders), the variational autoencoder (vae) is a generative probabilistic model which maps an input vector to a latent-space random variable (r.v.). below, we mathematically define the vae. let t denote the set of tumors for which the vae is to be fit to the tumor transcriptomes (with n ≡ |t|) and let m denote the number of genes for which transcript abundances are used to represent the tumor transcriptome. after min-max transformation of the tumor transcriptome measurements (sec. . ), each tumor’s transcriptome is represented as a vector x ∈ [ , ]m. let x denote the random variable representing the population distribution from which tumor transcriptomes are sampled, and let x ∈ [ , ]m×n represent the composite matrix of all sampled tumor transcriptomes). we aim to learn a vae that will comprise an encoder and decoder, with the encoder consisting of mean and variance functions µ : [ , ]m → rh and σ : [ , ]m → rh+, respectively. together, µ and σ map the tumor transcriptome vector xt to a h-dimensional r.v. z|xt, z|xt ∼n(µ(xt), diag(σ(xt))), ( ) where diag(m) is a matrix whose diagonal elements are the elements of the vector m. the decoder is a function g : rh → [ , ]m that, for an outcome z|xt = zt ∈ rh, maps g : zt → g(zt) ≡ x̃t; ( ) the tilde on x̃t denotes that it is the decoded data for the tumor transcriptome xt. a good autoencoder should have low reconstruction error l, which is convenient to define in terms of the p-norm of the difference between the tumor transcriptome data xt and the reconstructed data x̃t, i.e., ||xt−x̃t|| p p , where || ||p denotes the p-norm. however, this definition of the reconstruction error is only deterministic in the context of a specific outcome z|xt = zt. thus, it is conventional to define the reconstruction error as an expectation value over outcomes of z|xt, l|(x =xt) ≡ e z|xt=zt (||xt −g(zt)|| p p ), ( ) where eΩ represents an expectation value over a space of outcomes Ω. it should be noted the above representation of the reconstruction error is in terms of the outcome, zt, of a r.v. (z|xt) whose distributional parameter functions µ and σ have hyperparameters (neural network coefficients) that will be fitted. because eq. is ill-suited to backpropagation, it is helpful to recast it in terms of a new random variable et that depends on z|xt by et ≡ (diag(σ(xt)))− (zt|xt −µ(xt)). ( ) it follows from eq. and eq. that et is standard multivariate normal, et ∼n( ,i), ( ) where i is the h×h identity matrix, and thus, et does not depend on µ, σ, or t. we therefore drop the subscript t and simply denote the rescaled latent-space random variable as e. solving eq. for z|xt and applying it to eq. , the reconstruction error l|(x =xt) can be represented by l|(x =xt) = ee (∣∣∣∣∣∣xt−g(µ(xt) +√diag(σ(xt)) e)∣∣∣∣∣∣p p ) , ( ) which is amenable to backpropagation because the only r.v. in it is e, whose distributional parameters do not depend on the neural network coefficients that we will be varying. in practice, rather than computing the multivariate integral over outcomes of e, l|(x = xt) is typically approximated by averaging over a limited number j of samples from e, l|(x =xt) ' 〈(∣∣∣∣∣∣xt−g(µ(xt)+√diag(σ(xt)) �j))∣∣∣∣∣∣p p )〉 j , ( ) where 〈〉j denotes average over j ∈{ , . . . ,j} and �j is sample j from e. following way and greene ( ), we used a number of samples that is equivalent to the dimension of the transcriptome, i.e., j = m. for the case of p = (i.e., l norm), minimizing l|(x = xt) as defined above is equivalent to maximizing the expectation value of the log- likelihood log(p(g(z) = xt | x = xt)). however, following way and greene ( ) and consistent with empirical evidence (sec. . ), for our five-cancer study of the utility of a vae-based approach for response- to-chemotherapy prediction, as well as for the pan-cancer t-sne analysis (sec. . ), we chose to use l reconstruction loss, i.e., p = in eq. . the reconstruction loss measures bias error, whose minimization must be balanced against the simultaneous goal of controlling variance error through regularization. in the vae, regularization requires incentivizing (in the learning of µ, σ, and g) the latent space distributions of z|x to be close to standard multivariate normal. this is accomplished by assigning a penalty based on the kullback-leibler divergence between the distribution of z|xt and the target distribution e, represented by dkl(p(z|xt) ||p(e)). this regularization is analytically tractable (duchi, ), and for a given tumor t yields (see supplementary note, eq. s ) the following regularization function: dkl ( p(zt|xt) ∣∣∣∣ p(e)) = ||µ(xt)|| + ||σ(xt)|| −|| log(σ(xt))|| − , ( ) where log(σt) denotes an element-wise log and || || is the l norm. fitting the vae to x requires selecting µ, σ, and g from their respective function spaces; in practice, we search over functions that can be represented using a neural network for µ and σ (parameterized by the vector θ) and a neural network for the function g (parameterized by the vector φ). exploring the space of functions µθ, σθ, and gφ corresponds to computationally searching for the vector pair (θ̂,φ̂) that together minimize the joint (over all tumors) sum of the tumor-specific reconstruction loss and the regularization penalty, (θ̂,φ̂) = argmin (θ,φ) ∑ t∈t [ l|(x = xt)+dkl ( p(z|xt) ∣∣∣∣p(e))]. ( ) applying eqs. , , and , and setting p = as discussed above, we obtain the explicit formula for fitting a vae to x, (θ̂,φ̂) = argmin (θ,φ) ∑ t∈t [ j j∑ j= (∣∣∣∣∣∣xt −gφ(µθ(xt) + √diag(σθ(xt)) �j)∣∣∣∣∣∣ ) + ||µθ(xt)|| + ||σθ(xt)|| −|| log(σθ(xt))|| − ] . ( ) we implemented eq. in tensorflow version . . with keras version . . as the model-level library. we solved eq. using the adam optimization algorithm (kingma and ba, ) (with batch normalization) from the python package keras-gpu version . . with parameters learning_rate = × − , beta_ = . , and beta_ = . , to obtain (θ̂,φ̂). then, for each tumor t, we used a single sample z|xt = zt from the distribution n(µ θ̂ (xt), diag(σθ̂(xt))) as the final latent-space encoding of the tumor to be used for supervised learning (sec. . ). . labeling tumors based on response to chemotherapy from xena and cbioportal (cerami et al., ; gao et al., ), we obtained and combined tcga clinical data (where available) for note, functions µ and σ are just two different outputs of the encoding neural network, differing only at the final layer, and thus for simplicity of notation we represent them as having a common parameter vector θ. .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i the patients whose tumor transcriptomes we acquired (see sec. . ). from xena, we extracted the variables submitter_id.samples, therapy_type, and measure_of_response; from cbioportal, we extracted the variables sample_id, disease.free.status, and pharmaceutical.therapy.indicator. we co-analyzed the xena- and cbioportal-obtained clinical data to label tumors “responded” (y = ) or ”progressive” (y = ), by assigning y = when the clinical record had complete response or partial response in the measure_of_response column of the clinical data from xena, or with value diseasefree in the disease.free.status column of the clinical data from cbioportal while therapy type is recorded as chemotherapy in both. we assigned y = to tumors whose clinical records had values radiographic progressive disease, clinical progressive disease, or stable disease in the xena clinical data column measure_of_response, or had value recurred/ progressed in the cbioportal data column disease.free.status while the therapy_type is recorded as chemotherapy in both files. this yielded labeled tumors out of , total. a total of different drugs were used to treat the patients (see supplementary note, table s ). . vae model architectures we trained six transcriptome-encoding vaes based on four vae architectures, the pan-cancer vae architecture (for the -cancer unsupervised analysis, see sec. . ) and three cancer type-specific vae architectures for response-to-chemotherapy prediction (sec. . ) (one of which was used for three different cancer types, blca, brca, and paad, and the others of which were cancer type specific for coad and sarc). for the pan-cancer vae, we used a latent space dimension h = and three fully connected layers each for the encoder and decoder. for the cancer type-specific vae architectures, we again used the same number of fully-connected layers in the encoder as in the decoder (table ). table . vae architectures used for predicting chemotherapy response (h, latent space dimension; “layers”, # of layers used in the encoder/decoder). name cancer types h layers vae- blca, brca, paad six vae- coad two vae- sarc two . regularized gradient boosted decision trees (xgboost) for predicting whether or not (based on its transcriptome-derived feature- set: raw, pca, or vae) a tumor would respond to chemotherapy, we used xgboost (chen and guestrin, ), an efficient implementation of regularized gradient boosted decision trees. we used the binary classifier function xgbclassifier from the python software package xgboost version . , with gamma= . we tuned eight hyper- parameters (table ) by exhaustive grid-search with five-fold cross- validation, using sklearn.model_selection.gridsearchcv from scikit-learn version . . . to obtain feature importance scores, we used get_score with importance_type = cover. . area under roc curve (auroc) for computing the auroc (i.e., sensitivity versus false positive error rate curve), we used the function metrics.roc_auc_score from the python software package scikit-learn version . . with parameter average=“weighted”. we logit-transformed auroc values before testing (using two-tailed welch’s t-test and the wilcoxon signed rank test) for the l vs. l analysis (fig. . ), we carried out replications of five-fold cross-validation; within each replication, across the five folds, we obtained prediction scores for each tumor from the fold in which the tumor was in the test set, enabling us to compute an overall auroc within each replication. for each training data set, we have done replications of five-fold cross-validation by altering the random seed used for assign split of data during cross-validation. we have conducted the same procedure for five different cancer types (blca, brca, coad, paad, sarc) as shown in the panel names of figure . . principal component analysis (pca) for pca, we used the function decomposition.pca (with parameters svd_solver = “full′′) and n_components = . ( % variance, yielding components) from the python package scikit-learn version . . . for plotting, we used matplotlib version . . . funding sar acknowledges support from the animal cancer foundation. references airley, r. ( ). cancer chemotherapy. wiley-blackwell, ny, ny. an, j. and cho, s. ( ). variational autoencoder based anomaly detection using reconstruction probability. technical report snudm- tr- - , seoul national university. bouchacourt, d. et al. ( ). multi-level variational autoencoder: learning disentangled representations from grouped observations. arxiv: . . casado, e. et al. ( ). a combined strategy of sage and quantitative pcr provides a -gene signature that predicts preoperative chemoradiotherapy response and outcome in rectal cancer. plos one, , – . cerami, e. et al. ( ). the cbio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. cancer discovery, , . chabner, b. a. and longo, d. l. ( ). cancer chemotherapy and biotherapy: principles and practice. lippincott willians & wilkins, philadelphia, pa, fourth edition. chen, t. and guestrin, c. ( ). xgboost: a scalable tree boosting system. arxiv: . . chiu, y.-c. et al. ( ). predicting drug response of tumors from integrated genomic profiles by deep neural networks. bmc medical genomics, ( ), . corrie, p. g. ( ). cytotoxic chemotherapy: clinical aspects. medicine, ( ), – . del rio, m. et al. ( ). gene expression signature in advanced colorectal cancer patients select drugs and response for the use of leucovorin, fluorouracil, and irinotecan. journal of clinical oncology : official journal of the american society of clinical oncology, ( ), – . dillies, m.-a. et al. ( ). a comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis. briefings in bioinformatics, ( ), – . dolezal, j. m. et al. ( ). diagnostic and prognostic implications of ribosomal protein transcript expression patterns in human cancers. bmc cancer, ( ), . dong, h. et al. ( ). variational autoencoder for anti-cancer drug response prediction. arxiv: . . duchi, j. ( ). derivations for linear algebra and optimization. technical report, standford university. .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i table . xgboost classification algorithm hyperparameters and hyperparameter ranges used in grid-search tuning. hyperparameter name hyperparameter description hyperparameter range n_estimators number of trees to fit ( , , , . . ., ) max_depth maximum tree depth ( , , , . . ., ) learning_rate boosting learning rate ( . , . , . , . , . , . ) min_child_weight minimum sum of instance weight needed in a child ( , , , . . ., ) subsample sub-sample ratio of the training instance ( . , . , . , . . ., . ) colsample_bytree sub-sample ratio of columns when constructing each tree ( . , . , . , . . ., . ) reg_alpha coefficient of l regularization for the node weights ( , , , ) reg_lambda coefficient of l regularization for the node weights ( , , . . ., ) esteva, a. et al. ( ). dermatologist-level classification of skin cancer with deep neural networks. nature, ( ), – . evans, c. et al. ( ). selecting between-sample rna-seq normalization methods from the perspective of their assumptions. briefings in bioinformatics, ( ), – . frankish, a. et al. ( ). gencode reference annotation for the human and mouse genomes. nucleic acids research, , d –d . gao, j. et al. ( ). integrative analysis of complex cancer genomics and clinical profiles using the cbioportal. science signaling, , . geeleher, p. et al. ( ). clinical drug response can be predicted using baseline gene expression levels and in vitrodrug sensitivity in cell lines. genome biology, ( ), r . george, t. m. and lio, p. ( ). unsupervised machine learning for data encoding applied to ovarian cancer transcriptomes. biorxiv; doi: . / . goldman, m. et al. ( ). the ucsc xena platform for public and private cancer genomics data visualization and interpretation. biorxiv; doi: . / . gurney, h. ( ). how to calculate the dose of chemotherapy. british journal of cancer, , – . gámez-pozo, a. et al. ( ). prediction of adjuvant chemotherapy response in triple negative breast cancer with discovery and targeted proteomics. plos one, , . hutter, c. and zenklusen, j. c. ( ). the cancer genome atlas: creating lasting value beyond its data. cell, ( ), – . jimenez rezende, d. et al. ( ). stochastic backpropagation and approximate inference in deep generative models. arxiv: . . kaestner, s. a. and sewell, g. j. ( ). chemotherapy dosing part i: scientific basis for current practice and use of body surface area. clinical oncology, , – . kingma, d. p. and ba, j. ( ). adam: a method for stochastic optimization. arxiv: . . kingma, d. p. and welling, m. ( ). auto-encoding variational bayes. arxiv, page arxiv: . . kipf, t. n. and welling, m. ( ). variational graph auto-encoders. arxiv: . . kramer, m. a. ( ). nonlinear principal component analysis using autoassociative neural networks. aiche journal, ( ), – . kreyszig, e. et al. ( ). advanced engineering mathematics. wiley, hoboken, nj, tenth edition. kurokawa, y. et al. ( ). molecular prediction of response to - fluorouracil and interferon-α combination chemotherapy in advanced hepatocellular carcinoma. aacr, ( ), – . lee, k. et al. ( ). cpem: accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network. scientific reports, ( ), . li, x. and she, j. ( ). collaborative variational autoencoder for recommender systems. in proceedings of the rd acm sigkdd international conference on knowledge discovery and data mining, pages – , new york, ny. acm. malfuson, j.-v. et al. ( ). risk factors and decision criteria for intensive chemotherapy in older patients with acute myeloid leukemia. haematologica, ( ), – . mitchel, j. et al. ( ). a translational pipeline for overall survival prediction of breast cancer patients by decision-level integration of multi- omics data. in ieee international conference on bioinformatics and biomedicine (bibm), pages – . qin, j. et al. ( ). ica based semi-supervised learning algorithm for bci systems. in j. rosca, d. erdogmus, j. c. príncipe, and s. haykin, editors, independent component analysis and blind signal separation, pages – , berlin. springer. r core team ( ). r: a language and environment for statistical computing. r foundation, vienna, austria. isbn - - - . skeel, r. t. ( ). handbook of cancer chemotherapy. lippincott williams & wilkins, philadelphia, pa, sixth edition. titus, a. j. et al. ( ). unsupervised deep learning with variational autoencoders applied to breast tumor genome-wide dna methylation data with biologic feature extraction. biorxiv; doi: . / . wang, z. et al. ( ). rna-seq: a revolutionary tool for transcriptomics. nature reviews genetics, ( ), – . way, g. p. and greene, c. s. ( ). evaluating deep variational autoencoders trained on pan-cancer gene expression. arxiv: . . way, g. p. and greene, c. s. ( ). extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. pacific symposium on biocomputing, , – . weir, b. et al. ( ). somatic alterations in the human cancer genome. cancer cell, ( ), – . wen, h. and huang, f. ( ). personal loan fraud detection based on hybrid supervised and unsupervised learning. in th ieee international conf. on big data analytics (icbda), pages – . whelan, t. et al. ( ). helping patients make informed choices: a randomized trial of a decision aid for adjuvant chemotherapy in lymph node-negative breast cancer. jnci: journal of the national cancer institute, ( ), – . zhang, y. et al. ( ). a novel xgboost method to identify cancer tissue- of-origin based on copy number variations. front genet, , . .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / introduction results vae encoding preserves cancer type features obtaining a labeled tumor transcriptome dataset l loss is better than l loss for this application chemotherapy drug response classification result pca & vae feature importance scores, for coad discussion conclusions methods gene expression data t-distributed stochastic neighbor embedding (t-sne) variational autoencoder (vae) labeling tumors based on response to chemotherapy vae model architectures regularized gradient boosted decision trees (xgboost) area under roc curve (auroc) principal component analysis (pca) full-length de novo protein structure determination from cryo-em maps using deep learning full-length de novo protein structure determination from cryo-em maps using deep learning jiahua he and sheng-you huang∗ school of physics, huazhong university of science and technology, wuhan, hubei , p. r. china abstract advances in microscopy instruments and image processing algorithms have led to an increas- ing number of cryo-em maps. however, building accurate models for the em maps at - å resolution remains a challenging and time-consuming process. with the rapid growth of de- posited em maps, there is an increasing gap between the maps and reconstructed/modeled - dimensional ( d) structures. therefore, automatic reconstruction of atomic-accuracy full-atom structures from em maps is pressingly needed. here, we present a semi-automatic de novo struc- ture determination method using a deep learning-based framework, named as deepmm, which builds atomic-accuracy all-atom models from cryo-em maps at near-atomic resolution. in our method, the main-chain and cα positions as well as their amino acid and secondary structure types are predicted in the em map using densely connected convolutional networks. deepmm was extensively validated on simulated maps at å resolution and experimental maps at . - . å resolution as well as an emdb-wide data set of experimental maps at . - . å resolution, and compared with state-of-the-art algorithms including rosettaes, mainmast, and phenix. overall, our deepmm algorithm obtained a significant improvement over existing methods in terms of both accuracy and coverage in building full-length protein structures on all test sets, demonstrating the efficacy and general applicability of deepmm. availability: https://github.com/jiahuahe/deepmm supplementary information: supplementary data are available. ∗email: huangsy@hust.edu.cn; phone: + - - ; fax: + - - .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / introduction cryo-electron microscopy (cryo-em) has now become a widely used technique for structure deter- mination of macromolecular structures in the recent decade – . advances in microscopy instruments and image processing algorithms have led to the rapid increase in the number of solved em maps – . the ‘resolution revolution’ in cryo-em has paved a way for the determination of high-resolution structures of previously intractable biological systems – . according to the statistics of the electron microscopy data bank (emdb) , there were maps deposited in , which are almost times the maps released in . with the rapid growth of deposited em maps, there is an increasing gap between the maps and reconstructed/modeled -dimensional ( d) structures. as of april , , there were emdb maps, but only associated structures were deposited in the protein data bank (pdb) . for those maps determined at near-atomic resolution ( . ∼ . å), it is difficult to build high-resolution models with conventional software designed for x-ray crystallography. in view of the fact that near-atomic resolution maps take up the majority of current and henceforth released maps , tools, which can re- construct structures de novo from em maps without using known structures as templates , are press- ingly needed. as such, some algorithms like em-fold , gorgon , rosetta , , pathwalking – , phenix – , and mainmast , , have been recently presented for constructing and/or assembling structure fragments from cryo-em maps. despite the present progress in de novo structure building for cryo-em maps, there are various limitations in current approaches. they can either only build structural fragments , , or have low accuracy in terms coverage and/or sequence reproduction , , . it remains challenging to automat- ically build an accurate all-atom structure from the em maps at near-atomic resolution. recently, machine learning has been actively applied in structure determination for em maps, such as single particle picking , tomogram annotation , secondary structure prediction , and backbone tracing . however, applying deep learning to build full-length protein structures for near-atomic resolution em maps remains a challenging work. here, we have developed a semi-automatic de novo atomic-accuracy structure reconstruction method for em maps at near-atomic resolution through densely connected convolutional networks .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / (densenets) using a deep learning-based framework, named deepmm. instead of tracing the protein main-chain on the raw em density map, deepmm first predicted the probability of main-chain atoms (n, c, and cα) and cα positions near each grid point using one densenet . then, the method traced the main-chain according to the predicted main-chain probability map. the amino acid and secondary structure types were predicted by a second densenet. finally, the protein sequence was aligned to the main-chain according to the predicted cα probabilities, amino acid types, and secondary structure types for all-atom structure building. methods . workflow of deepmm the workflow of deepmm is illustrated in figure a. specifically, staring from a cryo-em map and the target protein sequence, deepmm first standardizes the order of axis, and interpolates grid interval to . å. then, deepmm cuts the entire map into small voxels of size å× å× å. afterwards, one densenet (say densenet a) is used to predict the main-chain and cα probability on each of the voxels. all the predicted probability values form a d probability map. next, possible main- chain paths are generated in the predicted main-chain probability map using a main-chain tracing algorithm . the cα probability values of main-chain points are interpolated from the predicted d cα probability map. afterwards, the amino acid and secondary structure types are predicted for each main-chain point through the second densenet (say densenet b). with the predicted cα probability, amino acid type, and secondary structure type for each main-chain point, the target protein sequence is then aligned to the main-chain paths based on the smith-waterman dynamic programming (dp) algorithm . the resulted multiple cα models are ranked by their alignment scores. finally, the all-atom structures are constructed from the top cα models using the ctrip program in the jackal modeling package , and refined by an energy minimization using amber . . training the densenets of deepmm two densely connected convolutional networks (densenets) are embedded into our deepmm algo- rithm. figure b illustrates the architecture of the networks. densenet is a feed-forward multi-layer .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / network which uses additional paths between earlier and later layers in a dense block. densenets have several compelling advantages. they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters . deepmm also employs a hard parameter-sharing multi-task learning method, which can greatly reduces the risk of overfitting . the first network (i.e. densenet a) is used to simultaneously predict the main-chain probability and cα probability of a grid point. the second network (i.e. densenet b) is used to pre- dict the amino acid type and secondary structure type of a main-chain local dense point (ldp). the input for the densenet a are voxels of size å × å × å. the second network (densenet b) takes the voxels of size å × å × å as input because main-chain points are not always on the integer grid after mean shift. for each voxel, the density values are normalized to the range of [ , ] according to the maximum and minimum density values in the voxel. d convolutions and d pool- ing layers are used instead of their d counterparts used in traditional image processing because the density maps have three dimensions. several dense blocks are used in both networks, each of which consists of eight densely connected layers. for densenet a, the first two dense blocks are shared by both tasks, whereas for densenet b, only one shared block is adopted. after the shared blocks, each task employs two task-specific blocks and gives the final prediction. the details of network architecture are provided in supplementary table . all the training parameters and procedure used for simulated em maps are essentially the same to the parameters and procedure used for experimental em maps unless otherwise specified. for densenet a, all the grid points above a density value d were used for training, where d was set to . for simulated maps at . å resolution. for experimental maps, d was set to / of its recommended contour level. the labels (main-chain probability and cα probability) of a grid point ~a were calculated as follows: p ~x ~a = min{e − ‖~a− ~x‖ r , ∀ ~x ∈ ‖~a − ~x‖ < rcut} ( ) where x stands for the n, c, or cα atoms. the r is the radius at which the probability drop to /e. if no atom is within rcut of a grid point, the corresponding probability is set to . a total of voxels were trained in one batch and epochs were trained for the whole data set. the adam optimizer with an initial learning rate of . was used to minimize the mean absolute error (mae). learning .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / rate decay was adopted, where the learning rate was reduced to / of the current value after every epochs. to avoid over-fitting, the weight decay parameter of adam optimizer was set to e- as the l regularization. for densenet b, one point was randomly sampled within . å for every main-chain atom in the training set. the corresponding amino acid type and second structure type marked by stride were assigned to each point. twenty types of amino acids were grouped into four classes according to their sizes, shapes and distributions in their em density maps , as illustrated in figure d. specifically, gly, ala, ser, cys, val, thr, ile and pro are grouped as class i. leu, asp, asn, glu, gln and met are grouped as class ii. lys and arg are grouped as class iii. his, phe, tyr and trp are grouped as class iv. residues that have structure codes of h, g, or i by stride were labelled as “helix”, those with codes of b/b or e were labelled as “sheet”, and the other residues were labelled as “coil”. all the training parameters were identical to those for densenet a except for using crossentropyloss as loss function. . tracing the main-chain path the main-chain tracing algorithm in mainmast was used to trace the main-chain path in our predicted main-chain probability map. in brief, local dense points (ldps) are first identified using the mean shift algorithm, which iteratively shifts the initial grid points towards the local highest probability by computing the weighted average of probability values. then, the shifted points that are within a threshold distance of . å are clustered, and the point with the highest probability in the cluster is chosen as the representative, called ldp. the next step is to connect ldps into a minimum spanning tree (mst) and iteratively refine the tree structure with a tabu search method. after multiple steps of tabu search, the longest path of the refined tree is traced as the main-chain path. the details of the algorithm can be found in the mainmast study . . aligning target sequence to main-chain path the smith-waterman dynamic programming (dp) algorithm is used to align the target sequence to the predicted main-chain path. the predicted cα probability value, amino acid type, and secondary structure type are assigned to each point of the main-chain. instead of using amino acid types, .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / amino acids are grouped into four classes according to their sizes, shapes, and distributions in em density maps (figure d). secondary structures are categorized into three types of helix, sheet, and coil. the match between the target sequence and main-chain path is evaluated by two scoring matrices for amino acid and secondary structure, respectively (figure b). namely, a target residue is more likely to be aligned to a main-chain point with the same amino acid type, the same secondary structure type, and a higher cα probability, and vice versa. the detailed alignment protocol is shown in figures a, b and c. the n residues {ai(i = , ...n)} in the protein are aligned to m ldps {lj(j = , ...m)} in the main-chain path. the matching score m(i, j) for a pair of ai and lj is computed as follows. m(i, j) = waamaa(taa(ai), taa(lj)) + wssmss(tss(ai), tss(lj)) ( ) where maa and maa are the scoring matrices for amino acid and secondary structure matching , , respectively. for a residue ai, the amino acid type is one of the four amino acid classes (taa(ai) = , , , ). the predicted amino acid type for an ldp lj is also one of the four amino acid classes (taa(li) = , , , ). similarly, the secondary structure matching score is calculated using the sec- ondary structure type predicted from the sequence (tss(ai) = , , ) by spider and secondary structure type predicted on ldps (tss(li) = , , ). the scoring matrices maa and mss used in the alignment are shown in figure b. the waa and wss are the weights for corresponding matching scores and set to . and . , respectively. with the calculated matching score m(i, j), an alignment is calculated with the follow rule to form a dp matrix, f , as follows. f(i, j) = max f(i − , j) + gap f(i − , j − ) − wcα−cα|dstd − d| + wcαpcα(j) + m(i, j) f(i, j − ) ( ) where gap is the gap penalty for unassigned residues in the protein sequence. to ensure a full-length structure reconstruction, gap is set to − . so as to forbid skipped residues. the |dstd − d| is the penalty score for cα-cα distance, where dstd is the standard cα-cα distance and d is the distance between ldp lj and the last aligned ldp. the pcα(j) is the predicted cα probability for ldp lj. the wcα−cα and wcα are the weights for the corresponding scores. here, wcα is set to . , and wcα−cα is set to . , . , and . for “helix”, “sheet”, and “coil”, respectively. for each combination .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / of parameters in the main-chain tracing procedure, cα models are generated. finally, all the generated cα models are ranked by their alignment scores. . parameter settings of deepmm the parameters of mean-shift, mst construction, and tabu search are set to be the same to those in mainmast , unless otherwise specified. deepmm employs several parameter combinations to generate multiple cα models for one em map. for each combination of parameters, trajectories of tabu search are carried out, yielding main-chain paths. since deepmm starts from the main-chain probability map, fewer parameter combinations are needed to reconstruct reliable d structures. for both simulated and experimental maps, the thresholds of probability (Φthr) and normalized probability (θthr) are both set to . for the simulated maps, only one parameter combination is adopted. specifically, the maximum number of tabu search steps (nround) is set to , the sphere radius of local mst (rlocal) is set to . å, and the constraint for the length (dkeep) is set to . å. for the experimental maps, we employ the following combinations of parameters: the sphere radius of local mst (rlocal= . , . , . å), the edge weight threshold (dkeep= . , . , . å), and the maximum number of the tabu search steps (nround= , , ). for the extended emdb- wide test set of maps, we employ fewer combinations of parameters so as to save computational cost: the edge weight threshold (dkeep= . , . å) and the maximum number of the tabu search steps (nround= , ). the sphere radius of local mst (rlocal) is set to å. for each of the generated main-chain path, cα models are generated using different standard cα-cα distances (dstd= . , . , . , . , . , . , . , . å) on two sequence directions. namely, models ( models for each of the trajectories) are constructed for each parameter combination. the cα models are ranked by their alignment scores and then an rmsd cutoff of å is used to remove the one with lower alignment score in two similar structures. finally, the top scored protein cα models are selected to build the all-atom structures. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . datasets used . . training sets two data sets, simulated em map set and experimental em map set, were used to train our deepmm method for simulated maps and experimental maps, respectively. for simulated em maps, representative structures for different superfamilies in the scope database were taken from emap sec as training set. those structures were removed from the training set if they have a tm-score of over . with any structure in the test set. to save the com- putational cost, only randomly selected structures from the training set were retained. next, we used the e pdb mrc.py program from the eman package (version . ) to generate the simulated em maps at . å resolution and . å grid interval for each structure in training and test set. the training scope entries used in this study were listed in supplementary table . for experimental em maps, all the em density maps at - å resolution that have associated pdb models were downloaded from the emdb. as of december , , em maps were collected. any pdb structure and its corresponding em map that met the following criteria were removed: (i) including nucleic acids, (ii) missing side-chain atoms, (iii) including “hetatm” residues, (iv) including “unk” residues, (v) including more than subunit (model), and (vi) including less than or more than residues. then, chains from the remaining experimental em maps were clustered with % sequence identity using cd-hit , yielding a total of chains. to ensure a valid evaluation, chains were removed from training set if they have over % sequence identity with any chain in the test set. each protein chain was zoned out from the whole map using a distance of . å . for good quality maps, protein chain and its associated map should have sufficient structural agreement. the cross-correlation between the experimental map and the simulated map density at the same resolution with the experimental map generated from the structure was calculated using the ucsf chimera . only the chains with a cross-correlation of over . were kept . the final training set consists of non-redundant protein chains. the grid intervals for experimental maps were unified to . å using trilinear interpolation. the training em maps and their corresponding pdb chains used in this study are listed in supplementary table . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . . test sets three test sets were used to evaluate our deepmm approach for its accuracy and general applicability, including one simulated map set and two experimental maps. the simulated map set was taken from the test set of simulated maps used by mainmast . the maps were generated at . å resolution with a grid spacing of . å using the e pdb mrc.py program in the eman package . the first experimental test set is the benchmark of em maps at . - . å resolution, which have been used to evaluate mainmast . the corresponding em maps were downloaded from the emdb, for each em map, a single subunit was zoned out from the whole density map at a distance cutoff of . å. in addition, to evaluate the accuracy and general applicability of deepmm, we have also con- structed a large test set of embd-wide experimental maps. the generation procedure of this set was similar to that for the experimental training set. specifically, for each chain of the em pdb structure at . - . å resolution and no more than one subunit (model) from the emdb, a single density patch was zoned out from the whole density map at a distance cutoff of . å. any protein chain and its corresponding em map patch that met the following situation were removed: (i) including nu- cleic acids, (ii) missing side-chain atoms, (iii) including “hetatm” residues, (iv) including “unk” residues, (v) including less than or equal or more than residues, (vi) having over % sequence identity to any chain in the training set. the cross-correlation between the experimental map and the simulated density map at the same resolution generated from the structure should be over . . each protein chain was zoned out from the whole map using a distance of . å . the finial test set consists of protein chains, which are listed in supplementary table . results . model reconstruction for simulated em maps we first evaluated the performance of our deepmm algorithm on the test set of simulated density maps at å resolution. deepmm traced the main-chain of protein on the predicted main-chain probability map rather than the raw em density map. thus, the generated cα models by our deepmm .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / are closer to the native structures with fewer search trajectories and steps compared to mainmast. for each of the maps, deepmm built cα models, which were ranked by their alignment scores. the top-ranked model was selected as the predicted structure. figure shows a comparison of the predicted cα models for the protein chains of different lengths by deepmm and mainmast. the detailed results are provided in supplementary table . it can be seen from the figure that our deepmm method obtained a much better performance than main- mast. as shown in figure a, deepmm built significantly more accurate cα models, and achieved an average cα rmsd of . å when the top scored model was considered, compared to . å for mainmast. deepmm also generated high-quality models with less than . å cα rmsd for all of the maps, compared with only one such model by mainmast. moreover, deepmm achieved the high-accuracy models with less than . å rmsd for of maps, whereas mainmast failed to generate any model with < . å rmsd (figure a). the program click was also used to evaluate the accuracy of the cα models built by deepmm and mainmast. the corresponding re- sults are shown in figure b. similar to the results of cα rmsd comparison, deepmm generated many more high-quality models according to the click rmsd criterion and achieved an average click rmsd of . å when the top model was considered, compared to . å for mainmast. in addition, deepmm also achieved a significantly higher structure overlap than mainmast (fig- ure c). except for two top scored models with . % and . % structure overlap, the rest top models generated by deepmm all have a % structure overlap. on average, deepmm ob- tained a high structure overlap of . %, compare to . % for mainmast. figure also reveals that deepmm generated consistently high-accuracy models for all the proteins of different lengthes, whereas mainmast tended to perform worse with the increasing number of residues in the protein, suggesting the higher robustness of deepmm than mainmast. . model reconstruction for experimental em maps our deepmm method was further tested on the benchmark of experimental density maps at . - . å resolution. for each of the experimental density maps, deepmm built protein cα models, which were then ranked by their alignment scores. figure a shows a comparison of the cα rmsds for the models built by deepmm and main- .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / mast. the corresponding data are provided in supplementary table . it can be seen from the figure that deepmm generated significantly more accurate models than mainmast. on average, deepmm obtained a cα rmsd of . å for the top scored models, which is much better than . å by mainmast. moreover, deepmm predicted a model of < å for out of top scored models, of which models are within . å cα rmsd. by contrast, only and models are within . å and . å for mainmast, respectively. figure b shows a comparison of the results for the models predicted by deepmm and rosettaes. it can be seen from the figure that deepmm performed much better and generated many more accurate models than rosettaes. compared to models within å rmsd by deepmm, only six models were predicted within . å rmsd by rosettaes for the top predictions. on average, rosetta obtained an average cα rmsd of . å, which is much higher than . å for deepmm. further examination of the predicted results also reveals that the model accuracy depends more on the quality than on the resolution of a map. namely, compared to maps with relatively higher resolution but lower quality like emd- a/b ( . å) and emd- ( . å), maps with relatively lower resolution but higher quality like emd- ( . å) and emd- ( . å) are more likely to be successful in reconstructing a correct model (supplementary table ). this phenomenon can be attributed to the fact that resolution is a global estimation and resolvability is not necessarily uniform throughout the whole map . figure gives two examples of successfully reconstructed structures by deepmm. one exam- ple, emd- , which is a nucleoprotein at . å resolution, was successfully reconstructed by deepmm, as shown in figure a. it can be seen from the figure that the predicted main-chain by deepmm overlaps well with that of the deposited structure. accordingly, the predicted model shows an atomic-accuracy with a cα rmsd of . å. figure b shows the results of another example, emd- , which is the bovine rotavirus vp at . å resolution. because of its high resolution, deepmm predicted a very high accurate model with a small cα rmsd of . å. correspondingly, the constructed full-atom model by deepmm shows an excellent overlap with the deposited structure (figure b). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . evaluation of deepmm on the emdb-wide data set to investigate the accuracy and general applicability of our deepmm method, we have further eval- uated the performance of deepmm on a large test set of emdb-wide experimental maps. this large test set consists of diverse em maps with . - . å resolutions from the emdb that have asso- ciated structures in the pdb (see the methods section). for each of the test cases, our deepmm method was conducted to reconstruct structures using four combinations of parameters, yielding models for each case. figure shows a summary of the results predicted by deepmm. the corre- sponding data are provided in supplementary table . two metrics, rmsd and tmscore, were used to evaluate the overall accuracy of predicted models. on average, deepmm achieved a cα rmsd of . å for the top prediction and . å for the top predictions on this test set of maps. the corresponding average tm-scores are . and . for top and top predictions, suggesting the high accuracy of our deepmm approach. figure a shows the percentage of the predicted models at different cα rmsd cutoffs. it can be seen from the figure that . % of the top models built by deepmm are within å cα rmsd. for the top scored predictions, . % of the cases have an rmsd of less than å. the percentage of the models with different tm-score cutoffs are showed in figure b. it can be seen from the figure that . % of the top models built by deepmm have a tm-score of > . . when the top models were considered, the corresponding percentage increased to . %. comparing the results in figures a and b also reveals that the percentages for tm-score are significantly higher than those for cα- rmsd, suggesting that the models built by deepmm still share the same fold with native structure even if they have a large cα rmsd. figure c shows the percentage of correctly predicted top models (i.e. within å cα rmsd) at different resolutions. for em maps at . - . å resolution, deepmm achieved an excellent per- formance in successfully reconstructing a correct model, and achieved a success rate of . % and . % for the top and scored models, respectively. the performance of deepmm decreased with the decreasing map resolution. specifically, for the em maps with a resolution of . - . å, . - . å, and . - . å, deepmm obtained a success rate of . %/ . %, . %/ . %, and . %/ . % for the top / predictions, respectively. for em maps with a resolution of . å or worse, it is chal- .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / lenging for deepmm to build correct models. on average, for the maps at - å resolution, deepmm gave a success rate of . % and . % in reconstructing a correct model within å cα-rmsd for the top and predictions, respectively. figure d shows the percentage of correctly predicted top models using the criterion of tm-score > . in different resolution ranges. similar trends in figure c can be observed in figure d. specifically, for the maps with a resolution of . - . å, . - . å, . - . å, . - . å, and . - . å, deepmm achieved correct models with a tmscore of > . for . %/ . %, . %/ . %, . %/ . %, . %/ . %, and . %/ . % of the test cases when the top / predictions were considered, respectively. on average, for the maps at - å resolution, deepmm obtained a success rate of . % and . % in building a model with tmscore > . for the top and predictions, respectively. next, deepmm was compared with phenix on this test set, where the phenix models were gener- ated using the phenix.map to model tool in the phenix package (version . . - ). two metrics calculated by phenix.chain comparison were used to evaluate the accuracy of a model. one is the fraction of the ca atoms in one model matching the ca atoms in another model within . å re- gardless of their residue names (i.e. coverage or residue match). the other is the percentage of the sequence in the target structure reproduced by the query model (i.e. specificity of sequence match). it should be mentioned that our sequence match is conducted using types of amino acids. a model with a high percentage of residue match may have a very low percentage of sequence match because of mismatching of residue names. figures a and b show the percentages of protein residues and the sequence reproduced by deepmm and phenix at different resolutions. figures c and d give the histograms of corresponding average values at different resolutions. it can be seen from the figure that deepmm achieved a significantly better performance than phenix in both residue match and sequence match, especially for those maps at low resolutions. for the maps at resolutions better than . å, . % of protein residues in the deposited structures were reproduced by our deepmm method, com- pared to . % by phenix. the corresponding average sequence match is . % for our deepmm approach, which is much higher than . % for phenix. for the maps at - å resolution, the average residue match for deepmm is . %, compared with . % for phenix. the corresponding average sequence match is . % for deepmm, which is much higher than . % for phenix. given that the prediction of sequence match is much more challenging than that of residue match, the much better .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / performance of deepmm than phenix in sequence match demonstrated the atomic-accuracy of the model built by deepmm. it is worth mentioning that deepmm can build fully-connected, full-length all-atom protein mod- els, whereas phenix is designed to build initial models of structure fragments. figure shows the protein models built by deepmm and phenix for one example, chain a of dw , part of a gabaa receptor at . å resolution. the deposited structure with its associated em density map (emd- ) is displayed in panel a. figures b and c show the phenix model and its superimposition with the de- posited structure, respectively. it can be seen from the figures that the model built by phenix consists of multiple fragments without showing any secondary structures, as expected. the predicted model by phenix for this map had a residue match of . %, but gave a very low sequence match of . %. therefore, although phenix recovered most parts of the target protein structure from the em density map, it assigned wrong residue names for most of the modeled fragments because its low sequence match, as shown in figure c. in contrast, deepmm built an excellent all-atom structure for this map, with a near-perfect residue match of . % and a high sequence match of . %. therefore, the model predicted by deepmm reproduced most of the secondary structures and had an almost identi- cal chain trace to the deposited structure(figure d). the corresponding amino acid names were also assigned correctly by our deepmm approach (figure e). conclusion in summary, we have developed a semi-automatic de novo structure determination method for near- atomic resolution cryo-em maps using a deep learning-based framework, named as deepmm. our deepmm approach can reconstruct complete all-atom protein structures for em maps with atomic- accuracy. deepmm was extensively validated on diverse benchmarks and compared with state-of-the- art approaches including rosettaes, mainmast, and phenix. deepmm has also been evaluated on an emdb-wide large test set of experimental maps at . - . å resolution. overall, deepmm was able reconstruct the protein models with tmscore> . for over % of the test cases. deepmm is fast and able to reconstruct an all-atom structure from an em map within hr on a single-gpu machine for an average-length protein chain of amino acids. given the high computational effi- .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / ciency and all-atomic accuracy, it is anticipated that deepmm will serve as an indispensable tool for semi-automatic atomic-accuracy structure determination for near-atomic-resolution cryo-em maps. acknowledgements the authors acknowledge professor daisuke kihara and his students genki terashi and sai raghaven- dra maddhuri venkata subramaniya from purdue university for providing their datasets. this work was supported by the national natural science foundation of china (grant nos. and ) and the startup grant of huazhong university of science and technology. competing interests the authors declare no competing interests. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / references ( ) nogales e. the development of cryo-em into a mainstream structural biology technique. nat methods. ; ( ): - . ( ) frank j. advances in the field of single-particle cryo-electron microscopy over the last decade. nat protoc. ; ( ): - . ( ) cheng y. single-particle cryo-em-how did it get here and where will it go. science. ; ( ): - . ( ) raunser s. cryo-em revolutionizes the structure determination of biomolecules. angew chem int ed engl. ; ( ): - . ( ) safdari ha, pandey s, shukla ak, dutta s. illuminating gpcr signaling by cryo-em. trends cell biol. ; ( ): - . ( ) luque d, castón jr. cryo-electron microscopy for the study of virus assembly. nat chem biol. ; ( ): - . ( ) li x, mooney p, zheng s, booth cr, braunfeld mb, gubbens s, agard da, cheng y. electron counting and beam-induced motion correction enable near-atomic-resolution single-particle cryo-em. nat meth- ods. ; ( ): - . ( ) punjani a, rubinstein jl, fleet dj, brubaker ma. cryosparc: algorithms for rapid unsupervised cryo- em structure determination. nat methods. ; ( ): - . ( ) scheres sh. relion: implementation of a bayesian approach to cryo-em structure determination. j struct biol. ; ( ): - . ( ) adams pd, afonine pv, bunkóczi g, chen vb, davis iw, echols n, headd jj, hung lw, kapral gj, grosse-kunstleve rw, mccoy aj, moriarty nw, oeffner r, read rj, richardson dc, richardson js, terwilliger tc, zwart ph. phenix: a comprehensive python-based system for macromolecular structure solution. acta crystallogr d biol crystallogr. ; (pt ): - . ( ) zhang b, zhang x, pearce r, shen hb, zhang y. a new protocol for atomic-level protein struc- ture modeling and refinement using low-to-medium resolution cryo-em density maps. j mol biol. ; ( ): - . ( ) xie r, chen yx, cai jm, yang y, shen hb. spread: a fully automated toolkit for single-particle cryogenic electron microscopy data d reconstruction with image-network-aided orientation assign- ment. j chem inf model. ; ( ): - . ( ) yin s, zhang b, yang y, huang y, shen hb. clustering enhancement of noisy cryo-electron microscopy single-particle images with a network structural similarity metric. j chem inf model. ; ( ): - . ( ) yang yj, wang s, zhang b, shen hb. resolution measurement from a single reconstructed cryo-em density map with multiscale spectral analysis. j chem inf model. ; ( ): - . ( ) kim dn, gront d, sanbonmatsu ky. practical considerations for atomistic structure modeling with cryo-em maps. j chem inf model. ; ( ): - . ( ) joseph ap, lagerstedt i, jakobi a, burnley t, patwardhan a, topf m, winn m. comparing cryo- em reconstructions and validating atomic model fit using difference maps. j chem inf model. ; ( ): - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / ( ) patwardhan a. trends in the electron microscopy data bank (emdb). acta crystallogr d struct biol. ; (pt ): - . ( ) berman hm, westbrook j, feng z, gilliland g, bhat tn, weissig h, shindyalov in, bourne pe. the protein data bank. nucleic acids res. ; ( ): - . ( ) alnabati e, kihara d. advances in structure modeling methods for cryo-electron microscopy maps. molecules. ; ( ): . ( ) lindert s, staritzbichler r, wötzel n, karakaş m, stewart pl, meiler j. em-fold: de novo folding of alpha-helical proteins guided by intermediate-resolution electron microscopy density maps. structure. ; ( ): - . ( ) baker ml, abeysinghe ss, schuh s, coleman ra, abrams a, marsh mp, hryc cf, ruths t, chiu w, ju t. modeling protein structure at near atomic resolutions with gorgon. j struct biol. ; ( ): - . ( ) wang ry, kudryashev m, li x, egelman eh, basler m, cheng y, baker d, dimaio f. de novo protein structure determination from near-atomic-resolution cryo-em maps. nat methods. ; ( ): - . ( ) frenz b, walls ac, egelman eh, veesler d, dimaio f. rosettaes: a sampling strategy enabling auto- mated interpretation of difficult cryo-em maps. nat methods. ; ( ): - . ( ) baker mr, rees i, ludtke sj, chiu w, baker ml. constructing and validating initial cα models from subnanometer resolution density maps with pathwalking. structure. ; ( ): - . ( ) chen m, baldwin pr, ludtke sj, baker ml. de novo modeling in cryo-em density maps with path- walking. j struct biol. ; ( ): - . ( ) chen m, baker ml. automation and assessment of de novo modeling with pathwalking in near atomic resolution cryoem density maps. j struct biol. ; ( ): - . ( ) terwilliger tc, adams pd, afonine pv, sobolev ov. a fully automatic method yielding initial models from high-resolution cryo-electron microscopy maps. nat methods. ; ( ): - . ( ) terwilliger tc, adams pd, afonine pv, sobolev ov. cryo-em map interpretation and protein model- building using iterative map segmentation. protein sci. ; ( ): - . ( ) afonine pv, poon bk, read rj, sobolev ov, terwilliger tc, urzhumtsev a, adams pd. real-space refinement in phenix for cryo-em and crystallography. acta crystallogr d struct biol. ; (pt ): - . ( ) terashi g, kihara d. de novo main-chain modeling for em maps using mainmast. nat commun. ; ( ): . ( ) terashi g, kagaya y, kihara d. mainmastseg: automated map segmentation method for cryo-em density maps with symmetry. j chem inf model. ; ( ): - . ( ) tegunov d, cramer p. real-time cryo-electron microscopy data preprocessing with warp. nat methods. ; ( ): - . ( ) chen m, dai w, sun sy, jonasch d, he cy, schmid mf, chiu w, ludtke sj. convolutional neural networks for automated annotation of cellular cryo-electron tomograms. nat methods. ; ( ): - . ( ) maddhuri venkata subramaniya sr, terashi g, kihara d. protein secondary structure detection in intermediate-resolution cryo-em maps using deep learning. nat methods. ; ( ): - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / ( ) si d, moritz sa, pfab j, hou j, cao r, wang l, wu t, cheng j. deep learning to predict protein backbone structure from high-resolution cryo-em density maps. sci rep. ; ( ): . ( ) huang g, liu z, van der maaten l, weinberger kq. densely connected convolutional networks. ieee conference on computer vision and pattern recognition (cvpr), honolulu, hi, , - . ( ) smith tf, waterman ms. identification of common molecular subsequences. j mol biol. ; ( ): - . ( ) xiang z, honig b. extending the accuracy limits of prediction for side-chain conformations. j mol biol. ; ( ): - . ( ) petrey d, xiang z, tang cl, xie l, gimpelev m, mitros t, soto cs, goldsmith-fischman s, kernytsky a, schlessinger a, koh iy, alexov e, honig b. using multiple structure alignments, fast model building, and energetic analysis in fold recognition and homology modeling. proteins. ; suppl : - . ( ) case da, cheatham te rd, darden t, gohlke h, luo r, merz km jr, onufriev a, simmerling c, wang b, woods rj. the amber biomolecular simulation programs. j comput chem. ; ( ): - . ( ) ruder s. an overview of multi-task learning in deep neural networks. arxiv preprint. jun ;arxiv: . . ( ) heinig m, frishman d. stride: a web server for secondary structure assignment from known atomic coordinates of proteins. nucleic acids res. ; (web server issue):w - . ( ) ho cm, li x, lai m, terwilliger tc, beck jr, wohlschlegel j, goldberg de, fitzpatrick awp, zhou zh. bottom-up structural proteomics: cryoem of protein complexes enriched from the cellular milieu. nat methods. ; ( ): - . ( ) wen z, he j, huang sy. topology-independent and global protein structure alignment through an fft- based algorithm. bioinformatics. ; ( ): - . ( ) heffernan r, dehzangi a, lyons j, paliwal k, sharma a, wang j, sattar a, zhou y, yang y. highly accurate sequence-based prediction of half-sphere exposures of amino acid residues in proteins. bioinfor- matics. ; ( ): - . ( ) fox nk, brenner se, chandonia jm. scope: structural classification of proteins–extended, integrating scop and astral data and classification of new structures. nucleic acids res. ; (database issue):d - . ( ) zhang y, skolnick j. tm-align: a protein structure alignment algorithm based on the tm-score. nucleic acids res. ; ( ): - . ( ) tang g, peng l, baldwin pr, mann ds, jiang w, rees i, ludtke sj. eman : an extensible image processing suite for electron microscopy. j struct biol. ; ( ): - . ( ) fu l, niu b, zhu z, wu s, li w. cd-hit: accelerated for clustering the next-generation sequencing data. bioinformatics. ; ( ): - . ( ) pettersen ef, goddard td, huang cc, couch gs, greenblatt dm, meng ec, ferrin te. ucsf chimera– a visualization system for exploratory research and analysis. j comput chem. ; ( ): - . ( ) nguyen mn, tan kp, madhusudhan ms. click–topology-independent comparison of biomolecular d structures. nucleic acids res. ; (web server issue):w - . ( ) pintilie g, zhang k, su z, li s, schmid mf, chiu w. measurement of atom resolvability in cryo-em maps with q-scores. nat methods. ; ( ): - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure densenet a preprocess cryo-em map cut map into voxels predict main-chain and cα probability of each voxel densenet b predict amino acid type and secondary structure of main-chain points align protein sequence to cα main-chain path construct all-atom protein model input voxel shared block shared block shared layers task b block task a block task a block task b block specific layers prediction for task a prediction for task b a b densenet m a in -c h a in t ra c in g figure : workflow of our deepmm method. (a) the flowchart of deepmm. deepmm first pre- dicts the main-chain and cα probability of each density voxel using a densely connected convolu- tional network (densenet), and then traces the protein’s main-chain path on the predicted main-chain probability map. next, the amino acid and secondary structure types for each main chain point are predicted by a second densenet. the cα models are generated by aligning the target sequence to the main-chain paths. finally, the all-atom structures are constructed from the cα models using the ctrip program and refined by an amber energy minimization. (b) the multi-task deep densenet ar- chitecture used in deepmm. starting from an input em density voxel, two dense blocks are shared by both tasks in densenet a, while only one dense block is shared by both tasks in densenet b. each prediction task employs two task-specific dense blocks and gives the final prediction. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure s e r i v ... s c a t h c e e e e ... c h h c c coil . i sheet . i sheet . iii helix . ii coil . iv coil . i coil . iii coil . i helix . iv sheet . iii sheet . i coil . i sheet . i sheet . i coil . ii sheet . i coil . iv coil . iv target main-chain path scoring matrix aa i ii iii iv i . - . - . - . ii - . . - . - . iii - . - . . - . iv - . - . - . . ss helix sheet coil helix . - . - . sheet - . . - . coil - . - . . a b c score cα models # . . # . # ... ... ... ... alignment result i ii iii iv d gly ala ser cys val thr ile pro leu asp asn glu gln met lys arg his phe tyr trp figure : alignment protocol between the target sequence and the predicted main-chain for deepmm. (a) deepmm runs alignments of the target sequence of the em map against each candidate main-chain path. each sphere represents a predicted local dense point (ldp) on the main-chain path. predicted information including the cα probability (on the top), secondary structure (in the middle) and amino acid class (at the bottom) of ldps is utilized during alignment. for the target sequence, its secondary structure is predicted by the spider program, as illustrated in the sequence colored in azure under the amino acid sequence. (b) scoring matrices for amino acid type matching and secondary structure matching. (c) the generated cα models are ranked by their alignment score. (d) twenty amino acids are grouped into four classed according to the similarity of their side-chain em densities. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . . . . . . . . . . . . . . deepmm mainmast ca c α r m s d ( Å ) protein length (aa) b deepmm mainmast c l ic k r m s d ( Å ) protein length (aa) deepmm mainmast s tr u c tu re o v e rl a p ( % ) protein length (aa) figure : comparison of the results by deepmm and mainmast for the protein chains with different lengths. (a) the cα rmsds of the top predicted models. (b) the rmsds of matched cα atoms within . å by the structure alignment tool click. (c) the structure overlap calculated by click, which is defined as the fraction of matched cα atoms. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure a m a in m a s t r m s d ( Å ) deepmm rmsd (Å) b r o s e tt a r m s d ( Å ) deepmm rmsd (Å) figure : comparison of the top models for deepmm and two other approaches on the test set of experimental maps. the solid line in the figure is the plot of y = x, and the dashed line stands for y = . (a) comparison of the models by deepmm and mainmast in terms of cα rmsd. (b) comparison of the models by deepmm and rosetta in terms of cα rmsd. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure a b figure : examples of the models generated by deepmm for experimental em maps. the em density map (transparent grey) and its associated native protein structure (green) are displayed on the left side. the cα chains of the deepmm model (red) and the native structure (green) are shown in ball-and-stick format on the predicted main-chain probability map (transparent yellow) in the middle. the full-atom structure generated by deepmm (red) and the native protein structure (green) are displayed on the right side. (a) the nucleoprotein at . å map resolution (emd- ). the top ranked model by deepmm has a cα rmsd of . å. (b) the bovine rotavirus vp at . å map resolution (emd- ). the top model by deepmm has a cα rmsd of . å. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . . . . . . dc b p e rc e n ta g e ( % ) rmsd (Å) top top a p e c e n ta g e ( % ) tm-score top top . - . . - . . - . . - . . - . all % o f r m s d < Å resolution (Å) top top . - . . - . . - . . - . . - . all % o f t m -s c o re > . resolution (Å) top top figure : test results of deepmm on the experimental test cases. (a) the percentage of the top scored models at different cα rmsd cutoffs. (b) the percentage of the top scored models at different tm-score cutoffs. (c) the percentages of top scored models within å rmsd in different map resolution ranges. (d) the percentages of the top scored models with a tm-score above . in different map resolution ranges. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . . . . . . . . . . . . deepmm phenix r e s id u e m a tc h ( % ) resolution (Å) deepmm phenix dc b s e q u e n c e m a tc h ( % ) resolution (Å) a . . . . . . deepmm phenix a v e ra g e r e s id u e m a tc h ( % ) resolution (Å) . . . . . . deepmm phenix a v e ra g e s e q u e n c e m a tc h ( % ) resolution (Å) figure : comparison of the models by deepmm and phenix on the large test set of experimental maps at different resolutions. the results for phenix are colored in orange, and those for deepmm are colored in royal blue. (a) percentages of the protein residues in the deposited structures reproduced by deepmm and phenix. (b) percentages of the sequence of the deposited structure reproduced by deepmm and phenix. (c) average percentage of residue match by deepmm and phenix. (d) average percentage of sequence match by deepmm and phenix. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure phenix deepmm a b c d e figure : protein models reconstructed by deepmm and phenix for the chain a of dw and its associated em density map at . å resolution (emd- ). (a) the native structure overlapped with its associated em density map. (b) the model predicted by phenix, which has a residue match of . % and a sequence match of . %. (c) the phenix model (orange) overlapped with the native structure (green). the enlarged box on the right side shows that the residue names assigned by phenix model are different from those of the native structure. (d) the model predicted by phenix, which has a residue match of . % and a sequence match of . %. (e) the deepmm model (royal blue) overlapped with the native structure (green). the enlarged view of the top region of the protein on the right side shows that the sequence assigned by deepmm is close to that of the native structure. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / introduction methods workflow of deepmm training the densenets of deepmm tracing the main-chain path aligning target sequence to main-chain path parameter settings of deepmm datasets used training sets test sets results model reconstruction for simulated em maps model reconstruction for experimental em maps evaluation of deepmm on the emdb-wide data set conclusion regtools: integrated analysis of genomic and transcriptomic data for the discovery of splicing variants in cancer regtools: integrated analysis of genomic and transcriptomic data for the discovery of splicing variants in cancer kelsy c. cotto , ,†, yang-yang feng ,†, avinash ramu , zachary l. skidmore , , jason kunisaki , megan richters , , sharon freshour , , yiing lin , william c. chapman , ravindra uppaluri , , ramaswamy govindan , , obi l. griffith , , , *, malachi griffith , , , * † denotes co-first authors. * denotes corresponding authors. correspondence to obi l. griffith (obigriffith@wustl.edu) and malachi griffith (mgriffit@wustl.edu). affiliations: . division of oncology, department of medicine, washington university school of medicine, st. louis, mo, usa . mcdonnell genome institute, washington university school of medicine, st. louis, mo, usa . department of genetics, washington university school of medicine, st. louis, mo, usa . department of surgery, washington university school of medicine, st. louis, mo, usa . department of surgery, brigham and women’s hospital, boston, ma, usa . department of medical oncology, dana-farber cancer institute, boston, ma, usa . siteman cancer center, washington university school of medicine, st. louis, mo, usa abstract somatic mutations in non-coding regions and even in exons may have unidentified regulatory consequences which are often overlooked in analysis workflows. here we present regtools (www.regtools.org), a free, open-source software package designed to integrate analysis of somatic variants from genomic data with splice junctions from transcriptomic data to identify variants that may cause aberrant splicing. regtools was applied to over , tumor samples with both tumor dna and rna sequence data. we discovered , events where a variant significantly increased the splicing of a particular junction, across , unique variants and , unique junctions. to characterize these somatic variants and their associated splice isoforms, we annotated them with the variant effect predictor (vep), spliceai, and genotype- tissue expression (gtex) junction counts and compared our results to other tools that integrate genomic and transcriptomic data. while certain events can be identified by the aforementioned tools, the unbiased nature of regtools has allowed us to identify novel splice variants and previously unreported patterns of splicing disruption in known cancer drivers, such as tp , cdkn a, and b m, as well as in genes not previously considered cancer-relevant, such as rnf . introduction .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / alternative splicing of messenger rna allows a single gene to encode multiple gene products, increasing a cell’s functional diversity and regulatory precision. however, splicing malfunction can lead to imbalances in transcriptional output or even the presence of novel oncogenic transcripts . the interpretation of variants in cancer is frequently focused on direct protein- coding alterations . however, most somatic mutations arise in intronic and intergenic regions, and exonic mutations may also have unidentified regulatory consequences , , , . for example, mutations can affect splicing either in trans, by acting on splicing effectors, or in cis, by altering the splicing signals located on the transcripts themselves . increasingly, we are identifying the importance of splice variants in disease processes, including in cancer , . however, our understanding of the landscape of these variants is currently limited, and few tools exist for their discovery. one approach to elucidating the role of splice variants has been to predict the strength of putative splice sites in pre-mrna from genomic sequences, such as the method used by the spliceai tool – . with the advent of efficient and affordable rna-seq, we are also seeing the complementary approach of evaluating alternative splicing events (ases) directly from rna sequencing data. various tools exist which allow the identification of significant ases from transcript-level data within sample cohorts, including suppa and spladder , . many of these tools have also evaluated the role of trans-acting splice mutations . however, few tools are directed at linking specific aberrant rna splicing events to specific genomic variants in cis to investigate the splice regulatory impact of these variants. those few relevant tools that do exist have significant limitations that preclude them from broad applications. the sqtl-based approach taken by leafcutter and other tools is designed for relatively frequent single-nucleotide polymorphisms. it is thus ill-suited to studying somatic variants, or any case in which the frequency of a particular variant is very low (often unique) in a given sample population – . recent tools that have been created for large-scale analysis of cancer-specific data, such as misplice and veridical, ignore certain types of ases, are tailored to specific analysis strategies and sets of hypotheses, or are otherwise inaccessible to the end-user due to issues such as lack of documentation, difficulty with installation and integration with existing pipelines, limited computing efficiency, or licensing issues – . to address these needs, we have developed regtools, a free, open-source (mit license) software package that is well-documented, modularized for ease of use, and designed to efficiently identify potential cis-acting splice-relevant variants in tumors (www.regtools.org). regtools is a suite of tools designed to aid users in a broad range of splicing-related analyses. at the highest level, it contains three sub-modules: a variants module to annotate variant calls with respect to their potential splicing relevance, a junctions module to analyze aligned rna-seq data and associated splicing events, and a cis-splice-effects module that integrates genomic variant calls and transcriptomic sequencing data to identify potential splice-altering variants. each sub-module contains one or more commands, which can be used individually or integrated into regulatory variant analysis pipelines. to demonstrate the utility of regtools in identifying potential splice-relevant variants from tumor data, we analyzed a combination of data available from the mcdonnell genome institute (mgi) at washington university school of medicine and the cancer genome atlas (tcga) project. in .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / total, we applied regtools to , samples across cancer types. we contrasted our results with other tools that integrate genomic and transcriptomic data to identify potential splice altering variants, specifically veridical, misplice, and savnet , , . novel junctions identified by regtools were compared to data from the genotype-tissue expression (gtex) project to assess whether these junctions are present in normal tissues . variants significantly associated with novel junctions were processed through vep and illumina’s spliceai tool to compare our findings with splicing consequences predicted based on the variant information alone , . with this additional analysis, we were able to more easily identify both variants in known cancer drivers, whose splicing consequences have not been previously reported in the literature, and potentially novel cancer drivers, whose disruption relies on splice-altering mutations results the regtools tool suite supports splice regulatory variant discovery by the integration of genome and transcriptome data. regtools is a suite of tools designed to aid users in a broad range of splicing-related analyses. the variants module contains the annotate command. the variants annotate command takes a vcf of somatic variant calls and a gtf of transcriptome annotations as input. regtools does not have any particular preference for variant callers or reference annotations. each variant is annotated by regtools with known overlapping genes and transcripts, and is categorized into one of several user-configurable “variant types”, based on position relative to the edges of known exons. the variant type annotation depends on the stringency for splicing-relevance that the user sets with the “splice variant window” setting. by default, regtools marks intronic variants within bp of the exon edge as “splicing intronic”, exonic variants within bp as “splicing exonic”, other intronic variants as “intronic”, and other exonic variants simply as “exonic.” regtools considers only “splicing intronic” and “splicing exonic” as important. to allow for discovery of an arbitrarily expansive set of variants, regtools allows the user to customize the size of the exonic/intronic windows individually (e.g. -i -e for intronic variants bp from an exon edge and exonic variants bp from an exon edge) or even consider all exonic/intronic variants as potentially splicing-relevant (e.g. -e or -i) (figure a). the junctions module contains the extract and annotate commands. the junctions extract command takes an alignment file containing aligned rna-seq reads, infers the exon-exon boundaries based on the cigar strings , and outputs each “junction” as a feature in bed format. the junctions annotate command takes a file of junctions in bed format (such as the one output by junctions extract), a fasta file containing the reference genome, and a gtf file containing reference transcriptome annotations and generates a tsv file, annotating each junction with: the number of acceptor sites, donor sites, and exons skipped, and the identities of known overlapping transcripts and genes. we also annotate the “junction type”, which denotes if and how the junction is novel (i.e. different compared to provided transcript annotations). if the donor is known, but the acceptor is not or vice-versa, it is marked as “d” or “a”, respectively. if both are known, but the pairing is not known, it is marked as “nda”, whereas if both are .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / unknown, it is marked as “n”. if the junction is not novel (i.e. it appears in at least one transcript in the supplied gtf), it is marked as “da” (figure b). the cis-splice-effects module contains the identify command, which identifies potential splice- altering variants from sequencing data. the following are required as input: a vcf file containing variant calls, an alignment file containing aligned rna-sequencing reads, a reference genome fasta file, and a reference transcriptome gtf file. the identify pipeline internally relies on variants annotate, junctions extract, and junctions annotate to output a tsv containing junctions proximal to putatively splicing-relevant variants. the identify pipeline can be customized using the same parameters as in the individual commands. briefly, cis-splice-effects identify first performs variants annotate to determine the splicing-relevance of each variant in the input vcf. for each variant, a “splice junction region” is determined by finding the largest span of sequence space between the exons that flank the exon associated with the variant. from here, junctions extract identifies splicing junctions present in the rna-seq bam. next, junctions annotate labels each extracted junction with information from the reference transcriptome as described above and its associated variants based on splice junction region overlap (figure c). for our analysis, we annotated the pairs of associated variants and junctions identified by regtools, which we refer to as “events”, with additional information such as whether this association was identified by a comparable tool, the junction was found in gtex, and whether the event occurred in a cancer gene according to cancer gene census (cgc) (figure c) , . finally, we created igv sessions for each event identified by regtools that contained a bed file with the junction, a vcf file with the variant, and an alignment (bam) file for each sample that contained the variant . these igv sessions were used to manually review candidate events to assess whether the association between the variant and junction makes sense in a biological context. regtools is designed for broad applicability and computational efficiency. by relying on well- established standards for sequence alignments, annotation files, and variant calls and by remaining agnostic to downstream statistical methods and comparisons, our tool can be applied to a broad set of scientific queries and datasets. moreover, performance tests show that cis- splice-effects identify can process a typical candidate variant list of , , variants and a corresponding rna-seq bam file of , , reads in just ~ minutes (supplementary figure ). pan-cancer analysis of tumor types identifies somatic variants that alter canonical splicing regtools was applied to , samples over cancer types. of these cohorts came from tcga while the remaining three were obtained from other projects being conducted at mgi. cohort sizes ranged from to , samples. in total, , , variants (figure a) and , , , junction observations (figure b) were analyzed by regtools. by comparing the number of initial variants per cohort to the number of statistically significant variants, we .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / were able to show that regtools produces a prioritized list of potential splice relevant variants (supplementary figure ). additionally, when analyzing the junctions within each sample, we found that junctions present in the reference transcriptome are frequently seen within gtex data while junctions observed from a sample’s own transcriptome data that were not present in the reference are rarely seen within gtex (supplementary figure ). , significant variant junction pairings were found for junctions that use a known donor and novel acceptor (d), novel donor and known acceptor (a), or novel combination of a known donor and a known acceptor (nda), with novel here meaning that the junction was not found in the reference transcriptome (methods, figure c, supplemental files and ). while our analysis primarily focuses on variants in relation to novel splice events because of the potential importance of these events within tumor processes, we also wanted to assess how often a variant was significantly associated with a known junction. , variant junction pairings were found for junctions known to the reference (da junctions) (supplemental files and ). this finding indicates that while splice variants usually result in a novel junction occurring, they sometimes alter the expression of known junctions. generally, significant events were evenly split among each of the novel junction types considered (d, a, and nda). the number of significant events increased as the splice variant window size increased, with both the e and i results being comparable in number. notably, hepatocellular carcinoma (hcc) was the only cohort that had whole genome sequencing (wgs) data available and, as expected, it exhibited a marked increase in the number of significant events for its results within the “i” splice variant window. this observation highlights the low sequence coverage of intronic regions that occurs with wes which subsequently leads to underpowered discovery of potential splice altering variants within introns. variants were analyzed across tumor types for how often they result in either a single or multiple novel junctions (figure a). while a single variant resulting in a single novel junction is most commonly observed ( . - . %), a single variant also commonly results in multiple junctions being created, either of the same type ( . - . %) or of different types ( . - . %) (figure b). variants that are associated with multiple novel junctions of different types were further investigated to identify how often a particular junction type occurred with another (figure c). most commonly, we observed an alternate donor or acceptor site being used in conjunction with an exon skipping event. these events were particularly common within the default window ( intronic bases or exonic bases from the exon edge), as a snv or indel within these positions has a high probability of disrupting the natural splice site, thus causing the splicing machinery to use a cryptic splice site nearby or skip the splice site entirely. the next most common event was an alternate donor site and an alternate acceptor site both being used as the result of a single variant. the combination of a novel acceptor site and novel donor site being used in conjunction with an exon-skipping event occurred the least and occurrence of this type of event remains fairly low, even as the search space increases within the larger splice variant windows. this finding indicates the low likelihood of a single variant resulting in simultaneous disruption of a splice acceptor and donor as well as complete skipping of an exon. overall, this analysis highlights that there is evidence that a single variant can lead to multiple novel junctions being expressed. tools that only allow for a single junction to be predicted or .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / associated with a variant therefore may not be completely describing the effect of the variant in question in up to ~ % of cases. regtools identifies splice altering variants missed by other splice variant predictors and annotators to evaluate the performance of regtools, we compared our results to those of savnet, misplice, veridical, vep, and spliceai , , , , . these tools vary in their inputs and methodology for identifying splice altering variants (figure a). both vep and spliceai only consider information about the variant and its genomic sequence context and do not consider information from a sample’s transcriptome. a variant is considered to be splice relevant according to vep if it occurs within - bases on the exonic side or - bases on the intronic side of a splice site. spliceai does not have restrictions on where the variant can occur in relation to the splice site but by default, it predicts one new donor and acceptor site within bp of the variant, based on reference transcript sequences from gencode. like regtools, savnet, misplice, and veridical integrate genomic and transcriptomic data in order to identify splice altering variants. misplice only considers junctions that occur within bp of the variant. additionally, savnet, misplice, and veridical filter out any transcripts found within the reference transcriptome. savnet, misplice, and veridical employ different statistical methods for the identification of splice altering variants. in contrast to regtools, none of the mentioned tools allow the user to set a custom window in which they wish to focus splice altering variant discovery (e.g. around the splice site, all exonic variants, etc.). these tools have different levels of code availability. misplice is available via github as a collection of perl scripts that are built to run via load sharing facility (lsf) job scheduling. to run misplice without an lsf cluster, the authors mention code changes are required. veridical is available via a subscription through cytognomix’s mutationforecaster. similar to regtools, savnet is available via github or through a docker image. however, savnet relies on splicing junction files generated by star whereas regtools can use rna-seq alignment files from hisat , tophat , or star, thus allowing it to be integrated into bioinformatic workflows more easily. in their recent publications, savnet , misplice , and veridical , also analyzed data from tcga, with only minor differences in the number of samples included for each study. vep and spliceai results were obtained by running each tool on all starting variants for the cohorts included in this study. in order to efficiently compare this data, an upset plot (figure b) was created . only variants are identified as splice altering by all six tools. comparatively, misplice and savnet find few splice altering variants, potentially indicating that these tools are overlooking the complete set of variants that have an effect on splicing. in contrast, veridical identifies by far the most splice altering variants across all tools, with . percent of its calls being found by it alone. spliceai and vep called a large number of variants, either alone or in agreement, that none of the tools that integrate transcriptomic data from samples identify. this highlights a limitation of using tools that only focus on genomic data, particularly in a disease context where transcripts are unlikely to have been annotated before. regtools addresses these short-comings by identifying what pieces of information to extract from a sample’s genome and transcriptome in a very basic, unbiased way that allows for generalization. other .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tools either only analyze genomic data, focus on junctions where either the canonical donor or acceptor site is affected (missing junctions that result from complete exon skipping), or consider only those variants within a very narrow distance from known splice sites. regtools can include any kind of junction type, including exon-exon junctions that have ends that are not known donor/acceptor sites according to the gtf file (n junction according to regtools), any distance size to make variant-junction associations, and any window size in which to consider variants. due to these advantages, regtools identified events missed by one or multiple of the tools to which we compared (figure b; supplementary figures and ). pan-cancer analysis reveals novel splicing patterns within known cancer genes and potential cancer drivers while efforts have been made to associate variants with specific cancer types, there has been little focus on identifying such associations in splice-altering variants, even those in known cancer genes. tp is a rare example whose splice-altering variants are well characterized in numerous cancer types . as such, we further analyzed significant events to identify genes that had recurrent splice altering variants. within each cohort, we looked for recurrent genes using two separate metrics: a binomial test p-value and the fraction of samples (see methods). for ranking and selecting the most recurrent genes, each metric was computed by pooling across all cohorts. for assessing cancer-type specificity, each metric was then also computed using only results from a given cancer cohort. since the mechanisms underlying the creation of novel junctions versus the disruption of existing splicing patterns may be different, analysis was performed separately for d/a/nda junctions (figure , supplementary figure , supplementary file ) and da junctions (supplementary figure , supplementary file ), which allowed multiple test correction in accordance with the noise of the respective data. we identified , genes in which there was least one variant predicted to influence the splicing of a d/a/nda junction. the th percentile of these genes, when ranked by either metric, are significantly enriched for known cancer genes, as annotated by the cgc (p= . e- , ranked by binomial p-values, p= . e- , ranked by fraction of samples; hypergeometric test). we also identified , genes in which there was least one variant predicted to influence the splicing of a da (known) junction. the th percentile of these genes, when ranked by either metric, are also significantly enriched for known cancer genes, as annotated by the cancer gene census (p= . e- , ranked by binomial p-values, p= . e- , ranked by fraction of samples; hypergeometric test). we also performed the same analyses using either the tcga or mgi cohorts alone. the tcga-only analyses gave very similar results to the combined analyses, with the th percentile of genes found in the d/a/nda and da analyses again being enriched for cancer genes (supplementary figures and ; supplemental files and ). due to small cohort sizes, in the mgi-only analyses, we identified only and genes in the d/a/nda and da analyses, respectively. the th percentile of genes from these analyses, respectively, were not significantly enriched for cancer genes (supplementary figures and ; supplemental files and ). when analyzing d, a, and nda junctions, we saw an enrichment for known tumor suppressor genes among the most splice disrupted genes, including several examples where splice .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / disruption is a known mechanism such as tp , pten, cdkn a, and rb . specifically, in the case of tp , we identified variants that were significantly associated with at least one novel splicing event. one such example is the intronic snv (grch , chr :g. c>a) that was identified in an oscc sample and was associated with an exon skipping event and an alternate acceptor site usage event, with and reads of support, respectively (supplemental figure ). the cancer types in which we find splice disruption of tp and other known cancer genes is in concordance with associations between genes and cancer types described by cgc and chasmplus , . our analysis’s recovery of known drivers, many of which with known susceptibilities to splicing dysregulation in cancer, indicates the ability of our method to identify true splicing effects that are likely cancer-relevant. another cancer gene that we found to have a recurrence of splicing altering variants was b m. specifically, we identified six samples with intronic variants on either side of exon (figure ). while mutations have been identified and studied within exon , we did not find literature that specifically identified intronic variants near exon as a mechanism for disrupting b m . these mutations were identified by vep to be either splice acceptor variant or a splice donor variant and were also identified by veridical. misplice was able to predict one of the novel junctions for each variant but failed to predict additional novel junctions due to the limitation of that tool to only predict one novel acceptor and donor site per variant. notably, out of the samples that these variants were found in are msi-h (microsatellite instability-high) tumors . mutations in b m, particularly within colorectal msi-h tumors, have been identified as a method for tumors to become incapable of hla class i antigen-mediated presentation . furthermore, in a study of patients treated with immune checkpoint blockade (icb) therapy, defects to b m were observed in . % of patients with progressing disease . in the same study, b m mutations were exclusively seen in pretreatment samples from patients who did not respond to icb or in post- progression samples after initial response to icb . there are several genes that are responsible for the processing, loading, and presentation of antigens, and have been shown to be mutated in cancers . however, no proteins can be substituted for b m in hla class i presentation, thus making the loss of b m a particularly robust method for icb resistance . we also observe exonic variants and variants further in intronic regions that disrupt canonical splicing of b m. these findings indicate that intronic variants that result in alternative splice products within b m may be a mechanism for immune escape within tumor samples. we also identify recurrent splice altering variants in genes not known to be cancer genes (according to cgc), such as rnf . regtools identified a recurrent single base pair deletion that results in an exon skipping event of exon (supplementary figure ). this gene is a paralog of rnf , which has been found to be mutated in several cancer types . this variant junction association was found in stad, ucec, coad, and esca tumors, all of which are considered to be msi-h tumors . after analyzing the effect of the exon skipping event on the mrna sequence, we concluded that the reading frame remains intact, possibly leading to a gain of function event. additionally, the skipping of exon leads to the removal of a transmembrane domain and a phosphorylation site, s , which could be important for the regulation of this gene . based on these findings, rnf may play a role similar to rnf and may be an important driver event in certain tumor samples. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / while most of our analysis focused on splice altering variants that resulted in d, a, nda junctions, we also wanted to investigate variants that shifted the usage of known donor and acceptor sites. through this analysis, we identified cdkn a, a tumor suppressor gene that is frequently mutated in numerous cancers , to have several variants that led to alternate donor usage (supplementary figure ). when these variants are present, an alternate known donor site is used that leads to the formation of the transcript enst . instead of enst . , the transcript that encodes for p ink a, a known tumor suppressor. the transcript that results from use of this alternate donor site is missing the last twenty-eight amino acids that form the c-terminal end of p ink a. notably, this removes two phosphorylation sites within the p protein, s and s , which when phosphorylated promotes the association of p ink a with cdk . this finding highlights the importance of including known transcripts in alternative splicing analyses as variants may alter splice site usage in a way that results in a known but pathogenic transcript product. discussion splice associated variants are often overlooked in traditional genomic analysis. to address this limitation, we created regtools, a software suite for the analysis of variants and junctions in a splicing context. by relying on well-established standards for analyzing genomic and transcriptomic data and allowing flexible analysis parameters, we enable users to apply regtools to a wide set of scientific methodologies and datasets. to ease the use and integration of regtools into analysis workflows, we provide documentation and example workflows via (regtools.org) and provide a docker image with all necessary software installed. in order to demonstrate the utility of our tool, we applied regtools to , tumor samples across tumor types to profile the landscape of this category of variants. from this analysis, we report , variants that cause novel splicing events that were missed by vep or spliceai. only . percent of these mutations were previously discovered by similar attempts, while . percent are novel findings. we demonstrate that there are splice altering variants that occur beyond the splice site consensus sequence, shift transcript usage between known transcripts, and create novel exon-exon junctions that have not been previously described. specifically, we describe notable findings within b m, rnf , and cdkn a. these results demonstrate the utility of regtools in discovering novel splice-altering mutations and confirm the importance of integrating rna and dna sequencing data in understanding the consequences of somatic mutations in cancer. to allow further investigation of these identified events, we make all of our annotated result files (supplemental files - ) and recurrence analysis files (supplemental files - ) available. understanding the splicing landscape is crucial for unlocking potential therapeutic avenues in precision medicine and elucidating the basic mechanisms of splicing. the exploration of novel tumor-specific junctions will undoubtedly lead to translational applications, from discovering novel tumor drivers, diagnostic and prognostic biomarkers, and drug targets, to identifying a previously untapped source of neoantigens for personalized immunotherapy. while our analysis .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / focuses on splice altering variants within cancers, we believe regtools will play an important role in answering this broad range of questions by helping users extract splicing information from transcriptome data and linking it to somatic (or germline) variant calls. the computational efficiency of regtools and increasing availability and size of such datasets may also allow for improved understanding of splice regulatory motifs that have proven difficult to accurately define such as exonic and intronic splicing enhancers and silencers. any group with paired dna and rna-seq data for the same samples stands to benefit from the functionality of regtools. methods software implementation regtools is written in c++. cmake is used to build the executable from source code. we have designed the regtools package to be self-contained in order to minimize external software dependencies. a unix platform with a c++ compiler and cmake is the minimum prerequisite for installing regtools. documentation for regtools is maintained as text files within the source repository to minimize divergence from the code. we have implemented common file handling tasks in regtools with the help of open-source code from samtools/htslib and bedtools in an effort to ensure fast performance, consistent file handling, and interoperability with any aligner that adheres to the bam specification. statistical tests are conducted within regtools using the rmath framework. travis ci and coveralls are used to automate and monitor software compilation and unit tests to ensure software functionality. we utilized the google test framework to write unit tests. regtools consists of a core set of modules for variant annotation, junction extraction, junction annotation, and gtf utilities. higher level modules such as cis-splice-effects make use of the lower level modules to perform more complex analyses. we hope that bioinformaticians familiar with c/c++ can re-use or adapt the regtools code to implement similar tasks. benchmarking performance metrics were calculated for all regtools commands. each command was run with default parameters on a single blade server (intel(r) xeon(r) cpu e - v @ . ghz) with gb of ram and replicates for each data point (supplementary figure ). specifically for cis-splice-effects identify, we started with random selections of somatic variants, ranging from , - , , , across data subsets. using the output from cis-splice-effects identify, variants annotate was run on somatic variants from the subsets (range: - , ) predicted to have a splicing consequence. the function junctions extract was performed on the hcc tumor rna-seq data aligned with hisat to grch and randomly downsampled at intervals ranging from - %. using output from junctions extract, junctions annotate was performed for data subsets ranging from , - , randomly selected junctions. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / benchmark tests revealed an approximately linear performance for all functions. variance between real and cpu time is highly dependent on the i/o speed of the write-disk and could account for artificially inflated real time values given multiple jobs writing to the same disk at once. the most computationally expensive function in a typical analysis workflow was junctions extract, which on average processed , reads/second (cpu) and took an average of . real vs . cpu minutes to run on a full bam file ( , , reads total). the function junctions annotate was the next most computationally intensive function and took an average of . real/ . cpu minutes to run on , junctions, processing junctions/second (cpu). the other functions were comparatively faster with cis-splice-effects identify and variants annotate able to process , and variants per second (cpu), respectively. to process a typical candidate variant list of , , variants and a corresponding rna-seq bam file of , , reads with cis-splice-effects identify takes ~ . real/ . cpu minutes (supplementary figure ). performance metrics were also calculated for the statistics script and its associated wrapper script that handles dividing the variants into smaller chunks for processing to limit ram usage. this command, compare_junctions, was benchmarked in january using amazon web services (aws) on a m . xlarge instance, based on the amazon linux ami, with gb of ram, vcpus, and a mounted tb ssd ebs volume with iops. these data were generated from running compare_junctions on each of the included cohorts, with the largest being our brca cohort ( sample) which processed . events per second (cpu). using regtools to identify cis-acting, splice altering variants regtools contains three sub-modules: “variants”, “junctions”, and “cis-splice-effects”. for complete instructions on usage, including a detailed workflow for how to analyze cohorts using regtools, please visit regtools.org. variants annotate this command takes a list of variants in vcf format. the file should be gzipped and indexed with tabix . the user must also supply a gtf file that specifies the reference transcriptome used to annotate the variants. the info column of each line in the vcf is populated with comma-separated lists of the variant-overlapping genes, variant-overlapping transcripts, the distance between the variant and the associated exon edge for each transcript (i.e. each start or end of an exon whose splice variant window included the variant) defined as min(distance_from_start_of_exon, distance_from_end_of_exon), and the variant type for each transcript. internally, this function relies on htslib to parse the vcf file and search for features in the gtf file which overlap the variant. the splice variant window size (i.e. the maximum distance from the edge of an exon used to consider a variant as splicing-relevant) can be set by the options “- e