key: cord-0316254-ng8rtgc9 authors: Chen, Yaowen; He, Zhen; Men, Yahui; Dong, Guohua; Hu, Shuofeng; Ying, Xiaomin title: MetaLogo: a generator and aligner for multiple sequence logos date: 2021-08-13 journal: bioRxiv DOI: 10.1101/2021.08.12.456038 sha: c5e6473a7028bcd5352827e3c6442c49efe4af12 doc_id: 316254 cord_uid: ng8rtgc9 Sequence logos are used to visually display sequence conservations and variations. They can indicate the fixed patterns or conserved motifs in a batch of DNA or protein sequences. However, most of the popular sequence logo generators can only draw logos for sequences of the same length, let alone for groups of sequences with different characteristics besides lengths. To solve these problems, we developed MetaLogo, which can draw sequence logos for sequences of different lengths or from different groups in one single plot and align multiple logos to highlight the sequence pattern dynamics across groups, thus allowing users to investigate functional motifs in a more delicate and dynamic perspective. We provide users a public MetaLogo web server (http://metalogo.omicsnet.org), a standalone Python package (https://github.com/labomics/MetaLogo), and also a built-in web server available for local deployment. Using MetaLogo, users can draw informative, customized, aesthetic, and publishable sequence logos without any programming experience. Sequence logo was first proposed by Schneider and Stephens in 1990 [1] , and has been widely used thousands of times for sequence pattern visualization in the academic field. Each position of a sequence logo is stacked by different amino acids or nucleotides, with the height of each base indicating its degree of conservation at that position. The most commonly used sequence logo generators include Weblogo [2] , Seq2Logo [3] , ggseqlogo [4] , Logomaker [5] and RaacLogo [6] and others, involving web servers, Python and R packages, etc. These tools greatly accelerate the researchers' exploration of sequence patterns and motifs. However, a common problem is that most sequence logo tools only support equal length sequences as input, users need to select the most representative length of sequences to filter the input for sequence logo when studying a sequence set, which is generally too simplified to represent all. One solution is to perform multiple sequence alignments (MSA) in advance, and use the gapped and aligned sequences as input to construct a single sequence logo, which has been supported by several tools. However, the problem with using MSA results is that it is difficult to indicate whether there are different patterns or motifs among sequences of different lengths. Let us take the B cell receptor (BCR) sequences as an example. it is known that CDR3s of different lengths may have different affinities for certain antigens. Therefore, to discriminate the length of CDR3s when checking the motifs of CDR3s is essential for immune repertoire analysis. In addition to separately studying motifs for sequences of different lengths, we may also need multiple sequence logos for sequences of the same length but from different groups, which could be generated based on sample sources or clustering results. All of the above requires a convenient tool that allows researchers to take multiple sets of sequences as input, draw sequence logos synchronously and align them at the logo level to display pattern dynamics across different groups, so as to understand the sequence characteristics of the sample in a more delicate manner. Besides CDR3s analysis, other motif-related studies, including transcript factor motif analysis, CRISPR array analysis, evolutionarily conserved sequences analysis and others, all have the same requirements. To solve the problems, we developed MetaLogo, which satisfies the need to allow variable length or multi-group sequence as input and to perform multiple logo alignments, and provides researchers with figures in an aesthetic, multi-form, and highly customizable way. MetaLogo provides a public web server (locally deployable), and a stand-alone Python package at the same time to provide researchers with the most convenient service. Users can input files in Fasta or Fastq format, and specify grouping by length or by group id indicated in sequence names. MetaLogo draws a separate sequence logo for each group, and then performs alignment for multiple sequence logos in a local or global mode, according to users' choice. For each set of sequences, MetaLogo first calculates the information contents of amino acids or nucleotides at each position in bits [7] . In order to align different sequence logos, we To measure the similarity between ܲ and Q , MetaLogo provides Dot Production (DP) and Cosine Similarity (COS) for users to choose from, which are commonly used as similarity measures and defined as follows: Besides bit arrays, we could also use frequency arrays to measure the similarity between positions. For each amino acid in one position, its frequency could be treated as the probability of one sequence having it in that position. Thus, here we could use similarity measurements designed for probability distributions. MetaLogo allows users to choose the Jensen-Shannon divergence (JSD) [8] as the similarity measurement. The JSD is a method of measuring the similarity between two probability distributions, and is a symmetrized version of the Kullback-Leibler (KL) divergence [9] . Note in the following context, ܲ and ܳ represent discrete probability distributions which sum to one. JSD is defined as follows: Bhattacharyya Coefficient (BC) [10] could also be used as a similarity measurement for two statistical samples. Since probability array does not indicate conservation like bit array do, hence MetaLogo provides an entropy (H) [11] adjusted Bhattacharyya Coefficient (EBC) as a choice to measure the probability array similarity, which is defined as follows: is the max entropy for a ݊ -dimensional probability vector. Among these measurements, COS and KL consider both non-conservative and conservative patterns while DP and EBC only value conservative patterns among groups. The alignment between sequence logos is based on the Needleman-Wunsch algorithm, which is a classic global sequence alignment algorithm. When using MetaLogo, users can choose two alignment modes, one is pairwise alignments between adjacent sequence groups ( Figure 1A) , and the other is a global logo alignment among all sequence groups ( Figure 1B) . For global multi-logo alignment, MetaLogo adopts the method of progressive alignment construction [12] . The closest pair of sequence logos are aligned first, and the next logo closest to the aligned sequence logo set is successively added for alignment. Introduced gaps and inserts of each alignment are retained for subsequent alignments until all logos get aligned. In the alignment process, users need to specify a certain similarity metric we mentioned above, and also the penalty for inserts and gaps. After padding and alignment, MetaLogo can visually highlight the highly similar pairs of positions between groups by connecting them using colorful strips. MetaLogo supports four different logo layouts, including horizontal, circular, radial, and 3D layouts. As shown in Figure 1 A-E, these diverse layouts are suitable for different scenes specifically. The horizontal layout is the default one, which can deal with most scenarios; the circular layout can more clearly show the conservations across multiple sequence groups; the radial layout is suitable to display sequences with conservative motifs in the middle or at the end of the sequences, rather than at the beginning; the 3D layout makes sequence logos more diverse and aesthetic. MetaLogo allows customization of most of the operable elements in the figure, including figure size, ticks size, label size, labels, title, grids, margins between items, colors of items and so on. Users can also choose whether to display axis, ticks, labels, group ids, etc. Multiple formats of figures are supported, including PNG, PDF, SVG, PS and EPS. Users can directly access our public web server (http://metalogo.omicsnet.org, Figure 1F MetaLogo is a new generator for aesthetic, customized and informative sequence logos. Unlike existing tools, MetaLogo can draw multiple sequence logos for sequences of different lengths or from different groups in one figure and perform alignment of sequence logos to reveal the pattern dynamics across groups. MetaLogo provides a free web server for public use, as well as a stand-alone Python package and a docker web service for local deployment. We will value the suggestions and comments from users, and continue to maintain code updates and upgrades to continuously contribute to the community. MetaLogo is a new sequence logo generator for variable-length sequences or multi-group sequences; MetaLogo performs pairwise and global sequence logos alignment to highlight the sequence pattern dynamics across different sequence groups. MetaLogo provides public web server, deployable local web server with docker, as well as standalone Python package for making highly customized sequence logos. This work was supported by National Science and Technology Major Project grant The authors have declared no competing interests. Sequence logos: a new way to display consensus sequences WebLogo: a sequence logo generator Seq2Logo: a method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion ggseqlogo: a versatile R package for drawing sequence logos Logomaker: beautiful sequence logos in Python RaacLogo: a new sequence logo generator by using reduced amino acid clusters Information content of binding sites on nucleotide sequences A new metric for probability distributions On Information and Sufficiency On a Measure of Divergence between Two Multinomial Populations A mathematical theory of communication Progressive sequence alignment as a prerequisitetto correct phylogenetic trees A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome Dynamics of B cell repertoires and emergence of crossreactive responses in patients with different severities of COVID-19 We thank colleagues in our lab including Pu Liu, Chao Feng, Sijing An and Runyan Liu for their careful reviews and feedbacks on the MetaLogo web server.