key: cord-1034829-io7ggp1v authors: González, Ricardo D.; Figueiredo, Pedro R.; Carvalho, Alexandra T. P. title: Cell delivery peptides for small interfering RNAs targeting SARS-CoV-2 new variants through a bioinformatics and deep learning design date: 2022-02-10 journal: bioRxiv DOI: 10.1101/2022.02.09.479755 sha: 1379c1e7b0bec68204f41567abae63856d26754c doc_id: 1034829 cord_uid: io7ggp1v Nucleic acid technologies with designed delivery systems have surged as one the most promising therapies of the future, due to their contribution in combating SARS-CoV-2 severe disease. Nevertheless, the emergence of new variants of concern still represents a real threat in the years to come. It is here that the use of small interfering RNA sequences to inhibit gene expression and, thus, protein synthesis, may complement the already developed vaccines, with faster design and production. Here, we have designed new sequences targeting COVID-19 variants and other related viral diseases through bioinformatics, while also addressing the limited number of delivery peptides by a deep learning approach. Two sequences databases were produced, from which 62 were able to target the virus mRNA, and ten displayed properties present in delivery peptides, which we compared to the broad use TAT delivery peptide. The current severe acute respiratory syndrome coronavirus type 2 (SARS-CoV-2) is a virus that caused the 2019 outbreak of coronavirus disease (COVID-19) pandemic, with origin in Wuhan, China [1] . The success of the new rapidly developed vaccines has propelled the interest in nucleic acids-based therapies as an approach to treat several systemic disorders, with an underlined interest in ribonucleic acids (RNA) [1] . However, new variants of concern have emerged, such as B.1.617.2 (Delta) and, more recently, the B.1.1.529 (Omicron) variants, which have become dominant in an international setting, with higher transmissibility rates and consequently higher affluence of patients to health systems [2] . Given that, new approaches to reduce viral charge in infected individuals are needed to diminish infection rates and, possibly, symptoms. In that concern, small interfering RNA (siRNA) have been extensively studied and are becoming an accepted modality of pharmacotherapy [3] . These short strands of RNA induce selective gene suppression by binding to sequence-specific messenger RNAs (mRNA) leading to their cleavage and degradation, thus preventing protein synthesis [4] . Nevertheless, the delivery of nucleic acids imposes challenges like efficient cell delivery. New delivery methods focus on the design of nanoparticles that will protect the RNA from degradation and allowing the crossing of the cell membranes for intracellular delivery, such as lipid nanoparticles, polymer-based nanocarriers (nanoparticles, micelles-based) and carbonbased nanostructures [5] . Nevertheless, protein transduction domains, most commonly known as cell penetrating peptides (CPP), are small peptides (6 to around 30 amino acids) that are able to carry cargos across the cellular membranes in an intact and functional fashion [6] . They have been used complementary to the aforementioned methods, allowing an enhanced endosomal escape by conjugation on the surface of those structures (nanoparticles, polymeric micelles, or liposomes). These complex constructs need a thorough design and assembly, which may hinder the development of stable RNA therapies and delivery. Nucleic acids have been successfully delivered to cells by CPPs [7] , but how this occurs remain unclear among researchers. They are mainly cationic in nature but can also have amphiphilic properties [8] , though no clear rules to distinguish if a peptide is a CPP or not are clearly established, limiting the number of described peptides. Even for the same CPP, the transduction mechanism varies depending on the cargo, cell environment and peptide concentration. Given that, CPPs have been described to be mostly internalized by endocytosis even if different results have made it difficult to indicate with precision which endocytic pathway is involved [9, 10] . The most used CPPs: trans-activator of transcription (TAT) protein of HIV-1, penetratin and transportan [9] , have been used in preclinical and clinical studies on Alzheimer's disease, cancer, or cerebral ischemia for the delivery of therapeutics, but caution due to cytotoxic effects must be ensured [11] . Notwithstanding, even if their mechanism is not clear, CPPs seem to be a great approach for the delivery of therapeutics, namely siRNAs, due to their smaller size [3] . Here, we have designed siRNA sequences for COVID-19, considering the new variants of concern that have been spreading worldwide. We have also addressed the limited number of CPPs by applying a deep learning approach for the design of new peptides capable of delivering siRNAs to inhibit RNA activity, thus reducing viral replication. Moreover, insights in the absorption, distribution, metabolism, and excretion (ADME)/toxicity of the peptides were accessed, comparatively to the most used TAT delivery peptide. An siRNA sequence library was constructed from the the SARS-CoV-2 reference genome (NCBI, NC_045512), based on the protocol proposed by Medeiros et al. [12] . A 21array sliding window approach with steps of 1 was implemented to obtain sequences with 21 nucleotides (nt). These sense sequences were then transcribed to their antisense counterpart via shell scripting to allow proper targeting, followed by filtering steps. As our purpose was to obtain sequences with siRNA features, a proper selection was necessary. Biological activity was controlled by restricting GC-content to 30-50%, and by removing sequences able to form hairpins and of self-annealing, whereas toxicity was evaluated by analyzing the presence of poly(U/T/A) and GCCA motifs [13] . We wanted these sequences to be able to target SARS-CoV-2 and its Delta and Omicron variants, but also to other related viral diseases, such as SARS-CoV, Middle East Respiratory Syndrome-related coronavirus (MERS-CoV), and influenza (H1N1), with no off-target effects in humans. To do that, we retrieved the following genomes in FASTA format: 1) the human genome, coding and non-coding transcriptome (GRCh38, from NCBI and ENSEMBL, respectively); 2) the reference genomes of SARS-CoV-2 (NC_045512. [14] . Alignment to those genomes was performed using the short-reads aligner Bowtie v1.1.0 [15] , reporting all valid alignments per read, with the following parameters: maximum number of attempts to match an alignment = 4, and maximum number of mismatches in the "seed" = 3, with "seed" length = 7. The analysis of these features allowed for the selection of the most promising sequences capable of targeting different genes of interest in COVID-19, acting as siRNAs, while being able to target related viral diseases, enhancing the targeting scope of these sequences. The design of CPPs was performed according to the protocol on antimicrobial peptides described by Tucs et al. [16] . This method relies on deep learning generative adversarial networks (GAN), which control the probability distribution of the newly generated sequences, ranking them into positive and negative classes. The underlined criteria include physicochemical descriptors such as charge, hydrophobicity, and molecular weight. Data was retrieved from the publicly available databases CPPsite 2.0 [17] , a webserver with deposited and curated CPPs (data available as of 1 st October 2021), and machine learning CPP (MLCPP) [18] , a two-layer prediction framework for machine-learning-based prediction of cell penetrating peptides. Additionally, as the work by Tucs and colleagues [16] on antimicrobial peptides showed that most of their new peptides were cationic and amphiphilic, which are features found on CPPs, and to enrich the training sets for GAN, sequences from their datasets (retrieved from APD, CAMP, LAMP and DBAASP databases) were also collected. Sequences with up to 52 amino acids were used for training. Redundant sequences were removed. The final dataset contained 14,778 positive sequences (CPP and antimicrobial peptides), and 6,664 negative sequences (non-CPPs). All the sequences were used during the training of the model (for activity prediction) followed by GAN of the positive dataset. Absorption, Distribution, Metabolism, Excretion and Toxicity (ADME-Tox) studies were conducted using the variable nearest neighbor (vNN) webserver for ADME prediction [19] , which allows for the retrieval of a range of properties, such as cardio-and cytotoxicity, and the likelihood of causing liver injury. This in silico method permits for a first scan of compounds before taking new molecules to the lab, concomitantly having a great potential as its learning algorithms rely on available experimental data. Before the outcome of RNA vaccines, the regulatory mechanism achieved by RNA interference (RNAi) was already known and used for RNAi-based therapies [3] . These usually short RNA strands, such as siRNA, caused sequence-specific gene suppression with several advantages over other therapies: the sequence specificity allows for targeting of oncogenes and growth factors, and even to target single nucleotide polymorphisms [20] . Furthermore, upon proper delivery, siRNAs can be delivered to the brain, enhancing the scope of RNA interference to treat neurodegenerative pathologies, and to viral targets such as HIV and Hepatitis C, by inhibition of the viral RNA [21] . The vast potential of this technology to target viral diseases is of most importance in the current worldwide COVID-19 pandemic, and a great opportunity to further development of therapies for diseases that are based on differential gene expression. It is noteworthy that RNA technologies allow for a faster development compared to classical viral vaccines and are more easily modified to follow new variants. The sliding window of one-step approach allowed to produce 29,880 21 nt-long sequences that could target some or most of the targeted genomes under study (available upon request). The alignment of those sequences to the genome of SARS-CoV-2 allowed for the annotation of 29,197 sequences to several genes of interest ( Figure 2 and Supplementary Information Table S1 ). Distribution of annotated genes that the produced 21-nt sequences are able to target. Pp, polyprotein; nsp, non-structural protein; ExoN, exoribonuclease; Pol, polymerase; Hel, helicase. Genes with less than 3.5% representation were pooled in the "others" group and are described in Supplementary Information (Table S1 ). SARS-CoV-2 genome is similar to other coronaviruses. It is a ssRNA that encodes 27 proteins from 14 open reading frames. Eight accessory proteins and the four main domains (nucleocapsid, envelope, membrane, and spike proteins) are encoded by the 3'-terminus, while polyproteins (pp) 1ab and 1a (upon cleavage of pp1ab) are encoded by the 5'-terminus. These are responsible for coding non-structural proteins (nsp) 1 to 10, whereas nsp11 is exclusively encoded by pp1a, and nsp13-16 from pp1ab [22] . Most of the produced siRNA sequences (≈ 20.0%) target the pp1ab gene and those it encodes (pp1a and nsp3). Other than spike proteins, most of the generated sequences target this 790 kDA polyprotein, as it is present in most gene sets (73.12%). Pp1ab and pp1a are replicases processed by two viral proteases (papain-like and 3C-like protease), important in viral replication and transcription, which makes them a vital target to inhibit viral activity [23] . On the other hand, only around 7.0% of the sequences aimed at the spike glycoprotein S1, and 6.0% to S2, totaling 13.0% of sequences to the spike protein. This protein is responsible for the binding to the host cell receptor and for the induction of membrane fusion, which made it a vital player in the host invasion process, mainly through the high affinity to the angiotensin converting enzyme (ACE) 2 receptor [24] . This is an interesting finding, as most current approaches to treat COVID-19 relied on targeting the latter, highly mutagenic protein, while it seems that pp1ab gene offers more targeting capability while ensuring the inhibition of the replication and transcription, without which no host invasion is possible. To make sure that the designed siRNAs would have an activity close to those retrieved from experimental settings, we applied several filters to the obtained sequences. GC-content is an important factor to biological activity, and it has been found that a GC-content ranging from 30.0 to 50.0% enables better activity [25] . We have applied that scheme and came up with 19,731 sequences with GC-content of 33.0 to 48.0%, targeting the same annotated genes. Next, poly(T/U) and poly(A) with more than four nucleotide repeats should be avoided, as these tags can act as termination signals to RNA polymerase III [26] . Thus, we excluded sequences with 4 or more of these repeats, ending up with 15,681 siRNA sequences. Toxicity is also a major concern regarding new therapeutics, being them drug or nucleic acids-like therapeutics. According to Fedorov et al. [27] , to limit toxicity, these sequences must have at least three mismatches compared to the human genome, and they should not possess poly(T/U) and GCCA tags (the first one has been resolved in the previous filtering step). We proceeded then to the removal of sequences with GCCA motifs and those that had less than three mismatches when aligned to the human genome and human coding and non-coding transcriptome. These filtering steps were performed by shell scripting and spreadsheets conditionals, and highly allowed for an 87.0% reduction on the number of siRNA sequences (1, 993 Supplementary Information, table S2 ). This last filtering removal was highly influenced by the number of matches to COVID-19 Omicron variant of concern, as there were not as much completed sequences with high coverage deposited in the GISAID platform, hence a total binding to the 21-nt sequences was lower. Despite the significant removal of the original siRNA sequences, the relative targeting to the genes of interest remains the same as in the original data set (around 20.0% for pp1ab, pp1a and nsp3 cluster, Figure 3 ). It is important to note that a general first step of siRNA sequences generation without filtering enables researchers to apply their own designed filters to select those sequences that would better fit their needs. The delivery of therapeutics to cells is a major focus in today's therapy research, as to ensure the proper localized activity of those moieties without raising off-target effects. In that regard, several nanostructures have been suggested to protect the carried drug cargos; however, nucleic acids-like therapies focus on base pairing rather than structural features, such as those that are screened for drug repurposing. In that sense, it is possible to use simpler but effective transporters to deliver them. CPPs have the ability to cross membranes in a noninvasive manner, allowing for the maintenance of cells' integrity while also being considered safe and highly efficient, showing low cytotoxicity and no immunological responses [8, 11] . However, the number of described CPPs in the literature is scarce, and cell delivery is usually achieved with the broad CPPs TAT, penetratin and transportan [9] . Thus, we have here delved into deep learning methods to design new delivery peptides. By controlling the probability distribution of the new peptide sequences, the deep learning GAN method allowed for the sorting of those sequences into positive (CPP) and negative (non-CPP) classes, considering physicochemical descriptors such as charge, hydrophobicity, and molecular weight of the data from which it learned. New sequences were created, in a total of 9,984 (data available upon request), ranging from 5 to 52 aa. From those, we removed four sequences that contained the Asx (B) unnatural amino acid (either aspartic acid or asparagine were possible in those positions), totaling 9,980 new sequences. This step facilitated the use of the following servers, as they only accept natural amino acids sequences. Despite the learning capability of these methods, we wanted to further filter our data. As a result, we have resorted to the server Machine Learning CPP (MLCPP), from which we initially extracted data for our learning datasets. This tool enables the discrimination of CPP and non-CPPs, scoring the sequences based on their amino acids' composition. It classified 3,866 sequences as CPP and 6,114 as non-CPP. Alongside the probability score of being a CPP or not, which we used as classifier, the server also provides a probability score regarding uptake efficiency. Both are important parameters for the determination of effective delivery peptides, so we organized the sequences according to both conditions. Interestingly, at least up to the 15 best scored of each classifier, there was no sequences present in both. This highly suggests that being positively scored as a CPP, due to physicochemical properties, does not imply a high uptake efficiency score. For further analysis, we have selected the top 5 sequences from each score classifier and performed a BLAST alignment to evaluate if these sequences were already deposited elsewhere in any genome. As no match was found, we assumed these sequences as new potential CPPs, and compared them to the results of the most used CPP, the TAT peptide, first by sequence alignment (Clustal Omega, Figure S1 ) [29] and by score analysis (Table 1) . 57.37% versus TAT's 59.71%). The sequence with highest uptake score (seq10) is up to 1.5fold better than TAT. These scores, however, do not seem to be related with the length of the sequences, as no pattern could be found. They may rely, more than on sequence size, on amino acid composition and physicochemical features. The latter were then calculated using HeliQuest server [30] ( Supplementary Information, Table S3 ), from which it was also possible to retrieve the peptides helical wheel representations (Figure 4 ). None of the sequences, including TAT, present highly amphipathic helices, with hydrophobicity inferior to 0.6. TAT is the peptide with higher polar residues content (91.67%), while the new peptides averaged at 61.07%. However, five of the new sequences (seq2, seq3, seq7, seq9, and seq10) have hydrophobic faces, which are said to be features of biologically active peptides, as these hydrophobic residues may insert between the membranes' lipid acyl chains, allowing for binding and crossing the membrane [31] . To obtain new effective and safe drugs to be used in clinical settings, the compounds need to be screened and characterized regarding their cytotoxicity. One of the most common studies performed to obtain this information is the ADME-Tox assay, which allows for the prediction of what will occur after administration in the human body. Despite being a highly in vitro process, several in silico pipelines have been created based on experimental results to more easily help researchers developing and characterize new products. Such an example is the variable nearest neighbor (vNN) webserver for ADME prediction [19] , which allows for the retrieval of a range of properties, as cardiotoxicity, cytotoxicity, and the likelihood of causing liver injury. We have scanned our new sequences and the TAT control in the server, to observe how they compared to the latter, a broadly accepted delivery peptide (Table 2 ). Regarding cytotoxicity, none of the new sequences were predicted to cause damage to cells, considering an IC50 ≤ 10.0 µM. However, three sequences, seq2, seq8 and TAT, have a high risk of inducing liver injury, which would withdraw the use of these peptides. These same peptides, plus seq9 and seq7, were predicted to be rapidly metabolized by the liver (half-life lower than 30 minutes). All the others, on the other hand, may be CYP3A4 inhibitors, which may justify their half-life being superior to 30 minutes, as they may be able to circumvent xenobiotics metabolism by this CYP enzyme. TAT and seq8 may be inhibitors of P-glycoprotein, while peptides seq1, seq4, seq3 and seq10 may be substrates. This protein is an essential membrane protein with the role of transporting foreign compounds from the cell to the outside. Given that, the latter set of peptides may not be ideal for cell delivery, as they could be extracted from inside the cell through the P-glycoprotein pathwaynonetheless being able to cross the cellular membranes. Interestingly, the TAT peptide has been assigned as capable of crossing the bloodbrain barrier in several studies [32] , but the server could not correctly predict this feature when stronger filters were applied, only when weaker filters were used. Furthermore, the peptide seq1 appears to have some cardiotoxicity (hERG). According to the server's training data for this feature, this sequence must present an IC50 ≤ 10.0 µM. This peptide should then be avoided to prevent the blockage of a gene coding for a potassium ion channel involved in normal cardiac repolarization. None of the sequences showed mitochondrial toxicity or mutagenicity (through mitochondrial membrane potential and mutagenicity probability (Ames test), respectively). At last, it was important to verify the maximum recommended therapeutic dose (MRTD), this is, the daily dose that is considered safe to patients to take. Remarkably, TAT peptide has the smallest MRTD value of 16 mg/day, while seq9 has the highest MRTD of 2,994 mg/day, 188-fold better than the control. This data suggests that seq7 and seq5 peptides may be the safest ones to cell delivery purposes. While the first may have a half-life prediction of up to 30 minutes, it respects all the other parameters, concomitantly being the second with highest MRTD (1,817 mg/day), which could make up for the shorter half-life. This peptide has a higher probability score of being a CPP (99.46% versus TAT's 99.27%) and its probability score on uptake efficiency is close to that of TAT (57.37% versus TAT's 59.71%). Seq5 also passes all the parameters except for the CYP3A4 inhibitor prediction. This parameter should be taken into account, as the inhibition of CYP3A4 activity may lead to increased concentration of the compound in the liver and the intestine, which could cause toxicity [33] . Nevertheless, no cito-or cardiotoxicity were predicted for this peptide, and it has the third higher MRTD value (1,747 mg/day), which could point to its general safety. The emergence of the viral COVID-19 disease has propelled the way to develop better and more personalized medicines, with the highlight to RNA therapeutics. This is a great time for delving into these new technologies and broad their scope to other pathologies. Nevertheless, the design of therapeutics directed to nucleic acids sequences requires careful design in order to enhance their efficiency and also assure their safety and the lack of off-target effects that may interfere with other genes expression and, consequently, protein synthesis and function. In that sense, computational tools ranging from bioinformatics analysis to the more computationally expensive machine and deep learning have paved their way into the routine of researchers prior to the extensive and comprehensive experimental evaluations. Here, we have explored the genomics of SARS-CoV-2 to produce sequences able to bind to its genes to produce inhibitory effects, further preventing protein synthesis by the RNA interference pathway, leading to RNA degradation. The availability of data, in part caused by the huge demand of COVID-19 data, allowed us to define sequences with a broad general use. These Also, despite current focus on nanostructures, delivery peptides are a promising delivery system, albeit the few number of described delivery peptides have restricted their use to the most common CPPs. Here, we have resorted to the power of deep learning to design new potential cell delivery peptides and explored the physicochemical properties to assign their activity prediction. The emergence of in silico tools and servers for these analyses are the way to go to enthusiasm researchers to better select and design peptide and nucleic acids sequences that could have better results in wet lab experiments, while helping reduce the cost of those tests via previous screening and selection. Thus, we designed 62 siRNA sequences targeting the most important SARS-CoV-2 genes, mainly to pp1ab and spike proteins, which are proteins with vital roles in the replication and host invasion, respectively. Additionally, we came up with 10 new putative delivery sequence systems that could be able to cross the cellular membrane. Some of these sequences displayed better features than the TAT control peptide, especially regarding the maximum recommended dose, which could point to a better safety and lower toxicity when compared to this control. The emergence of new variants and diseases may push the need to new therapies, but RNA and data analysis may lead the way to answer these needs. Figure S1 . Sequence alignment of the new peptides sequence to TAT cell penetrating peptide. Nucleic Acid-Based Therapy for Coronavirus Disease Omicron and Delta Variant of SARS-CoV-2: A Comparative Computational Study of Spike Protein Therapeutic SiRNA: State of the Art The Many Pathways of RNA Degradation Delivery Systems for Nucleic Acids and Proteins: Barriers, Cell Capture Pathways and Nanocarriers Cell Penetrating Peptides, Novel Vectors for Gene Therapy Peptide Nucleic Acid (PNA) Cell Penetrating Peptide (CPP) Conjugates as Carriers for Cellular Delivery of Antisense Oligomers A Comprehensive Model for the Cellular Uptake of Cationic Cell-Penetrating Peptides Twenty Years of Cell-Penetrating Peptides: From Molecular Mechanisms to Therapeutics: Peptide-Based Drug Delivery Technology Internalization Mechanisms of Cell-Penetrating Peptides Langel, Ü. Cargo-Dependent Cytotoxicity and Delivery Efficacy of Cell-Penetrating Peptides: A Comparative Study A Small Interfering RNA (SiRNA) Database for SARS-CoV-2 Precise and Efficient SiRNA Design: A Key Point in Competent Gene Silencing Disease and Diplomacy: GISAID's Innovative Contribution to Global Health: Data, Disease and Diplomacy Ultrafast and Memory-Efficient Alignment of Short DNA Sequences to the Human Genome Generating Ampicillin-Level Antimicrobial Peptides with Activity-Aware Generative Adversarial Networks CPPsite 2.0: A Repository of Experimentally Validated Cell-Penetrating Peptides Machine-Learning-Based Prediction of Cell-Penetrating Peptides and Their Uptake Efficiency with Improved Accuracy VNN Web Server for ADMET Predictions SiRNA-Based Approaches in Cancer Therapy Current Development of SiRNA Bioconjugates: From Research to the Clinic Genome Composition and Divergence of the Novel Coronavirus (2019-NCoV) Originating in China Expression and Functions of SARS Coronavirus Replicative Proteins Much More Than Just a Receptor for SARS-COV-2. Front Strategies for Improving SiRNA-Induced Gene Silencing Efficiency Functional Anatomy of SiRNAs for Mediating Efficient RNAi in Drosophila Melanogaster Embryo Lysate Off-Target Effects by SiRNA Can Induce Toxic Phenotype Optimization of Duplex Stability and Terminal Asymmetry for ShRNA Design Scalable Generation of High-quality Protein Multiple Sequence Alignments Using Clustal Omega HELIQUEST: A Web Server to Screen Sequences with Specific -Helical Properties Lipid-Protein Interactions in Biological Membranes: A Structural Perspective Cell-Penetrating Peptide-Mediated Therapeutic Molecule Delivery into the Central Nervous System The Modulatory Role of CYP3A4 in Dictamnine-Induced Hepatotoxicity GGCATACTAATTGTTACGACTAT-3' 3'-CCGUAUGAUUAACAAUGCUGA-5' pp1ab AGATTTGACACTAGAGTGCTATC-3' 3'-UCUAAACUGUGAUCUCACGAU-5' pp1ab GGTATTGCTACTGTACGTGAAGT-3' 3'-CCAUAACGAUGACAUGCACUU-5' pp1ab GACTTATTTAGAAATGCCCGTAA-3' 3'-CUGAAUAAAUCUUUACGGGCA-5' pp1ab, nsp16 5'-TGTTACTAATGTGAATGCGTCAT-3' 3'-ACAAUGAUUACACUUACGCAG-5' pp1ab CGATTATGACTACTATCGTTATA-3' 3'-GCUAAUACUGAUGAUAGCAAU-5' TAAGTTTAGAATAGACGGTGACA-3' 3'-AUUCAAAUCUUAUCUGCCACU-5' AAACTAATTGTTGTCGCTTCCAA-3' 3'-UUUGAUUAACAACAGCGAAGG-5' GTGATTTCATACAAACCACGCCA-3' 3'-CACUAAAGUAUGUUUGGUGCG-5' GGTAACTGGTATGATTTCGGTGA-3' 3'-CCAUUGACCAUACUAAAGCCA-5' pp1ab, pp1a AGATTTACTCATTCGTAAGTCTA-3' 3'-UCUAAAUGAGUAAGCAUUCAG-5' GATTTACTCATTCGTAAGTCTAA-3' 3'-CUAAAUGAGUAAGCAUUCAGA-5' GGTACAAGTAACTTGTGGTACAA-3' 3'-CCAUGUUCAUUGAACACCAUG-5' GGTACAACTACACTTAACGGTCT-3' 3'-CCAUGUUGAUGUGAAUUGCCA-5' pp1ab, pp1a GGTGAAACATTTGTCACGCACTC-3' 3'-CCACUUUGUAAACAGUGCGUG-5' pp1ab, pp1a TAGTTTCAACTATACAGCGTAAA-3' 3'-AUCAAAGUUGAUAUGUCGCAU-5' GGAATTTGCGAGAAATGCTTGCA-3' 3'-CCUUAAACGCUCUUUACGAAC-5' GAATACTGTTAAGAGTGTCGGTA-3' 3'-CUUAUGACAAUUCUCACAGCC-5' GTAATAGAGCAACAAGAGTCGAA-3' 3'-CAUUAUCUCGUUGUUCUCAGC-5' GAATTTGCGAGAAATGCTTGCAC-3' 3'-CUUAAACGCUCUUUACGAACG-5' pp1ab, pp1a TGCTATTACCTCTTACGCAATAT-3' 3'-ACGAUAAUGGAGAAUGCGUUA-5' pp1ab, pp1a Spike protein S1 5'-GGAACAAATACTTCTAACCAGGT-3' 3'-CCUUGUUUAUGAAGAUUGGUC-5' Spike protein S2 5'-GGTTTAATGGTATTGGAGTTACA-3' 3'-CCAAAUUACCAUAACCUCAAU-5' The authors acknowledge computing resources made available by the Minerva HPC from the Coimbra Institute of Engineering (ISEC); the National Distributed Computing Infrastructure The data underlying this article was accessed from public databases. Genomes were obtained from NCBI, ENSEMBL and GISAID (https://gisaid.org). Peptide's data was obtained from CPPsite2.0 (http://crdd.osdd.net/raghava/cppsite), MLCPP (http://thegleelab.org/MLCPP), APD (http://aps.unmc.edu), CAMP (http://camp.bicnirrh.res.in), DBAASP (https://dbaasp.org), and LAMP (http://biotechlab.fudan.edu.cn/database/lamp). The derived data generated in this research will be shared on reasonable request to the corresponding author.