key: cord-0817691-uz9b9mpx authors: Medeiros, Inácio Gomes; Khayat, André Salim; Stransky, Beatriz; dos Santos, Sidney Emanuel Batista; de Assumpção, Paulo Pimentel; de Souza, Jorge Estefano Santana title: A small interfering RNA (siRNA) database for SARS-CoV-2 date: 2020-10-01 journal: bioRxiv DOI: 10.1101/2020.09.30.321596 sha: a8127c85fb5181865a35f5203c56f56ed2bc6066 doc_id: 817691 cord_uid: uz9b9mpx Coronavirus disease 2019 (COVID-19) rapidly transformed into a global pandemic, for which a demand for developing antivirals capable of targeting the SARS-CoV-2 RNA genome and blocking the activity of its genes has emerged. In this work, we propose a database of SARS-CoV-2 targets for siRNA approaches, aiming to speed the design process by providing a broad set of possible targets and siRNA sequences. Beyond target sequences, it also displays more than 170 features, including thermodynamic information, base context, target genes and alignment information of sequences against the human genome, and diverse SARS-CoV-2 strains, to assess whether siRNAs targets bind or not off-target sequences. This dataset is available as a set of four tables in a single spreadsheet file, each table corresponding to sequences of 18, 19, 20, and 21 nucleotides length, respectively, aiming to meet the diversity of technology and expertise among labs around the world concerning siRNAs design of varied sizes, more specifically between 18 and 21nt length. We hope that this database helps to speed the development of new target antivirals for SARS-CoV-2, contributing to more rapid and effective responses to the COVID-19 pandemic. Started in late December 2019, coronavirus disease 2019 (COVID-19) rapidly transformed into a global pandemic, with an incidence of more than 30M cases and almost 1M deaths around the world as of September 2020 1 , and strongly negatively impacting the global economy (1) . This circumstance brought a huge demand for developing antivirals capable of targeting the SARS-CoV-2 RNA genome and RNA interference approaches (2-4) emerged as a possible solution. Small interference RNA (siRNAs) are RNA sequences about 20nt-long that, together with RNA-Induced Silencing System (RISC) (6) , bind interest mRNA molecules (4, 5) inhibiting its translation and expression. RNAi approaches have been employed for SARS-CoV (6) (7) (8) , with reports of viral levels decreasing (9) , and recent works claim that it may also work for SARS-CoV-2 (10, 11) . Researchers in (12) used Immune Epitope Database and Analysis Resource (IEDB) to find potential regions in diverse coronaviruses with matches to SARS-CoV-2, identifying many of them in SARS-CoV, the closest homolog. Chen et al (13) apply a window of 3000 nucleotides with a step of 1500 over reference SARS-COV-2 genome (MN908947 2 ) seeking 1-25nt regions called "free segments". Besides, siRNAs databases targeting a broad range of viruses (14) (15) (16) have been developed. Recently, researchers developed a SARS-CoV-2 oligonucleotide sequence database, to improve the SARS-CoV-2 detection and treatment methods, providing sequences with the lowest and highest conservation levels (17) . In this work, we propose a SARS-CoV-2 targets database to support siRNA approaches, aiming to speed up RNAi design by providing a set of possible targets and siRNA sequences with the required information for choosing the most appropriate targets for new siRNAs. Unlikely cited databases, which are manually curated, we apply a sliding-window approach for covering whole SARS-CoV-2 genomic space, extracting every possible siRNA sequence of 18, 19, 20, and 21 nucleotides, enabling researchers to assess solutions capable of targeting any region of the virus. The database has more than 170 features, including thermodynamic information, base context, target genes, and alignment information against diverse SARS-CoV-2 strains, together with scores and predictions collected from three siRNA efficiency prediction tools. All this coordinated information will enable users to select with higher confidence targets that best match a broad set of conditions for designing even more efficient siRNAs. Although siRNAs length can vary from 18 to 25 nucleotides (18), synthetic ones should range from 19 to 21nt (19) , according to ThermoFisher siRNA Design Guidelines 3 . Thus, the proposed database provides information about each possible 18 to 21 nucleotides siRNA target region from SARS-CoV-2, one table for each length. Moreover, tools employed for assessing siRNAs efficiency (20) (21) (22) operate over sequences lying in that range, which reinforces our choice. Since they present the same columns, we explain here the development process only for the 21-length table. SARS-CoV-2 reference genome was collected from NCBI (code NC_045512) and a sliding window of 21nt-long and step 1 4 were used to traverse the genome. Table 1 indicates the total number of sequences obtained for each length. Seven new sequences sets were then generated from the obtained sequences set (called target region), following the aforementioned ThermoFisher guidelines, and suggestions The proposed database displays a total of 119,526 siRNAs divided in four different sizes ranging from 18 to 21 nucleotides (see Table 1 for the number of siRNAs of each length). As stated, we applied over them three siRNA efficiency prediction tools to assess their inhibition power. Figure 1 illustrates the number of 21nt antisense siRNA sequences predicted as effective by every single predictor, and the quantities predicted by more than one. It can be seen that no siRNA was unanimously considered effective, while approximately 53% of them (15, (Figures 2b-d) . Finally, it can be observed that while all 18nt and 20nt siRNAs match some regions from MERS, SARS, and H1N1 using at least six mismatches, the number of mismatches increases to seven for 19nt and 21nt siRNAs. The proposed database is distributed as a spreadsheet file containing four tabs, each one corresponding to target region sequences of a specific length. Here we will present how a researcher can use this database with an illustrative example. Suppose a user wants to select siRNAs with 21 nucleotides length. In this case, the user will access the tab "21 bases" from the spreadsheet file. After opening it in a spreadsheets editor, the next step is selecting siRNAs whose properties match the user requirements. Assume that the user wants a siRNA that has little or no homology with the human genome, can act over as much as possible British SARS-CoV-2 strains, and its first dinucleotide is AA. This last requirement is achievable by applying a filter over column P to show only lines with value 1 on it (see Supplementary Text 1), decreasing the number of siRNA candidates from 29880 to 2858. For little or no homology with the human genome, the number of mismatches against human sequences must be at least three (25) . Filtering table to display lines with at least a value of three at columns BO, BP, BQ, CG, CH, CI, can be filtered to display only the three highest values, for example, which reduces candidates from 999 to 10 candidates. Such a reduction not only saves wet-lab tests costs but also ensures that selected siRNAs meet the main user requirements. Designing siRNAs is a challenging procedure, because sometimes minor changes in its nucleotide sequence can alter its functionality (26) . As reported in (27), specificity, potency, and efficacy of siRNA-mediated gene silencing can be determined by analyzing siRNA nucleotide sequence, hence its inability to bind to unintended regions (off-targets) is an important factor that must be strongly taken into consideration. Therefore, we proposed a SARS-CoV-2 targeted siRNAs database with sequence and thermodynamic stability information, to help the evaluation of important factors related to their efficacy and optimize the decision process towards choosing the best ones as target antiviral solutions. Considering that each laboratory has its own technology context and expertise in designing siRNAs of specific lengths, we provide a list of siRNAs varying from 18 to 21 nucleotides-length, aiming to meet the range of possible lengths used in the design process. Numerous works have been proposing methods and guidelines for choosing the best siRNAs by analysing their sequence characteristics (28) (29) (30) (31) , for which two broad reviews are available at (26, 32) . Our proposed database provides information regarding base, GC and AU context, so as the quantities of each RNA nitrogenated base in sequences, besides information about the presence of UUUU and GCCA, considered toxic motifs (33) So any user with a proper efficacy evaluation method (or anyone provided by literature) can easily evaluate siRNAs with this database at disposal. It also provides thermodynamic information collected from the application of three predictors (20) (21) (22) , thus enabling users to have a deeper look at siRNAs' properties, and choose the best ones according to their specificities. As it can be seen in Figure 1 , they have high divergence when setting a siRNA as efficient or not, which suggests that they must be used in a complementary way. Due to genetic diversity and variability of SARS-CoV-2 (34), a siRNA that is highly efficient over one strain may not be when applied to another. Hence, we also provide similarity information with strains from diverse countries, such that users will benefit from the opportunity of input geographical specificity and even more customization to their decision process. Ensuring that siRNAs are not capable of targeting human sequences (off-targets) is also another important requirement, for which a minimum of three mismatches is necessary to meet it (25) . Thus, similarity information with the human genome, coding and non-coding transcriptome, is also available in our database. As it was shown in the Database Analysis & Statistics session, virtually all 18nt-long siRNAs matched with such genome and transcriptomes with at least three mismatches, corroborating aforementioned statement from literature. For the best of our knowledge, this is the first database to figure siRNAs similarity information against human coding and non-coding transcriptomes, giving to users even more confidence power about siRNAs specificity. It is hoped, with this database, that the development of new target antivirals for SARS-CoV-2 using RNAi technology can be not only eased and accelerated, but also capable of identifying even more efficient solutions for silencing that virus, and contributing to the control of the pandemic. We made available the proposed database as a spreadsheet file given the urgency to provide this information for the scientific community that is developing effective therapeutics for SARS-CoV-2. Additionally, it is intended to build a webpage for more user-friendly and interactive access to the data. Moreover, we also intend to replicate the approach employed in this work for exploring the genomic space of other viruses, as well as ones that may represent a threat to possible new pandemic events. Spreadsheet file regarding database tables is available at http://www.bioinformaticsbrazil.org/siRNAdb/sirnas_cov_db.xlsx. The header of each column displays the total number of strains from each country that sequences were aligned to. In order to apply it to a huge volume of sequences, we have translated javascript code of webserver related to thermodynamic information calculus to in-house Python (https://www.python.org) scripts. Columns FA to FJ bring efficiency score and thermodynamic information provided by ThermoComposition21 program (4) to natural sense sequence, namely, predicted effectiveness 11 , number of GG dinucleotides present in the sequence, and a set of eight thermodynamic indexes (measured in Δ G) calculated by the tool which is used for effectiveness prediction. Columns FM to FV replicate this information set for synthetic sense sequence, and columns FY to GH, for antisense sequence. Column GJ provides predicted efficiency 12 from SSD program (5) over natural sense sequence, and columns GL to GO, a set of four thermodynamic indexes (measured in Δ G) calculated by the tool for natural sense sequence, which are used for effectiveness prediction. Columns GR to GU replicate this information set for synthetic sense sequence, and columns GX to HA, for antisense sequence. Finally, columns HD to HF bring efficiency prediction and thermodynamic information provided by software si-shRNA Selector program (6) to natural sense sequence, namely, predicted efficiency 13 and a set of two thermodynamic indexes (measured in Δ G) calculated by the tool which are used for effectiveness prediction. Columns HI to HK replicate this information set for synthetic sense sequence, and columns HN to HP, for antisense sequence. A Review of SARS-CoV-2 and the Ongoing Clinical Trials Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14 A review on current status of antiviral siRNA Oligonucleotide antiviral therapeutics: antisense and RNA interference for highly pathogenic RNA viruses Inhibition of genes expression of SARS coronavirus by synthetic small interfering RNAs Identification of Effective siRNA Blocking the Expression of SARS Viral Envelope E and RDRP Genes Inhibition of severe acute respiratory syndrome virus replication by small interfering RNAs in mammalian cells Using siRNA in prophylactic and therapeutic regimens against SARS coronavirus in Rhesus macaque 2020) siRNA could be a potential therapy for COVID-19 Research and Development on Therapeutic Agents and Vaccines for COVID-19 and Related Human Coronavirus Diseases 2020) A Sequence Homology and Bioinformatic Approach Can Predict Candidate Targets for Immune Responses to SARS-CoV-2 Computational Identification of Small Interfering RNA Targets in SARS-CoV-2 VIRsiRNAdb: a curated database of experimentally validated viral siRNA/shRNA PVsiRNAdb: a database for plant exclusive virus-derived small interfering RNAs. Database HIVsirDB: a database of HIV inhibiting siRNAs CoV2ID: Detection and Therapeutics Oligo Database for SARS-CoV-2 ) siRNA, miRNA and HIV: promises and challenges Asymmetric siRNA targeting the bcl 2 gene inhibits the proliferation of cancer cells in vitro and in vivo Computational models with thermodynamic and composition features improve siRNA design Optimization of duplex stability and terminal asymmetry for shRNA design SSD -a free software for designing multimeric mono-, bi-and trivalent shRNAs Ultrafast and memory-efficient alignment of short DNA sequences to the human genome OligoCalc: an online oligonucleotide properties calculator Accelerated off-target search algorithm for siRNA Precise and efficient siRNA design: a key point in competent gene silencing 2016) siRNA and RNAi optimization Functional anatomy of siRNAs for mediating efficient RNAi in Drosophila melanogaster embryo lysate An Effective Method for Selecting siRNA Target Sequences in Mammalian Cells Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference Rational siRNA design for RNA interference Jr (2004) A comparison of siRNA efficacy predictors Off-target effects by siRNA can induce toxic phenotype Genetic variation in SARS-CoV-2 may explain variable severity of COVID-19 OligoCalc: an online oligonucleotide properties calculator Predicting DNA duplex stability from the base sequence Improved thermodynamic parameters and helix initiation factor to predict stability of DNA duplexes Computational models with thermodynamic and composition features improve siRNA design SSD -a free software for designing multimeric mono-, bi-and trivalent shRNAs Optimization of duplex stability and terminal asymmetry for shRNA design We acknowledge the Pró-Reitoria de Pesquisa from Universidade Federal do Rio Grande do Norte and Pró-Reitoria de Pesquisa from Universidade Federal do Pará.We also acknowledge the Bioinformatics Multidisciplinary Environment (BioME) at UFRN and Bioinformatics Graduate Program, IMD/UFRN for the provision of computational resources. Not applicable.