key: cord-0874890-35vc7s0x authors: Sanchis, Pablo; Lavignolle, Rosario; Abbate, Mercedes; Lage-Vickers, SofĂ­a; Vazquez, Elba; Cotignola, Javier; Bizzotto, Juan; Gueron, Geraldine title: Analysis workflow of publicly available RNA-sequencing datasets date: 2021-06-18 journal: STAR Protoc DOI: 10.1016/j.xpro.2021.100478 sha: dd0149d92c791991def29ecae6ea387d96761acc doc_id: 874890 cord_uid: 35vc7s0x Differential gene expression analysis is widely used to study changes in gene expression profiles between two or more groups of samples (e.g., physiological versus pathological conditions, pre-treatment versus post-treatment, and infected versus non-infected tissues). This protocol aims to identify gene expression changes in a pre-selected set of genes associated with severe acute respiratory syndrome coronavirus 2 viral infection and host cell antiviral response, as well as subsequent gene expression association with phenotypic features using samples deposited in public repositories. For complete details on the use and outcome of this informatics analysis, please refer to Bizzotto et al. (2020). Differential gene expression analysis is widely used to study changes in gene expression profiles between two or more groups of samples (e.g., physiological versus pathological conditions, pre-treatment versus post-treatment, and infected versus non-infected tissues). This protocol aims to identify gene expression changes in a pre-selected set of genes associated with severe acute respiratory syndrome coronavirus 2 viral infection and host cell antiviral response, as well as subsequent gene expression association with phenotypic features using samples deposited in public repositories. For complete details on the use and outcome of this informatics analysis, please refer to Bizzotto et al. (2020) . Timing: 1 h 1. R is a free software environment for statistical computing and graphics. It runs on UNIX, Windows and MacOS. a. To download and install R go to https://www.r-project.org/ (R Core Team, 2013) . The current pipeline was performed using R version 3.6.2. 2. RStudio is an integrated development environment (IDE) for R. It allows to easily execute the R codes, plot graphics, and manage the workspace in a multipanel interphase. a. To download and install RStudio go to https://rstudio.com/products/rstudio/ (RStudio Team, 2020) . Pause point: Since this protocol follows a bioinformatic pipeline exclusively, by saving the executed steps the protocol can be paused at any time. Timing: 1 h 3. Users must first download the required packages (listed in the key resources table). They can be downloaded through Bioconductor, which provides tools for the analysis and comprehension of high-throughput genomic data. BiocManager::install() is the recommended command to install packages (for detailed information on why BiocManager::install() is preferred to the standard R packages installation please read https://www.bioconductor.org/install/#whybiocmanagerinstall): a. Install the packages needed for the analysis. See troubleshooting 1: b. Once all packages are installed, they need to be loaded: Timing: 2 days 4. When using datasets from public repositories, the key step is to identify a dataset (or datasets) that comply with the eligibility criteria and that contains the sample information required for the analysis. a. We suggest browsing Gene Expression Omnibus (GEO: https://www.ncbi.nlm.nih.gov/gds, (Barrett et al., 2012) ) and ArrayExpress (https://www.ebi.ac.uk/arrayexpress/, (Athar et al., 2019) ) repositories because they gather multiple high-throughput genomics datasets. However, there are several public repositories that might be more suitable for other types of studies. These repositories allow to download raw sequencing data (.fastq sequencing files) and/or pre-processed files (tab-delimited.txt files containing matrices with sequence read counts after trimming and alignment to the reference genome). The pre-processed files may contain a raw-counts matrix (non-normalized) or a normalized counts matrix (see below for more details). Sample information is also available to download. Finally, the platform used, and pre-processing algorithm (when data are pre-processed) are specified. We strongly recommend researchers to thoroughly evaluate the type of data submitted, study design, number of samples and any other relevant information that might help to analyze the samples and draw statistically valid conclusions. Troubleshooting 2. b. Our eligibility criteria for (Bizzotto et al., 2020) was: (i) publicly available transcriptome data; ii) detailed sample/patient information; (iii) detailed protocol information; (iv) R 60 samples. We selected the GSE152075 dataset from GEO which contained RNA-seq data from 430 SARS-CoV-2 positive and 54 negative patients (Lieberman et al., 2020) . We downloaded the gene expression aligned data matrix (tab-delimited .txt file with reads pseudo-aligned to the human reference transcriptome). Clinico-pathological information included age, gender, and viral load (expressed as cycle threshold (Ct) by RT-qPCR for the N1 viral gene at time of diagnosis). The interpretation for viral load was as follows: the lower the Ct, the higher the viral load. This phenotypic data can be downloaded directly in RStudio as explained in the ''RNA-seq data organization and counts normalization'' section. Note: detailed information on the usage of the different packages may be found in the links provided in the identifier column. For this bioinformatics analysis we used a laptop with an Intel Core i5 8th generation processor, 8 GB RAM memory and Windows 10. No high-performance computing clusters were needed for the analysis of the data. Internet connection is required for downloading R packages and data matrixes. The flow chart for data processing is included in Figure 1 . REAGENT Download and prepare the data matrix for analysis You can download the experiment information and clinical data directly from GEO using the GEOquery package: 1. The series matrix file is a text file that includes a tab-delimited value-matrix for each sample containing the phenotypic/clinical and experimental data of a given dataset. In the GEO webtool, there is a hyperlink to the series matrix, called ''Series Matrix File(s)''. To download the series matrix file directly to the R environment use the getGEO command: 2. You may now extract the phenotypic/clinical data matrix from the series matrix: 3. Download and save on your computer the raw-counts matrix from GEO website. This matrix is a tab-delimited txt. file containing the counts for every gene aligned from a RNA-seq experiment. After downloading it, load the matrix into RStudio: Note: The sequencing data for GSE152075 was submitted as raw-counts in a separate file from the experimental and clinical data. (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE152075; GSE152075_raw_counts_GEO.txt.gz); therefore, it was downloaded separately. For datasets with pre-processed/normalized data, the counts matrix might be included in the series matrix file downloaded in step 1. Troubleshooting 3. Timing: 1 h Before performing differential gene expression analysis, it is required to normalize the read counts if the raw-counts matrix was downloaded. This normalization step allows to compare gene expression (read counts) among samples (Evans et al., 2018) . It is also recommended to correct for batch effect if multiple batches of experiments were performed. If the user's dataset is already normalized, then go directly to step 6. 4. Gene expression normalization: a. Before sample normalization, data should be converted and organized to the format required for further analysis (data format and organization might vary for different packages). Troubleshooting 4. b. Make sure that the grouping variables are factors. We also changed the original names of the columns containing the relevant variables to make them shorter and easier to work with. Note: There are alternative methods to normalize and extract the counts such as rlog and vst that would fit this analysis (Love et al., 2014) . However, this STAR Protocol replicates the bioinformatics analysis described in Bizzotto et al. (Bizzotto et al., 2020) where the command ''normalized=TRUE'' was used. 5. For our study, we converted the continuous variables (e.g., age, viral load) into factor/strata variables (e.g., 10-year age ranges, low/medium/high viral load). The code below shows an example on how to stratify the viral load and age (after duplicating the original variable in order to not overwrite the original data). STAR Protocols 2, 100478, June 18, 2021 7 Note: This is an optional step depending on your variables and analysis. Timing: 1 day Note: As mentioned on Bizzotto et al. (Bizzotto et al., 2020) , some samples were removed from the analysis since they were considered to have low quality read sequencing (>70% genes with 0 counts). Because this might not apply to all protocols, we did not include the code for this filtering in the main manuscript, but it was included in the troubleshooting 5. Therefore, the outcomes for this protocol might slightly differ from the outcome published on Bizzotto et al. (Bizzotto et al., 2020) , but this omission does not change the results and interpretation of the study. This step aims to compare gene expression across different strata. Below we show, as an example, the differential MX1 expression analysis for SARS-CoV-2 positive vs. negative patients. 6. As described in our publication, we selected specific genes that could potentially be linked to SARS-CoV-2 infection. For this section, we selected MX1 as an example to show. Plot MX1 expression for SARS-CoV-2 positive and negative patients (Figure 2A .i) using the following code: CRITICAL: For the y argument you must specify a data frame containing the samples in the rows and the variables (genes) in the columns; therefore, we used the t() argument to transpose the data frame ''norm_counts''. Gene expression is expressed as the log 2 (counts); therefore, and because some samples have 0 counts, it is necessary to add 1 count to all genes for all samples to avoid errors due to log 2 (0). 8. Perform a Wilcoxon test to assess the statistical significance of MX1 expression differences between SARS-CoV-2 positive and negative patients (Figure 2A .ii): Note: Please note that Limma and DESeq2 would be more appropriate statistical packages to run when analyzing whole transcriptomes. In (Bizzotto et al., 2020) , we analyzed a pre-selected set of genes; therefore, the Wilcoxon test can be used to analyze mean differences between groups. 9. In addition, we performed the Jonckheere-Terpstra (Arif et al., 2015) trend test to evaluate gene expression trends across ordered strata (e.g., age stratified by 10-year ranges). As an example, here we show the trend test for MX1 expression across age categories in SARS-CoV-2 negative and positive patients ( Figure 2B .ii, left and right panels, respectively). The continuous age variable was stratified in 10-year ranges as described in''RNA-seq data organization and counts normalization -step 5'' section. Note: In this code the column ''clindata$age_cat'' contains the age stratified by 10-years ranges previously created. Note: to perform this test in SARS-CoV-2 negative patients, then replace ''COVID19'' with ''HEALTHY''. Note: The argument alternative could be ''two.sided'', ''increasing'' or ''decreasing''. Select the best argument for your hypothesis testing. Timing: 1 day The aim of this step is to analyze gene expression correlation in the different categorical variables. We also provide the code to perform a pairwise gene expression correlation including the viral load as a third variable plotted in a color scale. 10. Spearman correlation analysis between gene expression levels: a. Calculate the Spearman coefficients and plot all pairwise correlations. The example below shows the correlation analysis for four genes (MX1, MX2, ACE2 and TMPRSS2) in SARS-CoV-2 positive and negative patients ( Figure 3A ): Note: Any continuous variable (e.g. viral load expressed as Ct) could be included in the analysis of correlation with gene expression (Bizzotto et al., 2020) . 11. Plot gene expression correlation between two genes and include viral load (expressed as Ct) in a color scale ( Figure 3B .i) and calculate the Spearman correlation coefficient ( Figure 3B .ii) Principal-component analysis (PCA) is a dimensionality reduction technique used to increase the interpretability of a given dataset, minimizing information loss and maximizing variance. 12. Principal component analysis based on disease status: a. Calculate the principal component on the log 2 transformed gene expression data. The output is a table as shown in Figure 4A , containing the weight of each gene in the variance of the samples for Principal Component 1 (PC1; the largest component of variance in the data set) and Principal Component 2 (PC2; the second most important component influencing the variance): Note: For PCA we only considered expression of the candidate genes, but you might also include any independent variable you consider to affect the variability among samples. b. Plot PC1 vs. PC2 ( Figure 4B ): 13. 3D graphs showing expression for the selected genes can be plotted as follows. The output is shown in Figure 4C : Plots in Figure 2 depict dot plots of gene expression separating patients according to different phenotypic data, such as disease status ( Figure 2A ) and age ( Figure 2B) , and the output of the ll OPEN ACCESS STAR Protocols 2, 100478, June 18, 2021 statistical tests. The color code for these analyses is: red = SARS-CoV-2 negative patients, and blue = SARS-CoV-2 positive patients. In order to evaluate the association between different parameters, such as gene expression and viral load, we performed pairwise correlation analysis explained in the ''Correlation Analysis'' section. The outcome is a scatter plot and the Spearman coefficient (rho) with the associated P-value for each comparison. Figure 3A shows the pairwise correlation plots and statistics. Figure 3B shows the scatter plot for the correlation between two genes and viral load as color-coded dots. Finally, it is possible to visualize patient segregation by PCA using the selected genes (Bizzotto et al., 2020) as independent variables ( Figures 4A and 4B ). 3D scatter plots can be done. These plots depict gene expression and 95% confidence ellipsoids ( Figure 4C ). Eligibility criteria, statistical tests and software used for this protocol are properly described in the ''before you begin'' and ''step-by-step methods details'' sections. This protocol relies on the accuracy of the clinical records submitted to the public repository by the original authors. For some datasets, RNA-sequencing raw counts data may not be available, and data could have been pre-processed by the original authors, thus limiting the decision making of which sequence quality will comply with the own standards, usage of different alignment software and algorithms, reference genome, etc. However, there are still some quality controls that might be done. As we mentioned in (Bizzotto et al., 2020) , we detected that some samples had 0 counts for most genes (>70% genes), suggesting a very low quality RNA isolation and/or sequencing. Therefore, we were able to remove them from the analysis. We strongly suggest performing all possible quality controls before analyzing the data. It is important that when working in R, and following this protocol, take into consideration the critical statements and notes for each step provided in the method details. When running a piece of code, the output is not the expected (i.e., error message in the console or an unexpected feature in the plot). For the proper execution of the piece of code, the syntax must be accurate, so it is important to check whether there are no syntax mistakes, such as: Forgotten comma. Unclosed bracket, parenthesis, or quotes. Misspelled function or filenames. There is no count matrix available. The only accessible information are the FASTQ files thus requiring additional processing (e.g., alignment to reference transcriptome) which is not described in the current protocol. Since there are already publications explaining this matter further, we suggest visiting the manuals which describe how to process and align sequencing data before performing expression analyses using HISAT2 (Kim et al., 2019) , TopHat2 (Kim et al., 2013) , STAR (Dobin et al., 2013) and Salmon (Patro et al., 2017) . The raw counts for a specific dataset are available as a separate file, but it is not clear for the user whether they are raw counts or pre-processed data. It is important to verify that the data represent raw counts. This can be checked by reading the GEO submission description and the Methods section within the corresponding citation. The phenotypic and count matrices do not contain the same list of samples. It is important to verify that the phenotypic and counts matrices contain the same list of samples in order to merge phenotypic and gene expression data. If not, discard incomplete samples. This can be achieved using the following command, which creates a table only with the samples included in both matrices: There is a great number of genes with 0 counts. Samples considered to have low quality read sequencing should be removed from the analysis since they may introduce noise to the results. The following code is an example of how this can be achieved in RStudio when samples in the dataset have >70% genes with 0 counts: Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Geraldine Gueron, ggueron@iquibicen.fcen.uba.ar. This study did not generate new unique reagents. The code generated during this study is available at GitHub repository accessible using the following link: https://github.com/lab-inflamacionycancer/STAR-protocol-Sanchis.et.al. Non-parametric test for ordered medians: the jonckheere terpstra test ArrayExpress update -from bulk to single-cell expression data NCBI GEO: archive for functional genomics data sets-update SARS-CoV-2 infection boosts MX1 antiviral effector in COVID-19 patients GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor STAR: ultrafast universal RNA-seq aligner Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions 17 factoextra: Extract and Visualize the Results of Multivariate Data Analyses Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions In vivo antiviral host transcriptional response to SARS-CoV-2 by viral load, sex, and age Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 canvasXpress: Visualization Package for CanvasXpress in R Salmon provides fast and biasaware quantification of transcript expression R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing) RStudio: Integrated Development for R. RStudio clinfun: Clinical Trial Design and Data Analysis Functions (R package) ggplot2: Elegant Graphics for Data Analysis GGally: Extension to ''ggplot2 This work was supported by grants from AGENCIA-PICT-2016-0056 (Argentina) and AGENCIA-PICT-RAICES-2018-02639 (Argentina). The authors declare no competing interests.