key: cord-0877976-dhh73mtt
authors: Macedo-da-Silva, Janaina; Coutinho, João Victor Paccini; Rosa-Fernandes, Livia; Marie, Suely Kazue Nagahashi; Palmisano, Giuseppe
title: Exploring COVID-19 pathogenesis on command-line: A bioinformatics pipeline for handling and integrating omics data
date: 2022-05-12
journal: Adv Protein Chem Struct Biol
DOI: 10.1016/bs.apcsb.2022.04.002
sha: 9ec4ef1e7101f666164195263601784a3b6a0aeb
doc_id: 877976
cord_uid: dhh73mtt

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was first identified in late 2019 in Wuhan, China, and has proven to be highly pathogenic, making it a global public health threat. The immediate need to understand the mechanisms and impact of the virus made omics techniques stand out, as they can offer a holistic and comprehensive view of thousands of molecules in a single experiment. Mastering bioinformatics tools to process, analyze, integrate, and interpret omics data is a powerful knowledge to enrich results. We present a robust and open access computational pipeline for extracting information from quantitative proteomics and transcriptomics public data. We present the entire pipeline from raw data to differentially expressed genes. We explore processes and pathways related to mapped transcripts and proteins. A pipeline is presented to integrate and compare proteomics and transcriptomics data using also packages available in the Bioconductor and providing the codes used. Cholesterol metabolism, immune system activity, ECM, and proteasomal degradation pathways increased in infected patients. Leukocyte activation profile was overrepresented in both proteomics and transcriptomics data. Finally, we found a panel of proteins and transcripts regulated in the same direction in the lung transcriptome and plasma proteome that distinguish healthy and infected individuals. This panel of markers was confirmed in another cohort of patients, thus validating the robustness and functionality of the tools presented.

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was first identified in late 2019 in Wuhan, China, and has proven to be highly pathogenic, making it a global public health threat (Mahalmani et al., 2020) . There are a variety of manifestations, from asymptomatic to severe cases, that include symptoms such as muscle damage, prolonged tiredness, shortness of breath, loss of taste/smell, and severe pneumonia (Dixon et al., 2021; Nalbandian et al., 2021) . Cumulative reports also associate post-acute effects of the infection, including pulmonary, hematological, cardiovascular, neuronal, renal, endocrine, and gastrointestinal sequelae (Hayes, Ingram, & Sculthorpe, 2021; Nalbandian et al., 2021) . The quest to understand the infectious mechanisms of SARS-CoV-2, diagnostic alternatives, and treatments have been approached by several techniques (Das, Ahmed, Akhtar, Begum, & Banu, 2021) . Among them, omics sciences generate large amounts of information to enable holistic understanding of cell, tissue or organism function and reaction against a disease. These are highperformance techniques able to explore an organism at the level of genes (genomics), proteins (proteomics), metabolites (metabolomics), and lipids (lipidomics) . Integration of this vast information provides a comprehensive and powerful tool for exploring the infection mechanism of SARS-CoV-2 (Overmyer et al., 2021; Wu et al., 2021) .

In the last 10 years, with the advancement of technologies and processing capacity, proteomics and transcriptomics were readily implemented in several laboratories (Martens & Vizcaíno, 2017) , while metabolomics and lipidomics were integrated along the years (Wenk, 2005) . The more sophisticated data becomes, the greater the demand for new computational methods to deal with them. Currently, numerous tools can assess protein-protein interaction, calculate correlation between expression of different genes, identify enriched pathways and biological processes, and thus, drive data to find the most relevant points (Mangul et al., 2019; Mishra et al., 2021) .

Omics sciences combined with bioinformatics technologies, in particular, contributed to the identification of therapeutic targets (Bojkova et al., 2020; Li et al., 2021) ; to the elucidation of virus pathophysiological mechanisms (Desai et al., 2020; Rosa-Fernandes et al., 2019; Wu et al., 2020) ; pathogen-host interactions (Terracciano et al., 2021) ; detection of patients predisposed to manifestations of severe symptoms , selection of prognostic markers Terracciano et al., 2021) and to determine differential gene/protein/metabolite expression profiles (Desai et al., 2020; Leng et al., 2020; Wu et al., 2020) . In addition to these fundamental applications, omics tools are also helpful in stratifying clinical manifestations in viral infections (Macedoda-Silva et al., 2020) . Furthermore, integrating data from different biological matrices may offer a comprehensive overview of disease pathogenesis, helping in prioritizing biomarkers and therapeutic targets Zhu et al., 2022) . Storing, analyzing, and interpreting these data can be a big challenge for researchers. However, it is a worthwhile task to overcome as omics approach enable answering multiple biological questions.

Notably the generated raw data in previous studies are cumulatively stored and accessible in public repositories, such as PRIDE (Vizcaíno et al., 2013) and SRA (Kodama et al., 2012) , allowing reanalysis by different research groups. Therefore, mastering bioinformatics tools to process, analyze, integrate, and interpret these omic data is a powerful knowledge to build and validate hypotheses. The objective of this study is to present bioinformatics tools that can be used to reanalyze transcriptomic and proteomic data deposited in public databases and integrate them to increase the understanding of the pathophysiology of SARS-CoV-2.

Four datasets, including proteomics and transcriptomics, were selected from public repositories. Shu et al. (PXD019106) evaluated plasma samples from patients infected with SARS-CoV-2, subdivided into fatal (FT, n ¼ 5), severe (SV, n ¼ 7), and moderate (MD, n ¼ 10) groups by proteomics. In addition, the study included samples from healthy patients (HE, n ¼ 8) who tested negative on throat swab and serological tests (Shu et al., 2020) . determined the plasma proteome profile of 50 individuals with mild and moderate disease. Collections were performed in the first 24 h (D0) after the positive PCR test and after 14 days (D14) with negative PCR result (Zhong et al., 2021) . Kwan et al. (PRJNA692253) conducted RNA sequencing analyses of whole blood samples from 64 patients, of whom 45 tested positive for COVID-19 (INF) and 19 healthy participants not exposed to the virus (CTRL) (Kwan et al., 2021) . Wu et al. (PRJNA646224) explored the lung and colon tissue transcriptome of patients who died due to COVID-19 (INF, n ¼ 8) and healthy counterparts were obtained from cancer patients who underwent surgical resection or biopsy (CTRL) .

The analyzes presented here were performed on an Ubuntu 20.04.3 LTS system (https://www.ubuntu.com/), with a total memory of 62GB, 40 CPUs, 2 threads per core, 10 cores per socket, and 2 sockets. The local storage of raw files required a large amount of space. However, the size of the files varied substantially as it depended on the quality of the data and equipment used.

In this protocol, we did not address the processing of raw files from proteomic approaches. The protein quantification tables provided in the selected manuscripts were downloaded and analyzed. The detailed pipelines for processing proteomic raw data are available in Kong, Leprevost, Avtonomov, Mellacheruvu, and Nesvizhskii (2017) , Carvalho et al. (2016) , Guangcan et al. (2021) , and Tyanova, Temu, and Cox (2016) .

For transcriptomic analysis, the fastq raw files were downloaded through the SRA Explorer platform (https://sra-explorer.info/) with the access numbers PRJNA692253 and PRJNA646224. Information regarding each file, such as SRA Accession, instrument, and total bases (Mb) were accessed. The SRA repository stores raw high-throughput sequencing data from different search fields. SRA files consist of raw data and require quality scores per basis for the submitted data. Tools like the sratoolkit download sequencing data from this database. Fastq files are raw data that have a definition line (defline) that contains a read identifier and nucleotide base information, all in text form.

After selection, all files from PRJNA692253 were downloaded by clicking on "Add to collection." A total of 64 SRA files were added. The platform offers the option of downloading .fastq files directly, but also .sra files and metadata. In the "FastQ Downloads" tab, the corresponding URL for all files could be downloaded. Kwan et al. used paired-end sequencing, resulting in 128 fastq files. Accession number PRJNA646224 provides the possibility to download colon and lung transcriptome data. Here, we downloaded only data referring to the lung (CTRL and INF), resulting in a total of 19 fastq files since Wu et al. used single-end sequencing. After creating a directory named "fastq_files" in the terminal console, files were downloaded using the "wget" command.

FastQC is a highly recommended software for checking the quality of raw data coming from high-throughput sequencing pipelines. To download the software via the terminal console, "wget" command must be used.

The download is also possible by using directly the web browser from the Brabraham Institut website (https://www.bioinformatics.babraham.ac.uk/ projects/fastqc/). After unzipping the file, turned the tool executable and added FastQC to the path. A directory to allocate FastQC outputs files was created and then the software was executed.

At the end of the process, two output files will be created for each fastq file. By opening the generated .html file, the user can access the sample quality control report. The first module presents the results of basic statistics, including Total Sequences, Sequences flagged as poor quality, Sequence length, and %GC. Below is a BoxWhisker chart, which on the y-axis indicates the quality scores and on the x-axis the position in the read. The background indicates the quality of the sequences: green (very good), orange (fair), and red (poor). The "Per sequence quality scores" graph shows the average quality score on the x-axis and the number of sequences with that average on the y-axis. Generally, most samples' readings are expected to have a high-quality score. The parameter "Per base sequence content" reveals the proportion of each position of a base. If the difference between A and T or G and C bases is greater than 10%, the software will report a warning. If the difference is greater than 20%, this module will fail. In "Per sequence GC content," the GC content measurement is reported. A normal distribution profile is expected. In "Overrepresented Sequences" frequent sequences are reported, indicating that it has some biological relevance or that the library is not diverse or contaminated.

Trim Galore! is a platform that performs sample quality control by efficiently removing poor-quality parts of the readings and trimming the adapter. This tool requires the previous installation of cutadapt and FastQC tools. The step of removing adapter sequences may or not be adopted in a transcriptome data analysis pipeline. In the selected datasets, Wu et al. chose to apply Trim Galore! while Kwan et al. started the sequence alignment right after running FastQC. In this pipeline, we applied the tool only to the dataset from Wu et al. study, based on the obtained FastQC report. The command used followed the structure for single-end sequencing data.

In the above command, the "-j" option indicates the number of cores to be used for trimming; "-e" corresponds to the maximum allowed error rate; "-q" allows the trimming of low-quality ends from reads in addition to adapter removal; "--stringency" points out the overlap with adapter sequence required to trim a sequence; the "-O" option creates a directory to save the output files. The command "-a" is used to add the adapter sequence. The Trim Galore! can automatically detect whether the Illumina universal, Nextera transposase, or Illumina small RNA adapter sequence was used. However, it is possible to add the adapter sequence manually. When the command is missing, the software auto-detects the adapter.

Before starting the sequence alignment, it is necessary to download the reference genome. As indicated in the original manuscripts, Wu et al. used the h19 genome and Kwan et al. used the GRCh38 genome. The downloaded file will be allocated in the created directory. Then the HISAT2 sequence alignment software will be downloaded, unzipped, and added to the path. The genome will be indexed using the "hisat2-build" script, allowing HISAT2 to perform the read alignment faster in the next step. At the end of the indexing, eight .ht2 files will be created.

Sequence alignment was performed with HISAT2 software (Kim, Paggi, Park, Bennett, & Salzberg, 2019) . Other programs are available for free, such as STAR (Dobin et al., 2013) , Bowtie2 (Langmead & Salzberg, 2012) , and TopHat2 (Kim et al., 2013) . A comparison between the methods was recently reported and can be accessed at Schaarschmidt, Fischer, Zuther, and Hincha (2020) and Musich, Cadle-Davidson, and Osier (2021) .

The "-p" command indicates the number of threads to perform the alignment; "-x" is used to enter the path of previously generated reference files. It is important not to add the file extension (e.g., .ht2). The "-U" option is used to show the path of unpaired files. To perform paired-end sequencing alignment, the synthesis used is "-1" and "-2" to assign files _1 and _2. The "-S" command designates the .sam file that will be generated. SAM files store aligned sequences and take up much space, so these files can be converted to compressed version .BAM by the samtools software.

Samtools is a set of programs for interacting with high-throughput sequencing data. It is helpful for converting SAM, BAM and CRAM files. One of the most used commands is the "samtools view," which takes .BAM/.SAM files as input and converts them to .SAM/.BAM, respectively. The "-S" and "-b" commands are used. The alignment of fastq files occurs in random order with the position in the reference genome. Therefore, in "samtools sort," the BAM files sorting is performed. The "-o" command indicates the output file.

FeatureCounts is a platform that performs reading counting, also called reading summarization, by gene (Liao, Smyth, & Shi, 2014) . The htseq-count tool (Anders, Pyl, & Huber, 2015) , and featureCounts, are widely used for this purpose. Both share the same file input format (BAM or SAM) and need an annotation file that includes the chromosomal coordinates of features. Among the mandatory arguments required to run the software are "-a" and "-o," which specifies the name/path of an annotation file and the name of the output file including read counts. Among the optional arguments, we used the "-T" command to indicate the number of threads to be used in the analysis; the "-t" option specifies the feature type in GTF annotation, and "-g" specifies the attribute type in the GTF annotation.

After obtaining the read counts for each sample and downloading the quantitative proteomics tables, a set of R packages were used to handle the data and identify the relevant biological findings. The packages used are: limma, Glimma, edgeR, Homo.sapiens, metboAnalystR, Enhancedvolcano, combiroc, pROC, clusterProfiler, flashClust, uwot, NbClust, richplot, ggplot2, WCGNA, gprofiler2, ggvenn, DOSE, wordcloud, FactoMineR, factoextra, ggstatsplot, among others. The codes used to generate the data and figures will be available in Supplementary File 1 in the online version at https://doi.org/10.1016/bs.apcsb.2022.04.002.

The gene expression count files for each sample mapped by the transcriptomics pipeline were gathered in a single directory and, with the "readDGE" function available in the edgeR package, they were merged into a single object DGEList class. The raw counts were transformed into counts per million (CPM) so that the size of the libraries was taken into account and then scaled to log (log CPM). Most samples genes that did not have sufficiently large counts were removed with the "filterByExpr" function. The resulting matrix was normalized by a trimmed mean of M-values (TMM) (Law et al., 2016) .

The differentially expressed genes and proteins were determined by applying the standard analysis pipeline of the limma package, with p-value correction by the Benjamini-Hochberg method (q-value). The Homo.sapiens package performed the transformation of ENSMBL ID into Gene symbol. Interactive MAplots were plotted with the Glimma package. The table resulting from the analysis of differentially regulated genes/proteins was used as input to visualize of volcano plots by Enhancedvolcano package. The x and y values were the columns logFC and adjusted p-value, respectively. Principal component analysis (PCA) was conducted by applying the FactoMineR package using the "PCA" function. The "fviz_pca_ind()" function, available in the factoextra package, was applied to visualize the result. Samples were identified as outliers using the MetaboAnalystR package, with the "PlotRF.Outlier" function. The input data used were filtered and normalized matrices. Venn diagrams were built with the ggvenn package. The gprofiler2 package was used to perform gene ontology (GO). The "gost" function was applied to access functional enrichments of gene lists by applying a threshold of q-value < 0.05 and "gostplot" to visualize the results, with the capped and interactive options set to TRUE. Only ontologies with 5 or more proteins/genes were considered. The wordcloud package was used to build the word cloud based on the name frequency of ontologies identified by gprofiler2. Bubble plots were built with the ggplot package. The x-axis shows the gene count in each ontology shown on the y-axis. The size of the bubbles indicates the -log of the q-value and the color indicates the ontology category. To access pathways related to differentially expressed genes/proteins, the clusterProfiler and enrichplot packages were used. The "GSA" function with the options minGSSize set to 5 and maxGSSize to 100. The p-value correction method was indicated as "BH" (Benjamini-Hochberg). The regulated pathways network was built with the "cnetplot" function.

Data integration was performed by applying the "merge" function to select Gene symbols common between lung transcriptomics and plasma proteomics normalized tables. The "boxplot" function was applied to verify data distribution. Then the "removeBatchEffect" function, available in limma, was applied. The data were then normalized by the "scale" function (z-score). The "umap" function, available in the uwot package, was used to produce a low-dimensional embedding that summarizes the overall structure of high-dimensional data. The parameters adopted were min_dist -¼0.01, n_neighbors ¼ 20, n_epochs ¼ 10,000, verbose ¼ 2 and spread ¼ 2. The NbClust package was applied to determine the best number of clusters. Then the data were clustered by the "eclust" function. The cluster number was set to k ¼ 8. The "silhouette" function, available in the cluster package, was applied to assess the clustering quality. Genes and proteins from the same cluster, which have logFC > j 1 j (same direction) and q-value < 0.05 were selected and applied to construct ROC curves based on data from other datasets. The WGCNA package was applied to the expression data from Wu et al. and Kwan et al. The intersection of the ENSEMBL IDs mapped in the two datasets was performed by the "intersect" function. Correlation and connectivity between two datasets were determined using the softPower parameter set to 5. The number of modules was set to 4 and minClusterSize to 30. The "combiroc" package was used to determine the best marker combinations, and the "pROC" package was applied to plot and build the ROC curves.

Given the increasing availability of public omics data, we present a robust and open access pipeline for extracting insights from quantitative biomolecular data (Fig. 1) . Initially, raw files generated by RNA sequencing were downloaded and processed. The proteomics and transcriptomics data were then analyzed to find differently regulated features between the groups. In addition, outlier samples, biological processes, cellular components, molecular functions, and pathways of interest were identified. Transcriptomics data from different matrices were correlated, and proteomics and transcriptomics data from different biological samples were fitted and integrated. Finally, we evaluated through ROC curves the ability to separate samples from infected and healthy individuals based on the panel of relevant proteins and transcripts. Proteins and transcripts with better resolution between the clinical conditions were prioritized and discussed.

After filtering the original data provided by the authors, the PCA plot demonstrated that the plasma proteome of infected patients presented a different pattern than that of healthy patients (HE). The groups SV (severe) and FT (fatal) showed a very similar pattern, which differed from the MD (moderate) group ( Fig. 2A) . The protein profile of plasma from the same patient collected on day 0 (D0) of infection (PCR positive) and day Fig. 1 Computational workflow and tools adopted to explore quantitative proteomics and transcriptomics data from plasma, whole blood, and lung of healthy individuals and patients infected with SARS-CoV-2. 14 (D14) (PCR negative) showed similarity in the majority of cases, however few presented distinct recovery pattern (Fig. 2B) . A total of 453 proteins were differentially regulated in the comparison between FT and HE (Fig. 2D) ; 424 between SV and HE (Fig. 2E) , and 403 between MD and HE (Fig. 2F) . Among patients in groups D0 and D14, 421 regulated proteins were identified (Fig. 2G ). Comparing the proteins upregulated and downregulated in all comparisons ( Fig. 2H and I), 6 proteins (C1QA, CD93, FLT4, INHBC, ENPP2, and NID1) are identified upregulated in the infected groups, indicating a possible role in the SARS-CoV-2 pathology (Fig. 3) .

The boxplot plots (Fig. 3) showed that in the three degrees of infectious disease severity (FT, SV, MD) the expression level of the six proteins were higher than the healthy group (HE). The intra-patient comparisons (Fig. 3B) showed that after 14 days (D14) of disease onset (negative PCR), the levels of these proteins were lower than at the time of diagnosis (D0).

Enriched GO terms were associated to the dysregulation of immune system, proteasome, cytoskeleton, cell adhesion, and lipoprotein metabolism (Fig. 4A ). Upregulated proteins in the infected groups (FT, SV, MD, and D0) are related to the activation of the immune system, especially with the response mediated by leukocytes. Cytoskeleton dynamics and proteasomal degradation were also upregulated (Fig. 4B) . Processes related to lipid metabolism and cell adhesion and migration are downregulated (Fig. 4C ). Pathway analysis was performed using differentially regulated proteins and confirmed findings from GO analysis ( Fig. 4D and E) . In fact, downregulated proteins in the infected groups (FT, SV, MD, and D0) are related to cholesterol and lipoprotein metabolism. The ECM-proteoglycans pathway and cell adhesion molecules were also associated with downregulated features. On the other hand, the pathways of infection, immune system, apoptosis, and extracellular matrix (ECM) organization are upregulated.

PCA analysis based on blood transcriptome from CTRL and INF patients with different degrees of symptoms did not show a clear separation (Fig. 5A) . On the other hand, analysis of lung tissue from patients with fatal COVID-19 (INF) and healthy counterparts (CTRL) showed separation between groups (Fig. 5B) . A total of 1039 genes were identified differentially expressed in the blood transcriptome (Fig. 5C ) and 1034 in the lung tissue analysis (Fig. 5D ) (logFc j1 j and q-value < 0.05). A higher number of enriched GO terms (q value < 0.05) were found in the lung tissue compared to the blood transcriptome giving more information about the COVID-19 pathophysiology. Immune response mediated by leukocytes, apoptotic processes, collagen degradation, and viral life cycle were among the upregulated enriched GO terms corroborating the findings of plasma proteomics. On the other hand, there was an enrichment associated with misfolded proteins that were not identified in plasma (Fig. 5E ). Ontologies related to downregulated genes showed greater differences with respect to proteomic findings, although processes such as cell adhesion are common to both. Terms related to cell development and differentiation, especially of the brain components, are highlighted (Fig. 5F ). Processes related to lipid and cholesterol metabolism were not identified.

Weighted Gene Coexpression Network Analysis (WGCNA) is a popular method applied to identify correlation patterns between genes in samples. Integration of the 5000 common genes between lung and blood transcriptome did not show high modulus preservation between studies (Fig. 6A) . However, the green (Fig. 6B) and turquoise (Fig. 6C ) modules have 5 < Z < 10, indicating moderate preservation. By selecting and working with the 400 key genes (labeled on the basis of kME) differentially expressed in the assessed groups of the turquoise and green modules, we identified enriched biological processes and pathways. The green module has genes related to leukocyte activation and proteasome activity, which was also identified in the turquoise module (Fig. 6D) . However, it was also possible to verify processes linked to the virus, including proteins associated with spike viral glycoprotein maturation and deregulation of host-glycosylation (Fig. 6E) , a key process during infection. The clustering of 421 features common to lung transcriptome and plasma proteome data from healthy (CTRL and HE) and fatal (INF and FT) COVID-19 patients resulted in 8 clusters ( Fig. 7A and B ). The silhouette plot indicates genes possibly associated with the wrong cluster (n ¼ 27) (Fig. 7C) . These genes were disregarded in the enrichment analyses. Clusters 3, 4, 6, 7, and 8 show the highest correlation between the LogFC of the proteome and transcriptome (Fig. 7D) . Considering only genes regulated in the same direction in both matrices, we identified enriched biological pathways and processes. Cluster 1 highlights terms associated with ECM; cluster 2, terms linked to the immune system and enzymatic activity; cluster 3, cytoskeletal activities; cluster 4, lipid metabolism; cluster 5, immune system and leukocyte activation; cluster 6, Fig. 6 Weighted gene co-expression network analysis (WGCNA). (A) Z-score for modules preserved between lung transcriptome and whole blood; (B-C) Scatter plot indicating the LogFC of the 400 genes comprising the green and turquoise modules (Z > 5), respectively, selected by the topGenesKME function; (D-E) Network indicating differentially expressed proteins from the turquoise and green modules. The donut plot indicates enriched processes and pathways (q-value < 0.05). proteasomal degradation, immune system, and enzyme activity; cluster 7, complement system and cluster 8, glucose metabolism (Fig. 7E) .

The transcripts and proteins differentially regulated with a logFC > j 1j in the same direction (upregulation or downregulation) in both the proteome and the transcriptome were selected for application of the ROC curve (Fig. 8) . The targets with the highest specificity and sensitivity were selected based in the integrated expression dataset (plasma proteome and lung transcriptome) (Fig. 8A) . Additionally, the predictive power of these targets to stratify infected and healthy conditions was tested in another cohort (plasma proteome of D0 and D14 and whole blood transcriptome from INF and CTRL).

Three biomolecules gave the best AUC: TIMP1, a natural inhibitor of the matrix metalloproteinases involved in extracellular matrix remodeling, cell proliferation and regulated in response to cytokines; TNXB, tenascin XB, an extracellular matrix glycoprotein involved in wound healing, and PTPRS, protein tyrosine phosphatase receptor type S, the cell surface receptor that binds to glycosaminoglycans and involved in cell-cell interaction and cell growth and differentiation (Fig. 8B) . The combination of the three showed an AUC greater than 0.8. At the gene level, the same combination resulted in AUC higher than 0.9 (Fig. 8C ). Other combinations that showed AUC values close to 1 were PTPRS and TIMP1; and PTPRS and SPP1, a sialoprotein related to lymphocytic activation ( Fig. 8D and E).

To explore and understand the mechanisms of SARS-CoV-2 infection, the virus responsible for the death of more than 5 million people worldwide (https://covid19.who.int/), we present here an optimized pipeline for omics data analysis (Fig. 1) , as well as extract meaningful information from high-throughput techniques. Dealing with codes in command line may not be a common expertise; however, being able to access and interpret public data is of great advantage for performing in silico validations and designing experiments in the wet lab to test hypotheses. We present the application of computational tools to analyze raw transcriptomics data and funnel the findings to access key disease processes. In addition, we integrated quantitative proteomics data to improve our understanding of COVID-19.

The rapid need to understand the mechanisms and impacts of SARS-CoV-2 makes omics strategies stand out as they can offer a holistic and comprehensive view of thousands of molecules (Colinge & Bennett, 2007; Haas, Muralidharan, Krogan, Kaake, & H€ uttenhain, 2021) . The ability to study alterations of a large number of proteins and transcripts in a single experiment is attractive (Al-Amrani, Al-Jabri, Al-Zaabi, Alshekaili, & Al-Khabori, 2021). Furthermore, accessing the modulation of molecules based on a clinical status can answer many questions about a given disease (Amiri-Dashatan, Koushki, Abbaszadeh, Rostami-Nejad, & Rezaei-Tavirani, 2018) . However, to access these countless possibilities offered by omics, it is necessary to conduct information processing to reduce noisy data and focus only on what is relevant. Computational and bioinformatic analyses complement the sample preparation and data acquisition steps allowing a comprehensive visualization of dysregulated pathways (Cannataro, 2007) .

The mechanisms altered by SARS-CoV-2 during infection can be accessed through the application of shotgun proteomics (Badua, Baldo, & Medina, 2021; Rais, Fu, & Drabovich, 2021) . In addition, this approach can provide the quantification of viral proteins in complex clinical samples (Bojkova et al., 2020; Lach en-Montes, Corrales, Fernández-Irigoyen, & Santamaría, 2020; Lazari et al., 2021) . Reports of sequelae resulting from viral infection are increasingly frequent. Chen et al. (2021) applied a proteomic approach monitoring the health status of patients who developed COVID-19. The study showed changes in proteins related to cholesterol metabolism and heart disease. Our in silico analyses resulting from the application of the deft pipeline detailed here showed an alteration in processes linked to lipoproteins and cholesterol. These results are in agreement with changes in the HDL proteome associated with COVID-19 complications (Souza Junior et al., 2021) . The data obtained in this manuscript show that the plasma proteome reflects the immunological status of infected patients, since pathways and proteins were identified as increased in these groups. The intra-patient comparison at different time points showed there is an activation of the immune system, especially of pathways linked to leukocytes in the first 24 h. However, these responses are diminished after 14 days. Other studies also reported the ability of the plasma and serum proteome to reflect the action of the immune system in the infection (Arthur et al., 2021; Kumar, 2021; Villar et al., 2021; Zhong et al., 2021) . Among the immune system proteins that stood out were C1QA, which is crucial in the activation of the classical pathway of the complement system (Ghebrehiwet, Hosszu, Valentino, & Peerschke, 2012; Kouser et al., 2015; Nayak, Pednekar, Reid, & Kishore, 2012) , and CD93, involved in the regulation of phagocytosis of apoptotic cells in vivo. Transcriptomics approaches have also been extensively applied to study SARS-CoV-2 infection (Butler et al., 2021; Chakraborty, Sharma, Bhattacharya, Zayed, & Lee, 2021; Islam et al., 2021; O'Donnell et al., 2021; Sun et al., 2020; Wong et al., 2021) . We identified regulated pathways linked to misfolded proteins in the lung of infected patients, as previously reported . Moreover, we identified an increase in the enzymes involved ECM remodeling, extravasation of intracellular contents and activation of the immune system (Leeming et al., 2021; Leng et al., 2020) .

Leukocyte activation was recurrent in our analyses of the integrated proteomic and transcriptomic data (Alon et al., 2021; Coradi & Vieira, 2021) . Regarding our data integration approach, we saw that there is little correlation between transcripts mapped in lung and whole blood transcriptome. However, looking at moderately conserved transcripts, we found processes linked to leukocyte activation and host glycosylation. Our integrated data also mapped transcripts and proteins linked to proteasome activity in the infected groups. In fact, other groups have identified the importance of the proteasome in COVID-19 from the application of other techniques (Chatterjee et al., 2020; Longhitano et al., 2020; Wang et al., 2021) .

We conducted integrated omic data analysis with focus on open tools to analyze and interpret this data type. We presented the entire pipeline for handling from raw files to differentially expressed features using the FastQC, Trim Galore!, HISAT2, samtools, featureCounts and R softwares. We also looked at useful tools to explore processes and pathways related to mapped transcripts and proteins. An important point was establishing a pipeline to integrate proteomics and transcriptomics data, since adjusting these data to make them comparable can be a big challenge. We also consistently reported applying various packages available in the bioconductor, providing the codes used. Regarding the biological findings, we evidenced increase in the cholesterol metabolism, immune system activity, ECM and proteasome degradation increased in infected patients. We also noticed a leukocyte activation profile in both proteomics and transcriptomics data. Finally, we identified a panel of proteins and transcripts that are regulated in the same direction in the lung transcriptome and plasma proteome that have a great ability to distinguish between healthy and infected groups. This panel of markers was applied to another cohort of patients and showed good results, corroborating the robustness and usefulness of the computational tools presented in this manuscript.

(J.M.D.S.)); by the Conselho Nacional de Desenvolvimento Científico e Tecnológico ("Bolsa de Produtividade" (S.K.N.M. and G.P.)); by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (process n°88882.463201/2019-01 (L.R.F.)).

The codes used in this manuscript were made available in the Supplementary File 1 in the online version at https://doi.org/10.1016/bs.apcsb.2022.04.002. The input and output data are available in Supplementary File 2 in the online version at https://doi.org/10.1016/bs. apcsb.2022.04.002.

Proteomics: Concepts and applications in human medicine

Leukocyte trafficking to the lungs and beyond: Lessons from influenza for COVID-19

Proteomics applications in health: Biomarker and drug discovery and food industry

HTSeq-A Python framework to work with high-throughput sequencing data

Cellular and plasma proteomic determinants of COVID-19 and non-COVID-19 pulmonary diseases relative to healthy aging

Genomic and proteomic mutation landscapes of SARS-CoV-2

Proteomics of SARS-CoV-2-infected host cells reveals therapy targets

Shotgun transcriptome, spatial omics, and isothermal profiling of SARS-CoV-2 infection reveals unique host responses, viral diversification, and drug interactions

Computational proteomics: Management and analysis of proteomics data

Integrated analysis of shotgun proteomic data with PatternLab for proteomics 4.0

Understanding gene expression and transcriptome profiling of COVID-19: An initiative towards the mapping of protective immunity genes against SARS-CoV-2 infection

Targeted intracellular degradation of SARS-CoV-2 via computationally optimized peptide fusions

Proteomic analysis identifies prolonged disturbances in pathways related to cholesterol metabolism and myocardium function in the COVID-19 recovery stage

Introduction to computational proteomics

Alterações leucocitárias em pacientes com COVID-19 observadas em extensão de sangue perif erico

An overview of basic molecular biology of SARS-CoV-2 and current COVID-19 prevention strategies

Temporal and spatial heterogeneity of host response to SARS-CoV-2 pulmonary infection

Symptoms and symptom clusters associated with SARS-CoV-2 infection in community-based populations: Results from a statewide epidemiological study

STAR: Ultrafast universal RNA-seq aligner

The C1q family of proteins: Insights into the emerging non-traditional functions

How to use open-pFind in deep proteomics data analysis?-A protocol for rigorous identification and quantitation of peptides and proteins from mass spectrometry data

Proteomic approaches to study SARS-CoV-2 biology and COVID-19 pathology

More than 100 persistent symptoms of SARS-CoV-2 (long COVID): A scoping review

Transcriptome of nasopharyngeal samples from COVID-19 patients and a comparative analysis with other SARS-CoV-2 infection models reveal disparate host responses against SARS-CoV-2

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

The sequence read archive: Explosive growth of sequencing data

MSFragger: Ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics

Emerging and novel functions of complement protein C1q

Can proteomics-based approaches further help COVID-19 prevention and therapy? Expert Review of Proteomics

A blood RNA transcriptome signature for COVID-19

Proteomics insights into the molecular basis of SARS-CoV-2 infection: What we can learn from the human olfactory axis

Fast gapped-read alignment with Bowtie 2

RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000 Research

Prognostic accuracy of MALDI-TOF mass spectrometric analysis of plasma in COVID-19

Can biomarkers of extracellular matrix remodelling and wound healing be used to identify high risk patients infected with SARS-CoV-2?: Lessons learned from pulmonary fibrosis

Pathological features of COVID-19-associated lung injury: A preliminary proteomics report based on clinical samples

Multi-platform omics analysis reveals molecular signature for COVID-19 pathogenesis, prognosis and drug target discovery

featureCounts: An efficient general purpose program for assigning sequence reads to genomic features

Proteasome inhibitors as a possible therapy for SARS-CoV-2

Serum proteomics reveals alterations in protease activity, axon guidance, and visual phototransduction pathways in infants with in utero exposure to zika virus without congenital zika syndrome

COVID-19 pandemic: A review based on current evidence

Systematic benchmarking of omics computational tools

A golden age for working with public proteomics data

Network analysis of transcriptomics data for the prediction and prioritization of membrane-associated biomarkers for idiopathic pulmonary fibrosis (IPF) by bioinformatics approach

Comparison of short-read sequence aligners indicates strengths and weaknesses for biologists to consider

Post-acute COVID-19 syndrome

Complement and non-complement activating functions of C1q: A prototypical innate immune molecule

Pathogenic and transcriptomic differences of emerging SARS-CoV-2 variants in the Syrian golden hamster model. eBioMedicine, 73, 103675

Large-scale multi-omic analysis of COVID-19 severity

Mass spectrometry-based proteomics in basic and translational research of SARS-CoV-2 coronavirus and its emerging mutants

Cellular imprinting proteomics assay: A novel method for detection of neural and ocular disorders applied to congenital zika virus syndrome

Zika virus impairs neurogenesis and synaptogenesis pathways in human neural stem cells and neurons

SARS-CoV-2 activates ER stress and Unfolded protein response (preprint)

Evaluation of seven different RNA-seq alignment tools based on experimental data from the model plant Arabidopsis thaliana

Plasma proteomics identify biomarkers and pathogenesis of COVID-19

HDL proteome remodeling associates with COVID-19 severity

Comparative transcriptome analysis reveals the intensive early stage responses of host cells to SARS-CoV-2 infection

Mapping the SARS-CoV-2-host protein-protein interactome by affinity purification mass spectrometry and proximity-dependent biotin labeling: A rational and straightforward route to discover host-directed anti-SARS-CoV-2 therapeutics

The MaxQuant computational platform for mass spectrometry-based shotgun proteomics

Characterization by quantitative serum proteomics of immune-related prognostic biomarkers for COVID-19 symptomatology

The PRoteomics IDEntifications (PRIDE) database and associated tools: Status in 2013

Identification of proteasome and caspase inhibitors targeting SARS-CoV-2 Mpro

The emerging field of lipidomics

Transcriptome network analyses in human coronavirus infections suggest a rational use of immunomodulatory drugs for COVID-19 therapy

The trans-omics landscape of COVID-19

Transcriptional and proteomic insights into the host response in fatal COVID-19 cases

Next generation plasma proteome profiling of COVID-19 patients with mild to moderate symptoms. eBioMedicine, 74, 103723

A comprehensive review of the analysis and integration of omics data for SARS-CoV-2 and COVID-19

We are grateful for the financial support provided by the São Paulo Research Foundation (FAPESP, grants processes n°2018/18257-1 (G.P.), 2018/15549-1 (G.P.), 2021/00140-3