key: cord-0752642-gd3d3meu authors: Bhuvaneshwar, Krithika; Madhavan, Subha; Gusev, Yuriy title: Computational genomic analysis of the lung tissue microenvironment in COVID-19 patients date: 2021-05-30 journal: bioRxiv DOI: 10.1101/2021.05.28.446250 sha: 6d4078808eaf5e5936dac741d90343e3401a5403 doc_id: 752642 cord_uid: gd3d3meu The coronavirus disease 2019 (COVID-19) pandemic caused by the SARS-CoV-2 virus has affected over 170 million people, and caused over 3.5 million deaths throughout the world as of May 2021. Although over 150 million people around the world have recovered from this disease, the long term effects of the disease are still under study. A year after the start of the pandemic, data from COVID-19 recovered patients shows multiple organs affected with a broad spectrum of manifestations. Long term effects of SARS-CoV-2 infection includes fatigue, chest pain, cellular damage, and robust innate immune response with inflammatory cytokine production. More clinical studies and clinical trials are needed to not only document, but also to understand and determine the factors that predispose certain people to the long term side effects of his infection. In this manuscript, our goal was to explore the multidimensional landscape of infected lung tissue microenvironment to better understand complex interactions between SARS-CoV-2 viral infection, immune response and the lungs microbiome of COVID-19 patients. Each sample was analyzed with several machine learning tools allowing simultaneous detection and quantification of viral RNA amount at genome and gene level; human gene expression and fractions of major types of immune cells, as well as metagenomic analysis of bacterial and viral abundance. To contrast and compare specific viral response to SARS-COV-2 we have analyzed deep sequencing data from additional cohort of patients infected with NL63 strain of corona virus. Our correlation analysis of three types of measurements in patients i.e. fraction of viral RNA (at genome and gene level), Human RNA (transcripts and gene level) and bacterial RNA (metagenomic analysis), showed significant correlation between viral load as well as level of specific viral gene expression with the fractions of immune cells present in lung lavage as well as with abundance of major fractions of lung microbiome in COVID-19 patients. Our exploratory study has provided novel insights into complex regulatory signaling interactions and correlative patterns between the viral infection, inhibition of innate and adaptive immune response as well as microbiome landscape of the lung tissue. These initial findings could provide better understanding of the diverse dynamics of immune response and the side effects of the SARS-CoV-2 infection. 12 affected over 170 million people, and caused over 3.5 million deaths throughout the world as of 13 May 2021. Although over 150 million people around the world have recovered from this disease, 14 the long term effects of the disease are still under study. A year after the start of the pandemic, 15 data from COVID-19 recovered patients shows multiple organs affected with a broad spectrum 16 of manifestations. Long term effects of SARS-CoV-2 infection includes fatigue, chest pain, 17 cellular damage, and robust innate immune response with inflammatory cytokine production. 18 More clinical studies and clinical trials are needed to not only document, but also to understand 19 and determine the factors that predispose certain people to the long term side effects of his 20 infection. 21 22 In this manuscript, our goal was to explore the multidimensional landscape of infected lung 23 tissue microenvironment to better understand complex interactions between SARS-CoV-2 viral 24 infection , immune response and the lungs microbiome of COVID-19 patients. Each sample was 25 analyzed with several machine learning tools allowing simultaneous detection and quantification 26 of viral RNA amount at genome and gene level; human gene expression and fractions of major 27 types of immune cells, as well as metagenomic analysis of bacterial and viral abundance. To 28 contrast and compare specific viral response to SARS-COV-2 we have analyzed deep 29 sequencing data from additional cohort of patients infected with NL63 strain of corona virus. 30 31 Our correlation analysis of three types of measurements in patients i.e. fraction of viral RNA (at 32 genome and gene level), Human RNA (transcripts and gene level) and bacterial RNA 33 (metagenomic analysis), showed significant correlation between viral load as well as level of 34 specific viral gene expression with the fractions of immune cells present in lung lavage as well as 35 with abundance of major fractions of lung microbiome in COVID-19 patients. 36 37 Our exploratory study has provided novel insights into complex regulatory signaling interactions 38 and correlative patterns between the viral infection, inhibition of innate and adaptive immune 39 response as well as microbiome landscape of the lung tissue. These initial findings could provide 40 better understanding of the diverse dynamics of immune response and the side effects of the 41 SARS-CoV-2 infection. 42 The coronavirus disease 2019 (COVID-19) pandemic caused by the SARS-CoV-2 virus has 45 affected over 170 million people, and caused more than 3.5 million deaths throughout the world 46 as of May 2021 [1] . Majority of the patients infected were reported to have mild disease (about 47 75-80%), about 15-20% of patients need hospitalization, and about 5-10% need critical care [2 , 48 3]. 49 50 Scientists have been studying the molecular underpinnings of this infection. Several large multi-51 institutional efforts have been started with the goal of understanding the underlying mechanisms 52 of this infection. One such effort is the COVID Human Genetic Effort which explores the genetic 53 and immunological reasons for the various clinical severities of this disease [4 , 5] . Some reports 54 found that patients who get severe disease lacked effective immune response due to either a 55 mutation or a lack of effective viral response to overcome severe disease [6] . This effort also 56 highlighted that inborn errors of type I interferons (IFN) , or auto-antibodies to type I interferons 57 associated with COVID-19 based pneumonia [6] , and its therapeutic applications [7] . Another 58 effort is the National COVID Cohort Collaborative, or the N3C organized by the NIH National 59 Center for Advancing Translational Sciences (NCATS), a large effort for collecting and 60 harmonizing data derived from electronic health records (EHRs) from different institutions for 61 collaborative research [8] . Other large efforts include the COVID Symptom Study, an app for 62 tracking symptoms [9] ; and NIH COVID digital pathology repository that collects whole slide 63 images (WSI) of COVID related pathology [10] . The National Heart, Lung, and Blood Institute 64 (NHLBI) is conducting studies to test if medications are safe and can help individuals recover 81 Such studies would enable understand this disease from a molecular perspective with the 82 ultimate goal of finding better treatments. Researchers would be to use the big data collected 83 from one of these large initiatives to build learning based models and determine the factors that 84 predispose certain people to the long term side effects of this infection. 85 86 In this manuscript, our goal was to explore the multidimensional landscape of infected lung 87 tissue microenvironment to better understand complex interactions between SARS-CoV-2 viral 88 infection , immune response and the lungs microbiome of COVID-19 patients. 89 90 We report genomic analysis of deep sequencing data from publicly available RNA pipeline that uses Bowtie2 aligner [28] to find the viruses detected in the two datasets. Bowtie2 163 is an alignment algorithm that uses a combination of index-assisted seed alignment and single-164 instruction multiple-data (SIMD) based dynamic parallel processing to achieve fast, accurate and 165 sensitive alignment of sequencing data [28] . We also corroborated these results with the help of 166 an additional pipeline CENTRIFUGE [29] to detect and quantitate abundance of viral species. 167 Then we applied our quantification algorithm of viral RNA at the gene/CDS level, which is part 168 of the viGEN pipeline [27] . This produced gene counts of viral RNA on the input datasets. 169 170 Analysis of the immune environment in the RNA-seq data 171 We applied our immuno-genomics pipeline to the SARS-COV-2 and NL63-CoV datasets. Our goal was to explore multidimensional landscape of infected lung tissue microenvironment to 209 better understand complex interactions between virus , immune response and microbiome in the 210 lungs of COVID19 patients. By utilizing three types of bioinformatics workflows and tools, we 211 were able to detect and quantitate three different fractions of short reads from RNAseq data files: The metagenomic analysis of the patients in the SARS-CoV-2 dataset (Table 1) shows the 262 abundance of the top 20 bacterial species in the lung microbiome. Figure a n i 2 5 7 4 7 7 8 8 6 1 9 5 3 0 5 1 9 3 2 5 7 4 7 7 4 3 3 5 3 2 3 5 5 2 9 T cells CD4 Naïve were negatively correlated with genome counts. Figure 4B represents 294 statistically significant correlations between viral gene expression (gene level) and fraction of 295 immune cells. Activated NK cells and monocytes were found to be inversely correlated with the 296 gene counts of the 3 prime and 5 prime UTR regions respectively. Monocytes were also found to 297 be inversely correlated with other regions of the SARS-CoV-2 virus genome including 298 membrane glycoprotein, envelope protein, nucleocapsid phosphoprotein and more. a m e o f g e n o m e s p e c i e s I m m u n e c e l l t y Next, we assessed the immuno-profile of samples from the NL63-CoV dataset ( Figure 2B ). It 370 indicates T cells CD4 memory as the dominant immune cells. 371 372 The metagenomic analysis of the patients in the NL63-CoV dataset ( e 2 4 0 8 3 3 2 3 3 1 0 3 9 6 1 5 2 9 3 3 3 2 3 3 2 4 0 8 2 3 4 7 3 S t r e p t o c o c c u s p s e u d o p n e u m o n i a e 2 2 6 5 7 2 0 1 7 4 2 0 1 9 1 8 6 7 2 0 1 7 2 2 6 5 4 5 9 7 P r e v o t e l l a j e j u n i 1 7 7 0 8 1 8 6 9 6 4 1 9 0 3 5 5 2 1 2 6 8 1 8 6 9 6 2 6 9 2 1 7 7 0 H a e m o p h i l u s h a e m o l y t i c u s 1 7 1 0 4 4 0 7 2 1 4 2 4 9 2 3 3 7 1 3 1 3 6 9 1 7 1 0 4 4 0 7 2 V e i l l o n e l l a d i s p a r 1 7 0 9 3 6 0 8 6 9 5 4 2 8 8 3 6 0 8 6 9 3 4 7 9 2 2 1 7 0 9 3 7 0 6 S t r e p t o c o c c u s s a l i v a r i u s 1 3 8 1 1 0 3 0 9 3 1 6 5 0 3 3 5 2 2 1 0 3 0 9 3 1 3 8 1 3 7 7 6 V e i l l o n e l l a a t y p i c a 1 1 6 9 4 7 0 4 9 1 1 3 2 0 0 8 4 7 0 4 9 1 3 9 2 9 3 0 1 8 7 8 1 1 6 9 N e i s s e r i a m e n i n g i t i d i s 1 0 8 3 2 3 3 3 8 1 2 3 5 8 2 3 3 3 8 4 3 1 We did not find many significant correlations between viral load and viral gene expression and 390 immune-profile of the patients in the NL63-CoV dataset. This may have been attributed to the 391 challenges the owners of this dataset faced with regards to partial genome sequences obtained 392 during sequencing by NGS methods. While there were no significant correlation between viral 393 load and viral gene expression and immune-profile for the NL63 coronavirus species, we did 394 find some significant correlation with other similar coronavirus species , the Human Coronavirus 395 229E. Figure 5 shows the summary of the statistically significant correlations between viral gene 396 expression and immunological cell types for the NL63-CoV dataset. The full correlation matrix is 397 provided as Supplementary File 5A (genome level) and Supplementary File 5B (gene level). 398 Figure 5A represents the correlation at the genome level between viral load and fraction of 399 immune cells. Figure Monocytes and Mast Cells resting were negatively correlated with many bacterial species from 419 the following families including Prevotella, Streptococcus and Veillonella ( Figure 6 ). N a m e o f g e n o m e s p e c i e s I m m u n e c e l l t y p e N L 6 3 c o r o n a v i r u s N / A I n c r e a analysis showed that viral load of the SARS-CoV-2 genome was inversely correlated with 492 natural killer (NK) cells activation ( Figure 4A ). In other words, the NK cells were inactivated in 493 our analysis of the SARS-CoV-2 dataset ( Figure 4A ). included Enterobacter species, Pseudomonas, Streptococcus pneumoniae were also found in our 504 results ( Figure 6 ). The Enterobacteriaceae species was found to be resistant to antibiotics in 505 some CVID-19 patients [63] . Although not common, such infections in COVID-19 patients were 506 complex to treat since it was not easy to distinguish the bacterial co-infections from viral 507 infections of the respiratory tract. Our correlations results also show a high correlation of these 508 bacteria with immune markers including Eosinophils and activated NK cells ( Figure 5 ). Immune landscape and correlation analysis in the NL63-CoV dataset 514 In our analysis of the NL63-CoV dataset, we saw the dominance of mast cells and CD4 memory 515 resting T cells (Figure 9 ). Richards Relevance of machine learning tools and algorithms 544 545 As COVID-19 cases continue to rise around the world, researchers are harnessing the 546 computational power of machine learning and artificial intelligence (AI) tools to not only create 547 prediction and diagnostic tools for COVID-19 [69 , 70] but also improve outcomes [71] . In this 548 paper, we described the application of machine learning tools to process the raw sequencing data 549 generated by NGS technology, and also explore the immune , viral and bacterial landscape of the 550 SARS-CoV-2 and NL63-CoV datasets . 551 552 While traditional laboratory techniques allow direct detecting of immune cells in the blood, it is 553 more difficult to do for other types of tissues where immune cells can be detected using a 554 technique called flow cytometry or image cytometry methods. But both of these techniques are 555 difficult and labor-and time-consuming. 556 557 RNA-sequencing (RNA-seq) technology in conjunction with the application of machine learning 558 based virtual flow cytometry tools could be considered a potential alternative. Such an in-silico 559 process would enable researchers to not only estimate the immune cell environment, but also 560 work towards new hypothesis and therapies that would mediate appropriate immune 561 response using T cells for long term immunity; and help to minimize adverse side effects from 562 the SARS-CoV-2 infection [72 , 73] . 563 564 CONCLUSION 565 566 In this paper, we applied multiple machine learning tools to NGS data analysis of lung tissue 567 samples from COVID-19 patients. We explored the SARS-CoV-2 genome and compared it with 568 other endemic coronavirus NL63 genome. Finally, we explored the immunological landscape of 569 lung microenvironment from the SARS-Cov-2 and nasopharyngeal microenvironment from the 570 NL63-CoV datasets. 571 572 Our exploratory study has provided novel insights into complex regulatory signaling interactions 573 and correlative patterns between the viral infection, inhibition of innate and adaptive immune 574 response as well as microbiome landscape of the lung tissue. Many of our findings from the 575 analysis of the immune landscape of these two datasets, along with correlation analysis have 576 been corroborated in published literature proving that the study of immune system warrants 577 further analysis and exploration. 578 579 The study of how the SARS-CoV-2 virus interacts with the immune system; and comparing and 580 contrasting the immune system in patients affected by endemic viruses could offer important 581 insights into protection against SARS CoV 2; and shed light on new therapies to combat 582 severe COVID-19 disease. 583 584 These initial findings on small group of samples could provide better understanding of the 585 diverse dynamics of immune response and the side effects of the SARS-CoV-2 infection but 586 require further validation on a larger cohort of samples. Supplementary File 1A: shows the estimated copy number of viral genomes detected in lung lavage 621 samples of the SARS-CoV-2 dataset obtained using the viGEN pipeline • Supplementary File 1B: CENTRIFUGE metagenomics pipeline results on the SARS-CoV-2 dataset Supplementary File 1C: shows the estimated viral level of viral gene expression counts 624 Supplementary File 2A: correlation between genome level abundance and immunological data for the 626 SARS-CoV-2 dataset Supplementary File 2B: : correlation between gene level abundance and immunological data for the 628 SARS-CoV-2 dataset Supplementary File 2C: Correlation between bacterial abundance with immunological cell types for 630 the SARS-CoV-2 dataset Supplementary File 2D: Correlation between genomic viral load with bacterial abundance for the 632 SARS-CoV-2 dataset Supplementary File 3A: shows the estimated viral genome copy numbers in 5 pediatric patients from 634 the NL63-CoV dataset obtained using the viGEN pipeline • Supplementary File 3B: CENTRIFUGE metagenomics pipeline results on the NL63-CoV dataset • Supplementary File 3C shows the NL63 viral gene/CDS counts in the nasopharyngeal 637 microenvironment of the NL63-CoV dataset • Supplementary File 4 shows the microbiome profile of pediatric patients from the nasopharyngeal 639 microenvironment in the NL63-CoV dataset represented as a Sankey diagram visualization of the 640 bacterial species Supplementary File 5A: correlation between genome level abundance and immunological data for the 642 NL63-CoV dataset Supplementary File 5B: : correlation between gene level abundance and immunological data for the 644 NL63-CoV dataset • Supplementary File 5C: Correlation between bacterial abundance with immunological cell types for 646 the NL63-CoV dataset Supplementary File 5D: Correlation between genomic viral load with bacterial abundance for the 648 The coronavirus disease 2019 (COVID-19) 653 SARS coronavirus 2 (SARS-CoV-2) SARS coronavirus NIH National Center for Advancing Translational Sciences (NCATS) 657 Rlectronic health records (EHRs) 658 Whole slide images Post-acute sequelae of SARS-CoV-2 infection (PASC) Human endemic coronaviruses (hCoV) 662 Next generation sequencing (NGS) 663 Severe acute respiratory infection (SIRS) Single-instruction multiple-data (SIMD) Seven Bridges (SB) Cancer Genomics Cloud (CGC) Platform Nu-linear support vector regression (ν-SVR) 668 Support vector machine (SVM) FM) index 671 Natural Killer (NK) cells 672 Srtificial intelligence (AI) RNA-sequencing (RNA-seq) 694 We found a public dataset was from 5 pediatric patients with severe lower respiratory infection 695 by NL63 coronavirus with deep sequencing data performed on Illumina HiSeq platform. The 696 downloaded data were raw sequences in the form of COVID-19 Coronavirus Pandemic. Secondary COVID-19 Coronavirus Pandemic 2021 The Post-acute COVID-19 Syndrome (Long 705 COVID) Distribution-and-Critical-Illness-Severity-of. Last Accessed March 30, 2021. 710 4. The COVID Human Genetic Effort. Secondary The COVID Human Genetic Effort A Global Effort to Define the Human Genetics of 713 Protective Immunity to SARS-CoV-2 Infection Inborn errors of type I IFN immunity in patients with life-715 threatening COVID-19 Interferon-beta Therapy in a Patient with Incontinentia 717 Pigmenti and Autoantibodies against Type I IFNs Infected with SARS-CoV-2 Study of COVID-19 Risk and Long-Term Effects Underway at 37 Secondary Study of COVID-19 Risk and Long-Term Effects Underway at 37 19-risk-and-long-term-effects-underway-37-academic-medical-centers Last Accessed 732 Collaborative Cohort of Cohorts for COVID-19 Research 734 (C4R) Study: Study Design. medRxiv 2021 735 14 Consideration of prevention and management of long-739 term consequences of post-acute respiratory distress syndrome in patients with COVID-19 Predicting 'Long COVID Syndrome' with Help of a Smartphone App. Secondary Predicting 742 'Long COVID Syndrome' with Help of a Smartphone App US health agency will invest $1 billion to investigate 'long COVID'. Secondary US health 746 agency will invest $1 billion to investigate 'long COVID Understanding Human Coronavirus HCoV-NL63. Open Virol 753 Circulating CD4 T Cells Elicited by Endemic 755 Coronaviruses Display Vast Disparities in Abundance and Functional Potential Linked to 756 NCBI SRA SRP249613. Secondary NCBI SRA SRP249613 study=SRP249613. Last Accessed A pneumonia outbreak associated with a new 760 coronavirus of probable bat origin Complete Genome Sequences of Five Human Coronavirus 763 NL63 Strains Causing Respiratory Illness in Hospitalized Children in China (human),%20taxid:96 769 06&SourceDB_s=RefSeq&Completeness_s=complete. Last Accessed An Open Source Pipeline for the 771 Detection and Quantification of Viral RNA in Human Tumors Fast gapped-read alignment with Bowtie 2 Centrifuge: rapid and sensitive classification of 775 metagenomic sequences Correction: The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized-A 779 New Paradigm in Large-Scale Computational Research The Cancer Genomics Cloud: Collaborative New Paradigm in Large-Scale Computational Profiling Tumor Infiltrating Immune Cells with 784 CIBERSORT Robust enumeration of cell subsets from tissue 786 expression profiles Breitwieser FP, Salzberg SL. Pavian: interactive analysis of metagenomics data for 790 microbiome studies and pathogen identification R: A language and environment for statistical computing Secondary Severe acute respiratory syndrome 797 (SARS) Using Sankey diagrams to map energy 799 flow from primary fuel to end use Monocytes and macrophages in COVID-805 19: Friends and foes Cytokine storm induced by SARS-CoV-2 Distinct uptake, amplification, and release of SARS-CoV-2 by 809 M1 and M2 alveolar macrophages Pyroptotic macrophages stimulate the SARS-CoV-2-associated 811 cytokine storm Rogue antibodies could be driving severe COVID-19 The analysis of the long-term impact of SARS-CoV-2 on the 817 cellular immune system in individuals recovering from COVID-19 reveals a profound NKT cell 818 impairment T cell and antibody kinetics delineate SARS-CoV-2 820 peptides mediating long-term immune responses in COVID-19 convalescent individuals Delayed bystander CD8 T cell activation, early 823 immune pathology and persistent dysregulation characterise severe COVID-19 Long-Term SARS-CoV-2-Specific Immune and 825 Inflammatory Responses Across a Clinically Diverse Cohort of Individuals Recovering from 826 COVID-19 Adaptive immunity to human coronaviruses is 828 widespread but low in magnitude Circulating CD4 T cells elicited by endemic 830 coronaviruses display vast disparities in abundance and functional potential linked to both 831 antigen specificity and age National Research Project for SARS BG. The involvement of natural killer cells in the 841 pathogenesis of severe acute respiratory syndrome Natural killer cells associated with SARS-CoV-2 viral RNA 843 shedding, antibody response and mortality in COVID-19 patients Boosting Natural Killer Cells for the Treatment of COVID-849 19. Secondary Boosting Natural Killer Cells for the Treatment of COVID COVID-19 with Multiple Bacterial Co-853 infections: A Case Report Bacterial co-infections and antibiotic resistance in patients with COVID-19 Bacterial co-infection and secondary infection in 857 patients with COVID-19: a living rapid review and meta-analysis Exclusion of bacterial co-infection in COVID-860 19 using baseline inflammatory markers and their response to antibiotics T Cell Memory: Understanding COVID-19 T cells found in COVID-19 patients 'bode well' for long-term immunity. Secondary T cells 865 found in COVID-19 patients 'bode well' for long-term immunity Identification of PBMC-based Molecular Signature Associational with 869 COVID-19 Disease Severity Last Accessed Machine learning-based prediction of COVID-19 875 diagnosis based on symptoms Machine Learning Tools Help Predict COVID-19 Outcomes. Secondary AI, Machine 877 Learning Tools Help Predict COVID-19 Outcomes Last Accessed Immune cell profiling in cancer: molecular 881 approaches to cell-specific identification Testing a new COVID-19 test: How T-cells beat antibodies in helping to detect past 883 infections. Secondary Testing a new COVID-19 test: How T-cells beat antibodies in helping to 884 detect past infections All authors declare no conflict of interest. 677 YG designed the analysis and study and performed interpretation of the results. KB performed 679 the analysis. KB, YG and SM wrote and edited the paper. 680 The authors would like to acknowledge the CGC Seven Bridges team for enabling some of the 682 high-throughput analysis using the CENTRIGUE metagenomic tool 683 The datasets used in this manuscript are available online.