key: cord-0869132-pwsj22zq authors: Guo, Qian; Li, Mo; Wang, Chunhui; Guo, Jinyuan; Jiang, Xiaoqing; Tan, Jie; Wu, Shufang; Wang, Peihong; Xiao, Tingting; Zhou, Man; Fang, Zhencheng; Xiao, Yonghong; Zhu, Huaiqiu title: Predicting Hosts Based on Early SARS-CoV-2 Samples and Analyzing Later World-wide Pandemic in 2020 date: 2021-03-22 journal: bioRxiv DOI: 10.1101/2021.03.21.436312 sha: 703e11027187a8b0a728ea8748b3d7ad7ec0ebc9 doc_id: 869132 cord_uid: pwsj22zq The SARS-CoV-2 pandemic has raised the concern for identifying hosts of the virus since the early-stage outbreak. To address this problem, we proposed a deep learning method, DeepHoF, based on extracting the viral genomic features automatically, to predict host likelihood scores on five host types, including plant, germ, invertebrate, non-human vertebrate and human, for novel viruses. DeepHoF made up for the lack of an accurate tool applicable to any novel virus and overcame the limitation of the sequence similarity-based methods, reaching a satisfactory AUC of 0.987 on the five-classification. Additionally, to fill the gap in the efficient inference of host species for SARS-CoV-2 using existed tools, we conducted a deep analysis on the host likelihood profile calculated by DeepHoF. Using the isolates sequenced in the earliest stage of COVID-19, we inferred minks, bats, dogs and cats were potential hosts of SARS-CoV-2, while minks might be one of the most noteworthy hosts. Several genes of SARS-CoV-2 demonstrated their significance in determining the host range. Furthermore, the large-scale genome analysis, based on DeepHoF’s computation for the later world-wide pandemic in 2020, disclosed the uniformity of host range among SARS-CoV-2 samples and the strong association of SARS-CoV-2 between humans and minks. The global COVID-19 pandemic caused by severe acute respiratory syndrome Host prediction of SARS-CoV-2 175 The accurate prediction of hosts of earliest detected isolates can undoubtedly assist the 176 public health system to take more appropriate preventive measures at the early stage of 177 the pandemic outbreak. In view of this, we focused on the prediction with SARS-CoV-178 2 isolates sequenced in the earliest stage of COVID-19 detection, which is closer to the 179 most recent common ancestor of SARS-CoV-2. Previous to this paper, we have reported 180 the prediction for the six earliest sequenced SARS-CoV-2 isolates using our algorithm 181 on 21 January, 2020 [13] . In this study, we further strengthened the prediction of hosts replication [25] [26] [27] suggests the rationality of our findings. It is noteworthy that the 203 linear correlation between the lengths and the host likelihood scores for genes is not 204 tenable (Supplemental Figure S2 ). This shows that the importance of ORF1ab is not Table S2 ). The contributions of these genes were 212 represented by their host likelihood scores on human. We found that ORF1ab was 213 relatively important in the prediction for all these viruses, which was possibly due to its 214 functions in viral replication and host survival [27] . The structural genes (S, M, N, and 215 E genes) in these three viruses contributed differently on the human host type, 216 illustrating these genes functioned inconsistently in these viruses. Specifically, S gene, It is disappointed that host determination for SARS-CoV-2 is extremely difficult due to 228 the limited knowledge of the virus world. Therefore, the sequences and host 229 information of viruses contained in the public database should be valued and fully 230 utilized. To fill the gap in the efficient inference of host species for SARS-CoV-2 using 231 the tools which were state of the art, we deeply analyzed the host likelihood profiles of 232 viruses output by DeepHoF to seek specific vertebrate hosts of the early-stage SARS- CoV-2 isolates. In this study, we proposed that viruses with the same host species 234 possessed the host likelihood score profiles close in the five-dimensional space. Based 235 on this assumption, we compared the host likelihood score profile of SARS-CoV-2 with Denmark, respectively, which accelerated the cull of minks and killed the fur industry 254 in the two countries. On 9 October, 2020, at least 10,000 minks were reported dead at 255 Utah and Wisconsin mink farms in the USA, and they were believed infected by SARS-256 CoV-2 [5] ( Table 2) . When evaluating the contributions of 11 genes of SARS-CoV-2 in determining 258 mink as the most probable host, we found ORF1ab and ORF8 contributed the most 259 (Supplemental Table S4 ), which suggesting that genes show different contributions 260 when determining different hosts. The rationality of this result is supported by the roles 261 of ORF1ab in viral replication and host survival [27] , and the roles of ORF8 related to 262 immune evasion [30] . However, the interaction between the two genes and the mink 263 cell should merit the further attention and investigation. Additionally, novel coronaviruses, which possess high sequence similarity with 265 SARS-CoV-2, were found on pangolin [2, 3] in China. Even though these pangolin-266 associated coronaviruses were assigned similar host likelihood score profiles with 267 early-stage SARS-CoV-2 isolates, our analysis demonstrated that the similarity of 268 profiles between SARS-CoV-2 and pangolin-associated coronaviruses was lower than 269 those between SARS-CoV-2 and certain viruses of mink and Chinese rufous horseshoe 270 bat. In April 2020, farmed minks in Netherlands were noticed to be infected by SARS-CoV-273 2 because of the abnormal mortality [4] . Even though all the mink farms in Netherlands 274 have been screened mandatorily since 28 May 2020, the transmission of coronavirus 275 among the mink population did not seem to cease. Thus, a million farmed minks were 276 culled in Netherlands, and followed by a plan to cull 2.5 million farmed minks in 277 Denmark. Characterizing SARS-CoV-2 by their host likelihood score profiles, we found the 279 isolates detected on humans and minks in Netherlands distributed in a consistent mode, 280 where both groups were divided into a major cluster and a divergence ( Figure 3D Table S5 ). Among these 294 unique high-frequency variants in Dutch human-derived SARS-CoV-2, two were found 295 in Dutch mink-derived SARS-CoV-2, thus proved the circulation of SARS-CoV-2 296 between humans and minks in Netherlands. It was remarkable that our findings could 297 be supported by the conclusions from a research team in Netherland, who utilized more 298 detailed information about patients and related mink farms [12] . In the 2020 world-299 wide pandemic, minks are the only animal that has been reported to transmit SARS- CoV-2 to humans [11, 12] . We further compared the high-frequency variants of SARS- CoV-2 isolates in humans and minks in Netherlands. Except for four common variants, 302 SARS-CoV-2 isolates derived from minks still had 23 unique high-frequency variants 303 and six were found on S protein that is related to virus-host fusion process. This result 304 indicated that the virus might have gained higher diversity after the intra-species 305 circulation among mink herd and inter-species circulation between minks and human. As the mink infections are expanding worldwide, the association and circulation of 307 SARS-CoV-2 between humans and minks in Netherlands notifies us of the importance 308 to take precautions of the bidirectional transmission in other regions. Retrospective analysis of the world-wide pandemic 310 To verify the stability and uniformity of the host inference among SARS-CoV-2 311 samples, retrospective analysis of more isolates in the lasting pandemic was required. However, when the SARS-CoV-2 isolates were divided chronologically using 15 328 April 2020 as the split date, which divided 53,759 isolates into two parts more evenly 329 than other dates, we found that the two subsets have divergent distributions in each of 330 the two dimensions of PCA (two-sided two-sample Kolmogorov-Smirnov test, p-value 331 = 0, nisolates = 26,167 before 15 April 2020 and 27,592 after 15 April 2020) ( Figure 4B ). The approximately normal distribution of SARS-CoV-2 genomes and their time-333 dependent feature indicate the overall consistency and a certain extent of divergence in 334 the host likelihood score profiles of SARS-CoV-2 isolates. To explain the divergence among host likelihood score profiles, we identified all 336 variants in 53,759 genomes (Supplemental Table S5 ). The 13 high-frequency variants 337 were located on S gene, N gene, ORF1ab, ORF8 and ORF3a, some of which are related 338 to virus-host fusion process [22, 33] . Furthermore, we annotated our PCA result with predicting the host species for any novel virus that remained unsolved using the tools 373 which were state of the art, we further analyzed the host likelihood score profile to 374 further infer the specific hosts of SARS-CoV-2. The hosts determined by DeepHoF can 375 be either reservoirs or susceptible middle hosts, which are not discriminated in this 376 study. We found minks, bats, dogs and cats could be potential hosts of SARS-CoV-2, 377 while minks might be one of the most noteworthy animal hosts. Due to mutations, the 378 host likelihood score profiles of the isolates in the long period of the later pandemic had 379 slightly varied, but followed normal distribution where those of the early 17 isolates 380 locate in the center. As a consequence, the host range inferred with the profiles of the 381 isolates during the pandemic was consistent with the inference using the early samples. the future, the detection of novel viruses will rely more heavily on high-throughput 429 sequencing technologies such as metagenomics and metaviromics. Thus, more robust 430 tools designed for metagenomes and metaviromes are required. 431 We downloaded 63,049 whole viral genomes from GenBank by 9 July, 2019, and 435 tagged them with five host labels (plant, germ, invertebrate, non-human vertebrate and 436 human), which were integrated from the host metadata provided by GenBank 437 (Supplemental Table S6 ). The five host types covered all the living organism hosts. For 438 viruses infecting multiple host types, multiple labels were given. Following the data 439 collection procedure, short fragments were generated randomly from those tagged 440 whole genomes because of the computational cost in long sequence processing. The 441 training set was constructed with short fragments from 55,283 genomes released before 442 1 January, 2018, and the test set was constructed with the rest (the Accession list and 443 the host information of the genomes used for training and test are in Supplemental Table 444 S7). There is non-overlap of virus species in the training and test sets. Tables Table 1 Performance The global 632 virome project Virus taxonomy: ninth report of 634 the International Committee on Taxonomy of Viruses Virus entry: molecular mechanisms and biomedical applications PPR-Meta: a tool for identifying 638 phages and plasmids from metagenomic fragments using deep learning Linking virus genomes with host taxonomy Fast, scalable 643 generation of high-quality protein multiple sequence alignments using Clustal Omega RAxML-VI-HPC: maximum likelihood-based phylogenetic 646 analyses with thousands of taxa and mixed models Snippy: rapid bacterial SNP calling and core genome alignments Interactive Tree Of Life (iTOL) v4: recent updates and new 650 developments Betacoronavirus HKU24 strain HKU24-R05005I Lucheng Rn rat coronavirus isolate Lucheng-19 Bat coronavirus isolate PREDICT/PDF-2180