key: cord- -vygiate authors: yang, jian; jiang, hongchen; liu, wen; huang, liuqin; huang, jianrong; wang, beichen; dong, hailiang; chu, rosalie k.; tolic, nikola title: potential utilization of terrestrially derived dissolved organic matter by aquatic microbial communities in saline lakes date: - - journal: isme j doi: . /s - - - sha: doc_id: cord_uid: vygiate lakes receive large amounts of terrestrially derived dissolved organic matter (tdom). however, little is known about how aquatic microbial communities interact with tdom in lakes. here, by performing microcosm experiments we investigated how microbial community responded to tdom influx in six tibetan lakes of different salinities (ranging from to g/l). in response to tdom addition, microbial biomass increased while dissolved organic carbon (doc) decreased. the amount of doc decrease did not show any significant correlation with salinity. however, salinity influenced tdom transformation, i.e., microbial communities from higher salinity lakes exhibited a stronger ability to utilize tdom of high carbon numbers than those from lower salinity. abundant taxa and copiotrophs were actively involved in tdom transformation, suggesting their vital roles in lacustrine carbon cycle. network analysis indicated that operational taxonomic units (otus, affiliated with alphaproteobacteria, actinobacteria, bacteroidia, bacilli, gammaproteobacteria, halobacteria, planctomycetacia, rhodothermia, and verrucomicrobiae) were associated with degradation of cho compounds, while four bacterial otus (affiliated with actinobacteria, alphaproteobacteria, bacteroidia and gammaproteobacteria) were highly associated with the degradation of chos compounds. network analysis further revealed that tdom transformation may be a synergestic process, involving cooperation among multiple species. in summary, our study provides new insights into a microbial role in transforming tdom in saline lakes and has important implications for understanding the carbon cycle in aquatic environments. saline lakes are globally widespread and occupy almost a half of total inland water surface area [ ] . generally, saline lakes are located in catchment basins, which receive large amounts of terrigenous materials, and thus contain high concentrations of dissolved organic carbon (doc), largely due to their evaporative condensation effect [ , ] . therefore, saline lakes contribute significantly to the global carbon budget [ ] . within saline lakes, a considerable amount of doc is originated from terrestrially derived dissolved organic matter (tdom) from surrounding soils [ ] , and may undergo extensive transformations by lacustrine microbial communities [ ] . however to date, little is known about the biogeochemical fate of tdom in saline lakes, which is of great importance to the understanding of the global carbon cycle [ ] . traditionally, tdom is considered to be refractory to biological utilization, because a large portion of tdom molecules contain highly complex and aromatic chemical structures [ , ] . nevertheless, an increasing number of recent studies have indicated that tdom can be utilized by aquatic heterotrophic microorganisms [ ] [ ] [ ] [ ] . for example, tdom can be utilized by aquatic bacterial communities in freshwater environments to produce new microbial biomass [ ] or even mineralized to carbon dioxide (co ), which is an important source for atmospheric co [ , ] . tdom can also be transformed to recalcitrant dom by microorganisms in oceans [ ] . therefore, unveiling the linkage between aquatic microbial community and tdom is essential to comprehending carbon cycling in aquatic environments such as lakes. the qinghai-tibetan plateau (qtp) hosts thousands of lakes (including many saline and hypersaline lakes) with a salinity range from . to . g/l [ ] . previous studies have shown that tdom occurs widely in the qtp lakes and is a very important organic carbon source for microbial communities in these lakes [ , ] . in addition, previous studies have indicated that the qtp lakes are inhabited by a broad range of microbial species, and their taxonomic compositions become more divergent with increasing salinity difference [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . however, little is known about the role of microbial taxa in potential transformation of tdom in these high-elevation lakes. for example, little is known about which key microbial taxa are responsible for tdom degradation in saline lakes, which tdom compounds are transformed by those microbial taxa, and how such microbially catalyzed tdom transformation responds to salinity change. to address these knowledge gaps, the linkage between aquatic microbial communities and the degradation potentials of tdom in the qtp lakes of different salinity was investigated in this study by performing microcosm experiments and using a suite of analytical techniques. sampling six qinghai-tibetan lakes with different salinities were selected for this study: erhai lake (ehl) is a freshwater lake; qinghai lake (qhl) and tuosu lake (tsl) are saline lakes; gahai lake (ghl), xiaochaidan lake (xcdl), and chaka lake (ckl) are hypersaline lakes [ ] . a sampling cruise was carried out in may . at inshore sites (~ - m away from shoreline) of each lake, salinity, ph, and temperature of surface water (~ - cm) were measured with portable meters (sanxin, shanghai, china). water samples (~ ml each) for measurements of total dissolved nitrogen (tdn) and total dissolved phosphorus (tdp) were collected after filtration through . -μm nuclepore filters (whatman, uk). subsequently, about -l surface water was collected from each lake into acid-washed and sterilized -l polycarbonate bottles (nalgene, usa). to obtain tdom, onshore soils around each lake were taken from topsoil layer ( - cm) without plant. all samples were kept cold and in the dark during transportation to the laboratory. samples were stored at °c in the laboratory until further processing. to avoid exogenous contamination, all glassware used in the experiments was combusted at °c for h, and whatman nuclepore filters or plastic instruments were sterilized by either autoclaving or ultraviolet irradiation before use. preparation of tdom-containing media: to minimize any difference in tdom source, six onshore soils around the lakes were equally mixed to make one composite soil sample. the composite soil sample was mixed (water:soil = : , v-v) with lake water from each lake to obtain six water-soil mixtures. the mixtures were shaken on a rotary table for h, and then placed on bench for h in the dark. subsequently, the resulting tdom-containing supernatant (~ l) in each mixture was filtrated through a . -μm whatman nuclepore membrane filter, and collected into a pre-combusted glass bottle. the resulting tdom-containing filtrate was diluted using corresponding microbe-free lake water to maintain a similar level of tdom, i.e., the doc content of the tdomcontaining lake waters minus the doc content of their corresponding lake water was~ mg/l. microbe-free lake water was obtained by filtering lake water through a whatman . -μm nuclepore membrane filter. the resulting tdom-containing lake water was finally used as a basic culture medium for subsequent microcosm experiment. preparation of inocula: for each lake, about -l water was filtrated through . -μm whatman nuclepore membrane filters. the resulting biomass-containing filters were immersed into corresponding microbe-free lake water (~ ml), and manually stirred for min. after the filters were removed, the resulting concentrated microbial inocula were used for subsequent microcosm experiments. microcosm experiments and sample collection: prior to the microcosm experiments, lake waters were pre-incubated at °c in the dark for days to remove any labile lakederived dom. six microcosms (i.e., four experimental treatments and two controls) were prepared for each lake. microcosm experiments were conducted in -l glass bottles containing ml tdom-containing basic medium (prepared as above). to each bottle, ml of microbial inocula (prepared above) were added into four microcosms as experimental treatments, while two control bottles received ml microbe-free lake water. at zero time point (i.e., the beginning of the experiment), water samples were taken from two experimental treatments of each lake for microbial counts (n = ), dna extraction (n = ), doc (n = ), and ft-icr-ms (n = ) analyses. the other two experimental treatments and controls for each lake were sealed with air permeable and microbe-proofing films and then were incubated for days at °c in the dark. at the end of incubation (i.e., day ), duplicate water samples from both treatments and abiotic controls of each lake were collected for microbial count (n = ), doc (n = ), and ft-icr-ms (n = ) analyses. detailed sampling procedures for microbial count, doc, dna extraction, and ft-icr-ms analyses were described in the supplementary materials. doc concentrations were determined on a multi n/c s analyzer (analytik jena, germany). tdn and tdp were analyzed using published colorimetric methods [ , ] . acridine orange direct count was used to determine the total number of microbial cells in water samples according to the procedure described previously [ ] . dom in the water samples was isolated using solid phase extraction (spe) as described previously [ ] . all the speextracted dom samples were analyzed on a tesla bruker solarix ft-icr-ms spectrometer equipped with negative-mode electrospray ionization at pacific northwest national laboratory, richland, washington, usa. details concerning dom isolation and ft-icr-ms analyses were provided in the supplementary materials. total dna was extracted from biomass-containing filters using the fast dna spin kit for soil (mp biomedical, usa). the extracted dna was amplified with a universal s rrna gene primer set: f ( ′-gtgycagcmgc cgcggtaa- ′) and r ( ′-ggactacnvgggtwt ctaat- ′) according to the pcr conditions described previously [ ] . amplicon sequencing were performed by using an illumina-miseq platform (paired-ends sequencing of × bp) [ ] . detailed procedures for pcr preparation and sequence processing were provided in the supplementary materials. all the raw sequences obtained from this study have been deposited at the ncbi sequence read archive under the project prjna with biosample accession number of samn -samn . unless otherwise indicated, all statistical analyses were carried out in the r program (http://cran.r-project.org/) implemented with various packages. nonmetric multidimensional scaling (nmds) was employed to evaluate the microbial community composition difference between the beginning and end of incubation, based on the bray-curtis dissimilarity using the "vegan" packages. the total number of molecular formulae was calculated according to specific categories (e.g., cho, chon, chonp, chons, chonsp, chop, chos, and chosp). relative peak intensities (normalized to the sum of all peak intensities of identified molecular formulae per sample) were used to semi-quantitatively assess the changes in dom molecular composition as a result of incubation. principal coordinate analyses (pcoa) were conducted to illuminate the difference in dom molecular compositions among different experimental samples using the "ape" package. the van krevelen diagrams [ ] , in which the ratios of hydrogen to carbon (h/c) was plotted against the ratios of oxygen to carbon (o/c), were used to compare the dom compositional difference between different samples and as a result of incubation. to test the significance of difference in microbial community and dom molecular composition among samples, permanova analysis was performed based on bray-curtis dissimilarity with permutations. seven dom molecular groups were defined according to the presence-absence of dom formulae among the staring samples, treatments, and controls: ( ) transformed dom by biotic or abiotic process was indicated by the presence of dom formulae only in the starting samples; ( ) transformed dom by abiotic process was indicated by the presence of dom formulae in both the starting samples and biotic treatments but by the absence in abiotic controls; ( ) transformed dom by biotic activity was indicated by the presence of dom formulae in both the starting samples and abiotic controls but by the absence in biotic treatment; ( ) relatively stable dom was indicated by dom formulae that were shared among the starting samples, biotic treatments, and abiotic controls; ( ) newly produced dom by biotic activity was indicated by the presence of dom formulae only in biotic treatments; ( ) newly produced dom by abiotic processes was indicated by the presence of dom formulae only in abiotic controls; and ( ) newly produced dom by biotic or abiotic process was indicated by the presence of dom formulae in both biotic treatments and abiotic controls but by the absence in the starting samples. these groups were hereafter called g -g for brevity. in order to explore the association between microbial otus and dom molecular formulae, the network analysis was performed. briefly, spearman's rank correlations were performed between relative abundance of microbial otus and relative peak intensity of dom molecules for all experimental treatments at the end of microcosm experiments (a total of samples) using the r package "hmisc". rare otus (relative abundance < . %) in each sample were removed before correlation analysis. to be statistically significant, microbial otus and dom molecular formulae that occurred simultaneously in at least six samples were included in the analysis. to correct false discovery rates, all p values generated by spearman correlation analyses were adjusted using the benjamini-hochberg method [ ] . spearman's correlation coefficient |r| > . with p < . was considered as robust [ ] . all the robust correlations were used to construct a network where nodes represented microbial otus or dom molecular formulae, and edges indicated strong and significant correlations between nodes. network visualization was conducted using the interactive platform gephi (http://gephi.github.io/). the network modularity and module division were calculated through a method of fast greedy optimization [ ] using the r package of igraph. subsequently, the roles of the network nodes were assigned according to their withinmodule connectivity (zi) and among-module connectivity (pi) [ ] . four categories can be defined for each node: ( ) peripheral nodes (zi < . , pi < . ), ( ) connectors (zi < . , pi > . ), ( ) module hubs (zi > . , pi < . ), and ( ) network hubs (zi > . , pi > . ) [ , ] . the network and module hubs, and connectors are generally proposed as keystone module members [ ] . salinity of lake water was - g/l, ph was . - . , and water temperature was . - . °c (table ) , tdn and tdp were . - . and . - . mg/l, respectively. microbial cell abundance increased considerably from the beginning of the incubation (i.e., day ) to the end (i.e., day ) ( table ). the average amount of increase was . × , . × , . × , . × , . × , and . × cells ml − for ehl, qhl, tsl, ghl, xcdl, and ckl, respectively. the hypersaline ckl sample had the highest amount of increase. microbial cells in abiotic control samples showed no observable increase. a total of , qualified sequences were obtained with an average of , per sample (supplementary table s ). the observed otus ranged from to , and the shannon index ranged from . to . (supplementary table s ). permonova analysis showed that microbial community composition was significantly distinct (r = . , p < . ) among different lakes (fig. ) . for example, actinobacteria, alphaproteobacteria, bacteroidia, and gammaproteobacteria dominated (relative abundance > %) in the experimental samples of ehl, qhl, tsl, ghl, xcdl, while hypersaline lake ckl was dominated by halobacteria, whose relative abundance was up to % (fig. ). such a difference in microbial community composition among the lakes was further supported by nmds analysis (supplementary fig. s ), which showed that all samples from the same lake were clustered together but distinct from those of other lakes, regardless of incubation time. microbial community composition exhibited a discernible shift as a result of incubation ( supplementary fig. s ). however, permonova analysis did not show a significant (p > . ) difference in microbial community composition between the beginning and the end of incubation. we found that there were a total of otus, whose average relative abundances increased by > % from the table s ). according to the definition established previously [ ] , these four otus could be considered as copiotroph-like otus, (i.e., otus with relative abundance < . % at day but > % at day ). the concentrations of doc in the experimental treatments decreased from the beginning (i.e., day ) to the end (i.e., day ) of incubation, with the mean amount of decrease of . , . , . , . , . , and . mg l − for the ehl, qhl, tsl, ghl, xcdl, and ckl samples, respectively ( table ). the highest amount of doc decrease was observed in the hypersaline ckl samples. the doc concentrations in abiotic controls also experienced slight declines after incubation (table ) . therefore, the mean amounts of net doc decrease derived from microbial activity were . , . , . , . , . , and . mg l − for the ehl, qhl, tsl, ghl, xcdl, and ckl samples, respectively (table ) , and such doc decreases did not show any significant (p > . ) correlation with lake salinity. the total number of dom molecular formulae determined by ft-icr-ms ranged from to per sample, with cho, chon, and chos compounds being dominant in all samples (supplementary table s ). pcoa showed the samples from different lakes contained distinct dom molecular compositions, which was further supported by permonova analysis (r = . , p < . ; supplementary fig. s ) . notably, although all dom was initially eluted from the same composite soil sample, its composition was different even at time ( supplementary fig. s ). this difference could be caused by distinct dom molecular composition in lake water that was used to elute tdom from soil, or a salinity effect on tdom elution from soil [ ] . furthermore, pcoa also indicated that there appeared to be recognizable differences among the starting samples, biotic treatments and abiotic controls for each lake ( supplementary fig. s ), and this was also supported by the van krevelen diagrams (supplementary fig. s ) . however, such differences were not significant, as indicated by the permonova analysis (data not shown). to discern the difference in dom molecular composition among the starting samples, biotic treatments and abiotic controls of each lake, the numbers of dom molecular formulae belonging to seven defined group (see the "method" section) were calculated (fig. a) . each group contained various numbers of dom molecular formulae in different lakes, and g (relatively stable dom) contained the majority of dom molecular formulae ( - ) in all lakes (fig. a) . permonova analysis indicated that dom molecular compositions in lake microcosms were distinct (r = . , p < . ) among different groups (fig. b) . furthermore, the relative abundances of molecular categories (e.g., cho, chon, and chos) in each group varied in lakes ( supplementary fig. s ). the average carbon numbers of dom molecular formulae were all > in most groups and in all lakes, expect for g whose average carbon numbers were all < in six lakes ( supplementary fig. s ). some microbially transformed dom molecules (i.e., g ) were observed in the studied microcosms (fig. a, b) . different types of dom molecules were transformed by microbial activity in different lakes (fig. b) . for example, the chon molecules in g were more abundant in the ehl and qhl samples of lower salinity than in the samples of higher salinity (i.e., tsl, ghl, xcdl, and ckl samples), whereas chop molecules showed an opposite trend ( supplementary fig. s ) . furthermore, the numbers of carbon in the dom molecules of g were significantly (kruskal-wallis rank sum test: chi-squared = . , p < . ) lower in the microcosms with low salinity (i.e., ehl, qhl, and tsl) than in the microcosms with high salinity (i.e., ghl, xcdl, and ckl) (supplementary fig. s ). spearman correlation analysis showed that there were a total of otus, whose relative abundances showed strong (|r| > . , p < . ) correlations with the relative intensities of some dom molecular formulae. the phylogenetic affiliations of these otus were shown in supplementary table s . the dom molecular formulae that were correlated with the abovementioned otus possessed various h/ c and o/c ratios and the number of carbon ( fig. and supplementary fig. s ). most of these microbial otus were widely correlated with cho molecules except for otu (gammaproteobacteria), otu (bacteroidia), otu (actinobacteria), and otu (alphaproteobacteria), which were correlated with both cho and chos molecules ( supplementary fig. s ). the carbon number of dom formulae associated with microbial otus ranged from to , and their distribution patterns changed with different otus (supplementary fig. s ). network analysis indicated that the abovementioned correlations between microbial otus and dom molecular formulae formed a complex network, which consisted of nodes and edges (fig. a) . the network modularity was . (fig. b) , which suggests a good (modularity > . ) modular structure [ ] . the network was composed of eight modules, and each module contained different types and numbers of dom molecular formulae and microbial otus (fig. b) . the numbers of dom molecular formulae in module was the highest, followed fig. s b ). according to connectivity within and among modules, we identified four network hubs, module hubs, connectors, and peripheral nodes ( fig. and supplementary table s ). it was noteworthy that all modules and network hubs were microbial otus, and most connectors were dom formulae except for otu (bacteroidia) and otu (gammaproteobacteria) (fig. and supplementary table s ). the molecular characteristics (i.e., ratios of h/c and o/c) of these identified dom connectors were shown in the van krevelen diagrams, and they possessed a broad ranges of h/c and o/c ratios ( supplementary fig. s a ). the dom connectors mainly consisted of cho, chon, chons, and chos formulae, and they were dominated by cho formulae, which accounted for % of the total dom connectors (supplementary fig. s b ). our data revealed that microbial activities decreased the concentration of tdom in the studied treatments. the decreased amount of tdom (by as much as . - . mg/l, in table ) may be transformed into microbial cellular carbon [ , ] and/or mineralized to co through microbial respiration [ , ] . assuming one microbial cell contains fg ( fg = × − mg) carbon [ ] , the increased microbial cell numbers (table ) only accounted for < % of the decreased tdom (< . mg l − vs. . - . mg l − ); thus a large fraction of the decreased tdom (> %) may have been mineralized to co . this result suggests that the input of tdom to lakes may increase co emission. furthermore, the lack of correlation between salinity and the decreased amount of tdom in the microcosms suggests that other environmental factors may have contributed to the observed tdom degradation. in addition, microbial activities transformed some tdom to novel dom molecules, as evidenced by unique dom molecules observed in biotic treatments but not in abiotic controls and starting samples (e.g., g in fig. a, b) . this observation is consistent with previous studies in showing that microbial activity can transform dom into new products [ , [ ] [ ] [ ] [ ] . our combined molecular and physicochemical studies allowed us to identify the dominant microorganisms that may be responsible for such tdom transformation. the relative abundance of abundant otus (i.e., relative abundance > % both in the beginning and end of the experiment) increased in response to addition of tdom (supplementary table s ), suggesting their vital roles in transformation and mineralization of tdom. this observation may be ascribed to the fact that abundant taxa can usually utilize a broad spectrum of dom compounds as substrates [ ] . furthermore, it is also notable that these tdom-transforming taxa were phylogenetically diverse (i.e., actinobacteria, alphaproteobacteria, bacteroidia, gammaproteobacteria, halobacteria, verrucomicrobiae; as shown in supplementary table s ). in the meanwhile, distinct microbial taxa appeared to be responsible for the transformation of tdom among different lakes, as evidenced by abundance increases of lake-specific otus after microcosm incubations (supplementary table s ). this observation could be due to the fact that the original microbial communities differed among the studied lakes with different salinity [ , , , , , ] . it is interesting to observe that copiotrophs (e.g., otus belonging to genera of vibrio and photobacterium in the gammaproteobacteria) also contributed to the transformation of the added tdom. copiotrophs are frequently observed in the nutrient-rich environments and are opportunistic r-strategists for growth, and thus they prefer labile substrates to support their fast growth [ , ] . so copiotrophs are frequently observed in the microcosms amended with labile dom (e.g., glucose) [ , [ ] [ ] [ ] . in this study, the relative abundances of certain copiotrophic otus showed a dramatic increase (> times) in response to the addition of tdom during microcosm incubations of two saline lake (tsl and xcdl) samples (supplementary table s ). such response of copiotrophs can be ascribed to the possible presence of certain labile dom (e.g., protein and sugars) eluted from the sampled terrestrial soils [ , ] . therefore, it was possible to observe that tdom stimulated the growth of copiotrophs in saline and hypersaline lakes. to further illuminate the relation between dom transformation and specific microbial species responsible for such transformation, correlation analyses between microbial taxa and dom molecules were performed to illuminate this link. note that a significant correlation between microbial species and dom molecules does not necessarily suggest that such microbial species actually transformed or were capable of transforming the dom molecules [ ] . however, such correlation analyses at least provide some direction for future investigation on microbial degradation of specific dom compounds. correlation analyses in this study revealed that microorganisms may contribute to the transformation of dom compounds with a broad range of h/c and o/c ratios, and carbon number. different microbial otus were associated with the transformation of distinct dom compounds ( supplementary fig. s ). this can be ascribed to the fact that different microbial species possess different functional enzymes, which degrade different dom compounds [ , ] . in other words, some microbial species may produce some specific functional enzymes, which can degrade specific type of dom compounds [ , ] . for example, otu (gammaproteobacteria), otu (bacteroidia), otu (actinobacteria), and otu (alphaproteobacteria) are highly associated with the degradation of chos compounds (supplementary fig. s ). in addition to dom types, some microbial taxa may be able to degrade dom compounds with distinct ranges of carbon number [ ] . for example, otu (rhodothermia) and otu (gammaproteobacteria) were positively correlated with compounds with low carbon number (~< ), and negatively correlated with compounds with high carbon number (~> ); whereas otu (verrucomicrobia), otu (actinobacteria), and otu (gammaproteobacteria) were positively correlated with dom formulae of high carbon number, and negatively correlated with dom formulae of low carbon number (supplementary fig. s ) . remarkably, the abovementioned correlations between certain microbial taxa and dom formulae formed a strong modular network with eight modules (fig. b) . previous studies have indicated that a module within one network can be considered as a functional unit, which may perform an identical ecological task, and the involved microbial taxa within that module are highly connected with each other [ , ] . therefore, we speculated that microbial taxa within certain module may cooperate to degrade some specific types of dom (e.g., refractory tdom), and specific dom compounds may serve as substrates. indeed, one previous study suggested that one single species cannot completely degrade most large molecular weight dom compounds [ ] , implying that it is necessary for multiple species to cooperate for tdom degradation. furthermore, different modules contained distinct dom molecular composition ( supplementary fig. s a ), suggesting different modules may contribute to different dom transformations [ ] . it is also important to mention the identification of keystone module members (i.e., module and network hubs, connectors) due to their high connectivity within or among modules and key roles in the network [ , ] . in the present study, correlations between microbial otus and dom formulae were used to construct the network, and many microbial otus were largely correlated with dom formulae (fig. a) . accordingly, it is not surprising that those microbial otus were prone to be identified as modules or network hubs. however, the identification of keystone module members may highlight the key roles of those microbial taxa in tdom transformation. strikingly, a large number of dom connectors were identified in the studied network ( supplementary fig. a, b) . from ecological perspectives, connectors in the network are proposed as the bridges linking different modules [ , ] . accordingly, it can be speculated that those dom connectors may be key intermediate substrates or products from microbial degradation of tdom. however, ft-icr-ms cannot predict the chemical structure of each formula, so it is difficult to know the actual characteristics of those dom compounds. furthermore, the above speculation needs validation by studying microbial functions and dom structure. in our incubation experiments, the dom in the original lake waters were difficult to remove, thus any observed dom transformation should be a combination of both tdom and original lake dom. however, little bias should have been introduced when comparing the change of dom composition before and after incubation because ( ) any labile dom in the original lake waters should have been consumed before incubations, and ( ) the tdom used in lake microcosm experiment were adjusted to an equal level before incubations. our results indicated that both biotic and abiotic processes contributed to tdom transformation, because some newly produced dom were detected in both biotic treatments (i.e., g ) and abiotic controls (i.e., g ) (fig. a, b) . this result corroborated previous findings that certain dom compounds in the environments could be degraded either abiotically (such as photo-or thermal degradation) or biotically (e.g., microbial activity) [ ] . nonetheless, a large portion of dom compounds appeared to be stable during the experimental incubations. for example, more than % of dom compounds were shared among the initial samples, biotic treatments, and abiotic controls (g in fig. a) . a likely reason is that most tdom may be resistant to biological or nonbiological degradation and can exist for thousands of years [ ] , but the tdom in our microcosm experiments were only incubated for only days. despite the fact that tdom is largely refractory to biological degradation, there are some dom compounds that can be transformed by microbial activity, for example dom compounds in g (fig. a) . it is notable that among those dom compounds that are transformed by microbes, microbial communities in the microcosms with high salinity seemed to exhibit a stronger ability to transform dom compounds of higher carbon numbers than those in the microcosms with low salinity (i.e., dom formulae of g in supplementary fig. s ). we speculated that halophilic/ halotolerant microbes in lakes of high salinity have to exploit a broader range of carbon sources due to high energy costs to deal with salinity stress relative to those in lakes of low salinity [ ] , and thus they may have developed an ability to degrade organic matter with a higher carbon number and more complex structures. indeed, many halotolerant/halophilic prokaryotes have been indicated to possess specific enzymes that have a high efficiency to degrade recalcitrant organic matter (e.g., lignin, cellulose, and chitin) [ , ] . microbial communities in microcosms with high salinity may selectively consume nitrogen (n)-and phosphors (p)containing organic compounds. for example, the microbially transformed dom molecules in the microcosms with high salinity showed a higher relative abundance of n-and a lower relative abundance of p-containing formulae than those in the microcosms with low salinity (g in supplementary fig. s ). such difference may be due to different requirements for c/n/p in lakes of different salinity. microbial biomass in different lakes tend to have distinct c: n:p stoichiometry [ ] , so microbes in different lakes selectively assimilate c/n/p from organic/inorganic substrates in the environment to keep their stoichiometry at equilibrium [ ] [ ] [ ] . however, the underlying reasons still await further investigation. in addition, it is surprising that the ckl samples with the highest salinity ( g l − ) showed the highest increase of cell abundance and the highest doc decrease during the incubations (tables and ). this observation is inconsistent with the general principle that salinity decreases microbial growth and metabolism [ ] . such inconsistency may be explained by a possibility that the microbial population in the studied ckl microcosms adopt a highefficiency strategy for resisting salinity stress, and they have adapted to grow on tdom. indeed, four microbial otus whose relative abundances increased by > % from the beginning to the end of the incubation were all affiliated with halobacteriales in the ckl samples (supplementary table s ). the halobacteriales have been suggested to adopt the "salt-in" strategy that is energetically cheap and thus highly efficient to balance salinity pressure in the environment [ ] . moreover, halobacteriales are capable of growth on cellulose and chitin [ ] , which are important components of the terrigenous organic carbon [ ] . therefore, it may be reasonable to observe that the highest microbial cell increase and doc concentration decrease in the studied ckl microcosms during incubations with addition of tdom. in summary, our findings demonstrated that tdom can be transformed by microbial populations from different saline lakes, and both abundant microbes and copiotrophs contributed to the transformation of tdom. microbial communities in the microcosms with different salinities exhibited different preference and capability in tdom transformation: microbial populations in the microcosms with high salinity showed a stronger capability to degrade high carbon number dom compounds than their counterpart in the microcosms with low salinity. multiple microbial taxa may cooperate with each other to degrade certain kinds of tdom compounds in the studied microcosms. taken together, this study expands our understanding of microbial roles in tdom degradation in saline lake ecosystems. limnology: lake and river ecosystems quantification of dissolved organic carbon (doc) storage in lakes and reservoirs of mainland china differences in the distribution and optical properties of dom between fresh and saline lakes in a semi-arid area of northern china co emissions from saline lakes: a global estimate of a surprisingly large flux terrestrial carbon and intraspecific size-variation shape lake ecosystems lakes and reservoirs as regulators of carbon cycling and climate plumbing the global carbon cycle: integrating inland waters into the terrestrial carbon budget chemodiversity of dissolved organic matter in lakes driven by climate and hydrology molecular characterization of dissolved organic matter (dom): a critical review increases in terrestrially derived carbon stimulate organic carbon processing and co emissions in boreal aquatic ecosystems experimental insights into the importance of aquatic bacterial community composition to the degradation of dissolved organic matter evidence for the respiration of ancient terrestrial organic c in northern temperate lakes and streams degradation of terrestrially derived macromolecules in the amazon river temperature-controlled organic carbon mineralization in lake sediments microbial production of recalcitrant dissolved organic matter: long-term carbon storage in the global ocean an introduction to saline lakes on the qinghai-tibet plateau characterization of cdom in saline and freshwater lakes across china using spectroscopic analysis source and biolability of ancient dissolved organic matter in glacier and lake ecosystems on the tibetan plateau microbial response to salinity change in lake chaka, a hypersaline lake on tibetan plateau actinobacterial diversity in microbial mats of five hot springs in central and central-eastern tibet salinity impact on bacterial community composition in five high-altitude lakes from the tibetan plateau, western china do patterns of bacterial diversity along salinity gradients differ from those observed for macroorganisms? bacterioplankton community composition along a salinity gradient of sixteen high-mountain lakes located on the tibetan plateau low taxon richness of bacterioplankton in high-altitude lakes of the eastern tibetan plateau, with a predominance of bacteroidetes and synechococcus spp salinity shapes microbial diversity and community structure in surface sediments of the qinghai-tibetan lakes phylum-level archaeal distributions in the sediments of chinese lakes with a large range of salinity prokaryotic community structure driven by salinity and ionic concentrations in plateau lakes of the tibetan plateau amoaencoding archaea and thaumarchaeol in the lakes on the northeastern qinghai-tibetan plateau phosphate measurement in natural waters: two examples of analytical problems associated with silica interference using phosphomolybdic acid methodologies improved method for manual, colorimetric determination of total kjeldahl nitrogen using salicylate microbial diversity in water and sediment of lake chaka, an athalassohaline lake in northwestern china a simple and efficient method for the solid-phase extraction of dissolved organic matter (spe-dom) from seawater improved bacterial s rrna gene (v and v - ) and fungal internal transcribed spacer marker gene primers for microbial community surveys ultra-high-throughput microbial community analysis on the illumina hiseq and miseq platforms graphical-statistical method for the study of structure and reaction processes of coal controlling the false discovery rate: a practical and powerful approach to multiple testing correlation networks finding community structure in very large networks functional cartography of complex metabolic networks molecular ecological network analyses phylogenetic molecular ecological network of soil microbial communities in response to elevated co tracking differential incorporation of dissolved organic carbon types among diverse lineages of sargasso sea bacterioplankton terrestrial dissolved organic matter distribution in the north sea modularity and community structure in networks the role of dissolved organic matter bioavailability in promoting phytoplankton blooms in florida bay bacterial consumption of doc during transport through a temperate estuary respiration in the open ocean distinct dissolved organic matter sources induce rapid transcriptional responses in coexisting populations of prochlorococcus, pelagibacter and the om clade the biomass and biodiversity of the continental subsurface pathways for degradation of lignin in bacteria and fungi the emerging role for bacteria in lignin degradation and bio-product formation microbially-mediated transformations of estuarine dissolved organic matter lignin-degrading enzymes the niche of an invasive marine microbe in a subtropical freshwater impoundment bacterial diversity and activity along a salinity gradient in soda lakes of the kulunda steppe biogeography of bacterial communities exposed to progressive long-term environmental change resource partitioning and sympatric differentiation among closely related bacterioplankton microbial community transcriptomes reveal microbes and metabolic pathways associated with dissolved organic matter turnover in the sea nitrogen and phosphorus co-limitation of bacterial productivity and growth in the oligotrophic subtropical north atlantic interactions among dissolved organic carbon, microbial processes, and community structure in the mesopelagic zone of the northwestern sargasso sea uncoupling of bacterial and terrigenous dissolved organic matter dynamics in decomposition experiments microdiversity of extracellular enzyme genes among sequenced prokaryotic genomes natural assemblages of marine proteobacteria and members of the cytophaga-flavobacter cluster consuming low-and high-molecular-weight dissolved organic matter structuring of bacterioplankton communities by specific dissolved organic carbon compounds application of random matrix theory to microarray data for discovering functional gene modules cooperative dissolved organic carbon assimilation by a linurondegrading bacterial consortium functional molecular ecological networks universal molecular structures in natural dissolved organic matter inefficient microbial production of refractory dissolved organic matter in the ocean thermodynamic limits to microbial life at high salt concentrations halotolerant microbial consortia able to degrade highly recalcitrant plant biomass substrate halo (natrono)archaea isolated from hypersaline lakes utilize cellulose and chitin as growth substrates aquatic heterotrophic bacteria have highly flexible phosphorus content and biomass stoichiometry biological stoichiometry from genes to ecosystems stoichiometric controls on carbon, nitrogen, and phosphorus dynamics in decomposing litter element cycling as driven by stoichiometric homeostasis of soil microorganisms conflict of interest the authors declare that they have no conflict of interest.publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. key: cord- -l aa cl authors: wongsrikeao, pimprapar; saenz, dyana; rinkoski, tommy; otoi, takeshige; poeschla, eric title: antiviral restriction factor transgenesis in the domestic cat date: - - journal: nat methods doi: . /nmeth. sha: doc_id: cord_uid: l aa cl studies of the domestic cat have contributed to many scientific advances, including the present understanding of the mammalian cerebral cortex. a practical capability for cat transgenesis is needed to realize the distinctive potential of research on this neurobehaviorally complex, accessible species for advancing human and feline health. for example, humans and cats are afflicted with pandemic aids lentiviruses that are susceptible to species-specific restriction factors. here we introduced genes encoding such a factor, rhesus macaque trimcyp, and egfp, into the cat germline. the method establishes gamete-targeted transgenesis for the first time in a carnivore. we observed uniformly transgenic outcomes, widespread expression, no mosaicism and no f silencing. trimcyp transgenic cat lymphocytes resisted feline immunodeficiency virus replication. this capability to experimentally manipulate the genome of an aids-susceptible species can be used to test the potential of restriction factors for hiv gene therapy and to build models of other infectious and noninfectious diseases. supplementary information: the online version of this article (doi: . /nmeth. ) contains supplementary material, which is available to authorized users. felis catus has been domesticated for over , years and presently numbers . - . billion worldwide. medical surveillance of this most common companion animal is extensive, and over hereditary pathologies common to both cats and humans are known . the f. catus genome was recently sequenced at light ( . ×) coverage and a × assembly is imminent . over % of identified cat genes have a human homolog, and compared with the mouse there are fewer genomic rearrangements. intermediate size, prolific breeding capacity, similarity of systems to humans, abundance, modest costs and the neurobehavioral complexity of a carnivoran make the cat of value in experimental settings ranging from neurobiology to diverse genetic, ophthalmologic and infectious diseases. these include conditions in which mice or rats are not useful on the basis of disease susceptibility, organ size or other factors . cat transgenesis is thus of interest for both human and cat health research and potentially for developing ways to confer protection from epidemic pathogens to free-ranging feline species, all of which now face the threat of extinction . the world has two aids pandemics, one in domestic cats and the other in humans. the causative lentiviruses, feline immunodeficiency virus (fiv) and hiv- , are highly similar in genome structure, disease manifestations and host cell dependency factor use , . the differences between these lentiviruses are also informative and potentially exploitable. for example, species-specific lentiviral restriction factors such as trim and apobec proteins restrict fiv and hiv- with distinctive patterns [ ] [ ] [ ] [ ] . these genes have not been studied in a controlled manner at the systemic and species levels by introduction into the genome of an aids virus-susceptible species (old world primates or felids). given the challenges inherent to macaque transgenesis, the aids virus-susceptible cat would be singularly positioned for such studies if it can be accessed by genetic approaches used in mice. in contrast to primates, feline species lack antiviral trim α genes but have potently restrictive apobec proteins , , which sets up intriguing possibilities for testing such genes at the whole-animal level, for conferring gene-based immunity with them or engineered variants , , and potentially for hiv- disease model development . to realize the potential of the species for virology and nonvirology models, a means for practical cat genome modification is needed. somatic cell nuclear transfer (scnt) was recently used to generate cats that express fluorescent proteins , . however, the efficiency of animal cloning is extremely low , and scnt results in faulty epigenetic reprogramming in most embryos . cloned mammals with apparently normal gross anatomy can have many abnormalities resulting from failure to erase and reprogram epigenetic memory completely . the two key approaches for generating transgenic mice are dna injection into fertilized embryo pronuclei and injection of genetically modified embryonic stem cell (esc) lines into blastocysts. however, in nonrodent mammals, pronuclear injection is very inefficient, and the second method is blocked by the lack of germline-competent escs. transgenesis with germline transmission has been achieved in some mammals by microinjecting lentiviral vectors into oocytes or single-cell zygotes . this has not been achieved in any carnivore species. here we performed oocyte-targeted lentiviral transgenesis in the domestic cat. results multi-transgenic, nonmosaic cat embryo generation we optimized reagents, gamete collection, microinjection parameters, embryo culture and recipient queen preparation to establish an optimal cat transgenesis protocol (fig. a) . we obtained gametes from both sexes without additional animal procedures by microdissecting gonads discarded after spaying or neutering. in experiments summarized in supplementary table , we subjected in vitro-matured grade i and ii domestic cat oocytes to perivitelline space microinjection (pvsmi) with lentiviral vector tsing ; we performed injection - h before or - h after in vitro fertilization (ivf) (supplementary fig. ). then we cultured these embryos until blastocyst stage (day ). comparisons of embryo development rates (supplementary tables and ) and enhanced gfp (referred to as gfp throughout) expression ( fig. b) showed that transgenesis rates were high (> %) and the process was well tolerated, as cleavage and blastocyst formation rates did not differ substantially between pvsmi and control embryos (supplementary table ). there were no differences in morphology or total cell number and no preference for vector injection timing before or after ivf (supplementary table ) . however, mosaicism scored by nonuniform fluorescent protein expression in the blastocyst was negligible when we injected vectors before ivf but was substantial with injection after ivf (supplementary table ) . to investigate whether more than one transgene could be expressed in cat embryos in a single step by pvsmi, we microinjected oocytes with single-or dual-transgene lentiviral vectors. transgene assemblages were genes encoding gfp, gfp plus rfp, or gfp plus rhesus macaque trimcyp ( supplementary fig. ) . the latter combination was expressed from either a dual promoter or as a single a peptide-linked preprotein. after microinjection we performed ivf with cat sperm h later. we consistently observed embryo-pervasive, abundant expression of both proteins encoded by dual gene vectors in cat blastocysts when we injected lentiviral vector before ivf ( fig. b and supplementary table ). we observed no detrimental effects of dual expression on embryo development or gfp expression irrespective of transgene combination (supplementary table ). in addition, the a peptide or the dual promoter were each effective for simultaneous expression. the process from oocyte collection to fallopian tube transfer took - d (fig. a) . we randomly selected embryos for implantation from cleaved oocytes that had been subjected to ivf and transferred them into surgically exposed fallopian tubes at - h after lentiviral vector transduction. we carried out no preselection for transgene expression after microinjection (embryos were in any case not reliably fluorescent by the time of transfer). we performed transfers into hormonally synchronized queens prepared by a - h light-dark environment. we administered to queens pregnant mare serum gonadotropin on day - and human chorionic gonadrotropin on day - with respect to lentiviral vector transduction, and mated them ad lib from the day of human chorionic gonadrotropin injection until the day before embryo transfer with a vasectomized, azoospermia-verified tomcat to induce ovulation and corpus luteum formation. during surgery we punctured follicles with a needle if not naturally ovulated. twenty-two embryo-transfer procedures resulted in five pregnancies (labeled a-e), five births and three live kittens ( table ) . we achieved a high rate of transgenesis, with of testable live-born or fetal offspring found to be transgenic (a twelfth, spontaneously miscarried d preterm, was consumed by the surrogate mother and could not be tested). three male and two female transgenic cats, named tgcat - , were born by spontaneous vaginal deliveries at term and all five were transgenic (fig. , table and supplementary fig. ). tgcat (male), tgcat (male) and tgcat (female) survived, whereas the fourth and fifth cats died perinatally from obstetrical complications ( table ) . tgcats - were vigorous from birth, fed, played, developed and socialized normally and were healthy, with the exception that tgcat is unilaterally cryptorchid. he also has intermittent pruritic dermatitis, which may be due to a food allergy. in the first year he developed a ventral abdominal hernia and a lower eyelid irritation (entropion), both of which we cured surgically. although we cannot exclude vector-insertion genotoxicity in tgcat , the conditions do not constitute a recognizable syndrome. southern blotting on restriction enzyme-digested genomic dna from the three living transgenic kittens, from tgcat and from four miscarried fetuses showed that all eight were transgenic, with - insertions per cat (fig. b) . pcr assays on genomic dna confirmed the high level of genomic transduction (fig. c) . southern blot hybridization bands were specific, as all were (i) absent from control cat dna, (ii) different from cat to cat and (iii) of greater than the predicted minimum size determined by the distance from restriction site to end of the vector provirus ( fig. b) . sequencing of proviral genomic dna junctions (n = ) from two cats was performed and each was a bona fide retroviral integration junction, with the genomic sequences mapping to the cat genome (supplementary table ). tgcat , in which transgene expression was driven by the standard ( . -kilobase) human cytomegalovirus (hcmv) promoter of vector tsint ag, was brightly and stably green fluorescent in integument and oropharyngeal mucosa surfaces (fig. a) , but surface tissue expression was less bright for tgcat and tgcat (vector tbdmgpt). for the live kittens, we collected cells for protein analyses by oral mucosa scrapings (which showed gfp-expressing squamous epithelial cells), and blood and semen collection. both transgenes were expressed in activated peripheral blood mononuclear cells (pbmcs) but with notable variation (fig. a,b) . percentages of gfp-positive cells as determined by facs were - % in tgcat , tgcat and tgcat and increased gradually as the kittens aged (fig. b,c) . tgcat had the most gfppositive cells in the pbmc compartment, being about % gfppositive early in life and then over - % later (fig. a-c and supplementary fig. ) . several specific aspects here are interesting for developing models that will depend on lymphocyte or monocyte lineage expression. first, irrespective of promoter used, facs and immunoblot detection of gfp and rhtrimcyp in pbmcs in living cats required activation by phytohemagglutin-e (pha-e) and interleukin (il- ), and gfp expression increased steadily with time in culture (fig. c) . fluorescence intensity was variable ( fig. b and supplementary fig. a) . second, driving gfp expression from a minimal cmv (mcmv) promoter element adjacent to the pgk promoter was effective in tgcat , but we observed only low gfp expression with the same vector in tgcat , although even in this cat gfp expression increased steadily from rare positives to . % by months (fig. b,c) . third, all three cats expressed hemagglutinin epitope (ha)-tagged rhtrimcyp in the bulk pbmc population as detected by immunoblotting (fig. a) . tgcat consistently expressed more rhtrimcyp than the other two living cats by quantitative western blot analysis. however, this protein was more difficult to detect than gfp, and was clearly visualized by immunofluorescence, using an antibody to the ha tag, in only a fraction of the cells (supplementary fig. ) . even so, rhtrimcyp transgenic cat pbmcs displayed resistance to fiv replication, with the greatest resistance to replication seen in cells from the cat that expressed the most rhtrimcyp (tgcat ; fig. d) . the resistance to fiv replication was partial, as predicted for cell populations that express such a restriction factor partially . washed swim-up purified sperm from the two males had normal motility and strongly expressed the transgene as determined by pcr (fig. ) . consistent with this result and with the lack of embryo mosaicism when ivf was done after vector microinjection (supplementary table ) germline transmission was readily achieved by direct mating, with all progeny being transgenic. therefore, the transgenesis procedure preserves fertility, and the germline is transduced. transgene expression persisted in the f offspring of transgenic f parents, indicating that silencing did not occur (fig. ) . matings of tgcat with three nontransgenic queens produced five additional kittens from three pregnancies. similar to the sire, they were less surface green-fluorescent but were strongly 'pcr-and southern blot-positive' (data not shown); of these one died perinatally owing to dystocia associated with a hypocontractile uterus. thus, all f cats were transgenic; of were alive and healthy. tgcat was born after an uncomplicated singleton pregnancy at a normal gestation time ( d). it was morphologically normal but died during or shortly after parturition from an apparent obstetrical accident involving aspiration, although a precise cause could not be determined at autopsy. this cat provided the opportunity to study all tissues (fig. a) . detailed organ examination and histology did not identify abnormalities. tgcat is the product of an oocyte transduced by the tsing vector, in which gfp was driven by the hcmv promoter, and had ~ vector insertions (fig. ) . as was tgcat , the kitten was brightly green fluorescent in fur and skin, and immunoblotting revealed abundant gfp expression in all tissues tested: brain, spinal cord, heart, spleen, skin, muscle, liver, kidney, small intestine and stomach ( fig. a and supplementary fig. ) . solid viscera were visibly green fluorescent at the gross level, as were adipose tissues (for example, all omental and pericardial fat) and antibody labeling of fixed tissue showed uniform expression in all cells (fig. c) . when fresh tissue was sectioned and imaged directly for gfp by epifluorescence microscopy, pervasive expression was similarly evident (fig. d) . a fourth pregnancy (c; table ), for which we identified five well-formed, appropriately sized fetal skeletons by x-ray analysis at day of gestation ( supplementary fig. d) , ended in serial miscarriages between days and (~ d before term). we recovered four of these preterm cats (named tgpre - ; table ) for gross and molecular autopsy. dissection did not identify birth defects. as for tgcat - , southern blotting showed that tgpre - were each amply transgenic, with - genomic tsing vector insertions (fig. b) and gfp expression was similarly found in all tissues tested (fig. a,c) . we also probed rhtrimcyp expression (supplementary fig. ) using organs from a cat that was stillborn after a placental abruption (table ; tgcat ), and observed that rhtrimcyp expression was similarly widespread, including in the main lymphoid organs (lymph node, thymus and spleen). consistent with the immunoblotting data, tissues of individual organs were green fluorescent at the gross level. facs of fetal pbmcs from tgpre - showed that - % were gfp-positive (fig. e) . southern blots of genomic dna from the products of non-singleton pregnancies (figs. b and b) , showed also that each was the genetically unique product of a different transduced oocyte, and none were a product of twinning after transduction. our results indicate that transgenic cats may be used as experimental animals for biomedical research. the approach enables transgenesis by germ cell genetic modification for the first time in this species and in any carnivore. notably, we achieved uniformly transgenic outcomes, which reduce screening cost and time. a second implication of the high efficiency and the copy numbers achieved is that it should be possible to titrate vector dose down or to microinject a mix of vectors into one oocyte to produce complex multi-transgenics. the approach is accessible: feline oocytes competent for efficient transgenesis are readily obtained noninvasively and without added animal procedures from ovaries discarded during routine spaying (laparoscopic or ultrasound-guided percutaneous oocyte retrieval is also feasible). in vitro blastocyst development rates were higher than had been seen previously with scnt-developed transgenic embryos ( - % versus %) . we prevented mosaicism by microinjection before ivf and observed germline transmission. the persistence of transgene expression in f cats is encouraging for establishing useful transgenic lines. the lack of multiple inbred strains of cat, a current limitation, could be addressed in a focused breeding project. introducing a lentiviral restriction factor(s) into the genome of the cat has specific potential because this species is naturally susceptible to lentiviral infection (and aids) whereas mice, unmodified or transgenic, are not. several questions can therefore now be addressed. first, it is unknown whether introducing a single active restriction factor into the genome of an aids virus-susceptible species can protect it, and if so, at which of three broadly considered levels: transmission, establishment of sustained viremia and disease development. when antiviral genes are interrogated at the whole animal level by transgenesis in a natural host, results can be surprising and informative. for example, a recent transgenic intervention against influenza in chickens prevented secondary virus transmission to transgenic and nontransgenic contacts, but it had no effect on mortality after primary virus challenge . because species-specific lentiviral restriction factors have not been tested by controlled experimental introduction into an animal, the most fundamental question directly answerable with the approach is whether restriction factor transgenesis can mimic natural experiments that normally take place over large expanses of evolutionary time, with selection by viral culling, and render a species genetically immune to its own lentivirus. it is not possible to make clear predictions. for example, there are natural macaque and sootey mangabey trim alleles that do not block simian immunodeficiency virus transmission to animals that carry them but appear to constrain extent of replication in vivo and to exert selection pressure on the capsid . when breeding expansion is completed with our present restriction factor transgenic cats, fiv challenges can be done. whether or not more than one restriction factor will be needed to achieve antiviral protection, the concept of using them for this purpose in gene therapy has stimulated efforts to devise non-immunogenic human and feline versions , . both of these recently bioengineered trimcyps restrict fiv and can be tested in our system. indeed, fiv is unique among lentiviruses in being restricted by both old and new world monkey trimcyps. we speculate that feline transgenesis with host defense molecules could also confer protection from viral pathogens to wild feline species, all of which face accelerating extinction threats and which are among the most charismatic, ecosystem-iconic taxa in the carnivora. cat transgenesis could have additional impact. as we recently proposed, the domestic cat may have potential for modeling hiv- disease itself because, except for entry receptors, the cat genome can supply the dependency factors needed for hiv- replication . this is a fundamental difference compared to the mouse . gene knockdowns and targeting are foreseeable by combining our approach with current technologies. furthermore, transgenesis in this accessible, abundant species with intermediate size and complex neurobehavioral repertoire will permit other human-relevant models in areas such as neurobiology, where the cat is already a paramount model. studies in the cat have revealed much of the present knowledge on organization of the mammalian brain, in particular the visual cortex [ ] [ ] [ ] [ ] [ ] ; work in this area has been critical to unraveling the neural mechanisms of vision. although transgenesis in this species will not be as common as in rodents, the creation of a small number of lines with genetic tools could build on the large knowledge base in the species to dramatically alter capability for understanding the cerebral cortex. transgenic mice have many advantages, but fundamental differences with human physiology limit their utility in many ways. many diseases cannot be modeled in mice or rats, with size alone being sometimes intrinsically limiting. transgenesis has been performed in marmosets , and, so far without demonstrated germline transmission, in macaques . these two primate models have clear promise, but limitations arise from scarcity, expense, longer gestation times and, for macaques, prolonged time to sexual maturity ( - years) and the requirement to shield handlers from casually transmitted cercopithecine herpesvirus . for the purpose of aids-relevant work, new world monkeys such as marmosets are not susceptible to any lentivirus. even with a generic viral promoter we observed transgene expression in of cat organs tested. we observed rhtrimcyp expression in the main aids-relevant lymphoid tissues (lymph node, spleen and thymus). mature circulating hematopoietic lineages have notoriously specialized transcriptional environments, but - % of pbmcs in the living cats were gfp-positive in culture. variation may reflect genome positional effects. whereas tissue-specific or alternative promoter or enhancer elements can be used, cats with partial pbmc expression profiles also provide a good experimental opportunity because they allow the question of virus-mediated cell lineage selection in vivo to be addressed, modeling a realistic cell-based therapy situation, for example, gene therapy for hiv- disease. one important issue is whether fiv infection will result in long-term selection of a virus-refractory lymphocyte population as has been observed in nonobese diabetic severe combined immune deficiency (nod-scid) il rγ null mice transplanted with ccr −/− human cd cells . conversely, if systemic viral replication occurs, we can determine whether escape mutations arise. methods and any associated references are available in the online version of the paper at http://www.nature.com/naturemethods/. funding from us national institutes of health grants ai and ey assisted prior key technology developments. we thank the helen c. levitt foundation for initial pilot funding and a. keller for coordinating it, members of our laboratory for helpful discussions and assistance, h. fadel for assisting with site-directed mutagenesis, members of our transgenic mouse core for sharing microinjection equipment, g. towers (university college london) for a rhtrimcyp cdna, and mayo clinic veterinary staff for advice and surgical assistance. the domestic cat, felis catus, as a model of hereditary and infectious disease state of cat genomics feline leukemia virus and other pathogens as important threats to the survival of the critically endangered iberian lynx (lynx pardinus) shared usage of the chemokine receptor cxcr by the feline and human immunodeficiency viruses an essential role for ledgf/p in hiv integration restriction of retroviral replication by apobec g/f and trim alpha restriction of feline immunodeficiency virus by ref , lv and primate trim a proteins independent evolution of an antiviral trimcyp in rhesus macaques functions, structure, and read-through alternative splicing of feline apobec genes productive replication of vif-chimeric hiv- in feline cells truncation of trim in feliformia explains the absence of retroviral restriction in cells of the domestic cat potent lentiviral restriction by a synthetic feline trim cyclophilin a fusion potent inhibition of hiv- by trim -cyclophilin fusion proteins engineered from human components generation of cloned transgenic cats expressing red fluorescence protein generation of domestic transgenic cloned kittens using lentivirus vectors mammalian nuclear transfer nuclear reprogramming to a pluripotent state by three approaches lentiviral transgenesis-a versatile tool for basic research and gene therapy mode of transmission affects the sensitivity of human immunodeficiency virus type to restriction by rhesus trim alpha suppression of avian influenza transmission in genetically modified chickens trim alpha modulates immunodeficiency virus control in rhesus monkeys multiple blocks to human immunodeficiency virus type replication in rodent cells receptive fields, binocular interaction and functional architecture in the cat's visual cortex the period of susceptibility to the physiological effects of unilateral eye closure in kittens innate and environmental factors in the development of the kitten's visual cortex development of cat visual cortex following rotation of one eye neuronal circuits of the neocortex generation of transgenic non-human primates with germline transmission toward a transgenic model of huntington's disease in a non-human primate human hematopoietic stem/progenitor cells modified by zinc-finger nucleases targeted to ccr control hiv- in vivo the entire rhesus trimcyp transgene ( . kb) was amplified using nm each sense primer ′-atgtacccatacgatgttcc- ′ and antisense primer ′-gccgcttattcgagttgcc- ′. the program included an initial denaturation step at °c for s. pcr amplification was performed as follows; °c for s, °c for s, °c for s. a final extension step at °c for min concludes the program reproductive patterns of cats. in mcdonald's veterinary endocrinology and reproduction effect of light manipulation on ovarian activity and melatonin and prolactin secretion in the domestic cat effect of the quality of the cumulus-oocyte complex in the domestic cat on the ability of oocytes to mature, fertilize and develop into blastocysts in vitro lentiviral vectors production and use of feline immunodeficiency virus (fiv)-based lentiviral vectors intensive rnai with lentiviral vectors in mammalian cells development of a peptide-based strategies in the design of multicistronic vectors coordinate dual-gene transgenesis by lentiviral vectors carrying synthetic bidirectional promoters nucleotide sequence and genomic organization of feline immunodeficiency virus efficient transduction of nondividing cells by feline immunodeficiency virus lentiviral vectors all authors designed experiments, analyzed data and critiqued the manuscript. e.p. conceived the project and recruited p.w. and t.o. e.p. and t.o. oversaw the project. p.w. and d.s. produced vector and retrieved gametes; p.w. microinjected vector and did embryo cultures. p.w. transfered embryos with assistance from t.r. and e.p. with surgery. p.w., d.s and t.r. monitored cats, did cell and tissue assays and virology. p.w., d.s. and e.p. wrote the manuscript. the authors declare no competing financial interests.published online at http://www.nature.com/naturemethods/. reprints and permissions information is available online at http://www.nature. com/reprints/index.html. gametes used for embryo formation were obtained from gonads discarded after routine elective sterilization. oocyte-cumulus complexes (cocs) were recovered within h by repeated fine slicing of ovarian tissue in modified phosphate-buffered saline (mpbs) supplemented with mg ml − bovine serum albumin (bsa) and µg ml − l gentamicin. only grade i and ii oocytes were used. selected cocs were washed and matured in modified tcm- (gibco) containing iu ml − human chorionic gonadrotropin (hcg), . iu ml − pregnant mare serum gonadotropin (pmsg), µg ml − epidermal growth factor (egf), and mg ml − bsa in a humidified atmosphere of % co in air at °c for h.in vitro fertilization and in vitro culture. twenty-eight hours after ivm, cooled spermatozoa were washed twice in brackett-oliphant medium supplemented with µg ml − sodium pyruvate, mg ml − bsa, and µg ml − gentamicin by centrifugation at rpm for min. the supernatant was removed and sperm pellet was diluted in µl fertilization medium (g-ivf plus, vitrolife), and placed in the incubator to allow sperm swim-up for min. the spermatozoa concentration was adjusted to × ml − . ten oocytes were transferred into each of µl sperm microdroplets under mineral oil and co-cultured for h, after which presumptive zygotes were removed from sperm with a small-bore pipette, washed, and cultured in a modified earl's balanced salt solution (mk- ) supplemented with mg ml − bsa and µg ml − gentamicin for d. three days after sperm exposure, cleaved embryos were selected for transfer or subsequently cultured in mk- medium supplemented with % (v/v) fbs (fbs, hyclone laboratories) and µg ml − gentamicin for a further days to evaluate developmental capacities to morula and blastocyst stages.transgenic embryo production. before to lentiviral vector microinjection, cumulus cells were mechanically removed from the oocytes - h after incubation in maturation medium (pre-ivf injection group) or from presumptive zygotes at - h post-incubation in fertilization medium (post-ivf injection group). a volume of ~ pl vector was injected directly into the oocyte perivitelline space h before or h after ivf using a finely pulled glass capillary (femtotips, eppendorf) connected to a microinjector (eppendorf femtojet) adjusted for injection and compensation pressure with an injection time of s. after microinjection, the oocytes were washed and returned to culture in ivm medium until hour of maturation when they were used for ivf. for post-ivf injection, zygotes were washed and subsequently cultured in mk- medium. with the conditions developed, oocytes were modified at high rates without apparent toxicity to the zygotes early development and timing microinjection before fertilization created reliably non-mosaic embryos.embryo transfer, pregnancy detection, parturition and photo graphy. healthy - -year-old spf queens were the recipients for embryo transfer. they were induced with iu pmsg injected intramuscularly at - h before ivf, followed by injection of iu of hcg h after the pmsg. in addition, ad lib mating with a vasectomized male was done from the day of hcg injection until the day before embryo transfer.the females were anesthetized on the day of transfer with ketamine ( mg kg − ), medetomidine ( . mg/kg) and buprenorphine ( . mg kg − ) administered intramuscular and maintained with - % isoflurane gas. prior to abdominal incision the medetomidine was reversed with an intramuscular injection of atipamezole to minimize any effects the alpha- agonist may have on transfer success. an approximately cm ventral midline incision was made and ovaries and fallopian tube exteriorized. each ovary was examined for evidence of ovulation. if no corpus hemorrhagicum or corpus luteum was visualized, follicles were punctured with a needle to artificially induce ovulation. then, a transmural puncture of the fallopian tube was performed with a gauge needle and this was replaced with a fine hand-pulled glass transfer pipette, through which fifteen to twenty-five pre-loaded embryos (transduced, cleaved, > cell stage) in - µl mk- medium were transferred per fallopian tube under microscopic visualization using gentle positive mouth-controlled pressure. the pipette was withdrawn and the incision was closed in three layers.pregnancy status was determined with a canine relaxin kit (synbiotics) on day after transfer and by film radiography on day . pregnant recipients were monitored daily until delivery of term kittens which occurred by un-assisted spontaneous vaginal birth at term. all control and transgenic animal photographs were taken with a nikon camera at the same time using identical lighting, filter, and camera settings, with gfp imaged under blue light illumination with a long pass filter. supplementary figure contains additional images.immunofluorescence microscopy and immunohistochemistry. blastocysts (fig. c) were attached to a slide with bd-cell tak, cell and tissue adhesive, fixed and permeabilized for min at room temperature in pbs supplemented with % (w/v) paraformaldehyde and % (v/v) triton x- and blocked with % bsa in pbs for min. transduced and control blastocysts and activated pbmcs were imaged by confocal microscopy with gfp fluroescence imaged directly and ha-tagged rhtrimcyp detected using primary anti-ha (high affinity anti-ha rat monoclonal, roche, used at : dilution), with incubation for h at rt, washed, followed by incubation with cy -conjugated goat anti-rat igg secondary ( : dilution, chemicon international) for h. controls with each protein alone verified no signal cross-reception nature methods between channels and blastocysts derived from untransduced ova were negative as shown. following three min washing steps in pbs and mounting with addition of prolong gold anti-fade reagent with dapi (invitrogen) for nuclear dna staining, the embryos were analyzed by laser confocal microscopy (axiovert m; carl zeiss microimaging).animal tissues were fixed with % paraformaldehyde and paraffin-embedded. serial µm sections were made. immunohistochemistry was performed using a dako envision plus kit. sections were dewaxed in xylene and rehydrated in alcohol. endogenous peroxidase activity was blocked with . % hydrogen peroxide. sections were incubated with a : diluted primary mouse monoclonal antibody (clontech, jl , : ) for h. dako envision anti-mouse secondary antibody ( : ) was then applied for min. the sections were mounted using prolong gold anti-fading reagent and observed by light microscopy.vectors and fiv infections. all vectors and vector sequences are available from the authors upon request. lentiviral vectors were hiv- -based to permit pcr-based tracking of infectious fiv in future experiments. gfp is the enhanced version (egfp). tsin series lentiviral vectors were previously described , and were prepared using t transfection in nunc cell factories and concentrated by ultracentrifugation using established methods [ ] [ ] [ ] . the transfer vectors have cppt-cts and wpre elements and are u -deleted. dual gene vectors with rhesus (macaca mulatta) trimcyp and egfp utilize either a porcine teschovirus a peptide expressing a single pro-protein (human cytomegalovirus immediate early gene (hcmv)-promoted rhtrimcyp-p a-gfp) or a bi-directional promoter kindly provided by amendola et al. with tandemly arranged phosophglycerate kinase (pgk) and minimal cmv (mcmv, . kb) promoter elements driving rhtrimcyp and gfp respectively on opposite strands. vsv-g-pseudotyped vectors were produced in two-chamber cell factories (cf ) and concentrated by ultracentrifugation over a sucrose cushion as described , . vectors were titrated on feline kidney cell line (crfk) cells using flow cytometry for gfp expression. reverse transcriptase activities were used to normalize preparations . pbmcs were cultured in rpmi with % fcs, rhil- and antibiotics and were activated with µg ml − pha-e. for fiv infection of pbmcs, , feline pbmcs were infected with × rt activity units ( µl) of fiv tf generated by t cell transfection of pct orfarep , a version of pct in which we repaired the premature orf-a stop codon by overlap extension pcr to enable pbmc replication. supernatants were collected approximately every d thereafter and assayed for reverse transcriptase activity as described above.immunoblotting. transfected cell lysate or minced tissue samples were homogenized in ripa ( mm nacl, . % deoxycholate, . % sodium dodecyl sulfate, % np- , mm tris-hcl, ph . ) supplemented with protease inhibitors (complete-mini, boehringer). fractions and lysates were boiled in laemmli supplemented with β-mercaptoethanol for min, separated by gel electrophoresis, transferred onto pvdf membranes (immobilon-p, millipore), and blocked in mpbs containing mg ml − bsa and % tween for h at room temperature ( - °c) . blots were treated with primary antibodies against: gfp (jl , : , clontech), α-tubulin (mouse monoclonal antibody : , , sigma), ha (high affinity anti-ha antibody, rat monoclonal, : , , roche, cat # ) for h at room temperature. after washing, secondary antibodies were applied: alkaline phosphataseconjugated goat anti-mouse igg (calbiochem) diluted : , , and alkaline phosphatase-conjugated goat anti-rat igg (santa cruz biotechnology, inc., santa cruz, ca) diluted : . membranes were then incubated with ecl reagent (thermo scientific) and exposed to film.sperm collection and storage. epidydymi were separated by dissection within h and repeatedly finely sliced in mpbs supplemented with mg ml − bsa and µg ml − gentamicin to release spermatozoa. the medium was filtered with a µm cell strainer (bd falcon) and centrifuged at , rpm for min. sperm pellets were resuspended in µl test yolk buffer (refrigeration medium, irving scientific) in a . -ml microcentrifuge tube at room temperature and gradually cooled to °c. the samples were kept at °c until use, or cryopreserved in liquid nitrogen. sperm of transgenic males was obtained by electroejaculation.southern blotting. genomic dna of newborn and spontaneously aborted kittens was analyzed by southern blot hybridization and pcr. total dna was isolated from blood, tail tips and heart using the dneasy blood and tissue kit (qiagen). five micrograms dna was digested with afliii, bamh or ndei as indicated. dna fragments were separated by electrophoresis on . % agarose gel and transferred by capillary action to a nytran supercharge membrane (schleicher & schuell bioscience). dna was crosslinked to the membrane using a uv crosslinker (uvc ; hoefer). blots were then hybridized overnight at °c in ultrahyb (ambion) containing an p-labeled egfp probe. after washing at °c with . % sds, × ssc followed by . % sds, . × ssc, the blots were exposed to the kodak biomax ms x-ray film (sigma-aldrich) with intensifying screen at - °c and developed. bands in figure b and the right blot of figure b are more widely spaced than bands in the left blot of figure b because ndei and bamhi cleave, on average, every , bp apart, while afliii cuts on average every nt bp.quantitative rtpcr and semiquantitative pcr. transgenic and control genomic dna samples (pbmc, tail tip and organs) were analyzed by real-time quantitative pcr using the roche faststart dna master sybr green kit i. samples were quantified against a serially-diluted plasmid standard for total gfp using the roche lightcycler and roche lcda software. initial denaturation was at °c for min and a melting step after amplification ( - °c, temperature transition rate = . °c s − ). gfp was amplified using nm each sense primer ′-agaac ggcatcaaggtgaac- ′ and antisense primer ′-tgctcagg tagtggttgtcg- ′. pcr amplification and analysis was performed as follows; °c for s, °c for s, °c for s, × cycles, temperature transition rate = °c s. as a loading control feline gapdh was quantified using nm each sense primer ′-accacagtccatgccatcac- ′ and antisense primer ′-tccaccacccggttgctgta- ′. pcr amplification and analysis was performed using a roche lightcycler as follows: °c for s, °c for s, °c for s, × cycles, temperature transition rate = °c s. semiquantitative analysis for rhesus trimcyp was performed using phusion hot start high-fidelity key: cord- -v queyh authors: storch, david-maximilian; timme, marc; schroder, malte title: incentive-driven discontinuous transition to high ride-sharing adoption date: - - journal: nan doi: nan sha: doc_id: cord_uid: v queyh ride-sharing - the combination of multiple trips into one - may substantially contribute towards sustainable urban mobility. it is most efficient at high demand locations with many similar trip requests. however, here we reveal that people's willingness to share rides does not follow this trend. modeling the fundamental incentives underlying individual ride-sharing decisions, we find two opposing adoption regimes, one with constant and one with decreasing adoption as demand increases. in the high demand limit, the transition between these regimes becomes discontinuous, switching abruptly from low to high ride-sharing adoption. analyzing over million ride requests in new york city and chicago illustrates that both regimes coexist across the cities, consistent with our model predictions. these results suggest that current incentives for ride-sharing may be near the boundary to the high-sharing regime such that even a moderate increase in the financial incentives may significantly increase ride-sharing adoption. sustainable mobility [ ] [ ] [ ] [ ] [ ] [ ] is essential for ensuring socially, economically and environmentally viable urban life [ , ] . ride-sharing constitutes a promising alternative to individual motorized transport currently dominating urban mobility [ ] . recent analyses suggest that large-scale ride-sharing is specifically suited for densely populated urban areas [ ] [ ] [ ] [ ] [ ] . by combining two or more individual trips into a shared ride served by a single vehicle, ride-sharing increases the average utilization per vehicle, reduces the total number of vehicles required [ ] and thereby mitigates congestion and environmental impacts of urban mobility [ ] . hence, embedding ride-sharing for trips that would otherwise be conducted in a single-occupancy motorized vehicle, is preferable from a systemic perspective. previous research focused on algorithms to implement large-scale ride-sharing [ ] as well as the potential efficiency gains derived from aggregating rides [ , , ] . generally, matching individual rides into shared ones without large detours becomes easier with more users, increasing both the economic and environmental efficiency as well as the service quality of the ride-sharing service [ , , ]. yet, if and under which conditions people are actually willing to adopt ride-sharing remains elusive [ ] [ ] [ ] [ ] [ ] [ ] . in particular, it is unclear how to encourage an ever growing number of ride-hailing users to choose shared rides over their current individual mobility options [ ] [ ] [ ] . in this article, we disentangle the complex incentive structure that governs ride-hailing users' decisions to share their rides -or not. in a game theoretic model of a one-to-many demand constellation we illustrate how the interactions between individual ride-hailing users give rise to two qualitatively different regimes of ride-sharing adoption: one low-sharing regime where the adoption decreases with increasing demand and one high-sharing regime where the population shares their rides independent of demand. analyzing ride-sharing decisions from approximately million ride-requests in new york city and an additional million in chicago suggests that both adoption regimes coexist in these cities, consistent with our theoretical predictions. our findings indicate that current financial incentives are nearly, but not fully, sufficient to stimulate a transition towards the high-sharing regime. . contrasting ride-sharing adoption despite high request rate in new york city. fraction of shared ride requests from different origins (red) served by the four major for-hire vehicle transportation service providers in new york city by destination zone (january -december ) [ ] . gray areas were excluded from the analysis due to insufficient data (see methods). the fraction of shared ride requests differs significantly by origin and destination, even though the average overall request rate is similar for all four origin locations. a,b some areas, such as east village and crown heights north, show a high adoption of ride-sharing services. c,d despite a similarly high request rate, other locations, such as jfk and laguardia airports, show a significantly lower adoption of ride-sharing services with a complex spatial pattern across destinations. ride-sharing adoption (see supplementary information for details). these findings hint at a complex interplay of urban environment, demand structure and socio-economic factors that govern the adoption of ride-sharing. to disentangle these complex interactions, we introduce and analyze a game theoretic model capturing essential features of ridesharing incentives, disincentives as well as topological demand structure. trade-offs between incentives determine the decision to share a ride, or not. a shared rides offer advantages and disadvantages compared to single rides. on the one hand, they offer financial discounts typically proportional to the distance of a direct single ride (blue, dotted). on the other hand, rides shared with strangers may be inconvenient due to other passengers in the car (e.g. loss of privacy or less space, green) and may include detours compared to a direct trip to pickup or deliver these other passengers (orange, solid compared to dotted). b the decision to book a shared ride depends on the balance of all three factors. if the expected utility difference e[∆u] = e[u share ] − e[u single ] between a shared and a single ride is positive, the financial discounts overcompensate detour and inconvenience effects; users share. if e[∆u] is negative (as illustrated), users prefer to book single rides. the decision of ride-hailing users to request a single or a shared ride reflects the balance of three fundamental incentives ( fig. ) [ , ] : discounts. ride-sharing is incentivized by financial discounts granted on the single ride trip fare, partially passing on savings of the service cost to the user. often, these discounts are offered as percentage discounts on the total fare such that the financial incentives u share fin > are proportional to the distance or duration d single of the requested ride, u share fin = d single , where denotes the per-distance financial incentives. in many cases, these discounts are also granted if the user cannot actually be matched with another customer into a shared ride [ , ] . detours. potential detours d det to pickup or to deliver other users on the same shared ride discourage sharing. the magnitude of this disincentive u share det < increases with the detour d det . inconvenience. sharing a ride with another user may be inconvenient due to spending time in a crowded vehicle or due to loss of privacy [ , , ] . this disincentive u share inc < scales with the distance or duration d inc users ride together. in the following we take u share det ∼ d det and u share inc ∼ d inc , describing the first order approximation of these disincentives and matching the linear scaling of the financial incentives with the relevant distance or time. these incentives for a shared ride describe the difference ∆u in utility compared to a single ride or another mode of transport. the overall utility of a shared ride is then given by where the utility u single for a single ride describes the benefit of being transported, as well as the cost and time spent on the ride. the factors , ξ and ζ denote the user's preferences. by rescaling the utilities (measuring in monetary units), directly denotes the relative price difference between single and shared rides whereas ζ and ξ quantify the importance of inconvenience and detours relative to the financial incentives (see supplementary information for details). for a given origin-destination pair with fixed single ride distance d single , financial incentives are constant for a given discount factor . in contrast, detour and inconvenience contributions depend on the destinations and sharing decisions of other users. their magnitude depends on where these users are going and on the route the vehicle is taking for a shared ride (see methods). the decision to share a ride is determined by the expected utility difference (see fig. ) where e[·] signifies the expectation value over realizations of other users' destinations and sharing decisions conditional on one's own sharing decision. to understand how these incentives determine the adoption of ride-sharing, we study sharing decisions in a stylized city network [ ] with a common origin o (e.g., from a central downtown location) in the center and multiple destinations d (illustrated in fig. ). two rings define urban peripheries equidistant from the city center. branches represent cardinal directions of destinations. requests for shared rides will only be matched along adjacent branches, if the shared ride reduces the total distance driven to deliver the users and to return to the origin compared to single rides (see methods). pairing at most two users who request a shared ride, the problem of matching shared ride requests reduces to a minimum-weight-matching with an efficient solution, eliminating the influence of heuristic matching algorithms [ , ] (see methods for details). in this one-to-many setting, users requesting a shared ride would only share a ride if they make their requests within some small time window τ . therefore, we consider a game with s = s τ users travelling to a uniformly chosen destination location, where s denotes the average request rate. these users have the option to book a single ride or a shared ride at discounted trip fare. their decision to share depends on their expected utility difference e[∆u(d)] [eq. ( )], now depending on their respective destination d. from the utility differences e[∆u(d)], we compute the equilibrium sharing probabilities π * (d) with which users from destination d adopt ride-sharing to maximize their expected utility (see methods for details). at fixed discount and preferences ζ and ξ ride-hailing users may decrease their overall adoption of ride-sharing π * as the total number s of users increases (see fig. a , blue), even though ride-sharing becomes more efficient with higher user numbers. here · denotes the average over all destinations d. while for small request rates everybody is requesting shared rides (fig. b) , a distinctive sharing/non-sharing pattern emerges along the branches of the city network upon higher demand (fig. c,d) , before the adoption of ride-sharing eventually fades out for high request rates, s (fig. e) . this observation offers a novel perspective on the prevalent conclusion that increased demand improves the shareability of rides [ , ] . while more rides are potentially shareable, less people may be willing to share them. the underlying incentives explain this phenomenon: ideally, a user wants to book a shared ride (financial incentive) but without actually sharing the ride (inconvenience and detour). the expected detour and inconvenience mediate an interaction between ride-hailing users, turning ride-sharing decisions into a complex anti-coordination game. for small request rates, i.e. small numbers of concurrent users s, the probability p match (d) for a user with destination d to be matched with other users is low (see fig. a , gray). consequently, the expected detour is also small (analogously for the inconvenience). as illustrated in fig. b , bottom, financial incentives outweigh the expected disadvantages of ride-sharing such that everybody is requesting shared rides, π * (d) = for all destinations d, but is only rarely matched with another user. as the number of users s increases, the provider can pair ride requests more efficiently given constant sharing decisions, ∂p match (d)/∂s > , resulting in more requests that are actually matched with another user (see fig. a ). consequently, the expected detour and inconvenience also increase. however, instead of reducing the average adoption of ride-sharing homogeneously across all destinations, neighboring destinations adopt opposing sharing strategies (see fig. b ). in this sharing pattern, only destinations in identical cardinal direction can and will be matched into a shared ride, minimizing the detours for shared requests and simultaneously disincentivizing other users to start sharing due to high expected detours ( fig. c-e bottom) . as the number of users s increases further, the probability p match (d) would also increase at given sharing adoption π(d). this leads to an adoption of mixed sharing strategies where the financial discounts in this configuration, users requesting a shared ride never suffer any detour while users that do not share are disincentivized from doing so due to their high expected detour (compare bottom part of panel c). for high numbers of users (s = and , panels d and e), the probability to be matched with another user when requesting a shared ride increases and the financial incentives cannot fully compensate the expected inconvenience. the adoption of ride-sharing decreases until the financial incentives exactly balance the expected inconvenience (panels d and e, bottom). illustrated here for financial discount = . and inconvenience and detour preferences ζ = . and ξ = . . and the expected inconvenience are exactly in balance ( fig. d and e) . further numerical simulations demonstrate that this transition robustly exists also for heterogeneous demand distribution across the destinations and different origin locations within the network (see supplementary information) . naturally, if the discount is sufficiently large such that the financial incentives completely compensate the expected inconvenience, > ζ, all users share also in the high request rate limit, s → ∞. in this limit, d single = d inc as detours disappear, e[d det ] → , due to an abundance of similar requests. figure a -b summarizes these results in a phase diagram for the ride-sharing decisions as a function of financial discount and number of users s, illustrating under which conditions the users adopt ride-sharing (high-sharing regime) and under which conditions the users only share partially or not at all (low-sharing regime). for fixed values of financial discounts , different behavior emerges for different inconvenience preferences ζ. if ζ is sufficiently small (fig. a) , the system is in the high-sharing phase and the number of users requesting a shared ride is s share = s. otherwise, the system switches from the high-to the low-sharing state (fig. b, compare fig ) . figure c illustrates the scaling of s share in both states as s increases. in the partial sharing state, s share becomes constant for large s, such that s share /s → as s → ∞ (compare fig. a) , implying a discontinuous phase transition between low-sharing and high-sharing regimes for large s when the financial incentives exactly balance the inconvenience, c /ζ c = (see supplementary information for details). two qualitatively different regimes of ride-sharing adoption. a,b phase diagram of fraction of shared rides s share /s for different inconvenience preferences ζ. ride-sharing is adopted dominantly if the financial discount fully compensates the expected inconvenience (high-sharing, dark blue). otherwise, the total number of shared ride requests saturates and the overall adoption of ride-sharing decreases with increasing number of users s (low-sharing, compare fig. a ). in the limit of an infinite number of requests s → ∞ the transition becomes discontinuous (see supplementary information) . c with identical financial discounts = . , different sharing behavior emerges for different inconvenience preferences ζ. when ζ < all users request shared rides (s share = s, dark blue triangles, red line in panel a). when ζ > the system is in a low-sharing regime where users request shared rides at low numbers of users s but the number of shared ride requests saturates and becomes constant at high s (s share < s, light green triangles, red line in panel b). in the low-sharing regime, spatially heterogeneous patterns of ride-sharing adoption emerge (compare fig. b -e). in a real city with heterogeneous preferences across different locations and constant financial discounts , the sharing decisions may, on an aggregate level, appear to be in a hybrid state between the high-and low-sharing phases predicted by our model. indeed, the ride-sharing adoption across different origin locations in new york city and chicago, illustrated in fig. , matches the qualitative sharing behaviors at different preferences in our model (compare fig. c ). at locations with a low request rate s, the fraction of shared ride requests increases linearly with more requests, s share ∼ s. at high request rates, sharing decisions differ by origin zone (compare fig. ): for crown heights north and east village the linear scaling prevails, indicating is sufficiently large to compensate the expected inconvenience and detour effects completely. in fact, the spatial pattern of fraction of rides shared appears to be largely homogeneous across destinations as expected in this state (fig. a) . other origins with a similarly high request rate, such as jfk and laguardia airports, accumulate on a horizontal line with a constant number of shared ride requests per time. for these zones s share has saturated for the given financial incentives and will not increase with higher request rate. the sharing decisions in these locations are spatially heterogeneous across the city (fig. c) , consistent with the low-sharing state observed in our model (compare fig. ). together with fig. a and b, these observations suggest that financial incentives in new york city are at the phase boundary between the high-and low-sharing regime and slightly higher discounts may significantly increase sharing in some areas. ride-sharing adoption in new york city and chicago is consistent with the predicted high-and lowsharing regimes. a,b sharing decisions for new york city and chicago (blue dots) accumulate on two branches corresponding to the predicted high-and low-sharing regime as a function of request rate (compare fig. ) . at low request rates, the number of requests for shared rides increases linearly with the total number of requests (compare red diagonal). at high request rates, the sharing decisions differ between locations (compare fig. and , see also supplementary information). as inconvenience preferences ζ are naturally heterogeneous in the cities, adoption is in a hybrid low/high-sharing state. c,d for origins in the high-sharing state a spatially homogeneous pattern of ride-sharing adoption emerges across destinations. e,f for origins in the low-sharing state a spatially heterogeneous pattern forms. the agglomeration of most data points on the high-sharing branch for new york city suggests that the financial discounts are close to the boundary of the high-sharing phase. however, the slope of the high-sharing branch indicates that only about % of ride-hailing users consider ride-sharing as an option. while about % of requests are shared in the high-sharing regime in chicago, this potential is largely not realized. most data points at locations with high request rates accumulate on the horizontal line representing the low-sharing regime. seven large and busy zones in chicago with up to requests per minute (not shown) fall in between the high-and low-sharing state (see supplementary information for details). an analysis for the ride-sharing adoption across more than million trips in chicago (see methods and supplemental material for details) shows similar results (fig. d) , highlighting the existence of the low-sharing regime (horizontal branch s share = const.). even in the high-sharing regime, s share ∼ s, the ride-sharing adoption in new york city and chicago (corresponding to the slope of the diagonal branch in fig. a,b) is below %. in terms of our ride-sharing game, the remaining fraction of requests for single rides corresponds to a user group that does not consider ride-sharing as a potential option at all and, hence, is not captured by our model. the adoption of ride-sharing is governed by the complex interplay between demand patterns, matching algorithms, available transportation services, urban environments and the relevant incentive structure. incentives may include financial savings potentials, detour or delay preferences, various types of inconveniences, as well as sustainability, security and uncertainty [ , [ ] [ ] [ ] . we have introduced a model capturing essential incentives for and against ride-sharing, predicting two qualitatively different regimes of ride-sharing adoption consistent with an analysis of million ride-sharing decisions from new york city and chicago. a basic model includes three core incentive types: financial benefits, potential detours (and thus effectively slower service) and other inconveniences such as reduced privacy resulting from sharing a vehicle. this setting may already reflect many additional factors influencing ride-sharing adoption on an aggregate level. for example, sustainability or uncertainty preferences to first approximation scale with the additional distance driven and may thus be incorporated into the detour preferences. similarly, alternative public transport options may be captured by modifying the effective financial discount and relative inconvenience preferences for individual destinations. as such, we expect the qualitative dynamics to be robust even in more detailed settings taking into account additional conditions (compare supplementary information for different demand distributions). specific, district-level policy recommendations naturally require a more detailed description of the traffic conditions and alternative transport options, capturing all the above-mentioned dependencies. in particular, we predict the existence of two distinct regimes of ride-sharing adoption. for sufficiently strong financial incentives, the number of shared ride requests increases linearly with increasing demand. however, if the financial incentives are weak compared to inconvenience disincentives, the number of requests for shared rides saturates with increasing demand (regime of low ride-sharing adoption). this observation is independent of the choice of origin locations or the specific demand distribution (see supplementary information) and stands in stark contrast to the increasing shareability of rides with high demand [ , , ] . in the limit of large demand, the transition between the two regimes becomes discontinuous, switching abruptly from the low adoption to the high adoption regime with a small change of the incentives. ride-sharing adoption observed across new york city and chicago is consistent with these predictions and demonstrates that both regimes exist across the cities. the data suggest that even a moderate increase of financial incentives may strongly improve ride-sharing adoption in some areas currently in the low-sharing regime. still, the overall low fraction of shared ride requests, even in the high-sharing regime, suggests that an additional societal change towards acceptance of shared mobility is required [ ] to make the full theoretical potential of ride-sharing accessible [ , ] . a carefully designed incentive structure for ride-sharing users adapted to local user preferences is essential to drive this change and to avoid curbing user adoption or stimulating unintended collective states [ , ] . this is particularly relevant in the light of increasing demand as urbanization progresses [ ] . overall, the approach introduced above can serve as a framework to work towards sustainable urban mobility by regulating and adapting incentives to promote ride-sharing in place of motorized individual transport. new york city ride-sharing data. we analyzed trip data of more than million transportation service requests delivered through high-volume for-hire vehicle (hvfhv) service providers in new york city in . the data is provided by new york city's taxi & limousine commission (tlc) [ ] and consists of origin and destination zone per request, pickup and dropoff times, as well as a shared request tag, denoting a request for a single or shared ride. we compute the average request rate across all data throughout taking hours of demand per day as an approximate average. for fixed origin-destination pairs we determine the sharing fraction as the ratio of the total number of shared ride requests and the total number of requests. departure and destination zones represent the geospatial taxi zones defined by tlc [ ] . however, we exclude zones without geographic decoding, nor name tag defined by tlc. for each individual analysis, we fix the destination zone and compute the fraction of shared rides to destination zones. for a given departure zone, if the total number of requests is less than trips in the considered time interval and destination zone, we exclude that destination zone from the analysis to avoid excessive stochastic fluctuations (see supplementary information for details) . chicago ride-sharing data. we additionally analyzed more than million trips delivered by three service providers in chicago in . the data is provided through the city of chicago's open data portal and contains, amongst others, information of trip origin, destination, pickup and dropoff times as well as information whether a shared ride has been authorized [ ] . we restrict ourselves to geospatial decoding of the city's community areas, as well as trips leaving or entering the official city borders. in analogy to new york city, we compute the average request rate across all data for taking hours of demand per day as an approximate average reference time and repeat the analysis explained for new york city. city topology. for our ride-sharing model we construct a stylized city topology that combines star and ring topology [ ] . starting from a central origin node, rides can be requested to destinations distributed equally across two rings of radius (inner ring) and (outer ring), as depicted in figure . the distances between neighboring nodes on the same branch are set to unity. correspondingly, the distances between neighboring nodes are π/ on the inner, and π/ on the outer ring. ride-sharing adoption. we compute the equilibrium state of ride-sharing adoption by evolving the adoption probabilities π(d, t) following discrete-time replicator dynamics [ , ] π(d, t + ) = r(d, t) π(d, t), where the reproduction rate r(d, t) at destination d and time t is and e[x] represents the expectation value of random variable x. we prepare the system in an initial state π(d, ) = . of ride-sharing adoption for all destinations d and set a constant utility of a single ride u single (d) = to ensure positivity of eqn. ( ) . to evolve eqn. ( ), we numerically compute e[u share (d, t)] = e[u(d, t)|share] at each replicator time step t: we generate n = samples of ride requests of size s of which at least one goes to destination d and requests a shared ride. the other s − requests are drawn from a uniform destination distribution. each of them realizes a sharing decision in line with the current probability distribution π(d , t) at their respective destination d at time t. shared ride requests are matched pairwise (see below). from these n = game realizations, we compute the conditional expected utility of sharing. we repeat this procedure for all destinations d and then update all probabilities π(d, t) according to eqn. ( ) . before performing measurements on the system's equilibrium observables, we discard a transient of replicator time steps, corresponding to game realizations per destination. we then measure the average adoption for replicator time steps, representing a proxy for the stationary solution π * (d) of eqn. ( ) and plotted as the sharing fraction in figs. and (see supplementary information for details) . matching. each request set of size s decomposes into single and shared ride requests. we realize the optimal pairwise matching of requests as follows: for shared requests we construct a graph whose nodes correspond to requests and edges encode the distance savings potential of matching the two requests. to determine the distance savings potential we assume that, independent of single or shared ride, the provider has to return to the origin of the trip. after constructing the shared request graph we employ the 'blossom v' implementation of edmond's blossom algorithm to determine the maximum weight matching of highest distance savings potential [ ] . the matching determines the routing and the realization of inconvenience and detour (see supplementary information for more details). in the main manuscript of this article we disentangle the incentive structure of urban ride-sharing and demonstrate how it leads to emergence of two qualitatively different regimes of sharing adoption. a game theoretic model reproduces key features of the ride-sharing activity in new york city and chicago, including spatially heterogeneous patterns of ride-sharing adoption, saturation in the number of shared rides upon increasing demand and provides insights on the underlying mechanisms. this supplementary information provides additional details on the model, methods and results presented, and is structured as follows: in , four high-volume for-hire vehicle (hvfhv) companies (uber, lyft, via, juno) served more than million transportation service requests in new york city, corresponding to approximately trips per day conducted by a population of . million people [ , ] . in this supplementary note, we unveil the spatiotemporal demand patterns underlying this macroscopic number of transportation requests. the flux matrix w (∆t) formalizes the spatiotemporal demand for transportation services between different locations of an urban environment. its entries w o,d denote the number of transportation requests originating at location o and going to location d within a specific time window ∆t. w (∆t) decomposes into where w single (∆t) and w shared (∆t) are the flux matrices describing trip requests tagged as single or shared rides, respectively. we define the fraction of rides shared as the relative ratio of shared rides to absolute number of rides. note that eqn. ( ) is only defined if the total flux between origin and destination exceeds a threshold w min to reduce bias from fluctuations in statistical analyses of p o,d (∆t). we determine w single , w shared and p for the taxi zones in new york city from the new york city taxi fig. b ) and evening ( pm - am) encompassing leisure activity hours ( supplementary fig. c ). independent of daytime, all four origins exhibit complex spatial patterns of ride-sharing adoption across destinations. for jfk and laguardia airport these patterns are robust for all time windows, indicating stable fraction of rides shared to all destinations throughout the day. for crown heights north and east village only few rides are undertaken to far distance destinations in the morning and midday time window (gray areas representing w o,d < w min ). in the evening, more rides are requested overall, also to far distance destinations. overall, the qualitative patterns of ridesharing adoption do not vary significantly with the time of day (compare fig. across the full set of origin zones in new york city, supplementary figure suggests an overall trend to higher absolute demand for transportation services in the evening. the fraction of shared rides, however, is not affected by this trend. it is approximately constant throughout the day as illustrated in supplementary figure . the average standard deviation of fraction of rides shared across all taxi zones is less than . % between the three time windows, suggesting an equilibrated system. an aggregate analysis will naturally be dominated by the high overall demand in the evening and night time. still, the data suggests that the average ride-sharing adoption in new york city is stable across the day. hence, an aggregate analysis is representative. a linear scaling between s share and s indicates sufficient financial incentives to compensate the expected negative effects of ride-sharing with increasing demand. a decrease in slope and eventual saturation corresponds to a situation where financial incentives, expected detour and inconvenience are in balance. s share will not increase upon higher demand for given incentives (compare fig. in main manuscript as well as large s regime in supplementary fig. ). consider for example times square/theatre district and alphabet city (top left and bottom right): while for the first only approximately one in nine ride requests is shared, it is one in three for the latter. the spatial pattern of fraction of rides shared follows this trend. it is similar for regions with saturated shared ride request rate and starts to deviate the more the origin zone resembles a high-sharing regime (compare fig. fig. in the main manuscript). ride-sharing is only fully adopted if the financial discount compensates the expected inconvenience fully (full-sharing, dark blue). otherwise, the adoption of ridesharing decreases with increasing number of users s (partial sharing). c sharing decisions in chicago exhibit a hybrid state between low-and partial-sharing states (panel c, blue dots). at low request rates, the number of requests for shared rides increases linearly with the total number of requests. at higher request rates, the sharing decisions saturate (horizontal orange curve), indicating a partial sharing regime. few communities cross the horizontal branch, hinting at hardly any zones in a full-sharing regime, but generally low adoption of ride-sharing for the given financial incentives. the inset in panel c includes communities north east side, loop, near west side, lake view, west town, lincoln park, and trips originating outside of the boundaries of the city of chicago, whose request rates significantly exceed those of the other communities by up to one order of magnitude (not shown in the main panel, green border). in , three transportation service providers (uber, lyft, via) served more than million transportation service requests in the city of chicago, corresponding to approximately trips per day [ ] . in this supplementary note we demonstrate that the ride-sharing adoption in the city reproduces the hybrid sharing states observed for new york city and exhibits spatially heterogeneous patterns in ride-sharing adoption. chicago consists of community areas [ ] . supplementary fig. c illustrates request rate for shared rides as a function of the total request rate for rides. as illustrated for new york city in the main manuscript, chicago's different communities exhibit spatially heterogeneous ride-sharing adoption. while there exists a subset of communities for which the number of shared ride requests scales linearly in the total number of requests, other origin communities (e.g. lower west side, hyde park, uptown, near south side, o'hare) form a branch where the number of shared ride requests has saturated and does not increase with the overall number of ride requests. similarly to new york city, we observe partial and full-sharing regimes that give rise to spatially heterogeneous patterns of ride-sharing adoption (compare supplementary fig. c right) . other than in new york city, there are hardly any locations in the full-sharing regime indicated by the upper branch of ride-sharing adoption (compare supplementary fig. inset) . this means the different communities are in a partial-or non-sharing regime. in other words, financial discounts seem to be insufficient to compensate the inconvenience preferences in chicago, explaining that the majority of communities is not in a full-sharing state. the low number of locations on the upper branch suggests that a larger increase of the financial incentives is required to trigger the transition to the high-sharing regime to overcome the inconvenience preference ζ. in this supplementary note we formally define the ride-sharing anti-coordination game introduced in the main manuscript. we introduce a replicator dynamics governing the evolution of the population's willingness to share their rides. the resulting network dynamics unveils spatially heterogeneous sharing patterns, emerging from dynamic symmetry breaking. denote by g = (v, e) a mathematical graph of an urban street network composed of a node set v and an edge set e. nodes can be identified with individual intersection, census tracts or qualitatively similar zones embedded in space. edges correspond to streets connecting the different zones and are weighted by the geographical distance between them. the distance matrix d bundles the pairwise (shortest path) distances. in the following we consider a one-to-many setting where s people request transportation from a single origin o ∈ v to a destination d ∈ v \{o} on g. per destination node d ∈ v \{o} the probability π(d, t) ∈ [ , ] defines the local population's ride-sharing adoption when embarking from origin o at time t. π(d, t) is an aggregate measure for people's ride-sharing willingness, describing the average ride-sharing behavior of people with the same origin-destination combination. the ride-sharing adoption evolves under discrete-time replicator dynamics with reproduction factor where d single (d) = d o,d is the shortest path distance between origin o and destination d, d det (d) is the detour from sharing for destination d at time t and d inc (d, t) is the distance spent together on a shared ride. while the first distance is deterministic, the latter two are stochastic and depend on the overall demand for shared rides on the network. hence, they mediate a coupling between destinations on the network. a rescaling of eqn. ( ) shows that the dimensionless parameters ξ/ and ζ/ as well as . pairing rides is a maximum weight matching problem. the provider's matching algorithm solves the maximum weight matching problem. a a shared ride request graph defines potentially shareable rides σ share with edge weights defining the saved distance of a combined ride. b the provider pairs requests to maximize his saved distance. c per matching the provider defines the shared route to minimize the distance driven and customer inconvenience. the expected detour and inconvenience of shared rides originating from origin o ∈ v , going to destination d ∈ v , depend on (i) the configuration of destinations in the request set s at time t, (ii) the realization of sharing choices across all users, and (iii) the service provider's matching and routing algorithm. (i) origin-destination distribution. denote by σ ∈ v s the destination request configuration of the s simultaneous transportation requests from o. σ is a random variable governed by the origin-destination distribution w o . it impacts where users travel and which users may potentially be matched when sharing a ride. (ii) adoption of ride-sharing. depending on the user's individual decisions to share their rides, σ decomposes into σ share and σ single . the realization of destinations in σ share determines the potentially shareable rides. (iii) matching and routing algorithm. providers match ride requests based on distance savings potentials, which is equivalent to a maximum weight matching problem on a mathematical graph: shared ride requests define the nodes of this graph. if two rides offer a distance savings potential to the provider compared to two single rides, the ride requests are connected by an edge (see supplementary fig. a ). the distance savings potential defines the edge weight. here, we assume that both for single and shared ride requests the provider needs to return to the trip origin, consistent with the one-to-many setting. the provider's matching algorithm determines the matching of shared ride requests that maximizes the saved distance (see supplementary fig. b ). per matched request pair, the provider defines the trip route to minimize the distance driven. if he is indifferent whom to drop first, he will deliver the passenger with the shorter distance first to minimize customer inconveniences (see supplementary fig. c ). if, again, he is indifferent he tosses a fair coin to determine the order of the shared ride. while the adoption of ride-sharing in general depends on the underlying street network and the destination distribution, the problem simplifies in the limit of many concurrent users, s → ∞. in particular, a necessary and sufficient condition for full adoption of ride-sharing to be a stable equilibrium in this limit is that the financial incentives compensate the inconvenience (see supplementary fig. ). in this limit and with full sharing, detours disappear as users will always be matched with other users with the same destination. formally: theorem (full sharing in high-demand limit). if lim s→∞ π(d) * = the ratio of inconvenience to financial incentive must be ζ/ < . a dominant equilibrium strategy in sharing, π(d) * = , implies positive expected utility difference e[∆u(d) * ] > . the limit of infinite request number yields at ζc/ c = , a spatially heterogeneous pattern of ride-sharing adoption forms for finite s and detour preferences ξ > where adjacent branches alternate between the low-and high-sharing regimes. for ζ/ > ride-sharing adoption fades out along the branches that are still in the high-sharing regime. in the limit s → ∞ (or for ξ = ) the transition becomes discontinuous at ζc/ c = . simulation parameters: ξ = . , = . . where we used that π(d) * = for which s → ∞ corresponds to zero-detour matching to destination d. consequently, which implies ζ < . the case s = the case s = produces equilibrium adoption of ride-sharing qualitatively different than adjacent values of s, as discussed in figure a in the the main manuscript. for sufficiently high the population has a dominant sharing strategy in this configuration which is induced by the fact that the service provider can at most pair two of the three ride requests into a shared one. the left-over request will enjoy the benefits of a single ride at discounted trip fare, inducing an incentive to become this request which fuels both the ride-sharing adoption as well as the matching probability. as s increases beyond s = the incentive of gambling on being a left-over request reduces drastically as far less corresponding constellations exist. thus, s = produces a behavior in figure a of the main manuscript that looks qualitatively different than for other values of s. robustness of ride-sharing adoption for radially asymmetric origin-destination demand. radial asymmetry of the destination demand distribution does not qualitatively affect the equilibrium ride-sharing adoption. a dense core setting, inner ring destinations (gray shading) are visited twice as often as outer ring destinations. increased request rate reduces the destination's ride-sharing adoption from inside to outside. b urban sprawl setting, outer ring destinations (gray shading) are visited twice as often as inner ring destinations. as s increases the ride-sharing adoption ceases from outside to inside. in both cases branches of high ride-sharing adoption emerge in a random direction (compare fig. in all settings, the results are qualitatively the same as for homogeneous origin-destination demand (compare fig. in the main manuscript). naturally, sufficiently high financial incentives overcome this partial-sharing phase and result in full sharing, reproducing the two phases of ride-sharing adoption (see supplementary fig. robustness of ride-sharing adoption for azimuthally asymmetric origin-destination demand. an azimuthally asymmetric destination demand distribution predetermines the emergence of sharing/non-sharing branches. a sparse settlement setting, neighboring branches of destinations in radial direction alternate between being visited twice as likely (gray shading) as the other branches. the high-demand branches are sharing while the low demand branches quit sharing due to their high expected detour. increased request rate reduces the destination's ride-sharing adoption from inside to outside. b heterogeneous settlement setting, inner and outer ring destination nodes on the same branch alternate between being requested twice as often (gray shading) as the other one. also in this setting branches of high ride-sharing adoption emerge, driven by the outermost destinations. as s increases the ride-sharing adoption ceases from outside to inside (compare supplementary fig. b) . parameters: = . , ζ = . , ξ = . . in the one-to-many ride-sharing game, the relative position of the origin defines the scale of average distances to different destinations and the possible combinations in which requests for shared rides are matched. hence, it impacts expected detours and inconvenience. here, we consider the stylized city topology introduced in the main manuscript with a decentral origin at the periphery. supplementary figure illustrates a one-to-many situation of homogeneous transportation demand from the northernmost node in the stylized city topology. again, a distinct spatial pattern of ride-sharing adoption emerges in the partial sharing regime (supplementary fig. a ). when the financial incentives are sufficiently large, we recover the full sharing phase (supplementary fig. b ). in this setting, the sharing pattern is symmetric about the north-tosouth axis, where nodes in the direction of the city center share dominantly. they have no expected detours since in all constellations where they are matched, they will be dropped first. this is not the case anymore for destinations on the opposite side of the city center. hence, these destinations do not share. for the remaining destinations the decision to share, or not, results in a zero-sum game very soon as s increases (compare supplementary fig. a, center panel) and eventually reproduces a ride-sharing adoption pattern where neighboring branches alternate between sharing and not sharing (compare fig. in main manuscript). again, ride-sharing adoption behaves qualitatively similar compared to the constellation for central origins. goal : sustainbale cities and communities -target . : "...safe, affordable, accessible and sustainable transport systems traffic and related self-driven many-particle systems modelling the scaling properties of human mobility the hidden universality of movement in cities understanding individual human mobility patterns new developments in urban transportation planning intergovernmental panel on climate change. climate change mitigation of climate change european commission quantifying the benefits of vehicle pooling with shareability networks optimization for dynamic ride-sharing: a review addressing the minimum fleet problem in on-demand urban mobility scaling law of urban ride sharing topological universality of on-demand ride-sharing efficiency transportation sustainability follows from more people in fewer vehicles, not necessarily automation on-demand high-capacity ride-sharing via dynamic trip-vehicle assignment mean field theory of demand responsive ride pooling systems to share or not to share incentives and disincentives for ridesharing: a behavioral study. department of transportation, federal highway administration sharing the ride: a paired-trip analysis of uberpool and chicago transit authority services in chicago what do riders tweet about the people that they meet? analyzing online commentary about uberpool and lyft shared/lyft line the perfect uberpool: a case study on trade-offs ride-sharing efficiency and level of service under alternative demand, behavioral and pricing settings do transportation network companies decrease or increase congestion? towards demand-side solutions for mitigating climate change new york city taxi & limousine commission. congestion surcharge -tlc how mobility will shift in the age of us rideshare programs see supplementary information for details on data collection and treatment central loops in random planar graphs culture and low-carbon energy transitions anomalous supply shortages from dynamic pricing in on-demand mobility freezing by heating in a driven mesoscopic system chicago data portal -transportation network providers -trips. see supplementary information for details on data collection and treatment the replicator equation and other game dynamics fictitious play, shapley polygons, and the replicator equation blossom v: a new implementation of a minimum cost perfect matching algorithm world health organization. who announces covid- outbreak a pandemic weighted matchings in general graphs taxi & limousine commission chicago data portal -transportation network providers -trips we thank the network science group from the university of cologne and nora molkenthin for helpful discussions and christian dethlefs for help with simulations. d.s. acknowledges support from the studienstiftung des deutschen volkes. m.t. acknowledges support from the german research foundation (deutsche forschungsgemeinschaft, dfg) through the center for advancing electronics dresden (cfaed). in the main manuscript we demonstrated that the ride-sharing anti-coordination game reproduces opposing regimes of ride-sharing adoption in a simple setting. in this section we demonstrate the robustness of these results under different conditions, including non-homogeneous demand constellations and for different origin locations in the network, illustrating that the underlying mechanisms balancing incentives remain identical.ride-sharing adoption for non-homogeneous origin-destination demand using the stylized city topology introduced in the main manuscript, we investigate the impact of radially and azimuthally asymmetric destination demand on the ride-sharing adoption from a joint origin. we distinguish between four scenarios representative for different types of urban settlements: . dense core: starting from a joint origin in the city center, a gradient of decreasing destination demand in radial direction mimics urban environments with densely populated city core. further distance destinations (e.g. suburbs) are less often requested, e.g. because of sparser population density. urban sprawl : in situations where distant destinations from the city center make up the majority of ride requests the radial destination demand gradient is reversed. theses scenarios represent constellations of urban sprawl, or situations where the city core is only sparely populated, e.g. because of high real-estate prices. . sparse settlement: urban environments may exhibit azimuthal gradients in destination demand starting from an origin in the city center, e.g. stretched out residential settlements that have formed next to existing road, river banks etc. in that case destination demands in radial direction might be similar, but differ significantly by cardinal direction. heterogeneous settlement: urban constellations where both radial as well as azimuthal destination demand gradients exist might describe heterogeneously grown environments, e.g. because of natural obstacles or staged development.figs. and correspond to the four scenarios. for given financial discount an increase in request rate s gives rise to a spatially heterogeneous sharing/non-sharing pattern and decreasing overall adoption of ride-sharing in all scenarios, independent of the destination demand distributions (compare fig. in the main manuscript). as second order effects, the origin-destination distribution determines (i) whether the cardinal direction of the sharing pattern is random ( supplementary fig. for radially asymmetric destination demand), or aligned with the highest destination demand ( supplementary fig. for azimuthally asymmetric destination demand), and (ii) whether close-by or distant destinations reduce their willingness to share first upon increased request rate. . dense core: for dense core settings (see supplementary fig. a ) the cardinal direction of the sharing/nonsharing pattern is solely driven by random fluctuations breaking the azimuthal symmetry. the destination demand gradient leads to a reduction of willingness to share from inside to outside as s increases. . urban sprawl : phenomenologically, urban sprawl (see supplementary fig. b ) corresponds to dense core, but this time increasing the request rate reduces the willingness to share from the outside (i.e. high destination demand). in the presence of azimuthal destination demand gradients the sharing pattern forms along the branches of high demand (see supplementary fig. a ). the dominance of those destinations in the replicator dynamics guides the symmetry breaking into letting low demand destinations reduce their willingness to share, which reduces the expected detour for sharing branches. as s increases the willingness to share reduces from in-to outside as in the uniform case analyzed in the main manuscript. in this section of the supplementary information we provide detailed insight into the data used, cleansing procedures applied and simulation methods implemented. data sources. the new york city taxi & limousine commission (tlc) publishes trip records for high-volume for-hire vehicles (hvfhv) on a monthly basis. the data includes trip information on pickup time, origin zone, drop-off time, destination zone as well as a shared ride request label for providers completing more than trips per day [ ] . our analysis is based on the aggregate hvfhv activity between january and december , independent of service provider, including more than million transportation services. we exclude older data due to regulatory changes effective in [ ] , potentially impacting ride-hailing behavior, and data from due to changed transportation service activity in the course of the covid- pandemic [ ] .tlc partitions new york city into taxi zones and provides geospatial information about zone boundaries, names and jurisdictions [ ] . we adopt the definition of these zones in all of our analyses.additionally, the city of chicago publishes ride-hailing trip records on its open data portal [ ] . the data contains, amongst others, information about trip origin, destination, pickup and dropoff times as well as information whether a shared ride has been authorized by the requester. our analysis encompasses the time-span between january and december , as chosen for new york city, and includes more than million trip requests served by three transportation service providers (uber, lyft, via).in our geospatial analysis we restrict ourselves to chicago's community areas, as well as trips leaving or entering the official city borders.data preparation. we use tlc's data as-is. our data cleansing procedure removes trip records for which trip information is decoded as not available. furthermore, we omit trip records for zones and in our analysis. while the dataset contains trip requests labeled by these zones, there is no geographic decoding specified by tlc, nor do the zones have names.similarly, we use the chicago trip records as-is. for our analyses, we determine the total flux matrix specified in eqn. ( ) per city. when showing daily averages we normalize the total annual flux between origin and destination zones o and d by hour days to obtain a per-minuterequest rate, assuming hardly any request activity for hours per day. in case of specifically defined time windows (see supplementary note and ), we normalize the total flux by the window size.we compute the fraction of shared as specified in eqn. ( ) . equilibration. in this article, we focus on the equilibrium properties of the replicator dynamics underlying the ridesharing game on networks. to equilibrate the system we evolve eqn. ( ) . we discard a transient of replicator time steps before starting measurements of equilibrium values of observables.per replicator time step and per destination node we repeat the ride-sharing game for times for the current configuration of π(d, t) to generate a reliable numerical estimate for the expected utility increment of sharing e[∆u(d, t)] that is being used to update π(d, t + ).matching. after generating the shared ride request graph (see supplementary note ) we implement edmond's blossom algorithm to determine a maximum weight matching [ ] . since the algorithm used implements a minimum cost perfect matching, we reduce our non-perfect matching problem to a perfect one as described in [ , ch. . . ] . key: cord- -xcsal vk authors: rafie, k.; lenman, a.; fuchs, j.; rajan, a.; arnberg, n.; carlson, l.-a. title: the structure of enteric human adenovirus - a leading cause of diarrhea in children date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: xcsal vk human adenovirus (hadv) types f and f are a prominent cause of diarrhea and diarrhea-associated mortality in young children worldwide. these enteric hadvs differ strikingly in tissue tropism and pathogenicity from respiratory and ocular adenoviruses, but the structural basis for this divergence has been unknown. here we present the first structure of an enteric hadv - hadv-f - determined by cryo-em to a resolution of . Å. the structure reveals extensive alterations to the virion exterior as compared to non-enteric hadvs, including a unique arrangement of capsid protein ix. the structure also provides new insights into conserved aspects of hadv architecture such as a proposed location of protein v, which links the viral dna to the capsid, and assembly-induced conformational changes in the penton base protein. our findings provide the structural basis for adaptation to a fundamentally different tissue tropism of enteric hadvs. the structure of hadv-f reveals a ph-resistant capsid with an altered surface charge distribution to elucidate the structural basis of enteric adenovirus infection, we determined the structure of hadv-f using cryo-em. in parallel, the genome of the purified virus (strain "tak") was sequenced, revealing one protein-coding mutation (val ala in protein viii) compared to the deposited sequence for the same strain (genbank: dq . ). further, its proteome was determined using the high-recovery filter-aided sample preparation (fasp) mass spectrometry ( ) , revealing a total of viral proteins present in the purified virus (supplementary table ). at an average resolution of . Å, the three-dimensional reconstruction of hadv-f had continuous electron density with well-defined secondary structure elements and side-chain density (supplementary figure , supplementary movie ) . local resolution estimates revealed a resolution of better than Å for large parts of the icosahedral capsid (supplementary figure ) . this allowed us to build and refine an atomic model of the asymmetric unit (asu), which describes the icosahedral part of the virus (fig. b, supplementary movie ) . the final asu model contained four hexon homotrimers, single chains of the penton base protein and piiia, ten chains of pvi, two chains of pviii, three chains of the triskelion protein ix, and five chains of unknown identity, revealing known and unknown protein-protein interaction surfaces (supplementary table ). electron density for the fibers was present at the interface with the icosahedral capsid but was of insufficient quality for extensive model building due to increasing flexibility in more distal parts. compared to the two other reported structures of human adenoviruses, hadv-c (pdb: b t ( ) ) and hadv-d (pdb: tx ( )), the sequence identity of the capsid proteins ranges from - % (supplementary table ). this generally correlated with the average structural similarity between the three structures, with more divergent proteins showing higher structural difference in terms of cα root mean square deviation (rmsd) (supplementary table ). in their adaptation to gastrointestinal tropism, a major obstacle for enteric adenoviruses was likely the passage through the low ph of the stomach to their intestinal site of infection. to investigate the adaptation to the hostile environment of the stomach, we solved the structure of hadv-f at ph= . , which resembles the diurnal average gastric ph of young children ( ) (fig. a) . at a resolution of . Å the overall structure of the capsid at ph= . was largely unchanged (overall cα rmsd . Å for proteins listed in fig. b ) and no drastic local movements were observed (fig. b) , showing the resistance of the enteric adenovirus capsid to the gastric ph. we reasoned that the gastrointestinal adaptation might also have altered the distribution of acidic and basic residues exposed on the outer surface of the capsid. to investigate this, the surface charge distribution of hadv-f along with the other two existing human adenovirus structures (hadv-c and hadv-d ) were calculated for ph= . . a visual comparison of the charge distribution revealed significant differences between the three hadvs (fig. c ). the exposed surface of hadv-d is almost entirely covered by negative charge at ph= . , and hadv-c has mostly negatively charged surfaces on the top of the hexons (in a calculation underestimating the amount of negative charge due to exclusion of the flexible, and highly negatively charged, hypervariable region , which was not built in any of the hadv-c structures ( , , ( ) ( ) ( ) ( ) ). by comparison, the capsid of hadv-f is predominantly uncharged at ph= . (fig. c) . the surface charge distribution of hadv-f at ph= . revealed two distinct regions as still being relatively uncharged at this extreme ph: the n-terminal part of pix, largely occluded between hexons, and the solvent-exposed loops at the top of the hexons (fig. c ). whereas the overall structural fold of hexon chains is conserved among adenoviruses, they differ in the seven hypervariable regions (hvrs). comparing hadv-f (at ph= . ) to hadv-c and hadv-d , substantial differences were found in all seven hvrs (supplementary table ). in particular, hvr stands out in the comparison since it is much shorter in enteric hadvs (fig. d) . the absence of the highly negatively charged hvr loop in all hadv-c structures ( , ) indicates flexibility. on the other hand, the near-eliminated hvr in hadv-f is a short loop with a rigid conformation that allowed tracing the entire length of the polypeptide (fig. e ). taken together, these findings show that the capsid of enteric hadv-f is structurally unperturbed by exposure to stomach-like ph and has evolved to expose fewer charged residues on its exterior as compared to non-enteric hadvs, most prominently exemplified by a near-complete deletion of the hypervariable region . among the so-called minor capsid proteins, pix forms the most extended and complex arrangements. in previously reported structures of hadv-c and hadv-d , this amounts to a tight mesh of ordered protein density that stretches through the canyons between hexons across the virion surface ( , , ) (fig. a) . in hadv-f , only the n-terminal residues - (henceforth 'pix-n') were sufficiently ordered to trace the protein chain, thus ruling out the same sort of ordered, virus-spanning pix cage seen in other hadvs (fig. a ). analogously to other hadvs, pix-n trimerizes to form a triskelion (fig. b ) located between three non-peripentonal hexon units (fig. b) . each facet of the virion harbors four copies of pix-n triskelion in two distinct structural surroundings: three copies at the local -fold (l ) symmetry axes of each asymmetric unit, and one at the icosahedral -fold (i ) symmetry axis at the center of the facet (supplementary figure ) . the conformations of pix-n in these two surroundings are virtually identical (fig. b ). the three pix chains come together at the center of this triskelion, where interactions between residues phe , phe and tyr from each chain form a hydrophobic core (fig. c ). this hydrophobic core is differently arranged compared to the non-enteric hadv-c and hadv-d (supplementary figure ) and has more large hydrophobic residues at its center. sequence comparisons of pix revealed that pix-n has a higher sequence homology to hadv-c and hadv-d than its c-terminal part (residues - , 'pix-c') (supplementary figure ) . however, the sequences of pix-c are near-identical in hadv-f and the related hadv-f , indicating conservation between enteric adenoviruses. mass spectrometry analysis detected the entire pix sequence in the purified hadv-f , confirming the presence of pix-c in the purified virions (supplementary table ). we reasoned that the substantial protein mass corresponding to pix-c, emanating in the constricted space between the hexons, should be visible in the electron density map at a lower threshold even if it is flexible. indeed, the interhexonal space above pix-n harbors electron density corresponding to a flexible protein at the local -fold axes (fig. d, supplementary figure ) . this density appears to pass to the outside of the capsid between the hvr loops of the three surrounding hexons. in a localized asymmetric reconstruction ( ) , the three hvr could be resolved in their entirety (supplementary figure ) , showing that they are present in a single conformation and form a constriction of defined size (supplementary figure ) from which the c-termini of pix protrude. in contrast, there is clearly no electron density above pix-n at the icosahedral -fold position (fig. e) , showing that the spatial organization of pix-c differs between these two positions (fig. f ). taken together, these data reveal a very different arrangement of pix in enteric adenoviruses as compared to respiratory (c ) and ocular (d ) hadvs. in hadv-f , the c-terminal half of pix is flexible and exposes it's c-terminus to the capsid exterior at three of the four pix positions in each virion facet. the hadv-f penton base undergoes assembly-induced conformational changes located on the five-fold symmetry axes of adenovirus capsids, the penton base (pb) protein forms a homopentamer that contains integrin-binding motifs and serves as an assembly hub connecting the icosahedral capsid to the fibers (fig. a) . in our reconstruction of the entire hadv-f , the electron density for the pb was less well resolved than other parts of the capsid. to improve the map of the pb, we performed a localized asymmetric reconstruction of the pb monomer (supplementary figure ) . the improved map allowed the building of an atomic model for the pb, which was placed into a composite atomic model of the entire asymmetric unit. overall, the hadv-f pb is very similar to the pb in hadv-c ( , , ) and hadv-d ( ), with a β-sheet rich fold that can roughly be divided into four domains: crown, head, body and tail, with the body and head as the main domains, separated by loop regions (fig. a ). during assembly of the virus particle, the pb homopentamer forms a plethora of interactions with peripentonal hexons, the fiber, and the minor capsid protein piiia (supplementary table ). to investigate conformational changes induced during the pb assembly process, we solved the structure of a recombinantly expressed hadv-f pb homopentamer in solution (free pb -fpb) by cryo-em. the map had an average resolution of . Å (supplementary figure ) , allowing for the placement of an atomic model (fig. b ). comparing the atomic models of the fpb and the virion-bound pb (vpb), the overall cα-shift (rmsd) was very small (~ . Å). however, color-coding vpb by its structural deviations from fpb revealed regions with higher rmsd, indicating localized assembly-induced conformational changes ( fig. c , supplementary movie ). moreover, four sequence segments that were built in the vpb model could not be built in the fpb model ( fig. a ), indicating that these regions are disordered in solution and only become stabilized in a defined conformation upon assembly into the virion. one such region is the tail, a residue (thr -gly ) random coil region (fig. a) , which is disordered in the fpb (fig. b ). it is stabilized through interactions with two loop regions from piiia (supplementary table ). the sequence of the tail domain is largely conserved between hadvs (supplementary figure ) , suggesting a conserved role as an assembly motif. the second motif (tyr -leu ) becoming ordered upon assembly is an α-helix consisting of residues gln -thr , located close to the five-fold axis of the penton base (fig. d) , and close to where the fiber binds. although the hadv-f capsid map shows only weak density for the proximal fiber in connection with the pb, the vpb structure did allow tracing of a fragment of the conserved fiber tail (supplementary figure , supplementary table ) . thus, the folding of gln -thr may be dependent upon interactions within the capsid and/or binding of the fiber to the pb. additional assembly-dependent interactions take place in two loop regions located between residues val -asn in the body domain (fig. d ). these loops are disordered in the fpb but well resolved in the vpb where their conformation is stabilized by interactions with the peripentonal hexon. the first loop (ser -ser ) -located at the top of the body -is stabilized as an extended coil structure upon interaction with hexon chain ser -tyr loop. the second region (thr -gln ) -located at the bottom of body -is stabilized as a short α-helix upon binding to an uncharged pocket formed by peripentonal hexon residues ala -ile . peculiarly for enteric hadvs, the otherwise conserved integrin-binding rgd motif has been replaced by igdd in hadv-f (rgad for hadv-f ). in the hadv-f structure the igddcontaining loop is the only surface-exposed part of the pb for which we find no continuous electron density (fig. e) , despite being among the shortest loop of all hadvs( ) (supplementary figure ). this parallels the observed flexibility of the rgd motifs in the two previously reported structures of hadvs ( ) ( ) ( ) , indicating that the function of the igdd sequence may be dependent on it being flexible until interacting with a target molecule. in summary, a comparison of the pb pentamer in solution and in the virus capsid revealed several distinct motifs that become folded only upon assembly of the pb into the capsid, and further revealed that the non-canonical integrin-binding motif igdd is disordered also in the context of the assembled virus. the dna-binding protein v is located at a conserved position at the inner face of the capsid after initial model building of the virion at ph= . , the asymmetric unit contained five peptide chains that still had not been assigned an identity. four of the five chains were deemed too short to assign an identity to. three of those four chains interlace with different copies of pvi at positions where pvi interacts with pxiii or piiia at the inner face of the capsid (supplementary figure , supplementary movie ). the fifth unidentified peptide chain is markedly longer and is located at the inner face of the capsid, in a pocket formed by three non-peripentonal hexons ( fig. a -b, supplementary table ). its electron density is resolved well enough to identify large side-chain residues and we thus reasoned that a structural bioinformatics workflow might be devised to reveal its identity. with the initial constraint being only that residues number and in the identified mer peptide must have large side chains, we utilized a combination of the proteomics data, exclusion of proteins with known locations, real-space refinement scores and other considerations to elucidate its identity (supplementary methods , supplementary figure ). after sequential exclusion of candidates based on these criteria, a single most likely candidate remained, a sequence from the center of protein v (pv): gln -asp . pv has been reported to bind directly to dna and to pvi, thereby bridging the core and the surrounding capsid, but it has not been localized in any adenovirus structure. the built sequence of pv fits the density without any clashes or unlikely interactions with surrounding proteins, and furthermore shows a high degree of sequence conservation with pv in hadv-c and hadv-d (fig. c, d, supplementary movie ) . in both hitherto published hadv structures, hadv-c ( ) and hadv-d ( ), there is similarly shaped electron density at the corresponding position of the virus capsid (fig. e ). in hadv-c no atomic model was built into it ( ) . similarly, at the corresponding position in the hadv-d structure, two shorter peptide chains of unknown identity were placed ( ) . the location of the mer chain of pv is such that both the n-and c-terminus of pv, which are not resolved in our structure, may protrude towards the interior of the virion in agreement with the proposed role of pv to link the viral genome to the capsid. in summary, we used a systematic structural bioinformatics approach to propose a likely conserved position of pv in human adenoviruses. here, we present the structure of a major cause of diarrhea and diarrhea-associated mortality in children: the enteric human adenovirus hadv-f . as the first structure of an adenovirus with pronounced gastrointestinal tropism, it reveals a capsid stable to stomach ph, with substantial changes to the virion surface as compared to respiratory and ocular hadvs. overall, hadv-f has fewer charged, i.e. ph-dependent, residues exposed on the surface of its capsid. this is especially prominent at the top of the hexons where hvr is long, and rich in negatively charged residues in hadv-c . this allows interaction of hadv-c with lactoferricin through a chargedependent mechanism ( ) , which contributes to an extended tissue tropism ( , ) . evolution of hadv-f has resulted in a largely truncated and less charged hvr (fig. d) , seemingly to adapt to the specific conditions in the gastrointestinal tract. another major change to the capsid exterior of hadv-f is the starkly different conformation of protein ix (pix). instead of forming a viruscovering, rigid mesh, the c-terminal half of pix (pix-c) is flexible and protrudes to the outside of the capsid, kept in place by hexon hvr -containing loops. strikingly, pix-c density is observed above all pix-n trimers except at the icosahedral -fold axis (fig. d-e, supplementary figure ). in principle, each pix-c strand could either emerge in cis (i.e. above the same pix-n trimer to which it belongs), or stretch across the virion surface to emerge above another pix-n trimer (in a trans arrangement). the length of pix-c in hadv-f is compatible with both of these arrangements. a cis arrangement of all strands would be reminiscent of pix in some non-human adenoviruses, in which the conformation of pix-c is also more defined ( , , ) . whereas our data don't allow tracing of individual strands of pix-c, the lack of any pix-c density above the pix-n trimer at the icosahedral -fold rules out such a pure cis arrangement of pix-c. one possible, parsimonious interpretation would be that the central pix trimer adopts a trans arrangement, donating one pix-c strand to each of the three pix trimers at local -fold positions which in turn have their pix-cs in cis (fig. f ). whether this model is correct or not, it is clear from our data that pix arranges in a unique manner in enteric adenoviruses compared to other hadvs studied to date. all these modifications to the virion surface of hadv- are likely related to the different set of interactors, and different ph range that this virus encounters throughout the gastrointestinal tract. other gastrointestinal viruses interact with components such as bile (calicivirus)( ) and lipopolysaccharides (poliovirus ( ) , mouse mammary tumor virus( )), which is crucial for infection of these viruses. besides low-ph resistant interactions of hadv-f with gastrointestinal phospholipids ( ) , little is known about the hadv-f :gastrointestinal interactome. finding such interaction partners, e.g. of the disordered and exposed pix-c region, will yield further insights into the infection cycle and tropism of enteric adenoviruses. our study further unveiled how several motifs in the hadv-f pb are disordered in solution and only adopt a defined conformation upon assembly into the virus capsid, laying out another piece of the still largely unfinished puzzle of adenovirus assembly ( ) . the observation that the modified integrin-interacting motif of the hadv-f pb is still disordered in the assembled virus particle highlights the need for structural studies of the interactions with its proposed binding partner, laminin-binding integrins ( ) . biochemical data have defined protein v (pv) as a key protein linking the adenovirus genome to the capsid ( ) , but in spite of its conserved function in adenoviruses it had not been located in the adenovirus capsid. here we propose the point of interaction of pv to the interior of the capsid, and provide data suggesting that this position, at the junction between three non-peripentonal hexons, is conserved between hadvs. previous biochemical data have not suggested pv to interact with the hexons, but have instead suggested interactions between pv and the minor protein pvi ( , ) . these data are not mutually exclusive with our identification of the pv anchoring point to the hexons, since most of the pv sequence is still unaccounted for in the structure, and several copies of pvi are found in the vicinity of the anchoring point where they may form additional interactions. taken together, the structure of the enteric adenovirus hadv-f revealed key conserved aspects of adenovirus architecture as well as highly divergent features of enteric adenoviruses, thus laying the foundation for structure-based approaches to preventing this prominent cause of diarrheaassociated mortality in young children, and, for further development of these structurally divergent adv types as vaccine vehicles. human a cells (kind gift from dr. alistair kidd) were maintained in dulbecco´s modified eagle medium (dmem; sigma-aldrich) supplemented with % fetal bovine serum (fbs; hyclone, ge healthcare), mm hepes (sigma-aldrich) and u/ml penicillin + µg/ml streptomycin (gibco). for hadv-f (strain tak) propagation, bottles of a cells ( cm , at a % confluency) were infected with hadv-f inoculation material (produced in a cells) in ml of growth media ( % fbs) for min on a rocking table at ˚c. thereafter additional ml of growth media ( % fbs) was added to each flask and the cells were further incubated at °c. infected cells were harvested after approximately one week, or when cells displayed clear signs of cytopathic effect. cells were collected by centrifugation, resuspended in dmem and disrupted to release virions by freeze thawing and by addition of equal volume of vertrel xf (sigma-aldrich). after vigorous resuspension, the cell extract was centrifuged at rpm for min. the upper phase was transferred onto a discontinuous cscl gradient (densities: . g/ml, . g/ml, and . g/ml, in mm tris-hcl, ph= . ; sigma-aldrich) and centrifuged at rpm in a beckman sw rotor for . h at °c. the virion band was collected and desalted on a nap column (ge healthcare) into sterile pbs. the samples were split for tryptic and chymotryptic digestion and processed using a modified protocol of filter-aided sample preparation (fasp) ( ) . in brief, triethylammonium bicarbonate (teab) was added to a final concentrations of mm teab prior to reduction using mm dithiothreitol at °c for min. the reduced samples were loaded onto kda mwco pall nanosep centrifugal filters (sigma-aldrich), washed with m urea, % sodium deoxycholate (sdc) and alkylated with mm methyl methane thiosulfonate. two step digestion was performed on filters using trypsin and chymotrypsin as digestive enzymes in mm teab, . % sdc buffer. the first step was performed overnight and the second step, with an additional portion of proteases, for four hours the next day. tryptic digestion was performed at °c using pierce ms-grade trypsin protease (thermo fisher scientific). chymotryptic digestion was performed at roomtemperature using pierce ms-grade chymotrypsin protease (thermo fisher scientific). the peptides were collected by centrifugation and sdc was precipitated by acidifying the sample with tfa (final concentration %). the digested sample was desalted using pierce peptide desalting spin columns (thermo scientific) according to the manufacturer´s protocol. the digested and desalted samples were analysed using a qexactive hf mass spectrometer interfaced with an easy-nlc liquid chromatography system (both thermo fisher scientific). peptides were trapped on an acclaim pepmap c trap column ( μm x cm, particle size μm, thermo fischer scientific) and separated on an in-house packed analytical column ( μm x mm, particle size μm, reprosil-pur c , dr. maisch). a stepped gradient used was from % to % solvent b in min followed by an increase to % in min and to % solvent b in min at a flowrate of nl/min. solvent a was . % formic acid and solvent b was % acetonitrile in . % formic acid. the mass spectrometer was operated in data-dependent mode (dda) where the ms scans were acquired at a resolution of and a scan range from to m/z. the most intense ions with a charge state of to were isolated with an isolation window of . m/z and fragmented using normalized collision energy of . the ms scans were acquired at a resolution of and the dynamic exclusion time was set to s. data analysis was performed using proteome discoverer (version . , thermo fisher scientific). the data was searched against an in-house database containing the amino acid sequences of hadv-f . mascot (version . . , matrix science) was used as search engine with a precursor mass tolerance of ppm for ms and mmu for ms spectra. tryptic peptides were accepted with a maximum of one missed cleavage, chymotryptic peptides with maximum three missed cleavages. variable modification of methionine oxidation and fixed methylthio of cysteines were selected. the mascot significance threshold for peptides was set to . . purified hadv-f was used at . mg/ml (ph= . ) and . mg/ml (ph= . ). the recombinant hadv penton base (pb ) was purified as described before ( ) and used at mg/ml in pbs buffer, supplemented with % glycerol. a hadv sample at ph= . was prepared by adding µl of a . m citric acid / m na hpo ph= . solution to µl of hadvf- ph= . followed by incubating on ice for minutes. samples were vitrified on quantifoil cu r / (electron microscopy sciences, cat#: q cr ) and quantifoil cu r . / . (electron microscopy sciences, cat#: q cr . ) grids for the virus particles and the recombinant protein, respectively. prior to sample application the grids were glow discharged using a pelco easiglow device (ted pella inc.) at mamp for s. sample was applied by transferring µl sample onto the glow-discharged side of the grid, blotted and plunge frozen in liquid ethane, using a vitrobot plunge freezer (thermo fisher scientific), with the following settings: °c, % humidity, blotforce = - and a blotting time of s. for hadvf- ph= . sample was applied twice with a blotting step, using the same settings as above, between applications( ). all data were collected on an fei titan krios transmission electron microscope (thermo fisher scientific) operated at kev and equipped with a gatan bioquantum energy filter and a k direct electron detector. a condenser aperture of µm (hadv-f ph= . & . ) and an objective aperture of µm were chosen for data collection. a c -aperture of µm was selected for the pb data collection. coma free alignment was performed with autoctf/sherpa. data were acquired in parallel illumination mode using epu (thermo fisher scientific) software at a nominal magnification of kx ( . Å pixel size). both datasets for the hadv-f structure at ph= . were collected in super-resolution mode. due to a preferred orientation of pb , a second data set was collected at a ° tilted stage. data collection parameters are listed in the supplementary table . data processing and structure determination hadv-f ph= . two datasets were collected on hadv-f ph= . and initially processed independently. data were initially processed using relion . ( ) and continued in relion . (beta) ( ) . beam-induced motion was corrected using relion's motioncor ( ) implementation, at which step the superresolution movies were binned once, and the per-micrograph ctf estimated using gctf ( ) for all data sets. particles were manually picked and subjected to reference-free d classification and well-resolved classes were combined and subjected to d classification, applying icosahedral symmetry (i according to crowther ( ) ) and a mask of the capsid structures. a low-pass filtered ( Å) volume of hadvc- (emd- ( ) ) was used as a reference volume. particles were classified into two classes, resulting in % of particles allocated to one well-resolved class which was used for downstream processing. d refinement was performed using the output of the d classification as a reference model, low-pass filtered to Å, with no additional fourier-padding. following refinement, data were post-processed, and the particles subjected to per-particle ctf refinement, bayesian polishing and another round of per-particle ctf refinement. the particles were subjected to an additional round of d refinement before combining both datasets and performing a final d refinement, with no additional fourier-padding. the resolution was calculated using the gold standard fsc (threshold . ) to . Å after postprocessing. finally, the data were corrected for the ewald's sphere curvature using relion, which led to a local improvement of the electron density map with and a new average resolution of . Å. local resolution estimates were calculated using resmap ( ) . a homology model was generated using the swiss-model server( ) the hadv-f capsid protein sequences, for which homologues have been structurally determined. the resulting homology model was based on the reported hadv-d structure (pdb: tx ( )). the model was manually docked into the hadv-f electron density in chimerax( ) and the map corresponding to the asymmetric unit (asu) extracted. the asu map was locally sharpened using phenix's( ) autosharpen tool. subsequently, the hadv-f homology model was docked and subjected to an initial round of real-space-refinement using phenix. the structure was fully refined using iterative cycles of phenix's real-space-refinement and model building in coot ( ) . to improve the map quality surrounding the penton base monomer and the hvr -loop containing region, the map was improved using the localized asymmetric reconstruction workflow reported by ilca et al ( ) and implemented in scipion v . ( ) . coordinates for the sub-particles were determined in chimerax and subsequently located by applying icosahedral symmetry and extracted in scipion v . . sub-particles were subsequently filtered to exclude particles not present within a [- °, °] range from the image plane. the resulting sub-particles were then subjected to an asymmetric d classification. to increase the probability of convergence during classification, changes in the origins and orientations were not allowed. a subsequent d refinement yielded a d reconstruction of the penton base monomer and the hvr -loop containing region to a resolution of . Å and . Å, respectively. average resolutions were calculated according to the gold standard fsc calculations (threshold . ). data processing statistics for the asymmetric localized reconstruction are given in supplementary table . dfsc curves were calculated using the remote dfsc processing server ( ) . image processing and model building for hadv-f at ph= . data were processed as described for the hadvf- ph= . structure up until the first d refinement. the volume hadvf- at ph= . was low-passed filtered to Å and used as a reference. the resolution was estimated to . Å using the gold standard fsc (threshold . ) after postprocessing. local resolution estimates were calculated using resmap. the hadv-f ph= . model was fitted into the reconstructed hadv-f ph= . density using chimerax and an asu extracted and the resulting map locally sharpened using phenix. the model was then further fitted and energy minimized using namdinator ( ) . the hadv-f penton base (pb ) data (untilted and tilted at °) were processed using relion . (beta), with beam-induced motion correction and ctf estimation performed as for the hadv-f structure. particles were picked using the automated particle picker cryolo( ) using the available phosaurus generalized model. reference-free d classification of particles was performed in relion and revealed a significant proportion of particles with the same view, suggesting a preferred orientation of the specimen. from the ° data, an initial model was generated in cryosparc ( ) . well-resolved d classes were combined and subjected to d refinement using the same reference model as during d classification, low pass filtered to Å. following refinement, data were post-processed, and the particles subjected to per-particle ctf refinement, bayesian polishing and another round of per-particle ctf refinement before performing a final round of d refinement. inspection of the final volume revealed poor resolution along one of the axes (supplementary figure ) . we therefore collected data on a tilted specimen stage. as data collection at a tilted stage leads to a defocus gradient along the image path, per-particle ctf refinement was performed after particle extraction and before reference-free d classification, using gctf. subsequent processing steps were performed as described for the data collected on an untilted specimen stage. the average resolutions were estimated to . Å (untilted) and . Å ( ° tilt), using the gold standard fsc (threshold . ) after postprocessing. local resolution estimates were calculated using resmap. dfsc curves were calculated using the remote dfsc processing server ( ) . the pb volume generated from the data collected on the tilted stage was used for down-stream model building and model refinement. the penton base monomer chain from the hadv-f ph= . model was initially fitted into pb volume using namdinator ( ) and outlying residues pruned in coot. subsequently the model was fully built and refined using iterative cycles of realspace-refinement in phenix and model building in coot. surface charges for hadv-c (pdb: cgv ( ) ), hadv-d (pdb: tx ( )) and hadv-f was calculated using the pdb pqr-apbs software package( ) at ph= . and ph= . . for each direction of the unknown chain, a mer poly-alanine chain was manually placed into the respective density and initially real-space-refined using coot. a list of sequences was screened by using a job pipeline including a mutation step in coot and real-space-refinement in phenix. custom bash scripts written for this purpose are available upon request. an extended description is given in supplementary methods . figures of protein structures and electron densities were generated using chimerax. prospects of replication-deficient adenovirus based vaccine development against sars-cov- . vaccines (basel) systemic and mucosal immunity in mice elicited by a single immunization with human adenovirus type or vector-based vaccines carrying the spike protein of middle east respiratory syndrome coronavirus estimates of the global, regional, and national morbidity, mortality, and aetiologies of diarrhoea in countries: a systematic analysis for the global burden of disease study use of quantitative molecular diagnostic methods to identify causes of diarrhoea in children: a reanalysis of the gems case-control study quantifying risks and interventions that have affected the burden of diarrhoea among children younger than years: an analysis of the global burden of disease study human adenoviruses: from villains to vectors structure of human adenovirus latest insights on adenovirus structure and assembly. viruses atomic structure of human adenovirus by cryo-em reveals interactions among protein networks crystal structure of human adenovirus at . a resolution image reconstruction reveals the complex molecular organization of adenovirus the structure of the human adenovirus penton a triple beta-spiral in the adenovirus fibre shaft reveals a new structural motif for a fibrous protein crystal structure of the receptor-binding domain of adenovirus type fiber protein at . a resolution three-dimensional structure of the adenovirus major coat protein hexon adenovirus composition, proteolysis, and disassembly studied by indepth qualitative and quantitative proteomics revised crystal structure of human adenovirus reveals the limits on protein ix quasi-equivalence and on analyzing large macromolecular complexes atomic structures of minor proteins vi and vii in human adenovirus cryo-em structure of human adenovirus d reveals the conservation of structural organization among human adenoviruses crystal structure of enteric adenovirus serotype short fiber head adenovirus type virions contain two distinct fibers human adenovirus type contains two fibers integrins alpha v beta and alpha v beta promote adenovirus internalization but not virus attachment phylogenetic analysis and structural predictions of human adenovirus penton proteins as a basis for tissue-specific adenovirus vector design enteric species f human adenoviruses use laminin-binding integrins as co-receptors for infection of ht- cells adenovirus type lacks an rgd alpha(v)-integrin binding motif on the penton base and undergoes delayed uptake in a cells block in entry of enteric adenovirus type in hek cells diurnal variation in intragastric ph in children with and without peptic ulcers universal sample preparation method for proteome analysis structural and phylogenetic analysis of adenovirus hexons by use of high-resolution x-ray crystallographic, molecular modeling, and sequencebased methods a quasi-atomic model of human adenovirus type capsid adenoviral vector with shield and adapter increases tumor specificity and escapes liver and immune control the role of hexon protein as a molecular mold in patterning the protein ix organization in human adenoviruses localized reconstruction of subunits from electron cryomicroscopy images of macromolecular complexes lactoferrin-hexon interactions mediate car-independent adenovirus infection of human respiratory cells latent species c adenoviruses in human tonsil tissues adenoviruses use lactoferrin as a bridge for car-independent binding to and infection of epithelial cells cryo-em structures of two bovine adenovirus type intermediates. virology - three-dimensional structure of canine adenovirus serotype capsid structural basis for human norovirus capsid binding to bile acids intestinal microbiota promote enteric virus replication and systemic pathogenesis successful transmission of a retrovirus depends on the commensal microbiota unique physicochemical properties of human enteric ad responsible for its survival and replication in the gastrointestinal tract isolation and characterization of the dna and protein binding activities of adenovirus core protein v interactions among the three adenovirus core proteins vitrification after multiple rounds of sample application and blotting improves particle density on cryo-electron microscopy grids new tools for automated high-resolution cryo-em structure determination in relion- estimation of high-order aberrations and anisotropic magnification from cryo-em data sets in relion- . electron counting and beam-induced motion correction enable near-atomicresolution single-particle cryo-em gctf: real-time ctf determination and correction procedures for three-dimensional reconstruction of spherical viruses by fourier synthesis from electron micrographs quantifying the local resolution of cryo-em density maps swiss-model: homology modelling of protein structures and complexes meeting modern challenges in visualization and analysis macromolecular structure determination using x-rays, neutrons and electrons: recent developments in phenix features and development of coot scipion: a software framework toward integration, reproducibility and validation in d electron microscopy addressing preferred specimen orientation in single-particle cryo-em through tilting namdinator -automatic molecular dynamics flexible fitting of structural models into cryo-em and crystallography experimental maps sphire-cryolo is a fast and accurate fully automated particle picker for cryo-em cryosparc: algorithms for rapid unsupervised cryo-em structure determination improvements to the apbs biomolecular solvation software suite inference of macromolecular assemblies from crystalline state the sequence manipulation suite: javascript programs for analyzing and formatting protein and dna sequences conformational change of the adenovirus dna-binding protein induced by soaking crystals with k uo f solutions competing interests: there are no competing interests. data and materials availability: the scripts used for the bioinformatics analysis are available upon request. coordinates reported in this study have been deposited with the protein data bank with accession codes xxxx (hadv-f asymmetric unit) and xxxx (hadvf- (free) penton base). electron microscopy maps and half-maps have been deposited in the electron microscopy data bank with the accession codes emd-yyyyy (hadv-f ph= . ), emd-yyyyy (hadv-f ph= . ), emd-yyyyy (hadv-f (free) penton base), emd-yyyyy (localized asymmetric reconstruction of the hadv-f penton base), emd-yyyyy (localized asymmetric reconstruction of the hadv-f hvr -containing loop). ( )) and hadv-d (emd- ( )) electron densities located at the same position in their respective asymmetric units. key: cord- -yqe vdj authors: kumar, nilesh; mishra, bharat; mehmood, adeel; athar, mohammad; mukhtar, m. shahid title: integrative network biology framework elucidates molecular mechanisms of sars-cov- pathogenesis date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: yqe vdj covid- (coronavirus disease ) is a respiratory illness caused by severe acute respiratory syndrome coronavirus (sars-cov- ). while the pathophysiology of this deadly virus is complex and largely unknown, we employ a network biology-fueled approach and integrated multiomics data pertaining to lung epithelial cells-specific coexpression network and human interactome to generate calu- -specific human-sars-cov- interactome (csi). topological clustering and pathway enrichment analysis show that sars-cov- target central nodes of host-viral network that participate in core functional pathways. network centrality analyses discover high-value sars-cov- targets, which are possibly involved in viral entry, proliferation and survival to establish infection and facilitate disease progression. our probabilistic modeling framework elucidates critical regulatory circuitry and molecular events pertinent to covid- , particularly the host modifying responses and cytokine storm. overall, our network centric analyses reveal novel molecular components, uncover structural and functional modules, and provide molecular insights into sars-cov- pathogenicity. from the epicenter of the covid- (coronavirus disease ) outbreak in china, the disease has spread globally in countries/territories with over . million confirmed cases and almost , fatalities as of april , , and the world health organization (who) warned that the pandemic is accelerating worldwide , . apart from the human tragedy, covid- has a growing detrimental impact on the global economy and will likely cause trillions in financial losses worldwide in alone. covid- is an infectious respiratory illness caused by a highly contagious and pathogenic sars-cov- (severe acute respiratory syndrome coronavirus ). this single-stranded rna virus belongs to the family coronaviridae and is closely related to another human coronavirus sars-cov with . % nucleotide similarity , . sars-cov and another human coronavirus mers-cov (middle east respiratory syndrome-cov) caused two previous global epidemics in and , respectively, both characterized by high fatality rates , . these coronaviruses mainly spread from a contagious individual to a healthy person through respiratory droplets derived from an infected person's cough or sneeze, and from direct contact with contaminated surfaces or objects, where the virus can maintain its viability for period ranging from hours to days , . unlike other coronaviruses, sars-cov- transmits more efficiently and sustainably in the community according to center for disease control (cdc) . while majority of the patients infected with sars-cov- develop a mild to moderate self-resolving respiratory illness, infants and older adults (≥ years) as well as patients with preexisting medical conditions such as cardiovascular disease, diabetes, chronic respiratory disease, renal dysfunction, obesity and cancer are more vulnerable , . the pathophysiology of sars-cov- is complex and largely unknown but is associated with an extensive immune reaction referred to as 'cytokine storm' triggered by the excessive production of interleukin beta (il- β), interleukin (il- ) and others. the cytokine release syndrome leads to extensive tissue damage and multiple organ failure . while no vaccine or antiviral drugs are currently available to prevent or treat covid- , identifying molecular targets of the virus could help uncover effective treatment. integrated interactome-transcriptome analysis to generate calu- -specific human- it is likely that the outcome of sars-cov- infection can largely be determined by the interaction patterns of host proteins and viral factors. to build the human -sars-cov- interactome, we first assembled a comprehensive human interactome encompassing experimentally validated ppis from string database . since the string database is not fully updated, we manually curated ppis from four additional proteomes-scale interactome studies, i.e. human interactome i and ii, bioplex, qubic, and cofrac (reviewed in ). this yielded us an experimentally validated high quality interactome containing , nodes and , edges (fig. a ). subsequently, we compiled an exhaustive list of host proteins interacting with the novel human coronavirus that was referred to as sars-cov- interacting proteins (sips) (supplementary data ). this comprises human proteins associated with the peptides of sars-cov- , whereas the remaining host proteins interact with the viral factors of other human coronaviruses including sars-cov and mers-cov , which could also be of significance in understanding the molecular pathogenesis of sars-cov- . by querying these sips in the human interactome, we generated a subnetwork of , nodes and , edges that covers first and second neighbors of sips (fig. a) . given that the sips-derived ppi subnetwork may not operate in all spatial or temporal conditions, coronavirus-specific co-expression data is used to filter the interactions in the context of covid- . it is important to note that no exceptionally high-resolution sars-cov- transcriptome was available at the time of analysis (details below). therefore, we took advantage of extensive temporal expression data available for sars-cov and mers-cov (fig. b) . towards this, we performed a weighted coexpression network analysis (wgcna) in human airway epithelial cells (calu- ) treated with sars-cov and mers-cov over time in vitro in culture. this analysis yielded a comprehensive co-expression network with , nodes and , , edges ( fig. b) . by integrating this calu- co-expression network with sips-derived ppi subnetwork, we generated calu- -specific human-sars-cov- interactome (csi) that contains sips interacting with their first and second neighbors make a network of , nodes and , edges (fig. c, supplementary data ) . we showed that csi follows a power law degree distribution with a few nodes harboring increased connectivity, and thus exhibits properties of a scale-free network (r = . ; (fig. d , supplementary data ), similar to the previously generated other human-viral interactomes , , , , , , , , , , , , , . taken together, we constructed a robust, high quality csi that was further utilized for network-aided architectural and functional pathway analyses. from a network biology standpoint, a viral infection as well as other pathogen attacks can be viewed as a set of strategic perturbations, at least in part, within the core components of the host interactome , , . since such central nodes correspond to proteins that exhibit increased connectivity and/or central positions within a network, we addressed a question whether sars-cov- also attacks such important nodes within csi. towards this, we calculated the average degree (number of connections), betweenness (the fraction of all shortest paths that include a node within a network), load centrality (the fraction of all shortest paths that pass through a node), information centrality (the harmonic mean of all the information measures for a node in a connected network) and pagerank index (counting incoming and outgoing connections considering the weight of the edge) for sips, and compared them with their first and second neighbors. we demonstrated that these four topological features of sips were significantly higher than the other nodes within csi (fig. a , b, c and supplementary fig. a and b, supplementary data ; t-test p < . ). we also showed that sips were significantly enriched in csi compared to the human interactome (fig. d, supplementary data ; hypergeometric p< . e- ). these results indicate that sars-cov- targets core structural components of the human-viral interactome, and prompted another question as to whether csi also activates common biological processes in response to viral infection. since nodes within csi not only form protein complexes with each other but also transcriptionally co-express, we reasoned that densely connected nodes within this network may participate in similar biological functions. towards this, we investigated the underlying modular structures (protein clusters ≥ nodes) in csi followed by ingenuity pathway analysis (ipa). this approach allowed us to identify modules ranging from to nodes for the smallest and largest modules, respectively. subsequently, we examined the biological processes, cellular pathways and signaling cascades that are modulated in the top modules performed a human phenotype ontology analysis that identifies phenotypic abnormalities encountered in human diseases. significantly enriched terms included mitochondrial inheritance, hepatic necrosis, respiratory failure and abnormality of the common coagulation pathway ( supplementary fig. e ). collectively, we showed that sars-cov- proteins interact with central nodes of csi, and these proteins are implicated in core molecular and cellular pathways to establish infection and continue disease progress. human-viral interactome landscapes of several viruses have previously shown that viral proteins interact with nodes corresponding to high degree (hubs) and high betweenness (bottlenecks), and such structural features have been previously used to predict viral targets , , , , , , , , , , , , , . in addition to hubs and bottlenecks, pagerank algorithm was also effectively used to identify viral targets . moreover, these physical characteristics can also be used to prioritize the most influential genes in csi for biological relevance and drug target discovery. here, we used nine different centrality indices to identify the most influential nodes referred to as csi significant proteins (csps). this includes the above described degree, betweenness, information centrality, pagerank index and load centrality as well as additional features such as eigenvector centrality (a measure of the influence of a node in a network), closeness centrality (reciprocal of the sum of the length of the shortest paths between the node and network), harmonic centrality (reverses the sum and reciprocal operations of closeness centrality and weighted k-shell decomposition) (an edge weighting method based on adding the degree of two nodes in network partition). while weighted k-shell decomposition analysis was recently performed to increase the predictability of host targets of bacterial pathogens , we showed that the top % of nodes reside in the inner layers of csi (fig. a, supplementary data ) . for other centrality measures, we also maintained a stringent threshold of top % to be considered as a highly influential node or csp. evidently, we can expect overlapping topological features for the same set of nodes. noticeably, we observed a strong positive correlation between information centrality and degree ( fig. b ; r = . ), betweenness and degree ( fig. c ; r = . ) and pagerank and degree ( fig. d ; r = . , supplementary data ). collectively, we identified csps that exhibit more than one high centrality measure (fig. e, supplementary data ) . for instance, eef a that has previously been implicated in sars was enriched in all the centrality measures tested in our study (fig. f, supplementary data ) . in addition, ube i, ppia, and phb were also associated with sars and were enriched in more than five centrality measures (fig. f, supplementary data ). we categorized these csps into three major groups based on their potential roles in covid- . while we expect some, if not all, of these proteins to have more than one function, the group- csps might be largely relevant to modifying host response following sars-cov- infection (fig. e) . moreover, the proteins in the other two groups might be involved in viral entry, proliferation, survival and pathogenesis as well as cytokine storm ( fig. e ; see details in discussion). furthermore, we found that these csps are targets of some of the well-known sars-cov- viral proteins. sars-cov- nsp targets most of the csps (i.e. seven in total), sars-cov nsp targets five csps, and sars-cov- m has four csps targets, while other sars-cov- nsps' ( , , ) and sars-cov- orfs' ( b, , , c) possess relatively fewer targets. intriguingly, three of our csps (ppia, rps , and ndufa ) are targets of more than one sars-cov protein (fig g, supplementary fig. ), while phb is the target of several viral proteins tested as bait at low threshold. it is also important to note that phb is also targeted by viral proteins of sars-cov . these data support previous findings that an individual viral factor can target multiple host nodes and several viral proteins can interact with the same host protein , , , , , , , , , , , , , . collectively, these data strengthen our notion that centrality measures can be an effective method to predict highly influential nodes, leading us to discover such csps. to further understand the biological characteristics, regulatory relationships and molecular events associated with the nodes in csi, we incorporated transcriptome data of covid- patients derived from bronchoalveolar lavage fluid (balf) and peripheral blood mononuclear cells (pbmc) with our csi data . overall, sars-cov- infection exhibited largely different transcriptional signatures for balf and pbmc . we identified a set of and differentially expressed genes (degs) in balf and pbmc, respectively (p≤ . , fc ≥ . , fig. a , b, supplementary data ). thus, csi constitutes over % of transcriptomes pertaining to both balf and pbmc. intriguingly, in balf, we observed that the upregulated cluster a is enriched with eif signaling/translation pathway, while the two down-regulated clusters (b and c) are enriched in retinoic acid-mediated apoptosis signaling pathway (fig. a ). conversely, one major cluster that is significantly upregulated in pbmc is enriched in t cell receptor regulation of apoptosis and protein ubiquitination pathway (fig. b) . these data further support the notion that significantly enriched protein modules in csi are involved in sars-cov- pathogenesis. to reveal the regulatory circuitry and molecular events pertinent to sars-cov- infection, we performed probabilistic modeling using idrem (interactive dynamic regulatory events miner) framework that incorporates protein-dna interaction data with transcriptomics . given that idrem requires time-course transcriptional profiling data, and in vivo or in vitro temporal sars-cov- transcriptome data is currently lacking, we made use of a high-resolution temporal sars-cov dataset ( time points) . however, we only focused on those upstream transcriptional factors (tfs) and downstream target genes that were also present in balf and pbmc, which allowed us to mimic sars-cov- -mediated dynamic regulatory networks. this dynamic regulatory modeling identified several bifurcation points, where a set of tfs regulates their potential co-expressed and downstream target genes ( data ). among them, we observed the first major wave of differential regulation and activation of tfs at -hour post infection. at this bifurcation transcriptional event, we found a set of tfs (yy , stat , stat , and srebf ), which were also expressed in balf transcriptome. the next major bifurcation occurred at -hour post infection, comprising and tfs expressed in balf and pbmc, respectively (supplementary data ). while we found similar sets of target genes regulated by diverse sets of tfs at different stages of infection, we also discovered multiple combinations of tfs regulating similar sets of downstream genes (fig. c) . this reflects the intricate nature of dynamic regulatory relationships between tfs and their targets. next, we primarily focused on four major pathways/signaling events, i.e. cytokine storm, eif signaling/translation, protein ubiquitination pathway and t cell receptor regulation of apoptosis. in the first example of cytokine storm, we identified a total of tfs predominantly, we found that two tfs (stat and stat ) and one master regulator (jun) are early transcriptional players activated at -and -hour post infection. in particular, we found csi genes cxcl and tnfaip co-regulated with cxcl and cxcl , and il- a and il- , respectively, indicating that members of csi participate in cytokine storm. majority of these tfs are related to inflammatory/immune regulatory processes. similarly, during eif signaling/translation, we identified a total of tfs (mxi , in the last two decades, intra-and inter-species interactomes have been generated in a number of prokaryotes and eukaryotes including human, mouse, worm and plant models , , , . investigating such interactomes has indicated that diverse cellular networks are governed by universal laws, and led to the discovery of shared and distinct molecular components and signaling pathways implicated in viral pathogenicity. in the present study, we constructed a calu- -specific human-sars-cov- interactome (csi) by integrating the lung epithelial cells-specific co-expression network with the human interactome. we determined that csi displayed features of scale-freeness and was enriched in different centrality measures. identification of structural modules displayed the relationships with a set of functional pathways in csi. in-depth network analyses revealed most influential nodes. additional noteworthy findings pertain to sars-cov- transcriptional signatures, regulatory relationships among diverse pathways in csi and overall sars-cov- pathogenesis including the cytokine storm. we constructed a comprehensive and robust csi, a human-viral interactome that displayed scale free properties (r = . ; fig. d ). we also showed that the sars-cov- interacting proteins (sips) exhibit increased average centrality indices compared to the remaining proteins in the network (fig. d , supplementary fig. a , b). numerous human-viral interactomes have previously been generated to uncover global principles of viral entry, infection and disease progression. these include human t-cell lymphotropic viruses, epstein-barr virus, hepatitis c virus, influenza virus, human papillomavirus, dengue virus, ebola virus, hiv- , and sars-cov , , , , , , , , , , , , , , and all of these interactomes exhibited a power law distribution. another significant tenet of interactomes is the existence of modular structures or modules, defined as sets of densely connected clusters within a network that exhibit heightened connectivity among nodes within a module. such nodes within a module have previously been deemed to possess similar biological function or belong to the same functional pathways . since nodes in csi not only form protein complexes but also coexpress specifically to coronavirus infection, we extracted several functional modules from our network ( fig. e-k). the mostly highly connected module pertains to eif signaling, and is comprised of protein translation-related proteins such as rps and rpls. indeed, these ribosomal proteins have been shown to interact with viral rna for viral proteins biosynthesis, and are subsequently required for viral replication in the host cells . noteworthy, two ribosomal proteins, rpl and rps , found to interact with several sars-cov- viral factors. moreover, both of these proteins are also csi significant proteins (csps) that harbor increased centrality measures (fig. e) . intriguingly, rps has been demonstrated to operate as an immune factor that activates tlr -mediated antiviral. it remains to be addressed whether rps is a "double whammy" target of sars-cov- for ( ) hijacking this important factor for viral translation and replication, and ( ) suppressing a critical immune signaling pathway. regardless, ribosomal proteins are critical targets of numerous viruses and play equally essential roles in developing antiviral therapeutics . the ubiquitin proteasome system (ups) constitutes the major protein degradation system of eukaryotic cells that participate in a wide range of cellular processes, and another critical target of diverse viruses . ups plays an indispensable role in finetuning the regulation of inflammatory responses. for instance, proteasome-mediated activation of nf-κb regulates the expression of proinflammatory cytokines including tnf-α, il- β, il- . similarly, ups is indispensable in the regulation of leukocyte proliferation . the ups is generally considered a double-edged sword in viral pathogenesis. for example, ups is a powerhouse that eliminates viral proteins to control viral infection, but at the same time viruses hijack ups machinery for their propagation . in case of herpes simplex virus type , varicella-zoster virus and simian varicella virus, induction of nf-κb -mediated host innate immunity is suppressed by the manipulation of ups components . moreover, it was revealed that ups plays crucial roles at multiple stages of coronaviruses' infection . in our study, the ubiquitin proteasome module was composed of several members of s proteasome atpase or non-atpase regulatory subunits, which includes two csps, psmd and psma (fig. e ). it still needs to be determined whether these two csps play important roles in the expression of proinflammatory cytokines and are potentially involved in the cytokine storm. while the mechanistic evaluation of sars-cov- interaction with these two highvalue targets needs to be explored, both the mrna and protein expression corresponding to psmd was recently shown to be decreased up to % in aged keratinocytes . since reduced proteasome activity results in aggregation of aberrant proteins that perturb cellular functions, we hypothesized that sars-cov- targets these csps to interfere with er-mediated cellular responses. another noteworthy module is the t cell receptor regulation of apoptosis. indeed, it was recently reported that sars-cov- infection may cause lymphocyte apoptosis demonstrated by overall cell count and transcriptional signatures in pbmc of covid- patients , . another significant csp in this pathway is mtch (fig. e) , a proapoptotic protein that triggers apoptosis independent of bax and bak . we hypothesized that cytokines-mediated induction of cytokine storm is partially dependent on the sars-cov- interaction with mtch . taken together, our module-based functional analyses identified several novel molecular components, structural and functional modules, and overall provided insights into the pathogenesis of sars-cov- . our network topology analyses discovered csi significant proteins (csps) that have been implicated in several above described modules and pathways (fig. e ). to provide a system-wide perspective of the importance of these csps in covid- , we categorized these csps into three groups based on their possible functionality. group- includes csps that are potentially relevant to modifying host response following sars-cov- infection. these include eef a , etfa, mrps , mrps , mtch , ndufa , rab a, rab a, rab c, rab a and rhoa (fig. e) . we hypothesized that such csps are important in creating protective environment in host tissue following the viral infection. for example, rab and rho group of ras proteins may be involved in augmenting inflammatory signaling pathways. while antioxidants regulating mitochondrial and cytoplasmic proteins are possibly important in regulating and maintaining redox homeostasis , another csp, sccpdh, is involved in the metabolic production of lysine (lys) and α -ketoglutarate (α-kg) . intriguingly, l-lysine supplementation appears to be ineffective for prophylaxis or treatment of herpes simplex lesions . we hypothesized that sars-cov- may target sccpdh to hijack the biosynthesis of this essential amino acid for its benefit. group- csps that we identified are likely to be hijacked by sars-cov- for its entry, proliferation and survival in the host tissue. in this category, one of the most important csps is prohibitin (phb; fig. e ). phb is an important protein shown to be a receptor for dengue and chikungunya viruses , . although it has been shown that ace serves as the main receptor for sars-cov- entry into the cells , it is quite interesting that pathogenesis of the viral infection is not significantly different between the populations of hypertensive patients who receive or don't receive ace inhibitors , , , . therefore, it is plausible that under certain physiological conditions when sars-cov- does not engage with the ace receptor for its entry into the cell, phb serves as an alternative receptor. another csp integrin β encoded by itgb was recently shown to be required for the entry of rabies virus . whether itgb could also promote the entry of sars-cov- is another question that needs to be addressed. mepce is another important enzyme involved in rna stabilization by capping the ´ end of rna with methyl phosphate . it is also likely that mepce is utilized by the covid- virus for stabilization of its rna in the host tissue. similarly, ppp ca was shown to regulate hiv- transcription by modulating cdk phosphorylation , and thus is potentially involved in the gene regulation of sars-cov- . as discussed above, psma and psmd are the two proteasomal csps , , . while infecting the lung epithelium, sars-cov- may utilize these ups proteins for the fusion with the host cell membrane (fig. e) . similarly, nup can also be utilized for viral entry into the nucleus. additional three csps in this category, rpl and rps , as well as srp , could be employed for viral transcription and protein synthesis (fig. e) . finally, group- csps are proteins, which sars-cov- may utilize both to facilitate its proliferation as well as to induce a conducive environment in the host tissue for its sustenance and pathogenesis (fig. e) . these csps include ap m , csnk b, eef a , etfa, larp , rtn and ube i. among these csps, eef a , ppia, psma , psmd , rab a, rab a, rab c, rab a, rhoa and ube i are identified as the ones that are potentially associated with the pathogenesis of the cytokine storm as observed in some severely affected patient populations. intriguingly, eef a , a target of several viruses, is known to be activated upon inflammation . this csp is independently identified as one of the major regulators in human-sars-cov- predicted interactome . the csps, which regulate protein folding and translation, for example eiaf , could be utilized by sars-cov- to halt host protein translation, folding and protein quality control. in addition, we also identified e f , tbx and smarcb as first neighbors of some of these csps. these csps complexes play key roles in promoting cell death, causing inflammation and acting enzymatically as viral integrases. collectively, these csps and their first neighbors could directly and indirectly perform intricate pathopysiological functions but those mentioned here could be the key effects of covid- on host tissue dysregulation. this classification is also crucial for the design of effective therapeutic interventions against covid- . finally, we presented transcriptional modeling of csi genes including csps that participate in cytokine storm, eif signaling/translation, protein ubiquitination pathway and t cell receptor regulation of apoptosis. thus, these signaling pathways and tfs discovered through our analyses could provide important clues about effective drug targets and their combinations that can be administered at different stages of covid- . in conclusion, we generated a human-sars-cov- interactome, integrated virusrelated transcriptome to interactome, discover covid- pertinent structural and functional modules, identify high-value viral targets, and perform dynamic transcriptional modeling. thus, our integrative network biology-based framework led us uncover the underlying molecular mechanisms and pathways of sars-cov- pathogenesis. to build human interactome, we assembled a comprehensive protein-protein we obtained microarray data for gse , gse , gse from geo database and used geo r, an interactive web tool to generate differential gene expression between infection and mock treatments at their respective time points. briefly, geo r utilizes limma r package. limma is an r package for the analysis of gene expression microarray data. specifically, it uses the linear model for analyzing designed experiments and the assessment of differential expression. a threshold of log fold change and fdr ≤ . was set for differential expression analysis of all microarray experiments. for comparative study of sars-cov- expression pattern, we downloaded expression data set of rnas isolated from the bronchoalveolar lavage fluid (balf) and peripheral blood mononuclear cells (pbmc) of covid- patients . the criteria for filtering out significant genes were kept as adjusted p-value < . and foldchange > we mined calu- cells-specific datasets from geo database , and downloaded gse (wild type), gse (icsarscov) and gse (locov). we performed individual weighted gene co-expression network analysis (wgcna) package (r version . . ), and constructed three co-expression networks. moreover, we also generated topological overlap measure (tom) plots to compute a numerical entity that reflects interconnectedness among genes within a co-expression network. a cut-off of . was used to export the networks. subsequently, we merged these networks to generate a comprehensive calu- cells-specific co-expression to study the network connectivity pattern of interactome. to extract the calu- -specific human-sars-cov- interactome (csi) we integrated the cytoscape (version . . ) was used to visualize all the networks. the functional enrichment analysis was done by kyoto encyclopedia of genes and genomes (kegg), ingenuity pathway analysis (ipa), wikipathways, go biological process, cluego, and enricher for human phenotype ontology and rare diseases term with their statistically significant parameters . interactive visualization of dynamic regulatory networks (idrem) is a method which incorporates static and time series expression data to reconstruct condition-specific reaction network in an unsupervised manner . additionally, the regulatory model identifies specific stimulated pathways and genes, which uses statistical analysis to recognize tfs that vary in activity among models. we implemented idrem on , cumulative differentially expressed genes across hours of sars-cov infection with log normalization for dynamic regulatory event mining with all human , tfs/targets collections from encode database . the dynamic activated pathways regulated by tfs was generated by ebi human gene ontology function. hypergeometric test, linear regression (r ), and student t-test were performed using r version . . as well as online stat trek tool. all datasets used for this study are accessible through supplementary data files. the authors declare no competing interests. the authors also declare no financial interests. . t r a n s c r i p t o m i c c h a r a c t e r i s t i c s o f b r o n c h o a l v e o l a r l a v a g e f l u i d a n d p e r i p h e r a l b l o o d m o n o n u c l e a r c e l l s i n c o v i d - p a t i e n t s . e m e r g m i c r o b e s i n f e c t , - ( ) . a r u y a m a t , n a r a k , y o s h i k a w a h , s u z u k i n . t x k , a m e m b e r o f t h e n o n -r e c e p t o r t y r o s i n e k i n a s e o f t h e t e c f a m i l y , f o r m s a c o m p l e x w i t h p o l y ( a d p -r i b o s e ) p o l y m e r a s e a n d e l o n g a t i o n f a c t o r a l p h a a n d r e g u l a t e s i n t e r f e r o n -g a m m a g e n e t r a n s c r i p t i o n i n t h c e l l h s u l y , c h i a p y , l i m j f . t h e n o v e l c o r o n a v i r u s ( s a r s - c o v - ) e p i d e m i c . a n n a c a d m e d s i n g a p o r e , - ( ) . . c a s c e l l a m , r a j n i k m , c u o m o a , d u l e b o h n s c , d i n a p o l i r . f . d i n g j , h a g o o d j s , a m b a l a v the regulators only expressed in balf and pbmc transcriptomes are highlighted. significant regulators (tfs) control the regulation dynamics (p< . ). major bifurcation of pathways occurs at -hour with a total of tfs involved in dynamic modulation key: cord- -l dm pp authors: santhakumar, diwakar; rohaim, mohammed abdel mohsen shahaat; hussein, hussein a.; hawes, pippa; ferreira, helena lage; behboudi, shahriar; iqbal, munir; nair, venugopal; arns, clarice w.; munir, muhammad title: chicken interferon-induced protein with tetratricopeptide repeats antagonizes replication of rna viruses date: - - journal: sci rep doi: . /s - - -y sha: doc_id: cord_uid: l dm pp the intracellular actions of interferon (ifn)-regulated proteins, including ifn-induced proteins with tetratricopeptide repeats (ifits), attribute a major component of the protective antiviral host defense. here we applied genomics approaches to annotate the chicken ifit locus and currently identified a single ifit (chifit ) gene. the profound transcriptional level of this effector of innate immunity was mapped within its unique cis-acting elements. this highly virus- and ifn-responsive chifit protein interacted with negative sense viral rna structures that carried a triphosphate group on its ′ terminus (ppp-rna). this interaction reduced the replication of rna viruses in lentivirus-mediated ifit -stable chicken fibroblasts whereas crispr/cas -edited chifit gene knockout fibroblasts supported the replication of rna viruses. finally, we generated mosaic transgenic chicken embryos stably expressing chifit protein or knocked-down for endogenous chifit gene. replication kinetics of rna viruses in these transgenic chicken embryos demonstrated the antiviral potential of chifit in ovo. taken together, these findings propose that ifit specifically antagonize rna viruses by sequestering viral nucleic acids in chickens, which are unique in innate immune sensing and responses to viruses of both poultry and human health significance. innate immune responses, primarily triggered by interferons (ifns) and their antiviral effectors, can establish an extremely potent antiviral state to efficiently restrict virus replication and virus-induced pathologies in the susceptible host . to initiate these innate responses, viruses are detected by the host pathogen recognition receptors (prrs) through recognition of viral molecular signatures known as pathogen associated molecular patterns (pamps) . pamps associated with most viruses include viral double-stranded rna (dsrna), which is produced as replicative intermediate and single-stranded adenosine/uridine (au)-rich regions . these pamps effectively activate cascades of signalling events that culminate in the production of ifns in virus-infected cells. the released ifns transcriptionally activate hundreds of ifn-stimulated genes (isgs) in uninfected neighbouring cells that instigate direct or indirect antiviral activities , . well-studied examples of these antiviral effectors are dsrna-activated protein kinase r and ′- ′ oligoadenylate synthetase which bind to dsrna, and interferon induced proteins with tetratricopeptide repeats (ifit)- and - that bind to ′-triphosphate containing rna , . ifit genes are evolutionary conserved and are originated possibly by gene duplication . all members of the ifit family consist of multiple tetratricopeptide repeats (tpr) throughout the length of the protein that are mainly responsible for the protein-protein interaction and assembly of larger protein complexes , . most mammals encode several ifit genes including ifit /isg , ifit /isg , ifit /isg and ifit /isg , however, mice and rats lack ifit and horses lack ifit . amongst all known members of the family, ifit and ifit are highly responsive both at the transcription and translation levels to diverse cellular stresses including those induced by dsrna, virus infections and lipopolysaccharides , . the ifit proteins family is responsible for diverse array of cellular activities including nucleic acid sensing and direct antiviral effects . specifically, ifit is proposed to be a negative-feedback regulator of virus-triggered induction of type i ifns and cellular antiviral responses whereas ifit potentiate antiviral signalling . additionally, mammalian ifit can sense and bind to numerous short cellular rnas such as initiator trna, and these interactions are mapped across the protein surface . the most prominent features of ifit proteins are attributed to their involvements in the inhibition of virus replication through nucleic acid sensing and leading to possible inhibition of translation , . ifit protein, in addition to ifit , discriminates the cellular and viral mrna for initiating downstream antiviral activities by recognition of discrete features at the ′ termini , . most eukaryotic cellular ribosomal and transfer rnas (rrnas and trnas) carry monophosphate at ′-termini whereas messenger rnas (mrnas) bear n -methlguanosine cap (cap ) attached to the first base through ′- ′ triphosphate bridge that recruits cellular factors involved in rna processing and translation initiation , . additionally, in most higher eukaryotes methylation occurs at the ′-o position of the first or second base yielding the cap (m gpppnmn) or cap (m gpppnmnm) structures, respectively . although cap and cap are not crucial for mrna translation, human and murine ifit protein can inhibit translation of cap -lacking mrna , . these features of cellular rnas are also mimicked by several viruses as countermeasure strategies . however, viral genomic and subgenomic rna of negative sense single-stranded viruses such as influenza a viruses (iav) and newcastle disease viruses (ndv) bear triphosphate at the ′-termini . these pamps are sensed by cellular prrs and initiate innate immune responses, which ultimately restrict virus growth , . significant genetic, functional and structural features have been recently attributed to the mammalian ifit genes and proteins [ ] [ ] [ ] [ ] . there is limited information available on the repertoires of cellular proteins that recognizes different populations of viral nucleic acid in avian species especially in chickens which differ significantly in mounting innate immune responses and are infected by pathogens that continuously pose zoonotic threats to public health . here, we have genetically characterized chicken ifit and revealed that chickens, in contrast to other vertebrates, encode only one ifit gene (chifit ). the chifit gene was transcriptionally highly responsive to both type i ifns and rna viruses (iav and ndv) and interestingly this responsiveness was mapped to isre motif in the cis-acting elements. we also demonstrated that chifit specifically interacts with ssrna carrying ′-ppp moiety. through this interaction, chifit applies strong antiviral activities in lentivirus-transduced stable cell lines whereas crispr/cas -mediated chifit knockout promoted virus replication. finally, employing the rcas-based retroviral gene transfer vector system , we generated transgenic chicken embryos expressing chi-fit and demonstrated its antiviral potential in ovo. these findings were further evaluated by rcas-mediated gene silencing in developing transgenic chicken embryos by assessing the replication kinetics of rna viruses. these analyses provide evidence of the presence of a functional homologue of ifit and expand our understanding on the breaths and dynamics of nucleic acid sensing in chicken. genomic annotation of ifit locus revealed that chicken genome encodes ifit gene. the genome of all major mammalian species (including human, mouse, dog and horse) encodes for ifit genes, however, these genes are only genetically and functionally characterized in a limited number of species , . the human genome encodes four ifit genes (ifit , ifit , ifit and ifit ) whereas rats and mice lack ifit and horses lack ifit (fig. a ). in addition, several pseudogenes have been identified in different animal species including ifit b (in human, mice and rabbits), ifit c (in mice), ifit -like (in dogs and mice) and ifit -like gene (in dogs and rabbits). to identify corresponding ifit homologues in chicken, initially several ifit genes from human and mouse were used in the blast algorithm in the ensembl database. based on high genetic similarity with huifit , only a single gene (ensgalt ) was identified in the chicken genome (fig. a) . the ifit locus, encoding all ifit genes, is primarily mapped in between the lysosomal acid lipase/cholesteryl ester hydrolase (lipa) and pantothenate kinase (pank ) genes. since only a single ifit gene (ifit ) was identified in the chicken genome, we next examined an approximately . kb genomic sequence spanning aadn . contigs in the chicken chromosome (chromosome : , , - , , ). immediately adjacent to the lipa gene, a sequence gap was identified whose estimated length was bps in the ensembl chicken genome build (ensembl release -july ) ( supplementary fig. a ). long primers (supplementary table ) flanking each end of the gap were used to amplify genomic fragments and subsequent sequencing of the products was used to cover the genomic gap. we used the sequenced fragments as input in webaugustus and predicted any possible genes in the entire ifit locus. only one gene (ifit ) was predicted in any strand of the input dna using chicken as species parameters ( supplementary fig. a ). the genomic gap has now been filled with the latest ensembl chicken genome (ensembl release -dec ) and no gene other than ifit has been identified (until april ). to confirm the sequence of the cdna and to identify the genomic structure of the identified ifit gene, the transcript was amplified from rna extracted from ndv-infected cefs. complete sequence analysis of the gene revealed an open reading frame (orf) of bps ( amino acids excluding stop codon) with high sequence identity ( % and %) with the human and duck ifit proteins, respectively ( supplementary fig. b) . the identified gene showed a characteristic exon/intron organization where two exons were separated by a few kilobases (kb) long intron (fig. b) . the first exon encodes barely a kozak sequence (catg) and ′ untranslated region (utr). the second exon codes the rest of the orf ( bps) and ′ untranslated regions (utr) for chicken ifit gene. based on the transcriptomic data from marek's disease virus (mdv, strain rb b)-infected cefs (fig. b) and cdna sequencing data, a complete ifit gene was annotated. phylogenetic analysis of characterized chicken (ch) ifit gene with all known human, mouse and duck ifit genes clustered the chicken gene with scientific reports | ( ) : | doi: . /s - - -y duck and human ifit (fig. c) . one of the structural hallmarks of all ifit proteins is the presence of multiple trp motifs dispersed throughout the length of the protein , . the consensus chicken, duck and human ifit sequences were used to predict trps using ncbi's conserved domain database. both duck and human ifit proteins structurally carried ten trp motifs and two multi-domains whereas chicken ifit encoded eight predicted and ten structure-based tprs (fig. d) . taken together, the gene synteny, genetic similarity, genomic architecture and annotation indicate that chickens encode only one ifit gene compared to at least four in mammals. based on genetic clustering, it is highly similar to ifit genes of other avian (chickens and ducks) and mammalian species. moreover, no ortholog for ifit , ifit , ifit and no pseudogenes were identified in the current ensembl chicken genome build (ensembl release -dec ). mouse ifit protein is predominantly expressed in the cytoplasm and upon stimulation with the ifn, it accumulates in the cytoplasm (> × molecules per cell), and this abundance is identified to be crucial for its antiviral function . beside the distribution pattern of mouse ifit , no other ifit proteins have been investigated for their subcellular distributions. to delineate the expression dynamics and sub-cellular distribution, df- cells were transfected with v -tagged chifit and transient expression of chifit was compared in ndv-stimulated or mock-treated cells. confocal microscopy figure . genomic architecture along with relative loci around ifit genes in human, mouse, dog, horse and chicken, and gene annotation in chicken. (a) the ifit locus in compared species is flanked upstream with lipa gene and downstream with pank gene. direct syntenic analysis identified a single ifit gene in chicken compared to four in other species aslong with other pseudogenes. (b) transcriptomics profiling of chicken primary fibroblasts which were mock-infected or infected with rb b strain of mdv. blue bar represents the transcript in the current chicken genome assembly whereas the red bar represents the mappability of the transcript to the chicken genome. a final transcript from mdv-infected data and gene characterization is shown at the bottom. (c) phylogenetic analysis of ifit genes in different species. based on the clustering patterns and sequence homologies, the single identified gene clustered closer to ifit of duck and human. (d) putative tetratricopeptide repeats (tpr) showing characteristic features of ifit proteins. (e) expression and subcellular distribution of chifit in chicken embryo fibroblasts. chicken cells were transfected with ng of mammalian expression vectors encoding v -tagged chifit for hours and were left untreated (ndv-) or were treated with moi of ndv-gfp (ndv+) for another hours before fixation, staining for nucleus (blue), chifit (red) and gfp marker (ndv). scientific reports | ( ) : | doi: . /s - - -y using anti-v antibodies showed that chifit was exclusively cytoplasmic and was expressed throughout the cytoplasm; however, this expression was concentrated at the cell surface (fig. e, upper panel) . we were unable to demonstrate the distribution pattern of chifit under ndv infection (fig. e , lower panel) probably due to the profound antiviral state induced by chifit so that infection of ndv in the chifit -expressing cells could not be achieved. it has previously been demonstrated that human ifit enhances the innate immune responses by interacting with rig-i (not identified in chicken) and mavs and this interaction occurs at the mitochondria . since mavs localizes on both mitochondria and endoplasmic reticulum (er)-derived membranes (mem), we labelled mitochondria and er in presence of chifit to assess the distribution and localization of chifit with mavs. in df- cells, chifit localized in close proximity to the mitochondria and no co-localization was observed between er and chifit (data not shown). owing to lack of cross-reactivity of human ifit antibodies and specificity of anti-sera which we have raised against chifit , these experiments were performed using tagged-chifit . therefore, future studies are required to assess the expression patterns of the endogenous chi-fit under virus or non-virus stimuli. transcriptional profiling of chifit in vitro and ex vivo. we next assessed the nature of ligands that can transcriptionally induce the expression of chifit . different ligands that can either directly induce the transcription of ifit such as ifn-β or stimuli that result in the expression of ifns, which in turn induce the ifit transcription, were assessed ( fig. a) . tlr (lipopolysaccharide, lps) and tlr (poly i:c, synthetic dsrna) ligands significantly induced the expression of chifit gene by and folds, respectively (fig. b) . the induction is likely through the activation of ifns since the chifn-β stimulated cells profoundly increased the transcription of chifit by folds compared to untreated or mock-treated cells. there are evidences that negative sense single stranded rna viruses (e.g. influenza and ndv) produce dsrna as intermediate by-product during the virus replication cycle , - . therefore, it is plausible that induction of chifit expression in chicken cells infected with ndv is mediated by virus-generated dsrna (fig. b ). to further assess the temporal effect of ndv infection on the induction of chifit , a time course profiling ifn and ifn-regulated genes was evaluated. the virus-induced expression of chifit was profoundly observed as early as hours post-infection (fig. c ) and the expression was maintained for hours. a slight reduction in ifit expression was observed at hours post-infection (hpi) was reconstituted at hpi (fig. c ). this biphasic expression of ifit after virus infection was repeatedly observed in chicken cells. however, the pattern of expression of myxovirus-resistance protein (mx) gene, which is another well-characterized isg , shows a steady up-regulation and peak expression was observed at the latest time post-infection (fig. e ). the levels of chifn-β gene induction was proportional to the ndv replication (fig. d) , therefore, it can be inferred that the virus-induced expression of chicken ifit is ifn-dependent and that chifit is an early-isg with capacity to modulate initial steps of virus life cycle. to investigate transcriptional activation of ift following virus infection in chicken, we analysed a panel of rna transcripts derived from selected tissues (liver, kidney, spleen, beak, trachea, lungs and duodenum) of chickens ( -weeks-old rhode island red) which were mock-infected or infected with h n avian influenza virus strain a/chicken/pakistan/udl / (udl /h n ). all tissues from infected birds contained relatively higher levels of chifit compared to the corresponding tissue samples derived from mock-infected chickens (fig. g) . these results conclude that virus infection positively regulates the transcriptional dynamics of chifit in chickens. promoter structures contributing to the ifn, dsrna or virus-mediated transcriptional activation of chifit gene. most studied-ifit genes respond to ifn or ifn stimuli through one to four ifn-stimulated response elements (isre), which are mainly present within base pair (bp) of the transcriptional start site in the orientation of the encoded gene or in the reverse complement order . on the other hand, several ifit genes lack isre motifs in their promoters including ifit from chimpanzee, ifit b from human, chimpanzee and dog, and ifit from horse, chimpanzee and dog . since our data and previously published transcriptomics studies support the profound expression of ifit gene against a wide range of stimuli, we analysed the ′ flanking region of chifit gene for motifs that regulate the expression of chifit . in addition to the gene-encoding sequences, approximately . kb sequence including the putative promoter (supplementary fig. a ) was isolated and sequenced. inspection of the promoter region revealed the presence of two consecutive isre motifs within bps from the transcriptional start site ( fig. a and supplementary fig. b ). these elements were preceded by tata element, and the binding motif for specificity protein (sp ) transcription factor. a putative and weak ifn-gamma-activated site (gas) was identified at the distal end of the promoter and six consecutive gaaann elements were predicted between gas and sp motifs (fig. a) . gas is essential for ifn-gamma-mediated transcription of the genes whereas gaaann elements have been demonstrated to regulate the virus-induced and type i ifn-regulated genes via the binding of interferon regulatory factors , . to determine the role of these cis-acting regulatory elements in the responsiveness of different ifn ligands and stimuli, the entire promoter region ( bp) was cloned upstream to the luciferase gene in a promoter-less pgl . -basic vector (fig. a) . additionally, five reporter constructs were generated each containing either all of the gaaann elements, sp , tata box, double or single isre promoter sequences (fig. a,b) . luciferase reporter assays were performed to assess whether the upstream region of the chifit gene can mediate responses to stimuli from chifn-β, dsrna or virus infection. df- cells were transiently transfected with these reporter plasmids and subsequent luciferase activity was monitored after stimulation with either chifn-β (fig. c) , dsrna ( fig. d ) or with ndv (fig. e ). similar to ifn-induced promoter activation of chmx gene , the chifn-β induced -fold higher luciferase activities in full-length promoter (fig. c ). however, further deletion of gas, gaaann, sp and tata box resulted in an approximately -fold reduction in basal transcription activity. additional deletion of the distal isre reduced the ifn-responsiveness of the promoter. however, repetitive luciferase assays demonstrated that isre motif just proximal to the transcriptional start site was exceedingly responsive to the ifn-treatment with a highest of -fold induction in luciferase activity as compared to the control vector without a promoter sequence (fig. c ). in the dsrna-dependent promoter induction (fig. e ), results demonstrated that the above elements and motifs (gaaann, spi and tata box) are least important in controlling the transcriptional activation of luciferase gene and a non-significant difference was observed in the absence or presence of these motifs. however, promoter sequence containing dual isre motifs severely compromised the responsiveness of dsrna on the ifit promoter and the dsrna-dependent promoter induction was restored when distal-isre was deleted, further suggesting the importance of this isre motif (fig. e ). similar to dsrna induction, the construct containing only one isre motif was highly inducible by ndv compared to the construct carrying both isre motifs (fig. e) . as was observed in the ndv-induced chmx promoter activation, all other constructs with variable lengths and motifs were also responsive to virus infections. comparison of these three stimuli indicated lower luciferase activities in ndv and dsrna-treated chicken cells compared to chifn-β induced promoter activation, presumably due to indirect activation of ifit promoter by inducing endogenous ifns. following the demonstration of robust reporter function of the chifit promoter-containing proximal isre motif by the chifn-β, dsrna and ndv, we compared the sequence of cis-acting regulatory elements with the previously characterized chmx promoter . in silico analysis of highly responsive bp sequence indicated that the isre motif ( ′-gctttcactttct- ′ at position − to − in pchifit − /+ construct) was identical to the consensus isre (g/a/t)g/ctttcn - tttc(a/t/c) found in most of the ifn-inducible genes including chmx promoter sequence (fig. a ). for further absolute delineation of importance of this motif, we mutated both triplet thymidines (ttt), the core residues for the ifn-inducible activation of genes ( fig. b and supplementary fig. c ). responsiveness of the wild type (wt) and mutated isre motifs was assessed in chicken df- cells with or without stimulation by chifn-β, dsrna or ndv using luciferase assays. while both pchifit - /+ and pchifit - /+ -mut promoters were not auto-stimulatory, stimulation with either of the stimuli positively regulated the luciferase genes in pchifit - /+ . interestingly, stimulation of mutated promoter with chifn-β ( . . chicken and human ifit restricts the replication of rna virus in stably expressing chicken fibroblasts. to assess the antiviral potential of chifit protein, we generated a lentivirus construct which bicistronically expresses chifit and a red fluorescent protein tagrfp under the control of emcv (encephalomyocarditis virus) internal ribosome entry sequence (ires) . cef cells were transduced with the appropriate vsv-g pseudotyped lentiviral particles and infected with gfp expressing ndv. after hours, the virus infectivity (gfp+) was quantified in lentivirus transduced (rfp+) cell population using fluorescence-activated cell sorting (facs) (fig. a ). lentivirus expressing firefly luciferase (ffluc), not expected to affect the virus replication, was used as a negative control. human irf (huirf ), a potent virus restriction factor , was used as a positive control ( fig. b and supplementary fig. a ). compared to ffluc control, both huirf and chifit significantly inhibited ndv replication (fig. b,c) , while the antiviral effect of chifit was greater than that of huirf in cefs. since two populations of gfp+ cells were observed, further gating of low gfp+ and high gfp+ indicated a profound and cumulative antiviral effect compared to ffluc control ( supplementary fig. b ). these results showed that the chifit -mediated antiviral effect is not profound at individual cell levels but it attenuates replication of the virus after initial infection. to compare anti-viral effects of chifit (only ifit gene identified in chickens so far) with its orthologous and homologous human proteins, huifit , huifit , huifit and huifit were expressed in chicken cells and antiviral affect was monitored (fig. a ). similar to chifit , both huirf and huifit showed significant inhibition of ndv replication whereas no antiviral activities were observed by other huifit proteins in chickens (fig. d ). to further demonstrate the effect of ifit -induced inhibition of ndv replication, we observed an exclusive replication of either ndv (gfp+) or transduction of lentivirus expressing ifit (rfp+) in % cells compared to ffluc expressing cells (n = ) (fig. e, upper panel) . it is possible that transduction of lentivirus particles may interfere with the virus entry and/or ndv replication. to examine the lentivirus-mediated restriction of ndv replication, we used ffluc-transduced cells in the control group. the analysis of several microscopic fields showed simultaneous expression of lentiviruses-transduced ffluc protein and ndv infection (fig. e , lower panel and fig. b ), demonstrating that the transduction of lentivirus particles doesn't interfere with the ndv replication. collectively, these results demonstrate that ifit proteins of both human and chicken are potent cellular restriction factors against rna virus infection in chicken primary cells. to further evaluate the antiviral effect of chifit on virus replication, we applied a loss-of-function approach using crispr/cas genome editing technology. using synthetic grna and cas expression plasmid (fig. a) , the chifit was targeted for editing. df- cells were co-transfected with hybridized crrna and tracr-rna as well as a vector expressing hspcas and puromycin resistance marker (puro r ) gene. for fast and efficient enrichment of genetic modification, a population of stably-integrated cells was selected with puromycin, and was further enriched using facs (fig. a) . the relative frequency of gene editing in the puromycin-resistant and facs-enriched cell was estimated from a dna mismatch detection assay using t endonuclease i (t e ) (fig. b) . t e assay showed a mutation frequency of % within the chifit gene, however, t e assays are likely to underestimate the fold enrichment . we subsequently sequenced the mutated sites and confirmed the in-frame or out-of-frame gene editing (fig. c) . results of the t e assay and sequencing showed that sgrna sufficiently edited the gene and puromycin selection greatly improved the enrichment of the cells. additionally, most of the mutations in the chifit gene appeared to be deletions, which introduce a pre-mature stop codon to the beginning of exon of the chifit gene. the chifit deletion-confirmed df- cells (chifit ko df- ) were isolated, expanded and assessed for virus replication. both wt df- and chifit ko df- were infected with ndv and vsv, and replication of viruses was monitored for hours. both ndv (fig. d) and vsv (fig. e ) replicated efficiently in wt df- cells over the time course of infection ( to hours). however, deletion of chifit by crispr/cas further supported the replication of ndv (fig. d) and vsv (fig. e) , confirming that chifit is a crucial antiviral effector and elimination of such factors weakens the hostile barriers of the host. these results also highlight the possible exploitation of innate immune genes in promoting virus replication for vaccine production. additionally, we applied facs to quantify the percentage infectivity of the wt df- and chifit ko df- cells for gfp-expressing recombinant ndv. data demonstrated that the wtdf- cells infectivity by ndv ( . %) could further be enhanced ( . %) in the absence of chifit (supplementary fig. ) . cumulative quantitative measurement of ndv-infectivity was significantly enhanced in chifit ko df- cells compared to wt df- cells (fig. f ). next, in order to counter-confirm the antiviral potential of chifit , the level of vsv replication was assessed in conventional plague assay in df cells either overexpressing or knocked out with chifit (fig. g) . the vsv replicated effectively in wt df- cells and overexpression of chifit suppressed the vsv whereas crispr/cas knockout df- cells substantially supported the virus replication (fig. g) . together, these results confirm the potential of chifit as an important host restriction factor, at least against evaluated rna viruses. chicken ifit interacts with ′ppp-containing rna. different human and mouse ifit proteins (ifit , ifit , ifit and ifit ) interact with rna carrying multiple genetic modifications at their ′ ends , , - . since ifit was the only identified ifit protein in chickens, we aimed to explore whether chifit interacts with rna in a similar mechanism reported for huifit or chifit -rna interaction is redundant to other members of ifit family. in order to explore the molecular mechanisms of chifit recognition of rna, we used bp rna without any known similarity to viral rna sequences and structure, and generated (i) rna bearing terminal ′ hydroxyl group ( ′ohrna), and (ii) rna bearing ′ triphosphate ( ′ppprna) (fig. a) . these populations of rna, mimicking viral rna ends, were biotinylated and coupled with agarose beads. the beads were then incubated with chicken df- cells expressing v -tagged chifit , the ribonucleoproteins were purified (fig. b ) and the interaction of chifit was determined by staining chifit . while neither huifit nor chifit interacted with the ′ohrna, both proteins recognized rna carrying ppp at the ′ end (fig. c) . overexpression of chicken chifit in transgenic chicken embryos restricts the replication of rna virus. we next asked whether the in vitro demonstrated antiviral activities of chifit are translatable in ovo in developing chicken embryos. for this purpose, we constructed rcasbp(a)-chifit recombinant viruses to generate mosaic-transgenic chicken embryos that are constitutively expressing chifit , and monitored its impact on the replication of ndv (fig. a) . in addition, a reporter virus, rcasbp(a)-egfp, was generated to monitor the rescue of the virus and to investigate the effect of retroviruses on host responses. rcasbp(a)-chifn-β, expressing chifn-β which is a known antiviral cytokine , was generated as a positive control (fig. a) . expression of this transgene did not compromise the replication of retroviruses, and the induction of innate immune responses was not significant (data not shown), confirming previous reports that rcasbp is a safe in ovo gene transferring system , . stable cells lines expressing retrovirus-mediated chifit and chifn-β were used to monitor virus replication. while mock-infected or rcasa(bp)-wt infected df- cells did not interfere with the replication of ndv (fig. b) and vsv (fig. c) , stable expression of chifit and chifn-β established a profound antiviral state against both viruses over the time course of infection. these functionally validated infectious df- cells were used to generate transgenic chicken embryos. three-day-old embryonated chicken eggs were inoculated with wt retroviruses or viruses that were expressing chifit or chifn-β to generate mosaic transgenic chicken embryos (fig. d) . additionally, embryos were either mock-inoculated with pbs or with non-infectious df- cells to exclude the possibility of antiviral effects by the retrovirus infected df- cells. these five groups of embryonated eggs were either mock-challenged with pbs or virus-challenged with ndv, and were incubated for another five days (fig. d ). compared to controls, retroviral expression of ifit negatively affected the development of healthy embryos until day post-embryonation before challenge with ndv (fig. e) . quantitative analysis of ndv in allantoic fluids showed that the wt retroviruses or non-infectious df- cells alone were unable to interfere with the replication of ndv, however, transgenic embryos stably expressing chifit or chifn-β significantly restricted the virus replication compared to the mock-treated control (fig. f) , indicating that chifit can restrict virus replication in developing transgenic chicken embryos. collectively, these results further confirm the antiviral activity of chifit in ovo. ablation of chicken chifit in transgenic chicken embryos ameliorates the replication of rna viruses. next, we asked whether silencing of endogenous chifit in developing embryos would facilitate the replication of ndv. for this purpose, we streamlined two individual gene delivery protocols for specific and optimal gene silencing in chicken cells and embryos. first, a total of three short hairpin rnas (shrna) targeting the chifit and a scrambled shrna with a highly confident predicted score were cloned in the prf-prnaic as described before (fig. a) . transfection of chicken df- cells with these vectors expressing shrna under a chicken u promoter showed a reliable level of expression of the rfp marker gene indicating the functional integrity of the system (fig. b) . although all shrna were effective in silencing, quantitative analysis of ndv-induced chifit mrna showed that shrna# was the most effective (> %) (fig. d) . next, the validated shrna cassette that included chicken u promoter, shrna, leader and trailer sequences was transferred to the compatible rcasbp(a) vectors for silencing of chifit in ovo (fig. c) . additionally, we used shrna silencing-resistant rcasbp(a)-chifit retroviruses ( supplementary fig. ) to stably express the chifit in developing embryos in the presence of shrna against endogenous chifit gene. the replication competency of shrna expressing retroviruses was compared to the wt retroviruses (fig. e ) before using them to generate transgenic chicken embryos. using the experimental strategy depicted in fig. f , we monitored the replication of ndv in transgenic embryos that were either stably expressing chifit or were expressing chifit in the endogenously silenced ifit gene. the retroviral overexpression of chifit or silencing of chifit were normalized to non-infectious df- cells or wt rcaspb(a) that did not express the transgene. quantitative analysis of ndv showed an expected reduction in replication of the virus in ifit -overexpressing embryos, whereas silencing of endogenous chifit favoured the replication of ndv compared to the control df- cells (fig. g) . a moderate reduction in the virus replication was observed in transgenic embryos that were simultaneously silenced for endogenous chifit and were stably expressing silence-resistant chifit (fig. g) . taken together, these results not only confirm the antiviral activities of chifit in different experimental settings but also highlight the potential use of rcaspb(a) in enhancing the replication of avian viruses in embryonated eggs. innate immune responses are key to the success of host resistance to virus infection and isgs provide a front-line role in these defences by acting at multiple stages of the virus replication cycle [ ] [ ] [ ] [ ] . among the myriad of isgs, members of the ifit family have attracted recent attention for both immunological as well as virological reasons wild type and chifit ko -confirmed df- cells were infected with ndv or were left uninfected for hours before processing them for facs. cumulative mfi of gfp positivity in wild type and chifit ko df- cells based on at least independent experiments. (g) df- cells were either mock transfected (wt df- cells) or were transfected with ug of chifit expression plasmid (chifit oe df- cells). these cells and crispr/cas knockout df (chifit ko ) were either left un-infected or were infected with vsv for hours before staining with crystal violet. the plagues developments were imaged using hand-held camera. scientific reports | ( ) : | doi: . /s - - -y due to their abundant and profound transcription and translation responses to diverse stimuli such as ifns and viruses , . significant advances have been made during studies of the mechanisms of both human and mouse ifit proteins . however, the knowledge on the breadth and plasticity of the antiviral properties of ifit proteins are currently inadequate for non-mammalian hosts. characterizing the repertoire of ifit proteins and investigating their functions in chicken are of special interest because of the unique features of the chicken immune system including the absence of essential components of innate immune induction and signalling such as rig-i, irf , and irf in chickens. intriguingly, while chickens are lacking these components, they still mount profound innate responses against virus infections. understanding the alternative means of innate immune regulation and antiviral defences in chicken could establish the foundation to control chicken-mediated emergence of zoonotic infections such as influenza viruses . in order to explore the ifit genes in chicken, we began with genetics and functional genomic approaches and revealed that chicken encodes only a single ifit gene compared to genes in mammals (human and mice). based on its sequence and structure similarities, and phylogenetic associations, this single chicken ifit gene is classified as ifit (chifit ). currently the chicken genome is about % annotated and all chromosomes are correctly characterized , however, the possibility of orthlogous and pseudogenes in the non-annotated part ( %) of the chicken genome cannot be excluded at this stage. with regard to the complex ifn pathways and a number of major missing genes in chicken, the innate immune genes, as general sensors of self and non-self, are under continues evolutionary selection pressure compared to the pathogen-specific adaptive immune responses. therefore, it is plausible that ifit proteins are generated by paralogue expansions and/or gene deletions in chicken . ifit is the only gene that is missing in the mice and rat genomes whereas it is the only gene identified in chicken, as this study shows. it remains to be determined in future if chifit also plays additional antiviral functions of the redundant ifit and ifit . we observed a profound transcription of chifit gene by diverse stimuli that acted upon the ifn induction or ifn signalling pathways, which was further confirmed by the transcriptomics profiling of the mdv-infected chicken fibroblasts. interestingly, structural and functional characterization of the chifit promoter, required for effective transcriptional regulation of chifit , was mapped within bp of the transcriptional start site and carried a single isre motif. this minimum essential requirement of the promoter justifies several folds induction of chifit gene as early as hours of post ndv-infection. although transcription of ifit was generally proportional to the virus replication and ifn induction, an ifn-independent regulation of chifit is also possible, especially when the chifit gene up-regulation was observed in earlier time points of virus-infection when the ifn gene was barely detectable. it is also plausible that transcriptional activation of chifit is directly regulated by irf or related transcription factors and thus warrants future investigations. nevertheless, these transcriptional patterns augment essential roles of the chifit protein during the early stages of virus replication such as interaction with viral genetic material (viral rna/dna) and viral protein translation. previous rna-protein analysis revealed that all ifit proteins assemble into multimeric complexes (except ifit ) and that these interactions are crucial for co-functionalities of these proteins . while all ifit proteins can make multimers, ifit exists as a poorly characterized monomer . probably due to these structural flexibilities, reports on ifit functionalities are inconsistent , - . however, ifit protein can arguably sense a broad range of rna structures including single stranded rna with mono-(p) and tri-phosphates (ppp) at the ′ ends, double stranded dna and rna with cap modifications , , . in order to understand the binding potential of chifit to modified rna that either interacts specifically with human ifit or with human ifit , we coupled modified rna-coated beads with quantitative binding assays for chifit . our results indicated that chifit specifically interacted with rna that carried ′-ppp modifications and failed to interact with rna in which ′-ppp was replaced with oh. these rna-protein interaction studies highlighted the principal roles of chifit for direct recognition of foreign ppp-rna and to subsequently exert downstream antiviral activities. since ppp-rna is found within the genome of most viruses carrying negative sense single stranded rna genomes such as influenza, ndv, vsv , and ppp-rna is produced as an intermediate product during the replication of viruses with positive sense rna genomes such as coronaviruses , it is plausible that chifit sense foreign rna (bearing ppp-rna) while ignoring self rna (bearing cap in the case of mrna and monophosphate in the case of rrna and trna). compared to four essential ssrna cellular sensors including rig-i, tlr , tlr and ifit in mammals , , rig- is missing and tlr is disrupted due to insertion of retroviruses in the tlr locus in chickens , , . because of these fundamental differences in innate immune genes between avian (e.g. chicken) and mammals, future studies are needed to investigate if the interaction of ifit with ′-ppp-ssrna can induce downstream ifn signalling , and if so, does this interaction compensate for the deficiency of rig-i and tlr in chickens. this specific interaction with ppp-rna could lead to attenuation of virus replication by sequestering viral rna for transcription and translation. to investigate this possibility, we applied both gain-in-function and loss-in-function methodologies and evaluated the antiviral potential of ifit against negative sense single stranded rna viruses, including a poultry specific (ndv) and model rna virus (vsv). lentivirus-delivered stable expression of chifit or huifit compromised the replication of viruses, whereas crispr/cas mediated knocking-out of the chifit gene supported virus replication in chicken fibroblasts. intriguingly, overexpression of human ifit , ifit and ifit failed to establish an antiviral state in chicken fibroblasts, suggesting that chickens have opted exclusively for the antiviral activities of ifit . our attempts to monitor virus replication in mosaic transgenic chickens overexpressing chifit , or silenced for endogenous chifit yielded strong evidence that this cytokine possesses antiviral activities in developing embryos. thus rcas-mediated gene delivering and silencing approaches can be exploited to study gene functionalities in ovo at the early embryonic developmental stage and may establish the basis for evaluation of genetic resistance against pathogens. taken together, we characterized the ifit locus in chicken and systemically analysed the functional rationale for antiviral activities of chifit against rna viruses using both functional genomics and molecular biological approaches. the foundations built in this study warrant future investigations to assess the potential of chifit in sensing the nucleic acid of many diverse viruses and bacteria (which also generate ppprna), and the impact of these interactions on host innate immunity. data mining and sequence analysis. chicken genome (ensembl) and expressed sequence tags (est) databases were screened for the homologues of ifit family gene members using the basic local alignment search tool (blast) algorithm. a single transcript showing high sequence-similarity to human ifit was identified in the putative ifit locus. using sequences from public databases and transcriptomics data from marek's disease virus (mdv)-infected chicken embryo fibroblasts (cef), an open reading frame (orf) was revealed and extracted. the chicken ifit (chifit ) coding region was amplified from ndv-infected primary cefs, whereas sequence-covering gap in the ifit locus was amplified from genomic dna using primers mentioned in supplementary table . the genomic nucleotide sequence of chifit promoter region was amplified using primers provided in supplementary table . the orf and homology searches for chifit were carried out in the orf finder programme (http://www.ncbi.nlm.nih.gov/projects/gorf) and blast tool (http://www.ncbi.nlm.nih.gov/ blast/) integrated in the ncbi database. possible gene transcription start sites were identified using promoter predictor programme (http://www.fruitfly.org/seq_tools/promoter.html) whereas potential transcription factor binding sites were identified using the matinspector server (http://www.genomatix.de). gene synteny and tetratricopeptide repeats (tpr) were predicted using ensemble as well as conserved domain databases, respectively. the ifit sequences from aves and non-aves were acquired from ncbi and aligned using clustalw programme. the phylogenetic analysis was performed using neighbour-joining method with bootstrap value of n = , in the mega programme version . cells culture, media and antibodies. cefs were prepared from -day-old embryonated eggs at the pirbright institute as described previously . immortalized chicken fibroblasts (df- ), human embryonic kidney cells t (hek- t) and madin-darby canine kidney (mdck) cells were maintained in dulbecco's modified eagle medium (dmem) supplemented with % foetal bovine serum (fbs), % penicillin and streptomycin (p/s) at °c in % co incubator. amv- c -s (gag) antibodies were purchased from hybridoma bank of iowa, university of iowa. α-v and j antibodies for the detection of v tag and dsrna were from genetex, and scicons, respectively. alexa-flour secondary antibodies were purchased from invitrogen carlsbad, ca, usa and irdye cw α-mouse secondary antibodies were acquired from li-cor, nebraska usa. poly i:c scientific reports | ( ) : | doi: . /s - - -y (a synthetic analogoue of dsrna), dimethyl sulfoxide (dmso) and lipopolysaccharide (lps), were purchased from invivogen and sigma whereas chicken ifn-β was produced in hek- t cells . (ndv-gfp) was generated using reverse genetics system as described before and rescued virus particles were propagated in embryonated chicken eggs . the ndv-gfp strain was quantified using immunostaining and expressed as focus-forming units (ffu). vesicular stomatitis virus (vsv) expressing gfp (vsv-gfp) was kindly provided by dennis rubbenstroth (institute for virology, medical centre -university of freiburg, germany). vsv-gfp was propagated and quantified in df- cells and was represented in ffu or images showing plaques. allantoic fluids and infectious viruses from cell culture supernatants were titrated by plaque assays on mdck cells. briefly, mdck cells were inoculated with -fold serially diluted samples and overlaid with . % agarose (oxoid, hampshire, uk) in overlay dmem ( × mem, . % bsa v, mm l-glutamate, . % sodium bicarbonate, mm hepes, × penicillin/streptomycin (gibco, carlsbad, ca, usa) and . % dextran deae, with µg ml − tpck trypsin (sigma-aldrich, dorset, uk). plates were incubated at °c for h and plaques were developed using crystal violet stain containing methanol. plasmids construction and mutagenesis. the full-length chicken and human ifit was pcr-amplified using primers that were tailed with ′ bamhi site, ′ ecori/spei site and a consensus kozak translation sequence (ccaccatg) (supplementary table ). the bamhi and ecori/spei digested amplicons were sub-cloned in the mammalian expression vector, pefplink-v (kindly provided by steve goodbourn, st. george's university of london), which contains an n-terminal v tag. for identification of cis-acting elements in the chifit promoter, a . kb genomic sequence was amplified and cloned into the kpni and xhol sites in the promoter-less vector, pgl . basic (promega) and named as pchifit - /+ . subsequent five truncated versions of the promoter were amplified from full-length pchifit - /+ using primers mentioned in the supplementary table and cloned between kpni and xhol sites in the pgl . basic vector. for production of the chifn-β, the orf for chifn-β (accession number, nm_ ) was cloned in pcdna . + and final constructs were labelled as pcdna . -chifn-β . the reporter plasmid pgl -p-chmx-luc was kindly provided by nicolas ruggli, switzerland and renilla luciferase plasmid (phrl-sv ) was purchased from promega, madison, wi, usa. mitochondrial (dsred -mito- , # )) and er (mcherry-er- , # ) markers were obtained from addgene. triple thymidine duplex (tttnnnttt) in pchifit - /+ -wt construct was mutated into tatnnntat using quikchange lightning site-directed mutagenesis kit (agilent technologies) and was named as pchifit - /+ -mut. all mutagenesis oligonucleotides were designed in the quikchange primer design tool and these primers are provided in the supplementary table . all clones were sequenced from both ends for correct frame and orientation or were digested with unique cleavage sites to confirm the gene inserts. confocal microscopy. chicken cells were transfected with individual or combined plasmids for indicated time points using lipofectamine (invitrogen) at a ratio of : or were infected with lentiviruses, retroviruses or ndv-gfp for indicated time points. these cells were then fixed for h in % paraformaldehyde and permeabilized using . % triton-x before incubation with primary antibodies raised against v tag, dsrna (j ), or gag protein of retroviruses. additionally, depending upon the experimental needs, different fluorescent markers (rfp, gfp) were used. afterwards, cells were incubated with corresponding secondary antibodies at °c for h. after brief staining with ′, -diamidino- -phenylindole (dap ) (nuclear), slides were visualized using a leica sp confocal laser scanning microscope. western blotting. all the transfections for subsequent western blot analysis were performed following the same protocol as described for immunofluorescence unless otherwise indicated. cells were lysed in a hypotonic buffer and protease inhibitor cocktail (sigma). proteins were separated by sds-page under reducing conditions and analysed by western blotting using anti-v (gentek) and irdye-labelled secondary antibodies (li-cor biosciences). the signals were acquired and quantified using the odyssey infrared imaging system (li-cor biosciences). ifn bioassay. ifn-induced protection against vsv-gfp was used to identify ifn-producing stable clones and to quantify ifn preparations, as described before . briefly, df- cells were seeded in -well plates until they are % confluent and treated with serial dilutions of supernatants containing interferons for hours. these interferon stimulated cells were inoculated with vsv-gfp (moi of ). at hours post-infection, vsv-gfp replication was correlated with the change in gfp fluorescence signal intensities using luminometer (promega, madison, wi, usa). the percentage antiviral activity of ifns were determined by comparing the percentage reduction of corrected gfp signal intensity (gfp signal intensity of ifn treated and virus infected wells minus background fluorescence signal intensity of uninfected control) with the mock treated and vsv-gfp-infected control wells. one unit (u) of ifn in the tested ifn preparations was defined as the volume containing % inhibitory activity against vsv-gfp. a total of us of ifns were used for stimulation of cef or df- cells. tronic expression and gateway-compatible destination vector (ptrip.cmv.ivsb.gene.ires.tagrfp) for lentiviruses was kindly provided by charles rice, the rockefeller university, usa. to generate chifit entry clone, gene encoding chifit was pcr amplified with oligonucleotides (supplementary table ) containing attb sites flanking gene-specific sequences. pcr products were purified over qiagen columns (qiagen) and cloned into pdonr (invitrogen) with bp clonase. bp clonase reactions were transformed into escherichia coli (invitrogen), and colonies were screened by restriction digestion and sequencing. the gene sequences from pentr clones scientific reports | ( ) : | doi: . /s - - -y were moved into ptrip.cmv.ivsb.gene.ires.tagrfp using lr clonase ii (invitrogen) according to the manufacturer's instructions. after lr reaction products transformation, one or two colonies for each construct were grown in ml luria-bertani (lb) broth with ampicillin, and transfection-quality plasmid dna was purified over anion-exchange columns (qiagen). lentivirus constructs expressing human ifit , ifit , ifit , ifit , irf (positive control) and ffluc (negative control) were kindly provided by charles rice, the rockefeller university, usa. all constructs were sequenced using primers provided in supplementary table to confirm the gene insertion before rescuing lentiviruses. poly-lysine pre-coated plates with seeding density of × per well of -wells plates and were co-transfected with gene expressing proviral dna (huifit , huifit , huifit , huifit , chifit , huirf , or ffluc), hiv-i gag-pol and vsv-g in a ratio of : . : . using lipofectamine (invitrogen). supernatants collected at h and h post-transfection were cleared by centrifugation ( rpm for min) and were pooled and supplemented with mm hepes and μg/ul polybrene (sigma). for titration of lentiviral pseudoparticles, cef cells ( × ) were transduced with serial dilutions of individually rescued pseudoparticles for hours. trypsinized cells were fixed with % paraformaldehyde and processed for facs for quantification of percentage rfp+ cells. the volume of the lentiviral pseudoparticles that infected % of cef cells was used to transduce cefs. titrated lentiviral pseudoparticles were stored at − °c before use and the same stock of the virus was used for all experimentation. duced with moi of lentivirus-expressing specific gene in dmem media containing % fbs, mm hepes and μg/ml polybrene. transduction was facilitated by centrifugation ( g for h at °c) and cells were incubated at °c. a day later, cells were infected with gfp-tagged virus (ndv-gfp) at . moi and the infection was stopped after hr and replaced with fresh dmem containing % fbs. after hours of infections, cells were trypsinised and the cell suspensions were incubated with live dead marker (near ir cat no: l ) according to the manufacturer's protocol. the cells were then fixed with % paraformaldehyde (pfa) for min before analysing the cells by flow cytometry. as described before , live and singlet cells were gated based on forward and side scatter, and four-quadrant plots were generated using the untransduced and uninfected (rfp negative and gfp negative), uninfected (rfp positive and gfp negative), and untransduced (rfp negative and gfp positive) cells. analysis was carried out using flowjo software applying the same gating and analysis strategies for all samples. construction of ifit knockout cell line using crispr/cas . a synthetic gene-targeting approach was applied to specifically knockout the chicken ifit from the chicken embryo fibroblasts. for this purpose, two components (crrna and tracrrna) of single guide rna (sgrna) which are crucial for targeting specificity and scaffolding/binding ability for crispr associated protein (cas ) nuclease were synthesized by dharmacon. targeting the beginning of the second exon of chicken ifit , two individual crispr rna (crrna) were designed with the highest-predicted score and lowest off-target affects; sgrna acaggagaagtctcgttacc and sgrna gcttggatctactaccacat. using dharmafect duo transfection reagents, df- cells were co-transfected with individual crrna and a common trans-activating crrna (tracrrna) as well as plasmid expressing cas nuclease and puromycin resistance gene (puro r ) separated by self-cleavage t a sequence. cells were split after hours transfection and selected for puromycin antibiotics ( μg/ml) for one week or until complete eradication of non-transfected control cells. for fast and efficient enrichment of genetic modification, a population of cells with stable integration was enriched using facs. for this purpose, puromycin-selected cells were transfected with gfp-expression plasmid and individual cells were sorted by facs before being seeded in -well plates. at least clones were expanded and the relative frequency of gene editing in the puromycin-resistant and facs-enriched cells was estimated from a dna mismatch detection assay using t endonuclease i (t e ) (neb). the dna fragments flanking the target editing sites were amplified from genomic dna extracted by dneasy kits (qiagen) using primers mentioned in supplementary table . a total of ng of the pcr products were denatured at °c and allowed to anneal gradually at room temperature to form heteroduplex dna. the re-hybridized dna was digested with t ei and resolved in a . % agarose gel to determine the gene editing efficiency. additionally, the pcr products were sequenced using pcr-amplification primers and aligned with the corresponding wild-type genomic sequence to identify mutations, deletions and insertions. t ei and sequence verified clones were used to monitor virus replication. rna was extracted from ifn-β ( u), lps ( μg/ml), dsrna ( μg/ml), or ndv-stimulated (moi of ) df or cefs using trizol reagents (invitrogen, carlsbad, ca, usa). additionally, organs were collected from specific pathogen free (spf) chickens, which were infected or mock-treated (intranasally) for days with pfu of a/ chicken/pakistan/udl- / (h n ). a total of ng of rna was used in pcr reactions using superscript ® iii platinum ® sybr ® green one-step qrt-pcr kit (invitrogen, carlsbad, ca, usa). the abundance of specific mrna was compared to the s rrna (supplementary table ) in the applied biosystems prism system. the reaction was carried out in abi light cycler using the following thermo profile; °c for minutes hold, °c for minutes hold, followed by cycles of °c for seconds and °c for seconds. melting curve was determined at °c for seconds, °c for minute, °c for seconds and °c for seconds. primers for isgs including chifit are provided in supplementary table . primers specific for a conserved region of the influenza a and ndv matrix genes were used as described previously , . short hairpin design and expression systems. to silence endogenous chifit gene in developing embryos, a total of three -nucleotides-long rna interference (rnai) short hairpin rnas (shrna) were designed using the genscript rnai target finder (https://www.genscript.com/ssl-bin/app/rnai). double stranded dna products for each of three chifit specific and a control scrambled target were generated by pcr using random and gene-specific oligonucleotides together with hp-l and hp-r (supplementary table ) as described before . the amplified pcr products were cloned between nhei and mlui into microrna (mirna) cloning sites of prfprnaic (kind gift of stuart wilson, the university of sheffield, uk). all shrna coding plasmids were sequenced to confirm the inserts and orientations. to evaluate the silencing potential of individual shrna, df- cells were transfected with ng plasmid using lipofectamine according to the manufacturer's protocol and the knockdown effects on the chicken ifit was monitored and compared with the scrambled rna transfected control. next, the validated shrna cassette that included chicken u promoter, individual shrna, leader and trailer sequences was transferred to the rcasbp(a) retrovirus vector between noti and clai sites. rescue of rcasbp(a)-sh ifit retroviruses and generation of mosaic transgenic chickens are detailed below. to determine responsiveness of chicken ifit promoter to chicken ifn-β, dsrna, and virus-stimulation, chicken fibroblasts were grown in -well plate format at × to × cells in addition pgl -p-chmx-luc was used as a positive control whereas pgl . basic vector was used as a negative control. correspondingly, df- cells were co-transfected with phrl-sv and pchifit - /+ -wt or pchifit - /+ -mut constructs. all transfections were performed using lipofectamine sk-as) was linearized with unique bamhi restriction digestion and the purified dna was used for in vitro transcription in the presence of bioin- -utp using ribomax ™ large scale rna production system-sp (promega, cat# p ) as recommended by the manufacturer and reported previously with a few modifications. briefly, a reaction of μl was established containing μl × sp buffer, μl ntp-bioutp mixtures, μg linearized plasmid, and μl enzyme mix. the reaction mixture was first incubated at °c for hours followed by digestion of the dna remnant with u rnase-free dnase (thermo scientific) for another minutes at °c. the biotinylated uracil triphosphate (bioin- -utp) was incorporated during in vitro transcription for purification of ribonucleoproteins and due to nascent nature of polymerase a ′-ppprna was over-hanged as a signature for ifit protein interaction. after the completion of in vitro transcription, the quality of in vitro transcription rna was assessed by agarose gel electrophoresis and rna was purified with rneasy minelute cleanup kit (qiagen) according to the manufacturer's recommendations. a total of μg of purified in vitro transcribed and biotinylated ppp-rna was either mock treated or rna samples were purified with rneasy minelute cleanup kit and eluted with μl nuclease-free water for use in the rna-protein interactions to prepare chifit protein, chicken df- cells ( × ) were transfected with μg v -tagged chifit plasmid for hours and lysed with tap buffer in the presence of protease and rnase inhibitors. the rna-coated beads were incubated with mg chifit protein lysate on a rotary wheel for hours at °c, and washed three times to remove unbound proteins the gfp/gag expression-confirmed cell cultures were split into cm flasks and were passaged again into cm flasks after days. finally cells were expanded into cm flasks until the desired number ( cells/egg) was achieved. mosaic-transgenic chicken embryos were generated by inoculation of one million rcasbp(a)-chifn-β, rcasbp(a)-chifit , rcasbp(a)-shrna and rcasbp(a)-wt infected df- cells at day post-embryonation or were left un-infected. at day post-embryonation ( days post-infection), each egg was either left unchallenged or was challenged with pfu ndv-gfp. embryo mortality was monitored daily and allantoic fluids were harvested at days of embryonation and were subjected to the plaque assay for virus quantification. statistical analysis. pairwise comparisons of treated and control groups were performed using student's t-test interferons and viruses: an interplay between induction, signalling, antiviral responses and virus countermeasures cytosolic sensing of viruses interferon-stimulated genes and their antiviral effector functions interferons and their actions lineage-specific expansion of ifit gene family: an insight into coevolution with ifn gene family ifit is an antiviral protein that recognizes ′-triphosphate rna structural basis for viral ′-ppp-rna recognition by human ifit proteins interferon-induced ifit proteins: their role in viral pathogenesis isg is a negative-feedback regulator of virus-triggered signaling and cellular antiviral response ifit potentiates anti-viral response through enhancing innate immune signaling pathways trna binding, structure, and localization of the human interferon-induced protein ifit broad and adaptable rna structure recognition by the human interferon-induced tetratricopeptide repeat protein ifit conventional and unconventional mechanisms for capping viral mrna eukaryotic ribonuclease p: a plurality of ribonucleoprotein enzymes when your cap matters: structural insights into self vs non-self recognition of ′ rna by immunomodulatory host proteins sequestration by ifit impairs translation of ′ o-unmethylated capped rna inhibition of translation by ifit family members is determined by their ability to interact selectively with the ′-terminal regions of cap -, cap -and ′ ppp-mrnas the broad-spectrum antiviral functions of ifit and ifitm proteins the rcas vector system the glucocorticoid attenuated response genes garg- , garg- , and garg- /irg encode inducible proteins containing multiple tetratricopeptide repeat domains when two strands are better than one: the mediators and modulators of the cellular responses to double-stranded rna double-stranded rna is detected by immunofluorescence analysis in rna and dna virus infections, including those by negative-stranded rna viruses activation of the pkr/eif α signaling cascade inhibits replication of newcastle disease virus the chicken mx-promoter contains an isre motif and confers interferon inducibility to a reporter gene in chick and monkey cells differential expression profile of chicken embryo fibroblast df- cells infected with cell-adapted infectious bursal disease virus triggering the innate antiviral response through irf- activation the rna polymerase ii core promoter genotyping with crispr-cas-derived rna-guided endonucleases distinct induction patterns and functions of two closely related interferoninducible human genes, isg and isg crystal structure and nucleotide selectivity of human ifit /isg coordinated regulation and widespread cellular expression of interferon-stimulated genes (isg) isg- , isg- , and isg- in the central nervous system after infection with distinct viruses prolonged effect of baff on chicken b cell development revealed by rcas retroviral gene transfer in vivo antiviral activity of lambda interferon in chickens a robust system for rna interference in the chicken using a modified microrna operon association of rig-i with innate immunity of ducks to influenza rig-i in rna virus recognition a virological view of innate immune recognition identification and characterization of a functional, alternatively spliced toll-like receptor (tlr ) and genomic disruption of tlr in chickens molecular evolutionary genetics analysis version . cell-culture methods a novel cytokine with antiviral activities tissue tropism in the chicken embryo of non-virulent and virulent newcastle diseases strains that express green fluorescence protein biological characterization and phylogenetic analysis of a novel genetic group of newcastle disease virus isolated from outbreaks in commercial poultry and from backyard poultry flocks in pakistan a vesicular stomatitis virus replicon-based bioassay for the rapid and sensitive determination of multi-species type i interferon multiple interferon stimulated genes synergize with the zinc finger antiviral protein to mediate anti-alphavirus activity development of a real-time reverse-transcription pcr for detection of newcastle disease virus rna in clinical samples development of a real-time reverse transcriptase pcr assay for type a influenza virus and the avian h and h hemagglutinin subtypes construction and rescue of ifit expressing rcas system, and generation of transgenic embryos. the orf of chicken ifit and coding regions (signal and mature peptide sequence) of the chifn-β were amplified from rna extracted from the ndv-infected primary cefs. the amplified products were sub-cloned into an improved version of rcasbp(a)-Δf (kindly provided by stephen h. hughes, national cancer institute, md, usa) via the clai and muli restriction sites which replace the src gene while maintaining the splice accepter signals. the resultant constructs were named as rcasbp(a)-chifit and rcasbp(a)-chifn-β. similarly, a gfp encoding rcasbp(a), referred as rcasbp(a)-egfp, was generated by introducing the coding sequence of the gfp in between the clai and muli sites. additionally, a codon optimized and shrna-silencing resistant chifit gene was chemically synthesized (geneart, invitrogen) and cloned in the corresponding sites and labelled as rcasbp(a)-shrna . the inserted gene orientation and sequence validity were confirmed by dna sequencing.to rescue recombinant rcasbp(a) viruses, a total of . × df- cells were seeded in cm flasks and maintained at °c, % (vol/vol) co for hours (~ % confluent). cells were washed with pbs and transfected ethics statement. all animal studies and procedures were carried out in strict accordance with the guidance and regulations of european and united kingdom home office regulations under animal risk assessment numbers ar and ar . as part of this process the work has undergone scrutiny and approval by the ethics committee at the pirbright institute. supplementary information accompanies this paper at https://doi.org/ . /s - - -y. publisher's note: springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/licenses/by/ . /. key: cord- -sygnmiun authors: lam, sd; bordin, n; waman, vp; scholes, hm; ashford, p; sen, n; van dorp, l; rauer, c; dawson, nl; pang, csm; abbasian, m; sillitoe, i; edwards, sjl; fraternali, f; lees, jg; santini, jm; orengo, ca title: sars-cov- spike protein predicted to form complexes with host receptor protein orthologues from a broad range of mammals date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: sygnmiun sars-cov- has a zoonotic origin and was transmitted to humans via an undetermined intermediate host, leading to infections in humans and other mammals. to enter host cells, the viral spike protein (s-protein) binds to its receptor, ace , and is then processed by tmprss . whilst receptor binding contributes to the viral host range, s-protein:ace complexes from other animals have not been investigated widely. to predict infection risks, we modelled s-protein:ace complexes from vertebrate species, calculated changes in the energy of the complex caused by mutations in each species, relative to human ace , and correlated these changes with covid- infection data. we also analysed structural interactions to better understand the key residues contributing to affinity. we predict that mutations are more detrimental in ace than tmprss . finally, we demonstrate phylogenetically that human sars-cov- strains have been isolated in animals. our results suggest that sars-cov- can infect a broad range of mammals, but few fish, birds or reptiles. susceptible animals could serve as reservoirs of the virus, necessitating careful ongoing animal management and surveillance. severe acute respiratory syndrome coronavirus (sars-cov- ) is a novel coronavirus that emerged towards the end of and is responsible for the coronavirus disease (covid- ) global pandemic. available data suggests that sars-cov- has a zoonotic source [ ] , with the closest sequence currently available deriving from the horseshoe bat [ ] . as yet, the transmission route to humans, including the intermediate host, is unknown. so far, little work has been done to assess the animal reservoirs of sars-cov- , or the potential for the virus to spread to other species living with, or in close proximity to, humans in domestic, rural, agricultural or zoological settings. coronaviruses, including sars-cov- , are major multi-host pathogens and can infect a wide range of non-human animals [ ] [ ] [ ] . sars-cov- is a member of the betacoronavirus genus, which includes viruses that infect economically important livestock, including cows [ ] and pigs [ ] , together with mice [ ] , rats [ ] , rabbits [ ] , and wildlife, such as antelope and giraffe [ ] . severe acute respiratory syndrome coronavirus (sars-cov), the betacoronavirus that caused the - sars outbreak [ ] , likely jumped to humans from its original bat host via civets. viruses genetically similar to human sars-cov have been isolated from animals as diverse as racoon dogs, ferret-badgers [ ] and pigs [ ] , suggesting the existence of a large host reservoir. it is therefore probable that sars-cov- can also infect a wide range of species. real-world sars-cov- infections have been reported in cats [ ] , lions and tigers [ ] , dogs [ , ] and minks [ , ] . animal infection studies have also identified cats [ ] and dogs [ ] as hosts, as well as ferrets [ ] , macaques [ ] and marmosets [ ] . recent in vitro studies have also suggested an even broader set of animals may be infected [ ] [ ] [ ] . to understand the potential host range of sars-cov- , the plausible extent of zoonotic and anthroponotic transmission, and to guide surveillance efforts, it is vital to know which species are susceptible to sars-cov- infection. the receptor binding domain (rbd) of the sars-cov- spike protein (s-protein) binds to the extracellular peptidase domain of angiotensin i converting enzyme (ace ) mediating cell entry [ ] . the sequence of ace is highly conserved across vertebrates, suggesting that sars-cov- could use orthologues of ace for cell entry. the structure of the sars-cov- s-protein rbd has been solved in complex with human ace [ ] . identification of critical binding residues in this structure have provided valuable insights into viral recognition of the host receptor [ ] [ ] [ ] [ ] [ ] [ ] . deep mutagenesis studies have also revealed residues important for stability [ , ] . compared with sars-cov, the sars-cov- s-protein has a - -fold higher affinity for human ace [ , , ] , due to more contacts in the interface that cover a larger surface area [ ] , and three mutational hotspots in the sprotein that lead to a more specific and compact conformation [ , ] . similarly, variations in human ace have also been found to increase affinity for s-protein receptor binding [ ] . these factors may contribute to the host range and infectivity of sars-cov- . both sars-cov- and sars-cov additionally require the transmembrane serine protease (tmprss ) to mediate cell entry. together, ace and tmprss confer specificity of host cell types that the virus can enter [ , ] . upon binding to ace , the s-protein is cleaved by tmprss at two cleavage sites on separate loops, which primes the s-protein for cell entry [ ] . tmprss has been docked against the sars-cov- s-protein, which revealed its binding site to be adjacent to these two cleavage sites [ ] . an approved tmprss protease inhibitor drug is able to block sars-cov- cell entry [ ] , which demonstrates the key role of tmprss alongside ace [ ] . as such, both ace and tmprss represent attractive therapeutic targets against sars-cov- [ ] . recent work has predicted possible hosts for sars-cov- using the structural interplay between the s-protein and ace . these studies proposed a broad range of hosts, covering hundreds of mammalian species, including tens of bat [ ] and primate [ ] species, and more comprehensive studies analysing all classes of vertebrates [ ] [ ] [ ] , including agricultural species of cow, sheep, goat, bison and water buffalo. in addition, sites in ace have been identified as under positive selection in bats, particularly in regions involved in binding the s-protein [ , ] . the impacts of mutations in ace orthologues have also been tested, for example structural modelling of ace from primate species [ ] demonstrated that apes and african and asian monkeys may also be susceptible to sars-cov- . however, whilst cell entry is necessary for viral infection, it may not be sufficient alone to cause disease. for example, variations in other proteins may prevent downstream events that are required for viral replication in a new host. hence, examples of real-world infections [ ] [ ] [ ] [ ] and experimental data from animal infection studies [ ] [ ] [ ] [ ] [ ] are required to validate hosts that are predicted to be susceptible. here, we analysed the effect of known mutations in orthologues of ace and tmprss from a broad range of vertebrate species, including primates, rodents and other placental mammals; birds; reptiles; and fish. for each species, we generated a -dimensional model of the ace protein structure from its protein sequence and calculated the impacts of known mutations in ace on the stability of the s-protein:ace complex. we correlated changes in the energy of the complex with changes in the structure of ace , chemical properties of residues in the binding interface, and experimental covid- infection phenotypes from in vivo and in vitro animal studies. to further test our predictions and rationalise the key sites contributing to energy changes of the complex, we performed detailed manual structural analyses, presented as a variety of case studies for different species. unlike other studies that analyse interactions that the s-protein makes with the host, we also analyse the impact of mutations in vertebrate orthologues of tmprss . our results suggest that sars-cov- could infect a broad range of vertebrates, which could serve as reservoirs of the virus, supporting future anthroponotic and zoonotic transmission. we aligned protein sequences of vertebrate orthologues of ace . most orthologues have more than % sequence identity with human ace (supplementary fig. a) . for each orthologue, we generated a -dimensional model of the protein structure from its protein sequence using funmod [ , ] . we were able to build high-quality models for vertebrate orthologues, with ndope scores < - (supplementary table ). low-quality models were removed from the analysis. ace residues directly contacting the s-protein (dc residues) were identified in a structure of the complex (pdb id m j; fig. a , supplementary results , supplementary fig. ) . we also identified a more extended set of both dc residues and residues within Å of dc residues likely to be influencing binding (dcex residues). after analysing the orthologue interfaces, we removed models that were missing > dcex residues. models were removed from the analysis, leaving models to take forward for further analysis. we observed high sequence (> % identity) and structure similarity (score > out of ) between ace proteins for all species (supplementary results ). we used multiple methods to assess the relative change in binding energy (ΔΔg) of the sars-cov- s-protein:ace complex following mutations in dc residues and dcex residues that are likely to influence binding. we found that protocol employing mcsm-ppi (henceforth referred to as p( )-ppi ), calculated over the dcex residues, correlated best with the phenotype data (supplementary results , supplementary fig. , table ), justifying the use of animal models to calculate ΔΔg values in this context. since this protocol considers mutations from animal to human, lower ΔΔg values correspond to stabilisation of the animal complex relative to the human complex, and therefore higher risk of infection. we show the residues that p( )-ppi reports as stabilising or destabilising for the sars-cov- s-protein:ace animal complex for dc ( supplementary fig. ) and dcex (supplementary fig. ) residues. to consider ΔΔg values in an evolutionary context, we annotated phylogenetic trees for all vertebrate species analysed ( supplementary fig. ) and for a subset of animals that humans come into close contact with in domestic, agricultural or zoological settings (fig. ). in general we see a high infection risk for most mammals, with a notable exception for all nonplacental mammals. ΔΔg values measured by p( )-ppi correlate well with the infection phenotypes (table ) . ΔΔg values are significantly lower for animals that can be infected by sars-cov- than for animals for which there is no evidence of infection ( fig. ; mann-whitney one-sided p = . x - ). two animals are outliers in the infected boxplot, corresponding to horseshoe bat (ΔΔg = . ) and marmoset (ΔΔg = . ). to be cautious, since in vivo experiments have shown that marmosets can be infected, and in vitro experiments have shown that horseshoe bats can be infected [ , [ ] [ ] [ ] ( table ), we consider animals that have ΔΔg values less than, or equal to, the ΔΔg = . for horseshoe bat to be at risk. additionally, there is a clear sampling bias in the set of animals that have so far been experimentally characterised: all but chicken and duck are mammals. as more nonmammals are tested, the median ΔΔg value for non-infection is likely to increase. in further support of these predictions we analysed the animals having experimental evidence using an orthogonal method, haddock [ ] , and found ~ % agreement between the two independent approaches for animals predicted to be at risk (see supplementary results and supplementary figure ). (fig. ) . as shown in previous studies, and supported by experimental data, many primates are predicted to be at high risk [ , , ] . in agricultural settings, camels, cows, sheep, goats and horses also have relatively low ΔΔg values, suggesting comparable binding affinities to humans, in agreement with experimental data [ , ] . in domestic settings, dogs [ ] , cats [ ] , hamsters [ ] , and rabbits [ ] [ ] [ ] also have ΔΔg values suggesting risk, again in agreement with experimental data (table ) . whilst, zoological animals that come into contact with humans, such as pandas, leopards and bears, are also at risk of infection as shown experimentally [ ] in predicting susceptibility, we have chosen thresholds supported by in vivo or in vitro experimental data. previous work contrasted the binding energy of the s-protein of sars-cov and sars-cov- with human ace protein [ , , ] . sars-cov is able to infect humans despite a ~ -fold lower binding affinity [ , , ] , suggesting that even where mutations in different animal species make the interfaces less compatible for sars-cov , a considerably decreased binding energy may still be sufficient to enable infection. by applying this threshold we correctly predict all animals in our dataset that have experimental evidence of infection, to be at risk (table ) . however, for a few animals we predict at risk using this threshold, in vitro experimental studies to date have not shown infection. for example, donkeys are at risk of infection (ΔΔg = . ) but no infections were observed in vitro for these animals [ ] . however, infection has been observed in vitro for horse [ ] and horse and donkey have identical dcex residues and the same ΔΔg. amongst new world monkeys, marmosets have been experimentally infected [ ] . we predict that the closely related capuchin and squirrel monkey are also at risk, although they have not been shown to be infected using functional assays [ ] . we performed detailed structural analyses to characterise the key residues contributing to binding energy changes and to consider these discrepancies further. our analyses reveal that the interfaces in both capuchin and squirrel monkey are similar to marmoset, suggesting that these two new world monkeys are also likely to be at risk even though there is no current experimental data supporting this [ ] (supplementary results ). furthermore, all these monkeys have high global sequence similarity to human. for capuchin and squirrel monkey this is >~ % and their dcex residues are identical to those of human, further supporting risk. in marmoset, which has experimental evidence of infection the global sequence identity is % and % over the dcex residues. additionally, we compared changes in energy of the s-protein:ace complex in sars-cov- and sars-cov and found similar changes suggesting that the range of animals susceptible to the virus is likely to be similar for sars-cov- and sars-cov (supplementary results ). ace and tmprss are key factors in the sars-cov- infection process. both are highly coexpressed in susceptible cell types, such as type ii pneumocytes in the lungs, ileal absorptive enterocytes in the gut, and nasal goblet secretory cells [ ] . since both proteins are required for infection of host cells, and since our analyses clearly support suggestions of conserved binding of sprotein:ace across animal species, we decided to analyse whether the tmprss was similarly conserved. there is no known structure of tmprss , so we built a high-quality model (ndope = - . ) from a template structure (pdb id i ). since tmprss is a serine protease, and the key catalytic residues are known, we used funfams [ ] to identify highly conserved residues in the active site and the cleavage site that are likely to be involved in substrate binding. this resulted in two sets of residues that we analysed: the active site and cleavage site residues (ascs), and the active site and cleavage site residues plus residues within Å of catalytic residues that are highly conserved in the funfam (ascsex). the sum of grantham scores for mutations in the active site and cleavage site for tmprss is zero or consistently lower than ace in all organisms under consideration, for both ascs and ascsex residues (fig. ) . this means that the mutations in tmprss involve more conservative changes. mutations in dcex residues seem to have a more disruptive effect in ace than in tmprss . whilst we expect orthologues from organisms that are close to humans to be conserved and have lower grantham scores, we observed some residue substitutions that have high grantham scores for primates, such as capuchin, marmoset and mouse lemur. in addition, primates, such as the coquerel sifaka, greater bamboo lemur and bolivian squirrel monkey, have mutations in dcex residues with high grantham scores. mutations in tmprss may render these animals less susceptible to infection by sars-cov- . a small-scale phylogenetic analysis was performed on a subset of sars-cov- assemblies in conjunction with a broader range of sars-like betacoronaviruses (supplementary table ), including sars-cov isolated from both humans and civets. consistent with previous phylogenetic work [ ] , sars-like viruses isolated from horseshoe bats (ratg , epi_isl_ ; rmyn , epi_isl_ ) are the closest relatives of sars-cov- strains currently available in genomic repositories (fig. ) , though still remain many decades divergent from sars-cov- [ ] . aided by a large community sequencing effort, tens of thousands of human-associated sars-cov- genome assemblies are now accessible on gisaid [ , ] . at the time of writing, these also include one complete assembly generated from a virus infecting a domestic dog (epi_isl_ ), one genome obtained from a zoo tiger (epi_isl_ ), one directly isolated from a domestic cat (epi_isl_ ) and high coverage complete genomes obtained from farmed mink (including epi_isl_ ). sars-cov- strains from animal infections fall among the phylogenetic diversity observed in a representative set of human strains (fig. a) , as also seen in larger phylogenetic analyses available on nextstrain (https://nextstrain.org/ncov/global). irrespective of host, the sars-cov- spike receptor binding domain is conserved (fig. b) across tested human and animal associated sars-cov- , suggesting mutations in the rbd are not required for infections observed in non-human species to date. of note, whilst genome-wide data indicates a closer phylogenetic relationship between sars-cov- strains and species in circulation in horseshoe bats, the receptor binding domain alignment instead supports a closer relationship with a sars-like virus isolated from pangolins [ ] (epi_isl_ ; fig. b ), in line with previous reports [ ] . the ongoing covid- global pandemic has a zoonotic origin, necessitating investigations into how sars-cov- infects animals, and how the virus can be transmitted across species. given the role that the stability of the complex, formed between the s-protein and its receptors, could contribute to the viral host range, zoonosis and anthroponosis, there is a clear need to study these interactions. however, to our knowledge there have been few studies of relative changes in the energies of the sprotein:ace complex [ ] . a number of recent studies [ , , ] have suggested that, due to high conservation of ace , some animals are vulnerable to infection by sars-cov- . concerningly, these animals could, in theory, serve as reservoirs of the virus, increasing the risk of future zoonotic events, though transmission rates across species are currently not known. therefore, it is important to try to predict which other animals could potentially be infected by sars-cov- , so that the plausible extent of transmission can be estimated, and surveillance efforts can be guided appropriately. animal susceptibility to infection by sars-cov- has been studied in vivo [ , , , , ] and in vitro [ ] [ ] [ ] during the course of the pandemic. parallel in silico work has made use of the protein structure of the s-protein:ace complex to computationally predict the breadth of possible viral hosts. most studies simply considered the number of residues mutated relative to human a[ , , ] ], although some also analyse the effect that these mutations have on the interface stabil [ , , ] ]. the most comprehensive of these studies analysed the number and locations of mutated residues in ace orthologues from species [ ] , but did not perform detailed energy calculations as we have done. few studies have explored changes in the energy of the sprotein:ace complex on a large scale. shortly after we reported our work in biorxiv rodrigues et al. [ ] submitted a paper in biorxiv also reporting changes in binding energy of the complex for different animal species, measured using a different approach (haddock [ ] ). the results are in good agreement with ours (nearly % of the risk assessments agree). furthermore, when we applied haddock to the animals for which experimental data exists, we also observed significant agreement in risk assessments with those predicted using mcsm-ppi (see supplementary results , supplementary figure and supplementary methods ). our haddock analysis showed slightly better correlation with experiment than the rodrigues et al. study, possibly due to use of a different template structure when building the animal models ( m j, a better resolved structure than m used by rodrigues et al.) . our work is the only study that has so far explored changes in the energy of the s-protein:ace complex on a very large scale ( animals) in order to assess risk of infection across a broad range of animal species. furthermore, it is the only study to assess whether changes in tmprss could also be influencing risk. in this study, we performed a comprehensive analysis of the major proteins that sars-cov- uses for cell entry. we predicted structures of ace and tmprss orthologues from vertebrate species and modelled s-protein:ace complexes. we calculated relative changes in energy (ΔΔg) of sprotein:ace complexes, in silico, following mutations from animal residues to those in human. our predictions suggest that, whilst many mammals are susceptible to infection by sars-cov- , most birds, fish and reptiles are not likely to be. however, there are some exceptions. we manually analysed residues in the s-protein:ace interface, including dc residues that directly contacted the other protein, and dcex residues that also included residues within Å of the binding residues, that may affect binding. we clearly showed the advantage of performing more sophisticated studies of the changes in energy of the complex, over more simple measures--such as the number or chemical nature of mutated residues--used in other studies. furthermore, the wider set of dcex residues that we identified near the binding interface had a higher correlation to the phenotype data than the dc residues. in addition to ace , we also analysed how mutations in tmprss impact binding to the s-protein. we found that mutations in tmprss are less disruptive than mutations in ace , indicating that binding interactions in the s-protein:tmprss complex in different species will not be affected. to increase our confidence in assessing changes in the energy of the complex, we developed multiple protocols using different, established methods. we correlated these stability measures with experimental infection phenotypes in the literature, from in vivo [ , , , ] and in vitro [ ] [ ] [ ] studies of animals. protocol using mcsm-ppi (p( )-ppi ) correlated best with the number of mutations, chemical changes induced by mutations and infection phenotypes, so we chose to focus our analysis employing this protocol. our method cannot determine relative changes in energy that are associated with no risk. instead, we used experimental in vivo and in vitro infection data as the gold standard to identify animals at risk. of note, horseshoe bats, heavily advocated as a putative reservoir host, are predicted to be infected from in vitro experiments, despite the considerable disruption in the interface that our detailed structural analysis shows. we found that our predicted ΔΔg values for animals that can be infected by sars-cov- are significantly lower than for animals that showed no infection when tested experimentally (fig. , table ). ΔΔg values for horseshoe bat and marmoset were outliers for infected animals. these ΔΔg values are higher than the median ΔΔg value for animals that are not infected and are approximately the same value as the median ΔΔg = . for all animals included in this study. however, this may be a result of the biased sampling of animals that have been tested experimentally, where most have been mammals to date. going forward, if more distantly related animals are experimentally characterised, it is plausible that non-placental animals, of which many have ΔΔg > . (the value obtained for horseshoe bat), would be found to not be infected. therefore, the difference between ΔΔg values for animals that can, and cannot, be infected by sars-cov- will increase. overall, our measurements of the change in energy of the complex for the sars-cov s-protein were highly correlated with sars-cov- , so our findings are also applicable to sars-cov. humans are likely to come into contact with of these species in domestic, agricultural or zoological settings (fig. ) . of particular concern are sheep, that have no change in energy of the sprotein:ace complex, as these animals are farmed and come into close contact with humans. indeed, sars-cov- is already responsible for infections in various animal species. sars-cov- genomes [ , ] have been isolated from natural infections in zoo lions and tigers [ ] , companion animals including cats and dogs [ , ] and following widespread outbreaks in multiple mink farms in the netherlands resulting in mass culling [ ] (fig. ) . in most cases natural infections have been linked to human infections supporting cross-species transmission and high levels of exposure [ ] . to date, minks provide the only well supported example of sustained intraspecies transmission with secondary zoonotic transmission back to humans [ ] . consistently, we predict american mink to be at risk of infection by sars-cov- , with ΔΔg = . . to gain a better understanding of the nature of the s-protein:ace interface, we performed more detailed structural analyses for a subset of species. in a few cases, we had found discrepancies between our energy calculations and experimental phenotypes, namely predicting risk for some animals where in vitro experiments showed no infection (table ) . to test our predictions, we manually analysed how the shape or chemistry of residues may impact complex stability for all dc residues and a selection of dcex residues. previous studies have identified a number of important locations in human ace for binding the s-protein [ , ] and we found agreement with these in structural studies using our animal models. these locations, namely the hydrophobic cluster near the n-terminus and two hotspot locations near residues and , stabilise the binding interface. sars-cov- exploits the hydrophobic pocket by mutations that alter the conformation and flexibility of the rbd loop, together with point mutation l f that provides a compact interface which is more dynamically and energetically favourable compared with sars-cov [ , ] . our structural analysis showed how sars-cov- can utilise this pocket for binding at the interface in all the species we examined, including those for which current experimental test data suggest no risk. hotspot shows more structural variability. in agreement with our calculations of large changes in energy of the s-protein:ace complex in horseshoe bat, our structural studies show that the variant d n causes the loss of a salt bridge and h-bonding interactions between ace and s-protein at hotspot . these detailed structural analyses are supported by the high grantham score and calculated total ΔΔg for the change in energy of the complex. both dog and cat have a physicochemically similar variant at this hotspot (d e), which although disrupting the salt bridge still permits alternative h-bonding interactions between the spike rbd and ace . for marmosets and other new world monkeys, capuchin and squirrel monkey, our structural analyses revealed similarity to human at hotspot in the ace interface. in fact, capuchin resembled the human ace interface even more closely than marmoset, which can be infected [ ] , even though in vitro experiments have not reported infection in capuchin [ ] . our ΔΔg value for capuchin (ΔΔg = . ) suggests risk of infection, and also for squirrel monkey (ΔΔg = . ), despite the fact that squirrel monkey also failed to show risk in in vitro experimental studies. of note is the fact that marmoset showed no infection in in vitro studies [ ] , whilst recent in vivo experiments [ ] have shown risk, perhaps suggesting that it can be difficult to detect infection in vitro for these monkeys. alternatively, the lack of infection may suggest additional factors influencing infection and indicate that these animals, which are primates closely related to human, may be useful models for studying immune, or other factors, related to resistance. finally, our structural analyses showed that some dcex residues were likely to be allosteric sites, which may represent promising drug targets [ ] . the value of our study is not in determining an absolute ΔΔg threshold for risk, but rather in providing information about relative changes in binding energy that will allow the host range of the virus to be more accurately gauged once more experimental work has been conducted. we believe that false positive predictions are more acceptable than false negatives. so, within the context of possible transmission events between species, and particularly to human, we consider that an animal can be infected if there is any experimental evidence of infection. we applied protocols that enabled a comprehensive study of host range, within a reasonable time, for identifying species at risk of infection by sars-cov- , or of becoming reservoirs of the virus. although we felt that these faster methods were justified by the need for timely answers to these questions, there are clearly caveats to our work that should be taken into account. whilst we use a state of the art modelling tool [ ] and an endorsed method for calculating changes in energy of the complex [ ] , molecular dynamics may give a more accurate picture of energy changes by sampling rotamer space more comprehensively [ ] . however, such an approach would have been prohibitively expensive at a time when it is clearly important to identify animals at risk as quickly as possible. each animal could take orders of magnitude longer to analyse using molecular dynamics. further caveats include the fact that although the animals we highlight at risk from our changes in binding energy calculations correlate well with the experimental data, there is only a small amount of such data currently available, and many of the experimental papers reporting these data are yet to be peer reviewed. finally, we restricted our analyses to one strain of sars-cov- , but other strains may have evolved with mutations that give more complementary interfaces. for example, recent work suggests sars-cov- can readily adapt to infect mice following serial passages [ ] . in summary, our work is not aiming to provide an absolute measure of risk of infection. rather, it should be considered an efficient method to screen a large number of animals and suggest possible susceptibility, and thereby guide further studies. any predictions of possible risk should be confirmed by experimental studies and computationally expensive, but more robust methods, like molecular dynamics. the ability of sars-cov- to infect host cells and cause covid- , sometimes resulting in severe disease, ultimately depends on a multitude of other host-virus protein interactions [ ] . while we do not investigate them all in this study, our results suggest that sars-cov- could indeed infect a broad range of mammals. as there is a possibility of creating new reservoirs of the virus, we should now consider how to identify such transmission early and to mitigate against such risks. in particular, farm animals and other animals living in close contact with humans could be monitored, protected where possible and managed accordingly [ ] . ace protein sequences for vertebrates, including humans, were obtained from ensembl [ ] version and eight sequences from uniprot release _ (supplementary table ). tmprss protein sequences for vertebrate sequences, including the human sequence, were obtained from ensembl (supplementary table ) . a phylogenetic tree of species, to indicate the evolutionary relationships between animals, was downloaded from ensembl [ ] . the structure [ ] of the sars-cov- s-protein bound to human ace at . Å was used throughout (pdb id m j). we used standard methods to analyse the sequence similarity between human ace and other vertebrate species (supplementary methods ). we also mapped ace and tmprss sequences to our cath functional families to detect residues highly conserved across species (supplementary methods ). in addition to residues in ace that contact the s-protein directly, various other studies have also considered residues that are in the second shell, or are buried, and could influence binding [ ] . therefore, in our analyses we built on these approaches and extended them to compile the following sets for our study: . direct contact (dc) residues. this includes a total of residues that are involved in direct contact with the s-protein [ ] identified by pdbe [ ] and pdbsum [ ] . . direct contact extended (dcex) residues. this dataset includes residues within Å of dc residues, that are likely to be important for binding. these were selected by detailed manual inspection of the complex, and also considering the following criteria: (i) reported evidence from deep mutagenesis [ ] , (ii) in silico alanine scanning (using mcsm-ppi [ ] ), (iii) residues with high evolutionary conservation patterns identified by the funfam-based protocol described above, i.e. residues identified with dops ≥ and scorecons score ≥ . , (iv) allosteric site prediction (supplementary methods ), and (v) sites under positive selection (supplementary methods ). selected residues are shown in supplementary fig. and residues very close to dc residues (i.e. within Å) are annotated. we also included residues identified by other related structural analyses, reported in the literature (supplementary methods ) . using the ace protein sequence from each species, structural models were generated for the sprotein:ace complex for animals using the funmod modelling pipeline [ , ] (supplementary methods ). funmod searches for structural templates by mapping sequences to a cath funfam and selecting the structure of the closest relative of known structure, to use as a template for homology modelling [ ] . sequences are mapped by scanning them against the cath funfam hmm library using hmmer [ ] . the structural template selected was pdb id m j, a high-resolution crystal structure of sars-cov s-protein:human ace complex. we generated query-template alignments using hh-suite [ ] and predicted d models using modeller v. . [ ] . the 'very_slow' schedule was used for model refinement to optimise the geometry of the complex and interface. for each species, we generated models and selected the model with the lowest ndope [ ] score. only high-quality models were used in this analysis, with ndope score < - and with < dcex residues missing. this gave a final dataset of animals for further analysis. the modelled structures of ace were compared against the human structure (pdb id m j) and pairwise, against each other, using ssap [ ] . ssap measures the similarity between d protein structures by calculating similarity in vector views between aligned residues. a vector view for a given residue is the set of vectors from the cβ atom of that residue to the cβ atom of all other residues in the protein structure. ssap returns a score in the range - , with identical structures scoring [ ] . we also built models for tmprss proteins in all available species and identified the residues likely to be involved in the protein function (see supplementary methods ). we calculated the changes in binding energy of the sars-cov- s-protein:ace complex and the sars-cov s-protein:ace complex of different species, compared to human, following two different protocols: . protocol : using the human complex and mutating the residues for the ace interface to those found in the given animal sequence and then calculating the ΔΔg of the complex using both mcsm-ppi [ ] and mcsm-ppi [ ] (supplementary methods ) . this gave a measure of the destabilisation of the complex in the given animal relative to the human complex. ΔΔg values < are associated with destabilising mutations, whilst values ≥ are associated with stabilising mutations. . protocol : we repeated the analysis with both mcsm-ppi and mcsm-ppi as in protocol , but using the animal -dimensional models, instead of the human ace structure, and calculating the ΔΔg of the complex by mutating the animal ace interface residue to the appropriate residue in the human ace structure. this gave a measure of the destabilisation of the complex in the human complex relative to the given animal. values ≤ are associated with destabilisation of the human complex (i.e. animal complexes more stable), whilst values > are associated with stabilisation of the human complex (i.e. animal complexes less stable). we subsequently correlated ΔΔg values with available in vivo and in vitro experimental data on covid- infection data for mammals. protocol , mcsm-ppi , correlated best with these data. to measure the degree of chemical change associated with mutations occurring in dc and dcex residues, we computed the grantham score [ ] for each vertebrate compared to the human sequence (supplementary methods ). we performed phylogenetic analyses for a subset of sars-cov (n = ), sars-like (n = ) and sars-cov- (n = ) viruses from publicly available data in ncbi [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] and gisaid [ , ] (supplementary methods ). the funmod structural models for the sars-cov- spike-rbd:ace complex and tmprss are available on zenodo at https://zenodo.org/record/ [ ] . world health organisation. covid- situation report - available from a pneumonia outbreak associated with a new coronavirus of probable bat origin isolation and characterization of viruses related to the sars coronavirus from animals in southern china bats, civets and the emergence of sars the phylogenetic range of bacterial and viral pathogens of vertebrates bovine respiratory coronavirus porcine coronaviruses. emerg transbound anim viruses discovery, diversity and evolution of novel coronaviruses sampled from rodents in china discovery of a novel coronavirus, china rattus coronavirus hku , from norway rats supports the murine origin of betacoronavirus and has implications for the ancestor of isolation and characterization of a novel betacoronavirus subgroup a coronavirus, rabbit coronavirus hku , from domestic rabbits biologic, antigenic, and full-length genomic characterization of a bovine-like coronavirus isolated from a giraffe severe acute respiratory syndrome) [internet]. who. world health organization sars-associated coronavirus transmitted from human to pig. emerg infect dis sars-cov- neutralizing serum antibodies in cats: a serological investigation from people to panthera : natural sars-cov- infection in tigers and lions at the bronx zoo disease and diplomacy: gisaid's innovative contribution to global health: data, disease and diplomacy. glob chall global initiative on sharing all influenza data -from vision to reality susceptibility of ferrets, cats, dogs, and different domestic animals to sars-coronavirus- [internet]. microbiology comparison of sars-cov- infections among species of non-human primates functional and genetic analysis of viral receptor ace orthologs reveals broad potential host range of sars-cov- . biorxiv potential host range of multiple sars-like coronaviruses and an improved ace -fc variant that is potent against both sars-cov- and sars-cov- [internet]. microbiology broad and differential animal ace receptor usage by sars-cov- . biorxiv angiotensin-converting enzyme is a functional receptor for the sars coronavirus structure of the sars-cov- spike receptorbinding domain bound to the ace receptor structural basis of receptor recognition by sars-cov- structure of sars coronavirus spike receptor-binding domain complexed with receptor exceptional diversity and selection pressure on sars-cov and sars-cov- host receptor in bats compared to other mammals. biorxiv covid- : epidemiology, evolution, and cross-disciplinary perspectives the sars-cov- exerts a distinctive strategy for interacting with the ace human receptor. viruses the sequence of human ace is suboptimal for binding the s spike protein of sars coronavirus . biorxiv deep mutational scanning of sars-cov- receptor binding domain reveals constraints on folding and ace binding cryo-em structure of the -ncov spike in the prefusion conformation emergence of rbd mutations in circulating sars-cov- strains enhancing the structural stability and human ace receptor affinity of the spike protein human ace receptor polymorphisms predict sars-cov- susceptibility sars-cov- receptor ace is an interferon-stimulated gene in human airway epithelial cells and is enriched in specific cell subsets across tissues social science research network structural basis of sars-cov- spike protein priming by tmprss . biorxiv tmprss and adam cleave ace differentially and only proteolysis by tmprss augments entry driven by the severe acute respiratory syndrome coronavirus spike protein sars-cov- cell entry depends on ace and tmprss and is blocked by a clinically proven protease inhibitor ace and tmprss variants and expression as candidates to sex and country differences in covid- severity in italy. medrxiv a sars-cov- protein interaction map reveals targets for drug repurposing comparative ace variation and primate covid- risk broad host range of sars-cov- predicted by comparative and structural analysis of ace in vertebrates insights on crossspecies transmission of sars-cov- from structural modeling evidence of significant natural selection in the evolution of sars-cov- in bats, not humans gene d: expanding the utility of domain assignments an overview of comparative modelling and resources dedicated to large-scale modelling of genome sequences the haddock . web server: user-friendly integrative modeling of biomolecular complexes composition and divergence of coronavirus spike proteins and host ace receptors predict potential intermediate hosts of sars-cov- susceptibility of ferrets, cats, dogs, and other domesticated animals to sars-coronavirus infection and rapid transmission of sars-cov- in ferrets infection with novel coronavirus (sars-cov- ) causes pneumonia in the rhesus macaques tissue distribution of ace protein, the functional receptor for sars coronavirus. a first step in understanding sars pathogenesis functional classification of cath superfamilies: a domain-based approach for protein function annotation evolutionary origins of the sars-cov- sarbecovirus lineage responsible for the covid- pandemic isolation of sars-cov- -related coronavirus from malayan pangolins probable pangolin origin of sars-cov- associated with the covid- outbreak absence of sars-cov- infection in cats and dogs in close contact with a cluster of covid- patients in a veterinary campus sars-cov- spike protein favors ace from bovidae and cricetidae analysis of the mutation dynamics of sars-cov- reveals the spread history and emergence of rbd mutant with lower ace binding affinity first reported cases of sars-cov- infection in companion animals infection of dogs with sars-cov- sars-cov- infection in farmed minks, the netherlands computational design of ace -based peptide inhibitors of sars-cov- comparative protein structure modeling using modeller mcsm-ppi : predicting the effects of mutations on protein-protein interactions rapid adaptation of sars-cov- in balb/c mice: novel mouse model for vaccine efficacy host range of sars-cov- and implications for public health ensembl comparative genomics resources pdbe: improved findability of macromolecular structure data in the pdb pdbsum: summaries and analyses of pdb structures mcsm: predicting the effects of mutations in proteins using graph-based signatures challenges in homology search: hmmer and convergent evolution of coiled-coil regions hh-suite for fast remote homology detection and deep protein annotation statistical potential for assessment and prediction of protein structures ssap: sequential structure alignment program for protein structure comparison amino acid difference formula to help explain protein evolution characterization of severe acute respiratory syndrome coronavirus genomes in taiwan: molecular epidemiology and genome evolution genomic characterisation of the severe acute respiratory syndrome coronavirus of amoy gardens outbreak in hong kong discovery of a rich gene pool of bat sarsrelated coronaviruses provides new insights into the origin of sars coronavirus isolation and characterization of a bat sars-like coronavirus that uses the ace receptor bats are natural reservoirs of sars-like coronaviruses ecoepidemiology and complete genome comparison of different strains of severe acute respiratory syndrome-related rhinolophus bat coronavirus in china reveal bats as a reservoir for acute, self-limiting infection that allows recombination events severe acute respiratory syndrome coronavirus-like virus in chinese horseshoe bats genomic characterization and infectivity of a novel sars-like coronavirus in chinese bats supplementary structural models (sars-cov- spike-rbd:ace complex and tmprss ) -sars-cov- spike protein predicted to form complexes with host receptor protein orthologues from a broad range of mammals we thank gal horesh, caitlin lee carpenter and mohd firdaus raih for insightful discussions; alan hunns for help in making figures; and laurel woodridge, sean le cornu and declan torin cook for comments on the manuscript. we would also like to thank francois balloux, whose team member, lucy van dorp, contributed the phylogenetic analysis of sars-like viruses. key: cord- -z cl authors: wicaksono, irmandy; tucker, carson i.; sun, tao; guerrero, cesar a.; liu, clare; woo, wesley m.; pence, eric j.; dagdeviren, canan title: a tailored, electronic textile conformable suit for large-scale spatiotemporal physiological sensing in vivo date: - - journal: npj flex electron doi: . /s - - -y sha: doc_id: cord_uid: z cl the rapid advancement of electronic devices and fabrication technologies has further promoted the field of wearables and smart textiles. however, most of the current efforts in textile electronics focus on a single modality and cover a small area. here, we have developed a tailored, electronic textile conformable suit (e-tecs) to perform large-scale, multimodal physiological (temperature, heart rate, and respiration) sensing in vivo. this platform can be customized for various forms, sizes and functions using standard, accessible and high-throughput textile manufacturing and garment patterning techniques. similar to a compression shirt, the soft and stretchable nature of the tailored e-tecs allows intimate contact between electronics and the skin with a pressure value of around ~ mmhg, allowing for physical comfort and improved precision of sensor readings on skin. the e-tecs can detect skin temperature with an accuracy of . °c and a precision of . °c, as well as heart rate and respiration with a precision of . m/s( ) through mechano-acoustic inertial sensing. the knit textile electronics can be stretched up to % under cycles of stretching without significant degradation in mechanical and electrical performance. experimental and theoretical investigations are conducted for each sensor modality along with performing the robustness of sensor-interconnects, washability, and breathability of the suit. collective results suggest that our e-tecs can simultaneously and wirelessly monitor skin temperature nodes across the human body over an area of cm( ), during seismocardiac events and respiration, as well as physical activity through inertial dynamics. in recent years, we have witnessed a vast advancement towards flexible and stretchable devices , . the current form-factor of medical devices that are rigid and boxy starts to become soft and conformable , . this brings out health monitoring that is nonobtrusive, imperceptible, and closer to our body, even when we are away from the hospital . there are two major classes of wearable electronics for healthcare: on-skin, and textile electronics. thin, soft and skin-like electronics in the form of a patch, with wireless capabilities, have been developed to precisely detect various physiological signals from the human body, such as electrophysiology , temperature , pulse oximetry , blood pressure , hydration , and others . they are made either by designing a particular structure that can withstand strain on a deformable polymeric substrate, or by using intrinsically stretchable materials . on the other hand, textiles and clothing are ubiquitous in our daily life. we wear and wash them regularly, and they give us comfort and protection from the outside environments. being the closest layer to our body, they provide an ideal platform for the integration of electronics to monitor physiological processes through the skin. electronic devices integrated into textiles can, therefore, offer several advantages, such as enhanced mobility and comfort for the user . textile also serves an excellent substrate for sensing throughout dynamic activities and environments, where robustness and washability are critical as the substrate undergoes multiple stretching, friction, and is frequently exposed to dirt and humidity. several efforts have been conducted to integrate electronics into textiles, for instance, by coating yarns with metal or printing conductive inks on fabrics to serve as electrodes for electrophysiology , , sewing and attaching functional threads and fabrics , weaving electronics fabricated on polyimide strips for humidity , temperature , pulse oximetry , and gas sensing, as well as developing electronic fibers for seamless woven electronic textiles . some of these intelligent textiles, however, are not scalable for large-area sensing and do not allow stretchability for the application of skin-contact sensing for electronic suits. it is also worthy to note that current on-skin and wearable devices mostly measure a single parameter at a particular location of the body . distributed sensor networks that can spatiotemporally map multiple physiological processes and physical movements in different regions of the body (supplementary table ) are a valuable tool for clinicians, as they can provide a rich dataset to assess a health condition, predict disease, or advance sports science and analytics , . a specific example is soft, battery-free epidermal sensors that can be adhered to various regions of the body to perform full-body skin temperature and physical pressure mapping . these sticker-like sensors are used in sleep studies to help with the treatment of sleep disorders, jet lag, and pressure ulcers on a clinical bed setup. distributed skin temperature mapping has also been demonstrated to study thermoregulation efficiency in athletic performance , as well as dermatome abnormality through regional nerve root damage . however, even though they are wireless, these epidermal sensors require a near field communication (nfc) reader around the vicinity to power the electronics and collect the data. they would be also challenging to be used while performing dynamic activities, which limit its applications outside the bed. their soft, fragile nature and adhesive tape application on the skin may restrain them from long-term operations. other wireless on-skin devices are integrated with batteries , . however, having multiple devices with their independent power sources tend to be cumbersome when one needs to replace and charge every single device, rather than wearing a centralized, system-on-textile garment that could perform all of the functions. several on-skin and textile electronic devices also require specific materials and microfabrication techniques to be developed, resulting in relatively high cost and effort for mass manufacturing and large-scale deployment , . recent work also focus on the design of customizable, modular, and reconfigurable soft electronics, but not so many apply these design principles to textile-based applications [ ] [ ] [ ] . the wide variations in human body size and shape prove to be a challenge on the design and development of smart clothing. accordingly, a universal platform of sensor networks on textiles as well as their hardware-software integration, must be established . with this standardization, industries can, therefore, work on specific parts, such as sensor modules and not be concerned about designing a full wearable system. further processes can then focus on personalization of smart clothing based on the user's requirements and needs . through this work, we have developed a technique of combining thin, customizable conformable electronic devices, including interconnect lines and off-the-shelf integrated circuits, with plastic substrates that can be woven into knitted textile using an accessible and high-throughput manufacturing approach. similar to a compression garment, the nature of this stretchable knitted textile will allow intimate contact between electronics and the skin . our technique creates a platform to integrate a large assortment of conformable electronic components in a suit for large-scale physiological and physical activity sensing on the body. we demonstrate the capability of our electronic textile conformable suit (e-tecs) for distributed, wireless physiological sensing, such as temperature, respiration and heart-rate detection, and physical activity monitoring around the human body during a physical exercise. repeated mechanical cycling tests also prove the durability of the knitted textile for daily wear. figure a illustrates the concept of an e-tecs that monitors the human skin surface temperature distribution, heart rate, and respiration. the suit is tailored from a customized fabric that can be integrated with an assortment of sensor integrated circuits (ics) and interconnects in the form of flexible-stretchable electronic strips. the textile platform consists of channels or pockets for the weaving of these electronic strips (fig. b) . the sensor ics and interconnects are developed using two-layer industrial flexible printed circuit board (pcb) processes (fig. c, . a and see methods), with additional steps for chip and passive component assembly and encapsulation with thermoplastic polyurethane (tpu) (te- c, dupont) and washable encapsulant (pe , dupont). the tailored approach through body fitting results in e-tecs that fully conforms to the curvature of the body (fig. d) . the textile channels for embedding the electronics further enhance the comfort of the suit. figure e presents a photograph of a temperature device (max , maxim integrated) and the device woven into the customized textile. figure f shows the sensor island for temperature (right) and accelerometer (left) respectively, with an outline size of . cm × cm. the two layers, that consist of serpentine interconnects (fig. g ) of µm thick and µm track width of cupper (cu) and are sandwiched between µm thick and µm track width of polyimide (pi), serve as the bridge for a total of four bus lines as inter-integrated communication (i c) network architecture. from the crosssectional microscope image of the device woven into the customized textile (fig. h) , it can be seen that there are four main layers: the textile, encapsulation, electronic chip, and polyimide (pi). as shown in supplementary fig. a , we designed and fabricated seven different modules: four temperature sensing modules, one inertial sensing module, and two interconnection modules. in an area of cm × . cm flexible board (fpcb, kingcredie), we can fit a total of temperature sensors and interconnection strips, demonstrating the large-amount, rapid manufacturability of this approach. the temperature sensor (max , maxim integrated) has an accuracy of . °c between - °c, and a . °c resolution, which we rounded up programmatically to . °c. this sensor can have up to unique addresses, which can be set by connecting ground (gnd), power supply (vdd), data line (sda) or clock line (scl) signal to the a , a , and a pins on the chip. given that there are eight combinations possible in these three pins for each signal, we designed four different hard-wired a , a , and a pins ( supplementary fig. b) to voltage supply (vdd) or ground (gnd) and data-line (sda) or clock-line (scl), represented by m, m , m , and m . each temperature module can be manually joined by soldering the jumpers ( supplementary fig. c ), in order to access all of the possible addresses. the capacitor complement of the max is used as a decoupling capacitor to stabilize the local vdd supply from high-frequency noise and voltage ripples. the mechano-acoustic sensor or inertial measurement unit (imu) (mpu , invensense) is capable of measuring axis gyroscope and -axis accelerometer, with a programmable accelerometer range of ± g to g, the highest precision of . g or . m/s , and a maximum of two addresses in one i c bus. we designed four pads at each side of all sensor modules for connection to power and signal lines (vdd, scl, sda, gnd). the sensing modules or islands can be joined to the interconnect modules and each other by soldering the four pads together ( supplementary fig. d ). the interconnect strips have multiple islands of pads with an area of mm × mm in between serpentine interconnects ( supplementary fig. e ). the pad design enables the interconnect strips to be reconfigurable. it can be cut and joined to any length needed for connection to the sensor islands. the female headers or holes at the end of these interconnect strips can be used for textile-hardware connections by looping conductive threads or thin wires. all of the sensor modules can be connected to the main module for powering, processing, and wireless communication through the i c bus interface with four signal wires (vdd, scl, sda, gnd). for a single i c bus, the maximum sensor nodes it can access is = addresses. this means that the system can handle up to temperature sensors ( × to × f in -bit address) and inertial measurement units (imu) ( × and × ) with minimal wirings. the corresponding address of the temperature sensors is given in supplementary table . since every sensor module in our system have its own reading and processing happen locally, adding several sensor nodes of different nature will not introduce crosstalk, as long as they have unique sensor address. supplementary fig. a illustrates the concept of a modular sensor network architecture embedded in a piece of fabric. each sensor can be connected to each other with the interconnects in a horizontal manner, where the signal gets collected by the external layer, which consists of a bluetooth low-energy (ble) module, a microprocessor, and a power source. we developed a prototype ( figure s b ) to demonstrate the scalability of the sensorintegrated fabric, as shown in supplementary fig. b -d. as more fabrics and sensors are joined, i c address scanning from a microcontroller showed an increase of the number of sensor addresses i. wicaksono et al. detected ( supplementary fig. e-g) . this demonstration reflects the possibility of roll-to-roll manufacturing of sensor-embedded fabrics that can be cut in any size, joined, and tailored for various needs and applications. temperature and inertial sensor characterization we performed infrared (ir) thermography cross-validation with a total of four trials (n = ), of an encapsulated device without any integration to a fabric and an encapsulated device embedded in a fabric channel (fig. a) . see methods for further description of the experimental setup. the results in fig. b show that there is an increasing offset as the temperature rises in both cases, with the fabric device exhibiting better performance or higher temperature; this is due to the insulating behavior of the fabric layer that keeps the temperature from distributing to the environment. these values in supplementary fig. a were consistent with those determined by a two-dimensional ( -d) finite element model (fem) simulation in supplementary fig. b -g. based on the fem simulation and experimental results, the sensor required a calibration factor, defined by an offset and a multiplier from linear fitting that converts the sensor reading close to the temperature obtained by the ir camera ( supplementary fig. a ). -d fem model was created in accordance with the structure of the sensor embedded into the textile, to study the temperature distribution across the cross-section of the sensor. the heat is transferred from the heat source (digital hotplate, torrey pines scientific) to the bottom surface of the packaged sensor due to the thermal contact, and ultimately transmitted from the top textile layer to the external environment, primarily in the form of convection and radiation. we simulated no airflow, even though the air is considered in the external environment. the ambient temperature was specified as . °c, similar to the ambient temperature at the time of experimental characterization. the steady-state temperature distribution was then theoretically simulated. the simulated results matched the experimental results with a tolerance of . ± . °c. figure c shows a sample of the simulated thermal distribution across the -d model when the hot-plate temperature is °c, while supplementary fig. b -g shows the distribution when the hot-plate temperature is ranged from to °c. seismocardiography (scg) records the subtle motions around the body due to the atrial muscle contractions and blood ejection as the heart pumps. the frequency characteristic waveform of scg thus reflects cardiac mechanical events. it can be unobtrusively monitored by attaching imus to the body or integrating them to objects that will physically touch the body , . depending on the location of the imus, they can also capture body motions caused by the contraction and dilation of the lungs, which relate to the breathing mechanism. we placed an imu right below the sternum as it has been shown to be the most sensitive location to detect both heart and breathing activities , . we assembled, encapsulated, and integrated an accelerometer module with a customized fabric patch. figure j shows our mechano-acoustic element embedded in a fabric and placed right below the sternum, with a commercial electrocardiography (ecg) and respiration (zephyr biopatch, medtronic) strap as the cross-validation device for simultaneous measurement of scg and ecg. a single cardiac cycle represents the contraction (systole) and relaxation (diastole) of heart muscle motions of the atrium and ventricular chamber. these motions induce electrical activities, which are followed by mechanical movements as the heart chambers contract and the valves close. these electromechanical coupling features are imperative in ecg and heart auscultation. figure d , f show ecg and scg signals measured simultaneously from a healthy male subject (age ). the scg data are given by the accelerometer (mpu- , invensense) z-axis value, with a sensitivity setting of g, precision of . m/s , and a sampling frequency of hz. a finite impulse response (fir) low-pass filter is used (see methods) to process the raw data ( supplementary fig. ) eliminating respiratory waveforms. magnified views of a single cardiac cycle (fig. e , g) highlight all the critical features of these two waveforms, such as the mitral valve closure (mc), aortic valve opening (ao), and rapid ventricular ejection (re) occurring right after r-peak or ventricle depolarization and aortic valve closure (ac), mitral valve opening (mo), and rapid ventricular filling (rf) after t-peak, and ventricle relaxations . from the raw data in supplementary fig. , not only we could collect scg data that provide information on the heart activity, but we could also find insights on the breathing activity due to the lung and diaphragm mechanical movements. for respiratory waveform, fir low-pass filter is also used to eliminate highfrequency signals due to heart-beat events and obtain the direct current (dc) component of the signals. the result shows a breathing waveform (fig. h ) that exhibits a similar response in comparison to a commercial device (zephyr biopatch, medtronic) as shown in fig. i . digital knitting is a programmable, automatic machine process (fig. a) of stitching interlocked loops from multiple strands of yarn . it uses several needles or hooks to arrange the interlocking mechanism of loops into fabrics. the process of knitting starts with multiple cones of yarn that gets pulled into the machine by yarn carriers until a certain pre-programmed tension is achieved. the carriers then slide back-and-forth horizontally while the needles catch the yarns to form the loops. each carrier can be sequentially controlled to slide and combine different yarns to form structural or color patterns. the programming interface consists of two grid sections (fig. b, c) . the left grid is used to develop the shape and pattern of the knit fabrics through x-y color block programming, where each color and logo represent specific knit operation. using a flat two-bed digital knitting machine (super-j , matsuya), we patterned textile channels using a combination of two-layer jersey (left) and interlocked knitting (right), as illustrated in fig. d . figure e shows a region of the resultant fabric, with four textile channels and three interlocked stripes. the single-color stripes in this piece consist of two layers of separated fabric, while the dotted stripes represent the interlocked patterns, which combined two fabric layers into one. we digitally knitted three fabrics with a size of × cm: one for front-side, one for backside, and one for a pair of long-sleeve. the channel design of our digitally knit fabric was done based on the size of our sensing and interconnect modules. as shown in supplementary fig. a , the width of our sensing and interconnect modules are . cm. therefore, we decided to design cm channels on our digitally knit fabric to provide enough room for these modules (fig. e) . based on our design, the minimum distance between each sensor is cm vertically based on the channel width, and cm horizontally based on the interconnect module length. after the whole fabric was drafted, it was then cut for different body parts ( supplementary fig. ) using a personalized garment fitting measurement (supplementary table ). electronic-textile integration was then performed, by threading the electronic strips into the textile channel ( supplementary fig. a) , which is further explained in methods and supplementary fig. . finally, the sensor-integrated fabrics were then sewn into a bodysuit to form e-tecs (fig. f) , as illustrated in detail in supplementary fig. b , with the inside of the e-tecs shown in supplementary fig. c , d. supplementary fig. shows the diagram and photographs of electrical connections between the main module and e-tecs for processing, communication and powering. all of the sensors going horizontally through the stripes are collected with four thin copper wires vertically ( supplementary fig. ) through the seams and connected to the main hub (metawearr, mbientlab) through i c protocol. the main hub consists of a microprocessor, ble module, and rechargeable lithium polymer battery in a compact form. the lithium-polymer battery ( , hyp) as shown in supplementary fig. f is rated at . v, mah and has a h charging time. the total current consumption, while the main module and all of the sensor nodes are active, is approximately . ma. with the battery rated at mah, the working lifetime of our system is around h and min. we can improve this lifetime by using lithium-polymer battery with a higher capacity. as illustrated in fig. g , we sewed conductive snaps that function as a textile-hardware connector to link the i c pins on the microprocessor to the i c wires on the textile. the pluggable mechanism ( supplementary fig. b -e) allows the wireless communication and main processing hardware to be removed during charging of the battery. the i c pins of this micro-controller are wired to the conductive snaps for the textile-hardware interface. through wireless ble communication, a computer can access all of the sensor addresses and log their data accordingly. these data can then be stored or visualized in real-time with python matplotlib and pygame library. e-tecs must be personalized to ensure there is sufficient pressure for sensor contact between the textile and skin . using a disk sensor laminated on the skin, mahanty and roemer stated that a pressure of mmhg is sufficient to accurately measure skin temperature, while a larger pressure of up to mmhg will result in an increase of temperature due to the pressure exerted to the local tissue . for wearable comfort, the compression pressure should not be more than . mmhg, which is close to the average capillary blood pressure of . mmhg near the skin . as shown in supplementary table , a set of key tailoring measurements was used as a reference for the design of the e-tecs. for pressure measurements, ten circumference points of a subject's arm and a compression sleeve were measured to calculate the size of the reduction, as calculated in supplementary table and illustrated in fig. i . by performing mechanical characterization on the base fabric, we can evaluate the fabric rigidity and model the pressure of elastic fabric around the upper limb region of the human body. these modeled values were also cross-validated with a high-accuracy compression fabric subbandage pressure monitor (kikuhime, tt meditrade) , as illustrated in supplementary fig. . figure h presents both the experimental and modeled pressure variations across the sleeve. the pressure values for both cases show a similar trend, with a maximum difference of mmhg when the pressure variations are below . mmhg. these values, therefore, reflect the compression property of our e-tecs for on-body sensing with a pressure variations of - mmhg and ensure a comfortable and reliable contact between the sensors and the skin. to assess the reliability and electromechanical performance of the serpentine interconnects - , we performed two types of tests. the first test was a one-time uniaxial stretching of the serpentine i. wicaksono et al. interconnect until substrate breakage and conductor rupture. supplementary fig. a demonstrates the setup for this mechanical test. as shown in supplementary fig. b -d, the extension of three stretchable interconnects do not influence their resistances ( . - . Ω) until rupture events at strain values around - %. two drops can be seen in the load behavior around the rupture points, which occurred due to the sequential breakage of two serpentine lines. a similar response can also be observed in the case of a sensor module connected between two interconnects, with a dimension of mm × mm (supplementary fig. e-g) . all three of the samples' rupture points localized at around % strain, with a stable interconnect resistance of approximately two times that of the case with only one interconnect line ( . - . Ω). the interconnections do not show any degradation in the electrical property when tested, especially when joined by soldering each connection pad to the sensor module. this test also verifies the robustness of the soldered connections between the interconnects and the sensor modules. thus, it can be concluded that both types of interconnections are electrically functional and stay highly conductive for a strain value of up to %, as shown in fig. d . the second test we performed was a fatigue test until conductor rupture of both (i) a single interconnect, as well as (ii) a sensor module integrated between two interconnects, which can be used to evaluate the reliability and lifetime of the serpentine interconnects. most garment distortions happen due to the active movements of the upper body, such as shoulder movements, arm extension, and elbow diameter change , . according to hatch, the typical stretchability range of textiles for a tailored garment is - %, for sportswear is - %, and for a form-fit compression garment is between - % , . based on these ranges, we expect our e-tecs to withstand a strain of up to %. it was observed that both cases of stretchable interconnects could withstand stretching cycles at % elongation ( supplementary fig. a ). both interconnects show stable, flat low resistance behavior as a conductor throughout the test (fig. b , c and supplementary fig. e, f) . load versus strain graphs in supplementary fig. c , d illustrate the viscoelastic-plastic behavior of the tpu , . as shown in supplementary fig. b , at the first few cycles, there is a large gap and hysteresis shift of load due to the viscoplastic behavior of tpu, before the mechanical integrity of the tpu weakens and become more elastic at the rest of the stretching cycles. after the fatigue test, both samples showed an elongation of around %. based on this result, there is always be a viscoplastic-elastic adaptation on the first few cycles before the stretchable interconnects achieve a consistent mechanical response. optimizations in serpentine design, materials choice, and substrate thickness can be performed to improve the durability of this type of stretchable interconnects , . the mechanical performance of the serpentine structure was also simulated using commercial fem package comsol multiphysics . . one end of the serpentine model was applied with a fixed constraint, while a boundary load was applied to the other end of the serpentine model. the top polyimide surface was set to be symmetric . the tpu material is assumed to be hyper-elastic and exhibit viscoelasticity. stress distribution was simulated for the tensile test with a deformation of %. figure e shows the simulated stress distribution across the serpentine sample. the zoom-in views of the deformed samples (fig. f) reveal the maximum normalized stress occurred at arc angle of ± °c. the simulation results of . mpa has an agreement with experimental tensile strength measurement ( . mpa) at the large deformation region of % based on fig. a , with simulation error of less than %. similar to how we regularly treat our garments, we also designed our electronic textile to be washable for long-term use. toward this end, we first embedded light-emitting diode (led, rohm semiconductor) strips into the textile channels for a washability test. led brightness with a supply voltage of v and interconnect resistance were unchanged after the first wash until up to ten wash cycles ( supplementary fig. a, b) . the range of resistance values ( . - . Ω) was as expected, as it was noted on the previous mechanical tests that (supplementary fig. d ) each serpentine has a resistance of . - . Ω and in the fabric sample, and a total of eight serpentine interconnects are connected in series. we observed no flakes or discoloration on the washable encapsulation (pe , dupont) after ten washing cycles and liquid chemical treatment (ultra stain release, tide). we also conducted a continuous and real-time washing study, where we wove a strip of three temperature sensors and an accelerometer module (fig. g) into a textile patch and put them into an industrial washing machine (mhn pdcww , maytag washer), as demonstrated in supplementary video . figure i , j captures the multimodal sensor data of the entire washing cycle (fig. h ) that lasted for min. since the 'delicate and knit' option was chosen, cold water was mostly used during the wash. throughout the washing test, the textile patch underwent an initial warm wash, three cycles of rinsing, two cycles of draining, and a dry spin at the end. the temperature recordings reflect these events, while the accelerometer readings show four cycles of sequential slow spin, three cycles of continuous medium spin, and a cycle of fast spin for drying mode in the end. it can be observed that towards the end, the accelerometer values are saturated by the medium and fast spin. these tests thus prove the robustness of the encapsulation and interconnections of the system not only mechanically, but also electrically during delicate washing. the breathability, which is the ability of a fabric to permeate moisture vapor, such as due to sweat or perspiration is one of the most vital comfort factors in garment design . measurements of daily water vapor transmission in this work follow the standards as described in astm e . three fabric samples from % cotton fabric, % polyester and % spandex sports fabric, and our own % high-flex polyester fabric were cut and sealed to each dish opening using rubber bands (supplementary fig. a ). accumulated weight loss of each dish was measured daily. from the fitting results, it can be observed that even though our own customized, double-layer knit fabric is thicker ( . mm) compared to the cotton ( . mm) and sports fabrics ( . mm), the breathability of our fabric is still . % higher than the sports fabric, yet . % lower than to the open-air case ( supplementary fig. b and supplementary table ). temperature distribution across e-tecs will enable us to study heat transfer between our skin and environment. intense physical activity activates the muscle, produces heat in the core element, and initiates vasoconstriction that transfers blood from internal to superficial regions of the body . we can, therefore, monitor temperature change around the body during various dynamic physical activities such as daily activity and exercise, to see how heat dissipation and perspiration influence thermal comfort or athletic performance. we performed an activity test on a subject wearing the e-tecs (fig. a) . a male volunteer with no prior medical history of disease was recruited for participation in this test, and informed, signed consent was obtained from the individual after passing the pre-screening procedure. figure b shows the timeline of the activity tasks throughout the min test. supplementary fig. shows all of the raw temperature sensor data throughout the body during the min running test, while fig. e and supplementary fig. provide the calibrated temperature readings according to the linear fit equation found in supplementary fig. a . sensor data in these figures are separated in terms of their respective location: on the posterior side, anterior side, both arms, and the neck. in addition, for visualization purposes, fig. c illustrates a body heat-map from the temperature sensor data corresponding to each location. supplementary video also demonstrates simultaneous temperature and accelerometer readings while the subject is running. it can be observed that at the start of the activity test, the body heat-map shows a higher temperature profile on the neck, chest, upperabdomen and upper-back regions, and becomes lower towards the lower-abdomen and lower-back which agrees to a previous study . in some cases, as illustrated in fig. d , e, we can observe a short increase in temperatures across various body regions before they decrease in trend once the subject started to run at a graded load. a sudden change in exercise intensity increases cutaneous blood flow and releases heat, resulting in an increase of core and skin body temperature. this phenomenon occurs until perspiration starts and sweat evaporates from the eccrine glands of the skin, providing a cooling effect and decreasing the skin temperature throughout . as the sweat permeates through the fabric, the temperature tends to stabilize towards the end of the resting period. we can see that temperature around the posterior, especially at the arms, does not show a significant trend, which may be due to the local heat flux and blood flow that mostly originate from primary organs around the central region . to confirm our exercise results, we conducted a second running task and performed ir thermography (duo r, flir) on the same subject without the e-tecs. supplementary fig. shows the body heat-map at the anterior, posterior, and lateral view from the thermal camera throughout the running test at a graded load. the color change indicates a reduction in temperature across the whole body caused by the sweat, with an incremental increase while resting from minute to . even though the thermal camera has a higher resolution ( pixels x pixels) compared to the e-tecs ( points), it has a relatively low thermal accuracy (± °c) compared to our body temperature sensor with accuracy and precision of approximately . °c and . °c, respectively . thermal images from the commercial ir thermography camera show a body temperature spread of . to . °c, while wearing the e-tecs results in a temperature spread of . - . °c. the latter range is closer to typical body temperature range during normal activity and intense physical exercise . accelerometer data and mechano-acoustic waveforms from the activity test are also presented in fig. f -i. figure f shows all axis accelerometer data for the entire min of the task. we can observe the intensity of the task, shown as periodic -axis waveforms that can be counted to steps per minute, representing running at mph (fig. g) . the increasing acceleration when the subject started running at a graded load (supplementary fig. a ) and the transitioning acceleration as the subject slowed down to walking at three mph, corresponding to steps per minute ( supplementary fig. b) , are also clearly visible. by zooming into the z-axis acceleration at rest, we can observe the mechano-acoustic waveforms, triggered by the subtle contraction and relaxation of the heart, lung, and diaphragm (fig. h, i) . both figures represent raw data before any further processing, such as filtering. before the exercise, we can see a clear breathing waveform in fig. h , corresponding to breaths per s ( breaths per minute) with small peaks of the beating heart of spikes per s ( heartbeats per minute). after the subject performed a graded load exercise, a large amplitude of mechano-acoustic vibrations from the heart was visible due to the increase in cardiac output. as physical exercise intensity increases, the heart needs to pump more blood and oxygen supply to meet the demand of the body's muscles. the lung and respiratory system also respond to the intensity, with an increased breathing rate to compensate for the oxygen requirement of the body to release energy , . in correlation to the activity of these organs, after the exercise, both heart rate and breathing rate increased to bpm and bpm, respectively (fig. i) . in summary, we have merged flexible-stretchable electronics with customized knit fabrics to develop an e-tecs for distributed on-body sensing in vivo. large-scale manufacturing of flexible printed circuit boards and knit fabrics and modular sensor networks enable a high-throughput, scalable system, resulting in: ( ) large-area sensor coverage, and ( ) a versatile platform for multimodal sensor integration. not only did we produce our own fabric structures and patterns, but with garment design and patterning techniques, we also tailored the fabric into a suit for a tight fit, yet comfortable for conformal attachment to the curvature of the body. the engineered compression pressure across the body ensures each sensor's contact to the skin and minimizes dislocation from the sensing points. as our final prototype, we integrated sensor islands into the tailored e-tecs, including temperature sensors spread across the upperbody region, and one accelerometer placed right below the sternum. intense physical exercise was conducted to demonstrate the ability of e-tecs to perform continuous spatiotemporal temperature sensing, as well as simultaneous mechano-acoustic sensing for the estimation of heart rate and breathing rate. compared to ir thermography used in this work, our approach enables high-accuracy skin temperature sensing without being spatially limited by the camera's view or the need to be naked, fig. physical exercise, spatiotemporal physiological mapping, and movement analysis. a photograph of a subject performing the physical exercise task wearing a e-tecs. b timeline of four separate sections of the physical exercise task. c sensor mapping and body heatmap of the subject throughout the exercise. d full-body and each section of the body skin temperature, and (e) anterior skin temperature sensor data during the exercise. all -axis accelerometer data (f) throughout the entire task. (g) in the middle of a graded load test at mph. raw z-axis sensor reading (h) before and (i) after the exercise. expanding its applications in wearable sensing "on-the-go". the accelerometer could also detect subtle heart rate, respiration, and body movements for physical activity and physiological monitoring. future studies may focus on incorporating additional sensing modalities such as humidity, pressure, optical, ultrasonic, gas, magnetic field sensors, and so on, demonstrating the e-tecs capability during various activities outside the lab, and performing further optimization for electromechanical and washability study. the collective design and integration approach of e-tecs, as well as the underlining experimental and implementation studies would be of interest in the development of flexible-stretchable and textile electronic systems. the multi-modal, multi-functional framework of e-tecs will enable a new strategy of personalized telemedicine for rapid prototyping and deployment, especially during extreme conditions such as a pandemic or natural disaster relief efforts. it could advance mobile, comfortable, and continuous physiological and physical activity monitoring, with potential implications in healthcare, rehabilitation, and sports science not only in the hospital and laboratory, but also in home-care settings and eventually in outer-space applications. the structure of a sensor module in fig. e consists of two-layer flexible pcb (fpcb, kingcredie) with μm thick cu traces, μm thick base polyimide (pi) substrate, and μm thick pi outer shell. the max (maxim integrated) sensor ic, μm in thickness is soldered into the pads with μm thick pi stiffener as a support structure and encapsulated with μm thick washable encapsulant (pe , dupont). the entire module is then encapsulated in a tpu shell (te- c, dupont) with μm thickness for each top and bottom layer. for cross-sectional imaging, an electronic device woven into a fabric channel is submerged and cured in a polydimethylsiloxane mix (sylgard , sigma-aldrich) with base and curing agent ratio of : bath. we then cut the molded device with a circular cold saw (cs- , kalamazoo) at the middle of the chip. finally, we polished the device using the side of a rotating circular blade (wilton corporation). the knit fabrics were developed by a digital flat two-bed knitting machine (super-j , matsuya). two yarn carriers were used in order to make two layers of weft-knit fabric (fig. e) . weft knitting is a method of forming a fabric in which the loops are made in a horizontal way from a single yarn. with a two-bed knitting machine, single layer fabric can be realized by interlocking. interlocking uses two sets of needles that knit back-to-back in an alternate sequence to create two sides of the fabric that are exactly in line with each other, forming one layer. each yarn carrier holds -ply ( denier each ply) of high-flex polyester yarns. textile channels for electronic integration were knitted by allowing both the front and back needle beds to knit simultaneously and by making a spacer fabric with a hollow channel. the number of wale lines, which is in this spacer fabric defines the width of the opening of around cm while the course line number defines the width of the entire knit fabric (fig. c) . the rest of the fabric was formed through interlocking. solder-tip melting (wp , weller) was performed to open the channels for the exposed part of the sensor modules with a distance of . cm. after the pattern was drafted, the knitted fabric was laser-cut (helix w, epilog) with the open channels positioned in a horizontal orientation ( supplementary fig. ). the horizontal measurements (e.g. neck circumference, waist circumference, thigh circumference) were reduced by around % depending on the dimension to ensure a tight fit. the optimal amount of strain can be determined after further testing the yield strain of the stretchable interconnects, the compression pressure, and on the fit of the suit. a seam allowance of . cm was used on the pattern pieces. the shirt (fig. f) consists of a front, a back, two sleeves, and polo neckpieces. the raw edges of the seams were joined together using a zig-zag stitch with a sewing machine (cg , singer) as an overlocking stitch ( supplementary fig. b ). as illustrated in supplementary fig. , after the sensor-interconnects modules bonding by hot-melt soldering (pb-free # - g, mg chemicals), the sensor electronics were encapsulated (pe , dupont) by using medical and semiconductor grade epoxy resin that is machinewashable for both mechanical and electrical protection. the electronic strips were then further encapsulated in a stretchable outer shell, in which two films of thermoplastic polyurethane (tpu te -c, dupont) are laminated and each side of the tpu is bonded with heat ( °c). after that, the stretchable electronic strips can be integrated into one of the textile channels through manual weaving (supplementary fig. a) . every sensor is exposed through the opening and glued to the textile with washable fabric glue (ok to wash-it, aleene). four power and signal wires from the main hub were threaded to every end of these strips to connect the microprocessor to all available sensors (supplementary fig. ). four digitally knit fabric patches were cut in cm × cm and used as samples for tensile strength test using a commercial mechanical tester machine (instron ). the samples were extended with a speed of mm/min using a . kn load cell. load and extension data were recorded until the samples ruptured. we consider a typical stretch range for compression garments, which is the first portion of a load-extension curve ( - %) to calculate the rigidity of our fabric . compression pressure modeling to model the pressure in a compressive garment, we first define the rigidity of the elastic fabric material as where t is the fabric tension per unit length in gf/cm and st is the fabric extension. assuming that we have a tubular fabric covering a cylindrical tube, the fabric extension and the size of the reduction (re) are given by where r is the radius of a cylindrical tube and r is the radius of tubular elastic fabric (r > r). by applying laplace's law, the pressure (p) in gf/cm can be defined as expressing c as the circumference of the cylindrical tube gives us substituting parameters in eqs. ( ) and ( ) with eq. ( ) results in re since the human body model is not a perfectly cylindrical tube, we define a compression factor to define relationship between the circumference of the human body and cylindrical tube rearranging eq. ( ) into eq. ( ) gives us the final pressure value of elastic fabric for compressive garment purposes in order to find the compression pressure throughout the body, we initially need to study the tensile properties and calculate the rigidity of our fabric material. we consider a typical stretch range for compression garments, which is the first portion of a load-extension curve ( - %), to calculate the rigidity of our fabric. by using eq. ( ), the rigidity of each fabric is calculated to be . , . , . , and . gf, respectively ( supplementary fig. a) . a study on human subjects revealed that compression factor (cf) for the upper limb of a human body is . . using the aforementioned values in eq. ( ), we can then estimate the pressure of elastic fabric around the upper limb region of the human body. to assess the reliability and electromechanical performance of the serpentine interconnects, we performed two types of tests. the first test is one-time uniaxial stretching until substrate breakage and conductor rupture. supplementary fig. a demonstrates the setup for this test. a commercial mechanical tester (instron ) with a . kn load cell was used. load and extension data were recorded using a crosshead speed of mm/s until % extension of the original length of the samples. the prepared samples were the interconnect modules with two serpentine lines, and dimensions of mm × mm. resistance was measured with an lcr meter (e a, national instrument) connected to the integrated sensor leads with probes. via a common i/o interface (bnc- , national instruments), the load, extension, and resistance data were synchronously obtained and logged. all of the sensors going horizontally through the stripes were collected by four thin copper wires, which aligns and inserts vertically through the seams and is connected to the main hub (metawearr, mbientlab) via i c protocol. we sewed conductive snaps that function as a textile-hardware connector to link the i c pins on the micro-controller to the i c wires on the textile. the pluggable mechanism allows the hardware to be removed during charging of the battery. through wireless bluetooth communication, a computer can access all of the sensor addresses and log their data accordingly. these data can then be stored or visualized in real-time with python matplotlib and pygame library. a temperature sensor was embedded in a piece of fabric and encapsulated by a thermally conductive epoxy (pe- , dupont) and thermoplastic polyurethane (te- c, dupont). after being embedded into the fabric, the electrically packaged sensor was placed on the surface of a hot plate with direct contact. the sensor was heated from °c to °c on the hot plate at a ramp rate of °c/h. while the temperature on the anodized aluminum plate ramped up and was being recorded ( hz) by a highaccuracy ir camera (pi i, optris) with thermal sensitivity of mk and an accuracy of ± %, a micro-controller (arduino uno) simultaneously gathered data from all three flexible temperature sensors ( hz) and logged these sensor data to a computer. the ir temperature data from a point near the flexible temperature sensors on an anodized al plate were then compared to the sensor temperature data (n = ) at every °c elevation of temperature. we assembled, encapsulated, and integrated an accelerometer module within a customized fabric (fig. j) . the simultaneous seismocardiography and electrocardiography test was performed while the subject was laying on a bed in a relaxed state. we sampled the accelerometer-embedded fabric's z-axis data ( hz), which was wired to an arduino uno through i c communication, alongside a commercial ecg ( hz) and respiration ( hz) strap (zephyr biopatch, medtronic) as a cross-validation device. the electronic textile patch was connected to a ble module (metawearr, mbientlab) that is sealed inside a tube with clear silicone glue (rtv silicone, dynatex). this setup then went through a full washing cycle. as shown in supplementary video , the "delicate and knit" option was chosen, and logging and real-time streaming of sensor data from inside the industrial washing machine (mhn pdcww , maytag washer) during the complete cycle was performed. throughout the washing with g of standard detergent (ultra stain release, tide), the textile patch underwent an initial warm wash, three cycles of rinsing, two cycles of draining, and a dry spin at the end. after that, the electronic patch was dried for an hour by exposing it to warm airflow generated by w ceramic portable heater (cd , lasko) in the high setting. measurements of daily water vapor transmission in this work follow the standards as described in astm e . four mm diameter by mm height glass petri dishes were prepared, each filled with g of water. three fabric samples from % cotton fabric, % polyester and % spandex sports fabric, and our own % high-flex polyester fabric were cut and sealed to each dish opening with rubber bands ( supplementary fig. a) . accumulated weight loss of each dish was measured daily for eight days at room temperature ( °c) and % humidity, using a precision analytical scale (me te, mettler toledo). this weight loss (Δw) is the amount of water vapor that has transmitted through the fabrics and evaporated. the water vapor transmission rates (wvtr) can be calculated as follows: where Δw is the slope of the weight change in grams (g) every after h and a is the transmission surface area in m . we performed an activity test on a subject wearing the tailored e-tecs (fig. a) . all experiments were conducted in compliance with the guidelines of irb and were reviewed and approved by the massachusetts institute of technology committee on the use of humans as experimental subject (couhes protocol ). a male volunteer with no prior medical history of chronic cardiovascular, skin, mental health disease, or physical disability was recruited for participation in this test, and informed, signed consent including consent of photography during the test was obtained from the individual after passing the pre-screening procedure. the subject was asked to stand still on a treadmill for min before commencing the physical exercise test (supplementary video ) . the subject then started to run at a graded load of mph for min, before slowing down to mph for min. finally, the subject stopped the treadmill and rested by standing for min until the test ended. during the entire test, the e-tecs accessed, captured, and sent multi-nodal body temperature ( hz) and imu ( hz, accelerometer x, y, and z-axis) data to a computer through ble communication for logging. the subject performed the same test for a second time and was naked, without wearing the e-tecs for validation with an ir camera (duo r, flir). a finite impulse response low-pass filter with f s of hz, f pass of hz, f stop frequency of hz, d pass of . , and d stop of . , where d is the deviation (ripple) vector, is used to process the raw data by eliminating low-frequency respiratory waveforms. for respiratory waveform, fir lowpass filter with f s of hz, f pass of hz, f stop frequency of hz, d pass of . , and d stop of . are used instead for eliminating highfrequency signals due to heart-beat events and getting the dc component of the signals. the data that support the findings of this study are available from the authors on reasonable request. the authors declare that the data supporting the findings of this study are available within the article and the corresponding supplementary information file. the custom code and mathematical algorithm that support the findings of this study are available from the authors on reasonable request. the authors declare that the data supporting the findings of this study are available within the article and the corresponding supplementary information file. recent advances in flexible and stretchable bio-electronic devices integrated with nanomaterials recent progress in flexible and stretchable piezoelectric devices for mechanical energy harvesting, sensing and actuation soft electronics for the human body towards personalized medicine: the evolution of imperceptible health-care technologies rugged and breathable forms of stretchable electronics with adherent composite substrates for transcutaneous monitoring stretchable silicon nanoribbon electronics for skin prosthesis ultraflexible organic photonic skin conformable amplified lead zirconate titanate sensors with enhanced piezoelectric response for cutaneous pressure monitoring soft, skin-interfaced wearable systems for sports science and analytics monitoring of vital signs with flexible and wearable medical devices materials and mechanics for stretchable electronics weaving integrated circuits into textiles a smart textile based facial emg and eog computer interface fully integrated ekg shirt based on embroidered electrical interconnections with conductive yarn and miniaturized flexible electronics fabrickeyboard: multimodal textile sensate media as an expressive and deformable musical interface woven temperature and humidity sensors on flexible plastic substrates for e-textile applications textile integrated sensors and actuators for near-infrared spectroscopy an electronic nose on flexible substrates integrated into a smart textile diode fibres for fabric-based optical communications wearables: fundamentals, advancements, and a roadmap for the future wearable sensors for remote health monitoring a wireless body area sensor network based on stretchable passive tags battery-free, wireless sensors for full-body pressure and temperature mapping the use of infrared thermography to detect the skin temperature response to physical activity the clinical significance of infrared thermography for the prediction of postherpetic neuralgia in acute herpes zoster patients highly flexible, wearable, and disposable cardiac biosensors for remote and ambulatory monitoring. npj digit a wireless modular multi-modal multinode patch platform for robust biosignal monitoring towards woven logic from organic electronic fibres woven electronic fibers with sensing and display functions for smart textiles sensortape: modular and programmable d-aware dense sensor network on a tape d soft modular electronic blocks (smebs): a strategy for tailored wearable health-monitoring systems modular and reconfigurable stretchable electronic systems adaptive and responsive textile structures (arts) smart clothes-the unfulfilled pledge? bluetooth low energy-based washable wearable activity motion and electrocardiogram textronic monitoring and communicating system biowatch: estimation of heart and breathing rates from wrist motions theory and developments in an unobtrusive cardiovascular system representation: ballistocardiography epidermal mechano-acoustic sensing electronics for cardiovascular diagnostics and human-machine interfaces annual international conference of the ieee engineering in medicine and biology-proceedings sensorknit: architecting textile sensors with machine knitting. d print compression garments for medical therapy and sports. polymers (basel) the effect of pressure on skin temperature measurements for a disk sensor designing with stretch fabrics preliminary development of a wearable device for dynamic pressure measurement in garments stretchable inorganic-semiconductor electronic systems development of a thin-film stretchable electrical interconnection technology for biocompatible applications printed circuit board technology inspired stretchable circuits in improving comfort in clothing obtaining repeatability of natural extended upper body positions: its use in comparisons of the functional comfort of garments mechanical responses of filled thermoplastic elastomers stress-strain behavior of thermoplastic polyurethanes stretchability and compliance of freestanding serpentine-shaped ribbons the influence of double layer knit fabric structures on air and water vapor permeability standard test methods for water vapor transmission of materials thermal imaging of exercise-associated skin temperature changes in trained and untrained female subjects regional skin temperature response to moderate aerobic exercise measured by infrared thermography fully integrated wearable sensor arrays for multiplexed in situ perspiration analysis high resolution d thermal imaging using flir duo r sensor exercise-induced interleukin- and metabolic responses application of thermodynamics to biological and materials science pressure model of elastic fabric for producing pressure garments the authors acknowledge angela chen, jordi montaner, and andy su for technical support in digital knitting, and zoro zheng for technical support in pcb fabrication. the authors declare no competing interests. supplementary information is available for this paper at https://doi.org/ . / s - - -y.correspondence and requests for materials should be addressed to c.d.reprints and permission information is available at http://www.nature.com/ reprintspublisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons. org/licenses/by/ . /. key: cord- -pm i mb authors: du preez, andrea; law, thomas; onorato, diletta; lim, yau m.; eiben, paola; musaelyan, ksenia; egeland, martin; hye, abdul; zunszain, patricia a.; thuret, sandrine; pariante, carmine m.; fernandes, cathy title: the type of stress matters: repeated injection and permanent social isolation stress in male mice have a differential effect on anxiety- and depressive-like behaviours, and associated biological alterations date: - - journal: transl psychiatry doi: . /s - - - sha: doc_id: cord_uid: pm i mb chronic stress can alter the immune system, adult hippocampal neurogenesis and induce anxiety- and depressive-like behaviour in rodents. however, previous studies have not discriminated between the effect(s) of different types of stress on these behavioural and biological outcomes. we investigated the effect(s) of repeated injection vs. permanent social isolation on behaviour, stress responsivity, immune system functioning and hippocampal neurogenesis, in young adult male mice, and found that the type of stress exposure does indeed matter. exposure to weeks of repeated injection resulted in an anxiety-like phenotype, decreased systemic inflammation (i.e., reduced plasma levels of tnfα and il ), increased corticosterone reactivity, increased microglial activation and decreased neuronal differentiation in the dentate gyrus (dg). in contrast, exposure to weeks of permanent social isolation resulted in a depressive-like phenotype, increased plasma levels of tnfα, decreased plasma levels of il and vegf, decreased corticosterone reactivity, decreased microglial cell density and increased cell density for radial glia, s β-positive cells and mature neuroblasts—all in the dg. interestingly, combining the two distinct stress paradigms did not have an additive effect on behavioural and biological outcomes, but resulted in yet a different phenotype, characterized by increased anxiety-like behaviour, decreased plasma levels of il β, il and vegf, and decreased hippocampal neuronal differentiation, without altered neuroinflammation or corticosterone reactivity. these findings demonstrate that different forms of chronic stress can differentially alter both behavioural and biological outcomes in young adult male mice, and that combining multiple stressors may not necessarily cause more severe pathological outcomes. animal research has been paramount in investigating the association between stress and major depressive disorder (mdd) . exposure to individual or multiple stressors in rodents can alter the immune system , and the hypothalamic-pituitary-adrenal (hpa) axis , , decrease hippocampal neurogenesis , and induce anxiety-and depressive-like behaviours , . these areas of research have predominately utilized either (i) unpredictable chronic mild stress (ucms) models, which incorporate a variety of both physical and psychosocial stressors , or (ii) a social-stress-based model (e.g., social isolation or social defeat stress), which employ predominately psychosocial stressors , . however, one main limitation to this area of research is that there is no discrimination between the types of stress used, as most studies predominately incorporate either one or a combination of both these stress types. thus, the distinction between the effect of different types of stress on animal behaviour and physiology has been largely unexplored. this idea is not entirely novel in a clinical setting, with research showing that psychological, sexual and physical maltreatment can differentially affect mental health [ ] [ ] [ ] [ ] . for example, hodgdon et al. recently demonstrated that psychological, sexual and physical abuse lead to distinct behavioural profiles and this research warrants being investigated in a preclinical setting, which, to our knowledge, is yet to be done. thus, is one type of stress exposure more potent than another and, if so, does this matter? this has already been partially addressed by a few preclinical studies showing how only ucms and/or predator stress, but not restraint stress, can induce depressive-like behaviour , . it also has been shown that exposure to psychosocial stress leads to a much wider range of depressive-like behaviours and more pronounced increases in pro-inflammatory cytokine profiles than exposure to predominately physical stress and/or ucms , . although these lines of research have shed some light on the impact of different types of stress, only one of these studies directly compared different stress paradigms in the same study and, moreover, was limited in the biological domains assessed, having focused specifically on the neuroendocrine system . therefore, a more comprehensive evaluation is needed, one that focuses on multiple behaviours and biological outcomes resembling the complexity of mdd. therefore, with this in mind, we aimed to investigate the effect of two well-established stressors, social isolation , and repeated injections [ ] [ ] [ ] , both for weeks in young adult male animals. although depression can develop at any age between early childhood and older adulthood, both national and cross-national epidemiological studies report that the first onset of depression most frequently occurs in the s to early s [ ] [ ] [ ] , and thus the animal equivalent of a young adult was purposely selected for study. we measured anxiety-and/or depressive-like behaviour and corticosterone responsivity, systemic inflammation, neuroinflammation and hippocampal neurogenesis-biological outcomes all associated with mdd . we wanted to establish whether we can discriminate between the effects of repeated injection and permanent social isolation with respect to the behavioural and biological outcomes, and whether combining these stressors alters the severity of the associated behavioural and/or biological outcomes. experiments were conducted with male balb/canncrl mice (n = ), aged - weeks, weighing - g, obtained from charles river laboratories (margate, kent, uk). animals were housed under standard conditions ( - °c, humidity %, : h light : dark cycle with lights on at : , food and water ad libitum) and had a -week period, prior to stress exposure, to acclimatize to the biological services unit and experimenter. during habituation, all animals were housed in sibling pairs. all housing and experimental procedures were carried out in compliance with the local ethical review panel of king's college london and the uk home office animals scientific procedures act . for the rationale behind the chosen strain, sex and age of the mice used, see supplementary material. mice were exposed to one of three stress treatments for a -week duration. each treatment comprised one or two distinct stressors that was either in the form of repeated injection, which has been previously shown to differentially alter stress responses in balb/c mice and affective outcomes in outbred rats with high and low emotional reactivity , or permanent social isolation, which has consistently been associated with depressivelike phenotypes , , . in brief, three groups of mice were exposed either to repeated injection only (group ), permanent social isolation only (group ) or a combination of both stressors (hereafter referred to as combined stress; group ), whereas one group remained stress free (group ). group sample sizes were based on numbers typically used in animal research using chronic stress models , . animals, either singly or as a sibling pair, were randomly assigned by a chance procedure to their respective groups after the initial habituation period. for all forms of assessment, the experimenter was blinded to group status and all animals were handled regularly. figure defines in more detail the experimental groups and depicts the experimental timeline and housing conditions. intraperitoneal injections were carried out according to recommended guidelines and as previously described , . briefly, animals were scruffed and injected with saline volumes of ml/kg body weight administered between : and : daily. injection naive animals were handled but not injected. for details on the injection procedure, see supplementary material. body weight and food intake was measured weekly during the first weeks of stress exposure (fig. ) . for details on these measures, see supplementary material. anxiety-like behaviour was assessed using the open-field test (oft) and novelty suppressed feeding test (nsft) , whereas depressive-like behaviour was measured using the sucrose preference test (spt) and porsolt swim test (pst) -all as previously described. behavioural testing was carried out after weeks of stress exposure (fig. ) . for details on the general conditions on all behavioural testing and the methods for all behavioural assays, see supplementary material. blood sampling, processing and storage whole blood, from a tail cut, was collected between : and : , both h before and min after acute stress exposure, i.e., the pst, at the end of behavioural testing. samples were centrifuged at r.p.m. for min at °c and plasma was removed and stored at − °c. for details on blood sampling, see supplementary material. corticosterone levels were measured in duplicates from plasma samples collected at both blood collection points using commercially available enzyme-linked immunosorbent assay (elisa) kits (enzo life sciences, switzerland), according to the manufacturer's instructions. absorbance values were converted into concentrations using cubic spline four parameter logistics. plasma cytokine levels were determined using the multiplex screening assay based on magnetic luminex® xmap® technology as previously described . using a custom-made mouse premixed multi-analyte magnetic luminex screening assay (r&d systems, minneapolis, usa), levels of interleukin (il)- β, il , il , il , il , il , c-reactive protein, vascular endothelial growth factor (vegf), insulin-like growth factor and tumour necrosis factor (tnf)-α, in blood plasma collected at both blood collection points, were measured according to the manufacturer's instructions. median fig. schematic representation of the experimental procedure. mice were exposed to one of four experimental conditions after an initial week habituation period: (i) stress-free control group (socially-housed in pairs and injection naive-group ); (ii) repeated injection stress (sociallyhoused in pairs and repeatedly injected-group ); (iii) social isolation stress (permanent social isolation and injection naive-group ); and (iv) combined stress (permanent isolation and repeatedly injected-group ). behavioural testing began after weeks of treatment and was carried out under stringent environmental control, using standardized test procedures, between : and : , unless otherwise stated. blood was collected h before and min after acute stress exposure, i.e., the porsolt swim test, in the final week of behavioural testing. animals were culled h later and brain tissue extracted. a housing for sibling pair-housed animals (groups and ): large techniplast cages ( × × cm) were set up with sawdust, a red house, white house, nesting material and a plastic enrichment tube. larger cages with more nesting material, shelter and enrichment were deemed necessary to reduce aggression between sibling pair-housed animals. b housing for singly-housed animals (groups and ): small tecniplast cages ( × × cm) were set up with sawdust, a white house and nesting material. note: a larger sample size was used for control animals (n = ) given the risk of group housing such an aggressive strain (deacon, ; charles rivers laboratories international, inc., ; www.criver.com). for details on the elisa and luminex protocols, inter-and intra-assay variability, and kit sensitivity. brain tissue collection and sectioning animals were transcardially perfused and brains quickly extracted as previously described . all brain tissue was coronally sectioned at a thickness of μm using a leitz freezing microtome (microm hm , carl zeiss ltd, cambridge, uk) as previously described , . for details on brain tissue collection, sectioning and storage, see supplementary material. proliferative cells, immature neurons, microglia, astrocytes and mature astrocytes were visualized using ki , doublecortin (dcx), ionized calcium-binding adapter molecule- (iba ), glial fibrillary acidic protein (gfap) and s β, respectively, using free-floating immunohistochemistry as described previously . for protocol details, antibodies used and representative images, see supplementary material and supplementary fig. . the cell density of immunopositive ki , dcx, iba , gfap and s β cells in the prefrontal cortex (pfc) and/ or the dentate gyrus (dg) of the hippocampus were estimated by stereological analysis with stereoinvestigator software (mbf bioscience, williston, vt) using the optical fractionator module as previously described . for details on the stereological methods, see supplementary material. whole-slide digital images were provided by the ucl iqpath slide scanning service, using a leica scn f scanner. representative images of the hippocampus and the pfc were captured with aperio imagescope software v . . (leica biosystems, uk) at × magnification. to quantify and compare relative levels of immunoreactivity for iba -and gfap-positive cell staining in the dg and pfc, thresholding analyses using imagej software v . were performed as described previously , . see supplementary material for details of the thresholding methodology. using imagej software v . sholl and skeleton analyses were performed as previously described [ ] [ ] [ ] . for all analyses, iba cells were randomly selected across three hippocampal sections per mouse. three mice per experimental group were used for morphological quantification, equating to a total of iba cells per group. for details of the morphometric methodology, see supplementary material. given that gfap is also a marker for neural stem cells , to determine whether changes observed in gfap immunoreactivity were related specifically to astrocytes, sections were fluorescently double-labelled to determine the extent of co-localization of gfap with sox -a neural stem cell specific marker . immunofluorescent double labelling was carried out as previously described . images of double-labelled sections were obtained using a leica sp confocal microscope. specifically, an objective of × (oil immersion, na . ) was used for each image captured and wavelength laser lines of nm (diode laser), nm (argon laser) and nm (hene laser) were used. all images were acquired as confocal stacks of ten images separated by a . μm z-axis step size. each image was taken at a resolution of × pixels, with the picture dwell time set to . , giving a rate of . zplanes per second, and each frame was averaged four times to reduce signal noise. additional settings of gain, offset and pinhole size were optimized prior to imaging and were held constant for all images at v, v and . airy unit, respectively. for each confocal stack, the percentage of doublelabelled cells for gfap and sox in the dg of the hippocampus was determined using methods as outlined previously . briefly, gfap-positive cells per mouse were counted and the percentage of radial glial cells (gfap+/sox +) and astrocytes (gfap+/sox −) were calculated. for each mouse, cells were counted from a total of acquired z-stacks across hippocampal sections. dcx-positive cell morphology in the dg was visually classified as previously described . in brief, dcx-positive cells were classified according to four neuroblast subtypes based on their level of maturation, and the cell density for each neuroblast cell type was determined using stereological analyses as aforementioned. for details on the classification process, see supplementary material and supplementary fig. . statistical comparisons were conducted in ibm spss statistics v. (ibm ltd, portsmouth, uk) and consisted of independent samples t-tests, one-way or two-way analyses of variance (anova), repeated-measures twoway anova, three-way factorial anova, repeatedmeasures generalized linear mixed modelling, mann-whitney, or kruskal-wallis, followed by bonferroni or dunn's post hoc analyses where appropriate. all data were assessed for normality using probabilityprobability plots and the kolmogorov-smirnov test, and for homogeneity of variance using the levene's test. for data that did not conform to normality and/or homoscedasticity non-parametric statistical tests were applied. all tests carried out were two-sided and the alpha criterion used was p < . . data are represented as the mean and sem or the median and interquartile range. all animals were included in all analyses. for a more detailed description of the analytical approach, see supplementary material. we first assessed the behavioural effects associated with chronic stress exposure and found that all stress-exposed mice, irrespective of the type of stress, exhibited significantly increased anxiety-like behaviour in the oft (fig. a) relative to control animals, as shown by significantly less time in the centre of the oft arena (− % in repeatedly injected animals, − % in the socially isolated mice and − % in animals exposed to combined stress spent (fig. a) ). moreover, all groups of stress-exposed mice displayed an increased latency to feed in the nsft relative to control animals ( fig. b ) but due to the significant increases in anxiety also observed in these mice in the oft, the nsft was heavily confounded by anxiety, as the task is run in a novel arena. therefore, it was not possible to use this test to specifically assess anhedonia. interestingly, only mice exposed to permanent social isolation exhibited signs of significant depressive-like behaviour when compared with controls (fig. c, d) , as shown by increased anhedonic-like behaviour (exposed animals consuming % less sucrose in the spt; fig. c ), and behavioural despair (exposed animals spending % more time immobile in the pst; fig. d) . notably, behaviour in the oft and nsft was not confounded by locomotor activity (supplementary fig. a, b) , and the observed differences in the spt were not attributed to either total liquid consumption (supplementary fig. d ) or food consumption ( supplementary fig. e) . moreover, the observed differences in the behavioural assays were not confounded by olfactory ability (supplementary fig. c ) or by social dominance (supplementary table ) . for further detail on all other behavioural readouts, see supplementary fig. and for a full summary of all the behavioural changes associated with the three chronic stress exposures, see supplementary table . chronic stress does not alter baseline corticosterone levels after stress exposure, but the type of stress differentially alters corticosterone responsivity we next sought to uncover some of the biological alterations associated with these behavioural perturbations in stressed-exposed mice, focusing first on peripheral changes related to hpa axis functioning. we found no significant differences in baseline corticosterone levels across experimental groups after chronic stress exposure (i.e., h before acute stress), and all mice showed a % average increase in plasma corticosterone in response to the pst (fig. a) . interestingly, corticosterone levels min after the acute stress did significantly differ between experimental groups (fig. a) . on average, levels increased by threefold after the pst; however, mice exposed to repeated injection had a % larger increase in corticosterone levels compared with controls, whereas socially isolated mice had a % smaller increase in corticosterone levels compared with controls, and mice exposed to combined stress had a stress response similar to that of controls (fig. a) . for a detailed summary of all the peripheral corticosterone data for each of the three chronic stress exposures, see supplementary table . repeated injection decreases tnf-α and il ; social isolation increases tnf-α, but decreases il β, il , il , il and vegf; and injection and social isolation combined decreases il β, il and vegf we next examined the impact of chronic stress exposure on peripheral inflammatory profiles, both at baseline (i.e., h before the pst) and after the acute stress. interestingly, we found that there was very little pre vs. post effect of the acute stress on any of the cytokines ( fig. b-g) . specifically, only socially isolated mice exhibited significant differences in cytokine plasma levels post-pst relative to baseline, with a decrease in il and an increase in vegf levels in response to acute stress. in terms of between-group comparisons vs. control mice, mice exposed to repeated injection had significantly lower baseline tnf-α and post-pst il levels (fig. b, e) . conversely, socially isolated mice had significantly higher baseline and post-pst tnf-α levels, lower baseline and post-pst il levels, together with lower levels of baseline il β and vegf, and post-pst il and il ( fig. b-g ). mice exposed to combined stress had consistently lower il and vegf at baseline and post-pst, and lower il β levels post-pst (fig. c, e, g) . intriguingly, no significant differences in il or il were found between (or within) experimental groups, preand/or post-pst ( supplementary fig. a, b) . for a detailed summary of all the inflammatory data for each of the three chronic stress exposures see supplementary table . it is also noteworthy that none of the reported inflammatory changes were artifacts of social hierarchy (supplementary table ). given that glial cells play an important role in immune system functioning , we next examined both microglial and astrocyte biology in both the hippocampus and pfctwo particularly stress-sensitive regions . repeated injection increases iba -positive cell density and promotes a dystrophic cell morphology in the ventral dg, whereas social isolation decreases iba -positive cell density and induces a ramified cell morphology in the dorsal dg looking at microglial biology first, we found that exposure to repeated injections and social isolation reduced iba immunoreactivity in the dg of the hippocampus, with a similar pattern, but not reaching statistical significance, for the combined stress (fig. a, b) . these effects in the dg had specific regional selectivity, with decreases that were significant in the ventral dg for the repeatedly injected mice (− %) and in the dorsal dg for the socially isolated mice (− %) (fig. a, b) . interestingly, none of the conditions affected iba -positive cell immunoreactivity in the pfc (supplementary fig. a, b) . to further understand the relevance of these changes in iba immunoreactivity, we next assessed cell density and morphology in these specific regions of the dg. interestingly, despite both repeated injections and social isolation showing a decrease in iba immunoreactivity, repeatedly injected mice had a % increase in iba cell density in the ventral dg (fig. c) , while socially isolated mice had a % decrease in iba cell density in the dorsal dg (fig. d) . moreover, although injected mice had an increased iba cell density in the ventral dg, the morphological characteristics of these cells were altered, such that they assumed a more dystrophic morphology, potentially explaining the decreased iba immunoreactivity data for this group (fig. e , f and supplementary table ). in direct contrast, iba cells of socially isolated animals assumed a more ramified morphology, with no difference in the amount of space each cell occupies observed ( fig. g and supplementary table ). it is also noteworthy that cell soma size and average process length was not different between experimental groups (supplementary table ). social isolation increases gfap-positive cell immunoreactivity and s β-positive cell density in the dg in addition to investigating the impact of chronic stress on microglial biology, we also examined astrocyte biology in both the dg and pfc. interestingly, only social isolation stress increased gfap immunoreactivity in the dg relative to all other groups ( %)-a change not dg region specific (fig. a, b) -and similar to iba data, there were no differences in gfap immunoreactivity in the pfc between experimental groups ( supplementary fig. c, d) . as gfap is also expressed in neural progenitor cells, immunofluorescent double labelling was subsequently used to further characterize these cells in socially isolated and control animals. interestingly, our data revealed that all mice irrespective of exposure had a radial glia to astrocyte cell ratio of : ( supplementary fig. ) . given that only~ % of gfap-positive cells in the dg were astrocytes (i.e., controls: %; isolated: %), stereological cell density estimates were next obtained to determine whether the observed differences in gfap immunoreactivity were related to one or both gfappositive cell types. specifically, we found that social isolation significantly increased the number of radial glial cells only in both the dorsal ( %) and ventral dg ( %) (fig. c, d) . as only a small proportion of gfap-positive cells were a astrocytes in the context of our work, s β-positive cell density-typically a marker of a astrocytes-was next examined. similar to gfap, we found that social isolation significantly increased s β-positive cell density in both the dorsal ( %) and ventral dg ( %) (fig. e, f) . finally, given the associations between hippocampal neurogenesis, depression, stress and inflammation , we investigated the impact of the three chronic stress paradigms on both proliferation and differentiation in the dg. we found that chronic stress exposure did not alter hippocampal volume (supplementary fig. a ) or ki cell density in the dg (supplementary fig. b, c) for any of the exposure groups. however, mice exposed to repeated injection, or the combined stress, did show a decrease in overall dcx-positive cell density (− %) in the dorsal dg relative to controls (fig. a, b) . surprisingly, socially isolated mice had an overall dcx-positive cell density similar to controls. to gain a better understanding of whether these reductions in dcx cell density were related to all dcx cell types, or whether they were specific to a particular stage of dcx maturation, we classified all dcx-positive cells into one of four morphology types. compared with control mice, we found a decrease in all neuroblast cell types (− %) and in early post mitotic cells (− %) in the dorsal dg for repeatedly injected mice and mice exposed to combined stress, respectively. interestingly, we also found an increase in post mitotic stage neuroblasts ( %), in the ventral dg, in socially isolated animals. no differences in intermediate stage neuroblast cell density were found for any of the exposure groups (fig. c) . none of the reported brain-related changes were artifacts of social hierarchy (supplementary tables and ) , and for a full summary of all these changes, see supplementary table . for the first time, we show how exposure to different types of chronic stressors elicits distinct behavioural and biological phenotypes in both the periphery and the brain. overall, our results clearly demonstrate that the type of chronic stress exposure can indeed matter. our most interesting finding is that distinct types of chronic stress differentially alter hippocampal neuroinflammatory and (see figure on previous page) fig. effect(s) of repeated injection, social isolation and combined stress on corticosterone responsivity and systemic inflammation. a mean (±sem) plasma corticosterone levels before and after acute stress exposure. b mean (±sem) plasma tnf-α levels before and after acute stress exposure. c mean (±sem) plasma il β levels before and after acute stress exposure. d mean (±sem) plasma il levels before and after acute stress exposure. e mean (±sem) plasma il levels before and after acute stress exposure. f mean (±sem) plasma il levels before and after acute stress exposure. g mean (±sem) plasma vegf levels before and after acute stress exposure. *p < . ; **p < . ; ***p < . (adjusted p-values). analyses: repeated-measures two-way anova with bonferroni post hoc comparison. neurogenic profiles, which may be the basis by which the different behavioural phenotypes ultimately manifest. based on our data, we believe that the neurogenic profiles observed are a functional consequence of the neuroinflammatory changes associated with each stress exposure, given that microglia and astrocytes play an important role b representative photomicrographs of the ventral (i) and dorsal (ii) dentate gyrus stained for iba for repeatedly injected and socially isolated mice, respectively, all relative to controls. images captured at × and × magnification, scale bar = μm. c mean (±sem) iba -positive cell density in the ventral dentate gyrus for repeatedly injected mice (t [ ] = . , p = . ). d mean (±sem) iba -positive cell density in the dorsal dentate gyrus for socially isolated mice (t [ ] = . , p = . ). e representative photomicrographs and associated sholl masks of iba cells in the ventral dentate gyrus of repeatedly injected and control animals. f representative photomicrographs and associated skeletons of iba cells in the ventral dentate gyrus of repeatedly injected and control animals. g representative photomicrographs and associated sholl masks of iba cells in the dorsal dentate gyrus of socially isolated and control animals. images e-g were taken at × magnification, scale bar = μm. in sholl masks, red denotes a greater degree of branching complexity, whereas blue denotes the lowest degree of branching complexity. *p < . ; **p < . ; ***p < . (adjusted p-values). analyses: two-way anova with bonferroni post hoc comparison or independent samples t-test. in maintaining synaptic integrity . moreover, given that glucocorticoids and peripheral inflammation are the two most well-known pathways through which stress can affect microglia/astrocyte structure and functioning - , we believe that altered hpa axis activity and cytokines levels, also found in our models, are the key systems involved in altering microglia/astrocyte/neurogenesis and ultimately behaviour. our data suggest that repeated injections promotes hpa axis hyperactivity (possibly related to the anxious phenotype), as shown by increased corticosterone reactivity to stress, together with lower tnfα and il levels, indicating a hypercortisolemia-associated inhibition of the peripheral immune system . moreover, hpa axis hyperactivity can lead to increased microglial activation , which is also seen in repeatedly injected mice, as indicated by the increased iba cell density, reflecting microglia rapidly proliferating once activated , and iba cell morphology resembling reactive microglia , given that these cells occupy less overall space, possess a shorter maximum process length, and have fewer processes and reduced branching ramification. interestingly, previous research demonstrates that activated microglia assume a phagocytotic role and contribute to the apoptosis of newborn neurons , and consistent with this we observe a concomitant decrease in neuronal differentiation in these same animals. given that there is no change in ki , which is predominately expressed during the earlier critical neurogenic period, and there is a decrease in more mature neuroblasts (but not in the intermediate stage), we can conclude that the majority of apoptosis is likely occurring during the later - week window, a critical period for newborn cell survival . however, it is notable that we also see a significant reduction in proliferative stage neuroblasts, which supports that some cell death also occurs in the earlier critical period, although as apoptosis was not quantified in our study, this requires validation. contrary to repeated injection, permanent social isolation promotes hpa axis hypoactivity as shown by decreased corticosterone reactivity to stress, together with higher tnfα (i.e., an exact mirror image of the repeated injections), together with decreased il , il β, vegf and il . importantly, il is an important mediator for suppressing the immune response and directly inhibits the proliferation and production of tnfα . furthermore, increased tnfα and decreased il are both strongly associated with depressive-like behaviour and clinical depression - -a behavioural phenotype found only in our socially isolated animals. as with repeated injection, we believe that these different changes in hpa axis/immune system activity promote the microglial/astrocyte-associated abnormalities observed in socially isolated mice . for example, the reduced levels of corticosterone and the increased levels of pro-inflammatory cytokines promote microglial overactivation , , which can lead to increased microglial apoptosis and hyper-ramification , which we observe in our socially isolated animals, as indicated by a decrease in iba cell density and an increase in the internal complexity of these cells. moreover, we find an activation of neuroprotective a astrocytes, as shown by the increased gfap-and s β-positive cell density, the latter of which is specifically associated with a astrocytes , findings that have already been described in association with stress [ ] [ ] [ ] , including in our own in vitro work mimicking 'stress in a dish' with low concentration of glucocorticoids . as microglia and astrocytes are vital for synaptogenesis and/or synaptic integrity/maintenance , [ ] [ ] [ ] , and that their structure is intrinsically linked to their function , we speculate that the changes in neurogenesis following social isolation are functionally linked to the observed neuroinflammatory changes. indeed, in the socially stressed animals, we see a specific increase in mature neuroblasts, suggesting an overall impairment in synaptic pruning and/or apoptosis inhibition. given that we observe a reduced number of iba cells with an altered morphology, and an increase in a astrocyte activation, we believe that these glial cells are not able to adequately perform their function of synaptic pruning and/or overcompensate their neuroprotective-associated functions . further evidence to support these functional abnormalities comes from our finding that vegf is reduced in these animals, since vegf regulates synaptic pruning and synapse formation , as well as neurogenesis and microglia , and that astrocytes can directly secrete vegf , . interestingly, we not only find a wider regional specificity of chronic stress exposure for both neuroinflammatory and neurogenic profiles between the pfc and the hippocampus, but also regional specificity within the dg. although, the finding that regional differences exist is not novel , , finding no impact of chronic stress on the pfc is contrary to previous research , , , . however, the type and duration of stress exposure can have important implications on microglia and/or astrocyte biology , and we cannot out rule that changes in the pfc may have occurred earlier than measured in this study. although the basis for the regional specificity within the dg is unclear, one potential explanation could relate to the microglial response, which might occur at a different rate, or endure for different periods of time, in response to each of the stress exposures. moreover, given that our differentially stressed animals respond so distinctively to acute stress, a differentially altered microglial response represents a parsimonious explanation and especially when glucocorticoids are well-known modulators of microglial function . it is also notable that the ventral hippocampus (relative to the dorsal) responds entirely differently to glucocorticoids, such that it has a reduced firing frequency accommodation and more depolarization-associated spikes . this differential response may allow for a longer window of acquisition when activated and given the prominent role of the ventral hippocampus in inhibiting the hpa axis, raises the possibility that the differential actions of glucocorticoids on synaptic function may be relevant to stress regulation. in our work, this could potentially contribute to the regional specificity observed for both neuroinflammatory and neurogenic profiles in stress-exposed animals. regarding a broader interpretation of the functional significance of the neurogenesis-related findings, it is wellknown that the dorsal dg plays a role in memory and cognition, while the ventral dg controls stress responsivity and emotional processing . therefore, it is not entirely surprising that dcx in the ventral region of the dg is specifically altered in the context of permanent social isolation, which promotes a robust depressive-like phenotype, and a decrease in stress responsivity. moreover, these preferential effects on the ventral dg have previously been reported [ ] [ ] [ ] [ ] [ ] . although our data on social isolation aligns with the functional relevance of the ventral dg, it is unknown whether this applies to repeated injection and its associated changes in the dorsal dg when cognitive behaviour was not measured. however, it is noteworthy that anxietylike behaviour and cognitive impairment are closely associated in the context of chronic stress [ ] [ ] [ ] and future research should therefore measure cognitive functioning to extend upon our findings. unlike repeated injection and/or social isolation, we find that combined exposure to these stressors surprisingly does not induce depressive-like behaviour, alter microglial/ astrocyte biology, or alter hpa axis activity, contrary to previous research , , , . perhaps the most powerful conclusion of our study is that there appears to be no synergistic, potentiating effect in combining such different stressors, and in fact the two specific phenotypes of hypercortisolemia/reduced inflammation (repeated injection) and hypocortisolemia/increased inflammation (social isolation) seem to cancel each other out, leading to no differences in these biological systems. however, we do observe a specific decrease in overall neuronal differentiation in these stressed animals, for which altered cell death and/or cell proliferation will still ultimately account for these changes. moreover, these animals also exhibit specific decreases in il , il β and vegf, and given that vegf is an important pro-neurogenic growth factor , , and that cytokines can independently modulate neurogenesis , , these changes in immune system functioning could account for the observed decrease in neurogenesis. pertinently, a lack of il specifically has previously been shown to promote anxiety-not depressive-like behaviour -a behavioural outcome also observed in our combined stressexposed mice. therefore, the observed decrease in il , together with reduced neurogenesis, could ultimately contribute to the aberrant behavioural profile of these animals. although we clearly demonstrate that differential types of stress can indeed promote unique phenotypes, we believe that nociception could potentially explain these differential outcomes, especially when repeated injection initiates an acute pain response, whereas social isolation does not. the immune system, hpa axis and nociception are all intrinsically linked , , and therefore this could account for some of the observed biological and behavioural phenotypes associated with the two types of stress. for example, we specifically see reduced tnfα in repeatedly injected animals, and tnfα is an important inflammatory mediator of pain , the inhibition of which has been shown to alter pain perception . thus, the observed biological changes associated with our repeated injection paradigm could be a reflection of the pain response and/or a change in nociception sensitivity. pertinently, chronic pain has been consistently associated with anxiety - -a behavioural outcome observed for our repeatedly injected animals. furthermore, preclinical studies using electric foot shock paradigms, which also elicit pain, likewise report an increase in anxiety-like behaviour , , and clinical studies using repeated pain stimulation show that, despite pain habituation, exposed individuals report increased anxiety . interestingly, previous research shows how repeated injection differentially alters affective outcomes in rats with high and low emotionality, with low responders showing no change in depressive-like states . thus, repeated injection in the context of our work could differentially alter the emotional reactivity of our inbred mice, such that our repeatedly injected animals become low emotional responders. this could also account for why injection stress seems to override the effects of social isolation within the combined paradigm and why no apparent additive effect was observed for these animals. regarding the independent impact of permanent social isolation, it is unsurprising that this particular stressor, which is more ethological in nature, promotes aberrant behavioural and biological changes across multiple domains, especially in a gregarious species. mice are social due to the various advantages that an organized social structure provides in terms of mating selection, resource allocation, and social status . therefore, by socially isolating animals that are biologically and evolutionarily suited to social living, it is understandable how the removal of social structure could impair physiology, and/ or physiological responses that then ultimately manifest into aberrant behaviours. indeed, preclinical research consistently shows how chronically isolated rodents have increased anxiety-and depressive-like behaviour , , , together with altered hpa axis activity , , , and specific increases in the pro-vs. anti-inflammatory balance, i.e., increased tnfα/il and decreased il , -outcomes all observed in our socially isolated animals. moreover, in humans, who are also a social species, the lack of social support is an important risk factor for affective disorders and social isolation/loneliness has been consistently associated with depression [ ] [ ] [ ] [ ] . interestingly, a recent study even demonstrates how social integration can be protective against inflammation in young black women . although our work was designed to ensure maximum validity, it is unknown how generalisable our findings are relative to wider animal research when the species/strain, age and sex of the animal can impact both the behavioural and neurobiological phenotypes associated with stress exposure [ ] [ ] [ ] [ ] . due to financial and time constraints for such a huge programme of research, we only used young adult male animals for our research, which represents a limitation particularly when the prevalence of depression is significantly higher in women, in adolescence and old age [ ] [ ] [ ] , and when research using female rodents is underrepresented , . however, given that males are less susceptible to clinical depression, one could speculate that in the context of our work, female balb/c mice could be equally, if not more, sensitive to the impact of the different types of stress. future work should prioritise investigating the impact of different stressors in both male and female animals and in different strains/species, and it is our ambition to extend our work to female animals in the near future. moreover, it remains unknown whether the observed biological changes are direct consequences of chronic stress exposure, indirect changes from other biological system alterations, or simply adaptive compensation. by characterizing biological phenotypes at the end of treatment, based on a single snapshot, we cannot determine the temporal sequence of these changes, and more importantly, may have missed earlier biological alterations. future work should aim to collect behavioural readouts, blood samples, and tissue from parallel groups to determine the temporal sequence of these reported outcomes and more fully explore the mechanistic links. despite these limitations, this is the first time that a study has assessed all of these biological processes together and done an extensive evaluation of both microglial biology and neurogenesis. moreover, we have also taken into account the importance of the hippocampal dorsalventral axis . in summary, the outcomes of our study provide some novel insight into the changes that occur in behaviour, in peripheral stress and inflammatory systems, and within the brain, following exposure to different types of chronic stress, and how the type of stress can make a difference in terms of these changes. we demonstrate that the type of stress exposure could differentially alter depressive symptomology and/or its biological basis, thus informing the molecular underpinning of the clinical studies showing that different types of abuse and maltreatment can differentially affect mental health. moreover, our work gives clinically-relevant insights into the effects of social isolation, a condition that can increase morbidity and mortality , and can specifically lead to mdd [ ] [ ] [ ] [ ] , consistently with our behavioural data, while the relationship between repeated injections and anxiety may be relevant to the impact of recurrent medical treatments on mental health. finally, given the global impact of the coronavirus outbreak, for which social isolation has increased significantly as a consequence, understanding the psychological and biological effects of social isolation has become fundamentally important in further understanding, and then subsequently alleviating, the impact that social isolation may have worldwide. the recent progress in animal models of depression social isolation rearing induces mitochondrial, immunological, neurochemical and behavioural deficits in rats, and is reversed by clozapine or n-acetyl cysteine menthone confers antidepressant-like effects in an unpredictable chronic mild stress mouse model via nlrp inflammasome-mediated inflammatory cytokines and central neurotransmitters effects of group housing on stress induced emotional and neuroendocrine alterations cytokines and glucocorticoid receptors are associated with the antidepressant-like effect of alarin beneficial behavioural and neurogenic effects of agomelatine in a model of depression/anxiety neurogenesis-independent antidepressant-like effects on behavior and stress axis response of a dual orexin receptor antagonist in a rodent model of depression different susceptibility of prefrontal cortex and hippocampus to oxidative stress following chronic social isolation stress dysregulation of the hypothalamus-pituitary-adrenal axis predicts some aspects of the behavioral response to chronic fluoxetine: association with hippocampal cell proliferation the chronic mild stress (cms) model of depression: history, evaluation and usage short-term social isolation induces depressive-like behaviour and reinstates the retrieval of an aversive task: mood-congruent memory in male mice? social defeat protocol and relevant biomarkers, implications for stress response physiology, drug abuse, mood disorders and individual stress vulnerability: a systematic review of the last decade childhood emotional maltreatment and mental disorders: results from a nationally representative adult sample from the united states associations between depression and specific childhood experiences of abuse and neglect: a meta-analysis maltreatment type, exposure characteristics, and mental health outcomes among clinic referred trauma-exposed youth type and timing of childhood maltreatment and reduced visual cortex volume in children and adolescents with reactive attachment disorder psychological stress in adolescent and adult mice increases neuroinflammation and attenuates the response to lps challenge unpredictable chronic mild stress not chronic restraint stress induces depressive behaviours in mice in-depth behavioral characterization of the corticosterone mouse model and the critical involvement of housing conditions side effects of control treatment can conceal experimental data when studying stress responses to injection and psychological stress in mice revealing a latent variable: individual differences in affective response to repeated injections lifetime prevalence and age-of-onset distributions of dsm-iv disorders in the national comorbidity survey replication predictors of first lifetime onset of major depressive disorder in young adulthood gender differences in subtypes of depression by first incidence and age of onset: a follow-up of the lundby population major depressive disorder housing, husbandry and handling of rodents for behavioral experiments the laboratory mouse - mood and anxiety related phenotypes in mice mood and anxiety related phenotypes in mice a possible link between food and mood: dietary impact on gut microbiota and behavior in balb/c mice sensitivity to the effects of pharmacologically selective antidepressants in different strains of mice plasma proteins predict conversion to dementia from prodromal disease whole animal perfusion fixation for rodents regional and cellular neuropathology in the palmitoyl protein thioesterase- null mutant mouse model of infantile neuronal ceroid lipofuscinosis successive neuron loss in the thalamus and cortex in a mouse model of infantile neuronal ceroid lipofuscinosis dynamic microglial alterations underlie stress-induced depressive-like behavior and suppressed neurogenesis hippocampusdependent learning is associated with adult neurogenesis in mrl/mpj mice nih image to imagej: years of image analysis a quantitative spatiotemporal analysis of microglia morphology during ischemic stroke and reperfusion neuronal morphometry directly from bitmap images regulation of adult neurogenesis by stress, sleep disruption, exercise and inflammation: implications for depression and antidepressant action embryonic expression of the chicken sox , sox and sox genes suggests an interactive role in neuronal development chronic stress-induced disruption of the astrocyte network is driven by structural atrophy and not loss of astrocytes increased hippocampal neurogenesis in the progressive stage of alzheimer's disease phenotype in an app/ps double transgenic mouse model variability of doublecortin-associated dendrite maturation in adult hippocampal neurogenesis is independent of the regulation of precursor cell proliferation glial cells and their function in the adult brain: a journey through the history of their ablation chronic stress alters the density and morphology of microglia in a subset of stress-responsive brain regions molecular mechanisms in the regulation of adult neurogenesis during stress the role of microglia in adult hippocampal neurogenesis acute and chronic stress-induced disturbances of microglial plasticity, phenotype and function microglia shape adult hippocampal neurogenesis through apoptosis-coupled phagocytosis the hpa -immune axis and the immunomodulatory actions of glucocorticoids in the brain exploring the molecular mechanisms of glucocorticoid receptor action from sensitivity to resistance confocal imaging of microglial cell dynamics in hippocampal slice cultures microglial activation -tuning and pruning adult neurogenesis il- : the master regulator of immunity to infection il- modulates depressive-like behavior antidepressant-like effect of α-tocopherol in a mouse model of depressive-like behavior induced by tnf-α. prog. neuropsychopharmacol peripheral cytokine and chemokine alterations in depression: a meta-analysis of studies missing and pssible link between neuroendocrine factors, neuropsychiatric disorders, and microglia evidence that microglia mediate the neurobiological effects of chronic psychological stress on the medial prefrontal cortex molecular consequences of activated microglia in the brain: overactivation induces apoptosis connexin deficiency attenuates a astrocyte responses and induces severe neurodegeneration in a -methyl- -phenyl- , , , -tetrahydropyridine hydrochloride parkinson's disease animal model s β induces apoptotic cell death in cultured astrocytes via a nitric oxide-dependent pathway an update on reactive astrocytes in chronic pain changes in s b cerebrospinal fluid levels of rats subjected to predator stress glucocorticoid-related molecular signaling pathways regulating hippocampal neurogenesis structural and quantitative analysis of astrocytes in the mouse hippocampus astrocyte pathology in major depressive disorder: insights from human postmortem brain tissue astrocytes: orchestrating synaptic plasticity? microglia: a sensor for pathological events in the cns emerging roles for semaphorins and vegfs in synaptogenesis and synaptic plasticity unique role for dentate gyrus microglia in neuroblast survival and in vegf-induced activation cell type specific expression of vascular endothelial growth factor and angiopoietin- and - suggests an important role of astrocytes in cerebellar vascularization glial cell line-derived neurotrophic factor family members sensitize nociceptors in vitro and produce thermal hyperalgesia in vivo region specific decrease in glial fibrillary acidic protein immunoreactivity in the brain of a rat model of depression a comparative examination of the anti-inflammatory effects of ssri and snri antidepressants on lps stimulated microglia chronic restraint stress decreases glial fibrillary acidic protein and glutamate transporter in the periaqueductal gray matter stress-induced elevation of glucocorticoids increases microglia proliferation through nmda receptor activation differential corticosteroid modulation of inhibitory synaptic currents in the dorsal and ventral hippocampus unraveling the time domains of corticosteroid hormone influences on brain activity: rapid, slow, and chronic modes hippocampal cytogenesis correlates to escitalopram-mediated recovery in a chronic mild stress rat model of depression chronic high corticosterone reduces neurogenesis in the dentate gyrus of adult male and female rats severe early life stress hampers spatial learning and neurogenesis, but improves hippocampal synaptic plasticity and emotional learning under high-stress conditions in adulthood hippocampal neurogenesis: a biomarker for depression or antidepressant effects? methodological considerations and perspectives for future research hippocampal neurogenesis confers stress resilience by inhibiting the ventral dentate gyrus chronic unpredictable stress induces a cognitive deficit and anxiety-like behavior in rats that is prevented by chronic antidepressant drug treatment the effect of chronic phenytoin administration on single prolonged stress induced extinction retention deficits and glucocorticoid upregulation in the rat medial prefrontal cortex impact of anxiety on prefrontal cortex encoding of cognitive flexibility astroglial plasticity in the hippocampus is acronic psychosocial stress and concomitant fluoxetine treatment chronic psychosocial stress and citalopram modulate the expression of the glial proteins gfap and ndrg in the hippocampus vascular endothelial growth factor (vegf) stimulates neurogenesis in vitro and in vivo vegf links hippocampal activity with neurogenesis, learning and memory the role of inflammatory cytokines as key modulators of neurogenesis the role of pro-inflammatory cytokines in neuroinflammation, neurogenesis and the neuroendocrine system in major depression il- knock out mice display anxiety-like behavior interactions between the immune and nervous systems in pain chronic pain and chronic stress: two sides of the same coin? chronic stress tnf-α and neuropathic pain -a review blockade of tnf-α rapidly inhibits pain responses in the central nervous system longitudinal associations between depression, anxiety, pain, and pain-related disability in chronic pain patients coexistence of two forms of ltp in acc provides a synaptic mechanism for the interactions between anxiety and chronic pain neural mechanisms underlying anxiety-chronic pain interactions anxiolytic effects of herbal ethanol extract from gynostemma pentaphyllum in mice after exposure to chronic stress ameliorating effects of gypenosides on chronic stress-induced anxiety disorders in mice the influence of repeated pain stimulation on the emotional aspect of pain: a preliminary study in healthy volunteers effects of chronic social isolation on wistar rat behavior and brain plasticity markers individual housing induces altered immuno-endocrine responses to psychological stress in male mice associations between loneliness and perceived social support and outcomes of mental health problems: a systematic review social isolation, depression, and psychological distress among older adults loneliness and health in older adults: a mini-review and synthesis anxiety, depression, loneliness and social network in the elderly: longitudinal associations from the irish longitudinal study on ageing (tilda) social integration and quality of social relationships as protective factors for inflammation in a mationally representative sample of black women strain differences in sucrose preference and in the consequences of unpredictable chronic mild stress sex differences and phase of light cycle modify chronic stress effects on anxiety and depressivelike behavior age-related changes in behavior in c bl/ j mice from young adulthood to middle age stability of inbred mouse strain differences in behavior and brain size between laboratories and across decades why is depression more prevalent in women? depression and c-reactive protein in us adults: data from the third national health and nutrition examination survey age differences in major depression: results from the national comorbidity survey replication (ncs-r) female mice liberated for inclusion in neuroscience and biomedical research sex bias in neuroscience and biomedical research differential control of learning and anxiety along the dorsoventral axis of the dentate gyrus loneliness and social isolation as risk factors for mortality: a meta-analytic review. perspect disentangling loneliness: differential effects of subjective loneliness, network quality, network size, and living alone on physical, mental, and cognitive health this study was funded by janssen pharmaceutica as part of a large programme of research on depression and inflammation awarded to c.m.p., s.t., c.f. and p. a.z. c.m.p. has received additional research funding from the medical research council (uk) and the wellcome trust for research on depression and inflammation as part of two large consortia that also include johnson & johnson, gsk and the lundbeck foundation, and from the company eleusis benefit corporation, which is interested in research on depression and inflammation. the authors declare that they have no conflict of interest.publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.supplementary information accompanies this paper at (https://doi.org/ . /s - - - ).received: july revised: august accepted: september key: cord- -n kpvsvg authors: nguyen, long t.; smith, brianna m.; jain, piyush k. title: enhancement of trans-cleavage activity of cas a with engineered crrna enables amplified nucleic acid detection date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: n kpvsvg the crispr/cas a rna-guided complexes have a tremendous potential for nucleic acid detection due to its ability to indiscriminately cleave ssdna once bound to a target dna. however, the current crispr/cas a systems are limited to detecting dna in a picomolar detection limit without an amplification step. here, we developed a platform with engineered crrnas and optimized conditions that enabled us to detect dna, dna/rna heteroduplex and methylated dna with higher sensitivity, achieving a limit of detection of in femtomolar range without any target pre-amplification step. by extending the ’- or ’-ends of the crrna with different lengths of ssdna, ssrna, and phosphorothioate ssdna, we discovered a new self-catalytic behavior and an augmented rate of lbcas a-mediated collateral cleavage activity as high as . -fold compared to the wild-type crrna. we applied this sensitive system to detect as low as fm dsdna from the pca gene, an overexpressed biomarker in prostate cancer patients, in simulated urine over hours. the same platform was used to detect as low as ~ fm cdna from hiv, fm rna from hcv, and fm cdna from sars-cov- , all within minutes without a need for target amplification. with isothermal amplification of sars-cov- rna using rt-lamp, the modified crrnas were incorporated in a paper-based lateral flow assay that could detect the target with up to -fold higher sensitivity within - minutes. based on the crystal structure of lbcas a/crrna/dsdna (pdb id: xus) , we reasoned that crrna extensions can influence the trans-cleavage activity by either activating or inhibiting the catalytic efficiency of cas a, allowing us to better understand crrna design with tunable transcleavage activity. we speculate that chemical modifications of the crrna can potentially change its nature of binding and subsequently alter this collateral cleavage due to conformational changes of the cas a dynamic endonuclease domain. we placed ssdna, ssrna, and phosphorothioate ssdna extensions of various lengths ranging from to nucleotides on either the '-or '-ends of the crrna targeting gfp (green fluorescent protein), referred to here as crgfp (fig. b-h) . in order to measure the collateral or trans-cleavage activity of cas a, we employed a fret-based reporter used in detectr, composed of a fluorophore (fam) and a quencher ( iabkfq) connected by a -nucleotide sequence (ttatt), which displays increased fluorescence upon cleavage. consistent with the previous literature , respectively. the fold in fluorescence was normalized by taking the ratio of background-corrected fluorescence signals of sample with activator to the corresponding sample without activator. error bars represent ± sem, where n = replicates; two-way anova test two-way anova (n= , n= , ns p > . , *p < . , **p < . , ***p < . , ****p < . ). the experiments were repeated at least twice with n = per experiment. when using wild-type crrnas, we observed that the lbcas a exhibited higher trans-cleavage activity than the ascas a or the fncas a, and therefore, we designed various modified crrnas compatible with lbcas a. using the same reporters, we discovered that ssdna and ssrna extensions on the '-end of crgfp markedly enhanced the trans-cleavage ability of target-activated lbcas a. comparing the two types, the ssdna extensions demonstrated higher activity than the corresponding ssrna (figs. b-d,f and supplementary figs. [ ] [ ] [ ] [ ] [ ] . on the other hand, the phosphorothioate ssdna extensions at the '-or '-end displayed minimal or no activity, showing decreased fluorescence intensity as modification length increased (figs. e,h and supplementary figs. [ ] [ ] [ ] [ ] [ ] . this observation suggests that further extending the crrna with -mer phosphorothioate ssdna and beyond significantly inhibits lbcas a trans-cleavage activity. the finding corroborated b. li and colleagues that phosphorothioate ssdna may prevent crrna-cas a-dna complex formation. notably, the '-dna with -mer extensions on the crgfp, referred as crgfp+ 'dna , yielded the highest fluorescence signal compared to other modifications, measuring approximately . fold higher intensity than the wild-type crgfp (fig. c, supplementary figs. a, a) . by investigating the conformational changes from the crystal structure of the binary lbcas a:crrna complex , , , we observed that the '-end modifications on crrna is proximal to the ruvc region of the lbcas a. this supports our observation that the '-end extensions lead to higher trans-cleavage activity than the '-end. we speculated that once an r-loop is formed between crrna and dsdna or ssdna activator, the lbcas a executes a partial trans-cleavage of the 'end of crrna, leaving an overhang. these remaining extensions may further expand the nuclease domain in the lbcas a, resulting in conformational changes and allowing more access for nonspecific ssdna cleavage. to confirm our hypothesis, we attached different fluorophores, or fluorophore-quencher pairs separated by dna linkers, to either the '-or '-end of the crgfp with -mer dna extensions and analyzed by denaturing gel electrophoresis. surprisingly, we discovered that the '-end of the crrna is processed by lbcas a only in the presence of an activator while the '-end is cleaved by lbcas a even in the absence of the activator (fig. a,b and supplementary figs. , ). by placing the fluorophore fam on the '-end and a -mer dna extension on the '-end of the crgfp, we learned that the first uracil on the '-end of the crgfp gets trimmed by lbcas a in the absence of an activator, which corroborated previous studies reported for fncas a . as a result, the '-end modifications are eliminated and converted back to the wild-type crrna before complexing with the activator. this finding reinforces our previous observation that the 'extended crrna has similar collateral cleavage activity as the wild-type crrna. fascinated by lbcas a pre-crrna processing as previously described and from our observations, we investigated how extensions of the mature crrna would influence the trans-cleavage activity compared to the corresponding extended pre-crrna. we discovered that the modified pre-crrna and modified mature crrna (tru-crrna) exhibited comparable trans-cleavage efficiency (fig. g) . furthermore, when a dsdna or an ssdna activator was present, the '-and '-end dna- to further understand the lbcas a enhanced enzymatic activity, we performed a michaelis-menten kinetic study on the wild-type crgfp and the crgfp+ 'dna and observed that the ratio kcat/km was . -fold higher for crgfp+ 'dna than the unmodified crgfp (figs. c,d) . the time-dependent gel electrophoresis analysis of nonspecific cleavage of ssdna m mp phage (~ kb) reconfirmed the fluorophore-quencher-based reporter assay results (fig. e) . we speculated that the reporter composition itself may affect the lbcas a collateral cleavage activity. therefore, we incorporated and tested various nucleotides (gc and ta-rich) and fluorophores (fam, hex, and cy ) within the reporter. consistent with our hypothesis, we observed that the lbcas a achieved maximal trans-cleavage activity with fam or hex and tarich reporter ( fig. f and supplementary figs. [ ] [ ] [ ] [ ] [ ] . furthermore, these results led us to question whether the trans-cleavage activity is dependent on the sequence of ssdna extensions on '-end of the crrna. to test this, we altered the nucleotide content of the extended regions of the crgfp. it turned out that the crgfp with ta-rich extensions carried out significantly more collateral cleavage than those with gc-rich regions ( fig. h and supplementary fig. ). based on our findings that the trans-cleavage activity is drastically improved by -mer ssdna extensions to the '-end of crgfp, we questioned if the binding of crrna with lbcas a itself is influenced by such modifications. a biolayer interferometry binding kinetic assay revealed that the dissociation constant, kd, between the binary complex lbcas a:crrna and lbcas a:crrna+ 'dna are comparable within a low nm scale ( fig. i and supplementary fig. ). these binding results suggest that the 'dna modification on crrna does not affect the binary complex formation between the lbcas a and the crrna. while -mer ssdna extensions on the '-end of crrna increases trans-cleavage activity with lbcas a, we questioned if this is consistent across other orthologs of cas a. to investigate further, we carried out an in vitro cis-cleavage and trans-cleavage assay of ascas a and fncas a with an extended crgfp compared to a wild-type crgfp ( fig. j and supplementary fig. ). interestingly, the crgfp+ 'dna showed similar results with fncas a; however, it exhibited an opposite effect with ascas a. however, the cis-cleavage activity was found to be comparable between the crgfp and crgfp+ 'dna for all the orthologs tested. overall, lbcas a showed the highest fluorescence signal, which is consistent with previous studies. , through observation of the fluorophore-quencher-based reporter assay and time-dependent gel electrophoresis, we hypothesized that the various extensions of ssdna on the crrna induce conformational changes on lbcas a that result in enhanced endonuclease activity. structural analysis of lbcas a shows that it contains a single ruvc domain, which processes precursor crrna into mature crrna, cleaves target dsdna or ssdna (referred here as activators), and executes nonspecific cleavage afterwards. , therefore, we were interested in understanding the effects of these modified crrnas on cis-cleavage compared to the wild-type crrna, as well as how cis-cleavage activity correlates to the trans-cleavage activity. towards this, we carried out an in vitro cis-cleavage assay for various '-end and '-end modifications. we noticed that the cis-cleavage activity was either similar or marginally improved with most '-end modifications while the '-end modifications showed either similar or slightly reduced activity. this phenomenon suggests that the trans-cleavage activity is commensurate with the cis-cleavage activity . the kd was determined by the biolayer interferometry binding kinetic assay with r > . . (j) trans-cleavage activity of different variants of cas a. the prefix lb, as, and fn stand for lachnospiraceae bacterium, acidaminococcus, and francisella novicida, respectively. (k) single-point mutations (m -m ) on the target strand of a dsdna gfp activator. the heat map displays relative fluorescence intensity normalized to wild-type (wt) activator after hours. error bars represent ± sem, where n = replicates. the experiments were repeated at least twice with n = per experiment. next, we sought to the characterize specificity of these extended crrnas in discriminating point mutations across dsdna. by mutating a single nucleotide at each position across the targetbinding region, we discovered that the crgfp+ 'dna tolerated mutations and produced a stronger fluorescence signal than the wild-type crgfp ( supplementary fig. ). however, the fluorescence intensity ratio of mutated to the non-mutated dsdna targets for the crgfp+ 'dna was quite comparable to the crgfp ( fig. k and supplementary fig. ). this observation suggests that the modified crrnas increased sensitivity of trans-cleavage, however, the specificity remained unchanged. previous studies demonstrated that fncas a is a metal-dependent endonuclease, and magnesium ions are required for fncas a-mediated self-processing of precursor crrna. based on these findings, we hypothesized that different metal ions may significantly affect the trans-cleavage activity of lbcas a. this led us to test a range of divalent metal cations and discovered that most ions including ca + , co + , zn + , cu + , and mn + significantly inhibited the lbcas a activity ( supplementary fig. ) . by further investigating the zn + mediated inhibition of lbcas a, we found that the inhibition was dose-dependent ( supplementary fig. ). interestingly, ni + ions showed an unusual cis-cleavage activity possibly due to its interactions with the his tags on lbcas a ( supplementary fig. ) . among the tested divalent metal ions, the mg + ions showed the highest in vitro cis-cleavage activity, which was consistent with the literature. therefore, we characterized the effect of mg + ions on trans-cleavage activity of lbcas a. with increasing the concentration of mg + ions, a significant increase in fluorescence signal was observed in an in vitro trans-cleavage assay. by varying the amount of mg + in the cas a reaction, we identified that the optimal condition of mg + was around mm ( fig. a-b and supplementary figs. [ ] [ ] [ ] [ ] . we optimized the previously developed crispr-based detection assays , , and combined them with our engineered crrna+ 'dna to create a crispr-enhance (enhanced analysis of nucleic acids with crrna extensions) technology or referred here as enhance. to validate the enhance technology, we first selected a clinically relevant nucleic acid biomarker, prostate cancer antigen (pca /dd ), which is one of the most overexpressed genes in prostate cancer tissue and excreted in patients' urine. consequently, elevated pca levels during prostate cancer progression has become a widely targeted biomarker for detection. [ ] [ ] [ ] [ ] to determine the limit of detection of pca using our enhance technology, we spiked the pca cdna into synthetic urine and investigated how this clinically-relevant environment affects the activity of cas a. using enhance for detecting the pca cdna, the limit of detection was determined to be as low as fm in the urine at mm mg + concentration compared to ~ pm at mm mg + concentration after hours (figs. a-c and supplementary figs. - ). in contrast, the wild-type crrna also showed a similar fm limit of detection at mm mg + concentration while the limit of detection was ~ pm at mm mg + concentrations after hours. therefore, by combining the crrna modifications with increased mg + ion concentrations, we achieved approximately -fold increase in sensitivity, based on limit of detection calculations. nevertheless, this observation also suggests that our modified crrna+ 'dna significantly improves the limit of detection at low mg + but reaches a saturation point that is comparable with the wild-type crrna at high mg + concentration. to understand the importance of divalent ions in the cas a transcleavage reaction, we carried out a michaelis-menten kinetic study with various mg + concentrations ( supplementary fig. ). we observed that the initial reaction rate of cas a in the presence of high mg + concentrations increased tremendously compared to that in low mg + . ssdna. using modified crrna, the limit of detection of hcv target ssdna was found to be fm ( amoles) at min, without target amplification, mean ± se, two-way anova (n= , n= ). (i) fold change in trans-cleavage activity with lbcas a in presence of pm ( fmols) of target sars-cov- cdna (dsdna), mean ± se, two-way anova (n= , n= , ns p > . , *p < . , **p < . , ***p < . , ****p < . ). (j) lateral flow assay detecting nm ( fmols) of sars-cov- cdna using either crcov- or crcov- + 'dna- within minutes of incubation. (j) schematic diagram showing how a lateral flow assay works. briefly, the dipstick uses gold-labeled fitc-specific antibodies that binds to fitc-biotin reporter and travel through membrane. only cleaved reporter will reside at the positive line. (k) lateral flow assay detecting sars-cov- cdna using crcov- and crcov- + 'dna without a pre-amplification step, and (l) band-intensity analysis of (k) using imagej. (m) lateral flow assay detecting sars-cov- rna n gene using crcov- and crcov- + 'dna with rt-lamp, and (n) band-intensity analysis of (m) using imagej. however, the two reaction rates eventually reach a similar saturation point (supplementary figs. [ ] [ ] [ ] . this suggests that mg + is not only required for the cas a reaction, but also accelerates the enzyme's trans-cleavage activity. regardless, mg + plays an important role in lowering the limit of detection in synthetic urine containing pca . while as low as fm (equivalent to . attomoles) of pca cdna can be detected with enhance without any target amplification ( supplementary fig. ) , the clinical concentration of pca mrna in the urine can be lower and therefore may require target pre-amplification. , therefore, we incorporated and tested a recombinase polymerase amplification (rpa) step to isothermally amplify the pca cdna. by combining the rpa step as previously reported, , the concentration of pca cdna in the urine was detectable down to ~ am ( zmol) with . -fold signal to noise ratio (fig. d) . while crrna/lbcas a has been traditionally used to detect unmodified dna, the field is missing the knowledge on how the common epigenetic marker, dna methylation, affects its transcleavage activity. dna methylation is also one of the bacterial defense systems that fight against outside invaders. it would be fascinating to understand how lbcas a collateral cleavage is able to recognize methylated dna targets. this curiosity let us to discover that the wild-type crrna had significantly reduced activity in detecting methylated dna, containing -methyl cytosine, compared to the unmethylated dna. however, the enhance showed . -fold and . -fold and higher trans-cleavage activity compared to the wild-type crrna for targeting the methylated dsdna and ssdna, respectively (fig. e, supplementary fig. a) . although there are no reports on rna-guided rna targeting by lbcas a, we envisioned that an rna can potentially be detected as a dna/rna heteroduplex. to test this hypothesis, we incorporated a reverse transcription step to convert rna into cdna/rna heteroduplex before detecting the rna with a trans-cleavage assay. we discovered that the rna can only be detected if the target strand for crrna is a dna but not an rna in a heteroduplex. notably, the efficiency of the trans-cleavage activity for the dna/rna heteroduplex was found to be significantly lower than the corresponding ssdna or dsdna (fig. e, supplementary fig. b) . however, the dna/rna heteroduplex achieved an improved enzymatic collateral activity when using the crrna+ 'dna compared to the wild-type crrna. we applied the enhance to successfully detect low picomolar concentrations of hiv rna target encoding tat gene with our dna/rna heteroduplex detection strategy (fig. f) . in parallel, ssdna and dsdna targets from hiv were also detected with much higher sensitivity compared to the wild-type crrna within to minutes (figs. f,g and supplementary figs. ) . we further applied the enhance for detecting hcv ssdna and hcv dsdna gene encoding a polyprotein precursor, both of which indicated consistent enhanced collateral activity than the wild-type crrnas within minutes (figs. f, h , and supplementary fig. ). the limit of detection for hiv and hcv targets were calculated to be fm cdna and fm ssdna, respectively. in the wake of the recent covid- pandemic, there is an urgent need to rapidly detect the sars-cov- coronavirus (referred as cov- here for simplicity). we optimized the enhance to detect cov- dsdna by designing crrnas targeting nucleocapsid phosphoprotein encoding n gene (figs. f,i) . while no clinical samples were tested, the results indicated the 'dna -modified crrna consistently demonstrated higher sensitivity for detecting cov- dsdna within minutes as compared to the wild-type crcov- ( supplementary figs. - ). by incorporating a commercially available paper-based lateral flow assay with a fitc-ssdna-biotin reporter, [ ] [ ] [ ] we could visually detect nm of cov- cdna within minutes of incubation using both wildtype and modified crrnas without any target amplification (fig. j-l and supplementary fig. ). the enzyme trans-cleavage activity exhibited a consistent trend with the crrna+ 'dna among five different targets (fig. f) . when incorporating a reverse transcription step and a loop-mediated isothermal amplification (rt-lamp) strategy into the enhance, both the crcov- -wt and the crcov- + 'dna demonstrated a limit of detection down to a - copies of rna (fig. m) . however, in case of crcov- -wt, the partial cleavage of the reporter resulted in a darker control line on the paper strip. band-intensity analysis showed that the enhance exhibited an average of -fold higher ratio of positive to control line between nm and pm of target cov- rna, while the crcov- -wt indicated an average of only -fold ratio (figs. m,n and supplementary fig. ) . additionally, the time lapse pictures of lateral flow assays showed that the positive line for target-containing samples developed and became visible sooner (within seconds) for the crcov- + 'dna than the crcov- -wt ( supplementary fig. ). in summary, we extended the '-and '-end of the crrna and discovered an amplified transcleavage activity of lbcas a when the '-end is extended with dna or rna. we applied this modified crrna/lbcas a system with the optimal conditions to detect pca in simulated urine with high sensitivity. this enhance technology enabled us to detect the dna/rna heteroduplex and methylated dna with unprecedented sensitivity. we further employed this system to test a range of target nucleic acids, including ssdna, dsdna, and rna from hiv, hcv, and sars-cov- without the need for further optimization. these findings are a crucial step towards enhancing detection of nucleic acids and assisting in the diagnosis of various diseases. multiple dna activators were used in this study. the gfp fragment ( bp) was produced by amplifying the pegfp-c plasmid using polymerase chain reaction in the proflex pcr system (thermofisher scientific). the pcr product was purified using monarch® nucleic acid purification kit (new england biolabs inc.). additionally, the -nt ds-gfp and ds-pca activators were generated by annealing two singlestranded ts and nts fragments at a : ratio (integrated dna technologies inc.) in x hybridization buffer ( nm tris-cl, ph . , mm kcl, mm mgcl ). the annealing process was executed in the proflex pcr system at o c for minutes followed by gradual cooling to o c at a rate of . o c/s. the plasmid lbcpf - nls (addgene # , a gift from jennifer doudna lab) was transformed into nico (de ) competent cells (new england biolabs). colonies were picked and inoculated in terrific broth at o c until od = . . iptg was then added to the cultures, and they were grown at o c overnight. cell pellets were collected from the overnight cultures by centrifugation, resuspended in lysis buffer ( m nacl, mm tris-hcl, mm imidazole, . mm tcep, . mg/ml lysozyme, and mm pmsf, ph = ), and broken by sonication. the sonicated solution then underwent high speed centrifugation at , rcf for minutes. the collected supernatant was then run through a ni-nta hispur column (thermofisher) pre-equilibrated with wash buffer a ( m nacl, mm tris-hcl, mm imidazole, . mm tcep, ph = ). the column was then eluted with buffer b ( m nacl, mm tris-hcl, mm imidazole, . mm tcep, ph = ). the eluted fractions were then pooled together and underwent tev cleavage overnight at o c (tev protease was purified using the plasmid prk , # from addgene, a gift from david waugh lab). the resulting fraction was equilibrated with buffer c ( mm nacl, mm hepes, . mm tcep, ph = ) at a : ratio and run through hitrap heparin hp ml column (ge biosciences). the column was washed with buffer c and gradually eluted at a gradient rate with buffer d ( mm nacl, mm hepes, . mm tcep, ph = ). the eluted fraction was concentrated down to l and passed through the hiload superdex pg column (ge biosciences). the purified lbcas a was then buffer exchanged with storage buffer ( mm nacl, mm na co , . mm tcep, % glycerol, ph = ) and flash frozen at - o c until use. the bli ni-nta biosensors were purchased from fortebio to perform the binding kinetic study with polyhistidine-tagged lbcas a. in detail, the experiment was carried out in a -well plate and included five steps: baseline, loading, baseline , association, and dissociation. the biosensors were dipped into the baseline containing x kinetic buffer ( x pbs, . % bsa, and . % tween ). they were then transferred into each loading well containing g/ml of lbcas a. after processing through loading and baseline , the protein-tagged biosensor was next allowed to dip into the crrna sample wells at different dilution ( , , . , . , . , . , . , and g/ml) in the association step. the dissociation step occurred when the biosensors were transferred back to baseline at a shake speed of rpm. all the samples were read by the octet qke system (fortebio). kd was determined by software data analysis . (fortebio), and only kd with r > . were extracted for comparison between crrna wild type and modified crrnas. in-vitro digestion reactions were carried out with three different types of the cas a family (lbcas a, ascas a, and fncas a were purchased from new england biolabs inc., integrated dna technologies inc., and abm®, respectively) and a wide array of modified crrna's (purchased from dna technologies inc.). cas a and crrna were mixed with : ratio ( nm: nm) in x nebuffer . and pre-incubated at o c for minutes to promote the ribonucleoprotein complex formation. dna activator (gfp or pca fragments) (final concentration of nm) was then added to the mixture to produce a total volume of l and incubated at o c for minutes. the sample was then analyzed in either % agarose gel (for gfp fragments amplified from the pegfp-c plasmid) pre-stained with either syber gold (invitrogen), gelred (biotium inc.), or premade % novex™ tbe-urea gel (invitrogen) . nonspecific cleavage activity of cpf was activated by incubating cpf , crrna, and dna activator with a concentration of nm: nm: nm in x nebuffer . buffer at o c for minutes. m mp was then added to the l reaction mixture and incubated for an additional minutes. a fraction of the reaction was taken out every minutes, quenched in x purple gel loading dye (new england biolabs inc.), and subsequently analyzed in % agarose gel (fisher scientific) . the fluorophore-quencher reporter assay was carried out following a standard clinical detection protocol. the cas a-crrna ribonucleoprotein complex was assembled by mixing nm cas a and nm crrna in x nebuffer . in the proflex pcr system (thermofisher scientific) at o c for minutes (volume . µl). the activator ( nm final concentration), fq reporter ( nm final concentration), and ultrapure™ dnase/rnase-free distilled water (invitrogen) were pre-added to a -well plate (greiner bio-one) to a volume of . µl. the reaction was initiated by adding the cas a-crrna mixture to the -well plate preloaded with activator and fq reporter (integrated dna technologies inc). the plate was quickly transferred to a plate reader (clariostar or biotek), and fluorescence intensity was measured every minutes for or hours (detection limit assay) (fam fq: λex: / nm, λem: / nm; hex: λex: / nm, λem: / nm). after or hours (detection limit assay), the sample was scanned for images using the amersham typhoon (ge healthcare). for michaelis-menten kinetic study, nm lbcas a: nm crrna: nm activator were mixed in nebuffer . and incubated at o c for minutes. the reaction mixture was then transferred to a -well plate (greiner bio-one) preloaded with different concentrations of fq reporter (hex or fam fq reporter: m, . m, . m, . m, . m, and m) and ultrapure™ dnase/rnase-free distilled water (invitrogen) . to find limit of detection (lod), the fluorophore-quencher reporter assay was carried out with various concentrations of activator. the lod calculations were based on the following formula : lod = . * std of rfu in the absence of activator slope of rfu vs. activator concentration the metal ions (mg + , zn + , mn + , cu + , co + , ca + ) were prepared by diluting chloride salt in different concentrations. for cis-cleavage, the cas a-crrna-metal iron duplex was mixed with nm: nm: varying nm ratio in x annealing buffer ( mm tris-hcl, ph . @ °c, mm nacl, mg/ml bsa) and pre-incubated at o c for minutes. dna activator (gfp or pca fragments) was then added to the mixture to a total volume of l and incubated at o c for minutes. to minimize the testing time, the following reagent was assembled in a one-pot reaction: for experiments involving a rt-lamp preamplification step of target rna, the mixture was prepared in the following order (except for the rna and primer mix samples (idt technologies), everything was purchased from new england biolabs): the rt-lamp reaction was incubated at o c for - minutes prior to lbcas a reaction. crispr-cas a target binding unleashes indiscriminate single-stranded dnase activity the crispr-cas a gene-editing system induces collateral cleavage of rna in glioma cells crispr-cas a has both cis-and trans-cleavage activities on singlestranded dna nucleic acid detection with crispr-cas a/c c multiplexed and portable nucleic acid detection platform with cas , cas a and csm programmed dna destruction by miniature crispr-cas enzymes sherlock: nucleic acid detection with crispr nucleases nucleic acid detection of plant genes using crispr-cas increasing the specificity of crispr systems with engineered rna secondary structures extension of the crrna enhances cpf gene editing in vitro and in vivo chemically modified cpf -crispr rnas mediate efficient genome editing in mammalian cells structural basis for the canonical and non-canonical pam recognition by crispr-cpf cas a trans-cleavage can be modulated in vitro and is active on ssdna, dsdna, and rna synthetic oligonucleotides inhibit crispr-cpf -mediated genome editing crystal structure of cpf in complex with guide rna and target dna the crystal structure of cpf in complex with crispr rna structural basis for guide rna processing and seed-dependent dna targeting by crispr-cas a exploring the trans-cleavage activity of crispr-cas a (cpf ) for the development of a cpf is a single rna-guided endonuclease of a class crispr-cas system the crisprassociated dna-cleaving enzyme cpf also processes precursor crispr rna prostate cancer specificity of pca gene testing: examples from clinical practice a first-generation multiplex biomarker analysis of urine for the early detection of prostate cancer pca and tmprss -erg gene fusions as diagnostic biomarkers for prostate cancer pca in the detection and management of early prostate cancer prostate health index (phi) and prostate cancer antigen (pca ) significantly improve prostate cancer detection at initial biopsy in a total psa range of - ng/ml pca urinary biomarker for prostate cancer multiplexed and portable nucleic acid detection platform with cas , cas a and csm a protocol for detection of covid- using crispr diagnostics v. . sherlock biosciences a protocol for rapid detection of the novel coronavirus sars-cov- using crispr diagnostics: sars-cov- detectr v crispr-cpf mediates efficient homology-directed repair and temperature-controlled genome editing cpf is a single rna-guided endonuclease of a class crispr-cas system detection of unamplified target genes via crispr-cas immobilized on a graphene field-effect transistor we are grateful to the members in the jain lab for their helpful discussions and the university of florida (uf) health cancer center for their support. we are particularly thankful to eric beck for editing the manuscript and ling jin, santosh rananaware, and marco downing for helping with the experiments and/or data analysis. we also thank the monoclonal antibody core facility staff, especially dr. angle sampson and shadi bootorabi, at the uf interdisciplinary center for biotechnology research (icbr) for coordinating the biolayer interferometry experiments. this research was supported by the internal funding from the uf and the uf herbert wertheim college of engineering. the authors declare no competing interests. pkj initiated the study; ltn and pkj designed research; ltn and bms performed research; lnt, bms, and pkj analyzed the data; ln and pkj wrote the manuscript that was edited and approved by all authors. key: cord- -r fhpe authors: gussow, ayal b.; park, allyson e.; borges, adair l.; shmakov, sergey a.; makarova, kira s.; wolf, yuri i.; bondy-denomy, joseph; koonin, eugene v. title: machine-learning approach expands the repertoire of anti-crispr protein families date: - - journal: nat commun doi: . /s - - - sha: doc_id: cord_uid: r fhpe the crispr-cas are adaptive bacterial and archaeal immunity systems that have been harnessed for the development of powerful genome editing and engineering tools. in the incessant host-parasite arms race, viruses evolved multiple anti-defense mechanisms including diverse anti-crispr proteins (acrs) that specifically inhibit crispr-cas and therefore have enormous potential for application as modulators of genome editing tools. most acrs are small and highly variable proteins which makes their bioinformatic prediction a formidable task. we present a machine-learning approach for comprehensive acr prediction. the model shows high predictive power when tested against an unseen test set and was employed to predict , candidate acr families. experimental validation of top candidates revealed two unknown acrs (acric , ic ) and three other top candidates were coincidentally identified and found to possess anti-crispr activity. these results substantially expand the repertoire of predicted acrs and provide a resource for experimental acr discovery. a ll life forms evolve under constant pressure from numerous viruses and other parasitic genetic elements, and thus have evolved multiple defense systems . the crispr-cas are adaptive immunity systems that are present in nearly all archaea and~ % of bacteria, and have been harnessed for the development of powerful genome editing and engineering tools [ ] [ ] [ ] . in the incessant host-parasite arms race, viruses evolved multiple anti-defense mechanisms including diverse anti-crispr proteins (acrs) that are currently known to comprise distinct families , . the acrs employ different mechanisms to abrogate the activity of crispr-cas systems [ ] [ ] [ ] [ ] . most of the acrs that have been studied to date bind to functionally important sites of crispr-cas effector proteins and display high specificity toward a particular crispr-cas variant from a narrow range of bacteria or archaea. some acrs, however, have broader specificity , for example, acting as nucleic acid mimics . furthermore, recently, enzymatically active acrs, such as acetyltransferases and nucleases, have been discovered [ ] [ ] [ ] . clearly, acrs have enormous potential for application as modulators of genome editing tools , . despite the major interest of acrs for understanding the biology of the host-parasite interactions in prokaryotes and their potential to transform the use of crispr in dna editing, the discovery of acrs remains a formidable task. the amino acid sequences of acrs are extremely variable, which conceivably reflects the high variability and diversity of the crispr-cas systems in bacteria and archaea . the combination of the small size and the high evolutionary variability of the acrs hampers their detection with even the most powerful sequence analysis methods . the currently known acr families were discovered using a variety of customized approaches, the two primary bioinformatic ones being guilt-by-association and selftargeting , , [ ] [ ] [ ] . guilt-by-association involves searching for homologs of hthcontaining proteins that are typically encoded downstream of acrs . such proteins are known as anti-crispr associated (aca) and are notably more conserved among viruses than acrs themselves, which greatly facilitates their detection. the genomic neighborhoods encoding aca homologs are then searched for potential acrs. prokaryotic genomes containing crispr-cas systems that encompass spacers targeting regions of the same genome are known as self-targeting . in this case, crispr-cas system should, in theory, target and kill the host cell. therefore, organisms with self-targeting genomes can only survive when they also carry acrs to prevent crispr-cas from functioning (or, perhaps, by employing an alternative strategy for keeping the crispr-cas silent) and thus keep the cell viable. despite the notable success of these two approaches, buttressed by experimental validation of many predictions, neither provides a comprehensive methodology for detecting acrs. in addition to their extreme sequence variability, acrs share few distinguishing characteristics outside of their common role in thwarting crispr. here, we describe a systematic machine-learning approach we developed to predict acrs, based on the few known acr attributes and a secondary screen using heuristics of known acrs, to further enrich for acr candidates. we show that this method is significantly predictive of acrs, compile a collection of previously undetected predicted acrs families and examine the top candidates in detail, including experimental validation. characteristic features of the known acrs. the general concept behind our approach is to combine the few characteristics acrs tend to share into a detection model. our first step was therefore to assemble and quantify features that previously discovered acrs appear to have in common. to keep track of the known acrs, we relied on a combination of curated acr databases , , and our own manual data curation (supplementary table ). at the time of our data curation, acr families were known (supplementary table ). we used this original set to iteratively search for homologs in the nonredundant (nr) database at the ncbi using psi-blast and to construct a multiple protein sequence alignment for each acr family. we then used each of these alignments as the query for a psi-blast search against our local protein sequence dataset that includes prokaryotic and prokaryotic virus proteins, and consists of a total of , , proteins. all hits with an e-value below the threshold of e− were manually curated to eliminate obvious false positives, such as partial false-positive hits to very large proteins or hits to proteins with unambiguously assigned functions (unrelated to anti-crispr activity), in an effort to create a high-confidence acr set. the final positive set consisted of acrs, spanning families (seven of the known acr families were not represented in our database; supplementary table , supplementary data ). the most striking and obvious common feature of the acrs is their small size (weighted mean acr length: aa, table ), and the tendency to form sets of small proteins that are encoded by co-directional and closely spaced genes in (pro)virus genomes (hereafter directons; fig. , table ). we hypothesize that these directons are largely made up of co-transcribed early anti-defense genes. beyond these distinctive features, we considered other protein characteristics that we suspected might be predictive, such as the spacing of protein-coding genes within a directon ("directon spacing", table ) or protein hydrophobicity ("mean hydrophobicity", table ). we also considered whether proteins had significant hits when searched against conserved domains from either the ncbi conserved domain database (cdd) or prokaryotic virus orthologous groups (pvog) using psi-blast (e-value < e− , "protein is annotated", table ), with the expectation that proteins with conserved domains likely perform other functions and therefore are unlikely to be acrs. in total, we constructed a set of features (table , see "methods" section for details) that, together, provided a compendium of quantifiable features that were used to identify acr candidates. training and test sets. to build a predictive model, a training set comprised of two components was required: a positive set, consisting of previously discovered acrs, and a negative set, consisting of proteins confidently inferred not to be acrs (non-acrs). for the positive set, the acrs were weighted by their family and interfamily similarities (supplementary data , see "methods" section for details), to ensure that related and highly similar acrs were not overrepresented in the training dataset. because there is no well-defined, standard set of known non-acr proteins, we constructed the negative set by randomly selecting viral and prokaryotic proteins, under the assumption that the majority of proteins are non-acrs. the negative training dataset was constructed by randomly selecting proteins from a combination of randomly selected prokaryotic virus genomes and randomly selected crispr-cas-containing prokaryote genomes. similar to the positive set, we sought to avoid oversampling particular protein families. therefore, these proteins were clustered by sequence similarity, and for each cluster, a single representative was selected. we randomly selected proteins from this set to constitute the negative, non-acr set. during our work on the predictive model, an additional set of acrs was published , . we incorporated these into our analysis as an unseen test set, i.e., a set of acrs unavailable during the training stage that we could use to test our model against. thus, our training set consisted of all known acrs published before september (supplementary data ; positive set: n = , families; negative set: n = ), and the test set consisted of the acrs published after that date (supplementary data ; positive set: n = proteins, families; negative set: n = proteins). building and evaluating a predictive model. given our relatively small positive set, we sought to identify a model that would tend toward low variance. thus, we chose a random forest of extremely randomized trees . as an ensemble method with a highly random component, it is less likely than other machine-learning approaches to overfit the training data, while allowing a nonlinear mapping of features to label data and complex feature interactions. the model consisted of a random forest with decision trees. when training the model, each decision tree is built based on a random sampling of the training data. each split in the decision tree is determined by randomly selecting multiple values across a random subset of the features, and then setting the values that minimize the likelihood of misclassification as the thresholds for the decision tree split. thus, the final forest consists of decision trees, where each decision tree's leaf nodes correspond to members of the training set. when using the model to assess a candidate protein, the candidate traverses each decision tree. within each tree, it ends up in a leaf node that contains some mixture of acrs and non-acrs from the training set. the tree assigns the candidate a score that is equal to the fraction of acrs in its leaf. the score assigned by the model is the mean of the scores across all trees. using the model and the training set we developed, we assessed the performance of the model by five iterations of threefold crossvalidation. in each iteration, the model was trained on two-thirds of the acr families, and capacity to predict the families that were left out was assessed. for each protein in the test set, we predicted the likelihood of a protein being an acr using our random forest phage protein phage protein acrx acay phage protein fig. characteristics of known acrs. a a cartoon of a sample directon. acr proteins characteristically fall upstream of an hth-domain-containing gene, termed aca. acrs are usually found in suspected mobile genetic elements, such as phages. the acr directon is highlighted in the gold color, while the surrounding proteins are indicated in blue. characteristically, acrs fall in directons with small, unidentified proteins. b a density plot of acr lengths. the xaxis denotes the common logarithm of the protein length, in amino acids. the y-axis denotes the probability density function estimated from the data across the values of x. c a density plot of the mean lengths of proteins in acr directons. the x-axis denotes the common logarithm of the mean length of proteins in an acr directon, in amino acids. the y-axis denotes the probability density function estimated from the data across the values of x. nature communications | https://doi.org/ . /s - - - article model. given the imbalance in the weights of samples in the positive and negative sets, we down weighted the negative set in training the model, so that its combined weight was equal to that of the positive set. this weighting was applied to both model training and assessment. we relied on receiver operating characteristic (roc) area under the curve (auc) to assess the model performance and used a genetic algorithm for feature selection. the roc is plotted based on the true-positive rates (the proportion of acrs that are correctly identified) and the false-positive rates (the proportion of non-acrs that are predicted as acrs). on average, across all cross-validation iterations, we found that our method was significantly predictive of acrs, with an auc of . (permutation p-value: . ). we next used the model to predict acrs in the unseen test set. the model was found to significantly distinguish acrs from non-acrs, with an auc of . (permutation p-value: . ; fig. ). this result indicates that our method is indeed predictive of acrs that are not present in the training set. we converted the scores output by the model into binary predictions by setting a threshold for classification that maximizes the cross-validation balanced accuracy in the training set ( supplementary fig. a ). the binary model achieves a precision value of % and a recall value of % on the test set (permutation p-value: . , supplementary fig. b, c) . the members of three of the six acr families assessed in the test set were detected most of the time (acrif -if ), whereas members of the remaining three families were detected less than half of the time, with the single member of acrie in the test set not detected by the model (supplementary fig. d, supplementary table ). using the model to predict acrs. having formally demonstrated the predictive power of our model on the test set of recently discovered acrs, we sought to leverage the model to generate a dataset that would be enriched for true acrs. we combined the model predictions with other enrichment approaches based on known acrs, under the expectation that this combination would ultimately enrich for true acrs, with the caveat that explicitly applying additional enrichment approaches skews the prediction performance away from that reported by the model, and might bias the resulting set to overlook acr families that are distant from known acrs. first, we sought to define an appropriate search space of proteins likely enriched for acrs. the initial dataset consisted of , , proteins of that most ( , , ) came from prokaryotes, and the rest were encoded by viruses ( , ). acrs are typically encoded either within prokaryotic virus genomes, or within prokaryotic genomic regions that appear to be integrated viruses (proviruses) or other mobile genetic elements (mges) , . we therefore identified a subset of the prokaryotic database that consisted of genomes containing complete crispr-cas systems , under the premise that these genomes are more likely to encompass prophages with acrs targeting the respective crispr-cas variants , . we further sought to limit the prokaryote protein set to proteins encoded by (putative) proviruses. although there are many methods for predicting complete proviruses and their boundaries, these fall short of comprehensive identification of provirus regions in prokaryotic genomes, primarily, because numerous proviruses are inactivated and partially deteriorated , . indeed, many of the known acrs are encoded in the vicinity of virus proteins , but not necessarily within clearly active proviruses encoding hallmark virus genes and bounded by well-defined provirus boundaries. therefore, instead of explicitly predicting proviruses, we enriched for virus-related sequence, by filtering the prokaryote protein set to the proteins encoded in the vicinity of known virus proteins (see "methods" section for details). the resulting combined dataset of prokaryotic viruses and suspected proviruses consisted of , , proteins. as these proteins are largely virus related, we expected this set to be enriched for acrs. this set of proteins was assessed with our random forest model that resulted in an initial set of , , candidate acrs. we further filtered these to retain only those that had no significant hits to cdd or pvog protein clusters (supplementary data ). heuristic filters were applied to each of these clusters, based on known acr characteristics, to further enrich the candidate set for true acrs. the hallmark characteristics of acrs are that they (i) are encoded upstream of hth proteins, and (ii) are found in self-targeting genomes . we therefore required each family to have at least one member that fulfills each of these criteria. after this filtering, , families remained. of these families, included known acrs from the initial positive set (supplementary table ). following this filtering using the hallmark acr characteristics, we developed and applied additional heuristic thresholds based on our initial observations. as genes encoding acrs tend to form small directons, we sought to estimate a heuristic maximum threshold for the mean directon size in a candidate family that would enrich our protein set for true acrs. we therefore searched for the threshold that, when applied, retained the largest fraction of the known acrs in our set of , , while filtering out as many of the candidate families as possible. to quantify this feature, we used the balanced accuracy metric, which is equal to the average of the fraction of correct classifications between the two groups. we found that a maximum mean directon size of five genes gave the highest balanced accuracy (see "methods" section for details). consequently, we removed protein families with an average directon size of more than five genes. after this filtering, families remained. of the remaining families, included known acrs from the initial positive set (supplementary table ) . to eliminate additional false positives, we performed a psi-blast search of each protein family alignment against our sequence dataset and, under the premise that acrs are highly variable, fast evolving proteins that are not known to be encoded outside the virus or provirus contexts, removed families with numerous homologs in diverse prokaryotes. we found that the heuristic cutoff value for the number of prokaryote homologs that maximized the balanced accuracy was . we therefore limited our set to clusters with no > significant hits to the prokaryotic protein set. next, we enriched for virus proteins by limiting to families that either include at least one homolog encoded in a virus genome or have a small ratio of prokaryote homologs to provirus homologs. we found that the cutoff value for the prokaryote to provirus ratio that maximized balanced accuracy was . finally, we sought to exclude families that have numerous annotations when assessed with hhblits and thus include wellcharacterized non-acrs . we found the cutoff value that maximized balanced accuracy for the number of hhblits hits was . after this filtering, families remained. of the remaining families, included known acrs from the initial positive set (supplementary table ) . although, by applying these heuristics, we likely filter out some true acrs predicted by the model and bias our predictions toward the characteristics of known acrs, we expect that, overall, this approach enriches the resulting protein set for true acrs. after applying the above filters, our enriched set consisted of protein families (fig. , supplementary data ) . characteristics of predicted acrs. we performed a psi-blast search of all candidate protein family alignments against a dataset of known acrs and acr-related sequences. for of these families, significant hits to the acr set were detected. of these protein families, included known acrs. the remaining four families with significant similarity to known acr-related sequences are homologous to uncharacterized proteins that are encoded within previously described acr directons, namely, in the genomic neighborhoods of acriia - in listeria monocytogenes, and all have been suspected of acr activity although did not show such activity when tested . these proteins were previously designated as orfa, orfb, and orfe. heuristic filtering of the acr candidates. a flowchart illustrating the heuristic filtering steps. the initial set consisted of , clusters and was first filtered for clusters that included at least one member with an hth-domain-containing protein encoded downstream, and at least one member from a self-targeting genome, two hallmark acr characteristics . four additional filters were applied, for mean directon size, number of hhblits hits, number of homologs, and enrichment for virus homologs. the thresholds were set based on the data presented here. b bar plot indicating the percentage of candidates and known acrs from the initial positive set at each filtering step. the red bars denote the percentage of acrs remaining at each step, and the blue bars denote the percentage of all candidates remaining at each step. the raw numbers of remaining known acrs from the initial positive set are displayed above each red bar. nature communications | https://doi.org/ . /s - - - article nature communications | ( ) : | https://doi.org/ . /s - - - | www.nature.com/naturecommunications after removing these families, we obtained candidate acr families, consisting of , putative acrs. the mean size of a family was seven, the largest family included members, and nearly half of the families ( %) were singletons (fig. ) . given the different cluster sizes, each predicted acr was assigned a weight inversely proportional to the size of the respective cluster, in order to ensure that related and highly similar predicted acrs were not overrepresented in summary statistics. specifically, each predicted acr was assigned a weight of /n, where n is the number of predicted acrs in its cluster. the predicted acrs have a weighted average size of aa, with a standard deviation (sd) of . (fig. a) . as expected by design, the acr genes tend to form small directons (weighted mean: . ; weighted sd: . ) consisting of short genes (weighted mean of the protein sizes in the predicted acrs directons: aa; weighted sd: ; fig. b ). the weighted mean isoelectric point of the predicted acrs is . with a weighted sd of . , and the weighted mean hydrophobicity is − . with a weighted sd of . . per tmhmm and signalp predictions , , a weighted % of predicted acrs have at least one putative transmembrane helix or signal peptide that, as expected, is substantially less than the expectation based on the negative set ( %, table ). using jpred , we predicted the secondary structure of the consensus sequences in the predicted acr set. the mean percentage of amino acids contributing to alpha-helices was %, and the mean percentage of amino acids contributing to beta-sheets was %. in the negative set, % and % of the proteins were predicted to contain at least one alpha helix or beta sheet, respectively, and the mean percentage of amino acids contributing to alpha-helices and beta-sheets was % and %, respectively. although these values do not differ substantially, we tested whether the distributions of the two categories differed significantly. we found a significant difference between the distributions of amino acids contributing to beta-sheets among the candidates and in the negative set (mann-whitney u test pvalue: . e− , supplementary fig. a) , but no such difference for alpha-helices (mann-whitney u test p-value: . , supplementary fig. b) . the candidates are distributed across a diverse set of species (n = , ). escherichia coli accounts for the largest share of candidate acrs at . %. peptoclostridium difficile ( . %) and encoding at least one acr. of the analyzed virus genomes, ( %) encode a single predicted acr, ( %) encode two, and the remaining ones ( %) encode three or more acrs. archaeal viruses are also represented in this set, with of the predicted acrs found in archaeal viruses. the maximum number of predicted acrs in a single virus strain was five, observed in ruegeria phage dss -p , four of which fell in the same hth-containing directon. the viruses that were found to most commonly encode more than one acr were mycobacterium phages, followed by bacillus and synechococcus phages. among the archaeal viruses, the viruses that were found to most commonly encode more than one acr were sulfolobales mexican rudivirus followed by sulfolobus islandicus viruses. we sought to examine the genomic context of the largest predicted acr clusters and gauge how often they tend to appear in similar genomic neighborhoods. we examined the ten largest acr clusters and generated a presence-absence matrix for the members of these clusters in different genomic neighborhoods (fig. ) , with a genomic neighborhood defined as the ten genes upstream and downstream of each acr. each column is a genomic neighborhood (ordered by similarity) and each row represents an acr family. whereas the larger acr clusters in this subset tend to appear in similar genomic neighborhoods, within these neighborhoods, we also find scattered predicted acr singletons. this pattern is similar to what has been observed in known acrs, where the acrs present in a given directon vary across closely related strains, with some acrs appearing in nearly all instances of the directon and others appearing sporadically . case by case analysis of top acr candidates. we next examined in greater detail the top candidates from our acr candidate set. we constructed a set for in-depth examination, by filtering for clusters with more than four members and selecting the clusters with the highest mean model score. these top families were explored using hhpred , psi-blast against nr and examination of the genomic context for each candidate (supplementary data ). additionally, an overlapping but distinct set of top candidates from proteobacteria possessing associations with type i-c, i-e, or i-f were selected for experimental interrogation against these subtypes in pseudomonas aeruginosa (see "methods" section for details, supplementary table ). it has been previously shown that acrs are typically encoded in short directons consisting of small genes, usually including one gene encoding an hth-domain-containing protein , . this configuration has been observed for multiple acr families and numerous virus and provirus genomes. one well-characterized example of this configuration involves the acriia - families . members of one of our top five candidate acr clusters, candidate (hereafter c ), were found in suspected prophages and phages of l. monocytogenes, adjacent to acriia , with three quarters of the members of this family found in self-targeting genomes. at the time of our analysis, c was not found to be homologous to any of the previously discovered acriia genes. however, shortly after the completion of the analysis and while this manuscript was in preparation, preliminary results on testing c for an anti-crispr function have been reported independently . c has been identified as an anti-lmocas protein (acriia ), supporting the utility of our approach to discover acrs. members of the c cluster were identified in one phage (listeria phage b ) and four suspected prophages (one in listeria innocua and three in l. monocytogenes). all the prophage-encoded homologs were found in self-targeting genomes that carry cas-ii-a. three of these genomes also carry cas-i-b. all the prophage-encoded members of this cluster were found in bacterial genomes that also encoded acriia , and two of these also encoded acrs iia and iia . given that all the genomes encoding proteins of this family encompass cas-ii-a, we predict that this is the target of its anti-crispr activity, although targeting of cas-i-b is difficult to rule out. as is characteristic of known acrs, c homologs are typically encoded in short directons consisting of three genes. one of these genes contains an hth domain and is homologous to orfd of l. monocytogenes. orfd has been previously identified as a marker for acr directons and is a distant homolog of acriia although in itself, this protein has not been shown to possess acr activity . all members of this cluster are encoded adjacent to members of another predicted acr family, c . c includes three additional members that are not adjacent to c , but are all found in a directon with acriia and an additional candidate, c , in prophages of listeria strains solely containing cas-i-b. in the genomic neighborhoods of c , one instance includes an expanded version of acriia (encoded in l. monocytogenes l ) that contains an hth, whereas the remaining two instances of acriia lack an hth domain. however, an examination of the nucleotide sequences immediately upstream of the truncated acriia indicates that this truncation is likely to be an error in the sequence annotation, and that the n-terminal of these instances of acriia can be extended to match the acriia homolog in l. monocytogenes , including the hth domain. furthermore, the region of acriia that contains the hth domain is similar to the portion of orfd that contains an hth domain ( % identity), so that extended version of acriia appears to be a fusion of orfd and acriia . thus, candidates c , c , and c all contain the hallmark characteristics of known acrs, including their tendency to fall in known acr neighborhoods and next to known acr markers. this corroborating evidence greatly raises our confidence that these are true acrs and further validates the predictive power of the methodology. members of the c cluster, experimentally validated as acric (see below), were identified in one phage (rhodobacter rcapnl) and in three rcapnl prophages integrated in selftargeting genomes of rhodobacter capsulatus. c belongs to a small directon of three genes, with the second gene in the directon containing an hth domain. this hth-containing gene is a distant homolog of aca , a previously discovered gene associated with acrs, further supporting the prediction of anti-crispr functionality of c . the third gene in the directon is uncharacterized. the three self-targeting prophages containing the acr occur in genomes with two crispr systems, type i-c and type vi-a, either of that are potential targets of c . members of the c cluster were found in clostridium. half of the homologs were found in genomes that are selftargeting. as is characteristic of acrs, c genes typically belong to a small directon of - genes, where the second protein encoded in this directon contains an hth domain. the other proteins in the directon are uncharacterized. all the genomes in this set contain crispr i-c, a potential target of c . members of the c cluster, experimentally validated as acric (see below), were found in xanthomonas. eight of these genes were identified in xanthomonas translucens and one in xanthomonas sp. shu . eight of the nine homologs were found in self-targeting genomes. c tends to fall in a small directon of two genes, as is characteristic of known acrs, where the second gene in the directon contains an hth domain. all the genomes containing c have type i-c crispr systems, a potential target of c . to test these predictions, acr validation was conducted in p. aeruginosa strains expressing type i-c, i-e, or i-f crispr-cas systems targeting phage (see "methods" section for details). type i-e and i-f systems were expressed at endogenous levels with native spacers targeting phages jbd and dms m, respectively, whereas the type i-c system was expressed heterologously in strain pao , with an engineered spacer. the candidates associated with one of these three subtypes and present in pseudomonas, combined with members of the top acr set (supplementary data ) from proteobacteria, were selected for experimental validation. two candidates identified in this work (c and c , supplementary table ) were found to be homologous to two type i-c crispr inhibitors that have been independently identified via aca association . c is homologous to acric ( % amino acid identity, % coverage), and c is homologous to acric ( . % amino acid identity, % coverage). the remaining genes were synthesized and were successfully cloned into expression vectors (supplementary table ). in addition to acric and acric , anti-crispr activity was demonstrated for two proteins (acric and acric ) that also targeted the type i-c system (fig. , supplementary table ). acric is amino acids in length and highly acidic (pi = . ), much like previously described dna mimic acr proteins , . this protein was found to be highly active, fully inactivating type i-c crispr-cas. acric is amino acids and is a comparatively neutral protein (pi = . ) that displayed an~ -fold weaker activity than acric , under our experimental conditions. acric and ic were the two highest confidence candidates tested, and acric and ic were the fourth and sixth highest confidence candidates tested, respectively. of the remaining candidate acr proteins, four were toxic in the type i-c strain and therefore were not tested against i-c, and only six were tested against type i-f before the laboratory shutdown due to covid- (supplementary table ). together, these results confirm that top-ranking predictions by our method are highly likely to be active acrs. the acrs are of major interest to a wide range of researchers, due both to their role in the evolutionary arms race between viruses and their prokaryotic hosts, and to their potential use as crispr-cas inhibitors in genome engineering applications. here, we demonstrate substantial predictive and discriminative power of a machine-learning approach for the identification of candidate acrs. this result appears unexpected given the paucity of distinctive features of the acrs. nevertheless, these few, rather generic features including the small size of the acr genes, type i-c crispr-cas type i-f crispr-cas type i-e crispr-cas fig. identification of anti-crispr proteins acric and acric . phage dms m or jbd spot titration from left to right (tenfold serial dilutions) on lawns of p. aeruginosa strains expressing the indicated crispr-cas system (x-axis, type i-c, i-f, or i-e), with a crrna targeting the indicated phage and the indicated acr or empty vector (y-axis). during screening of all candidate acr proteins against all three systems, each combination was screened once. upon detecting inhibition, positive results were visually confirmed in triplicate. their arrangement in short directons that contain, additionally, genes for hth proteins, poor evolutionary conservation, association with viruses and proviruses, and self-targeting seem to be sufficient for apparently robust acr prediction. the underlying reason seems to be that, in viruses of prokaryotes, a substantial fraction, often, the majority of the genes that are not directly implicated in virus replication and morphogenesis are involved in anti-defense functions. a notable example can be found among archaeal viruses in some of which up to % of the genes appear to encode acrs . hence a possible caveat of our predictions: some of the genes that we predict as acrs might target other, non-crispr defense systems. conversely, the possibility exists that, using the approach described here, we only detect one, albeit major, class of acrs, whereas others might exhibit distinct properties. the above caveats notwithstanding, the combination of sensitive database searches, machine-learning and heuristic filters applied here yielded previously undetected families of strong acr candidates that comprise an extensive resource, which we make accessible online (http://acrcatalog.pythonanywhere.com/), for structural and functional studies on acr-crispr interactions, with likely subsequent applications. the experimental validation presented here and elsewhere confirmed many of the top predictions. genes that tested negative for crispr-cas inhibition against single representatives of type i-c, i-e, and i-f in p. aeruginosa could lack inhibitory activity in this assay for many reasons. they might be acrs specific for different variants of the tested subtypes or different subtypes altogether, or acrs that act at a different stage of immunity, such as spacer acquisition. the three model strains used to represent the type i-c, i-e, and i-f systems do not necessarily reflect the potential interactions between the candidate acrs and diverse variants of these systems, or different crispr-cas types and subtypes present in the genomes where the acr candidate was found. future work will be required to test these candidates against relevant systems in the species of interest. lastly, some of these candidate acr proteins might inhibit other, non-crispr-based bacterial immune systems, given that, as recently shown, anti-defense genes show a strong tendency to cluster in mges . the signatures of acrs described in this work might apply broadly to inhibitors of other prokaryotic immune systems. the current database of prokaryotic virus genomes is limited in scope but grows rapidly, thanks, largely, to metagenomic discovery of numerous viruses . furthermore, so far, no targeted search for acrs in mges other than viruses, such as plasmids or transposons, has been performed. characterization of the distribution of acrs throughout the prokaryotic mobilome is a key next step to understanding the arms race that can be expected to lead to the discovery of numerous acrs. thus, the clear extension of this work involves searching the expanding virus genome databases, metagenomes, and other mge. iterative application of this strategy should greatly expand the diversity of acrs and, possibly, inhibitors of other defense systems. iterative search for acr homologs. for each acr family, a single representative sequence was selected, and a psi-blast search was run against the ncbi nr sequence database. iterative psi-blast was run to convergence, the identified homologs were aligned using muscle , and the resulting alignment was searched against our prokaryote dataset and our prokaryotic virus dataset from the ncbi viral genomes resource , using psi-blast. we used a cutoff of e-value ≤ e− for homolog detection and manually reviewed each resulting alignment. of the assessed families, seven were not detected in our database (supplementary data ). as the database used in this study was curated in (ref. ) and does not include all known proteins, and because acrs tend to be highly variable, with few homologs, it was not unexpected that these families were missed. all seven families not in our database were originally detected in strains that were not available at the time the database was constructed. weighting the acrs. for the positive set, we sought to weight each acr by its sequence similarity to the other acrs, in order to avoid oversampling closely related data points. initially, each acr family is assigned a weight of one. then, within each acr family, its member proteins were clustered using mmseq , with the parameters c = . and s = . (ref. ). each cluster is defined as a subfamily, and the initial weight of one given to the family is divided evenly amongst the subfamilies. following this, each subfamily's weight is divided evenly among its members. thus, each acr's weight is proportional to its similarity to other acrs in the set. for the negative set, an analogous procedure was followed. after randomly selecting a set of proteins as the negative set pool, these proteins were clustered using the same mmseq parameters as used for the acr families, and from each cluster, a single representative was selected. each representative protein was given a weight of one. in training and in assessing the model, the negative set was reweighted so that each class (acr and non-acr) had the same total weight. protein annotations. proteins in our dataset were annotated by applying a psi-blast search against cdd and pvog , with an e-value cutoff of e− . proteins with hits to pvogs were classified as viral. when enriching for true acrs using heuristics, proteins with hits to either cdd or pvog were eliminated. self-targeting assemblies. self-targeting assemblies were detected by blasting the spacers from each assembly in our dataset against the corresponding genome and filtering for exact matches. wherever an exact match was found, the respective assembly was classified as self-targeting (supplementary data ) . defining the features for the model. overall, total features were defined. some features related to the protein itself, while others relate to the protein's directon. a directon was defined as consecutive proteins on the same strand with a maximum of bp between adjacent proteins. the features were defined as follows: protein size: the length, in amino acids, of the candidate protein. directon size: the number of genes in the directon. mean directon protein size: the mean length, in amino acids, of all proteins in the directon. protein hydrophobicity: the protein's hydrophobicity . protein annotation: a binary score of whether the protein is annotated or not. we consider a protein as annotated if it has at least one significant hit to any alignment, outside of alignments annotated as hypothetical protein, putative predicted product, or provisional. fraction of directon that is annotated: the fraction of proteins in the directon that are annotated as defined above. hth-downstream: whether there is an hth-domain-containing protein encoded downstream of and adjacent to (within three genes) the acr candidate within the same directon. this feature was analyzed by running a psi-blast search of proteins against the subset of alignments from the pvog and cdd datasets containing in their name, or description either the term hth or helixturn-helix, with an e-value cutoff of e− . self-targeting: whether the protein is encoded in a self-targeting genome. predicted membrane association: whether the gene is predicted to be transmembrane or contain a signal peptide using tmhmm and signalp, respectively , . fraction of membrane-associated proteins in directon: the fraction of the proteins encoded in the directon that are predicted to be transmembrane or contain a signal peptide as defined above. directon spacing: the mean spacing between genes in the directon. whether genome is viral: whether the protein is encoded in a viral genome or in a prokaryotic genome. a genetic algorithm that selects subsets of features and creates different feature combinations while optimizing for the best feature set was applied to the features for ten generations, yielding the following feature set: ( ) containing genome is self-targeting ( ) directon annotated protein fraction ( ) directon protein lengths mean ( ) directon size ( ) protein is annotated ( ) protein has hth-downstream ( ) protein length ( ) protein hydrophobicity building the model. the model was constructed using scikit-learn (https://scikitlearn.org), specifically, the extratreesclassifier with the the n_estimator parameter set to , meaning that the random forest consisted of trees. the rest of the parameters were left at default. a random forest was chosen to model the data given that it is an ensemble classifier that is less likely to overfit than other methods, while allowing a nonlinear mapping of features to labels data and complex feature interactions . the model was trained on the training dataset described above, while downweighting the negative set so that each class (acr and non-acr) has the same total weight. the thresholds for each split in the random forest trees were selected to minimize gini impurity , which measures how often misclassification would occur when a randomly selected member of the node is randomly classified based on the distribution of labels in the node, calculated as follows: where i g (n) is the gini impurity for node n, and p i is the fraction of samples for class i (either acr or non-acr) in node n. thus, the gini impurity reaches when all samples in the node fall into a single category. predictive scores were calculated by using the extratreesclassifier function predict_proba. when calculating binary predictions, the threshold was set to the best value for differentiation in the training set when maximizing accuracy, which was equal to . . defining the acr search space. the alignments of the pvog proteins were compared to the dataset of genomes containing crispr-cas , . each directon containing a protein with a viral hit with an e-value < e− was considered a provirus-related sequence, along with the adjacent directons on either side. adjacent blocks of prophage-related directons (within bp of each other) were considered as provirus candidates. if the provirus candidate contained at least two virus hits within kb of each other, it was considered a predicted prophage. the set of virus proteins was assembled from the ncbi viral genomes resource , and subset to prokaryotic viruses based on taxonomy data (https:// www.ncbi.nlm.nih.gov/genomes/genomesgroup.cgi?taxid= ). this virus set totaled , proteins encoded in genomes. permutation p-value calculation. to calculate permutation p-values, the model's predictions for the test set were shuffled. we then tested how well the model performed on this shuffled dataset. this procedure was repeated times, creating a null distribution of aucs. with this null distribution, a permutation pvalue was calculated as follows. let n p be the number of aucs in the null distribution that are greater than or equal to the actual observed auc. the permutation p-value, then, is equal to þn p . thus, when the actual auc was greater than any auc in the entire permuted set, the p-value was~ . . clustering and weighting candidate acrs. candidate acrs were clustered using mmseq , with the parameters c = . and s = . (ref. ). a weight of /n c was assigned to each cluster, where n c is the number of acr candidate clusters. the weight of each cluster was then divided evenly among all protein members of the cluster, so that the weight of each acr was inversely proportional to the size of the cluster it belonged to. these weights were used when calculating summary statistics for the acr candidate set, to avoid oversampling closely related data points. psi-blast search against known acr and acr-related sequences. we created a sequence database of known acrs and acr-related sequences (supplementary data ). this database included all known acrs, acas, and proteins previously suspected of possessing acr activity, but not showing any when tested. we included the group of previously suspected acr proteins as these are proteins that bear acr characteristics, and therefore may be detected by our method, but have already been tested for acr activity. a psi-blast search of each candidate acr cluster alignment as the query was performed against this dataset of known sequences, the clusters that produced hits with an e-value of < e− were discarded as belonging to known acr families or families that have already been already tested for the acr function. heuristic filtering. to choose the thresholds for all the heuristics except for selftargeting and hth-downstream, ten evenly spaced threshold values were tested, between the minimum acr value and the maximum acr value. each of these ten thresholds were applied as cutoffs to the acr families, and for each threshold the balanced accuracy was calculated. the balanced accuracy is equal to the mean of the percentage of known acrs that passed the threshold and the percentage of all proteins that were filtered by the threshold, so that a higher balanced accuracy corresponds to better discrimination between the known acrs and the rest of the candidates. the final threshold was selected so as to maximize the balanced accuracy. the selected threshold was then applied to the dataset. six heuristics were defined to further enrich the acr candidate set. number of members that have hth-downstream: we required that at least one member of the candidate family have an hth-containing protein encoded downstream within the same directon. number of members in self-targeting or virus genome: we required that at least one member of the candidate family was either encoded in a self-targeting genome or encoded in a virus genome. mean directon length: the mean number of genes in the directon for all members of the family. number of homologs in prokaryotic dataset: a psi-blast search of the multiple protein alignment of each family was performed against the prokaryotic sequence dataset , and filtered for hits with a maximum e-value of e− , % identity and % query coverage. ratio of prokaryotic homologs to predicted provirus homologs: a psi-blast search of the multiple protein alignment of each family was performed against the predicted provirus sequence dataset and the virus sequence dataset, and filtered for hits with a maximum e-value of e− , % identity and % query coverage. if a family produced at least one hit to a virus sequence, it was included. if not, it was required that the ratio between the number of hits to the prokaryotic sequence dataset to the number of hits to the predicted provirus dataset was less than or equal to three. number of hhblits hits: the alignment of each family was compared to pfam and pdb (ref. ) using hhblits . families with > hits were discarded. construction of acr presence-absence matrix. to generate the presence-absence table, for the ten largest acr clusters, ten genes upstream and downstream were extracted where available (a maximum of genes total). if within this set, an additional predicted acr was represented, the set was further extended to include the ten genes upstream and downstream of that additional predicted acr. the resulting gene arrays were considered the acr genomic neighborhood. a binary matrix was constructed where each column is a genomic neighborhood, ordered by content similarity, and each row is a predicted acr family. in addition to the acrs from the top ten largest clusters, those encoded within ten genes upstream or downstream of acrs from the largest clusters were included. each cell represents the presence or absence of a member of the respective acr family in the neighborhood. manual assessment of candidates. the multiple alignment for each of the top candidates in supplementary data was compared against the pdb, pfam, and ncbi cd databases using hhpred . for each candidate, we calculated a consensus sequence, where the consensus letter for an alignment position was defined as the amino acid that has the highest blosum score among the amino acids occupying the position. a psi-blast search of the consensus sequence of each candidate family was performed against nr, and the genomic contexts of homologs were visually assessed using geneious prime. plasmid preparation. all candidate proteins were reverse translated and codon optimized for p. aeruginosa pao using idt codon optimization tool. gene fragments (twist biosciences) were cloned into the saci/psti site in the pherd t vector using gibson assembly. the resultant plasmids were selected with µg/ml gentamicin and propagated in e. coli strain dh ɑ. transformation. all plasmids were transformed via electroporation into p. aeruginosa strains ll (a pao derivative with the type i-c cas genes integrated in the chromosome), smc (native type i-e), and pa (native type i-f) to test for inhibition of the type i-c, type i-e, and type i-f crispr-cas immune systems, respectively. transformation was performed using p. aeruginosa cultures grown overnight in lb medium at °c with shaking. to make cells electrocompetent, ml of overnight culture was pelleted, resuspended in ml of % glycerol, and then pelleted and resuspended twice more, with the final resuspension done with only µl of glycerol solution. a total of µl (~ ng) of plasmid was added to the electrocompetent cells, and the mixtures were allowed to sit on ice for min. the cells/dna mixture was transferred to cuvettes and electroporated using the biorad gene pulser xcell electroporation systems preset p. aeruginosa setting. immediately after electroporation, ml of lb was added to each cuvette. the cells were transferred from the cuvettes to . ml eppendorf tubes and recovered for h at °c with shaking. the recovered cells were pelleted, the top - µl of supernatant was removed, and then the cells were resuspended in the remaining supernatant. a total of µl of cells were then spread with glass beads onto lb agar plates with µg/ml gentamicin. the plated cells were allowed to grow overnight at °c. plaque assay for crispr-cas activity. single colonies of the three testing p. aeruginosa strains with the candidate plasmids were grown overnight in . ml of lb medium with µg/ml gentamicin. a total of µl of each overnight culture were then mixed with . ml of molten top agar (supplemented with mm iptg for ll ) in small glass tubes. the resultant agar-bacteria mixture was then poured onto circular lb agar plates with µg/ml gentamicin, . % arabinose, and mm mgso . after being left to dry for min, tenfold serial dilutions of crisprtargeted bacteriophage, ranging from to − were pipetted onto the plates, and the plates were then incubated overnight at °c. phages jbd , jbd , and dms m were used to assay type i-c, type i-e, and type i-f crispr-cas activity, respectively. plaque assays were conducted on standard petri plates, cm in diameter. reporting summary. further information on research design is available in the nature research reporting summary linked to this article. supplementary information is available for this paper at https://doi.org/ . /s - - - . correspondence and requests for materials should be addressed to e.v.k. peer review information nature communications thanks alexander hynes, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. peer reviewer reports are available. reprints and permission information is available at http://www.nature.com/reprints publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/ licenses/by/ . /. this is a u.s. government work and not under copyright protection in the u.s.; foreign copyright protection may apply a virocentric perspective on the evolution of life diversity, classification and evolution of crispr-cas systems biology and applications of crispr systems: harnessing nature's toolbox for genome engineering major bacterial lineages are essentially devoid of crispr-cas viral defence systems bacteriophage genes that inactivate the crispr/cas bacterial immune system the discovery, mechanisms, and evolutionary impact of anti-crisprs multiple mechanisms for crispr-cas inhibition by anti-crispr proteins cryo-em structures reveal mechanism and inhibition of dna targeting by a crispr-cas surveillance complex structure reveals a mechanism of crispr-rna-guided nuclease recruitment and anti-crispr viral mimicry keeping crispr in check: diverse mechanisms of phageencoded anti-crisprs a broad-spectrum inhibitor of crispr-cas inactivation of crispr-cas systems by anti-crispr proteins in diverse bacterial species an anti-crispr protein disables type v cas a by acetylation broad-spectrum enzymatic inhibition of crispr-cas a an anti-crispr viral ring nuclease subverts type iii crispr immunity anti-crispr: discovery, mechanism and function disabling a type i-e crispr-cas nuclease with a bacteriophageencoded anti-crispr protein naturally occurring off-switches for crispr-cas a new group of phage anti-crispr genes inhibits the type i-e crispr-cas system of pseudomonas aeruginosa inhibition of crispr-cas with bacteriophage proteins a unified resource for tracking anti-crispr names anti-crisprdb: a comprehensive online resource for anti-crispr proteins systematic prediction of genes functionally linked to crispr-cas systems by gene neighborhood analysis a simple method for displaying the hydropathic character of a protein cdd/sparcle: functional classification of proteins via subfamily domain architectures prokaryotic virus orthologous groups (pvogs): a resource for comparative genomics and protein family annotation discovery of widespread type i and type v crispr-cas inhibitors systematic discovery of natural crispr-cas a inhibitors extremely randomized trees the crispr spacer space is dominated by sequences from species-specific mobilomes prophages and bacterial genomics: what have we learned so far? a completely reimplemented mpi bioinformatics toolkit with a new hhpred server at its core predicting transmembrane protein topology with a hidden markov model: application to complete genomes signalp . : discriminating signal peptides from transmembrane regions jpred : a protein secondary structure prediction server the hhpred interactive server for protein homology detection and structure prediction listeria phages induce cas degradation to protect lysogenic genomes mobile element warfare via crispr and anti-crispr in pseudomonas aeruginosa phage acriia dna mimicry: structural basis of the crispr and anti-crispr arms race anti-crispr proteins encoded by archaeal lytic viruses inhibit subtype i-d immunity discovery of multiple anti-crisprs uncovers antidefense gene clustering in mobile genetic elements consensus statement: virus taxonomy in the age of metagenomics multiple sequence alignment with high accuracy and high throughput ncbi viral genomes resource mmseqs enables sensitive protein sequence searching for the analysis of massive data sets adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence the pfam protein families database in rcsb protein data bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy the authors declare that the data supporting the findings of this study are available within the paper and its supplementary information files. the ncbi nr sequence database is available at https://ftp.ncbi.nlm.nih.gov/blast/db/ under the file names "nr.*. tar.gz". other data are available from the corresponding author upon reasonable requests. source data are provided with this paper. the model code and sample data are available on github (https://github.com/ gussow/acr).received: january ; accepted: july ; the authors thank koonin group members for helpful discussions. this research was supported by the intramural research program of the national library of medicine at the nih. j.b.-d. is a scientific advisory board member of snipr biome and excision biotherapeutics, and a scientific advisory board member and co-founder of acrigen biosciences. key: cord- -uyswj ow authors: melin, amanda d.; janiak, mareike c.; marrone, frank; arora, paramjit s.; higham, james p. title: comparative ace variation and primate covid- risk date: - - journal: commun biol doi: . /s - - -w sha: doc_id: cord_uid: uyswj ow the emergence of sars-cov- has caused over a million human deaths and massive global disruption. the viral infection may also represent a threat to our closest living relatives, nonhuman primates. the contact surface of the host cell receptor, ace , displays amino acid residues that are critical for virus recognition, and variations at these critical residues modulate infection susceptibility. infection studies have shown that some primate species develop covid- -like symptoms; however, the susceptibility of most primates is unknown. here, we show that all apes and african and asian monkeys (catarrhines), exhibit the same set of twelve key amino acid residues as human ace . monkeys in the americas, and some tarsiers, lemurs and lorisoids, differ at critical contact residues, and protein modeling predicts that these differences should greatly reduce sars-cov- binding affinity. other lemurs are predicted to be closer to catarrhines in their susceptibility. our study suggests that apes and african and asian monkeys, and some lemurs, are likely to be highly susceptible to sars-cov- . urgent actions have been undertaken to limit the exposure of great apes to humans, and similar efforts may be necessary for many other primate species. i n late a novel coronavirus, sars-cov- , emerged in china. in humans, this virus can lead to the respiratory disease covid- , which can be fatal , . since then, sars-cov- has spread around the world, causing widespread mortality, and with major impacts on societies and economies. while the virus and its resulting disease represent a major humanitarian disaster, they also represent a potentially existential risk to our closest living relatives, the nonhuman primates. transmission incidences of bacteria and viruses-including another coronavirus (h-cov-oc )-from humans to wild populations of nonhuman primates have previously been linked to outbreaks of ebola, yellow fever, and fatal respiratory diseases, leading in some cases to mass mortality [ ] [ ] [ ] [ ] [ ] [ ] [ ] . such past events raise considerable concerns among the global conservation community with respect to the impact of the current pandemic . infection studies of rhesus monkeys, long-tailed macaques, and vervets as biomedical models have made it clear that at least some nonhuman primate species are permissive to sars-cov- infection and develop symptoms in response to infection that resemble those of humans following the development of covid- , including similar age-related effects [ ] [ ] [ ] [ ] [ ] [ ] . recognizing the potential danger of covid- to nonhuman primates, the international union for the conservation of nature (iucn), together with the great apes section of the primate specialist group, released a joint statement on precautions that should be taken for researchers and caretakers when interacting with great apes . however, the risk for many primate taxa remains unknown. here we begin to assess the potential likelihood that our closest living relatives are susceptible to sars-cov- infection. while the biology underlying susceptibility to sars-cov- infection remains to be fully elucidated, the viral target is well established. the sars-cov- virus binds to the cellular receptor protein angiotensin-converting enzyme- (ace ), which is expressed on the extracellular surface of endothelial cells of diverse bodily tissues, including the lungs, kidneys, small intestine, and renal tubes . ace is a carboxypeptidase whose activities include regulation of blood pressure and inflammatory response through its role in cleaving the vasoconstrictor angiotensin ii to produce angiotensin - and triggering varied downstream responses [ ] [ ] [ ] [ ] . ace is made up of a signal sequence at the n terminus (residues - ), a transmembrane sequence at the c terminus (residues - ), and an extracellular region, which contains a zinc metallopeptidase domain (residues - ) and a collectrin homolog (residues - ) , . characterizations of the infection dynamics of sars-cov- have demonstrated that the binding affinity for the human ace receptor is high, which is a key factor in determining the susceptibility and transmission dynamics. when compared to sars-cov, which caused a serious global outbreak of the disease in - , , the binding affinity between sars-cov- and ace is estimated to be between fourfold - and -to -fold greater . recent reports describing structural characterization of ace in complex with the sars-cov- spike protein receptorbinding domain (rbd) [ ] [ ] [ ] [ ] allow identification of the key binding residues that enable the host-pathogen protein-protein recognition. following the initial binding of the virus to the ace receptor, humans experience a great deal of variation in response to infection, with some individuals experiencing relatively mild symptoms, while others experience major breathing problems and organ failures, which can lead to death. some of this response is known to be linked to variation in how the immune system responds to infection, with some individuals experiencing a hyperinflammatory 'cytokine storm', which in turn aggravates respiratory failures and increases mortality risk , . there may also be some variation among humans in initial susceptibility to infection, such that approaches examining variation in ace tissue expression and gene sequences can offer insight into variation in human susceptibility to covid- [ ] [ ] [ ] [ ] . similarly, we can use such an approach to compare sequence variation across species, and hence try to predict the likely interspecific variation in susceptibility to initial infection. previous analysis of comparative variation at these sites enabled estimates of the affinity of the ace receptor for sars-cov in nonhuman species (bats) . here, we undertake such an analysis for sars-cov- across the primate radiation. our aim is to investigate the likelihood of initial susceptibility to infection for different major radiations and species while recognizing that downstream processes such as immune responses are likely to determine the extent to which species and individuals develop symptoms and pathologies in response to infection. we compiled ace gene sequence data from primate species for which genomes are publicly available, covering primate taxonomic breadth. for comparison, we assessed species of other mammals that have been tested directly for sars-cov- susceptibility in laboratory infection studies . we also included in our analysis the amino acid sequence variation at these sites for horseshoe bats, thought to be the original vector of the virus, and pangolins, a potential intermediate host, where viral recombination may have led to the novel viral form sars-cov- . we assessed the variation at amino acid residues identified as critical for ace recognition by the sars-cov- rbd and undertook an analysis of positive selection and protein modeling to gauge the potential for adaptive differences and the likely effects of protein variation. our aim was to develop predictions about the susceptibility of our closest living relatives to sars-cov- as a resource for stakeholders, including researchers, caretakers, practitioners, conservationists, and governmental and non-governmental agencies. variation in ace sequences. the ace gene ( bp) and translated protein ( amino acids) sequences are strongly conserved across primates. the average pairwise identity across primate species is . % for the ace nucleotide sequence and . % for the protein sequence, with a pairwise similarity (blosum ≥ ) of . % (supplementary data - ). out of bp, bp ( . %) are identical, while bp ( . %) are phylogenetically-informative sites for primates, and gene trees we generated ( supplementary fig. s a , b) closely recapitulate the currently accepted phylogeny of primates ( fig. ). in particular, the twelve sites in the ace protein that are critical for binding of the sars-cov- virus are invariant across the catarrhini, which includes great apes, gibbons, and monkeys of africa and asia (fig. ) . furthermore, catarrhines do not vary at any of the sites identified by alanine scanning (supplementary table s and supplementary fig. s ). the other major radiation of monkeys, those found in the americas (platyrrhini), have ace sequences that are less similar to humans across the length of the protein ( . - . % identical to h. sapiens, supplementary data ) but conserved within their clade (average pairwise identity . %, supplementary data ). they share nine of twelve critical amino acid residues with catarrhine primates; the three sites that vary from catarrhines, h , e , and t , are conserved within the platyrrhines. strepsirrhine primates and tarsiers, were more variable in the binding sites and less similar to the human protein across the length of the sequence ( . - . % pairwise identity, supplementary data ). like platyrrhines, the tarsier (carlito syrichta), mouse lemur (microcebus murinus), and galago (otolemur garnettii) have an h residue, while the sifaka (propithecus coquereli), aye-aye (daubentonia madagascariensis), and the blue-eyed black lemur (eulemur flavifrons) have the same allele as humans and other catarrhines, y . in non-primate mammals, a higher number of amino acid substitutions are evident ( . - . % pairwise identity to h. sapiens, supplementary data ), including at critical binding sites. all species possess a different residue to primates at site . bats are exceptionally variable within the binding sites, with the genus rhinolophus alone encompassing all of the variation seen in the rest of the non-primate mammals. where primates have glutamine (q ), bats have glutamate (e ), lysine (k ), leucine (l ), or arginine (r ) (fig. ). all fasta alignments of ace gene and protein sequences are available in supplementary data - , a full-length protein alignment is also shown in supplementary fig. s , and distance matrices are provided in supplementary data - . analysis of species-specific residues on ace -rbd interactions. the ace receptors of all catarrhines have identical residues to humans at the rbd/ace binding interface across all critical sites, and are predicted to have a similar binding affinity for sars-cov- . platyrrhines diverge from catarrhines at three of the twelve critical amino acid residues. compared to catarrhine ace , the platyrrhines' ace is predicted to bind sars-cov- rbd with a roughly -fold reduced affinity (ΔΔg bind = . kcal/mol) ( table ). in particular, the change at site from y to h found in monkeys in the americas has the largest impact of any residue change examined (table ) , which alone is predicted to lead to a -fold decrease in the binding affinity to sars-cov- ( fig. ). this single mutation combined with additional substitutions, especially q e, found in platyrrhines is predicted to substantially reduce the likelihood of successful viral binding ( table ) . of the other primates modeled, two of the three strepsirrhines, and tarsiers, also have the h residue and furthermore have additional protein sequence differences leading to further decreases in predicted binding affinity. the predicted fig. ace protein sequence alignment and evolutionary relationships of study species. branch lengths represent the evolutionary distance (time, in millions of years) estimated from timetree . we outline amino acid residues at critical binding sites for the sars-cov- spike receptor-binding domain. solid outlines highlight sites predicted to have the most substantial impact on viral binding affinity. notably, protein sequences of catarrhine primates are highly conserved, including uniformity among amino acids at all binding sites. primate species that are able to be successfully infected with covid- are indicated in red. predicted susceptibility to covid- for other primates is additionally coded by terminal branch colors. we use the nomenclature cebus capucinus to be consistent with the species name used in the genome annotation but note the recent adoption of cebus imitator for this species. silhouettes are from phylopic.org and available under the public domain dedication . license, with the exception of cebus (sarah werning; creative commons attribution . unported). binding affinity of tarsier ace is the most dissimilar to humans and this primate might be the least susceptible of the species we examine. in contrast, coquerel's sifaka (propithecus coquereli), the aye-aye (daubentonia madagascariensis), and a blue-eyed black lemur (eulemur flavifrons) share the same residue as humans and other catarrhines at site and have projected affinities that are near to humans (table ). other mammals included in our study -ferrets, cats, dogs, pigs, pangolin, and two of the seven bat species (r. pusillus and r. macrotis) -show the same residue as humans (y) at site , with accompanying strong affinities for sars-cov- . the remaining five sister species of bats possess h and lower binding affinities (table ) . adaptive evolution of ace sequences. we find evidence that the selective pressures acting on ace are not equivalent across the major clades in our analysis. the codeml clade model c provided a better fit than the null model (lrt = . , p < . ; table , supplementary table s ) (table ). in catarrhines, the three positively selected sites identified by beb calculations are not near the binding sites for sars-cov- (residues , , and ; table ). our results strongly suggest that catarrhines -all apes, and all monkeys of africa and asia, are likely to be susceptible to infection by sars-cov- . there is high conservancy in the protein sequence of the target receptor, ace , including uniformity at all identified and tested major binding sites. indeed, even among the residues identified in our full list of potential binding points, catarrhines are invariant (supplementary table residues between platyrrhines and catarrhines, and two of these, h y and e q show strong evidence of being impactful changes. these amino acid changes are modeled to reduce the binding affinity between sars-cov- and ace by ca. -fold. recent clinical analysis of viral shedding, viremia, and histopathology in catarrhine (macaque) versus platyrrhine (marmoset, callithrix jacchus) responses to inoculation with sars-cov- , show much more severe presentation of disease symptoms in the former, strongly supporting our results . similar reduced susceptibility is predicted for tarsiers, and two of the five lemurs and lorisoids (strepsirrhines). what is concerning is that three of the analyzed lemurs spanning divergent lineages-the coquerel's sifaka, the aye-aye, and the blue-eyed black lemur-are more similar to catarrhines at important binding sites, including possessing the high-risk residue variant at site , and as such are also predicted to be susceptible. nonetheless, these are only predicted results based on amino acid residues and protein-protein interaction models. we urge extreme caution in using our analyses as the basis for relaxing policies regarding the protection of platyrrhines, tarsiers or any strepsirrhines. experimental assessment of synthetic protein interactions can now occur in the laboratory, e.g. , and confirmation of our model predictions should be sought before any firm conclusions are reached. emerging evidence in experimental mammalian models appears to support our results; dogs, ferrets, pigs, and cats have all shown some susceptibility to sars-cov- but have demonstrated variation in disease severity and presentation, including across studies , . substitutions at binding sites might be at least partially protective against covid- in these mammals. for example, the limited experimental evidence to date suggests that while cats -which have the same residue as humans at site -are not strongly symptomatic, they present lung lesions, while dogs-which have a substitution at this site-do not . the amino acid residue at site differs from primates in all other mammalian species examined. however, our models suggest that the variant residues may confer relatively minor reductions in binding affinity. other sources of variation may affect ace protein stability . our results are also consistent with previous reports that ace genetic diversity is greater among bats than that observed among mammals susceptible to sars-cov-type viruses. this variation has been suggested to indicate that bat species may act as a reservoir of sars-cov viruses or their progenitors . intriguingly, all but bat species we examined have the putatively protective variant, h . additionally, results of our codeml branchsite analysis support previous findings of ace in bats being under positive selection, including sites within the binding domain of sars-cov and sars-cov- , which may be evidence of hostvirus coevolution. sites showing evidence of positive selection within catarrhine ace sequences were not in or near known cov binding sites (table and fig. ). two (residues , ) fall within the cleavage site (residues - ) utilized by the sheddase adam , known to interact with ace . however, neither of the residues under selection are the amino acids targeted by adam leaving the functional significance of evolution at these sites uncertain. further clinical and laboratory study is needed to fully understand infection dynamics. there are a number of important caveats to our study. firstly, all of our predictions are based on interpretations of gene and resultant amino acid sequences, rather than based on direct assessment of individual responses to induced infection. nonetheless, the overall pattern of our results is being borne out by infection studies on a few species that are used as biomedical models. so far, all catarrhine species tested by infection studies, including rhesus macaques, long-tailed macaques, and vervet table results of codeml analyses of adaptive evolution across ace gene sequences. monkeys , , have exhibited covid- -like symptoms in response to infection, including large lung and other organ lesions and cytokine storms . in contrast, marmosets did not exhibit major symptoms in response to infection . while these results support and validate our findings based on ace sequence interpretation, the number of primate species that can and will be tested directly by infection studies will be restricted to just a handful. our study enhances this picture, by allowing inferences to be made across the primate radiation, backed up by the published infection studies on a few target model species. some of our results, such as the uniform conservation of ace binding sites among catarrhines, backed up by the demonstrated high susceptibility of humans and other catarrhines to sars-cov- , should give a good degree of confidence of high levels of risk. given the identical residues of humans to other apes and monkeys in asia and africa at the target sites, it seems unlikely that the ace receptor and the sars-cov- proteins would not readily bind. our results for other taxa are dependent on modeling, hence should be treated more cautiously. this includes all interpretations of the susceptibility of platyrrhines and strepsirrhines, where the effects of residue differences on binding affinities have been estimated based on protein-protein interaction modeling. another caveat is that we have modeled only interactions at binding sites, and not predictions based on full residue sequence variation. residues that are not in direct contact may still affect binding allosterically. other factors, including proteases necessary for viral entry, and other viral targets, may also impact disease susceptibility and responses . more generally, if adhering to the precautionary principle, then our results highlighting higher risks to some species should be taken with greater gravity than our results that predict potential lower risks to others. another limitation of our study is that we have looked at only primate species, albeit with broad taxonomic scope. analysis of additional species is important, especially among strepsirrhine species, where our coverage is relatively scant. in particular, the residue overlap at important binding sites in the sequences of coquerel's sifaka, the aye-aye, and blue-eyed black lemur with those of catarrhines suggests many lemurs may be highly vulnerable and we underscore the need to assess a wider diversity of lemur species. furthermore, we examine only one individual per species, and intraspecific variation across populations should be considered; however, studies on intraspecific ace variation with humans and vervet monkeys suggest ace variants are low in frequency [ ] [ ] [ ] . finally, it is also important to remember that our study assesses only the potential for the initial binding of the virus to the target site. downstream consequences of infection may differ drastically based on speciesspecific proteases, genomic variants, metabolism, and immune system responses , . in humans, the development of covid- can lead to a pro-inflammatory cytokine storm of hyperinflammation, which may lead to some of the more severe impacts of infection , . nonetheless, it is evident from the hundreds of thousands of deaths and global lockdown that humans are highly susceptible to sars-cov- infection, and our results suggest that all apes and monkeys in africa and asia are similarly susceptible. many endangered primate species are now only found in very small population sizes . for example, there are believed to be only around mountain gorillas left in their entire range . with such small populations, the introduction of a new highly infectious disease is of serious concern. re-opening access to habituated great ape groups for tourism purposes, which may be critical to local economies , may be fraught with issues. iucn best practices recommend that tourists stay at least meters away from great apes , but in practice, almost all tourists get far closer than this -for example, the average distance that tourists get from mountain gorillas at the bwindi impenetrable national park in uganda is just . m . a concerted effort may be required by all stakeholders to try to avoid the introduction of sars-cov- into wild primate populations . recent measures suggested by the iucn for researchers and caretakers of great ape populations include: ensuring that all individuals wear clean clothing and disinfected footwear; providing hand-washing facilities; requiring that a surgical face mask be worn by anyone coming within m of great apes; ensuring that individuals needing to cough or sneeze ideally leave the area, or at least cough/sneeze into the crux of their elbows; imposing a -day quarantine for all people arriving into great ape areas who will come into frequent close proximity with them . the iucn's 'best practice guidelines for health monitoring and disease control in great ape populations' should also be followed . our results suggest that dozens of nonhuman primate species, including all of our closest relatives, are likely to be highly susceptible to sars-cov- infection, and vulnerable to its effects. major actions may be needed to limit the exposure of many wild primate populations to humans. this is likely to require coordinated input from all stakeholders, including local communities, international and national governmental agencies, nongovernmental conservation and development organizations, and academics and researchers. while the focus of many at this time is rightly on mitigating the humanitarian devastation of covid- , we also have a duty to ensure that our closest living relatives do not suffer from devastating infections and further population declines in response to yet another human-induced catastrophe. variation in ace sequences. we compiled ace gene sequences for catarrhine primates: species from all genera of great ape (gorilla, pan, pongo), genera of gibbons (hylobates, nomascus), and species of african and asian monkeys in genera (cercocebus, chlorocebus, macaca, mandrillus, papio, rhinopithecus, piliocolobus, theropithecus); genera of platyrrhines (monkeys from the americas: alouatta, aotus, callithrix, cebus, saimiri, sapajus); species of tarsier (carlito syrichta); and genera of strepsirrhines (lemurs and lorisoids: eulemur, daubentonia, microcebus, propithecus, otolemur) (supplementary table s ). we also included four species of mammals that have been tested clinically for susceptibility to sars-cov- infection , including the domestic cat (felis catus), dog (canis lupus familiaris), pig (sus scrofa), and ferret (mustela putorius furo). finally, we included the pangolin (manis javanica) and several bat species, including horseshoe bats (rhinolophus spp., hipposideros pratti, myotis daubentonii). sequences were retrieved from ncbi, either from annotations of published genomes or from genbank entries . we manually checked annotations by performing tblastn searches of the human ace protein sequence against each genome. we identified one misannotation for exon in microcebus murinus, which we manually corrected. the ace nucleotide sequence for alouatta palliata was obtained from an unpublished draft genome, via tblastn searches using the cebus ace protein sequence as a query and default search settings. accession numbers for sequences retrieved from ncbi and genbank are provided in supplementary table s and the alouatta palliata sequence is available in supplementary data . coding sequences were translated using geneious version . . and we aligned both nucleotide and amino acid sequences with mafft . amino acids were aligned with the blosum scoring matrix, while the pam scoring matrix was used for nucleotides. a . gap open penalty and an offset value of . were used for both. we manually inspected and corrected any misalignments, and verified the absence of indels and premature stop codons. to visualize patterns of gene conservation across taxa and identify the congruence of the ace gene tree with currently accepted phylogenetic relationships among species, we reconstructed trees using both bayesian (mrbayes . . ) and maximum likelihood (raxml . . ) methods with , mcmc cycles and bootstrap replicates, respectively (code available on github ). gene trees were compared to a current species phylogeny assembled using timetree , which is also used to illustrate the evolutionary relationships between study species in fig. . phylogenetically-informative sites along the ace sequence were identified with the pis function in the r package ips v. . . , . identification of critical binding residues and species-specific ace -rbd interactions. critical ace protein contact sites for the viral spike protein receptor-binding domain (rbd) have been identified using cryo-em and x-ray crystallography structural analysis methods [ ] [ ] [ ] [ ] . the ace -rbd complex is characteristic of protein-protein interactions (ppis) that feature extended interfaces spanning a multitude of binding residues. experimental and computational analyses of ppis have shown that a handful of contact residues can dominate the binding energy landscape . alanine scanning mutagenesis provides an assessment of the contribution of each residue to complex formation [ ] [ ] [ ] . critical binding residues can be computationally identified by assessing the change in binding free energy of complex formation upon mutation of the particular residue to alanine, which is the smallest residue that may be incorporated without significantly impacting the protein backbone conformation . our computational modeling utilizes the human sars rbd/ace high-resolution structures, and we make the implicit assumption that the overall conformation of ace is conserved among different species. this assumption, which is rooted in the high sequence similarity between ace sequences, allows us to use the structure of the complex to predict the impact of mutations at the protein-protein interface. we defined critical residues as those that upon mutation to alanine decrease the binding energy by a threshold value ΔΔg bind ≥ . kcal/mol. nine of the residues identified by alanine scanning as involved in the ace -rbd complex met this criterion (supplementary table s ). there was a large congruence in the sites identified with those highlighted by other methods. each of the eight sites implicated by cryo-em , were also detected by alanine modeling; five residues were ≥ . kcal/mol threshold and were below this threshold. to be cautious, in addition to the critical ace sites we identified through alanine scanning, we also examined residue variation at the sites that fell below the ≥ . kcal/mol threshold but that were identified as important by structural analyses - for a total of critical sites. all computational alanine scanning mutagenesis analyses were performed using rosetta software . the alanine mutagenesis approach has been extensively evaluated and used to analyze ppis and design their inhibitors, including by members of the present authorship , . we utilized the ssipe program to predict how ace amino acid differences in each species would affect the relative binding energy of the ace /sars-cov- interaction. using human ace bound to the sars-cov- rbd as a benchmark (pdb m j), the program mutates selected residues and compares the binding energy to that of the original. using this algorithm, we studied interactions of all primates across the full suite of amino acid changes occurring at critical binding sites for each species. to more thoroughly assess the impact of each amino acid substitution, we also examined the predicted effect of individual amino acid changes (in isolation) on protein-binding affinity. adaptive evolution of ace sequences. we further investigated ace and how selective pressures in different clades might be shaping variation at the binding sites, using codeml clade c and branch-site models in paml . we first tested if selection acting on ace is divergent between the major clades in our sample (platyrrhine, catarrhine, and strepsirrhine primates, non-primate mammals) with the codeml clade model c, which was compared to the null model (m a_rel) with a likelihood ratio test . this test shows whether there is a divergent selection (dn/ds ratio = ω) across all clades, but not which clades are experiencing positive selection. we, therefore, followed the clade model with a series of branch-site models, which allow one clade at a time to be designated as a set of "foreground" branches and test whether this clade has experienced episodes of positive selection compared to the remaining sets of "background" branches (ω foreground > ω background ). branchsite models are compared to a null model that fixes ω at with a likelihood ratio test. in the case of the alternative model having a significantly better fit than the null model, indicating positive selection, potential sites under positive selection are identified with a bayes empirical bayes (beb) approach . we completed branch-site models for each primate clade (platyrrhine, strepsirrhine, and catarrhine), as well as bats because previous research has identified ace to be under positive selection in this clade, potentially in response to coronaviruses . we had to exclude hipposideros pratti and myotis daubentonii from paml analyses, because only a partial ace sequence was available for these two species. input files and control files for paml codeml analyses are available in the github repository . statistics and reproducibility. models in paml were compared with likelihood ratio tests and evaluated for significance with a right-tailed chi-squared distribution. as this was a comparative study of gene sequences across species, we had one representative individual for each species (n = ) and no replicates. reporting summary. further information on research design is available in the nature research life sciences reporting summary linked to this article. nucleotide and protein sequences used in this study are available from ncbi and are also available as fasta files (supplementary data and ) and alignments (supplementary data and ) in the supplemental material. accession numbers are provided in supplementary table s . all code used in this project is available via a github repository (https://github.com/ mareikejaniak/ace ). the version of the repository used for this project has been archived in zenodo (doi: . /zenodo. ) . received: august ; accepted: october ; a novel coronavirus from patients with pneumonia in china emergence of a novel human coronavirus threatening human health impact of yellow fever outbreaks on two howler monkey species (alouatta guariba clamitans and a. caraya) in misiones, argentina ebola outbreak killed gorillas pandemic human viruses cause decline of endangered great apes descriptive epidemiology of fatal respiratory outbreaks and detection of a human-related metapneumovirus in wild chimpanzees forest fragmentation as cause of bacterial transmission among nonhuman primates, humans, and livestock human metapneumovirus infection in wild mountain gorillas human coronavirus oc outbreak in wild chimpanzees, côte d´ivoire covid- : protect great apes during human pandemics comparative pathogenesis of covid- , mers, and sars in a nonhuman primate model ards and cytokine storm in sars-cov- infected caribbean vervets age-related rhesus macaque models of covid- primary exposure to sars-cov- protects against reinfection in rhesus macaques infection with novel coronavirus (sars-cov- ) causes pneumonia in rhesus macaques comparison of nonhuman primates identified the suitable model for covid- section on great apes. great apes, covid- and the sars cov- joint statement of the iucn ssc wildlife health specialist group and the primate specialist group tissue distribution of ace protein, the functional receptor for sars coronavirus. a first step in understanding sars pathogenesis hydrolysis of biological peptides by human angiotensinconverting enzyme-related carboxypeptidase heart block, ventricular tachycardia, and sudden death in ace transgenic mice with downregulated connexins the anti-inflammatory potential of ace /angiotensin-( - )/mas receptor axis: evidence from basic and clinical research the pivotal link between ace deficiency and sars-cov- infection a human homolog of angiotensin-converting enzyme. cloning and functional expression as a captopril-insensitive carboxypeptidase ace x-ray structures reveal a large hinge-bending motion important for inhibitor binding and catalysis the international response to the outbreak of sars in severe acute respiratory syndrome (sars): a review of the history, epidemiology, prevention, and concerns for the future structural basis for the recognition of sars-cov- by full-length human ace structural basis of receptor recognition by sars-cov- structure of the sars-cov- spike receptor-binding domain bound to the ace receptor structural and functional basis of sars-cov- entry by using human ace cryo-em structure of the -ncov spike in the prefusion conformation clinical and immunologic features in severe and moderate coronavirus disease the covid- cytokine storm; what we know so far sars-cov- receptor ace and tmprss are primarily expressed in bronchial transient secretory cells structural variations in human ace may influence its binding with sars-cov- spike protein ace coding variants: a potential x-linked risk factor for covid- disease ace gene variants may underlie interindividual variability and susceptibility to covid- in the italian population angiotensin-converting enzyme (ace ) proteins of different bat species confer variable susceptibility to sars-cov entry susceptibility of ferrets, cats, dogs, and other domesticated animals to sars-coronavirus evidence of recombination in coronaviruses implicating pangolin origins of ncov- identification of critical active-site residues in angiotensin-converting enzyme- (ace ) by site-directed mutagenesis a pneumonia outbreak associated with a new coronavirus of probable bat origin evidence for ace -utilizing coronaviruses (covs) related to severe acute respiratory syndrome cov in bats ace and adam interaction regulates the activity of presympathetic neurons tmprss and adam cleave ace differentially and only proteolysis by tmprss augments entry driven by the severe acute respiratory syndrome coronavirus spike protein sars-cov- infection of african green monkeys results in mild respiratory disease discernible by pet/ct imaging and shedding of infectious virus from both respiratory and gastrointestinal tracts ace and tmprss variation in savanna monkeys (chlorocebus spp.): potential risk for zoonotic/anthroponotic transmission of sars-cov- and a potential model for functional studies human ace receptor polymorphisms predict sars-cov- susceptibility comparative genetic analysis of the novel coronavirus ( -ncov/sars-cov- ) receptor ace in different populations virus-host interactome and proteomic survey reveal potential virulence factors influencing sars-cov- pathogenesis sars-cov- cell entry depends on ace and tmprss and is blocked by a clinically proven protease inhibitor sars-cov- : a storm is raging impending extinction crisis of the world's primates: why primates matter estimating abundance and growth rates in a wild mountain gorilla population putting leakage in its place: the significance of retained tourism revenue in the local context in rural uganda best practice guidelines for great ape tourism the rules and the reality of mountain gorilla gorilla beringei beringei tracking: how close do tourists get? best practice guidelines for health monitoring and disease control in great ape populations. occasional papers of the iucn species survival commission no mafft multiple sequence alignment software version : improvements in performance and usability mrbayes: bayesian inference of phylogenetic trees raxml version : a tool for phylogenetic analysis and postanalysis of large phylogenies mareikejaniak/ace : code for primate ace project timetree: a resource for timelines, timetrees, and divergence times r: a language and environment for statistical computing (r foundation for statistical computing interfaces to phylogenetic software in r a hot spot of binding energy in a hormonereceptor interface anatomy of hot spots in protein interfaces computational alanine scanning to probe protein-protein interactions: a novel approach to evaluate binding free energies a simple physical model for binding energy hot spots in protein-protein complexes computational alanine scanning of protein-protein interfaces systematic analysis of helical protein interfaces reveals targets for synthetic inhibitors plucking the high hanging fruit: a systematic approach for targeting protein-protein interactions ssipe: accurately estimating protein-protein binding affinity change upon mutations using evolutionary profiles in combination with an optimized physical energy function paml : phylogenetic analysis by maximum likelihood an improved likelihood ratio test for detecting site-specific functional divergence among clades of protein-coding genes bayes empirical bayes inference of amino acid sites under positive selection acknowledgements m.c.j. was funded by a natural sciences and engineering council of canada discovery accelerator supplement to a.d.m. and by a postdoctoral fellowship from the alberta children's hospital research institute. p.s.a. thanks the national institutes of health (r gm ) for financial support. we thank four reviewers for constructive comments, which improved the manuscript considerably. the authors declare no competing interests. supplementary information is available for this paper at https://doi.org/ . /s - - -w.correspondence and requests for materials should be addressed to a.d.m. or j.p.h.reprints and permission information is available at http://www.nature.com/reprintspublisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/ licenses/by/ . /. key: cord- -ent vu z authors: tan, joshua; sack, brandon k; oyen, david; zenklusen, isabelle; piccoli, luca; barbieri, sonia; foglierini, mathilde; fregni, chiara silacci; marcandalli, jessica; jongo, said; abdulla, salim; perez, laurent; corradin, giampietro; varani, luca; sallusto, federica; sim, b kim lee; hoffman, stephen l; kappe, stefan h i; daubenberger, claudia; wilson, ian a; lanzavecchia, antonio title: a public antibody lineage that potently inhibits malaria infection by dual binding to the circumsporozoite protein date: - - journal: nat med doi: . /nm. sha: doc_id: cord_uid: ent vu z immunization with attenuated plasmodium falciparum sporozoites (pfspz) has been shown to be protective, but the features of the antibody response induced by this treatment remain unclear. to investigate this response at high resolution, we isolated igm and igg monoclonal antibodies from tanzanian volunteers who were immunized by repeated injection of irradiated pfspz and who were found to be protected from controlled human malaria infection (chmi) with infectious homologous pfspz. all igg monoclonals isolated bound to p. falciparum circumsporozoite protein (pfcsp) and recognized distinct epitopes in the n-terminus, nanp repeat region, and c-terminus. strikingly, the most effective antibodies, as assessed in a humanized mouse model, bound not only to the repeat region, but also to a minimal peptide at the pfcsp n-terminal junction that is not in the rts,s vaccine. these dual-specific antibodies were isolated from different donors and used vh - or vh - alleles carrying tryptophan or arginine at position . using structural and mutational data, we describe the elements required for germline recognition and affinity maturation. our study provides potent neutralizing antibodies and relevant information for lineage-targeted vaccine design and immunization strategies. investigate this response at high resolution, we isolated igm and igg monoclonal antibodies from tanzanian volunteers who were immunized by repeated injection of irradiated pfspz and who were found to be protected from controlled human malaria infection (chmi) with infectious homologous pfspz. all igg monoclonals isolated bound to p. falciparum circumsporozoite protein (pfcsp) and recognized distinct epitopes in the n-terminus, nanp repeat region, and cterminus. strikingly, the most effective antibodies, as assessed in a humanized mouse model, bound not only to the repeat region, but also to a minimal peptide at the pfcsp n-terminal junction that is not in the rts,s vaccine. these dual-specific antibodies were isolated from different donors and used vh - or vh - alleles carrying tryptophan or arginine at position . using structural and mutational data, we describe the elements required for germline recognition and affinity maturation. our study provides potent neutralizing antibodies and relevant information for lineage-targeted vaccine design and immunization strategies. malaria is a serious global health threat, causing , deaths and million clinical cases in . much of the effort to develop a vaccine against the disease has focused on plasmodium falciparum sporozoites (pfspz), the asymptomatic parasite stage that is injected by mosquitoes into the host skin to initiate a malaria infection. after entering the skin, pfspz migrate to the liver, multiply in hepatocytes and then emerge in the blood, where the parasites cause malaria symptoms and differentiate into sexual stages for transmission. while natural infection by pfspz elicits little or no protective immunity to this stage of the life cycle , , subunit or whole organism vaccines based on pfspz can induce robust immune responses [ ] [ ] [ ] . the most advanced malaria vaccine candidate, rts,s, incorporates part of the p. falciparum circumsporozoite protein (pfcsp), which coats the pfspz surface and plays a key role in parasite migration out of the skin, entry into the liver parenchyma and invasion of hepatocytes [ ] [ ] [ ] [ ] [ ] [ ] . multi-site clinical trials in sub-saharan africa have shown that rts,s confers significant but modest and short-lived protection against clinical illness , , . an alternative approach that has shown promise is the use of whole attenuated pfspz as immunogens. this line of research is based on key early discoveries that immunization with irradiated p. berghei sporozoites protected mice against subsequent challenge and that immunization of humans with > irradiated mosquitoes carrying pfspz conferred sterilizing protection against controlled human malaria infection (chmi) [ ] [ ] [ ] . these studies have led to efforts to develop whole attenuated pfspz as a vaccine , and recent trials have shown that immunization with attenuated pfspz was highly protective in malaria-naïve volunteers and gave significant protection in malian adults , [ ] [ ] [ ] . while these results are promising, the specific mediators of this protective immune response have yet to be fully elucidated. studies of the antibody response have mainly investigated polyclonal serum responses to pfspz and pfcsp , [ ] [ ] [ ] , and highresolution analysis of the monoclonal antibodies generated by vaccination and their target antigens on the pfspz surface remains to be performed. these experiments could provide useful information for the improvement of whole sporozoite-based vaccines and for the identification of new antigens as subunit vaccine candidates, as well as to generate tools for prophylaxis of p. falciparum infection. we characterized the antibody response of tanzanian volunteers living in malaria-endemic regions who were immunized by repeated intravenous injection of irradiated pfspz (pfspz vaccine) and then underwent chmi with live homologous parasites (fig. a) . serum igm and igg antibodies, as measured by flow cytometry on live pfspz, increased following immunization, but did not show a clear association with protection from chmi (fig. b,c and supplementary fig. a,b) . memory b cells from five protected individuals were immortalized and screened by staining of intact pfspz to isolate human monoclonal antibodies against any surface antigen on pfspz ( supplementary fig. c) . most of the igg monoclonal antibodies bound to pfspz with high affinity (fig. d,e) . interestingly, in the two donors from whom both pfspz-specific igm and igg monoclonal antibodies were isolated, the igm antibodies were recovered at much higher numbers (fig. f ). the igm antibodies had fewer, but still substantial, mutations compared to the igg antibodies, consistent with an origin from igm memory b cells (fig. g ). in particular, the finding of an antibody lineage containing both igm and igg members suggests an incomplete switch in this response despite repeated immunization ( supplementary fig. d ). these results demonstrate that immunizations with pfspz vaccine induce a robust antibody response that retains a significant igm component. to identify the features of the most effective neutralizing monoclonal antibodies produced by the protected individuals, we tested a panel of igg antibodies in vitro for their capacity to inhibit pfspz traversal and invasion of a human hepatocyte cell line (fig. a) . the invasioninhibitory activity varied among antibodies and was significantly correlated with binding affinity to pfspz (fig. b) . a subset of antibodies was further tested in an in vivo mouse humanized liver model for their capacity to protect against natural, mosquito bitetransmitted infection by pfspz. some antibodies, such as mgg , mgg , mgh , mgu and mgu , were very potent in reducing liver burden by up to . %, while others, such as mgg , mgh , mgh and mgu , were less effective (fig. c) . these findings suggest that in vivo neutralizing activity may be related to the fine specificity of the antibodies. next, we set out to identify the target antigens of the monoclonal antibodies. strikingly, although we had used an antigen-agnostic approach to identify antibodies that bound to the pfspz surface regardless of specificity, we found that all of the antibodies bound to recombinant pfcsp (fig. d,e) , confirming that this is the most immunogenic protein on the pfspz surface [ ] [ ] [ ] . to understand the basis for effective neutralization, we mapped the specificity of the monoclonal antibodies using synthetic peptides that cover the n-terminus, the nanp repeat region, the n-terminal junction (connecting region between the n-terminal domain and nanp repeats), and the c-terminus of pfcsp. binding to the classical nanp repeats (nanp peptide), the pfcsp c-terminus, recombinant pfcsp or pfspz did not correlate with efficacy ( supplementary fig. a-d) . interestingly, however, binding to a minimal -mer peptide (npdp ) that covers the junction between the n-terminal domain and the nanp repeats was a shared characteristic of the most potent in vivo neutralizing antibodies (fig. f) . these potent antibodies also recognized the nanp peptide, suggesting that the capacity to bind both to the nanp repeats and to the n-terminal junction of pfcsp is the main feature of efficient neutralization. another distinctive characteristic of the most potent antibodies was the common usage of vh - family alleles carrying tryptophan or arginine at position (vh - f , here defined as including vh - , vh - - , vh - - and vh - alleles sharing > % identity) (fig. e , supplementary fig. , , supplementary table ). strikingly, vh - f was the most common vh gene used by igm antibodies isolated from donors g and u, with almost % of such antibodies carrying w or r ( supplementary fig. e ). collectively, these data indicate that the most potent antibodies have a dual specificity and share common vh gene usage. importantly, such antibodies were isolated from four out of five donors, suggesting that these antibodies belong to a public lineage and therefore have the potential to be readily induced by vaccination. to investigate the influence of somatic mutations on binding of vh - f antibodies to pfcsp, we focused on two clonally related and highly mutated antibodies, mgu and mgu (fig. a) . the unmutated common ancestor (uca) of these antibodies, which carries tryptophan at position , was able to bind to pfcsp and pfspz with low affinity, while substitution to serine , which is commonly found in vh - alleles, resulted in loss of binding (fig. b,c) . these findings, in conjunction with the high frequency of w in the igm sequences and the identification of putative vh - f alleles carrying w in the germline sequences obtained from non-b cells of donors g, u and w ( supplementary fig. , ) , identify vh - f alleles carrying w as a preferred feature for the initiation of this lineage-specific antibody response to pfcsp. the branch point of this clone achieved, through several mutations, high affinity binding to pfspz, pfcsp and nanp , while further mutations in mgu , but not mgu , increased breadth by conferring the unique ability to bind to npdp (fig. b-e, supplementary fig. ). despite their similarities in binding to pfspz, pfcsp and nanp , mgu was substantially more potent than mgu in the in vivo assay (fig. c) , suggesting that acquisition of binding to npdp is the key factor for potent neutralization. interestingly, mutagenesis studies of mgu suggest that w remains a critical residue for binding to npdp , but becomes dispensable for high affinity binding to nanp and full-length pfcsp in the fully mutated antibody (fig. f ). in contrast, in a second clonal family consisting of mgu and mgu , full binding to pfcsp, nanp , npdp and pfspz was already achieved by the branch point while the remaining mutations appeared redundant ( supplementary fig. a -e). to investigate the original specificities of germline vh - f antibodies, we analysed binding of the ucas of various vh - f antibodies to pfcsp peptides using a more sensitive beadbased assay. most ucas bound to nanp but not to npdp , suggesting that the antibodies generally started as nanp-specific and gained affinity for npdp through somatic mutations ( supplementary fig. f-m) . these data delineate a pathway of antibody development that is dependent on specific vh alleles and leads to antibodies with dual specificity. to identify the minimal residues recognized by antibodies at the pfcsp nterminal junction, we performed a mutational analysis on npdp , a -mer version of npdp that was used to provide a longer scaffold for binding (fig. g) . the loss of binding to certain peptides identified a specific motif (dpnanp) that was recognized by most antibodies regardless of vh gene usage. these findings, combined with data from peptide array experiments ( supplementary fig. ) , identify the n-terminal junction binding site of the most potent neutralizing antibodies as including the first unit of the nanp repeat region and flanking non-repeat sequences, providing a molecular basis for the dual specificity of these antibodies. to gain structural insights into the recognition of the n-terminal junction, we attempted to crystallize several vh - f antibodies and successfully crystallized mgg in complex with the ac- kqpadgnpdpnanp -nh peptide ( fig. a and supplementary table ) . only the c-terminal half (npdpnan, residues - ) of the peptide was visible, with most of the contacts being made by heavy-chain residues as shown by the fab buried surface area (fig. b) . specifically, the heavy-chain cdr loops form a groove in which the peptide resides ( fig. a and supplementary fig. a ). in addition, three interfacial waters are involved in an extensive hydrogen-bonding network, connecting the side chain of n to the base of cdr h and cdr h ( supplementary fig. b ). the ch/π interaction between p and w ( fig. a and supplementary fig. b ) confirms that the latter residue is critical for binding as indicated by the mutagenesis experiments. while most peptide residues, except for n , display a relatively large buried surface area (fig. c) , weak electron density for the residues visible at the peptide's termini (n , p , a and n ) ( fig. e and supplementary fig. c ) indicates structural flexibility and highlights the sequence dpn as the principal binding motif. interestingly, the dpn sequence displays a pseudo turn, which is stabilized by hydrogen bonding of the aspartate side chain to the asparagine backbone amide. such a conformation is very similar to turns observed for unbound nanp peptides in solution and in crystal structures of free and antibody-bound peptides (fig. d) [ ] [ ] [ ] [ ] [ ] . an isolated dpn motif is also present in the - c-terminal peptide, which may explain its binding to mgg (fig. e) , notwithstanding the significant differences between the dpn flanking residues in the - c-terminal and the n-terminal junctional peptides (ndpnr versus pdpna, respectively). binding to both npdp and nanp repeats suggests that the aspartate residue in the dpn motif is interchangeable with an asparagine. binding studies of mgg to npdp peptide mutants validate the specificity of mgg toward dpn and npn motifs (fig. g) . only when the central dpn and npn motifs within the npdp peptide are mutated to dpa and aan is mgg binding completely abrogated. in contrast, a third dpn motif present at the c-terminus of the npdp peptide does not contribute to binding, possibly indicating the importance of flanking residues for optimal binding. overall, these data imply that potential peptide-binding promiscuity allows mgg to bind to diverse epitopes on pfcsp. since other vh - f antibodies cannot bind to the - peptide, we expect that they will be less promiscuous and bind slightly larger or more specific sequences. the finding that the most potent monoclonal antibodies recognize a defined n-terminal junctional peptide suggests that this region could be a component of an effective subunit vaccine. in an initial attempt to investigate whether the npdp peptide might be sufficient to induce a protective response, we immunized balb/c mice with npdp conjugated to klh. all mice produced igg antibodies that were specific for the npdp peptide, but were at best weakly reactive for the nanp peptide ( supplementary fig. a,b) . strikingly, in spite of their ability to bind to pfspz, the mouse sera were unable to inhibit pfspz invasion of a hepatocyte cell line in vitro, suggesting that dual specificity for nanp and the n-terminal junction may be required for potent neutralizing function ( supplementary fig. c,d) . the reliance of the dual-specific antibodies that we isolated on specific human vh - f alleles suggests that mice, which do not have the counterparts of human vh - f genes, may not be the most suitable model organism to test a novel csp-based vaccine. rather, an organism such as the aotus monkey, which has a more similar vh gene repertoire to humans and contains vh - f -like genes carrying the equivalent of w , may be a more suitable choice. this study shows that the antibodies produced by vaccinated and protected african individuals contain a highly mutated igg component, as well as an important igm component with fewer but substantial mutations, as also seen in the response to blood-stage plasmodium antigens . the large igm component would be consistent with stimulation of marginal zone b cells in the spleen following intravenous immunization with a particulate antigen . strikingly, all of the igg antibodies that we identified recognized pfcsp, consistent with previous studies describing the immunodominance of this protein and its abundance on the pfspz surface [ ] [ ] [ ] , . our findings are in agreement with previous work that separately describes the importance of the pfcsp nanp repeat region and of the n-terminus , - , but, importantly, highlight the fact that antibodies that target epitopes in both regions simultaneously are more potent than antibodies that exclusively recognize each individual site. interestingly, the structural analysis and peptide mutational data indicate that these antibodies, which are originally specific for nanp motifs, do not acquire a completely unrelated specificity, but rather gain promiscuity for sequences centred on a dpn motif. this dual specificity is to a large extent encoded by vh - f alleles, but also requires extensive somatic mutations. the importance of dual specificity is highlighted by the fact that immunization with a single npdp peptide is not sufficient to confer protection despite eliciting pfspz-binding antibodies. the increased potency of the dual-specific antibodies could be due to the proximity of the n-terminal junction to region i (klkqp) of pfcsp, which is involved in the cleavage of the n-terminus to allow pfspz invasion of hepatocytes , . this n-terminal junction region is not included in the most advanced malaria vaccine candidate rts,s, which may explain its limited efficacy in malaria-endemic regions . these findings support continued work to develop whole pfspz vaccines, which contain the entire pfcsp, and provide a rationale for further attempts to develop refined vaccination approaches to elicit dual-specific antibodies using prime-boost strategies with improved carriers , . the finding that the most potent antibodies share common vh gene usage in multiple donors is consistent with a public antibody response that can be readily induced by vaccination. these results are reminiscent of previous work on the use of particular vh genes and their allelic forms in the response to the stem of influenza haemagglutinin , and justify further efforts to investigate the role of vh gene polymorphisms in protective antibody responses. nevertheless, whether these antibodies are sufficient to protect humans and why some individuals were not protected remain to be established. possible reasons for the latter include the lack of a vh - f allele, the incomplete maturation of the vh - f antibodies, or insufficient production of the potent antibodies. the antibodies described could be used to obtain proof of concept that antibodies alone can be protective in vivo in humans, as previously shown in mice and non-human primates [ ] [ ] [ ] , and pave the way for the use of antibodies in the prophylaxis of p. falciparum infection and for the development of improved antibody-based subunit malaria vaccines. aseptic, purified cryopreserved pfspz of the nf strain provided by sanaria ® were used in serum and monoclonal antibody binding experiments. pfspz produced by the center for infectious disease research, seattle was used in all in vitro and in vivo functional assays. following informed consent, blood samples used in this study were collected from malaria pre-exposed volunteers during a clinical phase clinical trial of the safety, immunogenicity and protective efficacy of the sanaria ® pfspz vaccine in bagamoyo, tanzania for serum preparation, whole blood was collected in vacutainer tubes (bd) containing clot activators and kept at room temperature until a clot was formed. the tube was centrifuged at , × g for min at °c and the serum fraction was stored at − °c. peripheral blood mononuclear cells (pbmcs) were isolated from whole blood by ficoll density gradient centrifugation and resuspended in freezing medium for long-term storage in liquid nitrogen. cryopreserved pfspz (sanaria ® ) were thawed and stained with different concentrations of tanzanian sera or monoclonal antibodies in . × sybr green i (thermofisher scientific) for min at °c. the pfspz were washed twice by centrifugation at × g for min. human serum antibody binding was detected using . μg ml − alexa fluor -conjugated goat anti-human igg (jackson immunoresearch, - - ) or alexa fluor conjugated goat anti-human igm (jackson immunoresearch, - - ). mouse serum antibody binding was detected using μg ml − pe-cy -conjugated goat anti-mouse igg (biolegend, ) or pe-cy -conjugated rat anti-mouse igm (bd biosciences, ). facs diva (version . ) was used for acquisition of samples and flow-jo (version . ) was used for facs analysis. the pfspz were gated based on high fluorescence in the fitc channel. median fluorescence intensity (mfi) of the pfspz in the alexa fluor or pe-cy channel was calculated to quantify igg or igm binding. the concentration of antibody needed to achieve mfi , (conc ) was calculated by interpolation of binding curves fitted to a sigmoidal curve model (graphpad prism ) as a measure of affinity. the gating strategy can be found in supplementary figure . igm or igg memory b cells were isolated from frozen peripheral blood mononuclear cells (pbmcs) by magnetic cell sorting with . μg ml − anti-cd -pecy antibodies (bd, ) and mouse anti-pe microbeads (miltenyi biotec, - - ), followed by facs sorting using . μg ml − alexa fluor -conjugated goat anti-human igg (jackson immunoresearch, - - ), μg ml − alexa fluor -conjugated goat anti-human igm (invitrogen, a ) and / pe-labeled anti-human igd (bd, ). as previously described , sorted b cells were immortalized with epstein-barr virus (ebv) and plated in single cell cultures in the presence of cpg-dna ( . μg ml − ) and irradiated pbmc-feeder cells. two weeks post-immortalization, the culture supernatants were tested (at a / dilution) for binding to pfspz by flow cytometry using a no-wash protocol. briefly, cryopreserved pfspz were thawed, stained with the supernatants in . × sybr green i for min at room temperature, and incubated with . μg ml − alexa fluor -conjugated goat anti-human igg or anti-human igm for hour at °c. only supernatants that did not bind to control beads were selected to exclude polyreactive antibodies. the ability of monoclonal antibodies and serum to prevent pfspz invasion and traversal in vitro was tested as previously described , . briefly, monoclonal antibodies at μg ml − were mixed with freshly dissected pfgfp_luc sporozoites in dmem media containing fitc-dextran, % fbs, pen/strep, fungizone and l-glutamine and incubated at °c for min. these pfspz were then added to hc hepatoma cells plated one day prior at , cells/well in a -well plate for a final moi of . ( , pfspz: , hc ). plates were then spun at × g for min and the pfspz were left to infect for min at °c. cells were then fixed, stained with μg ml − of the monoclonal antibody a conjugated to alexafluor- and analyzed by flow cytometry for invaded cells ( a /alexafluor- positive) or traversed cells (fitc-dextran positive). frg huhep mice were purchased from yecuris, inc. and infected by bite of pfgfp_luc mosquitos - hours following intraperitoneal injection of μg/mouse of each monoclonal antibody or human igg control as described previously . parasite liver burden was determined by bioluminescent imaging using an ivis imager at day at the peak of liver burden. reductions in liver burden were calculated by normalization to the mean of control mice injected with an equivalent dose of human igg within each bite experiment. all animal procedures were conducted in accordance with and approved by the center for infectious disease research institutional animal care and use committee (iacuc) under protocol sk- . the seattle biomed iacuc adheres to the nih office of laboratory animal welfare standards (olaw welfare assurance # a - ). cdna was synthesized from selected b-cell cultures and both heavy chain and light chain variable regions (vh and vl) were sequenced as previously described . the usage of vh and vl genes and the number of somatic mutations were determined by analyzing the homology of vh and vl sequences of monoclonal antibodies to known human v, d and j genes in the imgt database (version . . ) . antibody-encoding sequences were amplified and sequenced with primers specific for the v and j regions of the given antibody. sequences were aligned with clustal omega (version . . ) . unmutated common ancestor (uca) sequences of the vh and vl were inferred with antigen receptor probabilistic parser (arpp) ua inference software, as previously described , or constructed using imgt/v-quest . phylogenetic trees were generated with the dna maximum likelihood program (dnaml) of the phylip package, version . , . antibody heavy and light chains were cloned into human igg , igκ and igλ expression vectors and expressed by transient transfection of expi f cells (thermofisher scientific) using polyethylenimine. cell lines were routinely tested for mycoplasma contamination. the antibodies were affinity purified by protein a chromatography (ge healthcare). the kqpadgnpdpnanp peptide was ordered from innopep inc. with a purity of > % and containing chlorine counter ions. the peptides have n-terminal acetylation and cterminal amidation to eliminate charges at the peptide termini. the mgg -peptide complex was crystallized from a solution containing mgg at . mg ml − in tbs buffer ( mm tris-hcl, mm nacl, . mm kcl, ph . ) with a : molar ratio of ac-kqpadgnpdpnanp-nh peptide to fab. crystals were grown using sitting drop vapor diffusion with a well solution containing . m kh po , % glycerol, % peg at k and typically appeared within days. crystals were cryo-cooled without additional cryoprotection. x-ray diffraction data were collected at the advanced light source (als) . . . data collection and processing statistics are outlined in supplementary table . data sets were indexed, integrated, and scaled using the hkl- package (version ) . the structures were solved by molecular replacement using phaser (version . . ) with a homology model (swiss-model - and pigspro ) for mgg as a search model. after refinement of the fab using phenix.refine (version . - ) combined with additional manual building cycles in coot (version . . ) , positive fo-fc density was observed in the fab combining site for the peptide. the peptide was manually built into the difference density fo-fc map, followed by additional rounds of refinement of the complex in phenix.refine and manual building cycles in coot . buried surface areas (bsa) were calculated with the program ms (version . ) , using a . -Å probe radius and standard van der waals radii . total iggs were quantified using half-area, high-binding -well plates (corning) with μg ml − goat anti-human igg (southernbiotech, - ) using certified reference material (erms-da , sigma-aldrich) as a standard. to test specific antibody binding, elisa plates were either directly coated with μg ml − of recombinant pfcsp (sanaria ® , sequence previously shown ), μg ml − of peptide - or μg ml − of peptide - , or first with μg ml − of avidin (sigma-aldrich), followed by μg ml − of nanp (nanpnanpnanpnanpna), npdp (kqpadgnpdpnanpnvdpn), npdp (kqpadgnpdpnanpn) or various npdp peptide mutants. non-specific binding to plates coated with an irrelevant control peptide was tested to exclude polyreactivity of the antibodies. all peptides and mutants were synthesized with biotin attached to the c-terminus (a&a labs). plates were blocked with % bovine serum albumin (bsa) and incubated with titrated antibodies, followed by / ap-conjugated goat anti-human igg (southern biotech, - ). plates were then washed, substrate (p-npp, sigma) was added and plates were read at nm. streptavidin beads with different levels of fitc labelling (svfb- - k, spherotech) were coated with μg ml − of biotinylated nanp , npdp , npdp or a negative control peptide for min at room temperature. the beads were washed and incubated with titrations of monoclonal antibodies for min at room temperature. antibody binding was detected with . μg ml − alexa fluor -conjugated goat anti-human igg or anti-human igm. the ucas were compared for binding to nanp and npdp at a concentration of μg ml − (supplementary fig. f ). biotinylated npdp and nanp peptides were diluted ( nm) in hepes buffered saline (hbs) ( mm hepes, ph . , mm nacl, mm edta, . % surfactant . hbs was also used as running buffer. an irrelevant biotinylated -mer peptide was used as a control for non-specific interactions. a neutravidin-immobilized nlc proteon sensor chip (biorad) was pre-conditioned with an nacl solution ( m) and the biotinylated peptides were injected onto the chip. the monoclonal antibodies were diluted and titrated in hbs ( - . - . - . - . nm) and injected onto chip; one channel of the chip was injected with hbs and used as reference for the analysis. all injections were made at a flow rate of μl/min. injection time and dissociation time were s and s, respectively. each binding interaction of the monoclonal antibodies with the biotinylated peptides was assessed using a proteon xpr instrument (biorad) and data were processed with proteon manager software (version . . . ). k a , k d and k d were calculated by applying the langmuir fit model. peptides of -amino acid lengths spanning the entire pfcsp (with a shift of a single amino acid between peptides) were synthesized and coated onto a microarray chip (peppermap® linear epitope mapping, pepperprint gmbh). the peptides were incubated with μg ml − of monoclonal antibodies for h at °c, followed by incubation with dylight conjugated goat anti-human igg to detect antibody binding. female balb/c mice ( - weeks of age) were obtained from envigo laboratories. all procedures were performed in accordance with guidelines by the swiss federal veterinary office and after obtaining ethical approval from the ufficio veterinario cantonale, bellinzona, switzerland (approval number: ). keyhole limpet haemocyanin (klh)conjugated npdp (genscript) was reconstituted in water and formulated with % mf (addavax, invivogen) according to the manufacturer's instructions. mice were immunized subcutaneously with μg of peptide on day and . mice were bled on day . recovered sera were used for staining of pfspz by flow cytometry and for binding to pfcsp and pfcsp peptides by elisa. the number of mutations in the heavy chains of igg (n= antibodies) and igm (n= antibodies) isolated from the tanzanian volunteers were compared by a two-sided t-test. results are shown as mean ± s.d.. in this test, n refers to the number of antibodies, p = . , t = . , df = . a two-tailed spearman's correlation was performed to correlate invasion with binding affinity to pfspz (from n= representative experiment out of ), p = . . in the in vivo test of the monoclonal antibodies, error bars show s.d. and were calculated from n = or mice for each antibody. a one-sided anova with kruskal-wallis post test was used to compare the percentages of liver burden with that of control mice injected with irrelevant human igg; for the anova, p = . , f = . , df = , df = . for the kruskal-wallis post test, the results for each individual antibody are presented as *p≤ . , **p≤ . , ****p< . . a two-tailed spearman's correlation was performed to correlate affinity for npdp , pfcsp, nanp , - or pfspz with in vivo antibody efficacy (from n= representative experiment out of ). p = . , . , . , . and . , respectively. the confidence intervals were not determined by prism as n< for each correlation. in all other cases, n refers to the number of independent experiments. sequence data of the monoclonal antibodies isolated in this study will be deposited in genbank (https://www.ncbi.nlm.nih.gov/genbank/). the x-ray structure factors and coordinates have been deposited in the protein data bank (pdb id bqb). refer to web version on pubmed central for supplementary material. a, binding interface of mgg in complex with the kqpadgnpdpnanp peptide; only residues to of the peptide have interpretable electron density (indicated in bold). the peptide is shown in the cartoon representation with sidechains as sticks, while the heavy and light chains of mgg are shown as dark and light grey surfaces respectively. the cdr loops are in the cartoon representation: cdrh (green), cdrh (blue), cdrh (magenta), cdrl (light green), cdrl (light blue) and cdrl (pink). the w sidechain is shown as blue sticks and the interfacial waters are highlighted as red spheres. b, buried surface area (bsa) for the heavy chain (hc) and light chain (lc) with the peptide. c, bsa for individual peptide residues with the fab. d, pseudo turn for the dpn motif of the bound peptide (yellow carbons) and type i β-turn for the previously published crystal structure of the unbound anpna peptide (green carbons) . stabilizing hydrogen bonds between the sidechain of d /n and the amide backbone of n /n in the two structures are highlighted by the dashed line. e, fo-fc electron density map for the n-terminal peptide contoured at . σ (dark blue) and . σ (light blue). naturally acquired antibodies to sporozoites do not prevent malaria: vaccine development implications an intensive longitudinal cohort study of malian children and adults reveals no evidence of acquired immunity to plasmodium falciparum infection the rts,s malaria vaccine antibody and b cell responses to plasmodium sporozoites protection against malaria at year and immune correlates following pfspz vaccination the basolateral domain of the hepatocyte plasma membrane bears receptors for the circumsporozoite protein of plasmodium falciparum sporozoites malaria circumsporozoite protein binds to heparan sulfate proteoglycans associated with the surface membrane of hepatocytes the plasmodium circumsporozoite protein is proteolytically processed during cell invasion efficacy and safety of rts s/as malaria vaccine with or without a booster dose in infants and children in africa: final results of a phase individually randomised controlled trial heparan sulfate proteoglycans provide a signal to plasmodium sporozoites to stop migrating and productively invade host cells the malaria circumsporozoite protein has two functional domains, each with distinct roles as sporozoites journey from mosquito to mammalian host four-year efficacy of rts,s/as e and its interaction with malaria exposure seven-year efficacy of rts,s/as malaria vaccine among young african children protective immunity produced by the injection of x-irradiated sporozoites of plasmodium berghei immunization of man against sporozoiteinduced falciparum malaria sporozoite induced immunity in man against an ethiopian strain of plasmodium falciparum protection of humans against malaria by immunization with radiationattenuated plasmodium falciparum sporozoites development of a metabolically active, non-replicating sporozoite vaccine to prevent plasmodium falciparum malaria protection against malaria by intravenous immunization with a nonreplicating sporozoite vaccine sterile protection against human malaria by chemoattenuated pfspz vaccine safety and efficacy of pfspz vaccine against plasmodium falciparum via direct venous inoculation in healthy malaria-exposed adults in mali: a randomised, double-blind phase trial rationale for development of a synthetic vaccine against plasmodium falciparum malaria the circumsporozoite protein is an immunodominant protective antigen in irradiated sporozoites natural parasite exposure induces protective human anti-malarial antibodies conformational preferences of synthetic peptides derived from the immunodominant site of the circumsporozoite protein of plasmodium falciparum by h nmr crystal structure of an npna-repeat motif from the circumsporozoite protein of the malaria parasite plasmodium falciparum structural basis for antibody recognition of the nanp repeats in plasmodium falciparum circumsporozoite protein t-dependent b cell responses to plasmodium induce antibodies that form a highavidity multivalent complex with the circumsporozoite protein identification of five different ighv gene families in owl monkeys (aotus nancymaae) somatically hypermutated plasmodium-specific igm + memory b cells are rapid, plastic, early responders upon malaria rechallenge human marginal zone b cells total and putative surface proteomics of malaria parasite salivary gland sporozoites interrogating the plasmodium sporozoite surface: identification of surfaceexposed proteins and demonstration of glycosylation on csp and trap by mass spectrometrybased proteomics an immunologically cryptic epitope of plasmodium falciparum circumsporozoite protein facilitates liver cell recognition and induces protective antibodies that block liver cell invasion the n-terminal domain of plasmodium falciparum circumsporozoite protein represents a target of protective immunity proteolytic cleavage of the plasmodium falciparum circumsporozoite protein is a target of protective antibodies versatile virus-like particle carrier for epitope based vaccines design of a hyperstable -subunit protein icosahedron rapid development of broadly influenza neutralizing antibodies through redundant mutations vaccine-induced antibodies that neutralize group and group influenza a viruses monoclonal, but not polyclonal, antibodies protect against plasmodium yoelii sporozoites inability of malaria vaccine to induce antibodies to a protective epitope within its sequence humoral protection against mosquito bite-transmitted plasmodium falciparum infection in humanized mice synthesis and immunological characterization of -mer and -mer peptides corresponding to the n-and c-terminal regions of the plasmodium falciparum cs protein molprobity: all-atom structure validation for macromolecular crystallography an efficient method to make human monoclonal antibodies from memory b cells: potent neutralization of sars coronavirus development of a quantitative flow cytometrybased assay to assess infection by plasmodium falciparum sporozoites efficient generation of monoclonal antibodies from single human b cells by single cell rt-pcr and expression vector cloning imgt, the international immunogenetics information system a new bioinformatics analysis tools framework at embl-ebi reconstructing a b-cell clonal lineage. i. statistical inference of unobserved ancestors co-evolution of a broadly neutralizing hiv- antibody and founder virus processing of x-ray diffraction data collected in oscillation mode phaser crystallographic software swiss-model: modelling protein tertiary and quaternary structure using evolutionary information protein structure homology modeling using swiss-model workspace the swiss-model workspace: a web-based environment for protein structure homology modelling pigspro: prediction of immunoglobulin structures v phenix: a comprehensive python-based system for macromolecular structure solution features and development of coot the molecular surface package side-chain torsional potentials: effect of dipeptide, protein, and solvent environment we would like to thank m. nussenzweig (rockefeller university) and h. wardemann (german cancer research center) for providing reagents for antibody cloning and expression. this work was supported in part by the swiss the authors would like to thank first and foremost the study volunteers for their participation in the study. we also thank the entire study team at the bagamoyo branch of the ifakara health institute and the manufacturing, quality control, regulatory and clinical teams at sanaria, inc. for their contributions to the conduct of the trial. we would like to thank prof. marcel tanner (former director of the swiss tropical and public health institute, basel) for his vision and support of the development of the clinical trial platform enabling whole sporozoite-based malaria vaccine trials in bagamoyo, tanzania. key: cord- -soov q q authors: wang, claire y t; ware, robert s; lambert, stephen b; mhango, lebogang p; tozer, sarah; day, rebecca; grimwood, keith; bialasiewicz, seweryn title: parechovirus a infections in healthy australian children during the first years of life: a community-based longitudinal birth cohort study date: - - journal: clin infect dis doi: . /cid/ciz sha: doc_id: cord_uid: soov q q background: hospital-based studies identify parechovirus (pev), primarily pev-a , as an important cause of severe infections in young children. however, few community-based studies have been published and the true pev infection burden is unknown. we investigated pev epidemiology in healthy children participating in a community-based, longitudinal birth cohort study. methods: australian children (n = ) enrolled in the observational research in childhood infectious diseases (orchid) study were followed from birth until their second birthday. weekly stool and nasal swabs and daily symptom diaries were collected. swabs were tested for pev by reverse-transcription polymerase chain reaction and genotypes determined by subgenomic sequencing. incidence rate, infection characteristics, clinical associations, and virus codetections were investigated. results: pev was detected in of ( . %) and of ( . %) stool and nasal swabs, respectively. major genotypes among the infection episodes identified were pev-a ( . %), pev-a ( . %), and pev-a ( . %). the incidence rate was episodes ( % confidence interval, – ) per child-years. first infections appeared at a median age of (interquartile range, . – . ) months. annual seasonal peaks changing from pev-a to pev-a were observed. infection was positively associated with age ≥ months, summer season, nonexclusive breastfeeding at age < months, and formal childcare attendance before age months. sole pev infections were either asymptomatic ( . %) or mild ( . %), while codetection with other viruses in stool swabs was common ( . %). conclusions: in contrast with hospital-based studies, this study showed that diverse and dynamically changing pev genotypes circulate in the community causing mild or subclinical infections in children. parechovirus can cause severe illnesses in children. however, studies focus mainly on hospitalized populations. true disease burden in the community remains largely unknown. from our community-based cohort, we found diverse parechovirus genotypes in the community, causing mild or subclinical infections in children. . these pev-a outbreaks occur biennially, arising in odd and even years in the southern and northern hemispheres, respectively. furthermore, a recent australian study reported that almost % of infants hospitalized with a pev-a infection had impaired neurodevelopment months later [ ] . much of what is known of pev epidemiology, including pev-a outbreaks, relies upon hospital-based studies, and uncertainty exists over the actual disease burden within the community [ , ] . seroprevalence studies from europe and japan suggest that infection is common and occurs early in life with pev-a antibodies present in %- % of infants by age year, increasing to %- % by age - years, with high antibody levels maintained in older children and throughout adulthood [ , ] . recently, pev-a seroepidemiology in australia, the netherlands, and the united states was shown to be similar [ ] . the overall prevalence of neutralizing antibodies to pev-a increased from nearly % in children aged - years, to % in those aged - years, peaking at % in adults aged - years, and then declining to % in older age groups [ ] . the lower pev-a seroprevalence in older adults may indicate waning childhood immunity or the relatively recent emergence and circulation of pev-a globally. beyond seroprevalence, few studies describe the epidemiology and clinical characteristics of pev infections in the community [ ] [ ] [ ] [ ] [ ] [ ] . these have included secondary analyses of monthly stool samples collected from months until - years of age in participants of case-control studies from norway and finland. these studies were designed originally to determine the incidence of type diabetes in genetically at-risk children [ ] [ ] [ ] . by age months, % of norwegian children recruited between and had at least pev detection, which increased to % by years of age [ ] . a similar pattern of pev infection, but with lower incidence rates, was observed in finnish children enrolled between and where by age months % had pev detected in their stools on at least occasion, which increased to % approaching their second birthday [ ] . in these longitudinal cohort studies, pev-a was the most commonly observed genotype ( %- %) with pev-a and pev-a detected only occasionally [ ] [ ] [ ] . limited symptom and epidemiological risk data were reported by the norwegian studies, both of which found no association between pev detection and respiratory or gastrointestinal symptoms [ , ] . other studies assessed upper airway samples for various viruses, including pev, in healthy older children and adults, but either lacked sufficient numbers or clinical or sociodemographic data to determine risk factors for pev infections or only sampled infrequently [ ] [ ] [ ] . longitudinal, community-based studies employing sensitive molecular diagnostic assays with regular and frequent sampling, irrespective of illness, are best suited to explore the true nature and disease burden of pev infections in young children. we therefore aimed to describe the epidemiology of pev in the first years of life by the means of an unselected communitybased birth cohort whose recruitment coincided with the first reported outbreak of severe pev-a disease in australia [ , ] , and to investigate the risk factors and symptoms associated with acquiring pev in these young children. the observational research in childhood infectious diseases (orchid) project (clinicaltrials.gov identifier nct ) is a community-based, longitudinal, birth cohort study of acute respiratory infections (aris) and acute gastroenteritis (age) in unselected, healthy australian children in their first years of life [ ] [ ] [ ] [ ] . recruitment was progressive over years; participants needed to be healthy, born at term ( - weeks), and without congenital or underlying chronic disorders. the children's health queensland, royal brisbane and women's hospital, and the university of queensland human research ethics committees approved the study. at enrollment, baseline sociodemographic and health data were collected by parental interview. parents maintained daily symptom diaries related to aris and age and a separate illness impact diary healthcare visits for these episodes. diaries were returned monthly by mail. telephone interviews were conducted to collect data on feeding and childcare attendance every months. parents collected from their child weekly anterior nasal swabs and diaper stool swabs from birth until age years. swabs were mailed to the laboratory where they were processed and stored at - °c (supplementary methods). table lists definitions for pev episodes and ari and age symptoms [ , ] . nucleic acid was extracted from the swabs using previously described protocols implemented with quality control [ , ] (supplementary methods). pev in extracts were detected using a previously published reverse-transcription polymerase chain reaction (rt-pcr) assay (supplementary table ) [ ] . stool swabs were screened for additional enteric viruses (supplementary table ) and nasal swabs for respiratory viruses [ ] using published assays [ ] . a selection of specimens positive for pev within the same episode ( figure and supplementary methods) were genotyped using published methods targeting the vp / and vp regions (supplementary methods) [ ] [ ] [ ] . one representative sequence per pev episode was selected for the phylogenetic analyses. incidence rates with % confidence intervals (cis) were assessed using poisson regression, including the natural logarithm [ ] . ari symptoms presence of nasal congestion/discharge, dry or wet-sounding cough, wheezing, shortness of breath, or doctor-diagnosed otitis media or pneumonia [ ] age symptoms ≥ loose stools recorded during a -h period [ ] symptomatic episode ari, age, or fever only symptoms present within d before or after the first positive pev detection [ , ] . abbreviations: age, acute gastroenteritis; ari, acute respiratory infection; pev, parechovirus a. of the number of swabs returned as an offset. each swab returned was defined as representing week of study time. the time to first detection was calculated using life tables. infants were censored at either the date of the last swab submitted if the next swab was not returned for > days, or at days, whichever came first. the association between potential risk factors and pev detection was examined using mixed-effects logistic regression with the child entered as a random effect to account for repeated measurements. risk factors considered were age (categorized as -< months, -< months, -< months, - months), sex, exclusive breastfeeding, older siblings in the household, childcare attendance (none/informal/formal), season of pev detection, and season of birth. given the strong association between breastfeeding and childcare with age, both breastfeeding-by-age and childcare-by-age interaction terms were included in the models. all variables (except sex) were analyzed as time-varying variables. univariable and multivariable analyses were conducted. in multivariable analyses, all variables were included in the regression models. a linear regression model was used to investigate the association between symptoms and pev cycle threshold (ct) values, with the latter being inversely proportional to the amplified target nucleic acid in the sample, representing a semiquantitative estimate of viral load. ct values from the first pev detection in each episode were analyzed. data were analyzed using stata version . software (statacorp, college station, texas). of infants enrolled, were excluded: due to preterm birth (born < weeks), and due to failure to provide any swabs. the remaining children ( male) provided stool and nasal swabs, of which and , respectively, met the inclusion criteria ( figure ). symptom diaries were submitted for children, constituting childdays of observation. cohort characteristics are summarized in supplementary table . pevs were detected in ( . %) stool and ( . %) nasal weekly swabs ( figure ). based on stool swab results, distinct pev episodes were identified during the study, including ( . %) episodes that coincided with the positive nasal swab detections. of the of ( . %) episodes meeting genotyping inclusion criteria (figure ), ( . %) were typed successfully (supplementary table figure ) [ , ] . the overall pev incidence rate was ( %ci, - ) episodes per child-years. incidence rates for the major types (pev-a types , , and ) were ( % ci, - ), ( % ci, [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] , and ( % ci, - ) episodes per childyears, respectively. after the first months of life, the incidence of new pev infections rose steadily until age months before plateauing and then gradually declining in the second year (supplementary figure ) . the maximum episode number detected was in children (supplementary results and supplementary figure ). the mean individual episode duration was . (standard deviation, . ) weeks with a maximum shedding duration of weeks. in the occurrences with different pev genotype detections from sequential swabs, the maximum duration of combined genotype shedding was weeks (supplementary figure ) . the first pev infection appeared at a median of . months of age (interquartile range [iqr], . - . ; supplementary table ) with the earliest detection at day of life in a stool swab. by months, . % had experienced at least pev infection, increasing to . % by their second birthday. first infections by pev-a occurred typically much earlier than for pev-a and pev-a ( figure ). pevs were detected throughout each year, but peaked in summer and early autumn (december through march; supplementary figure ). seasonality of the main genotypes varied in magnitude between years, with dominant genotypes alternating or sharing seasonal peaks ( figure ). of particular note was pev-a , whose magnitude rose sharply and became the dominant genotype during the spring/summer months of - ( figure ) . characteristics independently associated with pev infections included increasing age (particularly > months), summer season, not exclusively breastfeeding at a younger age ( - months), and attending formal childcare at a younger age (< months) (supplementary table ). sex, season of birth, and presence of older siblings in household were not associated with pev infections. of discrete pev episodes (figure ), ( . %) were asymptomatic, had ari symptoms ( . %), had age ( . %), had both ( . %), had an undifferentiated febrile illness ( . %), and lacked symptom data ( . %). when pev was the sole detected virus in stools, . % of episodes were asymptomatic (table ) , whereas if respiratory viruses in nasal swabs were also absent, the proportion without symptoms increased to . % (table ) . similar findings were observed when pev-a episodes were considered alone (supplementary table ). of the of symptomatic episodes with recorded burden information, ( . %) received a primary care consultation, but none needed hospitalization. overall, no association was observed between pev viral load and symptomatic episodes (symptomatic vs asymptomatic mean ct difference, orc hid -s e orc hid-s e orc hid-s e pev-a -kx orchi d-s e orchid-s e orchid-s e orchid-s e orchid-s e orchid-s e orchid-s e orchid -s e pev-a -ky orch id-s e orc hid-s e orc hid -s e or ch id-s e or ch id-s e or ch id-s e or ch id-s e table ) . overall, in ( . %) pev-associated episodes, at least other virus was detected in stool swabs collected within days of the onset of the pev infection (table ) . indeed, shedding of non-pev viruses was also relatively common in the stools of orchid subjects (supplementary table ). in pevassociated ari episodes, ( . %) had at least respiratory virus detected in nasal swabs during the -day window ( figure and table ); these were primarily rhinoviruses, which were present in of ( . %) symptomatic ari episodes. the orchid study provides high-resolution representation of pev infection in the first years of life in healthy australian children. we observed frequent pev detections in stools, but not in nasal swabs, from these young children. diverse pev genotypes circulated in the community, with distinct seasonal peaks and genotype transitioning during summer and early autumn. prolonged pev shedding was from sequential infections with different genotypes rather than single infections, suggesting limited heterotypic protection. a sharp increase in pev-a detections in cohort children coincided with a pev-a -related sepsis outbreak in australian infants where both sets of strains clustered phylogenetically [ ] . however, infections in cohort children were either community-managed or asymptomatic in nature with symptom association confounded by high viral codetection rates. primary pev infection occurred early in orchid subjects with > % affected by their first birthday, compared with % [ ] and % [ ] of children with pev detections by the same age in other community-based studies. however, neither study recruited before months of age, and both sampled less frequently (monthly) and could sequence fewer ( %- %) positive samples, meaning short or sequential episodes may have been missed [ , ] . by age months, . % of orchid children had at least pev infection, agreeing with the norwegian ( %), but not finnish ( %) studies [ , , ] , and aligning with seroprevalence surveys [ , , ] . in orchid, primary pev-a detections occurred earlier than pev-a , agreeing with seroprevalence findings [ , ] , but conflicting with rt-pcr-based studies [ , ] . however, the rt-pcr studies relied upon clinical samples, which may bias the pev detection ages as pev-a is associated primarily with disease before age months [ , , , , ] . unlike stools, pev was detected rarely in nasal swabs ( . %) and was always associated with a pev-positive stool of the same genotype, supporting fecal-oral as one transmission route. previous community-based studies reported higher pev detection rates ( %- %) in respiratory samples from asymptomatic subjects [ , ] , possibly resulting from different sampling techniques (gargle and nasopharyngeal swabs). in contrast, infants hospitalized in brisbane with pev-a -associated sepsis had high and comparable positive nasopharyngeal swab ( . %) and stool ( . %) detection rates [ ] . given the asymptomatic or mild nature of cases in our study, this observed discrepancy with hospitalized infant pev detection rates is likely a reflection of more severe systemic disease. across the study, pev-a was detected infrequently until the - spring/summer when it predominated. this coincided with the outbreak of pev-a -associated sepsis in australian infants [ , ] , and aligns with the predicted timing of a genomic recombination event leading to a more virulent phenotype [ ] . nevertheless, in our cohort pev-a was not associated with severe symptoms, even when detected during the outbreak period or within the high-risk first months of life. case-control studies [ ] and the sharp increase in pev infections after age months in our cohort suggest maternal antibody protection against pev-related disease early in life. thus, emergence of a novel recombinant pev-a strain and declining pev-a neutralizing antibodies in women of childbearing age at a population level may contribute to the biennial outbreaks in australian infants [ , ] . pev infection risk increased during summer, in agreement with some [ , , , , ] , but not all studies as late autumn peaks are also reported [ , , , ] . presence of older household siblings was not associated with pev or pev-a infections within orchid, a finding contrasting with hospital-based studies focusing mainly on pev-a cases [ , ] . it however, agrees with the norwegian community-based study where a sibling age gap > years increased the likelihood of infection, possibly originating from similar-aged peers attending kindergarten [ ] . almost half the sole pev episodes within orchid were asymptomatic, while virus codetection was common in the remaining symptomatic episodes, primarily from rhinovirusassociated aris, but also other enteric viruses. however, we may have underestimated the etiologic role of pev in aris as rhinoviruses themselves have also been shown to be detected frequently in asymptomatic cases from the same cohort [ ] . nevertheless, these results highlight the overall mild nature of pev infections within the community with none of the participants hospitalized, including infants infected with pev-a . the strength of our study lies in the systematic, highfrequency sample and symptom data collection from an unselected community-based birth cohort, a combination not previously reported in pev epidemiology studies. weekly samples allowed us to observe infections < weeks' duration ( . % of all episodes) that would not be captured by sampling less frequently [ , ] . in the finnish study, ( . %) children had secondary episodes, with only genotyped [ ] . this contrasts with our findings where . % of subsequent infections were typed successfully allowing multiple samples within the same month being available for sequencing and potentially allowing identification of sequential infections. however, several limitations should be considered. first, the orchid study was conducted in an urban, subtropical setting and most enrolled families were socioeconomically advantaged with higher rates of early age childcare attendance [ ] . while the findings are valid, they may not be entirely generalizable to other populations. second, the study relied upon parents diligently recording symptoms and collecting samples. although completion rates were excellent for such a demanding study, not all diaries and swabs were returned. finally, the genotyping regions used in this study are highly conserved within the genome and may not reflect the true genetic diversity shown by the pev-a recombinant variant associated with sepsis [ ] . the sharp increase in orchid pev-a cases observed during the first pev-a outbreak [ ] suggests this recombinant variant was circulating within the community. in summary, orchid extends previous limited seroprevalence and community-based surveys by confirming that pev infections from multiple genotypes are common during the first years of life, with virus shedding primarily through the fecal route. prolonged pev shedding is from sequential infections with different genotypes rather than a single virus episode. annual outbreaks occurred in the summer/autumn months with the severe disease-associated pev-a succeeding pev-a as the predominant strain in the - season, but it and other pev infections were typically asymptomatic or associated with mild ari or age symptoms. despite participants being predominantly from advantaged families, which may limit the extrapolation of findings to lower socioeconomic settings, these observations illustrate the value of unselected, high-resolution longitudinal community-based cohort studies helping to identify the overall epidemiology, risk factors, and clinical features associated with infection rather than focusing exclusively upon the few cases with severe disease. of total episodes and episodes per enrolled child each month of all parechovirus a (pev-a) genotypes as a function of enrolled subjects human parechovirus in respiratory specimens from children in kansas city, missouri pediatric parechovirus infections strategies to improve detection and management of human parechovirus infection in young infants specific association of human parechovirus type with sepsis and fever in young infants, as identified by direct typing of cerebrospinal fluid samples parechovirus genotype outbreak among infants human parechovirus: an increasingly recognized cause of sepsis-like illness in young infants increased detection of human parechovirus infection in infants in england during : epidemiology and clinical characteristics severe parechovirus infections in young infants-kansas and missouri high prevalence of developmental concern amongst infants at months following hospitalised parechovirus infection human parechovirus seroprevalence in finland and the netherlands seropositivity and epidemiology of human parechovirus types , , and in japan seroepidemiology of parechovirus a neutralizing antibodies longitudinal observation of parechovirus in stool samples from norwegian infants longitudinal study of parechovirus infection in infancy and risk of repeated positivity for multiple islet autoantibodies: the midia study human parechoviruses are frequently detected in stool of healthy finnish children detection of respiratory viruses in gargle specimens of healthy children detection and characterization of enteroviruses and parechoviruses in healthy people living in the south of cote d'ivoire respiratory virus detection and clinical diagnosis in children attending day care human parechovirus in infants: expanding our knowledge of adverse outcomes observational research in childhood infectious diseases (orchid): a dynamic birth cohort study the burden of community-managed acute respiratory infections in the first -years of life viruses causing lower respiratory symptoms in young children: findings from the orchid birth cohort multivalent rotavirus vaccine and wild-type rotavirus strain shedding in australian infants: a birth cohort study rapid detection of human parechoviruses in clinical samples by real-time pcr epidemiology and clinical associations of human parechovirus respiratory infections diversity of human parechoviruses isolated from stool samples collected from thai children with acute gastroenteritis high prevalence of human parechovirus (hpev) genotypes in the amsterdam region and identification of specific hpev variants by direct genotyping of stool samples evolutionary and network analysis of virus sequences from infants infected with an australian recombinant strain of human parechovirus type hpev- predominated among parechovirus a positive infants during an outbreak in - in queensland, australia seroepidemiology of human parechovirus types , , and in yamagata severe human parechovirus infections in infants and the role of older siblings human parechovirus , and neutralizing antibodies in dutch mothers and infants and their role in protection against disease asymptomatic children might transmit human parechovirus type to neonates and young infants intrafamilial transmission of parechovirus a and enteroviruses in neonates and young infants supplementary materials are available at clinical infectious diseases online. consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author. key: cord- -xfyavk m authors: gil-cruz, cristina; perez-shibayama, christian; onder, lucas; chai, qian; cupovic, jovana; cheng, hung-wei; novkovic, mario; lang, philipp a; geuking, markus b; mccoy, kathy d; abe, shinya; cui, guangwei; ikuta, koichi; scandella, elke; ludewig, burkhard title: fibroblastic reticular cells regulate intestinal inflammation via il- -mediated control of group ilcs date: - - journal: nat immunol doi: . /ni. sha: doc_id: cord_uid: xfyavk m fibroblastic reticular cells (frcs) of secondary lymphoid organs form distinct niches for interaction with hematopoietic cells. we found here that production of the cytokine il- by frcs was essential for the maintenance of group innate lymphoid cells (ilcs) in peyer's patches and mesenteric lymph nodes. moreover, frc-specific ablation of the innate immunological sensing adaptor myd unleashed il- production by frcs during infection with an enteropathogenic virus, which led to hyperactivation of group ilcs and substantially altered the differentiation of helper t cells. accelerated clearance of virus by group ilcs precipitated severe intestinal inflammatory disease with commensal dysbiosis, loss of intestinal barrier function and diminished resistance to colonization. in sum, frcs act as an 'on-demand' immunological 'rheostat' by restraining activation of group ilcs and thereby preventing immunopathological damage in the intestine. supplementary information: the online version of this article (doi: . /ni. ) contains supplementary material, which is available to authorized users. the intestine harbors a vast number of commensal microorganisms and is frequently confronted with infectious agents, including pathogenic viruses. moreover, food antigens that can induce severe allergic reactions enter the body via the intestinal mucosa. hence, the immunoregulation of tolerance to food allergens and commensal microbes must be initiated and maintained by the gut immune system, while forceful protective immunity to pathogens should be induced following their detection in gut-associated lymphoid tissues , . peyer's patches (pps) are central structures of the gut-associated lymphoid tissues and serve as the main initiation site of intestinal immune responses. moreover, antigen and immune cells from pps reach the mesenteric lymph nodes (mlns) that serve as a second major line of defense against pathogens . diverse layers of protective mechanisms, including mucosal immunoglobulin a (iga) and innate lymphoid cells (ilcs) , contribute to the maintenance of gut homeostasis and responses to intestinal pathogens. various ilc subsets (group (ilc ), group (ilc ) and group (ilc )) have distinct functions in mucosal tissues, such as interferon-γ (ifn-γ)-mediated clearance of intracellular pathogens (ilc ) or support of the expulsion of helminths by secretion of t helper type cytokines (ilc ) . a third type of ilcs contribute to the elimination of gram-negative bacteria through the production of il- (ilc ) . in addition, group ilcs direct the development of pps through interaction with fibroblastic stromal cells in the embryo . it is not clear however whether group ilcs or other ilc subsets maintain their cross-talk with fibroblastic stromal cells in adult pps. the activation of immune responses in pps and other secondary lymphoid organs (slos) depends on the establishment of specialized niches for hematopoietic cells that facilitate optimized antigen acquisition, activation of t cells in the t cell zone and induction of antibody responses in the b cell zone . the particular confines of these immune reactions are generated by fibroblastic reticular cells (frcs) that not only provide structural support and guidance to immune cells but also actively participate in shaping immune responsiveness , . for example, frcs of the t cell zone in lymph nodes (lns) regulate the migration and survival of t cells by producing homeostatic chemokines such as ccl and ccl , both crucial for the attraction and retention of t cells [ ] [ ] [ ] . podoplanin (pdpn)-expressing frcs located at the border between the t cell zone and b cell zone and in the b cell follicles provide b cell-stimulating factors to foster antibody responses and generate the b cell-attracting chemokine cxcl during infection . frcs situated along the ln subcapsular sinus, commonly known as 'marginal reticular cells' , are characterized by expression of the adhesion molecule madcam- (ref. ) . although the phenotype and function of frcs in peripheral lns have been studied in detail , , the precise role of frcs in pps in immunological function and pp structure is not yet known, nor is it fully clear whether frc subsets found in lns are present in pps. we found here that genetic ablation of innate immunological sensing dependent on the adaptor myd in ccl -expressing frcs precipitated severe intestinal inflammatory disease in the aftermath of infection with an enteropathogenic virus. delineation of the underlying mechanism revealed that unleashed trans-presentation of il- on myd -deficient frcs during viral infection fostered the activation of cytotoxic and non-cytotoxic group ilcs and substantially accelerated clearance of the virus. however, enhanced vigilance against the viral infection came at the cost of dysregulated t cell responses with a shift from foxp + regulatory t cells toward ifn-γ-producing type helper t cells. we conclude that frcs in pps and mlns actively regulate intestinal homeostasis by controlling local ilc activation and subsequent helper t cell differentiation. myd -independent frc subset specification and homeostasis frcs generate the cellular scaffold required for the activation of immune cells in lns , , express multiple toll-like receptors (tlrs) and respond to stimulation by the innate immune system . we therefore reasoned that frcs in pps and mlns might sense mucosal pathogens and affect immunity by directing those immune cells that respond first to the intrusion. to address this issue, we selectively abolished myd -dependent innate immunological sensing in frcs through the use of mice in which loxp-flanked alleles encoding myd undergo frc-specific deletion by a transgene expressing cre recombinase under the control of the ccl promoter (ccl -cre), called 'myd -conditional-knockout' (myd -cko) mice here. frcs in pps can be highlighted in vivo through expression of enhanced yellow fluorescent protein (eyfp) from the ubiquitous rosa locus (r r) in ccl -crer r-eyfp mice (called 'ccl eyfp mice' here) (fig. a) . ablation of myd in frcs did not affect pp formation (supplementary fig. a) , nor did it alter the size or structural organization of pps (fig. b) or their immune-cell content (supplementary fig. b-e) . likewise, the structural organization of mlns (data not shown) and their immune-cell composition (supplementary fig. b-e) were not affected by this frc-specific ablation of myd . high-resolution in situ analysis by confocal laserscanning microscopy revealed that the pp frcs of both ccl eyfp mice and ccl eyfp myd -cko mice formed the typical network structure and expressed the frc marker pdpn (fig. a,b) , ccl , smooth muscle actin-α and the intercellular adhesion molecule icam (supplementary fig. f) . moreover, we found that the lack of innate immunological sensing in myd -cko mice did not affect expression of the ccl -cre transgene in cd − pdpn + stromal cells of pps or mlns (fig. c,d) . the appearance of frc subsets such as cd + follicular dendritic cells (fig. e ) or cd + madcam- + marginal reticular cells was independent of myd in pps and mlns ( fig. f and supplementary fig. g) . likewise, the expression of other canonical frc markers was not altered by the absence of myd in eyfp + cells of pps or mlns (fig. g) myd signaling in frcs controls antiviral ilc responses to assess whether an invasive enteric pathogen would substantially alter the activity of frcs, we infected myd -sufficient mice and mice with frc-specific myd deficiency with mouse hepatitis virus (mhv). this cytopathic coronavirus is recognized via the tlr -myd pathway , 'preferentially' targets macrophages in slos and causes severe inflammatory disease in the intestine following uptake via the oral route . here, we used a dose of × infectious particles, which led to substantial viral replication on days and after infection in the pps and mlns of myd -sufficient mice (fig. a,b) but spared other regions of their intestine (data not shown). viral titers were significantly lower in myd -cko mice than in myd sufficient mice on day after infection, and infectious particles were almost completely eliminated from the myd -cko mice on day ( fig. a,b) . this finding was unexpected, because global deficiency in myd or tlr resulted in uncontrolled viral spread, with viral titers more than , -fold higher in mice with such global deficiency than in myd -cko mice (fig. a,b) . moreover, the virus was purged from the spleen and liver of myd -cko mice at day ( supplementary fig. a) , which indicated that potent immunological effector mechanisms had prevented systemic spread of the pathogen. the accelerated viral control in myd -cko mice as early as on day after infection suggested that innate antiviral immune cells had been activated by the frc-specific myd deficiency. indeed, we found significantly more cells expressing the activating natural killer cell (nk cell) receptors nk . and nkp in the pps and mlns of myd -cko mice than in those of myd -sufficient early after infection (fig. c,d) . moreover, production of the antiviral effector cytokine ifn-γ was much greater in the nk . + cells of myd -cko mice than in their counterparts from myd -sufficient mice (fig. e) . histological analysis of pps from infected mice at day after infection with mhv revealed that nk . + cells were located mainly in the interfollicular regions (fig. f) . moreover, these cells were in close contact with frcs expressing the ccl -cre transgene, both in the presence of myd (fig. g) and in its absence (fig. h) in frcs. the frequency and number of both non-cytotoxic group ilcs expressing the cytokine receptor il- rα and il- rα-negative nk cells were increased when immunological sensing was altered in the frcs of pps ( fig. i,j and supplementary fig. b ,c) and mlns ( supplementary fig. b,c) . control staining revealed that nk . + il- rα − group ilcs, but not il- rα + group ilcs, expressed the signature transcription factor eomes (supplementary fig. d) . frc-specific myd deficiency had a positive effect on the population expansion and activation of group ilcs, whereas the size of the ilc and ilc compartments in pps (fig. i,j) and mlns (supplementary fig. b,c) was slightly reduced relative to their size in myd -sufficient mice. to assess whether this enhanced antiviral immunity was exclusively dependent on group ilcs, we ablated nk . -expressing cells using a well-established depletion protocol (supplementary fig. e,f) . antibody-mediated depletion of group ilcs completely restored viral replication in all organs of myd -cko mice but had little effect on viral replication in myd -sufficient mice (fig. k) . these data indicated that the frcs in pps and mlns were able to function as potent regulators of antiviral immunity and that this function was exerted via control of group ilcs. the differentiation and activation of cytotoxic and non-cytotoxic ilcs is stringently regulated by cytokines . in addition, nk cell activity is controlled by inhibitory receptors that bind to major histocompatibility complex class i molecules . since myd -deficient a r t i c l e s frcs in pps and mlns did not show altered expression of major histocompatibility complex class i during intestinal infection with mhv (data not shown), we established an in vitro cell-culture system (fig. a) to probe frc cytokine responses following exposure to r , a synthetic agonist of tlr and tlr . we found that myd sufficient frcs responded to stimulation of tlr with considerable production of the inflammatory mediators il- and ccl , whereas frcs lacking myd failed to respond to stimulation with r (fig. b) . exposure to this tlr ligand led to a substantial reduction in the production of il- by myd -sufficient frcs, whereas myd -deficient frcs continued to produce large amounts of this nk cell-and ilc -activating cytokine (fig. b) . since il- acts mainly in a cell-contact-dependent manner via trans-presentation by il- receptor α-chain (il- rα) , we used expression of il- rα as an additional marker of differential frc activation. expression of il- rα in myd -sufficient frcs was significantly reduced after stimulation with the tlr ligands r or singlestranded rna or with il- β ( fig. c and supplementary fig. ) . myd -dependent regulation of the production of il- in vivo was confirmed by rt-pcr analysis showing significantly higher expression of il mrna in cd − pp stromal cells from myd -cko mice than in those from myd -sufficient mice (fig. d) and in eyfp + cells sorted from the pps of ccl eyfp myd -cko mice than in those sorted from the pps of ccl eyfp mice (fig. e) , on day after infection. moreover, flow cytometry revealed that il- was detectable on the surface of eyfp + cells expressing the cell-adhesion molecule cd (fig. f) . the lack of myd substantially increased not only the amount of surface-bound il- but also expression of the il- -trans-presenting molecule il- rα (fig. f) . back-gating showed that the enhanced il- production under conditions of myd deficiency could be attributed to pdpn dim cd hi frcs (fig. g) , which suggested that a distinct frc subpopulation activated group ilcs via trans-presentation of il- . to assess whether il- is the dominant frc-derived factor that mediates the population expansion of group ilcs and controls viral replication, we applied a neutralizing antibody to il- (anti-il- ) during the course of the infection starting d before infection with mhv. we found that treatment with anti-il- efficiently blunted the exaggerated population expansion of group ilcs in myd -cko pps (fig. a,b) . notably, neutralization of il- led to a reduced frequency (fig. a) and absolute number (fig. b) of group ilcs in myd -sufficient mice and diminished the ilc population both in myd -sufficient pps and in myd -cko pps (fig. c) . these data confirmed the importance of il- for ilc homeostasis in pps and indicated that innate immunological signaling in frcs functioned as an important regulatory switch for this il- -driven ilc proliferation. assessment of the expression of cd , a component of the il- receptor, on eomes + and eomes − ilc subsets revealed no major change in expression of the il- receptor β-chain (cd ) under conditions of frc-specific myd deficiency (fig. d) , indicative of an ilc extrinsic regulation circuit. notably, in vivo neutralization of il- disinhibited viral replication in myd -cko mice, while myd -sufficient mice showed only minor changes in viral load (fig. e) . together these data revealed that the ilc activity in pps and mlns was controlled almost exclusively via a single cytokine derived from a particular frc subset. moreover, it appeared that innate immunological activation of frcs via direct recognition of viral rna or through il- β-mediated stimulation was critical for the adjustment of ilc reactivity. a r t i c l e s frcs regulate homeostatic ilc and nk cell maintenance to assess whether frc-derived il- not only controls the proliferation and activity of group ilcs and/or nk cells during an infection but also contributes to their homeostasis, we selectively abolished il- expression in frcs by generating ccl -creil fl/fl mice (called 'il -cko' mice here). we found that frc-specific ablation of il- affected neither pp formation (supplementary fig. a ) nor the composition of t lymphocytes and b lymphocytes in this slo (supplementary fig. b) . nk . + nkp + cells were almost completely absent under conditions of frc-specific loss of il- , whereas neither the ilc compartment nor the ilc compartment was significantly affected (fig. a,b and supplementary fig. c,d) . since the ccl -cre transgene was active in less than . % of nonendothelial bone marrow stromal cells (supplementary fig. e) , we concluded that the il- -expressing frcs in gut-associated slos generated an essential niche for the maintenance of group ilcs and nk cells under homeostatic conditions. next we assessed to what extent such specific ablation of il in frcs affected the activation of ilc -and/or nk cell-mediated antiviral immunity. even under inflammatory conditions, we found an almost complete absence of nk . + nkp + cells in the pps (fig. c) and mlns (fig. d) of il -cko mice. moreover, other cd + lin − cells (i.e., group and group ilcs) did not produce ifn-γ during the early phase of the viral infection in il -cko mice (fig. c,d) , which indicated that other sources of il- , such as dendritic cells or macrophages , , failed to compensate for the lack of this growth factor in the frc niche. consistent with the results of the anti-nk . ablation experiment (fig. k) , we found only a moderate effect of the selective loss of group ilcs and/or nk cells on the control of viral replication (fig. e) . thus, we concluded that the frcs built and maintained an exclusive il- -dependent niche for the maintenance of group ilcs and/or nk cells in the pps and mlns and controlled antiviral immunity in gut-associated slos through cessation of il- production. early ifn-γ production by nk cells has been shown to be important for the differentiation of helper t cell subsets . accordingly, the greater abundance of ifn-γ-secreting group ilcs in myd -cko mice was associated with an elevated frequency of antiviral cd + t cells that secreted ifn-γ in the pps and mlns on day after infection with mhv (fig. a,b and supplementary fig. a,b) but only moderate proliferation of il- -producing helper t cells (supplementary fig. c) . notably, myd deficiency in frcs precipitated a significantly lower abundance of regulatory t cells expressing the transcription factor foxp in the pps of myd -cko mice a r t i c l e s than that in the pps of myd -sufficient mice (fig. c) , which suggested that immunological regulation in the small intestine might have been compromised. indeed, the lack of innate immunological sensing in frcs led to substantial weight loss in mhv-infected myd -cko mice (fig. d) , despite their accelerated viral clearance (fig. a) . as expected, global deficiency in myd , which permitted almost unrestricted replication of the cytopathic virus ( fig. b and supplementary fig. d) , was associated with a worsened clinical appearance (fig. d) . gross pathological analysis revealed that the intestinal wall of myd -cko mice was considerably inflamed on day after mhv infection, with a significantly shorter colon length than that of myd -sufficient mice (fig. e) . histopathological examination confirmed that the enhanced immunopathology in the small intestine of myd -cko mice (fig. f) included an edematous muscular layer and blunting of villi (fig. g) . those pathological changes were associated with a greater antibody response to intestinal microbiota such as escherichia coli than that of myd -sufficient mice (fig. h) , which indicated that the integrity of the epithelial barrier was compromised in infected myd -cko mice. such weakening of the epithelial barriers in mhv-infected myd -cko mice was accom- panied by pronounced changes in the composition of the microbiome ( fig. i and supplementary fig. e) , whereas the composition of the commensal flora in naive mice was not affected by frc-specific myd deficiency (supplementary fig. e ). the finding that myd -cko mice showed significantly lower resistance to colonization by other intestinal pathogens such as citrobacter rodentium on day after infection with mhv, relative to that of myd -sufficient mice (fig. j) , further emphasized the importance of immunoregulatory functions of frcs and their ability to adjust global immune responsiveness in the intestine. genetic association studies have provided evidence that multiple tlrs are involved in balancing the sensing of microbes in chronic inflammatory disease of the intestine , . moreover, experimental studies have revealed that innate signal integration via the shared adaptor myd promotes epithelial integrity , . it has been suggested that ilcs secreting cytokines involved in chronic intestinal inflammation, such as il- , il- or ifn-γ, serve as critical sentinels by translating innate immunological signals for the activation of adaptive immunity . our study has identified a myd -dependent pathway of ilc regulation that is called into action once the epithelial barrier has been breached by a pathogen. our results showed that antiviral ilc and nk cell responses in pps and mlns were efficiently regulated through limiting of the provision of il- by frcs. both direct recognition of viral rna via tlr and indirect activation of frcs via il- β led to activation of this potent control mechanism. it appears that this direct and simple regulatory pathway acts proximal to more elaborate immunoregulatory circuits that involve the balancing of helper t cell differentiation and the generation of regulatory t cells. ilcs are recognized as the main drivers of the diverse immune responses needed to maintain mucosal surface integrity and protection . however, these cells are almost completely unresponsive to microbial ligands , which suggests that ilcs utilize indirect means to regulate their activation and functionality after microbial invasion. the data presented here suggest that frcs in gut-associated slos function as an intermediate cell to regulate ilc function specifically via il- . the importance of il- production by frcs was emphasized by the almost complete absence of group ilcs and/or nk cells in mice that lacked il expression in frcs. thus, it appears that frcs in pps and mlns form a highly specialized niche for group ilcs and/or nk cells. that interpretation is in line with the finding that il- production by myeloid cells is not an exclusive regulator of the homeostasis of group ilcs and/or nk cells in slos . thus, it appears that il- production by frcs secures the maintenance of group ilcs and/or nk cells in slos and prevents exaggerated innate immune reactions through myd -dependent restriction of il- production under conditions in which myeloid cells start to provide il- via its trans-presentation by il- rα to other activated immune cells. the conceptual framework of intestinal immunoregulation outlined by published studies , predicts that innate immune signaling in intestinal tissues needs to be equilibrated at levels that preserve both tissue repair and host defense. indeed, the absence of myd signaling in all cells rendered the host highly susceptible to infection with cytopathic murine coronavirus. this virus exhibits strong tropism for myeloid cells, and the provision of type interferons is needed to prevent cell death , . moreover, type interferons are important stimulators of nk cells and potentiate early antiviral immune responses, for example, through the stimulation of il- production by dendritic cells , . interestingly, il- -mediated counter-balancing of this protective mechanism was induced in frcs immediately after the recognition of viral rna via tlr , following exposure to il- β and the transmission of these innate immunological signals via myd . our finding that the proliferation and activity of group ilcs and nk cells in pps and mlns was controlled almost exclusively via il- is in line with the finding that intraepithelial group ilcs are highly responsive to . it remains to be determined whether the ilc subset, and other ilc subsets, outside of pps and mlns are subject to regulation by stromal cells. it is possible that mesenchymal stromal cells of the lamina propria, such as myofibroblasts or mural cells, exert immunoregulatory functions comparable to those of frcs in pps. clearly, lamina propria mesenchymal stromal cells not only form the scaffold of the tissue but also provide a plethora of positive and negative regulatory factors that impinge on innate and adaptive immune processes . in sum, our study has extended the current paradigm of immunoregulation in the intestine and has revealed innate immunological sensing in frcs as a key mechanism for the maintenance of intestinal homeostasis. it will be important to further delineate the frc-ilc axis of intestinal immunoregulation to facilitate the targeting of processes that efficiently restrain intestinal immunopathology. methods and any associated references are available in the online version of the paper. type innate lymphoid cells control eosinophil homeostasis nfil is crucial for development of innate lymphoid cells and host protection against intestinal pathogens development of secondary lymphoid organs form follows function: lymphoid tissue microarchitecture in antimicrobial immune defence lymph node fibroblastic reticular cells in health and disease fibroblastic reticular cells: organization and regulation of the t lymphocyte life cycle fibroblastic reticular cells in lymph nodes regulate the homeostasis of naive t cells restoration of lymphoid organ integrity through the interaction of lymphoid tissue-inducer cells with stroma of the t cell zone maturation of lymph node fibroblastic reticular cells from myofibroblastic precursors is critical for antiviral immunity b cell homeostasis and follicle confines are governed by fibroblastic reticular cells identification of a new stromal cell type involved in the regulation of inflamed b cell follicles marginal reticular cells: a stromal subset directly descended from the lymphoid tissue organizer transcriptional profiling of stroma from inflamed and resting lymph nodes defines immunological hallmarks control of coronavirus infection through plasmacytoid dendritic-cell-derived type i interferon type i ifn-mediated protection of macrophages and dendritic cells secures control of murine coronavirus infection the organ tropism of mouse hepatitis virus a in mice is dependent on dose and route of inoculation natural killer cell activation enhances immune pathology and promotes chronic infection by limiting cd + t-cell immunity the biology of innate lymphoid cells dissecting natural killer cell activation pathways through analysis of genetic mutations in human and mouse the biology of interleukin- and interleukin- : implications for cancer therapy and vaccine design macrophage-and dendritic-cell-derived interleukin- receptor alpha supports homeostasis of distinct cd + t cell subsets dendritic cells support the in vivo development and maintenance of nk cells via il- trans-presentation induced recruitment of nk cells to lymph nodes provides ifn-γ for t h priming the role of the toll receptor pathway in susceptibility to inflammatory bowel diseases crohn's disease is associated with a toll-like receptor- polymorphism recognition of commensal microflora by toll-like receptors is required for intestinal homeostasis innate immune signalling at intestinal mucosal surfaces: a fine line between host protection and destruction the unusual suspects--innate lymphoid cells as novel therapeutic targets in ibd transcription factors controlling innate lymphoid cell fate decisions il- rα chaperones il- to stable dendritic cell membrane complexes that activate nk cells via trans presentation an innately dangerous balancing act: intestinal homeostasis, inflammation, and colitis-associated cancer initial and innate responses to viral infections--pattern setting in immunity or disease dendritic cells prime natural killer cells by trans-presenting interleukin differential responses of immune cells to type i interferon contribute to host resistance to viral infection intraepithelial type innate lymphoid cells are a unique subset of il- -and il- -responsive ifn-γ-producing cells mesenchymal cells of the intestinal lamina propria anova with tukey's post-test or two-way anova with benferroni's post-test. statistical significance was defined as p < generalized lacz expression with the rosa cre reporter strain identification of coronavirus non-structural protein as a major pathogenicity factor-implications for the rational design of live attenuated coronavirus vaccines intimin-specific immune responses prevent bacterial colonization by the attaching-effacing pathogen citrobacter rodentium the authors declare no competing financial interests.reprints and permissions information is available online at http://www.nature.com/ reprints/index.html. mice. c bl/ /n (b ) mice were purchased from charles river laboratories (germany). myd −/− and tlr −/− mice were obtained from the institute for laboratory animal sciences at the university of zürich. bac-transgenic c bl/ n-tg(ccl -cre) biat (ccl -cre) mice have been previously described and were crossed with r r-eyfp mice . to specifically ablate myd or il- in frcs, ccl -cre and ccl eyfp mice were crossed with mice with loxp-flanked myd alleles (jackson laboratory) or with mice with loxp-flanked il alleles. constitutive ablation of il was achieved by flanking exon with two loxp sequences by standard germline recombination in embryonic stem cells. all mice were on the c bl/ n genetic background and were maintained in individually ventilated cages and were used between and weeks of age. as control animals, co-housed and cre-negative littermate mice have been used in all experiments. experiments were performed in accordance with federal and cantonal guidelines (tierschutzgesetz) under permission numbers sg / , sg / and sg / following review and approval by the cantonal veterinary office (st. gallen, switzerland). mice were infected with mhv a , orally by gavage with × plaque-forming units as previously described . mice were sacrificed and organs were stored at − °c until further analysis. mhv titers were determined by standard plaque assay using l cells table ). cells were acquired with a facscanto (bd biosciences) and analyzed using flowjo software (treestar inc.). analysis of eomes, gata- , tbet, foxp and rorγt expression was performed using the foxp /transcription factor staining buffer set from ebioscience, according the manufacturer's instructions. analysis of nk . + and cd + t cells responses was performed using cytokine production after stimulation with phorbolmyristateacetate (pma, ng/ml) and ionomycin ( ng/ml; both purchased from sigma) in the presence of brefeldin a ( µg/ml) for h at °c. for peptide-specific cytokine production, cells were restimulated with m peptide (tvyvrpiiedyhtlt; genscript) in the presence of brefeldin a ( µg/ml) for h at °c. for intracellular staining, restimulated cells were surface-stained and fixed with cytofixcytoperm (bd biosciences) for min. fixed cells were incubated at °c for min with permeabilization buffer ( % fcs, . % saponin in pbs) containing antibodies to ifn-γ and to il- a (supplementary table ). samples were analyzed by flow cytometry using a facscanto (becton dickinson), data were analyzed using flowjo software (tree star, inc.).histology. pps and mlns were fixed overnight in freshly prepared % paraformaldehyde (merck) at °c under agitation. fixed organs were embedded in % low melting agarose (invitrogen) in pbs and sectioned with a vibratome (leica vt- ). -to -µm-thick sections were blocked in pbs containing % fcs, anti-fcγ receptor (supplementary table ) and . % triton x- (sigma). sections were incubated over night at °c with the following antibodies: anti-pdpn, anti-b , anti-cd , anti-cd , anti-smooth muscle actin-α, anti-eyfp and anti-ccl (supplementary table ) . unconjugated antibodies were detected with the following secondary antibodies: dylight conjugated anti-rat-igg, alexa -conjugated anti-rabbit-igg, dylight conjugated anti-syrian hamster-igg and dylight -conjugated streptavidin (supplementary table ) . microscopy was performed using a confocal microscope (zeiss lsm- ) and images were processed with zen software (carl zeiss, inc.) and imaris (bitplane). intestinal tissues were removed and cleaned and the distal ileum was immediately fixed in neutral buffered % formalin solution. two to four -µm-thick sections of the ileum from each mouse were stained with hematoxylin and eosin. each sample was graded for the following four criteria on a scale of - ( , none; , mild; , intermediate; severe): leukocyte infiltration in the lamina propria; iedema; villus distortion; and presence of markers of severe inflammation such as crypt abscesses, submucosal inflammation, and ulcers. scores for each criterion were added to generate an overall inflammation score for each sample. sections were scored in a blinded fashion by two independent observers. table ). samples were analyzed by flow cytometry using a facscanto (bd biosciences). data was analyzed using flowjo software (tree star, inc.).quantitative real-time pcr. total cellular rna was extracted from homogenized tissues and sorted cells using trizol reagent (invitrogen) following the manufacturer's protocol. cdna was prepared using cdna archive kit (applied biosystems), and quantitative rt-pcr was performed using the light cycler-faststart dna master sybr green i kit (roche diagnostics) on a lightcycler machine (roche diagnostics). mrna expression was measured using the quantitect sybr green pcr primers (qiagen). for microbial composition analysis, the ileal content of naive or mhv-infected myd -cko and their cre-negative littermates was used. microbial composition was assessed by a s high-throughput amplicon analysis. the s rrna gene segments spanning the variable v and v regions were amplified from dna from ileal content samples, using a multiplex approach with the barcoded forward fusion primer ′-ccatctca tccctgcgtgtctccgactcag barcode attagatacccyggtagt cc- ′ in combination with the reverse fusion primer ′-cctctctatggg cagtcggtgatacgagctgacgacarccatg- ′. the pcr-amplified s v -v amplicons were purified and prepared for sequencing on the ion torrent pgm system according to the manufacturer's instructions (life technologies). samples with over reads were accepted for analysis. data analysis was performed using the qiime pipeline version . . . operational taxonomic units were picked using uclust with a % sequence identity threshold, followed by taxonomy assignment using either the latest greengenes database.statistical analysis. statistical analyses were performed with graphpad prism . using an unpaired two-tailed student's t-test or mann-whitney test. longitudinal comparison between different groups was performed by key: cord- -v gkubd authors: mäkinen, janne j.; shin, yeonoh; vieras, eeva; virta, pasi; metsä-ketelä, mikko; murakami, katsuhiko s.; belogurov, georgiy a. title: the mechanism of the nucleo-sugar selection by multi-subunit rna polymerases date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: v gkubd rna polymerases (rnaps) synthesize rna from ntps, whereas dna polymerases synthesize dna from ’dntps. dna polymerases select against ntps by using steric gates to exclude the ’ oh, but rnaps have to employ alternative selection strategies. in single-subunit rnaps, a conserved tyr residue discriminates against ’dntps, whereas selectivity mechanisms of multi-subunit rnaps remain hitherto unknown. here we show that a conserved arg residue uses a two-pronged strategy to select against ’dntps in multi-subunit rnaps. the conserved arg interacts with the ’oh group to promote ntp binding, but selectively inhibits incorporation of ’dntps by interacting with their ’oh group to favor the catalytically-inert ’-endo conformation of the deoxyribose moiety. this deformative action is an elegant example of an active selection against a substrate that is a substructure of the correct substrate. our findings provide important insights into the evolutionary origins of biopolymers and the design of selective inhibitors of viral rnaps. all cellular lifeforms use two types of nucleic acids, rna and dna to store, propagate and utilize their genetic information. rna polymerases (rnaps) synthesize rna from ribonucleoside triphosphates (ntps), whereas dna polymerases (dnaps) use '-deoxyribonucleoside triphosphates ( 'dntps) to synthesize dna. the rna building blocks precede the dna building blocks biosynthetically and possibly also evolutionarily , . messenger rna molecules function as information carriers in a single-stranded form, whereas ribosomal, transfer and regulatory rnas adopt complex three-dimensional structures composed of double-stranded segments. the double stranded rnas favor a-form geometry where the ribose moiety of each nucleotide adopts the '-endo conformation (fig. a) . in contrast, dna functions as a b-form double helix, where the deoxyribose of each nucleotide adopts the '-endo conformation (fig. a, b) . hybrid duplexes between the rna and dna transiently form during transcription and adopt an a-form geometry because 'oh groups in the rna clash with the phosphate linkages in the b-form configuration. the sugar moieties of ntps and 'dntps equilibrate freely between the ' and '-endo conformations in solution with the overall bias typically shifted towards the '-endo conformers . however, both ntps and 'dntps typically adopt the '-endo conformation in the active sites of the nucleic acid polymerases . rnaps and dnaps need to discriminate efficiently against the substrates with the non-cognate sugar. the intracellular levels of ntps are in the range of hundreds of micromoles to several millimoles per liter and exceed those of the corresponding 'dntps more than tenfold [ ] [ ] [ ] . when selecting the 'dntps, most dnaps use bulky side-chain residues in their active sites to exclude the 'oh of the ntps (reviewed in ref. ). these steric gate residues, typically gln/glu in a-family dnaps and tyr/phe in y-and b-family dnaps, create a stacking interaction with the deoxyribose moiety of an incoming 'dntp and form a hydrogen bond between the backbone amide group and the ′-oh group of the deoxyribose moiety (fig. c) . selection against the 'dntps by rnaps is a daunting challenge because 'dntps are substructures of the corresponding ntps. single-subunit rnaps (e.g., mitochondrial and bacteriophage t and n enzymes) are homologous and structurally similar to dnaps. however, single-subunit rnaps lack a steric gate and use a conserved tyr residue to discriminate against 'dntps , . tyr selectively facilitates the binding of ntps by forming a hydrogen bond with the 'oh group of the ntp ribose ( fig. c) , . intriguingly, the same tyr also inhibits the incorporation of 'dntps by an unknown mechanism , . noteworthy, a homologous tyr hydrogen bonds with the steric gate gln/glu residue in a-family dnaps ( fig. c ) , . the mechanism of discrimination against 'dntps by the multi-subunit rnaps (bacterial, archaeal and eukaryotic nuclear rnaps) is poorly understood. the combined structural evidence (reviewed in ref. ) suggests that the 'oh group can make polar contacts with three universally conserved amino acid side chains: β'arg , β'asn and β'gln (numbering of the escherichia coli rnap). the β'arg and the β'asn are contributed by the active site cavity and can interact with the 'oh of ntps in the open and closed active site (see below), whereas the β'gln is contributed by a mobile domain called the trigger loop (tl) and can only transiently interact with the '-and '-oh of ntps in the closed active site [ ] [ ] [ ] (fig. c) . closure of the active site by the tl is an essential step during nucleotide incorporation by the multi-subunit rnaps because the α-phosphate of the ntp is located . - Å away from the rna ' end in the open active site , . complete closure of the active site by the folding of two alpha-helical turns of the tl positions the triphosphate moiety of the substrate ntp inline for an attack by the 'oh group of the rna and accelerates catalysis ˜ fold , , , . in contrast, folding of one helical turn of the tl is insufficient to promote catalysis ( 'oh  αp distance . Å ) but likely significantly reduces the rate of ntp dissociation from the active site by establishing contacts between the β'gln and the ribose moiety and stacking of the β'met with the nucleobase (reviewed in ref. ). the relative contribution of the tl (β'gln and β'met ) and the active site cavity (β'arg and β'asn ) to the discrimination against 'dntps remains hitherto uncertain. the closure of the active site makes only a -to -fold contribution to an overall -to -fold selectivity against the ´dntp in rnaps from e. coli and saccharomyces cerevisiae . consistently, the open active site of the e. coli rnap retained a ~ -fold overall selectivity against 'dntps . however, the open active site of the thermus aquaticus rnap has been reported to be largely unselective , and individual substitutions of the β'asn with ser in e. coli and s. cerevisiae resulted in only a < -fold decrease in selectivity , . most importantly, although the universally conserved β′arg closely approaches 'oh of the ntp in several x-ray crystal structures [ ] [ ] [ ] , (supplementary table ) and has been highlighted as the sole residue mediating the selectivity against 'dntp in a computational study by roßbach and ochsenfeld , the role of this residue has not been experimentally assessed. in this study, we systematically investigated the effects of individual substitutions of the active site residues on the discrimination against 'dntps in single nucleotide addition (sna) assays and during processive transcript elongation by the e. coli rnap. this analysis demonstrated that β'arg is the major determinant of the selectivity against 'dntps in multi-subunit rnaps. we further analyzed the binding of '-deoxy substrates by in silico docking and x-ray crystallography of thermus thermophilus rnap. our data suggest that the conserved arg actively selects against 'dntps by favoring their templated binding in the '-endo conformation that is poorly suitable for incorporation into rna. to investigate the mechanism of the discrimination against the '-deoxy substrates we performed time-resolved studied of the single nucleotide incorporation by the wild-type (wt) and variant e. coli rnaps. among several single substitutions of the key residues that contact ntp ribose (fig. c) , we selected four variant rnaps that retained at least half of the wild-type activity at saturating concentration of ntps. this approach minimized the possibility that the amino acid substitutions induced global rearrangements of the active site thereby complicating the interpretations of their effects on the sugar selectivity. transcript elongation complexes (tecs) were assembled on synthetic nucleic acid scaffolds and they contained the fully complementary transcription bubble flanked by -nucleotide dna duplexes upstream and downstream (supplementary fig. a) . the annealing region of a nucleotide rna primer was initially nucleotides, permitting the tec extended by one nucleotide to adopt the post-and pre-translocated states, but disfavoring backtracking . the rna primer was ' labeled with the infrared fluorophore atto to monitor the rna extension by denaturing page. the template dna strand contained the fluorescent base analogue -methyl-isoxanthopterin ( -mi) eight nucleotides upstream from the rna ' end to monitor rnap translocation along the dna following nucleotide incorporation . we first measured gtp and 'dgtp concentration series of the wt and altered rnaps using a time-resolved fluorescence assay performed in a stopped-flow instrument (supplementary fig. b, c) . we used the translocation assay because it allowed rapid acquisition of concentration series, whereas measurements of concentration series by monitoring rna extension in the rapid chemical quench-flow setup would be considerably more laborious. the concentration series data allowed the estimation of kcat and the km (michaelis constant) for gtp and 'dgtp. we then supplemented the concentration series with timecourses of gmp and 'dgmp incorporation obtained using a rapid chemical quench-flow technique with edta as a quencher. edta inactivates the free gtp and 'dgtp by chelating mg + but allows a fraction of the bound substrate to complete incorporation into rna after the addition of edta , . as a result, the edta quench experiment is equivalent to a pulse-chase setup and provides information about the rate of substrate dissociation from the active site of rnap. a global analysis of the concentration series and edta quench experiments (i) allowed the estimation of the kd for gtp dissociation from the active site and (ii) suggested that the kd for the dissociation of the 'dgtp from the active site approximately equals the km for 'dgmp incorporation (see supplementary note). we further used inferred values of kcat and kd to compare the capabilities of the variant rnaps to discriminate against 'dgtp ( fig. ) . wt rnap displayed ~ -fold higher affinity for gtp than for 'dgtp ( fig. a, table ). the β'r k and β'q m substitutions decreased the selectivity at the binding step -and fold, respectively, largely by decreasing the affinity for gtp. in contrast, the β'n s decreased the selectivity only -fold, whereas the β'm a increased the selectivity . -fold. at saturating substrate concentrations, the wt rnap incorporated gmp ~ -fold faster than the 'dgmp ( fig. b, table ). the β'r k substitution decreased the selectivity -fold, primarily by accelerating the incorporation of the 'dgmp. in comparison, the effects of other substitutions on the selectivity against the 'dgtp at the incorporation step were relatively small (fig. b, supplementary tables , ) . the β'm a decreased the selectivity -fold, whereas the β'n s and β'q m increased the selectivity . -and -fold, respectively. noteworthy, the β'q m decreased the rate of 'dgmp incorporation -fold. overall, these experiments suggested that the β'arg plays a central role in the discrimination against '-deoxy substrates ( table ) : the β'arg selectively facilitated binding of gtp and selectively inhibited the incorporation of 'dgmp. in contrast, the role of the β'gln was complex: while the β'gln selectively facilitated the binding of gtp, it also selectively facilitated the incorporation of 'dgmp. the time-resolved sna assays described above are superior to any other currently available techniques for the quantitative assessment of the binding and incorporation of different substrates and the effects of active site residues therein. however, these assays have several limitations: the nucleotide incorporation was measured for static complexes stabilized in the post-translocated state by the artificially limited rna:dna complementarity and the effects are assessed only at a single, easy to transcribe, sequence position. to test if the conclusions drawn from the sna assay remain valid during processive transcript elongation we developed a semiquantitative assay as follows. tecs were assembled on a nucleic acid scaffold with a bp-long downstream dna and chased with ntp mixtures containing µm atp, ctp, utp and gtp or 'dgtp for min at °c. transcription with the 'dgtp by the wt rnap resulted in characteristic pauses at each sequence position preceding the incorporation of the 'dgmp ( fig. , pre-g sites). we used the amplitude of these accumulations as a semi-quantitative measure of the ability of the rnap to utilize 'dgtp. noteworthy, the interpretation of the processive transcription by some variant rnaps was complicated by enhanced pausing after the incorporation of cytosine (fig. b , at-c sites) and 'dgmp ( fig. b , at-g sites) in certain sequence contexts. however, these additional pauses were unrelated to the utilization of the 'dgtp as a substrate and could be disregarded when comparing pre-g pauses that occurred upstream of all at-c and at-g pauses. in contrast to the wt rnap, the β'r k did not pause prior to the incorporation of the 'dgmp ( fig. ) , consistently with the significantly higher 'dgmp incorporation rate observed in sna assays (fig. b) . moreover, the β'r l rnap also did not accumulate at the pre-g sites despite being strongly defective during processive transcription (fig. a, supplementary fig. ) . these data suggest that the loss of selectivity in the β'r k is attributable to the absence of the β'r rather than the gain of function effect of the lys residue at the corresponding position. the β'm a paused noticeably less whereas the β'q m paused noticeably more than the wt rnap at the pre-g sites (fig. a, supplementary fig. ) consistently with the -fold higher (β'm a) and -fold lower (β'q m) kcat for the 'dgmp incorporation in the sna experiments ( fig. b, table ). in contrast, the β'n s was largely indistinguishable from the wt rnap in its ability to utilize the 'dgtp in the processive transcription assay (fig. , supplementary fig. ) , presumably because this assay is not sensitive enough to resolve the ~ . -fold difference in kcat for the 'dgmp incorporation (fig. b, supplementary tables , ) . overall, the analysis of the utilization of the 'dgtp during the processive transcription of diverse sequences by the wt and variant rnaps recapitulated the major effects observed in the sna experiments. next, we tested the effects of the β'r k, β'm a, β'q m and β'n s substitutions on utilization of 'datp, 'dctp and 'dutp during processive transcription (supplementary figs. - ). for each 'dntp, we custom designed a template where the 'dntp is incorporated several times early in transcription, thereby allowing unambiguous interpretation of the accumulation of rnaps at sites preceding the 'dnmp incorporation. an analysis of the utilization of 'datp, 'dctp and 'dutp largely recapitulated the effects observed for 'dgtp, except that the β'n s was markedly inferior to the wt rnap in utilizing 'datp and 'dutp. overall, these data demonstrated that the enhanced or diminished capabilities of the variant rnaps to utilize 'dgtp in the sna assays reflected, in qualitative terms, their capabilities to utilize all four 'dntps. the role of the β'arg in selectively promoting the binding of ntps was easy to explain because the β'arg interacts with the 'oh of the ntp analogues in several rnap structures (supplementary table , fig. c, a, b) . in contrast, the observation that the β'arg selectively inhibited the incorporation of 'dntps could not be readily explained: our results show that the β'arg substitutions promote the incorporation of the substrate that lacks the 'oh group, which the β'arg would interact with. we hypothesized that, in the absence of the 'oh, the β'arg interacted with something else and that the interaction slowed down the incorporation of 'dnmps into the nascent rna. we further reasoned that the 'oh group of the 'dntp was the most likely interacting partner of the β'arg , an inference supported by md simulations of s. cerevisiae rnapii . however, the 'oh group is positioned too far from the β'arg when the sugar moiety is in the '-endo conformation (supplementary table ) . we further hypothesized and demonstrated by in silico docking experiments that the 'oh could move to within the hydrogen bond distance of the β'arg if the deoxyribose moiety adopted a '-endo conformation (supplementary fig. to test this hypothesis in crystallo, we solved the x-ray crystal structure of the initially transcribing complexes containing t. thermophilus rnap, dna and -nt rna primer with incoming 'dctp bound at the active site at . Å resolution. the structure displayed a wellresolved electron density of the 'dctp and the β'arg closely approaching the deoxyribose moiety ( fig. c, d, table , supplementary fig. a, supplementary data ) . the 'dctp was observed in the pre-insertion conformation, that was unsuitable for catalysis because the αphosphate was located . Å from the 'oh of the rna primer. the electron density was consistent with the interaction between the β'arg and the 'oh group of the deoxyribose in the '-endo conformation in agreement with the results of in silico docking. interestingly, the density for the metal ion complexed by the β-and γ-phosphates of the 'dctp was weak and the coordination distances were longer than typically observed for mg + in the corresponding position. we modeled this metal ion as a na + rather than mg + similarly to what has been proposed for dna polymerase β . the tl was completely unfolded in the structure of the initially transcribing complex with the 'dctp, in contrast to a partially helical conformation typically observed in the structures of the rnap complexes with non-hydrolysable ntp analogues (supplementary table to test if the unavailability of the 'oh group was indeed responsible for the destabilization of the tl folding, we solved the x-ray crystal structure of the initially transcribing complex of the t. thermophilus rnap with a 'dctp at . Å resolution. the structure displayed a wellresolved density of the 'dctp and the β'arg closely approaching the '-deoxyribose moiety ( fig. e, f, supplementary fig. b, supplementary data ) . the 'dctp was in the preinsertion conformation that was unsuitable for catalysis because the α-phosphate was located . Å away from the 'oh of the rna primer. the overall pose of the 'dctp was similar to that of cytidine- '-[(α,β)-methyleno]triphosphate (cmpcpp): the '-deoxyribose adopted a 'endo conformation and the 'oh group interacted with the β'arg . however, the tl was completely unfolded, supporting our hypothesis that the unavailability of the 'oh group was alone sufficient to significantly destabilize the folding of the first helical turn of the tl. overall, the comparative analysis of rnap structures with cmpcpp, 'dctp and 'dctp suggested that the β'arg inhibited the incorporation of 'dntps by interacting with their 'oh group and favoring the '-endo conformation of the deoxyribose moiety. at the same time, the structures did not provide a decisive answer as to why the '-endo conformations of 'dntps were less suitable for incorporation into rna than the '-endo conformations. the x-ray structures and in silico modeling experiments suggested that interactions between the 'oh of the deoxyribose moiety and the β'arg or the β'gln were mutually exclusive. accordingly, the β'arg could inhibit the incorporation of the 'dnmp solely by slowing down the initial steps of the tl folding, by sequestering the 'oh group and preventing its interaction with the β'gln of the tl. to test this hypothesis, we determined the incorporation rate of 'dgmp by the wt rnap (supplementary fig. c) . we found that the kcat for 'dgmp incorporation was only -fold slower than the kcat for gmp incorporation and fold higher than the kcat for 'dgmp incorporation ( table ). these data demonstrated that the sequestration of the 'oh group accounted for no more than a -fold inhibition of the 'dgmp incorporation by the β'arg . the remaining -fold inhibition of the overall -fold inhibitory effect was contributed by some other features of the '-endo binding pose, as discussed below. in this study we performed a systematic analysis of the role of the amino acid residues in the active site of the multi-subunit rnap in selecting ntps over 'dntps. we identified a conserved arg residue, β'arg (e. coli rnap numbering) as the major determinant of the sugar selectivity. the β'arg favored binding of gtp over 'dgtp and selectively inhibited the incorporation of 'dnmps into rna (figs. , , table ). the enhancement of ntp binding by the β'arg is consistent with the observation that the β'arg is positioned to hydrogen bond with the 'oh of the ntp substrate analogues in several rnap structures (supplementary table ) and with md simulations of the s. cerevisiae rnapii . however, the existing data fail to explain the inhibition of the 'dntp incorporation by the β'arg . in search of an explanation, we performed in silico docking experiments and solved the x-ray crystal structures of the initially transcribing t. thermophilus rnap with the cognate 'dctp and 'dctp. these experiments revealed that the β'arg interacts with the 'oh group of the 'dntp substrate and favors the '-endo conformation of the deoxyribose ( fig. c, d, supplementary fig. e, f) . in contrast, the ribose of the cognate ntp substrate is stabilized in the '-endo conformation by multiple polar contacts and hydrogen bonds with the active site residues: β'arg , β'asn and β'gln (figs. c, a, b, supplementary fig. a , b, we next considered whether the deformation of the 'dntp substrate, repositioning of the β'arg or both were behind the slow incorporation of the 'dnmps by the wt rnap. a hybrid quantum and molecular mechanics (qm/mm) analysis of nucleotide incorporation by the s. cerevisiae rnapii suggested that repositioning of the β'arg by the 'dntp substrate may increase the activation energy barrier for the nucleotide addition reaction . however, a comparison of the rnap structures with bound cmpcpp, 'dctp and 'dctp revealed very small changes in the conformation of the β'arg (fig. ) . similarly, a survey of the published x-ray and cryoem structures revealed that the β'arg occupies approximately the same volume irrespective of the presence or absence of the active site ligands (supplementary table ). accordingly, we reasoned that the preferential selection of the catalytically-inert '-endo conformers of 'dntps and the deformation of the catalytically-labile '-endo conformers of 'dntps by β'arg were likely the major factors behind the slow incorporation of the 'dnmps. however, it remained unclear why the '-endo conformers of the substrates were less suitable for the incorporation than the '-endo conformers. we first explored the possibility that the sequestration of the 'oh group by the β'arg makes it unavailable for the interaction with the β'gln of the tl (fig. c, d, supplementary fig. e, f) , thereby destabilizing the tl-mediated closure of the active site. it is well established that the closure of the active site by two helical turns of the tl accelerates the catalysis of nucleotide incorporation by ~ -fold , , , . noteworthy, the tl is partially folded in most structures with ribonucleotide substrate analogues (fig. a, table ) - , , yet was completely unfolded in the structures we obtained with either a 'dctp ( fig. c, d) or a 'dctp ( fig. e, f) . given that the 'dctp was in the conventional '-endo conformation, the latter result suggested that the unavailability of the 'oh group was sufficient to significantly impair the folding of the tl and slow down the catalysis by the t. thermophilus rnap in crystallo. to quantitatively estimate the contribution of the 'oh interactions to the catalysis, we determined the rate of the 'dgmp incorporation by e. coli rnap. we found that the rate of the 'dgmp incorporation was -fold slower than the rate of the gmp incorporation, but -fold faster than that of the 'dgmp ( table , supplementary fig. c) . these results suggested that the sequestration of the 'oh group by the β'arg could account for no more than a -fold out of its -fold overall inhibitory effect. notably, the t. thermophilus rnap also incorporates 'dnmps faster than 'dnmps but discriminates against both types of substrates ~ -fold stronger than the e. coli rnap . similarly, the effects of the β'q m substitution were inconsistent with the idea that the 'oh capture by β'arg could alone account for the slow rate of the 'dnmp incorporation. if that were true, the β'q m variant should be relatively insensitive to the absence of the 'oh group. however, the opposite was true: β'q m was only twofold slower in incorporating gmp than the wt rnap, but tenfold slower in incorporating 'dgmp. we propose that the β'gln competes with the β'arg for the 'oh group of the 'dntp substrate: the β'arg favors the catalytically-inert '-endo conformer (fig. c, d, supplementary fig. e, f) , whereas the β'gln favors the catalytically-labile '-endo conformer (supplementary fig. c, d) . as a result, the β'gln is more important during the incorporation of 'dnmps than nmps. since the tl folding can account only for a fraction of the inhibitory effect, what other factors make the '-endo conformers of 'dntps catalytically inert? it is noteworthy that the sugars of the attacking and substrate nucleotides adopt the '-endo conformation in all rnaps and dnaps during the nucleotide incorporation . in other words, even the ' ends of dna primers adopt the '-endo conformation to catalyze the incorporation of the 'dnmps into the dna. apparently, the a-form geometry is much better suited for the catalysis of the nucleotide condensation than the b-form geometry , . the better accessibility of the nucleophilic 'oh group of the attacking nucleotide is likely the primary reason. the substrate then adopts the 'endo conformation to match the overall geometry of the a-form duplex and to avoid clashes with the attacking nucleotide . in general terms, the inertness of the '-endo conformation of the 'dntps can be partially attributed to the differences in the conformations of the triphosphate moieties that in turn originate from the differences in the bond angles at c' of the sugar between the '-and '-endo conformers (fig. b) . we term this inhibitory component as c' -geometry-dependent effects. however, in our view, it is impossible to further refine this hypothesis at present: (i) the resolutions of the structures are not very high (≥ . Å, supplementary table noteworthy, the conserved arg is one of only five catalytic residues that are conserved in the superfamily of the so called "two-β-barrel" rnaps , that includes the multi-subunit rnaps and very distantly related cellular rna-dependent rnaps (rdrps) involved in the rna interference ( supplementary fig. ) . accordingly, the common ancestor of the two-β-barrel rnaps could conceivably discriminate against 'dntps and therefore likely evolved in the presence of both ntps and 'dntps. this inference lends credence to the hypothesis that proteins evolved in primordial lifeforms that already possessed both rna and dna , . viral rdrps (members of the so-called "right-hand" superfamily of nucleic acid polymerases) are not homologous to multi-subunit rnaps but share some elements of their sugar selection strategies. it appears that the 'oh of the substrate ntp facilitates the active site closure in both classes of enzymes. in multi-subunit rnaps, 'oh facilitates the tl folding via the interaction with β'gln , whereas in viral rdrps, 'oh initiates the closure by sterically clashing with asp (poliovirus rdrp numbering) . in both classes of enzymes, 'dntps adopt a '-endo pose wherein the 'oh is misplaced and cannot readily facilitate the closure of the active site, explaining low reactivities of 'dntps. however, 'dntps are better substrates than 'dntps also for viral rdrps suggesting that the low reactivity of the '-endo 'dntps additionally relies on c' -geometry-dependent effects (see above), which lead to a suboptimal conformation of the triphosphate moiety, a suboptimal geometry of the transition state, or both. multi-subunit rnaps and viral rdrps converged on using the '-endo binding pose to discriminate against 'dntps. in doing so these enzymes accentuate the intrinsic preferences of 'dntps to retain the inert '-endo conformation upon binding to the a-form template in the non-enzymatic system in summary, our data show that a universally conserved arg residue plays a central role in selecting ntps over 'dntps by the multi-subunit rnaps. when ntp binds in the rnap active site, its ribose adopts the '-endo conformation that positions the 'oh group to interact with the universally conserved gln residue of the tl domain and promotes the closure of the active site, whereas the triphosphate moiety can undergo rapid isomerization into the insertion conformation leading to efficient catalysis. the interaction of the conserved arg residue with the 'oh of the ntps selectively enhances their binding more than -fold and renders rnap saturated with ntps in the physiological concentration range. in contrast, the interaction of the conserved arg with the 'oh of the 'dntp substrates shapes their deoxyribose moiety into the catalytically inert '-endo conformation where the 'oh cannot promote closure of the active site and substrate incorporation is additionally inhibited by the unfavorable geometry of the triphosphate moiety. the deformative action of the conserved arg on the 'dntp substrates is an elegant example of active selection against a substrate that is a substructure of the correct substrate. dna and rna oligonucleotides were purchased from eurofins genomics gmbh (ebersberg, germany) and iba biotech (göttingen, germany). dna oligonucleotides and rna primers are listed in supplementary table our initial docking trials revealed that docking of nucleoside monophosphates produced the most robust and quantitatively interpretable results. thus, the docking algorithm failed to recover templated poses for nucleosides without phosphate groups. the docking algorithm also failed to position the triphosphate moiety to coordinate metal ion number two and instead attempted to maximize its contacts with the protein. as a result, the recovered conformations of the triphosphate moieties differed from those observed in crystal structures. considering the high impact of the triphosphate moiety on the ligand binding score and our assessment that the triphosphate moiety was docked incorrectly, we opted to limit the systematic investigation of the interaction between rnap and the sugar moieties of nucleosides to docking nucleoside monophosphates. we first docked '-endo cmp, '-endo 'dcmp and '-endo 'dcmp to the rnap fragment . the docking algorithm recovered high-scoring poses (- . ± . kcal/mol) for cmp in out of runs, lower-scoring poses (- . ± . kcal/mol) for '-endo 'dcmp in out of runs and 'endo 'dcmp in out of runs. the β'arg side chain was kept flexible in the latter case because our manual assessment suggested that a sub-angstrom repositioning of β'arg would be needed to accommodate the '-endo deoxyribose. we than fixed the β'arg table ). these in silico experiments suggested that the semi-closed active site can bind the '-endo and '-endo 'dcmp with similar affinities. the 'oh of the '-endo 'dcmp was positioned to interact with β'gln and β'asn , whereas the 'oh of the 'endo 'dcmp was positioned to interact with β'arg and β'asn ( supplementary fig. ) . we further inferred that the open active should have preference for the '-endo 'dcmp because β'gln is not positioned to interact with the 'oh of the substrate in the open active site. well in line with our prediction, the x-ray diffraction data for the crystals of the rnap- 'dctp complex was consistent with the '-endo conformation of the 'dctp bound in the open active site (fig. b, c, supplementary fig. a) . we further verified the binding preferences of the open active site in silico by removing the 'dctp from the model and docking alternative conformers of the 'dcmp. the docking algorithm recovered higher-scoring poses (- . ± . kcal/mol) for the '-endo 'dcmp in out of runs and lower-scoring poses (- . ± . kcal/mol) for '-endo 'dcmp in only out of runs (supplementary table ). the non-template dna strand ( '-tataatgggagctgtcacggatgcagg- ') was annealed to the template dna strand ( '-cctgcatccgtgagtgcagcca- ') in µl of mm tris-hcl (ph . ), mm nacl, and mm edta to the final concentration of mm. the solution was heated at °c for min and then gradually cooled to °c. the crystals of the rnap and promoter dna complex were prepared as described previously the x-ray datasets were collected at the macromolecular diffraction at the cornell high energy synchrotron source (macchess) f beamline (cornell university, ithaca, ny) and structures were determined as previously described , using the following crystallographic software: the reaction products were modelled as sums of independent contributions by the fast and slow fractions of rnap using numerical integration capabilities of the kintek explorer software. contributions of each fraction were modeled as scheme . upper and lower bounds of the parameters were calculated at a % increase in chi . table and supplementary tables - . error bars are ranges of duplicate measurements or sds of the best-fit parameters, whichever values were larger. a tecs were assembled using the scaffold shown above the gel panels and chased with µm atp, ctp, utp and gtp or 'dgtp for min at °c. the positions of gmps in the resolved stretches of the transcribed sequence are marked along the right edge of the gel panels. -bit grayscale scans were normalized using max pixel counts within each gel panel and pseudocolored using rgb palette on the right. b lane profiles of transcription in all-ntps and 'dgtp chases by the wild-type and β'r k rnaps quantified from gels in (a). traces were manually aligned along the x-axis and scaled along the y-axis using several sequence positions as references. magenta numbers are interatomic distances in Å. panels (a) and (b) were prepared using pdb id coli rnaps. a tecs were assembled using the scaffold shown above the gel panels and chased with µm atp, ctp, utp and gtp or 'dgtp for min at °c. the positions of gmps in the resolved stretches of the transcribed sequence are marked along the right edge of the gel panels. -bit grayscale scans were normalized using max pixel counts within each gel panel and pseudocolored using rgb pale e on the right. b lane profiles of transcription in all-ntps and 'dgtp chases by the wild-type and β'r k rnaps quantified from gels in (a). traces were manually aligned along the x-axis and scaled along the y-axis using several sequence positions as references. supplementary fig. : utilization of 'dutp and 'datp during the processive transcript elongation by the wt and variant rnaps. a tecs were assembled using the scaffold shown above the gel panels and chased with µm ctp, gtp, utp, atp (all-ntps-chase), or ctp, gtp, atp, 'dutp ( 'dutp-chase), or ctp, gtp, utp, 'datp ( 'datp-chase) for min at °c. the positions of umps or amps in resolved stretches of the transcribed sequence are marked along the right edge of the gel panels. -bit grayscale scans were normalized using max pixel counts within each gel panel and pseudocolored using rgb pale e on the right. b lane profiles of transcription by the wt (cyan) and β'r k (magenta) rnaps quantified from gels in (a). traces were manually aligned along the x-axis and scaled along the y-axis using several sequence positions as references. + + + + supplementary fig. : lane profiles of transcription in all-ntps and 'dgtp chases quantified from gels in main text figure . fig. : lane profiles of transcription in all-ntps, 'dutp and 'datp chases quantified from gels shown in supplementary figure . fig. . utilization of 'dctp during the processive transcript elongation by the wt and variant rnaps. a tecs were assembled using the scaffold shown above gel panels and chased with µm gtp, utp, atp and ctp (all-ntps chase) or 'dctp ( 'dctp chase) for min at °c. b lane profiles of transcription by the wt (cyan) and β'r k (magenta) rnaps quantified from gels supplementary table . dna oligonucleotides and rna primers used in this study. we used time-resolved single nucleotide addition experiments to estimate the equilibrium constant for gtp, 'dgtp and 'dgtp binding and dissociation in the active site of rnap and to determine the first order rate constant (also known as the turnover number) for the incorporation of gmp, 'dgmp and 'dgmp into the nascent rna. the tecs were assembled on synthetic nucleic acid scaffolds and contained the fully complementary transcription bubble flanked by -nucleotide dna duplexes upstream and downstream (supplementary fig. a) . the annealing region of a -nucleotide rna primer was initially nucleotides, permitting the tec extended by one nucleotide to adopt the post-and pre-translocated states, but disfavoring backtracking. the rna primer was ' labeled with the infrared fluorophore atto to monitor the rna extension by denaturing page. to facilitate the rapid acquisition of kinetic data (see below), the template dna strand contained a fluorescent base analogue -methyl-isoxanthopterin ( -mi) eight nucleotides upstream from the rna ' end. -mi allowed the monitoring of rnap translocation along the dna following nucleotide incorporation (supplementary fig. a) . we first measured concentration series of gmp and 'dgmp incorporation by the wild-type and altered rnaps using a time-resolved fluorescence assay performed in a stopped flow instrument (supplementary figs. - ) . we used the translocation assay because it allowed the rapid acquisition of concentration series, whereas measurements of concentration series by monitoring rna extension in a rapid chemical quench-flow setup would be considerably more laborious. we then performed a preliminary data analysis by fitting each fluorescence timetrace to a single exponential function followed by fitting the resulting individual rates to a michaelis equation. the inferred kcat and km generally supported all major conclusions reported in this study. however, we proceeded to expand the datasets by including additional data and developed more elaborate analysis routines. the first reason to invoke a more elaborate analysis was the observation that most fluorescence time traces in our datasets fitted poorly to the single exponential function. in fact, the underlying physics of a single turnover enzymatic reaction suggests that individual timetraces in the concentration series should, in a general case, be poorly described by a single exponential function (see below). the second reason to invoke a more elaborate analysis was the concern that the michaelis constant is a lumped constant that contains a sum of the catalytic and substrate dissociation rates in the numerator and the substrate binding rate in the denominator, whereas the equilibrium binding constants are the ratios of the substrate dissociation and binding rates. accordingly, we were concerned that comparing the michaelis constants of reactions could potentially lead to erroneous conclusions in the cases where the km was markedly different from the kd. for the sake of understanding our analysis workflow, it is important to acknowledge that each reaction timetrace in the concentration series describes a single turnover process: we designed the transcribed sequence so that only a single gmp (or 'dgmp or 'dgmp) became incorporated upon the addition of gtp (or 'dgtp or 'dgtp). the ease of obtaining single turnover timetraces is a significant analytical advantage natively associated with templatedependent nucleic acid polymerases. it is often possible to infer more parameters from concentration series of single-turnover reactions than from concentration series of classic multiturnover enzymatic reactions. next, most timetraces in the concentration series are not expected to fit a single exponential function even in the case of the simple signal, a -nt extended nascent rna (rna in this study). the enzymatic reaction is minimally a two-step sequential reaction that consists of the that was employed by prajapati et al. next, a kinetic heterogeneity in the tec preparations introduced an additional level of complexity to the fitting of the data. we reported previously that a vast majority of tecs contain - % of a slow fraction that manifests itself as a slow phase in reaction timetraces of both the fluorescence signal (stopped-flow assay) and the extended rna (quench flow assay) , . in the case of fast reactions measured in this study (gtp, 'dgtp data), the rates of the fast and slow phases differed approximately tenfold and therefore the phases could be precisely resolved (see a dedicated section below). importantly, the fast phase of the reaction constituted - % of the signal amplitude ( table , supplementary table ) . accordingly, we considered the activity of the fast fraction as a representative measure of the rnap activity in each experiment and disregarded the minor slow fraction when comparing the wild-type and variant rnaps (fig. ) . in the case of slow reactions ( 'dgtp data), the fast and the slow phases were not well separated ( -fold difference in rates, when fitting data to equation , each timetrace was described by a stretched exponential function (an empirical function that is often used to describe heterogeneous systems ). at the same time, the exponent followed the hyperbolic dependence on the 'dgtp concentration ( supplementary figs. c, ) . such fits described the data well and gave three parameters: a reaction rate constant (k), a stretching parameter (β) and the michaelis constant (km). when a stretching exponential function is applied to a process where the reactivity changes over time (or distance), the rate constant parameter (k) corresponds to the initial reaction rate constant. in our case, the stretched exponential fit potentially absorbed both temporal and structural heterogeneity as well as the deviations from the single exponential behavior caused by the sequential nature of the enzymatic reaction (see above). for this reason, the rate parameter (k) did not have an easily interpretable meaning. to circumvent this problem we calculated the median reaction time as (median reaction time) = (ln( )^( /β)) / k; then calculated the median reaction rate assuming that (median reaction rate) = ln( ) / (median reaction time) and used the median reaction rate as a measure when comparing the wild-type and variant rnaps (fig. , supplementary table ) . next, fitting the data to equation gives the km rather than the kd. however, it is rather certain that koff >> kcat for all 'dgmp incorporation reactions ( table , also see scenario below). if so, km approximately equals kd for each 'dgmp incorporation reaction. accordingly, we used km in place of kd for 'dgmp addition reactions when comparing substrates and rnaps (fig. ) . finally, we emphasize that the 'dgmp incorporation data by the wild-type and the β'r k rnaps were fit to both scheme and equation leading to affinities for 'dgtp that were indistinguishable within the margin of the experimental uncertainty (compare 'dgtp data in table and supplementary table ). the catalytic activity of the wild-type rnap towards 'dgtp inferred by fitting the data to equation was, as expected, in-between the catalytic activities of the fast and slow fraction inferred by fitting the data to scheme . accordingly, we argue that the employment of different analysis routines for gtp and 'dgtp is of little concern for the main inferences drawn in this study. we have previously shown that the nucleotide addition and the subsequent translocation along the dna by the wild-type e. coli rnap occur with similar rates at saturating concentrations of cognate ntps . as a result, (i) the translocation timetraces are delayed by a few milliseconds relative to the nucleotide addition timecurves and (ii) the translocation timetraces at saturating concentrations of cognate ntp substrates are not well described by a single exponential function because both nucleotide addition and translocation are partially rate limiting. in this study, translocation rates were tangential to the main line of investigation, but they were necessary parameters during the global fitting of the fluorescence timetraces and gmp incorporation timecurves to scheme . at the same time, the translocation rates are much faster than the 'dgmp incorporation rates and could be completely disregarded during the analysis of the 'dgtp concentration series by fitting the date to scheme or equation . supplementary table should not be equated with the forward translocation rates. thus, we modeled translocation as an irreversible transition in scheme . as a result, the inferred translocation rates are the rates of the system approaching the translocation equilibrium after the nucleotide incorporation rather than the forward translocation rates. albeit somewhat counterintuitively but following the rules of the formal kinetics the inferred equilibration rate equals the sum of the forward and the backward translocation rates. it was possible to further split the equilibration rate into the forward and backward translocation rates by assessing the completeness of the translocation, as we did in our previous studies . however, we refrained from doing so in this study because the translocation process was tangential to the main line of the investigation. fig. a) . as a result, both tec and tec ntp are detected as tec in the edta quenched samples because nearly % of the tec ntp is converted into tec after the addition of edta, and practically no ntp dissociates back into the solution (kcat >> koff). the above situation corresponds to 'dgmp addition by the wild-type and variant rnaps. fitting the 'dgtp concentration series to a semi-empirical equation allowed the estimation of kcat and km 'dgtp ≈ kd 'dgtp for the wild-type, β'r k, β'm a, β'q m and β'n s rnaps table ). for the β'r k and the wild-type rnap we additionally measured the edta quench curve, fitted the data globally to scheme and inferred the lower bounds of kon and koff. in addition to kd (table , supplementary figs. c, a) . . as always, kcat and km can be inferred from the ntp concentration series, but neither kcat/km ≈ kon (as is in scenario ) nor km ≈ kd (as is in scenario ). in contrast, the global fit of the ntp concentration series and the edta quench data has the best resolving power in scenario : kcat, kon, koff, and ktra (in some cases) can be inferred from the data though the precision of the individual estimates varies greatly. the above situation corresponds to the gmp addition by the wild-type and variant rnaps (supplementary table , supplementary figs. b, ) and the 'dgmp addition by the wildtype rnap (table , supplementary fig. c) . only the wild-type rnap data allowed for precise estimates of all parameters of scheme . in the case of the β'r k and β'q m for the comparison of the rnap's capabilities to bind and utilize various substrates (fig. ) . handling of the slow fraction during fitting to scheme . the timecourses of the nmp incorporation by the wild-type e. coli tec typically display a distinctive slow phase that represents - % of the overall signal amplitude and features the rate of . - s - . in contrast, the major, fast phase of the reaction is approximately tenfold faster at saturating [ntp] ( - s - for gtp). the slow phase possibly represents an inactive tec in equilibrium with the active tec, a fraction of the tec that slowly reacts with the ntp substrate or a combination of both. during the fitting of the data using the kintek explorer software, the slow phase can be modeled in two ways (supplementary note fig. b) . the first option is to invoke a reversible equilibrium between the active and inactive tec and to introduce a virtual equilibration step prior to mixing of the tec with the ntps. we term this approach as the reversible inactivation model. the second option is to explicitly model the tec preparation as two fractions that do not interconvert but incorporate nmp with different rates. the fractions of the slow and fast tec are then allowed to vary as parameters during the fit. we term this approach as the nonequilibrium heterogeneity model. the two models are largely indistinguishable if measurements are carried out at a single [ntp] and both models require two parameters to describe the slow phase: inactivation and recovery rates in the first case, and the slow fraction and its reaction rate in the second case ( supplementary note fig. b) . however, the response of the slow phase to the decrease in the [ntp] differs between these two models. the reversible inactivation model predicts that the rate of the slow phase is independent of [ntp] and the slow phase is largely abolished as the [ntp] decreases. in contrast, the non-equilibrium heterogeneity model predicts that the rate of the slow phase decreases in unison with the rate of the fast phase as [ntp] decreases (both follow a hyperbolic dependence on [ntp]). in this study we analyzed all gmp and 'dgmp incorporation datasets using the non-equilibrium heterogeneity approach to model the slow phase, because some datasets (e.g. β'q m, supplementary fig. ) could not be adequately fit by the previously employed reversible inactivation model , , . fig. : kinetic analyses of the data. a simulation and graphic interpretation of the edta and hcl quench curves at saturating substrate concentrations and different values of k . b simulation of concentration series of a off biphasic reaction using the reversible inactivation (left) and non-equilibrium catalytic heterogeneity (right) models. origin of life: the rna world the antiquity of rna-based evolution activated ribonucleotides undergo a sugar pucker switch upon binding to a single-stranded rna template watching dna polymerase η make a phosphodiester bond physiological concentrations of purines and pyrimidines abundant ribonucleotide incorporation into dna by yeast replicative polymerases basic mechanisms of transcript elongation and its regulation unlocking the sugar "steric gate" of dna polymerases a mutant t rna polymerase as a dna polymerase mechanism of ribose '-group discrimination by an rna polymerase the structural mechanism of translocation and helicase activity in t rna polymerase x-ray crystal structures elucidate the nucleotidyl transfer reaction of transcript initiation using two nucleotides choosing the right sugar: how polymerases select a nucleotide substrate crystal structures of open and closed forms of binary and ternary complexes of the large fragment of thermus aquaticus dna polymerase i: structural basis for nucleotide incorporation klentaq polymerase replicates unnatural base pairs by inducing a watson-crick geometry interactive d versions of the structural figures (webgl in browser): supplementary data : interactive fig. a, b supplementary data : interactive fig. c, d supplementary data : interactive fig. e, f supplementary data : interactive supplementary fig. c, d supplementary data : interactive supplementary fig. e, f supplementary data : interactive supplementary fig we thank irina artsimovitch for critically reading the manuscript, the staff at the macchess for support of crystallographic data collection, anssi m. malinen for constructing plasmids, matti turtola for his contribution to the development of the edta quench method. the reaction products were modelled as sums of independent contributions by the fast and slow fractions of rnap using numerical integration capabilities of the kintek explorer software. contributions of each fraction were modeled as scheme . upper and lower bounds of the parameters were calculated at a % increase in chi . key: cord- -lqzgz p authors: gallo, juan e.; ochoa, juan e.; warren, helen r.; misas, elizabeth; correa, monica m.; gallo-villegas, jaime a.; bedoya, gabriel; aristizábal, dagnóvar; mcewen, juan g.; caulfield, mark j.; parati, gianfranco; clay, oliver k. title: hypertension and the roles of the p . risk locus: classic findings and new association data date: - - journal: nan doi: . /j.ijchy. . sha: doc_id: cord_uid: lqzgz p background the band p . contains an established genomic risk zone for cardiovascular disease (cvd). since the initial wellcome trust case control consortium study (wtccc), the increased cvd risk associated with p . has been confirmed by multiple studies in different continents. however, many years later there was still no confirmed report of a corresponding association of p . with hypertension, a major cv risk factor, nor with blood pressure (bp). theory in this contribution, we review the bipartite haplotype structure of the p . risk locus: one block is devoid of protein-coding genes but contains the lead cvd risk snps, while the other block contains the first exon and regulatory dna of the gene for the cell cycle inhibitor p . we consider how findings from molecular biology offer possibilities of an involvement of p in hypertension etiology, with expression of the p gene modulated by genetic variation from within the p . risk locus. results we present original results from a colombian study revealing moderate but persistent association signals for bp and hypertension within the classic p . cvd risk locus. these snps are mostly confined to a ‘hypertension island’ that spans less than kb and coincides with the p haplotype block. we find confirmation in data originating from much larger, recent european bp studies, albeit with opposite effect directions. conclusion although more work will be needed to elucidate possible mechanisms, previous findings and new data prompt reconsidering the question of how variation in p . might influence hypertension components of cardiovascular risk. legend to graphical abstract: schematic depiction of the main observation presented and discussed in this article. two adjacent haplotype blocks characterize the p . cardiovascular risk locus: left, the block or island containing the first part of the p gene and its wellcharacterized promoter, in which we observed clearly elevated associations (red) with blood pressure (dbp, sbp) and/or hypertension in a colombian and a european study sample, and right, the block hypertension and bp association 'hypertension island' * (haplotype block < kb) lead cvd risk snps (haplotype block < kb) significance of association strongest cvd association current threshold for genome-wide reporting ( p = x - ) fair cvd association contains start and promoter of p gene contains no protein-coding genes it had long been a widely held belief that common genetic variation in the established cardiovascular risk locus of cytogenetic band p . , as discovered and delimited via corresponding genetic associations in [ , , ] , is not reproducibly associated with high blood pressure or hypertension, a prime risk factor for cardiovascular disease [ , ] . the pleiotropic nature of the ≈ -kb classic risk locus of p . , the strongest known single contributor in the human genome to genetic risk of cardiovascular disease [ , , ] , viewed together with the complexity of the genetic and molecular regulation of hypertension, prompted us to reopen the question if this region might, after all, be associated with a contribution of hypertension to the increased cardiovascular risk that characterizes the locus. in the theory section, we briefly depict the classic two-block structure of the region in the light of current knowledge, and review some findings and hypotheses that could admit a step of a hypertension etiology being modulated at this locus. in the results we first present results from an original, small colombian association study focusing on p . variation. we then present largely corroborating results from much larger, recently published european studies, which we identified in a second step. taken together, the evidence accumulating so far suggests that common genetic variation in a well-described block (< kb) of the p . risk region -and, more specifically, in the regulatory dna of the p gene it harbors -may play a role in promoting hypertension, for example via vascular modifications in resistance arteries (arterioles). demographic and clinical information, including selected cardiovascular risk factors, was collected for all participants in a study conducted in medellín, colombia (see supplementary data). systolic (s) and diastolic (d) blood pressure (bp) levels were defined by the average of two conventional auscultatory bp measurements through a mercury manometer according to an approach in line with current european guidelines [ ] . presence of hypertension was identified based on physician's diagnosis, prescription of antihypertensive drugs, average office sbp ≥ and/or dbp ≥ mmhg, or any combination of these possibilities. a total of participants were genotyped. in view of this small sample size and in line with traditional case-control methodology [ ] , [ , chapter ] , the highest and lowest dbp tertiles of the population were slightly overrepresented in the genotyped individuals. dna was extracted from white blood cells following standard salting-out procedures and genotyped by lgc group, uk with kasp tm technology (https://www. biosearchtech.com). all snps genotyped were biallelic in our study sample. snp rs is triallelic in some other kg populations (not clm), its rare third allele frequency being . in amr, . in eur and . in afr. statistical analyses were performed considering bp either as a continuous or a categorical variable (above or below the current [ ] threshold for hypertension diagnosis). effect sizes were defined as differences of mean values for continuous variables (e.g., sbp, dbp), and as odds ratios for dichotomous variables (e.g., presence or absence of hypertension). variables considered as possible covariates in the statistical analyses included gender, age, smoking and body mass index (bmi). to calculate association p-values and effect sizes between snp states and phenotypic variables we used the r package snpassoc [ , ] . models fitted included codominant, dominant, recessive, overdominant and log-additive models. participants signed an informed consent formulated for this genotype-phenotype association study. the study was approved by the ethical committee of the corporación para investigaciones biológicas, medellín, colombia. all procedures were performed in agreement with the declaration of helsinki. additional material and methods are given in supplementary material s . methods of european studies were as described in the publications presenting those studies [ , , ] . the originally described risk locus of p . has a bipartite or two-segment structure, in which each of the two mutually adjacent segments is dominated by a distinct haplotype block. although the haplotypes (i.e., states) of the two blocks are correlated, via a linkage disequilibrium originally estimated as around . (for average |d |) [ ] , their frequencies are different. the minor allele frequency (maf) plots shown in figure depict individual snp-and haplotype-level variation in the locus. two of the superpopulations that were sequenced and genotyped by the genomes project ( kg; [ ]) are represented: amr (populations from the americas having a native american ancestry component) and eur (populations of european ancestry). the 'left-hand' block of the risk region, colored green in figure , is composed of essentially three haplotype classes, giving one low and two higher, similar minor allele frequencies (separated by a green horizontal bar). the 'right-hand' block is dominated by two haplotype classes and has fairly constant allele frequencies close to %; the lead cvd risk snps of the region are at its rightmost fringe, denoted by ochre triangles. the left-hand block includes the upstream regulatory dna and start of the protein-coding gene encoding p (transcribed leftward). the figure shows a big difference between amr and eur in this block. the figure also shows a matrix, in / haplotype (schema) coding, for snps that are representative of variation within this block, as obtained by genomes' sampling of individuals (i.e., chromosomes) from clm (medellín, colombia, an amr population). where three or more major haplotype classes dominate a region, individual bialellic snps can provide only aggregated resolution of what is happening at the haplotype level [ ] . here, snp classes s (blue identifiers), s (red) or s provide information on, respectively, the contrast of haplotype class c , c or c versus the remaining two haplotype classes of the haplotype block. the cvd risk locus in p . is a confirmed risk locus not only for cardiovascular disease but also for other diseases, including cancers. the pleiotropies appear to be partly antagonistic, in that the risk allele for one disease can be the protective allele for another disease. for example, the nhgri-ebi gwas catalog (www.ebi.ac.uk/gwas) reports snps of the left block where the protective allele for an age-related condition (e.g., cvd, glaucoma) would be the risk allele for a cancer-related condition (e.g., breast cancer, glioma, pediatric brain tumor, endometriosis; see supplementary material s . ). antagonistic pleiotropies have been noted or proposed before for this locus ( [ ] ; see also [ ] ), with data suggesting a corresponding positive selection in some populations having native american ancestry [ ] . recent literature on the risk locus of p . and its associations has emphasized the enigmatic long noncoding rna (lncrna) gene anril, which has regulatory roles that are still only partly understood, but that can affect expression of proteincoding genes in the risk locus [ ] . earlier molecular biology studies had focused on two protein-coding genes for cyclin-dependent kinase inhibitors, which when activated can arrest or inhibit the cell cycle at the g stage: the gene cdkn a encodes both p arf (p ) and p ink a (p ), and cdkn b encodes p ink b (p ). a third protein-coding gene, mtap, which is located outside the risk locus, is also relevant for understanding the region. mtap encodes a key enzyme in the biosynthetic pathway of polyamines (which plays an important role in the stabilization of atherosclerotic plaques); there is also evidence that mtap may be partly regulated from within the p . cvd risk locus despite the intervening distance [ , , ] . . . the p gene some common genetic variation in the p haplotype block is in regulatory dna, where it could in principle modulate transcription of the p gene. thus, the common snp rs , in an established promoter of the p gene, borders a tetranucleotide that was critical for c/ebpβ binding and transcription of the p gene in experiments in vitro ( figure ) [ , , ] ; its common variation might also modulate a predicted stemloop (see supplementary material s . ). the functionally strategic position of snp rs was also noted before, but in the apparently antagonistic context of glioma risk [ ] . the well-studied promoter of the p gene [ ] is a 'battlefield' in which the proto-oncogene c-myc and tgfβ/smad struggle to control the p gene. indeed, control of this gene can be important: where expression of the cell cycle inhibitor and tumor suppressor p is suppressed, cancer risk can rise [ ] . tgfβ is a main j o u r n a l p r e -p r o o f regulator of blood vessel development and maintenance, but it acts through several alternative 'arms', or pathways, e.g., via crosstalk. we ask when and where the arm of tgfβ's activities that utilizes p , as part of the tgfβ/smad pathway, might be employed under physiological conditions in a context that would be relevant for the genesis of hypertension. figure shows a rough sketch of elements of three pathways in which tgfβ can play a role and/or modulate hypertension etiologies: the tgfβ/smad pathway with target gene p , in the central lane, and two other pathways that are trans-activatable by 'crosstalk'' with tgfβ, respectively mediated by the type angiotensin-ii receptor (at r / classic ras pathway, left) and the epidermal growth factor receptor (egfr, right). the central and rightmost of these three pathways and their crosstalk, for which roles in hypertension etiologies remain to be elucidated in more detail, are examined in this context in supplementary material s . ; refs. [ , , , ] describe experimental results that may be relevant. in addition to such cues from experiments, an independent line of evidence supporting a likely role for the central pathway comes from large-scale wholegenome association studies and metastudies that can now identify whole networks showing evidence of collective association with blood pressure [ , ] . the results indicate that the tgfβ/smad pathway plays an important role in bp regulation, as it is enriched for bp-associated genes [ ] . it then seems a plausible hypothesis that p , as an important tgfβ/smad-responsive target, could be involved in the modulating of blood pressure. although the complexity and possible sensitivity of such a network can render difficult the prediction of how a genetic change at a given locus will affect bp or the risk of developing hypertension, already figure hints that the p context puts us in a locus and a scenario where key players, such as cell cycle inhibitors, are only a step away from etiologically familiar routes to pathogenesis of hypertension or its prevention. we conducted a local, small study in the city of medellín, colombia (n= individuals without overt cvd) to explore possible associations between hypertension and p . ( common snps spanning > kb). one of the motivations for the study was that we knew of no p . studies of a population living in latin america. it was immediately clear from the sample size that even for minor allele frequencies (maf) above % we would find at best nominally significant evidence for associations from this study alone ( . > p ≥ × − ; [ ] ). still, we wished to see what could be achieved by collecting and analyzing local data of modest size and integrating them into the global knowledge base [ , p. ] , [ ] . the screening of p . for associations in the colombian study sample revealed an 'island' spanning consecutive snps ( . kb), in which sbp and dbp levels as well as the presence of hypertension gave consistent association signals, and the island coincided almost exactly with the 'left' haplotype block of the p . risk locus (figures and ; see theory). effect sizes for the snps in the island were approximately allele-additive. snps in the immediately flanking regions gave only sporadic or no association signals (supplementary material s . ). for instance, despite the small sample size, snp rs in the island achieved p values of × − for hypertension, and × − / × − for sbp/dbp increase, with a per-allele odds ratio of . for hypertension and a mean difference (effect size) of . / . mmhg for sbp/dbp levels, after correcting for sex and age. comparable results were obtained when including smoking and/or bmi as covariates (supplementary material s . and supplementary data). we compared the inferred risk and protector alleles for hypertension or blood pressure at our genotyped snps with the risk and protector alleles of disease phenotypes that had been previously reported at those same snps. risk alleles were defined as those for which mean difference or beta was positive (continuous variables), for which effect size was greater than one (dichotomous variables), or that were explicitly labeled as 'risk allele'. previously published associations listed in the nhgri-ebi gwas database suggested two mutually 'antagonistic' groups (see section . , supplementary material s . , or [ , supplementary information]), the risk alleles for the cvd-or aging-related conditions (e.g., coronary heart disease, or intracranial aneurism) being the protector alleles for cancerrelated conditions (e.g., glioma). in our colombian study and in the p . island where we observed the best association signals, the inferred risk alleles for hypertension or bp (e.g., a, the major allele, for rs ) were also the previ- j o u r n a l p r e -p r o o f ously reported risk alleles for cvd-or aging-related conditions at all snps that we genotyped. although our study focused on blood pressure variables and hypertension, we did consider, and subjected to the same basic phenotype-genotype analysis, several dozen (not all independent) alternative candidates for the phenotypic/outcome variable of primary interest, selected on the basis of their hemodynamic or other relevance to the development of hypertension (listed in supplementary data). however, the bp and hypertension variables consistently gave by far the best support for an association at the snps we genotyped (see supplementary data for the results obtained for rs and rs , representing the two equivalence classes of snps s and s , which are characterized by high linkage disequilibrium). a review of the data from recently published large european studies [ , , ] with detailed look-ups of results in this region of interest, again showed higher dbp association signals in the same island that we had delimited in the colombian study than in its flanking dna, with p-values that almost reached genome-wide significance. thus, in the studies by warren et al. [ ] and by evangelou et al. [ ] (n > , ), the two lowest p-values for dbp, . × − and . × − , were found, respectively, at the 'left' fringe of the island's haplotype block (rs ) and at a snp in the interior of the island that we had not genotyped in the colombian study (rs ). furthermore, in the european blood pressure studies [ , ] genome-wide significance of dbp associations was attained, outside of the classic p . cvd risk locus and its flanking regions, in the next gene mtap (see figure and theory), with a lowest p-value of . × − for the sentinel snp rs (red asterisk and red horizontal bar at left in figure ; see also the locuszoom plot in supplementary material s . ) . the bp association with the mtap gene had not been noted in previous studies, and indeed the region had not been sufficiently covered in earlier imputation panels. in the earlier study by ehret et al ( [ ] , n > , ), the best p-values observed were within the island, and again for dbp, although they did not reach values below . (see supplementary material, subsection s . ). we were even able to detect concordant results as far back as , in a study screening close to . million snps for bp associations [ ] . the snps in the island again had much lower p-values than those in the lead cvd risk zone and regions flanking the island. thus, the snp that had the lowest p-value for hypertension in our colombian study sample, rs , had similarly the lowest p-value for dbp in the island in ref. [ ] , namely . as its genome-wide meta-analysis p-value corrected for genomic control. a notable difference with respect to the colombian results was the effect direction. indeed, directions of the bp effects in the european studies [ , ] were consistently opposite to those observed in the colombian study at the same snps. in other words, the bp effects were also opposite to the established cvd effects for those snps in the literature. the now established cvd effect directions within the island are the same in european populations, deduced via direct genotyping of island snps, as they are in populations with a latin american or native american ancestry component (e.g., [ , ] ), where they can be inferred from genotyped snps in linkage disequilibrium located close to the island. thus, in the european study samples, but not in the colombian study sample, the inferred protector alleles for hypertension (i.e., earlier in disease etiology) are the known, constant risk alleles for cardiovascular disease or events (i.e., later in the etiology). three other european reports we found, in ref. [ ] and in the phenoscanner and roslin geneatlas association databases, had varying significance and independence, but the reported effect directions appear compatible with our findings (supplementary material s . ). j o u r n a l p r e -p r o o f the results from the trans-ethnic investigations we describe here are compatible with the persisting presence, in a south american and in a european population, of associations between bp levels/hypertension and genetic variation specifically in a previously identified haplotype block or 'island' nested within the cardiovascular risk locus of band p . . in the european population, dense coverage of the association screening also allowed the recognition of an association peak for bp in a neighboring gene, mtap, which is located at about kb from the risk locus. according to the data collected so far, the hypertension or bp effect within the cvd risk locus p . would have the opposite direction in the south american compared to the european population, whereas the cvd effect appears to be the same in both populations, according to published reports [ , ] . to our knowledge, however, our study is the first screen of the p . 'island' for hypertension or bp associations in a south american population, and we also know of no dedicated cvd association study with dense coverage in this part of the p . risk zone. it is therefore too early to generalize or extrapolate from this one small study, and we must await confirmation from independent genotypephenotype screening in future, if possible also in south american populations having other ancestry proportions, such as those of peru or bolivia (see, for example, [ ] ). the larger networks in which the p gene is embedded (e.g., the pathways of figure ) are likely to be complex and resilient, and not just genetic background but also environment (e.g., diet) and local epidemiology (e.g., disease prevalences) could influence routing or rebalancing of the networks and thus affect associations. according to theoretical arguments, genotype-phenotype associations that change direction of effect between geographic, population or other environments (e.g., in the form of crossing reaction norms [ , ] ) could actually be favored, because a variant that is inferior in all environments would be rendered unstable and would eventually be eliminated from the gene pool [ , ] , [ , section ] . in line with this view, trans-ethnically significant bp associations can exhibit sizable directional inconsistency rates. for example, among snps genome-wide that were associated with bp traits (sbp, dbp, or pulse pressure) at a high significance of p < − in a european sample (n = , ) in ref. [ , supplementary table ] , of the snps having p < . also in a south asian population, and of the snps having p < . in an african population, had effect directions that were opposite to those in the european population. as a final note, a wider view of physical or regulatory gene interactions may help understand potential roles of the cvd risk locus in modulating blood pressure or hypertension. one line of investigation that could be relevant is represented by hi-c data that capture interactions between different chromosomal regions. thus, hi-c data reported in ref. [ , supplementary table ] show, at least in mesenchymal stem cells, an interaction cluster that includes the classic risk region and other regions from almost all of cytogenetic band p . ( . / . mb; for details see supplementary material s . ). the clustering regions include (or immediately flank) the p . interferon gene group, relevant in inflammation, and two genes (gadd g and focad) that have been considered of interest in cardiovascular as well as cancer contexts. of particular note, the cluster also includes the mtap and dmrta genes, which have been recently reported to be associated with dbp [ ] and sbp [ ] , respectively, and are located at a distance on either side of the classic risk region. such findings suggest that a full understanding of the potential role of p and the classic cvd risk locus in modulating hypertension and cvd might require a wider view that takes into account the regulatory interactions among genes distributed across, and possibly beyond, the . -mb cytogenetic band p . . additional (extended) materials & methods, theory, and results are included in supplementary material sections s , s , and s , respectively. de-identified genotype data and detailed association results of the medellín study are given in a supplementary data file (excel workbook) and described in supplementary material section s . j o u r n a l p r e -p r o o f pomethylation that is disrupted in some cancers [ ] ) illustrate the functional importance of the region around snp rs . bottom: functional p promoter binding sites (boxes), critical subsequences (beige) and attempted experimental replacements [ , ] that compromised normal transcription of p (bottom subsequences). legend to figure : schematic diagram sketching three well-characterized pathways ( vertical lanes) in which tgfβ plays a role via signaling and/or transactivation/crosstalk and which may act to promote or prevent hypertension. at the top of each lane, preparation steps needed for a master product that is essential for the pathway's activation are summarized. asterisks and daggers indicate gene products in which common snps in or in the gene's vicinity have been reported to be associated with blood pressure or hypertension in ref. [ , ] (*; (*) for pathway) or have been reviewed as being associated with bp or hypertension in ref. [ ] ( †). dashed horizontal arrows indicate experimentally observed or hypothesized crosstalk/transactivation between pathways (not just receptors). not shown, for simplicity, is another potentially relevant system, descending alongside the classic ras pathway at left, namely the 'nonclassical' ras, composed primarily of the angii/ang iii-at r pathway and the ace -ang-( - )-at r axis [ ] , which generally counteracts the effects of a stimulated classic ang ii-at r axis as shown in the figure rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs kg-clm,% kg-gbr,% class s s s s s s s s s s s s s j o u r n a l p r e -p r o o f rs genome-wide association study of , cases of seven common diseases and , shared controls genomewide association analysis of coronary artery disease a common allele on chromosome associated with coronary heart disease chromosome p . locus for coronary artery disease: how little we know alu elements in anril non-coding rna at chromosome p modulate atherogenic cell functions through trans-regulation of gene networks recent studies of the human chromosome p locus, which is associated with atherosclerosis in human populations. arterioscler from genotype to phenotype in human atherosclerosis -recent findings genomic risk prediction of coronary artery disease in , adults: implications for primary prevention esc/esh guidelines for the management of arterial hypertension haplotypes of the beta- adrenergic receptor associate with high diastolic blood pressure in the caerphilly prospective study principles of statistical inference snpassoc: an r package to perform whole genome association studies basic statistical analysis in genetic case-control studies the genetics of blood pressure regulation and its target organs from association studies in , individuals genomewide association analysis identifies novel blood pressure loci and offers biological insights into cardiovascular risk genetic analysis of over million people identifies new loci associated with blood pressure traits genomes project consortium. a global reference for human genetic variation toward multiple snp motif analyses of loci associated with phenotypic traits antagonistic pleiotropy and mutation accumulation influence human senescence and disease chromosome p snps associated with multiple disease phenotypes correlate with anril expression genetic loci associated with coronary artery disease harbor evidence of selection and antagonistic pleiotropy long noncoding rna anril: lnc-ing genetic variation at the chromosome p locus to molecular mechanisms of atherosclerosis expression of chr p genes cdkn b (p ink b ), cdkn a (p ink a , p arf ) and mt ap in human atherosclerotic plaque cytoplasmic mtap expression loss detected by immunohistochemistry correlates with p homozygous deletion detected by fish in pleural effusion cytology of mesothelioma tgfβ influences myc, miz- and smad to control the cdk inhibitor p ink b c/ebpβ at the core of the tgfβ cytostatic response and its evasion in metastatic breast cancer cells tgf-β family signaling in tumor suppression and cancer progression variants in the cdkn b and rtel regions are associated with high-grade glioma susceptibility battle at the p promoter tgf-beta regulation by emilin : new links in the etiology of hypertension loss of emilin- enhances arteriolar myogenic tone through tgf-β (transforming growth factor-β)-dependent transactivation of egfr (epidermal growth factor receptor) and is relevant for hypertension in mice and humans dissecting the role of epidermal growth factor receptor catalytic activity during liver regeneration and hepatocarcinogenesis p < × − has emerged as a standard of statistical significance for genome-wide association studies promoting cardiovascular health in the developing world: a critical challenge to achieve global health the missing diversity in human genetic studies genetic variants in novel pathways influence blood pressure and cardiovascular disease risk susceptibility locus for clinical and subclinical coronary artery disease at chromosome p in the multi-ethnic advance study the effect of chromosome p variants on cardiovascular disease may be modified by dietary intake: evidence from a case/control and a prospective study association of a chromosome locus p . cdkn b-as variant rs with hypertension: the tam-risk study natural selection on genes related to cardiovascular health in high-altitude adapted andeans the evolution of life histories introduction to genetic analysis trade-offs in life-history evolution the causes of evolution selective variegated methylation of the p cpg island in acute myeloid leukemia elastin microfibril interface-located protein , transforming growth factor beta, and implications on cardiovascular complications nonclassical renin-angiotensin system and renal function abstract: schematic depiction of the main observation from association data presented or discussed in this article. two adjacent haplotype blocks characterize the p . cardiovascular risk locus: left, the block or island containing the first part of the p gene and its well-studied promoter, characterized by elevated associations (red) with blood pressure (dbp, sbp) and/or hypertension in colombian and european studies, and right, the block containing lead cardiovascular risk kg-eur and kg-amr, representing mainly european origin and mainly or partly native american origin, respectively. color-shaded matrix shows -snp haplotype motifs (rows of red 's/major allele and 's/minor allele: master motifs, yellow shading: one-snp mutants) of the kg-amr individuals ( haplotypes) from medellín, kg-clm, which correspond in ancestry and admixture to the individuals m -clm studied here from the same city (snp allele frequencies are shown below matrix). the major haplotype classes are shaded light green (c ), light blue (c ) and sand (c ); by inclusion/exclusion they define the snp classes s (blue rs identifiers), s (red rs identifiers) and s . relative haplotype frequencies are shown at right for kg-clm, corresponding to our association results from medellín, and for the british population this figure illustrates a nested-plot overview of genotyped p . snps, the association signals for hypertension in the medellín study, and elements of a causal hypothesis involving a snp in the p promoter. colored bars and asterisks (top track) show extent and sentinel/lead snp of regions of strongest bp (red) or cardiovascular risk (blue) association in the studies by warren we thank professors i. king jordan and greg gibson (georgia institute of technology) for suggestions and references early in this study, drs. edwin garcía and juan ramón gonzález for advice on association analysis, and richard jacobs and jane mcdougall (lgc genomics uk) for genotyping assistance. we also thank two anonymous referees of a earlier version of this paper for suggestions. this rs rs rs rs rs rs rs rs rs rs rs rs rs * rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs rs key: cord- -j i wlak authors: zarai, yoram; zafrir, zohar; siridechadilok, bunpote; suphatrakul, amporn; roopin, modi; julander, justin; tuller, tamir title: evolutionary selection against short nucleotide sequences in viruses and their related hosts date: - - journal: dna res doi: . /dnares/dsaa sha: doc_id: cord_uid: j i wlak viruses are under constant evolutionary pressure to effectively interact with the host intracellular factors, while evading its immune system. understanding how viruses co-evolve with their hosts is a fundamental topic in molecular evolution and may also aid in developing novel viral based applications such as vaccines, oncologic therapies, and anti-bacterial treatments. here, based on a novel statistical framework and a large-scale genomic analysis of , viruses from all classes infecting host organisms from all kingdoms of life, we identify short nucleotide sequences that are under-represented in the coding regions of viruses and their hosts. these sequences cannot be explained by the coding regions’ amino acid content, codon, and dinucleotide frequencies. we specifically show that short homooligonucleotide and palindromic sequences tend to be under-represented in many viruses probably due to their effect on gene expression regulation and the interaction with the host immune system. in addition, we show that more sequences tend to be under-represented in dsdna viruses than in other viral groups. finally, we demonstrate, based on in vitro and in vivo experiments, how under-represented sequences can be used to attenuated zika virus strains. viruses, the most abundant type of biological entity, are small infectious agents that can only replicate inside the living cells of other organisms (hosts). the viral genetic material is composed of either rna or dna molecule, single or double stranded. viral genomes typically encode three types of protein: proteins for replicating the genome, proteins for packing the genome, and proteins for modifying the function of the host's cell to enhance the replication of the virus's material. , viruses are believed to play a central role in evolution, (e.g. via horizontal gene transfer , - ), be responsible for various human diseases (e.g. aids and respiratory diseases , ) , and also have important applications to biotechnology and nanotechnology. for instance, the recent zika virus (zikv) epidemic in the americas have led the world health organization to declare a 'public health emergency of international concern', , and just recently the novel coronavirus ( -ncov) outbreak in china was declared pandemic by the same organization. due to their complete reliance on the host gene expression machinery, viruses are under constant evolutionary pressure to effectively interact with the host intracellular factors, and at the same time effectively evade its immune system. , thus, understanding how viruses co-evolve with their hosts to ensure their fitness may help in developing novel viral based applications such as vaccines, oncologic therapies, and anti-bacterial treatments. it is natural to expect that viruses and hosts co-evolution patterns are also encrypted in the viral genome. for example, it was shown that high correlation of gc content exists between bacteriophage and related hosts, that a pattern of cpg dinucleotides is suppressed in vertebrate hosts and in their related rna viruses, , that the frequency of tpa dinucleotides is suppressed in invertebrate hosts and in their related rna viruses, and that many long sequences are shared between hosts and their related viruses. , identification and analysis of short dna sequences that are under-represented (also referred to as suppressed or avoided) in genomes of different species were analysed in the past. , for example, in, markov chain models were used to analyse short sequences in the dna of two hosts: escherichia coli and bacillus subtilis. markovian models were used in to predict the frequencies of short sequences and applied them to many prokaryotic species, and the authors in introduced an efficient algorithm to identify sequences that are avoided. in this paper, we analyse under-represented nucleotide sequences in the coding regions of all types of viruses and in the coding regions of their corresponding hosts using a novel statistical framework. these sequences are analysed separately in each of the three reading frames. we provide a large database of these sequences, identify unique and interesting patterns within these sequences, and demonstrate how these sequences can be utilized to attenuate the zikv via in vitro and in vivo experiments. in this section, we briefly describe the main steps of our methodology. a detailed description appears in the supplementary document. the general flow of our analysis is depicted in fig. a . the dataset of virus-host associations was retrieved from previously published data. these include , unique viruses and corresponding hosts, where all the corresponding coding sequences were downloaded and processed. randomization models were used to generate many random variants of the host and virus coding sequences. two different randomization models were used, each control for different biases. a dinucleotide randomization model preserves both aminoacid order and content and the distribution of all possible pairs of nucleotides, whereas a synonymous codon randomization model preserves both amino-acid order and content, and the codon usage bias. these were then used to statistically infer short nucleotide sequences that are under-represented within both the original host and virus genome coding regions, in each reading frame, and those that are common to all three reading frames. these under-represented sequences were analysed and compared among different viral groups and viral proteins, revealing some interesting evolutionary patterns that will be discussed later on. based on this analysis, an attenuated variant of the zikv was engineered and its attenuation was demonstrated in cell lines and in mice. the virus and host coding sequences and association information was retrieved from a published database. in brief, the association between viruses and hosts was derived from the genomenet virus-host database. the database contains , unique viruses and corresponding unique hosts from all kingdoms of life (see supplementary table s ). figure b protists), where we specify for each host domain the portion of the corresponding viruses belonging to each virus type. the virus types in the database are reverse-transcribing (retro), double-stranded dna (dsdna), double-stranded rna (dsrna), single-stranded dna (ssdna), single-stranded rna (ssrna, positive and negative sense), and other (unclassified). the question that we must first address is: what constitutes an under-represented sequence in a coding region? to detect sequences that are statistically under-represented in the coding regions, our statistical background model must capture well-understood coding region features, which are known to be under selection. for example, selection for codon usage bias may cause few short sequences to be in low abundance in the coding regions (as opposed, for example, to regions that are not translated). this, however, does not imply that these short sequences were directly selected against by evolutionary forces. our definition of under-represented short nucleotide sequences in the coding region must then be formulated with respect to all known coding region features (i.e. amino-acids content and order, codon usage bias, and dinucleotide distribution), to suggest possibly new evolutionary forces acting on the viral coding regions. to that end, two randomization models were used to evaluate our hypothesis for short, under-represented nucleotide sequences in the coding regions of the viruses and in the coding regions of their corresponding hosts. the first, called dinucleotide randomization, preserves both amino acid order and content (and thus the resulting protein), and the frequencies of the possible pairs of adjacent nucleotides (dinucleotides). the second, called synonymous codon randomization preserves both amino-acids order and content (and thus the resulting protein) and the codon usage bias. figure c depicts a schematic description of both randomization methods. a selection against short nucleotide sequences that cannot be explained by the canonical genomic features that are preserved by both randomization models implies that these sequences will appear more frequently in the random variants (generated by the above randomization models) than in the original genome. empirical p-values were derived from the empirical null model defined by the above two randomization models. the p-value estimates the probability of obtaining a random value (i.e. the number of occurrences of a sequence in the coding regions) that is the same or larger than the observed value in the original genome. this was performed separately in each of the three reading frames. a sequence was declared underrepresented if its p-values corresponding to the two randomization models were both . . note that in the case of synonymous codon randomization, no under-represented sequence of size three nucleotides can be identified in the first reading frame. specifically, when analysing under-represented sequences in the viruses, we compared the original genome to , corresponding randomization variants generated by each of the randomization models described above. under-represented sequences were then identified separately in each reading frame. in addition, common under-represented nucleotide sequences were identified (i.e. sequences that are under-represented in all three reading frames-see supplementary document, section . . ). this may indicate selection against sequences that may 'interfere' with the process of mrna translation. see supplementary document, sections . and . for an additional method of identifying under-represented sequences in the viruses based on the corresponding hosts (i.e. host-based as oppose to random-based analysis). due to the large size of the host genome, the analysis of underrepresented sequences in the hosts was performed differently than in the viruses. instead, the hosts were analysed relative to their corresponding viruses. recall that a host can be infected by several viruses. specifically, for each pair of a host and a corresponding virus (i.e. a virus that infects that host), we randomly sampled the host coding sequences with a sample size equals the total size of the virus coding sequences. twenty host samples were used for each host-virus pair. each sample was compared with , corresponding randomization variants generated by each of the random models. thus, twenty sets of under-represented sequences were identified in the host, for each reading frame, given a corresponding virus. a sequence that is under-represented in at least ten of the twenty samples, per reading frame, is then considered as under-represented in the host, given the corresponding virus. this is referred to as the sampled majority under-represented set of the host given a corresponding virus (see supplementary document, section . . ). the final set of sequences that are under-represented in the host was defined by the intersection over all the corresponding viruses. see more details in supplementary document, section . . the genome of a thai-strain zikv from an infectious-clone plasmid was evaluated to uncover under-represented sequences (see supplementary document, section . for more details on the zikv strain). first, the two randomized models (dinucleotides and synonymous codons) were used on the zikv coding sequence to identify short sequences that are under-represented. next, oligos of five nucleotides ( -mers) that were identified by both models and showed significant p-values were selected and ranked according to their significance level (see the list of oligos detected in supplementary document, section . ). following, the sequence of the thai strain zikv ns protein was systematically scanned at the nucleotide level (according to the significance in the relevant frame) to identify locations that can be modified with each -mer, but without affecting the amino acid sequence of the protein (fig. ). specifically, we were able to identify and introduce synonymous codon changes in the first reading frame, and synonymous codon changes in the second reading frame. figure . a general scheme of engineering a synthetic sequence. specifically, in the case of the synthetic zikv ur sequence, we introduced different under-represented -mer oligo in the first two reading frames (identified using both randomization models), replacing the original nucleotide sequence while verifying that the protein aa sequence remains unchanged. the modified ns sequence (hereafter named ur ) was later synthesized as plasmid dna, amplified by pcr, and used to build zikv-ur strain by gibson assembly. the first-passage stock virus was produced using vero cells. synthetic strain preparation: the infectious-clone plasmid of the thai-strain zikv was constructed from pcr products of viral cdna. the transfection of the plasmid into mammalian cells generated infectious virus with replication kinetics similar to those of the original virus. the sequence of the infectious-clone plasmid was indeed verified. the viral sequence from this infectious-clone plasmid was evaluated to uncover under-represented sequences as discussed above. cell lines: bhk with rtta was used to generate virus from assembled dna. the supernatant from the transfected bhk was then used to infect vero cells to prepare the virus stock for subsequent experiments. replication kinetics of the wild-type (wt) virus and the ur virus were characterized in vero cells with moi ¼ . . the infectious titre was quantitated with vero cells using immunostaining against e protein by g monoclonal antibodies. animals: the male and female ag mice produced by an in-house colony were used. groups of animals of both genders were randomly assigned to experimental groups and individually marked with ear tags. animals were challenged with malaysian zikv, zikv wt synthetic, ur , or vehicle. serum was collected from all mice dpi for assessment of neutralizing antibodies (neutabs) via prnt assay. mice were monitored for mortality and disease signs daily. individual weights were recorded daily throughout the course of the study. virus: wt zikv (malaysian strain, p - ) was prepared by two passages in vero cells. a challenge dose of ccid was administered via s.c. injection in a volume of . ml. the virus was generated from the same infectious-clone plasmid as the designed variants. quantification of neutab: neutab was quantified using a % plaque reduction neutralization titre (prnt ) assay. serum samples were heat inactivated at c for min in a water bath. one half serial dilution, starting at a / dilution of test sera was made. dilutions were then mixed : with an appropriate titre of zikv in mem containing % fetal bovine serum (fbs) and incubated at c overnight. the virus-serum mixture was then added to individual wells of a -well tissue culture plate with vero cells ( e cells/ well). viral adsorption proceeded for h at c and % co , followed by addition of . % ( , cps) methylcellulose overlay medium containing % fbs to each well. plates were incubated for days, and then stained with crystal violet [with % (wt/vol) crystal violet in % (vol/vol) ethanol] for min. the reciprocal of the dilution of test serum that resulted in > % reduction in average plaques from virus control was recorded as the prnt value. to identify short under-represented nucleotide sequences, we compared the number of appearances of each , , and nucleotides sequences in each reading frame of the original genome with many corresponding randomization variants. our randomization models preserve the basic canonical features of the coding sequences, i.e. amino-acids composition, codon usage bias, and dinucleotide distribution (see section . ). thus, an under-represented sequence cannot be explained by these canonical features and may be selected against by other evolutionary forces. to estimate the false discovery rate, we performed two separate evaluations. first, we generated , randomizations (instead of , ) for few randomly selected viruses and verified that underrepresented sequences that were detected using , randomizations were also detected using , randomizations. in the second evaluation, we performed identifications of under-represented sequences in random variants of the viruses (rather than in the original genome). specifically, a random variant of each virus was randomly selected, and the p-value was evaluated relative to this (random) variant (see supplementary document, section . . . ). comparing the number of under-represented sequences identified in the original viruses and the randomized variants of the viruses yields an estimation of a false discovery rate of . % (for m ¼ ), . % (for m ¼ ), and . % (for m ¼ ). the under-represented sequences identified were further processed by analysing different virus and host groups. specifically, we analysed under-represented sequences for each virus group, for each host domain, for all viruses that corresponds to the same host, and for different combinations of host domains and virus groups (see supplementary document, section . ). a complete list of the most abundant under-represented sequences among the different virus groups is available in supplementary table s . in addition, we refined our analysis of under-represented sequences in the viruses by analysing different protein groups. we classified all viral genes into five mutually exclusive functional groups [surface, structural, enzymatic, unknown (unclassified genes), and other (hypothetical genes)] and showed that the selection against short nucleotide sequences depends on the viral protein function. finally, we performed a test study using zikv, where we engineered under-represented sequences into the genome of an asian zikv and studied their effect both in vitro and in vivo. figure a and b depicts the average number of under-represented sequences of size m ¼ , , and nucleotides, identified in few subsets of viruses in both the original and random variants of the virus. see supplementary document, section . for details about the different subsets, and supplementary document, section . . . for generating random variants of viruses. as shown in the figures, the average number does indeed increase with the sequence size. also, many under-represented sequences are found in dsdna viruses that infect bacteria and vertebrate hosts. the average number of underrepresented sequences found in the random variants of the viruses is between and % of the average number found in the original genome, suggesting a false discovery rate < %. since the genome of dsdna viruses tend to be on average larger than the genome of rna viruses, we aimed at evaluating if the larger number of under-represented sequences identified can be simply attributed to a better statistical signal due to the larger nucleotide size of these viruses. a sampling analysis that we performed (see supplementary document, section . ) suggests that the number of under-represented sequences identified in dsdna viruses matches their genomic size, when compared with rna viruses. a complete list of under-represented sequences of sizes m ¼ , , and nucleotides in all viruses in the database is available in supplementary table s (random-based) and in supplementary table s (host-based). our analysis suggests that among the most abundant common underrepresented nucleotide sequences (i.e. sequences that are underrepresented in all three reading frames) are homooligonucleotide repeats, specifically in viruses. these are sequences of the form xx.x, where all x contain the same nucleotide. figure a note that among these, specifically in viruses, are sequences containing the same nucleotide repeated m ¼ , , or times (i.e. sequences that correspond to the same colour repeating m times in the figure) . a finer resolution of these common under-represented sequences is provided in fig. b , where we depict these sequences separately for different subsets of hosts (left figure) and subsets of viruses (right figure) . see supplementary document, section . for more details of the different subsets. table lists the six most abundant common under-represented nucleotide sequences of size m ¼ , , and in dsdna viruses. all homooligonucleotide sequences (shown in red coloured text) are among these most abundant sequences. one possible reason for this general selection against homooligonucleotide (in all three reading frames) in both viruses and hosts is to reduce erroneous frame shifts as ribosomes traverse the mrna while decoding it codon by codon. a sequence containing a repetition of the same nucleotide in the coding sequence may cause the ribosome to miss the codon boundary, resulting in a frame shift and thus a non-functional and most likely deleterious protein. , this must be recognized and degraded by energy-consuming intracellular proteolytic mechanisms. since translation is the most energetically consuming process in the cell, it is believed that transcripts undergo selection to minimize this energy cost. [ ] [ ] [ ] [ ] [ ] selection against sequences of repetitive nucleotides reduces faulty translation, thus minimizing the overall translation cost. it is possible that this selection against homooligonucleotide repeat is indeed more pronounced in viruses than in hosts since viruses are under much stronger evolutionary selection as they have a larger effective population size and thus a stronger effect of these types of mutations on their fitness. another possible reason may be related to different host immune evasion mechanisms used by viruses (see section . ). we also evaluated the sequence overlap between common underrepresented sequences in viruses and transcription factor binding sites and again found a general selection against homooligonucleotide repeats. these are reported in supplementary document, section . . a nucleotide sequence is called palindromic if it is identical to its reverse complement. obviously, palindromic sequences are of even length. our analysis reveals that . % of all common underrepresented sequences of size m ¼ nucleotides in viruses are palindromes. excluding homooligonucleotide repeats this becomes %. note that only . % of all possible sequences of size m ¼ nucleotides are palindromes. we also evaluated the number of palindromes in random variants of the viruses. these random variants preserve basic transcript features such as amino-acid order and content, codon usage bias and dinucleotide distributions. only . % of all common under-represented sequences of size m ¼ in the random variants of the viruses were found to be palindromes. these findings suggest that indeed the coding regions of viruses are selected against short palindrome sequences. figure a and b depicts the percentage of palindromic sequences of size m ¼ nucleotides that are common under-represented sequences in subsets of hosts and viruses. it was found that palindromic sequences are selected against only in one subset of hosts: bacterial hosts that are infected by dsdna viruses. in addition, palindromic sequences were found to be selected against in dsdna viruses that infect either bacteria (i.e. bacteriophage) or vertebrate hosts. , as depicted in fig. a ). figure c and d depicts the total number of occurrences of each palindrome as under-represented sequence in dsdna viruses that infect bacteria and vertebrate hosts, respectively. in these sub-figures we analysed under-represented sequences regardless of reading frames. two cases are shown: the case where the real virus genome is used (shown in blue colour), and the case where a randomized variant of the virus genome is used (shown in red colour). note the scale difference in the y-axis between the real and the randomized results. the results in the figures imply that dsdna viruses undergo selection against short palindrome sequences. it has been proposed that the principal underlying reason for the apparent avoidance of short palindromes in dsdna viruses is because they are targets for many restriction-modification systems and possibly for general recombination systems as well. , , , , restrictionmodification systems protect bacteria and archaea from attacks by bacteriophages and archaeal viruses. a restriction-modification system specifically recognizes short sites in foreign dna and cleaves it, while such sites in the host dna are protected by methylation. to evaluate the hypothesis of palindromes avoidance in viruses due to restriction-modification systems, we downloaded all restriction enzyme patterns from the rebase database (we used version , which contains information for different restriction enzymes) and evaluated the overlap between the common under-represented nucleotide sequences we identified and the restriction sites from rebase. figure e depicts the number of exact matches between the most abundant common under-represented palindrome sequences of size m ¼ nucleotides in dsdna viruses and restriction sites. the figure also depicts the corresponding enzyme name and the p-value for each common under-represented sequence. the p-value was computed by evaluating the match between common under-represented sequences of random variants of the viruses and the restriction sites. figure f depicts the number of restriction sites that are supersets of the most abundant common under-represented palindrome sequences. p-values were computed as in the case of an exact match. to show that the correspondence between selection against short palindromic sequences in viruses and restriction sites cannot be explained by basic coding region features such as amino-acid content and order, codon usage bias and dinucleotide distribution, we also evaluated the overlap between restriction sites and common under-represented sequences of random variants of viruses. this is reported in supplementary document, section . . a complete list of all common under-represented palindromes of size m ¼ is provided in supplementary table s . common under-represented sequences were only identified in two subsets of hosts. on the other hand, common under-represented sequences were identified in all eight subsets of viruses. our analysis reveals that dsdna viruses infecting bacteria and vertebrate hosts have the largest number of common underrepresented sequences among the different virus subsets. this, as suggested above, seems to be due to the size of dsdna viruses when compared with ssdna and rna viruses. on the other hand, bacteria that are infected by dsdna viruses have the largest number of common under-represented sequences among the different host subsets. thus, the stronger selection for under-represented sequences in bacteria may induce stronger selection for under-represented sequences in viruses that utilize this host. in addition, we evaluated the number of under-represented sequences identified in the real genome of the viruses when compared with the randomized genome of the viruses. this is reported in supplementary document, section . . indeed, many more sequences are identified as under-represented in the real genome of the virus. on average over all viruses and the three sequence sizes, there are stds more under-represented sequences in the real genome in comparison to the random genomes, implying that these cannot be explained by basic coding region features, and suggesting possibly new evolutionary forces acting on the viral coding regions. note that since we analyse each pair of a host and a corresponding virus separately, the set of under-represented sequences in a host above is the sampled majority under-represented set. for obvious reasons, sequences that are not under-represented in both host and virus coding regions constitute the majority of the sequences and are thus not reported here. a complete list of all under-represented sequences within the three classes above for all hosts and viruses in our database is available in supplementary table s . in general, an under-represented sequence of m nucleotides may contain sub-sequences that are themselves under-represented. thus, it may be interesting to identify unique under-represented sequences, i.e. sequences that do not contain any sub-sequences that are underrepresented. for each pair of a host and a corresponding virus, a sequence belonging to one of the three classes above is referred to as a unique under-represented sequence if it does not contain any subsequence that is under-represented in that class. specifically, a unique common under-represented sequence of size m ¼ (m ¼ ) nucleotides doesn't contain any sub-sequence of size m ¼ (of size m ¼ and of size m ¼ ) nucleotides that is common under-represented sequences. a complete list of all unique common under-represented sequences within the three classes above for all hosts and viruses in the database is available in supplementary table s . the correspondence of the most abundant under-represented sequences between viruses and their related hosts is depicted in fig. for different host and virus subsets. each panel depicts both the most abundant common under-represented sequences (left) and the most abundant unique common under-represented sequences (right), where the panel names correspond to the class names. our first observation is that many under-represented sequences are indeed unique. for example, comparing the cases of m ¼ and m ¼ of class a (left sub-figure middle and bottom rows, respectively) with the corresponding unique set (right sub-figure top and bottom rows, respectively) reveals that the majority of the most abundant sequences is unique. second, homooligonucleotide repeats are among the most abundant sequences in all three classes. in addition, more sequences were identified in class b over the different subsets than in the other two classes. for example, table lists the most abundant unique sequence of classes b and c in all the different subsets of hosts and viruses. as shown in the table, unique sequences were identified in all subsets in class b, as oppose to class c. the viral genome encodes different types of proteins that are necessary for the life cycle of viruses in their respective hosts. these, in general, include surface proteins that interact with the host receptors and enable attachment and entry to the host cell, structural proteins that serve as the building blocks of the virus, and replicating enzymes, such as rna and dna polymerase, that are required for the replication of the virus. in addition, many other proteins, some of which are uncharacterized, are diversely involved in different regulatory and accessory functions. here, our aim is to refine the analysis of under-represented sequences in viruses by analysing, separately, different protein groups. to that end, and similarly to, we classified all viral genes into five mutually exclusive functional groups (functional sets): surface, structural, enzymatic, unknown (unclassified genes), and other (hypothetical genes). specifically, for each virus in the database, we divided its genome into the five gene sets defined above. each gene set contains all the virus genes of the same functional group. for example, the surface gene set of a virus contains all the genes that encode surface proteins in the virus's genome. a set might be empty for a particular virus if no genes of the corresponding functional group exist in that virus. see supplementary document, table s for a list of the total number of sets and genes of each functional group in the database. the analysis of under-represented sequences was then performed separately in each of the five gene sets for each of the viruses in the database (see more details in supplementary document, section . ). a complete list of all under-represented sequences in each viral functional group over all viruses in the database is available in supplementary table s . we first analysed the average number of under-represented sequences identified in each gene set. to control for the difference in the average gene size and the number of genes in each set, we randomly selected , , , , , , , , and , genes from each of the surface, structural, enzymatic, unknown, and hypothetical functional groups, respectively. this means that the number of identified under-represented sequences is analysed over similar region sizes, and the differences between the different sets cannot be explained by the genes' nucleotide size in each set. figure a depicts the average number of under-represented sequences (over all three reading frames) identified in each of the gene set over the (randomly selected) subset of genes. relatively small number of under-represented sequences were identified in surface genes (that participate in the recognition of the host receptors), when compared with the other gene sets. at least twice as many were identified in many of the enzymatic genes. these proteins interact closely with the host cell machinery, are essential for the viral replication cycle, and thus must use mechanisms that guarantee their function. figure b depicts the most abundant common under-represented sequences within each viral functional group. these differ between the different functional groups; however, homooligonucleotide sequences appear among the most abundant common underrepresented sequences in all groups. we designed an attenuated zikv variant based on the underrepresented analysis we performed. such variants may be useful in the future for generating a live-attenuated vaccine. specifically, we introduced synonymous mutations to the ns nucleotide sequence, which includes under-represented sequences, and named the new variant ur (see details in section ). infection studies in vero cells demonstrated fractional variant attenuation of the ur virus, which was correlative with our model predictions (see foci size in fig. a, right bottom) . in addition, infectious virus collected and evaluated from the ur variant showed substantial attenuation relative to wt zikv (fig. a) . there is evidence that ag mice lacking ifn-a/b and ifn-c (types i and ii interferon) receptors can be valuable for evaluating the efficacy of new vaccines and anti-viral treatments for zikv. , therefore, as these mice are immune compromised, various strains of zikv cause lethal infection and disease, and will typically cause morbidity and mortality. depending on the strain, severe disease is observed between and weeks after virus challenge. , thus, to further test the synthetic vaccine attenuation level in vivo, ag mice were challenged with attenuated zikv preparations as well as synthetic wt zikv. these inoculations were done in parallel with the original virus grown in cell culture. infection with the synthetically attenuated zikv strains was lethal in all inoculated ag mice. however, the mortality curve of mice infected with ur was delayed, when compared with that of wt malaysia and wt synthetic zikv (average of . days in ur vs. and . in wt malaysia/synthetic zikv, respectively; see fig. b ). no mortality was observed in unvaccinated controls, and mice vaccinated with vehicle (fig. b) . weight loss was also observed in all the infected mice ( - %; see fig. c ). normal control mice experienced general weight gain throughout the experimental period (fig. c) . weight loss corresponded well with mortality, and mice typically lost substantial weight, requiring humane euthanasia. neutab is the primary mediator of protection in vaccine studies in this model. , therefore, serum samples were taken to determine the presence of neutab in infected mice. the neutab titre was evaluated in vaccinated mice weeks after vaccination. mice vaccinated with synthetic wt or ur had significantly (p < . ) elevated neutab titres when compared with vehicle controls (see fig. d ). as expected, no neutab was detected in mice vaccinated with vehicle or in normal control groups (see fig. d ). the virulence levels of ur were somewhat lower than the levels of the malaysian and synthetic wt strains, thus demonstrating that under-represented sequences can be potentially used in the design of live attenuated zikv strains. accordingly, additional attenuation of this variant (e.g. by introducing similar changes to other zikv proteins) may further decreases the lethality of the mice infected by it. since ag mice are very susceptible to zikv infection, mouse model might be too stringent to test these live attenuated vaccine candidates, as human infection is generally sub-clinical after natural zikv infection, hence the attenuated strain might be effective in an immunocompetent model. we compared the average number of under-represented sequences identified in each pair of a virus and its corresponding host. see supplementary document, section . for more details. we found that in % of the cases the average number was larger in the hosts. we believe that this is due to the fact that the viral genome is usually populated with many overlapping codes and genes, when compared with cellular organisms. [ ] [ ] [ ] this introduces many constraints along the viral genome, which can decrease the number of under-represented sequences in the virus. for example, a sub-optimal codon within the host coding region may be synonymously replaced by evolution without affecting the host fitness. however, due to overlapping codes, replacing a sub-optimal codon within the viral coding region may affect multiple proteins and genes, and thus be deleterious to the virus. in this study, we analyse sequences of three, four, and five nucleotides long that are under-represented in the coding regions of viruses of all types and in their corresponding host coding regions. this study is based on a novel statistical evaluation that controls for classical coding region features, which is performed separately in each of the three reading frames. we provide various novel discoveries that may shed light on the evolution of viral dna sequences and on the virus co-evolution with its respective hosts. it is important to emphasize that the observed patterns may be related to various variables and their complex interactions, include gene expression optimizations, various mechanisms for escaping the host immune system, and co-evolution with the corresponding hosts. for example, it was reported that suppression of cg dinucleotides in hiv- is due to coevolution with its vertebrate host to avoid the host defence mechanisms. in general, our analysis reveals that under-represented viral sequences are related to different mechanisms such as restriction modification systems and possibly to alternative or unknown immune escape mechanisms, as these sequences cannot be explained by canonical mechanisms that may suggest, for example, classical viral recognition using antibodies. we show that homooligonucleotide repeats are the most abundant under-represented sequences in both viruses and hosts. a possible explanation for this avoidance is to reduce an erroneous ribosomal frame shifts and thus reduce faulty translation and consequentially the overall translation cost. however, as this motif is shown to be shared between hosts and viruses, our analysis also indicates that a stronger selection pressure against these sequences exists in viruses. this again can be attributed to escape mechanisms from the host immune system, as the virus nucleotide composition evolves to be similar to the host, and it is certainly possible that an excess avoidance of homooligonucleotide repeats reduces viral recognition by classical host immune mechanisms. there may be other relevant explanations such as interaction with small rna genes (e.g. mirnas). it is possible, for example, that these sequences may increase the efficiency of mirna and mrna interactions and thus decrease expression levels. this should be studied further. in addition to homooligonucleotide repeats, we show that palindromes are among the most abundant under-represented sequences in viruses. specifically, excluding homooligonucleotide repeats, our analysis reveals that % of all under-represented sequences of four nucleotides long in viruses are palindromes (where only . % of all possible sequences of that size are palindromes). indeed, analysis of palindromes avoidance in viruses was performed previously. it was shown that palindromes are the most under-represented short sequences in a prokaryotic genome. [ ] [ ] [ ] for example, it was reported that short palindromic sequences are avoided at a statistically significant level in the genomes of several bacteria. four and six nucleotides palindromic sequences that are avoided were reported for few viruses and hosts in, and avoidance of palindromes in several dozen phage genomes was reported in these analyses are based on statistical counts of certain sequences in the given dna and thus do not control for canonical coding region features (codon usage bias, amino acid order and content and dinucleotide distribution) as was done in this study. in addition, our analysis is performed over a larger set of viruses of all types and their corresponding hosts, and at a reading frame resolution. thus, we believe that the results reported here may be more accurate, and should provide a better understanding of this phenomenon. one plausible explanation for avoidance of palindromes in viruses is because they are targets for many restriction-modification systems and possibly for general recombination systems as well. we statistically show a high overlap between under-represented palindromes in viruses and restriction enzyme patterns. this overlap cannot be explained by classical coding region features. restriction of recognition sites has been observed in genomes of prokaryotic organisms. , [ ] [ ] [ ] the authors in analysed the avoidance of restriction sites in few bacteriophage, and concentrated on sites containing six nucleotides. rusinov et al. studied most known recognition sites (both palindromic and asymmetric) in thousands of prokaryotic genomes and found factors that influence their avoidance. it was also shown that the recognition site avoidance correlates with the lifespan of restriction-and-modification systems. recently, the authors in the numbers in parenthesis indicate the frequency of occurrences in percentage. x indicates that no corresponding sequence was identified. analysed avoidance of recognition sites of restriction-modification systems in the genomes of prokaryotic viruses and found it to be a widespread but not a universal anti-restriction strategy of these viruses. the method used by the authors is based on a compositional bias calculation, which is the ratio of the observed to the expected frequency of a sequence, where the expected frequency is estimated based on the observed frequencies of all sub-sites of a given sequence. the compositional bias measure was originally used in for analysing over-and under-represented sequences in dna viruses. since the compositional bias measure doesn't account for a statistical background that preserves know evolutionary forces, we believe that a more accurate and comprehensive procedure of identifying underrepresented sequences is the one used here. in addition, we analyse the distribution of these underrepresented sequences among various viral and host groups. we show, for example, that dsdna viruses infecting bacteria or vertebrate hosts contain a larger set of under-represented sequences than other viral types and that this may be related to their larger genome size. furthermore, we show that on average the set of sequences that are under-represented in viruses but are not under-represented in their related hosts is the largest set among different host-virus underrepresented correspondence. we also show that the selection against under-represented sequences in viruses depends upon the protein function. for example, larger number of sequences is shown to be under-represented in enzyme genes than in surface genes. moreover, even larger number of sequences is found to be under-represented in genes with (currently) unknown functionality, prompting further investigation into the nature of these genes. the differences between these groups may also be related to the expression levels of the different proteins. if, for example, surface genes tend to have low expression levels then they may be under weaker selection for features such as under-represented sequences. vaccines are a topic of a singular importance in present day biomedical science. however, the discovery of vaccines has so far been primarily empirical in nature requiring considerable investments of time, efforts, and resources. to overcome the numerous pitfalls attributed to the classical vaccine design strategies, more efficient and robust rational approaches are highly desirable. one direction in designing in silico vaccine candidates may be based on exploiting the synonymous information, encoded in the viral genomes and related to gene expression, for attenuating the viral replication cycle while retaining its genotype and structure. the analysis and results reported here may have important implications in vaccine synthesis. specifically, the outcomes of this study may provide clues and guidance into practical design of efficient and safe viral vaccines via attenuated viral material. furthermore, it may also prove to be beneficial for other biotechnological objectives related to viral based products such as developing oncolytic viruses and engineering phages to fight bacteria. [ ] [ ] [ ] [ ] [ ] [ ] indeed, we demonstrate, both in vitro and in vivo, how under-represented sequences can be utilized to obtain an attenuated zikv. the aim of these experiments is an initial proof of concept. of course, additional experiments with more variants and controls are needed to better understand the effect of these under-represented sequences on the viral growth rate and fitness. for example, it will be helpful to study additional mutants that do not possess underrepresented sequences but include other types of mutation. however, it is important to emphasize an interesting and a non-obvious aspect of these experiments. the introduced mutations are silent and thus did not alter the encoded protein. based on our experience, in many cases silent mutations may not affect the viral fitness, and furthermore, there are cases where they may even improve its growth rate. also, it is important to emphasize that in these experiments both the wild-type and the mutant variants were generated by the same process and from the same infectious-clone plasmid. finally, the randomization models used in this study may not completely preserve the viral rna secondary structure, and thus the selection for under-represented sequences may be partially due to alterations in secondary structures. fields virology tinkering with translation: protein synthesis in virus-infected cells the role played by viruses in the evolution of their hosts: a view based on informational protein phylogenies horizontal gene transfer in prokaryotes: quantification and classification evolution of complexity in the viral world: the dawn of a new vision giant viruses, giant chimeras: the multiple evolutionary histories of mimivirus genes rates of hospitalizations for respiratory syncytial virus, human metapneumovirus, and influenza virus in older adults viral infectious disease and natural products with antiviral activity negative-strand rna viruses: applications to biotechnology viruses and their uses in nanotechnology zika virus outbreak rapid spread of emerging zika virus in the pacific area the evolutionary genetics of viral emergence viral adaptation to host: a proteome-based analysis of codon usage and amino acid preferences patterns of evolution and host gene mimicry in influenza and other rna viruses cg dinucleotide suppression enables antiviral defence targeting non-self rna virus-host coevolution: common patterns of nucleotide motif usage in flaviviridae and their hosts evidence of a direct evolutionary selection for strong folding and mutational robustness within hiv coding regions universal evolutionary selection for high dimensional silent patterns of information hidden in the redundancy of viral genetic code exceptional motifs in different markov chain models for a statistical analysis of dna sequences comparison of methods of detection of exceptional sequences in prokaryotic genomes on avoided words, absent words, and their application to biological sequence analysis, algor avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes evolutionary role of restriction/modification systems as revealed by comparative genome analysis computational dna sequence analysis restriction-modification systems interplay causes avoidance of gatc site in prokaryotic genomes molecular evolution of bacteriophages: evidence of selection against the recognition sites of host restriction enzymes the significance of distance and orientation of restriction endonuclease recognition sites in viral dna genomes avoidance of recognition sites of restriction-modification systems is a widespread but not universal anti-restriction strategy of prokaryotic viruses over-and under-representation of short oligonucleotides in dna sequences forbidden penta-peptides linking virus genomes with host taxonomy infectious clone plasmid of a thai-strain zika virus and its fluorescent reporter system for high-throughput assay and vaccine development a simplified positive-sense-rna virus construction approach that enhances analysis throughput multi-color fluorescent reporter dengue viruses with improved stability for analysis of a multi-virus infection ribosomal frameshifting and transcriptional slippage: from genetic steganography and cryptography to adventitious use translational accuracy and the fitness of bacteria an integrated approach reveals regulatory controls on bacterial translation elongation gradients in nucleotide and codon usage along escherichia coli genes selection for reduced translation costs at the intronic ' end in fungi multiple roles of the coding sequence ' end in gene expression regulation non-ltr retrotransposons encoding a restriction enzyme-like endonuclease in vertebrates predicting rad-seq marker numbers across the eukaryotic tree of life lifespan of restriction-modification systems critically affects avoidance of their recognition sites in host genomes one recognition sequence, seven restriction enzymes, five reaction mechanisms rebase-a database for dna restriction and modification: enzymes, genes and genomes protection from secondary dengue virus infection in a mouse model reveals the role of serotype cross-reactive b and t cells characterization of lethal zika virus infection in ag mice characterization of a novel murine model to study zika virus protective efficacy of zika vaccine in ag mouse model a zika vaccine targeting ns protein protects immunocompetent adult mice in a lethal challenge model pacing a small cage: mutation and rna viruses evolution of viral proteins originated de novo by overprinting hidden silent codes in viral genomes statistical analyses of counts and distributions of restriction sites in dna sequences rna virus attenuation by codon pair deoptimisation is an artefact of increases in cpg/upa dinucleotide frequencies live attenuated influenza virus vaccines by computer-aided rational design changes in codon-pair bias of human immunodeficiency virus type have profound effects on virus replication in cell culture bacteriophages and their implications on future biotechnology: a review bacteriophages and biotechnology: vaccines, gene therapy and antibacterials taking aim on bacterial pathogens: from phage therapy to enzybiotics experimental molecular evolution of bacteriophage t we are grateful to the anonymous referees for comments that greatly helped in improving this paper. the work of y.z. was supported by the israeli ministry of science, technology and space and by the edmond j. safra center for bioinformatics at tel-aviv university. the animal research ethics committee at utah state university approved this research. supplementary data are available at dnares online. key: cord- -n ylgqfu authors: giri, rajanish; bhardwaj, taniya; shegane, meenakshi; gehi, bhuvaneshwari r.; kumar, prateek; gadhave, kundlik; oldfield, christopher j.; uversky, vladimir n. title: when darkness becomes a ray of light in the dark times: understanding the covid- via the comparative analysis of the dark proteomes of sars-cov- , human sars and bat sars-like coronaviruses date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: n ylgqfu recently emerged coronavirus designated as sars-cov- (also known as novel coronavirus ( -ncov) or wuhan coronavirus) is a causative agent of coronavirus disease (covid- ), which is rapidly spreading throughout the world now. more than , , cases of sars-cov- infection and more than , covid- -associated mortalities have been reported worldwide till the writing of this article, and these numbers are increasing every passing hour. world health organization (who) has declared the sars-cov- spread as a global public health emergency and admitted that the covid- is a pandemic now. the multiple sequence alignment data correlated with the already published reports on the sars-cov- evolution and indicated that this virus is closely related to the bat severe acute respiratory syndrome-like coronavirus (bat sars-like cov) and the well-studied human sars coronavirus (sars cov). the disordered regions in viral proteins are associated with the viral infectivity and pathogenicity. therefore, in this study, we have exploited a set of complementary computational approaches to examine the dark proteomes of sars-cov- , bat sars-like, and human sars covs by analysing the prevalence of intrinsic disorder in their proteins. according to our findings, sars-cov- proteome contains very significant levels of structural order. in fact, except for nucleocapsid, nsp , and orf , the vast majority of sars-cov- proteins are mostly ordered proteins containing less intrinsically disordered protein regions (idprs). however, idprs found in sars-cov- proteins are functionally important. for example, cleavage sites in its replicase ab polyprotein are found to be highly disordered, and almost all sars-cov- proteins were shown to contain molecular recognition features (morfs), which are intrinsic disorder-based protein-protein interaction sites that are commonly utilized by proteins for interaction with specific partners. the results of our extensive investigation of the dark side of the sars-cov- proteome will have important implications for the structural and non-structural biology of sars or sars-like coronaviruses. significance the infection caused by a novel coronavirus (sars-cov- ) that causes severe respiratory disease with pneumonia-like symptoms in humans is responsible for the current covid- pandemic. no in-depth information on structures and functions of sars-cov- proteins is currently available in the public domain, and no effective anti-viral drugs and/or vaccines are designed for the treatment of this infection. our study provides the first comparative analysis of the order- and disorder-based features of the sars-cov- proteome relative to human sars and bat cov that may be useful for structure-based drug discovery. intrinsically disordered proteins (idps) and intrinsically disordered protein regions (idprs)), in order to better understand an interplay between the ordered and disordered components of the proteome. in classical structure-function-paradigm, it is believed that a unique, stable, and well-defined -dimensional structure is a prerequisite for a protein to accomplish its unique biological function. although this notion dominated scientific minds for over the hundred years, eventually an idea of the presence of functional intrinsic disorder in proteins came to the attention of the structural biologists. according to this "heretic" viewpoint, a noticeable amount of biologically active proteins (of protein regions) fail to fold into the well-defined structures and instead remain disordered, existing as highly dynamic ensembles of rapidly interconverting conformations under the physiological conditions. these proteins and protein regions are known now as intrinsically disordered proteins (idps) and intrinsically disordered protein regions (idprs), respectively. the propensity of being functional intrinsically disordered proteins (similar to the propensity of forming unique biologically active structures of ordered proteins) is determined by the amino acid sequences [ ] [ ] [ ] . idps exhibit their biological functions in numerous biological processes commonly associated with cellular signalling, gene regulation, and control by interacting with their physiological partners [ ] [ ] [ ] [ ] [ ] . these functions of idps and idps are regulated by their protein-protein, protein-rna, protein-dna interactions [ , ] . molecular recognition features (morfs) are the regions in idps implicated in regulation of idps function by protein-protein interactions and serve as the primary stage in molecular recognition. zhang and colleagues have reported the genomic sequence of sars-cov- with genbank accession number nc_ having , nucleotides. the virus was isolated from the bronchoalveolar lavage fluid of a patient, went through a circle or renaming, from novel wuhan seafood market pneumonia virus to deadly wuhan coronavirus, to the novel coronavirus ( -ncov) or the wuhan- novel coronavirus (wuhan- -ncov, and was eventually named sars-cov- by the who [ ] . it is known that the idps/idprs are present in all three kingdoms of life, and viral proteins often contain unstructured regions that have been strongly correlated with their virulence [ ] [ ] [ ] [ ] . in this report, we investigated the disordered side of the sars-cov- proteome using a complementary set of computational approaches to check the prevalence of idprs in various sars-cov- proteins and to shed some light on their disorder-related functions. we also have comprehensively analyzed idprs among the closely related viruses, such as human sars cov and bat sars-like cov. furthermore, we have also identified protein functions related to protein-protein interactions, rna binding, and dna binding from all three viruses. since these three viruses are closely related, our study provides important means for a better understanding of the sequence and structural peculiarities of their evolution. we believe that this study will help the structural and non-structural biologists to design and perform experiments for a more in-depth understanding of this virus and its pathogenicity. this also will have long-term implications for developing new drugs or vaccines against this currently unpreventable infection. sequence retrieval and multiple sequence alignment. the protein sequences of bat cov (sars-like) and human sars cov were retrieved from uniprot (uniprot ids for individual proteins are listed in table ). the translated sequences of sars-cov- proteins [genbank database [ ] (accession id: nc_ . )] were obtained from genbank. we used these sequences for performing multiple sequence alignment (msa) and predicting the idprs. we have used clustal omega [ ] for protein sequence alignment and esprit . [ ] for constructing the aligned images. for the prediction of the intrinsic disorder predisposition of cov proteomes, we have used multiple predictors, such as members of the pondr ® (predictor of natural disordered regions) family including pondr ® vls [ ] , pondr ® vl [ ] , pondr ® fit [ ] , and pondr ® vlxt [ ] , as well as the iupred platform for predicting long (≥ residues) and short idprs (< residues) [ ] . these computational tools predict residues/regions, which do not have the tendency to form an ordered structure. residues with disorder scores exceeding the threshold value of . are considered as intrinsically disordered residues, whereas residues with the predicted disorder scores between . and . are considered flexible. complete predicted percent of intrinsic disorder (ppid) in a query protein was calculated for every protein of all the three viruses from outputs of six predictors. the detailed methodology has been given in our previous reports [ , ] . and disopred [ ] . the protein residues with anchor, morfpred, and disopred score above the threshold value of . and morfchibi_web score above the threshold value of . are considered morf regions. idprs facilitate interactions with rnas and dnas and regulates many cellular functions [ ] . thus, for predicting the dna binding residues in cov proteins, we have used two online servers: drnapred [ ] and disordpbind [ ] . for rna binding residues, we used pprint (prediction of protein rna-interaction) [ ] and disordpbind [ ] . the mean values of the predicted percentage of intrinsic disorder scores (mean ppids), that were obtained by averaging the predicted disorder scores from six disorder predictors (supplementary table - ) for each protein of sars-cov- as well as human sars, and bat cov are represented in table . * these sequences are based on genome annotations conducted by wu et al. [ ] . proteins and their ppids are coloured to reflect their disorder status (ordered -blue, moderately disorderedpink, highly disordered -red) figures a, b , and c are d-disorder plots generated for sars-cov- , human sars and bat cov proteins, respectively, and represent the ppid pondr-fit vs. ppid mean plots. based on their predicted levels of intrinsic disorder, proteins can be classified as highly ordered (ppid < %), moderately disordered ( % ≤ ppid < %) and highly disordered (ppid ≥ %) [ ] . from the data in table , figures a, b , and c, as well as the ppid based classification, we conclude that the nucleocapsid protein from all three strains of coronavirus possesses the highest percentage of the disorder and is classified as highly disordered protein. orf b protein in bat cov, orf protein in sars-cov- , human sars, and bat cov, and orf b protein in human sars and sars cov belong to the class of moderately disordered proteins. while the structured proteins, namely, spike glycoprotein (s), an envelope protein (e) and membrane protein (m) as well as accessory proteins orf a, orf a, orf (orf a and orf b in case of human sars) of all three strains of coronavirus are ordered proteins. orf and orf proteins also belong to the class of ordered proteins. in ch-cdf plot of the proteins of (d) sars-cov- (e) human sars and (f) bat cov, the y coordinate of each protein spot signifies distance of corresponding protein from the boundary in ch plot and the x coordinate value corresponds to the average distance of the cdf curve for the respective protein from the cdf boundary. in order to further investigate the nature of the disorder in proteins of sars-cov- , human sars, and bat cov, we utilized the combined ch-cdf tool that uses the outputs of two binary classifiers of disorder, charge hydropathy (ch) plot and cumulative distribution function (cdf) plot. this helped in retrieving more detailed characterization of the global disorder predisposition of the query proteins and their classification according to the disorder "favours". the ch plot is a linear classifier that differentiates between proteins that are predisposed to possess extended disordered conformations that include random coils and premolten globules from proteins that have compact conformations (ordered proteins and molten globule-like proteins). the other binary predictor, cdf is a nonlinear classifier that uses the pondr ® vlxt scores to discriminate ordered globular proteins from all disordered conformations, which include native molten globules, pre-molten globules, and random coils. the ch-cdf plot can be divided into four quadrants: q (bottom right quadrant) is an area of ch-cdf phase space that is expected to include ordered proteins; q (bottom left quadrant) includes proteins predicted to be disordered by cdf and compact by ch (i.e., native molten globules and hybrid proteins containing high levels of both ordered and disordered regions); q (top left quadrant) contains proteins that are predicted to be disordered by both ch and cdf analysis (i.e., highly disordered proteins with the extended disorder); and q (top right quadrant) possesses proteins disordered according to ch but ordered according to cdf analysis [ ] . figures d, e and f represent the ch-cdf analysis of proteins of sars-cov- , human sars, and bat cov and show that all the proteins are located within the two quadrants q and q . the ch-cdf analysis leads to the conclusion that all proteins of sars-cov- , human sars, and bat cov are ordered except nucleocapsid protein, which is predicted to be disordered by cdf but ordered by ch and hence lies in q . molecular recognition features (morfs) are short interaction-prone disordered regions found within idps/idprs that commence a disorder-to-order transition upon binding to their partners [ , ] . these regions are important for protein-protein interactions and may initiate an early step in molecular recognition [ ] . in this study, we have analyzed and compared morfs (protein-binding regions) in sars-cov- with human sars and bat cov. the results of this analysis are summarized in table , which clearly shows that most of the sars-cov- proteins contain at least one morf, indicating that disorder does play an important role in the functionality of these viral proteins. in addition to protein-protein interactions/protein-binding functions, idps and idrs also mediates functions by facilitating their interactions with nucleotides such as dna and rna [ , ] . therefore, we have used a combination of two different online servers for locating protein residues that are showing the propensity to bind with dna as well as rna. nucleotide-binding residues in proteins of three studied coronaviruses are listed in supplementary coronaviruses encode four structural proteins, namely, spike (s), envelope (e) glycoprotein, membrane (m), and nucleocapsid (n) proteins, which are translated from the last ~ kb nucleotides and form the outer cover of the covs, encapsulating their single-stranded genomic rna. s protein is a large multifunctional protein forming the exterior of the cov particles [ , ] . it forms surface homotrimers and contains two distinct ectodomain regions known as s and s . in some covs, the s protein is actually cleaved into these subunits, which are joined non-covalently, whereas an additional proteolytic cleavage within the n-terminal part of the s subunit that takes place upon virus endocytosis generates spike proteins s '. subunit s initiates viral infection by binding to the host cell receptors, s acts as a class i viral fusion protein that mediates fusion of the virion and cellular membranes and thereby promotes the viral entry into the host cells, whereas s ' serves as a viral fusion peptide [ , ] . spike binds to the virion m protein through its c-terminal transmembrane region [ ] . belonging to a class i viral fusion protein, s protein binds to specific surface receptor angiotensin-converting enzyme (ace ) on host cell plasma membrane through its n-terminal receptor-binding domain (rbd) and mediates viral entry into host cells [ ] . the s protein consists of an n-terminal signal peptide, a long extracellular domain, a singlepass transmembrane domain, and a short intracellular domain [ ] . a . Å resolution structure (pdb id: acc) of s protein from human sars complexed with its host binding partner ace has been obtained by cryo-electron microscopy (cryo-em the biophysical analysis reported in previous study has also revealed that the s protein from sars-cov- has a higher binding affinity to ace than s protein from human sars [ ] . which has been calculated by averaging the disorder scores from all six predictors is represented by a short-dot line (sky-blue line) in the graph. the light sky-blue shadow region signifies the mean error distribution. the residues missing in the pdb structure or the residues for which pdb structure is unavailable are represented by the grey-coloured area in the corresponding graphs. (e) aligned disorder profiles generated for spike glycoprotein from sars-cov- (black line), human sars (red line), and bat cov (green line) based on the outputs of the pondr ® vsl . msa analysis among all three coronaviruses demonstrates that s protein of sars-cov- has a . % sequence identity with bat cov and . % identity with human sars (supplementary figure s a) . all three s proteins are found to have a conserved c-terminal region. however, the n-terminal regions of s proteins display noticeable differences. given that there is significant sequence variation rbd located at the n-terminal region of s protein, this might be the reason behind variation in its virulence and its receptor-mediated binding and entry into the host cell. according to our intrinsic disorder propensity analysis, s protein from all three covs analysed in this study are highly structured, as their predicted disorder propensity lies below % ( table ). in fact, the mean ppid scores of sars-cov- , human sars cov, and bat cov are calculated to be . %, . %, and . %, respectively. figures b, c , and d represent the intrinsic disorder profiles of s proteins from sars-cov- , human sars and bat cov obtained from six disorder predictors. finally, figure e shows aligned disorder profiles of s proteins from these covs and illustrates remarkable similarity in their disorder propensity, especially in the c-terminal region. it is of interest to map known functional regions of s proteins to their corresponding disorder profiles. the maturation of s protein requires specific posttranslational modification (ptm), proteolytic cleavage that happens at two stages. first, host cell furin or another cellular protease nicks the s precursor to generate s and s proteins, whereas the second cleavage that takes place after the viral attachment to host cell receptors leads to the release of a fusion peptide generating the s ' subunit. in human sars cov, the first and second cleavage site is located at residues r and r , respectively, whereas in bat cov, the corresponding cleavage sites are residues r and r . as it follows from figure , these cleavage sites are located within the idprs. in human sars cov s protein, fusion peptide (residues - ) is located within a flexible region, is characterized by the mean disorder score of . ± . . similarly, in bat cov s protein, fusion peptide (residues - ) has a mean disorder score of . ± . . s protein contains two heptad repeat regions that form coiledcoil structure during viral and target cell membrane fusion, assuming a trimer-of-hairpins structure needed for the functional positioning of the fusion peptide. in human sars cov s protein, heptad repeat regions are formed by residues - and - , which have mean disorder scores of . ± . and . ± . , respectively. the analogous situation is observed for the s protein from bat cov, where these heptad repeat regions are positioned at residues - ( . ± . ) and - ( . ± . ). another functional region found in s proteins is the receptor-binding domain (residues - and - in human sars cov and bat cov, respectively) containing a receptor-binding motif responsible for interaction with human ace . in human s protein of human sars cov this motif (residues - ) is not only characterized by structural flexibility, possessing a mean disorder score of . ± . , but also contains a disordered region (residues - ). since s protein is known as spike glycoprotein, it contains numerous glycosylation sites. due to rather close similarity of disorder profiles of s proteins analysed here, we can assume that all the aforementioned indications of the functional importance of disorder and flexible regions in s proteins from sars cov and bat cov are also applicable to sars-cov- s protein. finally, table shows that s protein from sars-cov- contain one morf region at its cterminal (residues - ) by morfchibi_web, two morf regions ((residues - ) & (residues - )) by morfpred, and one morf region at n-terminal (residues - ) by disopred . these results indicating that intrinsic disorder is important for its interaction with binding partners. interestingly, the n-terminal region of s protein (residues - ) from all three viruses are observed to be a disorder-based protein binding region by two predictors (morfpred and disopred ). n-terminal morf displays its role in viral interaction with host receptor and c-terminal morf displays its role in m protein interaction and viral assembly. moreover, morf region mainly lies in the n-and c-terminal regions suggesting a possible role during cleavage as well. in addition to protein-binding regions, s protein also shows many nucleotide-binding residues. tables , , and shows that numerous rna binding residues predicted by pprint in all three viruses and a single rna binding residue were predicted by disordpbind in human sars. further, drnapred and disordpbind predicted the presence of many dna binding residues in s protein of all three viruses. these results signify the role of s protein functions related to molecular recognition (protein-protein interaction, rna binding, and dna binding) such as interaction with host cell membrane and further viral infection. therefore, identified idps/idprs and residues/regions from s protein crucial for molecular recognition can be targeted for disorder-based drug discovery. envelope (e) protein is a small, multifunctional inner membrane protein that plays an important role in the assembly and morphogenesis of virions in the cell [ ] [ ] [ ] . e protein consists of two ectodomains associated with n-and c-terminal regions, and a transmembrane domain. it homo-oligomerize to form pentameric membrane destabilizing transmembrane (tm) hairpins to form a pore necessary for its ion channel activity [ ] . figure a shows the nmr-structure (pdb id: mm ) of human sars envelope glycoprotein of - residues [ ] . msa results illustrate ( figure b ) that this protein is highly conserved, with only three amino acid substitutions in e protein of sars-cov- conferring its % sequence similarity with human sars and bat cov. bat cov shares % sequence identity with human sars. mean ppid calculated for sars-cov- , human sars, and bat cov e proteins are . %, . %, and . % respectively ( table ) . the e protein is found to have a reasonably well-predicted structure. our predictions suggest that the residues of n-and c-terminals are displaying a higher tendency for the disorder. the last hydrophilic residues (residues - ) have been reported to adopt a random-coil conformation with and without the addition of lipid membranes [ ] . literature suggests that the last four amino acids of the c-terminal region of e protein containing a pzd-binding motif are involved in protein-protein interactions with a tight junction protein pals . our results support literature as we identified long n-terminal region of approximately residues long as disorder-based protein binding region in all three viruses (see table , supplementary table and ). pals is involved in maintaining the polarity of epithelial cells in mammals [ ] . respective graphs in figures c, d , and e show the predicted intrinsic disorder profiles for e proteins of sars-cov- , human sars, and bat cov. we speculate that the disordered region content may be facilitating the interactions with other proteins as well. in agreement with this hypothesis, table shows that in e protein from sars-cov- , the c-terminal domain serves as protein-binding region. we found that the residues from - is a long morf in e proteins of all three viruses as predicted by morfchibi_web ( table , supplementary table and ). as aforementioned, these randomly-coiled binding-residues at c-terminus may gain structure while assisting the protein-protein interaction mediated by e protein. one more morf region (residues [ ] [ ] [ ] [ ] [ ] in the transmembrane domain was observed by disopred in the e protein of all three viruses. since these residues are the part of ion channel, we speculate that these residues do specific interactions and may be guiding the specifi functions of ion channel activity. few rna binding residues by pprint and disordpbind and several dna binding residues by drnapred are predicted for e protein in all three viruses. assembly by interacting with the nucleocapsid (n) and e proteins [ ] [ ] [ ] . protein m interacts specifically with a short viral packaging signal containing coronavirus rna in the absence of n protein, thereby highlighting an important nucleocapsid-independent viral rna packaging mechanism inside the host cells [ ] . it gains high-mannose n-glycans in er, which are subsequently modified into complex n-glycans in the golgi complex. glycosylation of m protein is observed to be not essential for virion fusion in cell culture [ , ] . cryo-em and tomography data indicate that m forms two distinct conformations, a compact m protein having high flexibility and low spike density, and an elongated m protein having a rigid structure and narrow range of membrane curvature [ ] . some regions of m glycoproteins might serve as important dominant immunogens. although no structural information is available for the full-length m protein as of yet, a short peptide of the membrane glycoprotein (residues - ) from human sars cov was co-crystallized with a complex between a- alpha chain of the hla class i histocompatibility antigen and β microglobulin (pdb id: i g) [ ] . figure a shows that within this complex, the cocrystallized m protein region exists in an extended conformation. m protein of sars-cov- has a sequence similarity of . % with bat cov and . % with human sars m proteins ( figure b ). our analysis revealed that the intrinsic disorder levels in m proteins of sars-cov- , human sars cov, and bat cov are relatively low since these proteins show the ppid values of . %, . %, and . % respectively. this is in line with the previous publication by goh et al. on human sars hku where they found the mean ppid of % using additional predictors such as topidp and foldindex along with the predictors used in our study [ ] . figures c, d , and e represent per-residue disorder profiles generated for m proteins of sars-cov- , human sars cov, and bat cov and show that with the exception to their n-and c-terminal regions, these proteins are mostly ordered. the last residues of mers-cov m protein are important for intracellular trafficking and contains a determinant that localizes it into the golgi network [ ] . our results in table illustrates that the disordered c-tail of the m protein is predicted to have disorder based protein-binding region and therefore can serve as a binding site for its specific partner required for its localization inside the host cell. a long morf region (residues - ) at the c-terminal of m protein in all three viruses were observed by morfchibi_web. two morf regions (one at n-terminus (residues - ) and one at c-terminus (residues - )) was observed by disopred in human sars and bat cov. however, single morf (residues - ) observed in sars-cov- by disopred . morfpred also predicts a short morf at c-terminus of sars-cov- (residues - ), human sars (residues - ), and bat cov (residues - ) ( table , supplementary tables and ) . furthermore, the m protein from all three viruses displays strong tendency to bind with rna (as predicted by pprint and disordpbind) and dna (as predicted by drnapred and disordpbind) (see supplementary tables , , and ). our understanding on m protein of covs (idps and morf at c-terminus and molecular recognition) elucidates its crucial role in interaction with the n and e proteins for viral assembly. nucleocapsid (n) protein: nucleocapsid (n) protein is one of the major viral proteins playing several significant roles in transcription, and virion assembly of coronaviruses [ ] . it binds to viral genomic rna forming a ribonucleoprotein core required for the rna encapsidation during viral particle assembly [ ] . sars-cov virus-like particles (vlps) formation has been reported to depend upon either m and e proteins or m and n proteins. for the effective production and release of vlps, co-expression of e or n proteins with m protein is necessary [ ] . n protein of human sars consists of two structural domains, the n-terminal rna-binding domain (ntd: - residues) and the c-terminal dimerization domain (ctd: - residues) with a disordered patch in between these domains. n protein has been demonstrated to bind viral rna using both ntd and ctd [ ] . figure a displays the nmr solution structure of the ntd of human sars cov nucleocapsid protein ( - residues) (pdb id: ssk) [ ] . figure a shows an x-ray crystal structure of the ctd of human sars cov nucleocapsid protein ( - residues) (pdb id: gib) [ ] . a model of the domain organization of the n-protein from sars-cov- is shown in figure b . the amino acid-long n protein of sars-cov- shows a percentage identity of . % with n protein of bat cov n protein and . % with human sars n protein (supplementary figure s b) . our analysis revealed that the n proteins of coronaviruses contain the highest levels of intrinsic disorder (see figure and table ). in fact, n proteins from sars-cov- human sars cov, and bat cov are characterized by the mean ppid of . %, . %, and . %, respectively. in accordance with the previously evaluated intrinsic disorder predisposition [ ] , n protein is highly disordered in all three sars viruses analysed in this study ( table ) . graphs in figures c, d , and e depict the disorder profiles of sars-cov- , human sars cov, and bat cov nucleocapsid proteins and show that their n-and c-terminal regions are completely disordered, and all three proteins also contain the central unstructured segment. as expected, the intrinsic disorder predisposition of the n protein of sars-cov- is remarkably similar to that for the n protein of human sars cov as reported in a previous study [ ] . this is further supported by figure f , where pondr ® vsl -generated disorder profiles of these three proteins are overlapped to show almost complete coincidence of their major disorder-related features. it is clear that in n-proteins, the n-and c-termini and a log central segment are completely disordered. figure c shows that in the n protein from sars-cov- , residues - , - , - , - , and - are found to be disordered. many of these residues are lying within the ntd and ctd regions, and which, due to their structural plasticity, were not crystallized in human sars cov n protein. sars-cov- has a disordered segment from - residues while human sars has predicted to have an unstructured segment from - residues. overall, all three n proteins are found to be highly disordered. the n protein from human sars cov has one phosphorylation site (residue s ) and several regions with compositional biases, such as ser-rich (residues - ), poly-leu, poly-gln, and ploy-lys (residues - , - , and - ), all predicted to be disordered. similarly, in n protein from bat cov, s is phosphorylated, and this protein has ser-rich, poly-leu, and ploy-lys regions (residues - , - , and - , respectively), all of which are disordered. it has been reported to interact using the central disordered region with m protein, hnrnp a , and self n-n interaction [ ] [ ] [ ] . the middle flexible region is also responsible for its rna-binding activity [ ] . deletion of - residues, - residues, - residues of n abolishes its multimerization, rnabinding capacity, and hnrnp a interactions respectively. supplementary table and , and table shows that n protein is heavily decorated with numerous morfs, suggesting that this protein is a promiscuous binder. long disorder-based protein bonding regions at nand c-terminus of n protein of all three viruses were observed by all four predictors (morfchibi_web, anchor, morfpred, and disopred ). indeed, this is the single protein where we found many morfs as compared with the other structural, non-structural and accessory proteins of covs. the morfs present in these regions may mediate the abovementioned interactions of n proteins. figure a represents another important disorderrelated functional feature of the n protein. in fact, the ctd homodimer shown there is characterized by highly intertwined morphology, which is typically a result of bindinginduced folding [ ] [ ] [ ] , indicating that a very significant part of ctd gains structure during dimerization. we identified numerous rna binding residues in all three viruses using pprint server. this finding supports the function of n protein as it interacts with genomic rna for a ribonucleoprotein core formation which is crucial step for rna encapsidation during viral particle assembly. in addition, drnapred and disordpbind predicts multiple dna binding residues for n protein in sars-cov- , human sars, and bat cov. the long flexible (idprs) regions at n and c-terminus of sars-cov- have long protein-binding as well as nucleotide-binding regions that may have important role in its interaction with viral rna. these flexible regions can be targeted to inhibit interaction of n protein with viral genomic rna. literature suggests that some viral proteins are translated from the genes interspersed in between the genes of structural proteins. these proteins are known as accessory proteins, and many of them are proposed to be involved in viral pathogenesis [ ] . proteins orf a and orf b. orf a is a multifunctional protein with the molecular weight of ~ kda that has been found to localize in different organelles inside the host cells. also referred to as u , x , and orf , the gene for this protein is present between the s and e genes of the sars-cov genome [ ] [ ] [ ] . the homo-tetrameric complex of orf a has been demonstrated to form a potassium-ion channel on the host cell plasma membrane [ ] . it performs a major function during virion assembly by co-localizing with e, m, and s viral proteins [ , ] . orf b protein can be found in the cytoplasm, nucleolus, and outer membrane of mitochondria of the host cells [ , ] . in huh cells, its over-expression has been linked with the activation of ap- via erk and jnk pathways [ ] . transfection of orf b-egfp leads to cell growth arrest at the g /g phase of vero, , and cos- cells [ ] . orf a induces apoptosis via caspase / directed mitochondrial-mediated pathways, while orf b is reported to affect only the caspase -related pathways [ , ] . on performing msa, results of which are shown in figure d , we found that orf a protein from sars-cov- is slightly evolutionary closer to the orf a of bat cov ( . %) than to the orf a of human sars cov ( . %). graphs in figures a, b , and c depict the propensity for disorder in orf a proteins of novel sars-cov- , human sars cov, and bat cov (sars-like), respectively. mean ppids in these orf a proteins are . % (sars-cov- ), . % (human sars), and . % (bat cov (sars-like)). orf a of sars cov- shows protein-binding regions at its n-terminus (by morfchibi_web (residues - ), morfpred (residues - ), and disopred (residues [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] ) and at c-terminus (by morfchibi_web (residues - ) and morfpred (residues - )) ( table ) . similarly, orf a of human sars and bat cov also shows morfs at n-and c-terminus with the help of morfchibi_web and morfpred (supplementary tables and ). these protein-binding regions in orf a may have role in its co-localization with e, m, and s viral proteins. apart from morfs, it also displays several nucleotide-binding residues in all three viruses (see supplementary tables , , and ). in fact, this represents maximum number of rna and dna binding residues as compared with all other accessory proteins. these results indicate that the idps/idprs of this protein could be utilized in molecular recognition (protein-protein, protein-rna, and protein-dna interaction). according to the intrinsic disorder predisposition analysis of orf b proteins, their mean ppid values in sars-cov- , human sars cov, and bat cov are %, . %, and . % respectively, as represented in figures a, b , and c. msa results ( figure d ) demonstrate that orf b of sars-cov- is not closer to orf b protein of human sars and orf b protein of bat-cov, having a sequence similarity of only . % and . %, respectively. as we can see in table , there is not a single morf found in orf b of sars-cov- . however, for human sars we identified three morfs (residues - , - , and - ) and for bat cov one morf at n-terminus (residues - ) by morfchibi_web server. protein orf . orf is a short coronavirus protein with just residues. also known as p , this membrane-associated protein serves as an interferon (ifn) antagonist [ ] . it downregulates the ifn pathway by blocking a nuclear import protein, karyopherin α . using its c-terminal residues, orf disrupts karyopherin import complex in the cytosol and, therefore, hampers the movement of transcription factors like stat into the nucleus [ , ] . it contains a ysel motif near its c-terminal region, which functions in protein internalization from the plasma membrane into the endosomal vesicles [ ] . another study has also demonstrated the presence of orf in endosomal/lysosomal compartments [ , ] . msa results demonstrate that (figure d) , sars-cov- orf is closer to orf protein of human sars cov, having a sequence similarity of . % than to the orf of bat cov (sars-like) ( . %). novel sars-cov- orf is predicted to be the second most disordered structural protein, with ppid of . %, and with especially disordered cterminal region. our analysis of the intrinsic disorder predisposition using six predictors revealed the mean ppid in orf proteins of sars-cov- , human sars, and bat cov to be . %, . %, and . %, respectively ( table ) . graphs in figures a, b and c illustrate that orf proteins from all three studied coronaviruses are expected to be moderately disordered proteins with the high disorder content in their c-terminal regions. these disordered regions are important for the biological activities of orf . as aforementioned, this hydrophilic region contains lysosomal targeting motif (ysel) and diacidic motif (ddee) responsible for binding and recognition during translocation [ ] . however, the n-terminal region does not contain a noticeable disorder. the - residues of the n-terminal region of human sars cov orf was shown to be α-helical and embedded in the membrane, although orf is not a transmembrane protein [ ] . a long morf region ((residues - in sars-cov- ), (residues - in human sars), and (residues - in bat cov)) is also present at cterminus of orf proteins which are tabulated in table , and supplementary tables and . no predictor other than morfchibi_web has located morfs in this protein. supplementary table , , and shows nucleotide-binding residues in orf of all three viruses. it represents very few rna binding residues by pprint and few dna binding residues by drnapred. orf a and orf b proteins. alternatively called u , orf a is a type i transmembrane protein [ , ] . it has been proven to localize in er, golgi, and peri-nuclear space. the presence of a krkte motif near the c-terminal region is needed for importing this protein from the er to the golgi apparatus [ , ] . orf a contributes to viral pathogenesis by activating the release of pro-inflammatory cytokines and chemokines, such as il- and rantes [ , ] . in another study, overexpression of bcl-xl in t cells blocked the orf a mediated apoptosis [ ] . on the other hand, orf b is an integral membrane protein that has been shown to localize in the golgi complex [ , ] . the same reports also confirm the role of orf b as an accessory as well as a structural protein in sars-cov virion [ , ] . figure d represents the . Å x-ray crystal structure of the - fragment of the orf a from human sars cov (pdb id: xak) and demonstrates the compact seven-stranded topology of this protein, which is similar to that of the ig-superfamily members [ ] . importantly, in this crystal structure, residues - constituted the region with missing electron density, indicating high structural flexibility of this segment. in line with this hypothesis, the nmr solution structure of the - fragment of the orf a from human sars cov (pdb id: yo ) showed that residues - are highly disordered [ ] . at the domain level, the structure of the orf a protein includes a signal peptide, a luminal domain, a transmembrane domain, and a short cytoplasmic tail at the c-terminus [ , ] . we found that -residue-long orf a protein of sars-cov- shares . % and . % sequence identity with orf a proteins of bat cov and human sars cov, respectively ( figure e) . on the other hand, the orf b of sars-cov- is found to be closer to orf b of human sars than to orf b of bat cov, showing sequence identities of . % and . %, respectively (see figure d ). as can be observed from table , our disorder predisposition analyses resulted in the overall ppid for orf a proteins of . % for sars-cov- , . % for bat cov and . % for human sars cov. mean ppids estimated for orf b proteins are . % for sars-cov- , . % for bat cov and . % human sars cov. figures a, b , and c represent the residues predisposed for disorder in orf a proteins of sars-cov- , human sars cov, and bat cov, respectively. table shows that orf a protein is expected to have several morfs indicating the potential involvement of this protein in disorder-dependent proteinprotein interactions. at the n-terminus, we observed one morf region (residues - ) with the help of disopred in all three viruses. in addition to protein binding regions, orf a also contains several rna and dna binding residues. analysis also represents, orf b proteins from all three viruses have low disorder content, likewise, they are not predicted to contain any morf by any of the predictors used in this study ( table , supplementary table and ). although the orf b does not contain protein-binding regions, it was found to contain nucleotide (rns and dna) binding regions in the protein figures a, b , and c depict the residues predisposed for disorder in orf b proteins of sars-cov- , human sars cov, and bat cov, respectively. according to our analysis, both proteins in all three studied coronaviruses have a mostly ordered structure. proteins orf a and orf b. in animals and isolates from early human infections, the orf gene codes for a single orf protein. however, in late infections, more specifically, at middle and late stages, a nucleotide deletion in the orf gene led to the formation of two distinct proteins, orf a and orf b containing and residues respectively [ , ] . both proteins have conformations different from that of the longer orf protein. it has been reported that overexpression of orf b resulted in the downregulation of e protein while the proteins orf a and orf /orf ab have no effect on the expression of protein e. also, orf /orf ab was found to interact very strongly with proteins s, orf a, and orf a. orf a interacts with s and e proteins, whereas orf b protein interacts with e, m, orf a and orf a proteins [ ] . the disorder-based protein binding regions of this protein identified in this study may have important role in interaction with other proteins. orf protein found in early sars-cov- isolates having residues and according to our analysis, it shares a . % sequence identity with orf protein of bat cov ( figure c) . furthermore, figures a and b show that there is no intrinsic disorder in both orf proteins from sars-cov- and bat cov. therefore, these two proteins predicted to be completely structured having a mean ppid of . %. in orf a and orf b proteins of the human sars, the predicted disorder is estimated to be . % and . %, respectively (table ) . graphs in figures a and b illustrate the presence of some disorder near the n-and c-terminals of orf a and orf b proteins. table shows the identified morf regions in orf of sars-cov- . it shows three morf regions (residues - , - , and - ) by morfchibi_web and one morf region (residues - ) by disopred . in human sars, the n-terminus of both orf a (residues - ) and orf b (residues - ) was found to be morf by morfchibi_web server (supplementary table ) . further, four proteinbinding regions (residues - , - , - , and - ) were identified by morfchibi_web server in bat cov (supplementary table ). apart from protein-binding, orf of sars-cov- , orf a and orf b of human sars, and orf of bat cov also comprise several nucleotide-binding residues (see supplementary table , , and ). this protein is expressed from an alternative orf within the n gene through a leaky ribosome binding process [ ] . inside the host cells, orf b enters the nucleus, which is a cell cycle-independent process and represents a passive entry. this protein was shown to interact with a nuclear export protein receptor exportin (crm ), using which it translocate out of the nucleus [ ] . our morfs analysis shows the presence of disorderbased protein binding regions in orf b protein which may have role in its interaction with crm and further translocation outside the nucleus. a . Å resolution crystal structure of orf b protein from human sars cov (pdb id: cme) shows the presence of a dimeric tent-like -structure along with the central hydrophobic amino acids ( figure d) . the published structure has the highly polarized distribution of charges, with positively charged residues on the one side of the tent and negatively charged on the other [ ] . based on the sequence availability of accession id nc_ . , the translated protein sequence of orf b is not reported for the sars-cov- as of yet. however, based on the report by wu and colleagues [ ] , the sequences of the sars-cov- are already annotated. therefore, we took the corresponding amino acid sequences from that study and conducted the intrinsic disorder analysis. according to the msa, results shown in figure e , orf b protein from sars-cov- shares . % identity with human sars and . % identity with bat cov. our idp analysis ( table ) shows that orf b from human sars is a moderately unstructured protein with a mean ppid estimated to . %. as depicted in figure a , b, and c, disorder mainly lies near the n-terminal end - residues and - residues near the central region with a well-ordered inner core of human sars orf b protein. the x-ray crystal structure of orf b has a missing electron density of the first residues and - residues near the central region. this indicates that the corresponding regions are disordered, which are difficult to crystallize due to their highly dynamic structural organization. sars-cov- orf b protein with a mean ppid of . % has an n-terminal ( - residues) predicted disordered segment. orf b of bat cov is shown to have an intrinsic disorder content of . %, comparatively lower than that of the human sars orf b protein. morfs lies in the n-terminal region of orf b proteins ( table , supplementary table and ). in the absence of other viral proteins, its first residues have been demonstrated to induce membranous structures similar to dmvs [ ] . the available crystal structure also has the missing electron density in the n-terminal region suggests that these flexible amino acids are likely to interact with host lipids. the first - residues of sars-cov are identified as disorder-based protein binding region that may have role in its interaction with host lipids and formation of dmvs. supplementary tables , , and represents nucleotide-binding residues for orf b of sars-cov- , human sars, and bat cov. the newly emerged sars-cov- has an orf protein of amino acids. orf of sars-cov- has a % sequence similarity with orf of bat cov strain bat-sl-covzc [ ] . however, we did not conduct the disorder analysis for orf from the bat-sl-covzc strain, since all our studies reported here are related to a different strain of bat cov (reviewed strain hku - ). therefore, we report here only the results of disorder analysis for the orf protein from sars-cov- , according to which this protein has a mean ppid of . % (see also figure for disorder profile of orf ). this protein contains a morf from - residues at its n-terminus as predicted by morfchibi_web. further, we found its tendency to nucleotides and found the presence of few rna binding sites, however, it does not contain dna binding residues. protein orf . this is a -amino-acid-long uncharacterized protein of unknown function, which is present in human sars and bat cov. in sars-cov- , orf is a -amino-acidlong protein. according to the msa, orf of sars-cov- has . % identity with human-sars and . % identity with bat cov as represented in figure d . we have performed the intrinsic disorder analysis to see the peculiarities of the distribution of disorder predisposition in this protein. figures a, b, and c show the resulting disorder profiles of orf of sars-cov- , human sars cov, and bat cov. although these proteins have calculated mean ppid values of . %, . %, and . % respectively, figure shows that they have flexible n-and c-terminal regions. this protein can use intrinsic disorder or structural flexibility for protein-protein interactions since it possesses morfs. it mainly contains morfs at n-and c-terminal regions as tabulated in (table , supplementary table and ). it was also found to contain several rna and dna binding residues (supplementary table , , and ) . these results indicating its vital role in protein function related to molecular recognition such as protein-protein, protein-rna, and protein-dna interaction. in coronaviruses, due to ribosomal leakage during translation, two-third of the rna genome is processed into two polyproteins: (i) replicase polyprotein a and (ii) replicase polyprotein ab. both contain non-structural proteins (nsp - ) in addition to different proteins required for viral replication and pathogenesis. replicase polyprotein a contains an additional nsp protein of amino acids, the function of which is not investigated yet. the longer replicase polyprotein ab of amino acids accommodates five other non-structural proteins (nsp - ) [ ] . these proteins assist in er membrane-induced vesicle formation, which acts as sites for replication and transcription. in addition to this, non-structural proteins work as proteases, helicases, and mrna capping and methylation enzymes, crucial for virus survival and replication inside host cells [ , ] . global analysis of intrinsic disorder in the replicase polyprotein ab table represents the ppid mean scores of non-structural proteins (nsps) derived from the replicase polyprotein ab in sars-cov- , human sars cov, and bat cov. these values were obtained by combining the results from six disorder predictors (see supplementary table s -s ) . figures a, b , and c represent the d-disorder plots of the nsps coded by orf ab in sars-cov- , human sars cov, and bat cov, respectively. based on the mean ppid scores in table , figures a, b, c , and taking into ppid based classification [ ] , we conclude that none of the nsps in sars-cov- , human sars cov, and bat cov are highly disordered. the highest disorder was observed for nsp proteins in all three coronaviruses. both nsp and nsp are moderately disordered proteins ( % ≤ ppid ≤ %). we also observed that nsp , nsp , nsp , nsp , nsp , nsp , nsp , nsp , and nsp have less than % disordered residues and hence, belong to the category of mostly ordered proteins. other non-structural proteins, namely, nsp , nsp , nsp , and nsp have negligible levels of disorder (ppid < %), which tells us that these are highly structured proteins. the ch-cdf analysis of the nsps from sars-cov- , human sars and bat cov have been represented in figures d, e , and f respectively. it was observed that all the nsps of the three coronaviruses are located within the quadrant q of the ch-cdf phase space, indicating that all the nsps are predicted to be mostly ordered. replicase polyprotein ab. the longer replicase polyprotein ab is a , amino acid-long polypeptide, which contains non-structural proteins listed in table . nsp , nsp , and nsp are cleaved using a viral papain-like proteinase (nsp /pl-pro), while the rest of nsps are cleaved by another viral c-like proteinase, nsp / cl-pro. we mapped the cleavage sites of the replicase ab polyprotein from human sars cov to the disorder profile of this polyprotein. figure represents the results of this analysis by showing zoomed-in regions surrounding all the cleavage sites with few residues spanning at both terminals. interestingly, we observed that all the cleavage sites are largely disordered, suggesting that intrinsic disorder may have a crucial role in the maturation of individual non-structural proteins. as the nsps of human sars cov are evolutionary close to the nsps of sars-cov- , we hypothesize that the cleavage sites in the sars-cov- replicase ab polyprotein are also intrinsically disordered or flexible. to shed more light on other implications of idprs, the structural and functional properties of nsps and their predicted idprs are thoroughly described below. this protein acts as a host translation inhibitor as it binds to the s subunit of the ribosome and blocks the translation of cap-dependent mrnas as well as mrnas that uses the internal ribosome entry site (ires) [ ] . figure d shows the nmr solution structure (pdb id: gdt) of human sars nsp protein ( - residues), whereas residues - were not included in this structural analysis [ ] . sars-cov- nsp shares . % and . % sequence identity with nsp s of human sars cov and bat cov, respectively. its n-terminal region is found to be more conserved than the rest of the protein sequence ( figure e ). mean ppids of nsp s from sars-cov- , human sars cov, and bat cov are . %, . %, and . %, respectively. figure a , b, and c represent the graphs of predicted per-residue intrinsic disorder propensity of these nsp s. according to the analysis, the following regions are predicted to be disordered: sars-cov- (residues - and - ), human sars cov (residues - and - ), and bat cov (residues - and - ). nmr solution structure of nsp from human sars revealed the presence of two unstructured segments near the n-terminal ( - residues) and c-terminal ( - residues) regions [ ] . the disordered region residues) at c-terminus is important for nsp expression [ ] . based on sequence homology with human sars cov nsp , the predicted disordered c-terminal region of sars-cov- nsp may play a critical role in its expression. alanine mutants at k and h in the cterminal region of nsp protein is reported to abolish its binding with the s subunit of the host ribosome [ ] . in conjunction with this data, several morfs are present in the unstructured segments of nsp proteins. these regions are tabulated in table , and supplementary tables and . this protein functions by disrupting the host survival pathway via interaction with the host proteins prohibitin- and prohibitin- [ ] . reverse genetic deletion in the coding sequence of nsp of the sars virus attenuated little viral growth and replication and allowed the recovery of mutant virulent viruses. this indicates the dispensable nature of the nsp protein for sars viruses [ ] . the sequence identity of the nsp protein from sars-cov- with nsp s of human sars cov and bat cov amounts to . % and . %, respectively (supplementary figure s a) . we have estimated the mean ppids of nsp s of sars-cov- , human sars cov, and bat cov to be . %, . %, and . % respectively (see table ). the per-residues predisposition for the intrinsic disorder of nsp s from sars-cov- , human sars cov, and bat cov are depicted in figures a, b , and c. according to this analysis, the following regions in nsp proteins are predicted to be disordered, residues - (sars-cov- ), residues - (human sars), and residues - (bat cov). as listed in table , and supplementary tables and , human sars cov does not contain morf while sars-cov- and bat cov have an n-terminally located morf region predicted by morfchibi_web. nsp is an almost , -residue-long viral papain-like protease (plp) that affects the phosphorylation and activation of irf and therefore antagonizes the ifn pathway [ ] . it was also demonstrated that nsp works by stabilizing nf-inhibitor further blocking the nf-pathway [ ] . figure d represents the . Å resolution x-ray crystal structure of the catalytic core of nsp protein from human sars cov (pdb id: fe ), which was obtained by andrew and colleagues [ ] . this structure consists of the residues - of nsp . the structure revealed folds similar to a deubiquitinating enzyme in-vitro deubiquitinating activity of which was found to be efficiently high [ ] . nsp protein of sars-cov- contains several substituted residues throughout the protein. it is equally close with both nsp proteins of human sars and bat cov sharing respective . % and . % identity (supplementary figure s b) . according to our results, the mean ppids of nsp proteins of sars-cov- , human sars, and bat cov are . %, . %, and . % respectively ( table ) . graphs in figures a, b , and c portray the tendency of nsp proteins of sars-cov- , human sars, and bat cov for the intrinsic disorder. nsp proteins of all three studied sars viruses were found to be highly structured and characterized by rather similar disorder profiles. this is further supported by figure e , where pondr ® vsl -generated disorder profiles of these three proteins are overlapped to show almost complete coincidence of their major disorder-related features. according to the mean disorder analysis (see figures a, b, and c) , nsp proteins are predicted to have the following idprs, sars-cov- ( - , - , - ), human sars ( - , - , - ) and bat cov ( - , - , - ) . the first residues in nsp represent a ubiquitin-like globular fold while - residues form the flexible acidic domain rich in glutamic acid. it is thought to bind and ubiquitinate viral e protein using the n-terminal acidic domain [ , ] . this unstructured segment has many morfs predicted by anchor and morfpred servers which may facilitate the protein-protein interaction ( table ) . interestingly, nsp of all three viruses was found with highest number of rnabinding residues (supplementary tables , , and ) . nsp has been reported to induce the formation of the double-membrane vesicles (dmvs) with the co-expression of full-length nsp and nsp proteins for optimal replication inside host cells [ ] [ ] [ ] . it localizes itself in ermembrane, when expressed alone but is demonstrated to be present in replication units in infected cells. it was observed that nsp protein contains a tetraspanning transmembrane region having its n-and c-terminals in the cytosol [ ] . no crystal or nmr solution structure is reported for this protein as of yet. nsp protein of sars-cov- has multiple substitutions near the n-terminal region and has a quite conserved c-terminus (supplementary figure s c) . it is found to be closer to nsp of bat cov ( . % identity) than to human sars nsp ( %). mean ppids of nsp s from sars-cov- , human sars, and bat cov are estimated to be . %, . %, and . % respectively. the low level of intrinsic disorder is further illustrated by figures a, b , and c. with ppids around zero, nsp were classified as highly structured proteins, which, however, contain some flexible regions. likewise, table shows the presence of only nand c-terminal morfs which possibly assist in cleavage of nsp protein from long polyproteins a and ab. also referred to as cl-pro, nsp works as a protease that cleaves the replicase polyproteins ( a and ab) at major sites [ , ] . x-ray crystal structure with . Å resolution (pdb id: c o) obtained for human sars cov nsp is shown in figure d . here, cl-protease is bound to a phenyl-beta-alanyl (s, r)-n-declin type inhibitor. another crystal structure resolved to . Å revealed a chymotrypsin-like fold and a conserved substrate-binding site connected to a novel α-helical fold [ ] . recently, the x-ray crystal structure (resolution . Å) was solved for the sars-cov- nsp in complex with an inhibitor n (pdb id: lu ) ( figure e ). nsp protein is found to be highly conserved in all three studied cov viruses. sars-cov- nsp shares a . % sequence identity with human sars nsp and . % with nsp of bat cov (supplementary figure s d) . therefore, it not surprising that our analysis demonstrated the identical mean ppid values of . % for nsp s from sars-cov- , human sars, and bat cov ( table ) . the predicted per-residue intrinsic disorder propensity of sars-cov- , human sars, and bat cov nsp s are presented in figures a, b , and c, respectively. as the graphs depict, nsp s have several flexible regions and n-terminally idpr of six residues. due to the low flexibility of this protein, a single morf predicted by morfchibi_web is present in the n-terminal region (residues - ) in nsp s of all three viruses (table , supplementary tables and ) . further, the identified nucleotide-binding residues in nsp of all three viruses are tabulated in supplementary tables , , and . non-structural protein (nsp ). nsp protein is involved in blocking er-induced autophagosome/autolysosome vesicle formation that functions in restricting viral production inside host cells. it induces autophagy by activating the omegasome pathway, which is normally utilized by cells in response to starvation. sars nsp leads to the generation of small autophagosome vesicles thereby limiting their expansion [ ] . nsp of sars-cov- is equally close to nsp s from both human sars and bat cov, having a sequence identity of . % ( figure d ). according to our analysis, mean ppids for nsp s are calculated to be . %, . %, and . % for sars-cov- , human sars cov, and bat cov, respectively. figures a, b, and c show the corresponding graphs of intrinsic disorder tendency of nsp s from sars-cov- , human sars cov, and bat cov and demonstrate that these proteins are highly ordered and show low flexibility. as it is a membrane protein, nsp proteins are predicted to have only a single morf near the nterminal region (residues - in sars-cov- , residues - in human sars, and residues - in bat cov) by the disopred server (table , supplementary tables and ) . the role of these protein-binding regions for the induction of autophagy is need to be elucidated. nsp and ) . the ~ kda nsp helps in primaseindependent de novo initiation of viral rna replication by forming a hexadecameric ring-like structure with nsp protein [ , ] . both non-structural proteins and contribute molecules to the ring-structured multimeric viral rna polymerase. site-directed mutagenesis in nsp revealed a d/exd/e motif essential for the in vitro catalysis [ ] . figure d depicts the . Å resolution electron microscopy-based structure (pdb id: nur) of the rdrp-nsp -nsp complex bound to the nsp . the structure identified conserved neutral nsp and nsp binding sites overlapping with finger and thumb domains on nsp of the virus [ ] . we found that nsp of sars-cov- share % sequence identity with nsp of bat cov and . % with nsp from human sars (figure e) , while sars-cov- nsp is closer to nsp of human sars ( . %) than to nsp of bat cov ( . %) ( figure d ). due to the high levels of sequence identity, mean ppids of all nsp s were found to be identical and equal to . %. both sars-cov- and human sars nsp proteins were calculated to have a mean ppid of . % and, for nsp of bat cov mean disorder is predicted to be . %. figures a, b , and c display the intrinsic disorder profiles for nsp s, whereas figures a, b , and c represent the predicted intrinsic disorder propensity of nsp s. as our analysis suggests, nsp s might have a well-predicted structure, while nsp s are moderately disordered. nsp s are predicted to have a long idpr (residues - ) in both sars-cov- and human sars, and a bit shorter idpr in bat cov (residues . furthermore, sars-cov nsp using its n-terminus residues (v , c , v , and v ) forms a hydrophobic core with nsp residues (m , m , l , m , and l ). additionally, h-bonding takes place between nsp q and nsp t residues [ ] . these amino acids are the part of morfs predicted in nsp and nsp proteins. the results are tabulated in both supplementary tables , , and ). nsp protein is a single-stranded rna-binding protein [ ] . it might protect rna from nucleases by binding and stabilizing viral nucleic acids during replication or transcription [ ] . our results on nucleotide-binding tendency of nsp shows the presence of several rna binding and few dna binding residues in nsp of sars-cov- , human sars, and bat cov (supplementary tables , , and ) . presumed to evolve from a protease, nsp forms a dimer using its gxxxg motif [ , ] . figure d shows a . Å crystal structure of the homodimer of human sars nsp (pdb id: qz ) that identified a unique and previously unreported for other proteins, oligosaccharide/oligonucleotide fold-like fold [ ] . here, each monomer contains a coneshaped β-barrel and a c-terminal α-helix arranged into a compact domain [ ] . nsp of sars-cov- is equally similar to nsp s from both human sars and bat cov, having a percentage identity of . %. the difference in three amino acids at , and positions accounts for these similarity scores ( figure e ). as calculated, the mean ppids of nsp s of sars-cov- , human sars cov, and bat cov are . %, . %, and . % respectively. figures a, b , and c depict the predicted intrinsic disorder propensity in the nsp protein from sars-cov- , human sars, and bat cov. according to our analysis, all three nsp s are rather structured but contain flexible regions. nsp contains conserved residues (r , k , y , r , r , f , k , y , f , k , r , and r ) of positively charged side chains suitable for binding with the negatively charged phosphate backbone of rna and aromatic side-chain amino acids providing stacking interactions [ ] . these residues are a part of multiple disorder-based binding sites predicted by morfchibi_web server ( table , supplementary table and ) . nsp performs several functions for sars-cov. it forms a complex with nsp for dsrna hydrolysis in ′ to ′ direction and activates its exonuclease activity [ ] . it also stimulates the methyltransferase (mtase) activity of nsp required during rna-cap formation after replication [ ] . figure d represents the x-ray crystal structure of the nsp /nsp complex (pdb id: c t) [ ] . in agreement with the results of previous biochemical experimental studies, the structure identified important interactions with the exon (exonuclease domain) of nsp without affecting its n -mtase activity [ , ] . sars-cov- nsp protein is quite conserved having a . % sequence identity with nsp of human sars and . % with nsp of bat cov (figure e) . mean ppids of all three studied nsp proteins are found to be . %. figures a, b , and c represent disorder profiles of nsp s and signify the lack of long idprs but presence flexible regions in these proteins. furthermore, [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] was predicted by morfpred server. interestingly, the sars-cov nsp residues f , f , and v form van der waals interactions with many of the nsp amino acids [ ] and one residue (f ) is located in morf region which we have identified. furthermore, many nucleotide-binding residues which are found in all three viruses (supplementary table , , and ) and above-mentioned residues are not found to interact with dna/rna. in coronaviruses, nsp is an rna-dependent rna polymerase (rdrp). it carries out both primer-independent and primer-dependent synthesis of viral rna with mn + as its metallic co-factor and viral nsp and as protein co-factors [ ] . as aforementioned, a . Å resolution structure of human sars nsp in association with nsp and nsp proteins (pdb id: nur) has been reported using electron microscopy ( figure d ). nsp has a polymerase domain similar to "right hand", finger domain ( - , - residues), palm domain ( - , - residues) and a thumb domain ( - ) [ ] . sars-cov- nsp protein has a highly conserved c-terminal region (supplementary figure s e ). it is found to share a . % sequence identity with human sars nsp and . % with bat cov nsp . mean ppid values for all three nsp s are estimated to be . % (table ) . figures a, b , and c show that although these proteins are mostly ordered, they have multiple flexible regions. as rdrp protein is observed to be mostly structured, significant morfs in disordered regions are not found ( table , supplementary table and ). nsp functions as a viral helicase and unwinds dsdna/dsrna in ' to ' direction [ ] . recombinant viral helicase expressed in e.coli rosetta strain was reported to unwind ~ bp per second [ ] . figure d represents a . Å x-ray crystal structure of human sars nsp (pdb id: jyt) [ ] . this helicase contains a - loop on a domain, which is primarily responsible for its unwinding activity. furthermore, the study revealed an important interaction of nsp with nsp that further enhances its helicase activity [ ] . the -amino-acid-long nsp of sars-cov- is almost completely conserved, as it shares . % with nsp of humans sars and . % with nsp of bat cov (supplementary figure s f) . in accordance with our results, the mean ppids of all three nsp proteins are estimated to be . %. figures a, b , and c show that nsp s contain multiple flexible regions but do not possess significant disorder. as expected, being a low disorder protein nsp does not contain any morf region and not a single bindingregion is located by any server used in all three viruses ( table , supplementary table and ). it has many nucleotide-binding residues (rna and dna) which are tabulated in supplementary tables , , and . nsp is a multifunctional viral protein that acts as an exoribonuclease (exon) and methyltransferase (n -mtase) in sars coronaviruses. it's ' to ' exonuclease activity lies in the conserved dedd residues related to exonuclease superfamily [ ] . its guanine-n methyltransferase activity depends upon the s-adenosyl-lmethionine (adomet) as a cofactor [ ] . as aforementioned, nsp requires nsp for activating its exon and n -mtase activity inside the host cells. figure d depicts the . Å crystal structure of human sars nsp /nsp complex (pdb id: c t), where amino acids - form the exon domain and - residues form the n -mtase domain of nsp . a loop (residues - ) is essential for its n -mtase activity [ ] . figure s g) . mean ppid values for nsp s from sars-cov- and human sars is calculated to be . %, while the nsp from bat cov has a mean ppid . %. predicted per-residue intrinsic disorder propensity of nsp s from sars-cov- , human sars, and bat cov is represented in figures a, b , and c, respectively. as can be observed from these plots and corresponding ppid values, all nsp s are found to be highly structured. likewise, table shows nsp contains two protein binding regions (residues - and - ) predicted by the morfpred server in all three viruses. as shown in supplementary tables , , and , the orf represents multiple nucleotide-binding residues. nsp is a uridylate-specific rna endonuclease (nendou) that creates a ′- ′ cyclic phosphates after cleavage. its endonuclease activity depends upon mn + ions as co-factors. conserved in nidovirus, it acts as an important genetic marker due to its absence in other rna viruses [ ] . figure d represents a . Å crystal structure of uridylate-specific nsp (pdb id: h ) that was deduced by bruno and colleagues using x-ray diffraction [ ] . the monomeric nsp has three domains: nterminal domain ( - residues) formed by a three anti-parallel -strands and two α-helices packed together; a middle domain residues) that contains an α-helix connected via a -amino-acid-long coil to an ordered region containing two α-helices and five -strands; and a c-terminal domain ( - residues) consisting of two anti-parallel three -strand sheets on each side of a central α-helical core [ ] . the nsp is found to be quite conserved across human sars and bat covs. sars-cov- nsp shares an . % sequence identity with nsp of human sars and . % with nsp of bat cov (supplementary figure s h) . calculated mean ppids of nsp s from sars-cov- , human sars, and bat cov are . %, . %, and . %, respectively. similar to many other non-structural proteins of coronaviruses, nsp s from sars-cov- , human sars, and bat cov are predicted to possess multiple flexible regions but contain virtually no idprs (see figures a, b, and c) . similarly, no significant disorderbinding regions are predicted in nsp proteins ( table ) . sars-cov- contain one morf (residues - ) predicted by morfpred server. human sars do not have a single morf while bat cov possesses two very short binding regions (supplementary table and ). supplementary table , , and depicts the presence of many rna binding residues and few dna binding residues in nsp of all three viruses. nsp protein is another mtase domain-containing protein. as methylation of coronavirus mrnas occurs in steps, three proteins nsp , nsp , and nsp acts one after another. the first event requires the initiation trigger from nsp protein, after which nsp methylates capped mrnas forming cap- ( me) gpppa-rnas. nsp protein, along with its co-activator protein nsp , acts on cap- ( me) gpppa-rnas to give rise to final cap- ( me)gpppa( 'ome)-rnas [ , ] . a Å x-ray crystal structure of the human sars nsp -nsp complex is depicted in figure d (pdb id: r ) [ ] . the structure consists of a characteristic fold present in class i mtase family comprising of α-helices and loops surrounding a seven-stranded β-sheet [ ] . nsp protein of sars-cov- is found to be equally similar to nsp s from human sars and bat cov ( . %) (supplementary figure s i) . mean ppids for nsp s from sars-cov- , human sars, and bat cov are . %, . %, and . %, respectively. in line with these ppids values, figures a, b, and c show that nsp s are mostly ordered proteins containing several flexible regions. correspondingly, no significant morfs are present in this protein ( table , supplementary table and ). a single morf (residues [ ] [ ] [ ] [ ] [ ] [ ] were found with the help of morfpred in all three viruses. further, several rnabinding and few dna-binding residues are also identified (supplementary table , , and ). replicase polyprotein a. since replicase polyprotein a contains non-structural proteins - identical to those found in replicase polyprotein ab, we did not perform their disorder analysis separately. however, replicase polyprotein a has one additional non-structural protein designated as nsp . nsp is a small uncharacterized protein cleaved from the replicase polyprotein a. this small protein with unknown function requires experimental insights to further characterize this protein. the intrinsic disorder predicting software used in this study requires amino acid sequences, which are at least -residue long. therefore, because of their short sequences (just residues) nsp s from all three studied coronaviruses were not checked for the intrinsic disorder, disorder-based protein binding regions, and nucleotide-binding residues. based on the msa outputs, nsp from sars-cov- was found to have a sequence identity of . % with nsp s from human sars and bat cov (figure ). the emergence of new viruses and associated deaths around the globe represent one of the major concerns of modern times. despite its pandemic nature, there is very little information available in the public domain regarding the structures and functions of sars-cov- proteins. based on its similarity with human sars cov and bat cov, the published reports have suggested the functions of sars-cov- proteins. in this study, we utilized information available on sars-cov- genome and translated proteome from genbank, and carried out a comprehensive computational analysis of the prevalence of the intrinsic disorder in sars-cov- proteins. additionally, a comparison was also made with proteins from close relatives of sars-cov- from the same group of beta coronaviruses, human sars cov and bat cov. our analysis revealed that in these three covs, the n proteins are highly disordered, possessing the ppid values of more than %. these viruses also have several moderately disordered proteins, such as nsp , orf , and orf b. although other proteins have shown lower disorder content, almost all of them contain at least some idprs, and all cov proteins analysed in this study definitely have multiple flexible regions. importantly, our study provides novel information on presence of intrinsic disorder at the cleavage sites of the replicase ab polyprotein of covs. this observation confirms the crucial role of idprs in maturation of individual proteins. we also established that many of these proteins contain disorder-based binding motifs. since idps/idprs might undergo structural transition upon association with their physiological partners, our study generates important grounds for better understanding of the functionality of these proteins, their interactions with other viral proteins, as well as interaction with host proteins in different physiological conditions. this will also guide structural biologists to carry out a structure-based analysis of sars-cov- proteome to explore the path for the development of new drugs and vaccines. the periodical outbreaks of pathogens worldwide always remind the lack of suitable drugs or vaccines for proper cure or treatment. in , nearly deaths were reported due to the sars outbreak in more than countries. but this time, the outbreak of wuhan's novel coronavirus (sars-cov- ) has quickly surpassed this number, indicating more causalities soon. the lack of accurate information and ignorance of primary symptoms are major reasons, which cause many infection cases. although efficient transmission from human to human is confirmed, the actual reasons for fast sars-cov- spread are still unknown, but some assumptions were made by researchers and chinese authorities. the fast spread of sars-cov- , covid- pandemic, and associated introduction of quarantine also have made major impacts on economy and education worldwide due to several restrictions, such as limited transportation, restrained or frozen traveling, halted attendance of mass events, the introduction of distant teaching and learning, etc. due to advancements in sequencing techniques, the full genome sequence of sars-cov- was made available in a few days of the first infection report from wuhan, china. however, massive subsequent research needs to be done to identify the actual cause of sars-cov- infectivity and to design suitable treatment in the coming future. certain possibilities can be explored with the available information. the mutational pressure study on this virus will be very interesting to see if this virus transforms from bat sars to human sars to sars-cov- . more in-depth experimental studies using molecular and cell biology techniques to establish structurefunction relationships are required for a better understanding of the functioning of sars-cov- proteins. additionally, based on the sequence homology and information on proteinprotein interactions, the associated viral and host proteins should be explored, for finding means suitable for limiting replication, maturation, and ultimately pathogenesis of this virus. although structural biology techniques (so-called rational drug design) can be used in drug development utilizing high throughput screening of compounds virtually or experimentally, the applicability of these techniques is limited by the presence of intrinsic disorder in target proteins. therefore, the thorough disorder analysis of three coronaviruses conducted in this study will help structural biologists to rationally design experiments keeping this information in mind. authors contribution: rg: conception and design, interpretation of data, writing, and review of the manuscript, and study supervision. vnu: conception and design, acquisition and interpretation of data, writing, and review of the manuscript. ms, tb, pk, brg, kg: acquisition and interpretation of data, writing of the manuscript. table . evaluation of intrinsic disorder in non-structural proteins of bat cov. table : predicted morf residues in human sars proteins. supplementary table : predicted morf residues in bat cov proteins. supplementary table : predicted nucleotide-binding residues in sars-cov- proteins. supplementary table : predicted nucleotide-binding residues in human sars proteins. supplementary table : predicted nucleotide-binding residues in bat cov proteins. supplementary figures s . multiple sequence alignment of structural proteins of all three studied coronaviruses are generated using clustal omega. the aligned images are created using esprit . . figure s a . msa of sars-cov- , human sars, and bat cov spike glycoproteins. figure s b . msa of sars-cov- , human sars, and bat cov nucleoproteins. supplementary figure s . multiple sequence alignment of non-structural proteins of all three studied coronaviruses are generated using clustal omega. the aligned images are created using esprit . . figure s a . msa of sars-cov- , human sars, and bat cov nsp proteins. figure s b . msa of sars-cov- , human sars, and bat cov nsp proteins. figure s c . msa of sars-cov- , human sars, and bat cov nsp proteins. figure s d . msa of sars-cov- , human sars, and bat cov nsp proteins. figure s e . msa of sars-cov- , human sars, and bat cov nsp proteins. figure s f . msa of sars-cov- , human sars, and bat cov nsp proteins. figure s g . msa of sars-cov- , human sars, and bat cov nsp proteins. figure s h . msa of sars-cov- , human sars, and bat cov nsp proteins. figure s i . msa of sars-cov- , human sars, and bat cov nsp proteins. clinical course and outcomes of critically ill patients with sars-cov- pneumonia in wuhan, china: a singlecentered, retrospective, observational study nidovirales: evolving the largest rna virus genome discovery of seven novel mammalian and avian coronaviruses in the genus deltacoronavirus supports bat coronaviruses as the gene source of alphacoronavirus and betacoronavirus and avian coronaviruses as the gene source of gammacoronavirus and deltacoronavi full-genome deep sequencing and phylogenetic analysis of novel human betacoronavirus the molecular biology of coronaviruses identification of novel subgenomic rnas and noncanonical transcription initiation signals of severe acute respiratory syndrome coronavirus ultrastructure and origin of membrane vesicles associated with the severe acute respiratory syndrome coronavirus replication complex a contemporary view of coronavirus transcription classification of intrinsically disordered regions and proteins intrinsically disordered proteins and intrinsically disordered protein regions intrinsically unstructured proteins: re-assessing the protein structure-function paradigm flexible nets. the roles of intrinsic disorder in protein interaction networks identification and functions of usefully disordered proteins function and structure of inherently disordered proteins. current opinion in structural biology intrinsic disorder in transcription factors showing your id: intrinsic disorder as an id for recognition, regulation and cell signaling drnapred, fast sequence-based method that accurately predicts and discriminates dna-and rna-binding residues high-throughput prediction of rna, dna and protein binding regions mediated by intrinsic disorder a new coronavirus associated with human respiratory disease in china intrinsically disordered side of the zika virus proteome. frontiers in cellular and infection microbiology , , .teome viral disorder or disordered viruses: do viral proteins possess unique features? deciphering the dark proteome of chikungunya virus prediction and functional analysis of native disorder in proteins from the three kingdoms of life fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega deciphering key features in protein structures with the new endscript server length-dependent prediction of protein intrinsic disorder optimizing long intrinsic disorder predictors with protein evolutionary information pondr-fit: a metapredictor of intrinsically disordered amino acids sequence complexity of disordered protein iupred a: context-dependent prediction of protein disorder as a function of redox state and protein binding the dark side of alzheimer's disease: unstructured biology of proteins from the amyloid cascade signaling pathway the dark proteome of cancer: intrinsic disorderedness and functionality of hif- α along with its interacting proteins why are "natively unfolded" proteins unstructured under physiologic conditions? subclassifying disordered proteins by the ch-cdf plot method computational identification of morfs in protein sequences using hierarchical application of bayes rule prediction of protein binding regions in disordered proteins anchor: web server for predicting protein binding regions in disordered proteins morfpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins disopred : precise disordered region predictions with annotated protein-binding activity prediction of disordered rna, dna, and protein binding regions using disordpbind prediction of rna binding sites in a protein using svm and pssm profile genome composition and divergence of the novel coronavirus ( -ncov) originating in china a majority of the cancer/testis antigens are intrinsically disordered proteins molecular recognition features in zika virus proteome mpmorfsdb: a database of molecular recognition features in membrane proteins disordered rna-binding region prediction with disordpbind coronavirus ibv: removal of spike glycopolypeptide s by urea abolishes infectivity and haemagglutination but not attachment to cells recombination, reservoirs, and the modular spike: mechanisms of coronavirus cross-species transmission mechanisms of coronavirus cell entry mediated by the viral spike protein cooperative involvement of the s and s subunits of the murine coronavirus spike protein in receptor binding and extended host range the cytoplasmic tail of the severe acute respiratory syndrome coronavirus spike protein contains a novel endoplasmic reticulum retrieval signal that binds copi and promotes interaction with membrane protein structure of sars coronavirus spike receptorbinding domain complexed with receptor important role for the transmembrane domain of severe acute respiratory syndrome coronavirus spike protein during entry cryo-em structure of the sars coronavirus spike glycoprotein in complex with its host cell receptor ace cryo-em structure of the -ncov spike in the prefusion conformation the coronavirus e protein: assembly and beyond incorporation of spike and membrane glycoproteins into coronavirus virions a severe acute respiratory syndrome coronavirus that lacks the e gene is attenuated in vitro and in vivo the transmembrane oligomers of coronavirus protein e structure of a conserved golgi complex-targeting signal in coronavirus envelope proteins structural and functional aspects of viroporins in human respiratory viruses: respiratory syncytial virus and coronaviruses the sars coronavirus e protein interacts with pals and alters tight junction formation and epithelial morphogenesis identifying sars-cov membrane protein amino acid residues linked to virus-like particle assembly self-assembly of severe acute respiratory syndrome coronavirus membrane protein the cytoplasmic tails of infectious bronchitis virus e and m proteins mediate their interaction nucleocapsid-independent specific viral rna packaging via viral envelope protein and viral rna signal differential maturation and subcellular localization of severe acute respiratory syndrome coronavirus surface proteins s, m and e studies on membrane topology, n-glycosylation and functionality of sars-cov membrane protein a structural analysis of m protein in coronavirus assembly and morphology the membrane protein of severe acute respiratory syndrome coronavirus acts as a dominant immunogen revealed by a clustering region of novel functionally and structurally defined cytotoxic tlymphocyte epitopes prediction of intrinsic disorder in mers-cov/hcov-emc supports a high oral-fecal transmission the cterminal domain of the mers coronavirus m protein contains a trans-golgi network localization signal the coronavirus nucleocapsid is a multifunctional protein ribonucleocapsid formation of severe acute respiratory syndrome coronavirus through molecular action of the n-terminal domain of n protein structural proteins of the severe acute respiratory syndrome coronavirus are required for efficient assembly, trafficking, and release of virus-like particles multiple nucleic acid binding sites and intrinsic disorder of severe acute respiratory syndrome coronavirus nucleocapsid protein: implications for ribonucleocapsid protein packaging structure of the nterminal rna-binding domain of the sars cov nucleocapsid protein crystal structure of the severe acute respiratory syndrome (sars) coronavirus nucleocapsid protein dimerization domain reveals evolutionary linkage between corona-and arteriviridae characterization of protein-protein interactions between the nucleocapsid protein and membrane protein of the sars coronavirus the nucleocapsid protein of sars coronavirus has a high binding affinity to the human cellular heterogeneous nuclear ribonucleoprotein a analysis of multimerization of the sars coronavirus nucleocapsid protein localization of the rna-binding domain of mouse hepatitis virus nucleocapsid protein analysis of ordered and disordered protein complexes reveals structural features discriminating between stable and unstable monomers flexible nets: disorder and induced fit in the associations of p and - - with their partners in various protein complexes, disordered protomers have large per-residue surface areas and area of protein-, dnaand rna-binding interfaces sars coronavirus accessory proteins identification of a novel protein a from severe acute respiratory syndrome coronavirus subcellular localization and membrane association of sars-cov a protein a novel severe acute respiratory syndrome coronavirus protein, u , is transported to the cell surface and undergoes endocytosis severe acute respiratory syndrome-associated coronavirus a protein forms an ion channel and modulates virus release the severe acute respiratory syndrome (sars)-coronavirus a protein may function as a modulator of the trafficking properties of the spike protein the role of severe acute respiratory syndrome (sars)-coronavirus accessory proteins in virus pathogenesis mitochondrial location of severe acute respiratory syndrome coronavirus b protein nucleolar localization of nonstructural protein b, a protein specifically encoded by the severe acute respiratory syndrome coronavirus sars-cov accessory protein b induces ap- transcriptional activity through activation of jnk and erk pathways g /g arrest and apoptosis induced by sars-cov b protein in transfected cells over-expression of severe acute respiratory syndrome coronavirus b protein induces both apoptosis and necrosis in vero e cells the a protein of severe acute respiratory syndrome-associated coronavirus induces apoptosis in vero e cells severe acute respiratory syndrome coronavirus open reading frame (orf) b, orf , and nucleocapsid proteins function as interferon antagonists severe acute respiratory syndrome coronavirus orf antagonizes stat function by sequestering nuclear import factors on the rough endoplasmic reticulum/golgi membrane enhancement of murine coronavirus replication by severe acute respiratory syndrome coronavirus protein requires the n-terminal hydrophobic region but not c-terminal sorting motifs a putative diacidic motif in the sars-cov orf protein influences its subcellular localization and suppression of expression of cotransfected expression constructs the n-terminal region of severe acute respiratory syndrome coronavirus protein induces membrane rearrangement and enhances virus replication characterization of a unique group-specific protein (u ) of the severe acute respiratory syndrome coronavirus severe acute respiratory syndrome coronavirus a accessory protein is a viral structural protein augmentation of chemokine production by severe acute respiratory syndrome coronavirus a/x and a/x proteins through nf-kappab activation chemokine upregulation in sars-coronavirus-infected, monocyte-derived human dendritic cells induction of apoptosis by the severe acute respiratory syndrome coronavirus a protein is dependent on its interaction with the bcl-xl protein the orf b protein of severe acute respiratory syndrome coronavirus (sars-cov) is expressed in virus-infected cells and incorporated into sars-cov particles a protein of severe acute respiratory syndrome coronavirus inhibits cellular protein synthesis and activates p mitogen-activated protein kinase structure and intracellular targeting of the sars-coronavirus orf a accessory protein solution structure of the x protein coded by the sars related coronavirus reveals an immunoglobulin like fold and suggests a binding activity to integrin i domains the -nucleotide deletion present in human but not in animal severe acute respiratory syndrome coronaviruses disrupts the functional expression of open reading frame molecular evolution of the sars coronavirus during the course of the sars epidemic in china the human severe acute respiratory syndrome coronavirus (sars-cov) b protein is distinct from its counterpart in animal sars-cov and down-regulates the expression of the envelope protein in infected cells severe acute respiratory syndrome coronavirus accessory protein b is a virion-associated protein sars-cov b protein diffuses into nucleus, undergoes active crm mediated nucleocytoplasmic export and triggers apoptosis when retained in the nucleus the crystal structure of orf- b, a lipid binding protein from the sars coronavirus mechanisms and enzymes involved in sars coronavirus genome expression biosynthesis, purification, and substrate specificity of severe acute respiratory syndrome coronavirus c-like proteinase severe acute respiratory syndrome coronavirus protein nsp is a novel eukaryotic translation inhibitor that represses multiple steps of translation initiation novel -barrel fold in the nuclear magnetic resonance structure of the replicase nonstructural protein from the severe acute respiratory syndrome coronavirus identification of residues of sars-cov nsp that differentially affect inhibition of gene expression and antiviral signaling coronavirus nonstructural protein : common and distinct functions in the regulation of host and viral gene expression severe acute respiratory syndrome coronavirus nonstructural protein interacts with a host protein complex involved in mitochondrial biogenesis and intracellular signaling the nsp replicase proteins of murine hepatitis virus and severe acute respiratory syndrome coronavirus are dispensable for viral replication severe acute respiratory syndrome coronavirus papain-like protease ubiquitin-like domain and catalytic domain regulate antagonism of irf and nf-b signaling severe acute respiratory syndrome coronavirus papain-like protease: structure of a viral deubiquitinating enzyme nuclear magnetic resonance structure of the n-terminal domain of nonstructural protein from the severe acute respiratory syndrome coronavirus the envelope protein of severe acute respiratory syndrome coronavirus interacts with the non-structural protein and is ubiquitinated severe acute respiratory syndrome coronavirus nonstructural proteins , , and induce double-membrane vesicles mobility and interactions of coronavirus nonstructural protein two-amino acids change in the nsp of sars coronavirus abolishes viral replication localization and membrane topology of coronavirus nonstructural protein : involvement of the early secretory pathway in replication ligand-induced dimerization of middle east respiratory syndrome (mers) coronavirus nsp protease ( clpro): implications for nsp regulation and the development of antivirals a novel mutation in murine hepatitis virus nsp , the viral c-like proteinase, causes temperature-sensitive defects in viral growth and protein processing structure of coronavirus main proteinase reveals combination of a chymotrypsin fold with an extra alpha-helical domain coronavirus nsp restricts autophagosome expansion the sars-coronavirus nsp +nsp complex is a unique multimeric rna polymerase capable of both de novo initiation and primer extension insights into sars-cov transcription and replication from the structure of the nsp -nsp hexadecamer structure of the sars-cov nsp polymerase bound to nsp and nsp co-factors the severe acute respiratory syndrome-coronavirus replicative protein nsp is a singlestranded rna-binding subunit unique in the rna virus world variable oligomerization modes in coronavirus non-structural protein severe acute respiratory syndrome coronavirus nsp dimerization is essential for efficient viral growth rna '-end mismatch excision by the severe acute respiratory syndrome coronavirus nonstructural protein nsp /nsp exoribonuclease complex in vitro reconstitution of sars-coronavirus mrna cap methylation structural basis and functional analysis of the sars coronavirus nsp -nsp complex biochemical characterization of a recombinant sars coronavirus nsp rna-dependent rna polymerase capable of copying viral rna templates mechanism of nucleic acid unwinding by sars-cov helicase delicate structural coordination of the severe acute respiratory syndrome coronavirus nsp upon atp hydrolysis discovery of an rna virus '-> ' exoribonuclease that is critically involved in coronavirus rna synthesis major genetic marker of nidoviruses encodes a replicative endoribonuclease crystal structure and mechanistic determinants of sars coronavirus nonstructural protein define an endoribonuclease family coronavirus nonstructural protein is a cap- binding enzyme possessing (nucleoside- 'o)-methyltransferase activity biochemical and structural insights into the mechanisms of sars coronavirus rna ribose '-o-methylation by nsp /nsp protein complex key: cord- -ko l qv authors: scarpin, m regina; leiboff, samuel; brunkard, jacob o title: parallel global profiling of plant tor dynamics reveals a conserved role for larp in translation date: - - journal: elife doi: . /elife. sha: doc_id: cord_uid: ko l qv target of rapamycin (tor) is a protein kinase that coordinates eukaryotic metabolism. in mammals, tor specifically promotes translation of ribosomal protein (rp) mrnas when amino acids are available to support protein synthesis. the mechanisms controlling translation downstream from tor remain contested, however, and are largely unexplored in plants. to define these mechanisms in plants, we globally profiled the plant tor-regulated transcriptome, translatome, proteome, and phosphoproteome. we found that tor regulates ribosome biogenesis in plants at multiple levels, but through mechanisms that do not directly depend on ′ oligopyrimidine tract motifs ( ′tops) found in mammalian rp mrnas. we then show that the tor-larp - ′top signaling axis is conserved in plants and regulates expression of a core set of eukaryotic ′top mrnas, as well as new, plant-specific ′top mrnas. our study illuminates ancestral roles of the tor-larp - ′top metabolic regulatory network and provides evolutionary context for ongoing debates about the molecular function of larp . target of rapamycin (tor) is a conserved eukaryotic serine/threonine protein kinase that regulates metabolism by promoting anabolic processes when nutrients are available (liu and sabatini, ) . in the most well-studied pathway, mammalian tor is stimulated by amino acids to promote translation, acting as a rheostat to couple free amino acid availability with rates of amino acid incorporation into proteins (valvezan and manning, ) . tor is under intense biomedical investigation because dysregulation of the tor network causes or contributes to a wide range of human diseases, prominently including cancer (saxton and sabatini, ) . therefore, many details of the tor signaling pathway have been elucidated in mammals and yeast; less is known, however, about the tor network in other eukaryotic lineages. in the plant model for genetics and molecular biology, arabidopsis thaliana, tor is essential for the earliest stages of embryogenesis (menand et al., ) , and inhibiting tor strongly represses growth and development (deprost et al., ; xiong and sheen, ) . plant tor activity is controlled by a number of upstream signals, such as glucose , light (chen et al., a; li et al., b) , nucleotides (busche et al., ) , and phytohormones including auxin (li et al., b; schepetilnikov et al., ) , brassinosteroids (zhang et al., ) , and abscisic acid . when plant tor is active, it promotes the transcription of genes involved in cell-cycle progression, ribosome biogenesis, and various other metabolic processes, depending on developmental context . as in other eukaryotes, plant tor associates with at least two additional proteins, raptor (regulatory-associated protein of tor) and lst (lethal with sec ), to form an active complex called torc (tor complex ) (deprost et al., ; mahfouz et al., ; moreau et al., ) ; it is not known whether tor acts in any raptor-or lst -independent complexes in plants . very little is understood about the signal transduction networks downstream from tor in plants; elucidating these signaling pathways is a major goal to understand how tor signaling evolved in eukaryotes and how tor signaling networks could be manipulated to promote agricultural yields while reducing reliance on expensive and environmentally-harmful fertilizer inputs (busche et al., ) . rapamycin, an inhibitor of torc first isolated from streptomyces hygroscopicus cultures (vézina et al., ) , represses growth largely by inhibiting mrna translation (thomas and hall, ) , which has led to extensive studies of the mechanisms underlying the regulation of mrna translation by tor. as a simplified model, in mammals, tor specifically promotes the translation of mrnas that encode ribosomal proteins (rps) and a handful of other components of the translation apparatus; in turn, these additional ribosomes increase global rates of mrna translation. vertebrate rp mrnas have evolved a regulatory motif, called a terminal oligopyrimidine tract ( top), which begins with a cytosine at the cap and is followed by several uracils and/or cytosines, but no (or very few) adenines or guanines . approximately mammalian transcripts are classically considered top mrnas, including all~ cytosolic rp mrnas (philippe et al., ) . the top motif is crucial for the tor-mediated regulation of rp mrna translation initiation in vertebrates: when tor is inactive, top motifs are sufficient to strongly repress translation initiation (meyuhas and kahan, ) . multiple tor substrates have been proposed to mediate tor- top mrna translation regulation (berman et al., ; meyuhas and kahan, ; philippe et al., ; thoreen et al., ) , including es kinases (s ks), eif e-binding proteins ( e-bps), eif g initiation factors, and la-related protein (larp ), among others. the precise mechanisms remain contested, however, and only s k, eif g, and larp are conserved across eukaryotes. larp 's role in translation was first discovered in a screen for proteins that differentially associate with the cap in response to tor activity (tcherkezian et al., ) . subsequent mechanistic studies have alternatively proposed that larp promotes translation of top mrnas (tcherkezian et al., ) , represses translation of top mrnas (fonseca et al., ; philippe et al., ; philippe et al., ) , has dual and opposing roles in controlling translation of top mrnas depending on its phosphorylation status (hong et al., ) , or has no role in regulating translation, but instead stabilizes top mrnas (gentilella et al., ) . at a molecular level, larp can bind directly to the -methylguanosine (m g) and top (cassidy et al., ; lahr et al., ) , but may also associate with pyrimidine-enriched sequences elsewhere in a transcript (hong et al., ) , and is often found in complex with the polya tail via protein-protein interactions with polya binding proteins (pabps) and/or direct interaction with the polya rna (al-ashtal et al., ; aoki et al., ) . these apparently conflicting models (berman et al., ) may reflect differences in the physiological status or genetics of the cell types used, multiple proteinand rna-binding sites on the larp protein with distinct affinities, or confounding effects of different techniques to identify larp binding partners with distinct inherent biases. larp is under increasing scrutiny because of its clinical significance: in addition to contributions to the progression of some cancers (mura et al., ) and a possible link to zika virus pathogenesis (scaturro et al., ) , larp was recently found to physically associate with the rna-binding nucleocapsid of the recently-emerged zoonotic coronavirus sars-cov- (gordon et al., ) . therefore, in light of evidence that inhibiting tor limits replication of a closely related coronavirus (mers-cov) (kindrachuk et al., ) and the availability of fda-approved tor inhibitors (especially rapamycin and related, rapamycin-like compounds) and several additional tor inhibitors in clinical trials, ablating tor-larp signaling has been proposed as a possible pharmacological target to treat severe coronavirus infections (gordon et al., ; zhou et al., ) . most strikingly, although larp is deeply conserved in eukaryotes, the ubiquitous top motif in the leader of cytosolic rp mrnas only recently evolved in animals, suggesting that larp did not co-evolve with its proposed current primary function in humans, the direct regulation of cytosolic rp mrna translation. in plants, larp has been reported to associate with the cytosolic exoribonuclease, xrn , and recruit xrn to specific transcripts during heat shock to promote their degradation (merret et al., ) , but it is not known whether larp regulates mrna stability and/or translation under standard physiological conditions. pioneering efforts to identify the mechanisms regulating translation of top mrnas, however, did show that wheat germ extracts contain a repressor that specifically limits translation of top mrnas in cell-free translation assays (biberman and meyuhas, ; shama and meyuhas, ) , suggesting that plants also discriminately regulate translation of top mrnas. here, we show that the tor-larp - top signaling axis regulates translation in arabidopsis, impacting expression of a set of deeply conserved top genes, including translation elongation factors, polya-binding proteins, karyopherins/importins, and the translationally-controlled tumor protein. moreover, we identify new, plant-specific genes regulated by tor-larp - top signaling, including several genes involved in auxin signaling, developmental patterning, and chromatin modifications. significantly, many of the top mrnas do encode proteins that contribute to ribosome biogenesis, although only a handful of cytosolic rp mrnas themselves have top motifs. we propose that tor-larp - top mrna signaling arose early during eukaryotic evolution to coordinate translation and cell division with cellular metabolic status and nutrient availability, and we argue that tor-larp - top signaling has since evolved new, plantspecific targets that regulate plant physiology, growth, and development. to identify novel downstream components of the plant tor signaling network, we took orthogonal global approaches to quantify how deactivating tor impacts the plant transcriptome (rna-seq), translatome (ribo-seq), proteome, and phosphoproteome ( figure a ). seedlings were grown to quiescence in half-strength ms media for days under photosynthesis-limiting conditions with a hr light/ hr dark diurnal cycle. media were then replaced with halfstrength ms media plus mm glucose to activate tor. hr later, media were replaced again with half-strength ms media plus mm glucose (to maintain tor activity) or half-strength ms media plus mm glucose and . mm torin (to attenuate tor activity) (li et al., b; montané and menand, ) . seedlings were collected hr after these treatments. at least seedlings were pooled for each treatment and considered one sample, and the entire experiment was replicated three times. after collection, portions of each sample were divided for different analyses. total rna was extracted from one set of samples in polysome buffer with cycloheximide; an aliquot of this rna was used to build rna-seq libraries after protease treatment and depletion of rrna, and the rest of the rna was used for ribosome footprint profiling (hsu et al., ; ingolia et al., ) . total protein was extracted from a parallel set of samples and digested with trypsin; an aliquot of this digested protein was tmt-labeled and analyzed by liquid chromatography-tandem mass spectrometry (lc-ms/ms), and the rest was enriched for phosphopeptides and then tmt-labeled and analyzed by lc/ms-ms. in summary, we quantified transcripts from , genes, ribosome footprints from , genes, peptides from proteins, and phosphopeptides from phosphoproteins (supplementary file ). for our assays, we attenuated tor activity using torin , a potent atp-competitive tor inhibitor that is effective at reducing tor activity in arabidopsis thaliana (li et al., b; montané and menand, ) . tor is a member of the atypical phosphatidylinositol- -kinase (pi k)-like protein kinases (pikks) that evolved from lipid kinases (pi ks) and were present in the last eukaryotic common ancestor (brunkard, ; keith and schreiber, ) . pikks are a small family of only five kinases involved in metabolic regulation (tor), nonsense-mediated decay (smg ), and the dna damage response (atm, atr, and dna-pkcs). although all of these were present in the last eukaryotic common ancestor, they are not conserved in all extant eukaryotic lineages; the a. thaliana genome, for example, only encodes tor, atm, and atr. atp-competitive tor inhibitors, which have been developed for pharmacological treatment of tor-associated human diseases, show a range of selectivity for tor, with some exhibiting low selectivity (e.g. pp ) (liu et al., ) or extremely high selectivity (e.g. azd , which is~ , -fold selective for tor) (chresta et al., ) . in between these extremes, torin is at least -fold more selective for tor than other pikks in in vitro assays (liu et al., ) . in cell types that have strongly induced the dna damage response, however, such as some breast cancers, torin can have synergistic cellular effects by inhibiting both tor and another pikk (chopra et al., ) . for example, a recent report argues that torin causes cytotoxicity in some triple-negative breast cancer cell lines with highly elevated atr activity by inhibiting tor and attenuating atr (chopra et al., ) . azd can also cause cytotoxicity, but only at higher concentrations (chopra et al., ) . since atr and atm are not for three days and then supplemented with mm glucose to stimulate tor and growth for hr. seedlings were next treated with either mm glucose + . mm torin to attenuate tor activity or only mm glucose to promote tor activity for hr. seedlings were then snap-frozen in liquid nitrogen. rna or protein was extracted as described in the methods for global, unbiased profiling of the seedling torin -sensitive transcriptome, translatome, proteome, and phosphoproteome. (b) inhibiting the tor pathway affected the accumulation of transcripts in multiple categories, as determined by mapman analysis. one of the most broadly affected categories was protein biosynthesis, including significantly lower levels of mrnas that encode cytosolic and mitochondrial ribosomes after torin treatment. here and below, the observed fold-change in mrna levels for each category is shown in box-whisker plots drawn with tukey's method; the middle line represents the median and the dot represents the mean of the foldchanges. p values for each category were determined by the mann-whitney u test using mapman gene ontologies and corrected for false positives with the stringent benjamini-yekutieli method. (c) in addition to rp genes, categories that were affected by torin treatment included repression of genes involved in protein translocation and rna processing and induction of genes involved in protein degradation and stress responses. induced under our growing conditions, we expect that torin specifically inhibits tor in our assays. we also chose a near-minimal concentration of torin to attenuate, rather than completely abolish, tor activity, which also reduces the likelihood of non-specific activity. nonetheless, it is possible that torin could have minor off-target effects, which should be considered throughout our analyses below. the rna-seq experiment revealed significant, global repression of mrnas that encode rps in response to tor inhibition. plant cytosolic ribosomes are composed of different proteins that are encoded by annotated genes in arabidopsis (hummel et al., ) , including at least two paralogs encoding each subunit (supplementary file ). of those genes were transcribed at detectable levels in our rna-seq data, of which ( %) accumulated to significantly lower levels hr after treatment with torin than in mock-treated seedlings ( figure b) . strikingly, mrnas encoding every subunit of the cytosolic ribosome significantly decreased in the torin -treated plants. in plants, both mitochondria and chloroplasts assemble their own ribosomes to translate genes encoded by their respective genomes. most of the mitochondrial and chloroplast rps are encoded by the nucleus. many of the mitochondrial rp mrnas are also significantly downregulated in torin -treated seedlings ( / transcripts, or %), and none are upregulated ( figure b , supplementary file ). in stark contrast, only one chloroplast rp mrna is differentially expressed in torin -treated seedlings (bl c is . -fold repressed; this is / transcripts, or . %) ( figure b , supplementary file ). to summarize, within hr of treatment with torin , tor inactivation widely suppresses the expression of cytosolic and mitochondrial rp mrnas ( figure b ). the degs identified in torin -treated seedling transcriptome impact diverse additional processes ( figure c , supplementary file ), mostly in line with previously reported results using different experimental systems (dong et al., ; . in addition to rp mrnas, many other mrnas involved in protein synthesis accumulated to significantly lower levels in torin -treated seedlings ( figure c , supplementary file ), including mrnas that encode translation initiation factors, aminoacyl trna synthetases, and small nucleolar ribonucleoprotein (snornp) subunits. mrnas that promote protein catabolism were coordinately induced ( figure c , supplementary file ), including many mrnas that participate in proteasomal degradation, such as ubiquitin e ligases, and several mrnas that encode components of the autophagosome. mitochondrial biogenesis is broadly suppressed, including not only mitochondrial rp mrnas ( figure b , supplementary file ), but also transcripts encoding oxphos electron chain subunits and proteins that participate in translocation of proteins into mitochondria ( figure c , supplementary file ). inhibiting tor decreased levels of mrnas that encode cell wall proteins that contribute to growth, including arabinogalactan proteins, extensins, and expansins ( figure c , supplementary file ). transcriptional programs associated with abiotic and biotic stress were widely induced after tor inhibition, including accumulation of mrnas that encode nod-like receptors, receptor-like kinases, wrky transcription factors, nac transcription factors, and enzymes that contribute to secondary metabolic stress responses ( figure c , supplementary file ). lastly, tor inhibition significantly decreased levels of mrnas that encode components of the nuclear pore complex and the family of importin/karyopherin b nuclear transport receptors ( figure c , supplementary file ). next, we used ribosome footprinting to identify transcripts that are differentially translated in response to torin . putative ribosome footprint sequences were aligned to the tair genome and scanned for periodicity using ribotaper . ribo-seq fpkm were divided by rna-seq fpkm to calculate the relative translation efficiency (te) of every transcript identified in both datasets. we observed twofold or greater change in te for transcripts (supplementary file ). by far, the most significantly-affected category after torin treatment was broad translational repression of rp mrna expression ( figure d, supplementary file ) . this result is in strong agreement with the conserved role of tor in promoting rp mrna translation in other eukaryotic model systems. we also observed significant translational repression of many photosynthesis-associated genes ( figure d , supplementary file ). in particular, the relative translational efficiency of mrnas encoded by the chloroplast significantly decreased after treatment with torin , indicating that tor promotes translation within the chloroplast. the seedling tor-regulated proteome and phosphoproteome we conducted quantitative global proteomics to define changes in protein abundance hr after inactivating tor with torin . of proteins quantified in our analysis, showed significant differences in abundance between torin -and mock-treated seedlings (supplementary file ). we observed only small differences in protein abundance between treatments (~ . -fold changes, on average, for the significantly impact proteins), but this is likely because protein half-lives (~hours to days) are typically an order of magnitude longer than mrna half-lives (~minutes to hours) in eukaryotes (li et al., a; toyama and hetzer, ) . strikingly, we observed a statistically-significant decrease in the abundance of of the rps detected in the seedling proteome (down . -fold on average) ( figure e, supplementary file ) . the only other biological process detected as significantly impacted in the torin -treated quantitative proteome was an induction of various solute transporters. the torin -sensitive phosphoproteome revealed phosphoproteins that accumulated to significantly different levels in torin -treated seedlings compared to mock-treated controls (figure a , supplementary file ). about half of these were abundant enough that they were also quantified in the global proteome ( proteins), of which eight showed significant differences in total protein levels after torin treatment (supplementary file ). mrnas encoding all of these significantlyimpacted phosphoproteins were detected by rna-seq (supplementary file ), except for df , a seed-specific transcription factor that we speculate may persist as a stable protein for some time after germination, but is no longer transcriptionally expressed. levels of of these mrnas showed significant differences after torin treatment (supplementary file ). although the transcripts and/ or total protein levels of several of the phosphoproteins changed, these differences could not readily explain the magnitude of difference in protein phosphorylation we detected. moreover, phosphorylation often impacts protein stability, and this global experiment cannot distinguish whether changes in protein level are caused by differential phosphorylation or if changes in phosphoprotein levels reflect unrelated changes in protein stability. thus, while we include these parallel data, we proceeded to analyze all putative tor-sensitive phosphoproteins identified in these experiments as presumptive substrates or downstream targets of tor. we first focused on proteins whose phosphorylation decreases upon torin treatment, since these could be direct substrates of tor ( figure ) . the strongest effect was on es b, a canonical phosphorylation target of the tor-s k signaling axis; phosphorylated es b was readily detected in mock-treated controls but was undetectable in torin -treated samples ( figure c ). ser , which is near the c-terminus of es b, became dephosphorylated upon torin treatment, which is in agreement with previous reports that es b-ser phosphorylation is dependent on tor-s k activity ( figure b ). this clear result served as robust internal validation for our torin -treated phosphoproteome. we next compared the torin -sensitive seedling phosphoproteome to the tor-regulated cell suspension culture phosphoproteome . eight targets overlapped between our datasets: in both experiments, inhibiting tor decreased phosphorylation of es b, larp , eukaryotic initiation factor eif b , the plant-specific conserved binding of eif e one protein (cbe ), and a plant ubiquitin regulatory x domain-containing protein (pux ), whereas inhibiting tor increased phosphorylation of eukaryotic initiation factor eif g, the universal stress protein phos , and a plant-specific duf protein of unknown function (at g ). of the remaining phosphoproteins that we found were sensitive to torin , were detected in the cell suspension culture phosphoproteomes, and in many cases their phosphorylation was similarly impacted by tor inhibition, but not at statistically-significant levels in those experiments . these corroborating results further confirmed that our experimental approach identified bona fide phosphorylation targets of the tor signaling network. to confirm that torin did not inhibit atm or atr activity under our experimental conditions, we compared the torin -sensitive seedling phosphoproteome to the arabidopsis atm and atr-dependent phosphoproteome, a set of phosphoproteins that were differentially phosphorylated in atm;atr double mutants compared to wildtype in response to irradiation, that is atm/atr-activating conditions (roitinger et al., ) . only of those phosphoproteins were detected with the same phosphosites in any of our phosphoproteome experiments, and none of these phosphosites were significantly inhibited by torin . these figure . tor regulates the phosphorylation of critical proteins involved in translation, cellular dynamics, and signal transduction in arabidopsis seedlings. (a) phosphoprotein levels decreased for the majority ( %) of the proteins that were significantly differentially phosphorylated after torin treatment. the scatterplot displays the difference in phosphoprotein abundance in torin -treated seedlings for every detected phosphoprotein. statistically-significant differences are indicated by colored dots (red for decreased abundance, blue for increased abundance; p< . , mann-whitney figure continued on next page results support our hypothesis that torin selectively acts only to inhibit tor in arabidopsis seedlings grown under our physiological conditions. the torin -sensitive seedling phosphoproteome is strongly enriched for proteins involved in ribosome biogenesis, including rp subunits (panther go cellular component cytosolic rps were . fold overrepresented, p< . , fisher's exact test with bonferroni correction) ( figure f ). enrichment analysis was performed with panther gene ontologies, as described in detail in the methods section. in addition to es b, phosphorylation of the large ribosomal subunit ul a decreased in torin treated seedlings, as did phosphorylation of the acidic stalk proteins ul b, p b, and p a, and the ul /acidic stalk-like rpd /mrt protein involved in nuclear ribosome assembly. oppositely, ul a phosphorylation somewhat increased, despite a significant decrease in ul a transcripts. ul a is a component of the s ribonucleoprotein particle, along with ul and the s rrna, that is assembled as a subcomplex prior to incorporation into the large subunit. recessive alleles of ul a, including piggyback (pinon et al., ) and oligocellula (yao et al., ) , impact arabidopsis leaf development (fujikura et al., ) and enhance phenotypes of asymmetric leaves (pinon et al., ; yao et al., ) and angustifolia three mutants that drastically disrupt leaf patterning. the differentially phosphorylated residue in ul a, ser , is not conserved in humans but is deeply conserved in the plant lineage. beyond rps per se, several proteins involved in ribosome biogenesis, such as the nucleolar proteins gar -like and nucleolin, were significantly less phosphorylated in torin treated seedlings (supplementary file ). multiple rna-binding proteins are differentially phosphorylated in the torin -sensitive seedling phosphoproteome (panther go molecular function rna-binding proteins were . -fold overrepresented, p< À , fisher's exact test with bonferroni correction) ( figure f ). several of these have been assigned functions in mrna splicing or maturation (panther go-slim biological process rna splicing were . -fold overrepresented, p< . , fisher's exact test with bonferroni correction), including rna-binding protein (rbm ), serrate (se), arginine/serine-rich splicing factor and (rs and rs ), binding to tomv rna l (btr l, orthologous to human nova), defectively organised tributaries (dot , orthologous to human sart- ), and fip , which are all significantly dephosphorylated upon tor inactivation by torin , and splicing factor sc (sc ), which is slightly more phosphorylated in torin -treated seedlings. proteins that are known or predicted to regulate translation initiation are also enriched in the torin -sensitive seedling phosphoproteome, including eif b , eif b , cbe , and larp a, which are dephosphorylated in torin -treated seedlings, and eif g, which is hyperphosphorylated in torin -treated seedlings (figure a and e). these results suggest that tor plays a major role in the regulation of transcript processing and translation, and that tor influences mrna expression through multiple, distinct signaling axes. beyond these general trends, we noticed a torin -sensitive change in the phosphorylation status of two critical proteins involved in signal transduction: raptor and topless ( figure ). raptor is an essential protein that complexes with tor and lst to form torc . in arabidopsis and humans, raptor is post-translationally modified to modulate torc activity in response to various cues (carrière et al., ; dunlop et al., ; foster et al., ; gwinn et al., ; wang et al., ; wang et al., ) . we found that phosphorylation of raptor thr and ser , residues between the conserved raptor heat repeats and wd repeats, significantly decreased in response to torin treatment. these residues are conserved in most land plant rap-tor sequences, including physcomitrella patens and marchantia polymorpha raptor orthologues, the abundance of phosphopeptides shown in (b) in response to torin are represented by dots in this scatterplot (mock-treated, dark blue; torin -treated, light blue). black lines represent the mean and standard error. in each case, the difference was statistically-significant (p< . , t test). (d) alignment of raptor and topless protein sequences surrounding the phosphosites that were sensitive to torin treatment. a. thaliana ( ) shows raptor and topless sequences; a. thaliana ( ) shows the sequences of their closely-related paralogs, raptor and topless-related . sequences from representative land plant species are shown in phylogenetic order: a. thaliana (thale cress, a rosid) solanum lycopersicum (tomato, an asterid), oryza sativa (rice, a monocot), amborella trichocarpa (representing a basal lineage of angiosperms), selaginella moellendorffii (spikemoss, a lycophyte), and marchantia polymorph (common liverwort, representing a basal lineage of land plants). phosphosites are highlighted in blue. (e) categorical analysis of the molecular functions of torin -sensitive phosphoproteins is shown. (f) categorical analysis of the subcellular localization of torin -sensitive phosphoproteins is shown. but are not found in other eukaryotic lineages ( figure d ). torin -sensitive phosphorylation of rap-tor residues suggests that torc could regulate its own activity, although further studies will be needed to determine whether raptor is a direct substrate of torc and how phosphorylation of these residues affects torc activity. topless is a transcriptional regulator that mediates multiple phytohormone pathways (long et al., ; oh et al., ; pauwels et al., ; szemenyei et al., ) ; several studies have proposed that tor signaling participates in phytohormone signaling networks (li et al., b; schepetilnikov et al., ; wang et al., ; zhang et al., ) , although most have focused on whether tor activity is sensitive to phytohormones as upstream cues, rather than whether tor modulates downstream phytohormone responses. we found that topless thr and ser phosphorylation decreased after torin treatment in wild-type seedlings ( figure b and c). topless is a member of the groucho/tup family of co-repressors (cavallo et al., ; keleher et al., ) , which are recruited to promoters by diverse transcription factors where they repress transcription. topless is most famous in plants for its interactions with phytohormone-regulated transcription factors, including direct interaction with auxin-responsive aux/iaa-arf complexes (szemenyei et al., ) , jasmonate-responsive ninja-jaz complexes (pauwels et al., ) , and brassinosteroid-responsive bzr complexes (oh et al., ) . topless-ser is deeply conserved in land plants ( figure d ), and topless-ser is broadly conserved in angiosperms, including amborella trichocarpa, monocots, and dicots, but is not found in non-flowering plants ( figure d ). future studies of the functional impact of these phosphorylation events could illuminate a role for tor-topless regulation of the complex crosstalk among metabolic, developmental, and stress-response signaling networks. the seedling torin -sensitive phosphoproteome also revealed that the phosphorylation of two actin-binding villin proteins, villin (vln ) and vln , is regulated by tor ( figure ). in mammals and yeast, the raptor-independent torc complex regulates actin filamentation through multiple pathways (jacinto et al., ; loewith et al., ; rispal et al., ; sarbassov and kim, ; schmidt et al., ) . recently, a forward genetic screen in arabidopsis found that a recessive allele of isopropylmalate synthase (ipms ) promotes actin filamentation in a tor-dependent pathway (cao et al., ) . ipms is the first committed step of leucine biosynthesis; accordingly, ipms mutants exhibit broad defects in free amino acid accumulation (cao et al., ) . while the exact cause remains undefined, the disruption of amino acid metabolism in ipms mutants apparently increases cellular tor activity (cao et al., ; schaufelberger et al., ) . reducing tor activity by treating ipms mutants with atp-competitive tor inhibitors rescues the ipms actin organization phenotype, which suggests that tor regulates actin filamentation in plants. we found that torin decreases phosphorylation of vln and vln , closely related villins, at the same sites near the c-terminus in both proteins. vln and vln are required for normal plant development and actin bundling in arabidopsis (bao et al., ; qu et al., ; van der honing et al., ; wu et al., ) , but their post-translational regulation remains underexplored in plants. functional studies of tor-promoted vln /vln phosphorylation could reveal new links between tor and the cytoskeleton in plants. we chose to focus further studies on larp for several reasons. larp is deeply conserved in eukaryotes and is a consistent target of tor in mammalian and plant phosphoproteomic screens, but the functional significance of tor-larp signaling remains unresolved (berman et al., ) , even in biomedical model systems. we began by investigating the role of larp in normal plant development and physiology. in some eukaryotes, such as drosophila melanogaster, larp is essential for early stages of development (burrows et al., ) . in contrast, larp is not essential in c. elegans, although larp mutants exhibit delayed growth and very low-frequency arrest of embryogenesis (nykamp et al., ) . larp contributes to heat stress recovery in arabidopsis (merret et al., ) , but its role in development under standard growing conditions has not been thoroughly characterized. to address this, we grew larp and wild-type plants for days on half-strength ms media plus . % agar, and then measured root growth ( figure a) . we observed % shorter roots in larp than in wild-type plants, along with a reduction in the number of secondary roots (although this is possibly a consequence of the defect in primary root growth) ( figure b ). the first true leaves of larp plants were also significantly smaller than in wild-type plants grown under these conditions ( figure c ). we next assayed shoot growth in soil-grown plants four weeks post-germination. we did not observe any obvious defects in leaf morphology ( figure d ), but again, the most recentlyemerged leaf (l ) was significantly smaller in larp plants than in wild-type ( figure e total chlorophyll levels were reduced in wt seedlings (dark gray) when treated with . mm torin and were significantly lower in the larp background (light gray) regardless of treatment (n > ). the bar graph shows mean total chlorophyll levels and standard error (***p< . , **p< . , *p< . , n.s. = not significant). growth defect in larp and that larp is proposed to regulate translation in human cells, we next tested whether free amino acid levels are different between larp and wild-type shoots, which could indicate a global defect in protein biosynthesis ( figure f ). we found virtually no differences between larp and wild-type plants, however, except that alanine accumulated to % higher levels than in larp shoots ( figure f ). we noticed that larp mutants appear relatively chlorotic as seedlings under the conditions described above for global profiling experiments, so we next measured chlorophyll accumulation in larp and wild-type seedlings after supplying them with glucose or glucose and torin to attenuate tor activity ( figure g ). total chlorophyll levels were significantly lower in larp mutants than in wild-type, whether or not the seedlings were treated with torin ( figure g ). in wild-type seedlings, treatment with torin for hr caused a slight but significant decrease in chlorophyll levels ( figure g ). there was no significant difference in chlorophyll levels between treatments in larp seedlings, however ( figure g ). to determine how larp promotes growth and test whether larp genetically interacts with the glucose-tor signaling pathway, we conducted all of the global profiling experiments described above ( figure a) using larp mutants and otherwise identical treatments. rna-seq revealed that several processes are disrupted in larp mutants compared to wild-type in glucose-supplied seedlings ( figure a ). most prominently, in larp , we observed significant repression of genes involved in photosynthesis, especially genes encoding components of the photosynthetic electron transport chain (petc) complexes ( figure a , supplementary file ). repression of photosynthesis-associated nuclear gene expression is consistent with our observation that chlorophyll levels were significantly lower in these larp seedlings than in wild-type ( figure g ). global proteomic analysis confirmed that photosynthesis-associated proteins are less abundant in larp mutants ( figure b , supplementary file ). in addition to photosynthesis-associated transcripts, e ubiquitin ligase mrnas, biotic stress-response transcripts, and cell wall glycoprotein mrnas (including those that encode arabinogalactan proteins) were also repressed in larp ( figure a ). several biological processes are transcriptionally induced in larp compared to wild-type, including genes involved in glucosinolate biosynthesis, hsp /hsp chaperones, microtubule cytoskeleton, lipid degradation, and lipid body-associated genes (oleosins and caleosins), flavonoid biosynthesis, and cell-cycle genes (especially those involved in cell division) ( figure a ). in response to torin , larp mutants showed largely similar responses as wild-type seedlings, including extensive repression of cytosolic ribosome biogenesis, nucleocytoplasmic trafficking, cell wall biosynthesis, and cell-cycle progression, as well as broad induction of genes involved in protein catabolism (including autophagy and e ubiquitin ligases), biotic stress-response genes, protein chaperoning, and receptor-like kinases ( figure d , supplementary file ). even in glucose-supplied seedlings, larp mutants showed several differences in te compared to wild-type ( figure c , supplementary file ), demonstrating that larp contributes to plant physiology even when tor is active. we observed at least a two-fold change in te for transcripts in larp compared to wild-type when seedlings were supplied glucose ( figure c ). the te of several categories of mrnas was relatively repressed in larp mutants, including mrnas that encode cytoskeletal kinesins, proteins involved in chromatin modifications and organization, and proteins involved in translation ( figure c ). inhibiting tor with torin impacted the te of transcripts in the larp background compared to glucose-supplied larp seedlings ( figure e , supplementary file ). as in wild-type, mapman analysis showed that translation of cytosolic rp mrnas was broadly repressed by torin in larp . overall, these results demonstrate that larp is not required for all effects of torin on mrna translation in plants, suggesting that other proteins may exert translational control in response to tor deactivation. knocking down larp expression can reduce tor activity in some human cell lines (mura et al., ) , which led us to investigate whether tor activity is altered in arabidopsis larp mutants by comparing the phosphoproteomes of larp and wild-type plants. under mock conditions, phosphoproteins accumulated to significantly different levels in larp than wild-type seedlings. of these overlapped with the torin -sensitive phosphoproteome, more than twice the number expected (p< À ; at most overlapping phosphoproteins would be expected, p> . ). remarkably, all but one of these phosphoproteins was affected in the same pattern in mock-treated larp compared to mock-treated wild-type as in the torin -treated wild-type compared to mock-treated wild-type ( figure f, supplementary file ) . to validate these results, we assayed phosphorylation figure . larp promotes tor signaling and activity in growing arabidopsis seedlings. (a) multiple categories of transcripts accumulated to significantly different levels in larp compared to wild-type. most significantly, photosynthesis-associated nuclear gene expression was strongly repressed in larp mutants. here and below, the observed fold-change in mrna levels for each category is shown in box-whisker plots drawn with tukey's method; the middle line represents the median and the dot represents the mean of the fold-changes. p values for each category were figure continued on next page of the well-established tor substrate, s k-t , in wild-type and larp plants using phosphospecific polyclonal antibodies against s k-pt and monoclonal antibodies against total s k ( figure g ). as predicted by the global phosphoproteomic results, s k-pt /s k ratios are noticeably lower in larp ( figure g) . thus, larp is required to maintain high levels of tor activity in actively growing arabidopsis seedlings, which could contribute to the growth defects we observed. topscore analysis reveals conserved tor-larp - top signaling axis in mammals, tor specifically controls the translation of a canonical set of~ mrnas via the top motif, which is present in all cytosolic rp mrnas and a few other transcripts involved in translation initiation and elongation (berman et al., ; jefferies et al., ) . in plants, however, it has been reported that rp mrnas do not have top or pyrimidine-enriched motifs, although, to our knowledge, there has been no comprehensive effort to define plant top mrnas. we sought to annotate arabidopsis transcripts to identify likely top mrnas using a modified version of the recently-described topscore approach (philippe et al., ; figure a ). topscores are calculated using quantitative transcription start site-sequencing (tss-seq), which provides single-nucleotide resolution of mrna ends genome-wide ( figure a ). every tss-seq read in the leader of an mrna is scored by whether it is part of an oligopyrimidine tract: all ends starting with purines are scored as , and ends starting with a pyrimidine are scored as one plus the distance to the next purine in the leader (thus, for example, a tss-seq read that starts with a pyrimidine followed by four consecutive pyrimidines and then a purine is scored as ). the sum of these scored reads is then divided by the total number of tss-seq reads that mapped to the leader. we took advantage of published paired-end analysis of tss (peat) tss-seq data (morton et al., ) to quantify topscores in arabidopsis ( figure ). using stringent parameters to ensure highquality results, we calculated topscores for transcripts encoded by genes (supplementary file ). the median topscore in this dataset is . , with % of topscores between . and . ( figure e ). we next used mapman to conduct wilcoxon rank-sum analyses to determine if gene topscores broadly correlate with any biological functions ( figure e ). we found that genes involved in rna biosynthesis, especially transcription factors in the zn finger c c superfamily, have higher topscores. genes encoding proteins involved in nucleocytoplasmic transport, including importin/ karyopherins, nucleoporins, and ran gtpases, also have significantly higher topscores. genes involved in cell wall remodeling and biochemistry, however, have significantly lower topscores. glycosyltransferase genes, including udp-glucosyl transferase (ugt) and xyloglucan endotransglucosylase/transferase (xth) genes, have remarkably low topscores, as do genes involved in lignin biosynthesis ( figure e ). next, we tested whether topscores correlate with changes in te in response to tor inhibition. the median topscore for all genes expressed across ribo-seq experiments (with rpkm ! ) is . ( figure f ), whereas the median topscore of genes that are specifically translationally repressed upon torin treatment (at least two-fold decrease in relative translational efficiency) is . ( figure f ), significantly higher (mann-whitney u test, p< . ). this result demonstrates that figure g) , global quantitative proteomics revealed that most photosynthesis-related protein levels were lower in larp mutants. the color-coded dots represent proteins that were detected at significantly higher (blue) or lower (red) levels in larp compared to wt, with representative proteins labeled. (c) many transcripts show differences in te in larp compared to wild-type, as depicted in box-whisker plots (drawn as in a) showing significantly-affected mapman categories with p values colored as in panel e. (d) torin treatment in the larp background affects the mrna levels of multiple functional categories, mostly similar to the effects of torin on mrna levels in wild-type plants ( figure c) , including repression of protein biosynthesis, cell-cycle progression, and protein sorting, alongside induction of stress responses and protein degradation pathways. (e) the te of cytosolic ribosomal mrnas is significantly repressed by torin treatment in larp mutants. (f) tor activity is reduced in larp mutants. observed differences in phosphoprotein abundance in mock-treated larp compared to wild-type (y axis) are mapped against torin -treated wt versus to mock-treated wt (x axis); phosphoproteins that accumulated to significantly different levels (p< . ) in both comparisons are highlighted in red and blue. the strong positive correlation shown suggests that the tor signaling network is relatively inactive in larp mutants than in wt. (g) phosphorylation of the direct tor substrate, s k-pt , is drastically reduced in larp mutants compared to wt, confirming the results shown in panel f. the densitometry ratio between the levels of s k-pt and total s k are shown above. this experiment was repeated three times with consistent results. box-whisker plots show the distribution of topscores for each category. topscores in each category were compared using mann-whitney u tests; *p< . , **p< . . (f) in wild-type seedlings, topscores for transcripts that were translationally repressed by torin treatment are significantly higher than the distribution of topscores transcriptome-wide and significantly higher than the topscores for transcripts that were translationally repressed by torin treatment in larp mutants (mann-whitney u, *p< . , **p< . ). there was no statistically-significant difference in topscore distributions between the whole transcriptome and the set of transcripts that were translationally repressed in larp . mrnas with significantly lower steady-state levels in larp mutants compared to wt also had slightly but statistically significantly higher topscores than the distribution of all topscores and the topscores of mrnas with significantly higher steady-state levels in larp mutants compared to wt. (g) high-confidence top mrnas in arabidopsis participate in diverse biological functions, including rna metabolism, protein metabolism, cell-cycle regulation, and subcellular trafficking. the online version of this article includes the following source data for figure : inhibiting tor represses translation of mrnas with high topscores. strikingly, this effect is dependent on larp : in larp mutants, the median topscore of genes that are specifically translationally repressed upon torin treatment is only . ( figure f ), significantly lower than the median topscore for all genes (mann-whitney u test, p< . ). thus, in aggregate, larp is required to repress top mrna translation in response to tor inhibition in plants. we therefore propose that the tor-larp - top signaling axis is conserved between plants and mammals. given that larp 's primary assigned role in mammals is to regulate translation of rp mrnas, but the universal rp mrna top motifs only recently evolved in animals , we next sought to elucidate the possible ancestral functions of tor-larp - top signaling by identifying top mrnas that are shared in both arabidopsis and humans. we used three criteria to define high-confidence top mrnas: (i) topscores in the top th percentile, (ii) decreased te in response to torin treatment, (iii) little or no effect of torin on te in larp mutants. using these criteria, we identified several arabidopsis top mrnas are homologous to top mrnas in mammals ( figure -source data ). these include mrnas that encode proteins involved in translation, such as polya binding proteins (pabps) and eukaryotic elongation factors eef b/d and v subunits. additionally, mrnas that encode importins/karyopherins, the translationally-controlled tumor protein (tctp ), and heterogeneous nuclear ribonucleoproteins (hnrnps) are all top mrnas in both humans and arabidopsis. finally, some mrnas that encode cytosolic rps, such as rack , us , es , us , es , and ul , are also top mrnas in both humans and arabidopsis. in the core set of human top mrnas recently defined by philippe et al., , there are cytosolic rp mrnas and other top mrnas; excluding the rp mrnas, over one-third of these ( / ) are conserved top mrnas in arabidopsis. strikingly, the importance of regulating the expression of these non-ribosomal genes by tor remains largely unstudied. for example, we found that importin/karyopherin mrnas are translationally regulated by the tor-larp - top motif signaling axis in plants and humans. importins/karyopherins selectively traffic proteins with importin recognition motifs (nuclear localization signals) from the cytosol to the nucleus; in humans, ipo and ipo are core top mrnas, and in arabidopsis, imb is a high-confidence top mrna (topscore: . ; dte (wt + t / wt) =À . fold, dte (larp + t / larp ) = + . -fold) and the median topscore for all importins/karyopherins is . ( figure e ), suggesting that others may also be subject to translational regulation by tor-larp - top signaling. in humans, both ipo and ipo are responsible for importing rps to the nucleus; ribosome subunits are translated in the cytosol, but then must be transported to the nucleolus for assembly before ribosome large and small subunits are exported to the cytosol. in arabidopsis, the importin imb (also known as karyopherin enabling the transport of the cytoplasmic hyl or ketch ) similarly promotes ribosome biogenesis by carrying rp subunits to the nucleus. the imb transcript has a topscore of . and was slightly translationally repressed by torin treatment in wild-type (dte = À . fold) but not in larp . an attractive hypothesis, therefore, is that tor-larp - top signaling coordinates translation of importins to support ribosome biogenesis by driving cytosolic rp translocation to the nucleus for assembly. we then considered whether comparative analysis of top mrnas could identify previously uncharacterized top mrnas in humans. for example, we found that several mrnas that encode mitochondrial rps (bs m, ul m, and us m) are regulated by tor-larp - top signaling, but none were identified as top mrnas in humans. upon reanalysis, however, we discovered that some human mitochondrial rp genes are also likely regulated by tor-larp - top signaling. hsmrps has a high topscore ( nd percentile), is translationally repressed in cells treated with torin ( . -fold, p adj = . ), and is not translationally repressed in larp cells treated with torin ( . fold higher te in response to torin in larp double knockouts, p adj = . ) (philippe et al., ) . similarly, hsmrps has a high topscore ( th percentile), is translationally repressed in cells treated with torin ( . -fold, p adj = . ), and is not translationally repressed in larp cells treated source data . tor-larp - top signaling in arabidopsis seedlings regulates translation of mrnas that encode deeply conserved eukaryotic proteins, plant lineage-specific proteins, and diverse proteins involved in ribosome biogenesis. with torin ( . -fold higher te). neither hsmrps nor hsmrps were initially identified as top mrnas because they did not meet the extremely stringent statistical criteria for top mrna designation. through comparative, evolutionary analysis of tor-larp - top mrna signaling, however, we submit that hsmrps and hsmrps may also be top mrnas. we next focused on newly-identified top mrnas that have not been previously described in the mammalian literature on tor-larp - top signaling (supplementary file ). many of these genes are involved in plant-specific pathways ( figure -source data ) , such as cell wall biosynthesis (uuat , gatl ), phytohormone signaling (gid a, pin , iaa , bg ), and chloroplast physiology (nfu ). others are genes only found in plant lineages, such as blister, which encodes a polycomb group-associated protein; peapod , which encodes a tify transcription factor that regulates leaf development; and octopus-like , a plasma membrane protein likely involved in patterning. the majority of newly-identified top mrnas in arabidopsis encode broadly conserved genes, however, that participate in a number of biological processes, including chromatin structure and remodeling, ribosome biogenesis, rna interference, and vesicle trafficking, among others (supplementary file ) . several of the plant-specific top mrnas are involved in auxin signaling. pin , which encodes a putative auxin efflux carrier, iaa , which encodes an auxin-responsive transcriptional regulator, and big grain , a protein of unknown function linked to auxin signaling, were all translationally repressed by torin treatment in wild-type but not in larp , and all have high topscores ( . , . , and . , respectively). using less stringent parameters, we considered whether other genes involved in auxin signaling could be regulated by tor-larp - top signaling. five pin genes were expressed in our experiments (pin , pin , pin , pin , and pin ), and with the exception of pin , all have topscores over . (the top th percentile). these may also be regulated by the tor-larp signaling axis; for example, pin transcripts have a topscore of . , had . -fold reduced te in wild-type seedlings treated with torin compared to controls, and . -fold higher te in larp seedlings treated with torin compared to controls. although only a handful of arabidopsis cytosolic rp gene mrnas have top motifs, many genes involved in diverse steps of ribosome biogenesis are regulated by tor-larp - top signaling. among the high-confidence top mrnas, rrn , efg l, eh , swa , rrp l, nap , and wdsof l participate in crucial steps of rrna synthesis and maturation; ebp contributes to s ribosomal subunit assembly; and es b and es a are part of the s ribosomal subunit. this finding leads us to speculate that tor-larp - top signaling regulated ribosome biogenesis in the last eukaryotic ancestor of animals and plants, and that the direct control of rp translation by top motifs to coordinate ribosome biogenesis evolved later in an ancestor of vertebrates. alternative tsss may modulate tor-larp - top regulation tsss can vary with developmental or physiological context (benyajati et al., ; kurihara et al., ; rojas-duran and gilbert, ; young et al., ) . as a result, in humans, some mrnas only have top motifs in specific cell types, allowing for tunable regulation of gene expression by the tor-larp signaling axis (philippe et al., ) . currently, there are not sufficient publiclyavailable tss-seq data in plants to determine whether alternative tsss fine-tune tor-larp - top regulation genome-wide, but upon close analysis of the core top mrnas defined here, we did find that several genes that encode top mrnas have two, apparently distinct tsss. pin , for example, has two distinct tss peaks, although both the longer and shorter predicted leaders have high topscores ( . and . , respectively). pabp has two strong tss peaks with approximately equal coverage in the arabidopsis root tss-seq dataset ( figure b and d) : the longer predicted leader has a high topscore ( . ), but the shorter predicted leader has a near-median topscore ( . ). therefore, it is possible that, like humans, plants modulate which genes encode top mrnas in response to developmental or physiological cues. future investigations of plant tsss and larp -dependent translation will be needed to thoroughly test this hypothesis genomewide. conservation, adaptation, and exaptation of eukaryotic tor-larp - top signaling tor and several of its interactors are conserved across eukaryotes, but many distinct upstream regulators and downstream effectors of tor have evolved in different lineages chantranupong et al., ; shi et al., ) . using parallel global profiling methods for unbiased screening, we sought to uncover both new and conserved components of the tor signaling network in arabidopsis thaliana seedlings. from these screens, we identified thousands of genes that are regulated transcriptionally, translationally, or post-translationally by tor. by comparing these results to distantly-related eukaryotes (e.g. yeast [huber et al., ] and humans [hsu et al., ; yu et al., ] ) and to orthogonal experimental approaches in arabidopsis (e.g. cell suspension culture proteomics ), we defined a core set of bona fide tor-sensitive phosphoproteins in arabidopsis, including larp . larp is an rna-binding protein that is highly conserved in animals, plants, and most fungi (deragon, ) . to elucidate how larp acts downstream of tor, we repeated our global profiling experiments in larp mutants. analysis of these results revealed that tor-larp signaling controls the translation of a specific set of mrnas that begin with a top motif, demonstrating that tor-larp - top signaling is an ancestral function of the tor signaling network in eukaryotes. the precise mechanisms of tor-larp signaling have been the subject of considerable recent controversy (berman et al., ) , which we sought to partially address by providing an evolutionary perspective. the typical biomedical model for evolutionary comparison, s. cerevisiae, is unusual among eukaryotes because its genome does not include a complete larp orthologue (deragon, ), which has limited progress on understanding the tor-larp pathway. within mammalian model systems, investigations of larp have focused primarily on the role of larp in regulating cytosolic rp mrna translation and/or stability, because virtually all vertebrate and many invertebrate cytosolic rp mrnas begin with a top motif parry et al., ) . plant and fungal cytosolic rp mrnas do not all start with a top motif, but larp is broadly conserved in these lineages, suggesting that the ancestral function of larp might be entirely different from its current role in human cells. our study revealed, however, that larp does control the translation of top mrnas in plants, and we discovered that humans and plants share a core set of deeply conserved eukaryotic top mrnas. moreover, while only a handful of rp mrnas begin with top motifs in arabidopsis, several other steps of ribosome biogenesis, including rdna transcription, rrna processing, and ribosome complex assembly, are controlled by the tor-larp - top signaling axis. therefore, we speculate that the animal lineage adapted tor-larp - top signaling to directly coordinate expression of rp mrnas, a fine-tuning of this preexisting ribosome biogenesis regulatory mechanism that helps to ensure that rp synthesis is exquisitely synchronized (brunkard, ; meyuhas and kahan, ) . the evidence we present here further suggests that the broader role of tor-larp - top signaling in coordinating nutrient availability with ribosome biogenesis was already present in the ancestors of plants and animals. the core eukaryotic top mrnas identified by our analysis encode eukaryotic elongation factor b subunits, poly(a)-binding proteins, cytosolic rps, cyclins, importins, and a transmembrane protein of poorly-defined molecular function called ergic . our discovery that these transcripts have likely been consistent effectors of the tor-larp - top signaling axis suggests that their coordinated expression with metabolic status sensed by tor is adaptively important for eukaryotic cell biology. the significance of tor in controlling mrna translation, and thus a putative function for tor in regulating expression of eef b subunits, pabps, and rps, has been understood for some time (hara et al., ) : tor is stimulated by amino acid levels and in turn promotes amino acid consumption in protein biosynthesis. similarly, across eukaryotes, tor acts as a gatekeeper for cellcycle progression, only permitting the g /s phase transition when sufficient nutrients are available to support cell division (brown et al., ; kunz et al., ; , and thus providing a good hypothesis for the regulation of cyclin expression by tor. it is less immediately clear why tor-larp - top signaling coordinates translation of importins and ergic . several studies in plants and animals have shown that reducing importin levels can restrict ribosome biogenesis by limiting the transport of newly-translated rps from the cytosol to the nucleolus for ribosome assembly (chou et al., ; golomb et al., ; jäkel and gö rlich, ; xiong et al., ) . the importins encoded by top mrnas, ipo and ipo , are both specifically responsible for nuclear import of several rp subunits and are implicated as key contributors to cancer cell proliferation, tumorigenicity, and regulation of the p oncogenic pathway (Ç ag atay and chook, ; golomb et al., ; zhang et al., ) . therefore, we hypothesize that importins are translationally regulated by tor-larp - top signaling as an additional regulatory step to promote post-translational ribosome assembly specifically when cells can metabolically sustain translation. like importins, ergic is also strongly implicated in several cancers (hong et al., ; lin et al., ; wu et al., ) , but its molecular function has not been thoroughly investigated. in mammalian cells, ergic cycles between the er and golgi membranes, where it contributes to anterograde and/or retrograde secretory transport (breuza et al., ; orci et al., ; yoo et al., ) . ergic is orthologous to yeast erv , which forms a retrograde receptor complex that is required for efficient localization of er resident proteins that do not have the canonical c-terminal hdel sequence (shibuya et al., ) . to our knowledge, ergic has not been directly studied in plants to date. deeper understanding of the molecular function of ergic may reveal why it is subject to metabolic regulation by tor-larp - top signaling. recently, we proposed that the tor metabolic signaling network evolves through exaptation by coopting existing pathways to serve new functions relevant to specific lineages (brunkard, ) . for example, tor gained new functions when eukaryotic lineages evolved multicellularity, such as coordinating plasmodesmatal (intercellular) transport in plants or regulating cellular differentiation and cell type-specific metabolisms in humans (brunkard, ; kosillo et al., ) , as examples. similarly, here, we provide evidence that plants exapted the tor-larp - top signaling axis to regulate translation of proteins involved in developmental patterning and auxin signaling ( figure g ), pathways that did not exist in the unicellular ancestor of plants and animals that first evolved tor-larp - top signaling. in contrast, we argue that the universal top motif found in all vertebrate cytosolic rp mrnas is an example of adaptation. tor-larp - top signaling evolved before the divergence of plants and animals to coordinate multiple processes, including ribosome biogenesis ( figure ) . larp was subsequently lost in some lineages, such as s. cerevisiae ( figure c ). in the vertebrate lineage, tor-larp - top signaling adapted to directly control the expression of all rp mrnas ( figure c) , rather than only indirectly influence translation of ribosome biogenesis-related proteins. many, but not all, invertebrate rp mrnas begin with top motifs, perhaps reflecting an intermediate 'evolutionary transition'. our parallel global profiling approach revealed that inactivating tor rapidly represses translation in the chloroplast, and we observed a corresponding significant decrease in chlorophyll levels ( figure g ), suggesting that tor activity is required to maintain chloroplast physiology. in agreement with the glucose-tor activation transcriptome, however, our experiments also show that briefly inactivating tor does not impact the expression of photosynthesis-associated nuclear genes. separate studies have demonstrated that prolonged inhibition of tor activity can strongly repress expression of photosynthesis-associated nuclear genes (phangs). the coordinated expression of the nuclear and chloroplast genomes has been a major research focus for several decades, with the strong consensus that defective translation of the chloroplast transcriptome triggers retrograde signals that suppress phang expression (brunkard and burch-smith, ; koussevitzky et al., ; susek et al., ; woodson and chory, ) . we speculate, therefore, that inactivating tor first represses translation in the chloroplast, and that this secondarily leads to repression of phang expression via retrograde signaling, a hypothesis that we are currently pursuing. a previous study of ribosome protein abundance in plants with prolonged, mildly attenuated tor expression in stable tor rnai lines found reduced levels of chloroplast ribosome subunits, and argued that this could be due to translational control via pyrimidine-rich elements in the leader of some cytosolic mrnas that encode chloroplast rps (dobrenel et al., ) . we did not observe any clear effects of torin treatment on the te of nuclear-encoded chloroplast rp mrnas, however, indicating that tor can control chloroplast genome expression through other mechanisms. early studies of torc in yeast and mammalian cells showed that torc can control actin filamentation and cytoskeletal dynamics. several mechanisms have been proposed to link torc to actin filamentation, but this remains a relatively poorly studied area in the field. recent studies in plants revealed that genetically disrupting amino acid metabolism can increase tor activity and increase actin bundling, leading to diverse morphological defects. here, we found that tor controls the phosphorylation of critical actin-associated proteins, vln and vln . in humans, elevated tor activity promotes transcription of gelsolin genes (nie et al., ) , which are orthologous to arabidopsis villin genes. gelsolin hyperaccumulation and es hyperphosphorylation are, in fact, specific clinical markers of tuberous sclerosis tumors (onda et al., ) , which are the result of tor hyperactivation. functional studies to determine how phosphorylation of vln and vln downstream from comparative global profiling in arabidopsis and humans presented here revealed a set of 'core' top mrnas that are regulated by the tor-larp - top signaling axis. (c) we propose that tor-larp - top signaling evolved an early role in regulating ribosome biogenesis in the last common ancestor of plants and animals by controlling translation of ribosome biogenesis (ribi)-associated mrnas. subsequently, plants exapted this signaling axis to regulate the expression of other mrnas that encode, for example, proteins involved in hormone signaling and developmental patterning. animal ancestors adapted the tor-larp - top pathway to directly control expression of ribosomal protein mrnas themselves, rather than diverse upstream ribi mrnas. other lineages, including a recent ancestor of saccharomyces cerevisiae, lost the larp pathway, and presumably coordinate ribosome biogenesis downstream of tor through other mechanisms. tor impacts actin bundling could illuminate a new connection between tor and the cytoskeleton in eukaryotes. in addition to vln /vln phosphorylation, other pathways may contribute to tor-mediated control cytoskeletal dynamics in plant cells. for example, cct b, a subunit of the chaperonin containing tcp (cct) complex (also known as tric), is a top mrna, and several other cct mrnas only slightly missed our stringent criteria for defining top mrnas (e.g., cct has a topscore of . , was relatively translationally repressed . -fold by torin in wild-type, and was not relatively translationally repressed by torin in larp ). the cct complex promotes assembly of many proteins, including raptor and lst (cuéllar et al., ) , but is most famously associated with assembly of actin and tubulin subunits (balchin et al., ; dekker et al., ; yam et al., ) . in mammals, tor regulates cct complex function by promoting phosphorylation of the cct complex (abe et al., ) , indicating that metabolic regulation of cct complex activity by tor may occur through multiple signal transduction pathways in different eukaryotic lineages. actin depolyme-rizing factor (adf ), part of the cofilin family of actin destabilizing proteins, is also encoded by a top mrna, providing another possible connection between tor and actin filamentation (bernstein and bamburg, ) . in a ground-breaking study of plant tor dynamics, xiong et al. found that the e fa transcription factor, which promotes the g /s cell-cycle transition, is likely a direct substrate of tor in arabidopsis and is required for full activation of the root meristem in response to glucose-tor signaling . in our experimental system, torin also transcriptionally repressed expression of many of the e fa targets, including origin recognition complex second longest sub-unit (orc ), minichromosome (mcm ), prolifera (mcm ), histone . (htr ), and proliferating cellular nuclear antigen (pcna ), among others. e f transcription factors are regulated by the cyclin (cyc)-cyclin-dependent kinase (cdk)-retinoblastoma-related (rbr) pathway during cell-cycle progression. in this canonical pathway, d-type cyclins bind to and activate cyclin-dependent kinases that phosphorylate and thus inactivate retinoblastoma-related rbr , derepressing e f transcription factors that drive the transcriptional program of the g /s phase cellcycle transition (ach et al., ; choi and anders, ; ebel et al., ; huntley et al., ; serrano et al., ; xie et al., ) . in addition to the glucose-tor-e fa pathway, a recent investigation identified multiple recessive alleles of yak in a screen for mutants resistant to root growth arrest after treatment with tor inhibitors (forzani et al., ) . subsequent analyses indicated that yak is required for at least two responses to tor inhibitors: repressing the expression of cyclins and inducing the expression of the siamese-related (smr) family of cdk inhibitors (forzani et al., ) . we found that two critical d-type cyclins, cycd ;one and cycd ; , are translationally regulated by tor-larp - top signaling. in humans, several cyclins are translationally regulated by tor, but through at least two distinct molecular pathways. human ccng , a member of the atypical g-type cyclin family that evolved in animals, is encoded by a core top mrna and is clearly regulated by tor-larp - top signaling (philippe et al., ) . unlike typical cyclins, ccng is understood to primarily function in coordinating the vertebrate-specific pp a-mdm -p pathway that controls stress-responsive cell-cycle arrest (bennin et al., ; gordon et al., ; okamoto et al., ; russell et al., ) . human ccnd , which encodes a member of the d-type cyclin family that promotes the g to s phase cell-cycle transition, is also translationally promoted when tor is active, but apparently through larp -independent mechanisms, including the tor- ebp-eif e signaling axis (averous et al., ; musgrove, ) . in arabidopsis, in addition to cycd ;one and cycd ; , which are two clear examples of top mrnas, we found that cyclin mrnas have significantly higher topscores than other transcripts (median = . , mean = . , n = , mann-whitney u test, p= . ), suggesting that other cyclin mrnas may also be regulated by the tor-larp - top signaling axis in some contexts. therefore, we propose that tor translationally controls expression of cyclins to promote cell cycle progression, in addition to regulation of yak upstream (forzani et al., ) and e fa downstream of cyclin-cdk-rbr signaling. adding to this complex network, we found that regulators of cell division and cyclin expression that are involved in developmental patterning, including the transcription factor peapod (ppd ) (baekelandt et al., ; white, ) and several proteins involved in auxin signaling, which has previously been reported to act upstream of tor (beltrán-peña et al., ; chen et al., a; li et al., b; schepetilnikov et al., ; turck et al., ) , are encoded by tor-larp - top-regulated mrnas. ongoing investigations of the role of tor in cell cycle regulation could elucidate the relative contributions of the transcriptional, translational, and post-translational regulatory steps in this multilayered tor signaling network (ahmad et al., ; lokdarshi et al., ) . in this report, we showed that tor, the master regulator of eukaryotic metabolism, coordinates mrna translation in plants through diverse mechanisms at transcriptional, translational, and posttranslational levels. focusing on one of these mechanisms, we demonstrated that tor specifically controls the translation of a distinct subset of mrnas that begin with a top motif that is recognized by the putative tor substrate, larp , identified in our tor-sensitive phosphoproteomic screen. rigorous phenotypic analysis and global profiling experiments in larp mutants revealed that, although larp is not absolutely essential for plant development under standard physiological conditions, larp is required to maintain tor homeostasis in plants and to support wild-type growth rates (figure ) . our studies elucidate conserved transcripts that are translationally controlled by tor-larp - top signaling in both humans and arabidopsis. unexpectedly, although top motifs are most famously associated with cytosolic rp mrnas in the vertebrate lineage, we found that the conserved eukaryotic top mrnas instead encode other genes involved in the regulation of translation, ribosome biogenesis, and subcellular translocation. these evolutionary insights may prove useful for ongoing investigations of the role of larp in cancers, genetic disorders, and infection by viruses. for sterile culture experiments, arabidopsis thaliana wild-type (col- ) and larp , previously called larp - (merret et al., ) , a homozygous t-dna insertion line (salk_ ) in the larp gene (at g ), seeds were grown in a plant growth chamber maintained at ˚c, % humidity, and mmol photons m À s À photosynthetically-active radiation with a hr light/ hr dark diurnal cycle. for the seedling treatments, surface-sterilized seeds of wild-type or larp were plated in one well of a six-well plate containing ml of half-strength ms liquid media. after three days, the media were replaced with half-strength ms liquid media plus mm glucose and incubated for hr, followed by replacement with half-strength ms media plus mm glucose or half-strength ms media plus mm glucose and . mm torin . after hr of incubation with the different treatments, the tissues were collected and frozen in liquid nitrogen. for the root length and leaf size measurements, wt and larp seeds were plated in square petri dishes containing half-strength ms-agar media. after days, the plants were dissected and photographed to image root length and leaf sizes; measurements were made from images using imagej. for the leaf size analysis, wt and larp plants were grown in soil for weeks. the plants were dissected, pictures were taken, and leaf size was measured using imagej. at least quiescent seedlings were collected and flash frozen in liquid nitrogen. the tissues (~ . g) were pulverized using a mortar and pestle and resuspended in ml of ice-cold polysome extraction buffer as described in hsu et al., . the polysome extraction buffer contained % (vol/vol) polyoxyethylene ( ) tridecyl ether, % deoxycholic acid, . mm dtt, mg/ml cycloheximide, unit/ml dnase i (epicenter), mm trisÁhcl (ph ), mm kcl, and mm mgcl . the lysate was homogenized by vortexing and incubated on ice with shaking for min. this was then centrifuged at , x g at ˚c for min and the supernatant was transferred to a new microtube and centrifuged again at , x g for min at ˚c. the concentration of rna extracted was determined with the qubit rna hs assay kit (invitrogen). the total rna obtained was split into two samples for rna-seq and ribo-seq experiments: an aliquot of ml sample was saved for rna-seq (at least mg total rna) and an aliquot of ml sample was saved for ribo-seq (at least mg total rna). leftover rna was saved for rt-qpcr analysis. for the library preparation and sequencing we followed established methods (hsu et al., ) . for ribo-seq, a ml aliquot of rna was treated with units of nuclease provided by the artseq/ truseq ribo profile kit (illumina) for an hour at ˚c on a nutator. nuclease digestion was stopped by adding ml superase-in (thermo fisher scientific). size exclusion columns (illustra microspin s- hr columns) were equilibrated with ml of polysome buffer by gravity flow and spun at x g for min. then, ml digested lysate was applied to equilibrated columns and spun at x g for min. next, ml % (w/v) sds was added to the elution, and rna greater than nt was isolated following manufacturer's instructions with the zymo rna clean and concentrator kit (zymo research; r ). after checking digestion quality, rna less than nt was isolated following manufacturer's instructions with the zymo rna clean and concentrator kit (zymo research; r ). next, the rrna was depleted using the ribo-zero plant leaf kit (illumina, mrzpl ) according to the artseq/truseq ribo profile kit manual. after rrna depletion, purified rna was separated by % (wt/vol) tbe-urea page, and gel slices from and nt were excised. ribosome footprints were recovered from the excised gel slices following the overnight elution method specified in the kit manual. ribo-seq libraries were constructed according to the artseq/truseq ribo profile kit manual and amplified by cycles of pcr with a barcode incorporated in the primer. the pcr products were gel purified overnight (ingolia et al., ) . equal molarity of the libraries was pooled for single-end bp sequencing in an illumina hiseq system. putative ribosomal footprint sequences were processed as previously described with minor modifications. briefly, we removed adapter sequences and low quality reads using fastp (chen et al., b ) and aligned to a subset of arabidopsis genome tair sequences annotated as rrnas, snornas, and trnas with bowtie (langmead and salzberg, ) to remove untranslated rna sequences. the remaining sequences were aligned to a tair genome index generated using the araport gene model annotations using star (dobin et al., ) with -out-filtermultimapnmax and -outfiltermismatchnmax three options. for rna-seq, the ml aliquot plus ml of % (w/v) sds was purified using the zymo rna clean and concentrator kit (zymo research; r ). the total rna was rrna-depleted with the ribo-zero plant leaf kit (illumina; mrzpl ) following the manufacturer's instructions. the artseq/truseq ribo profile kit (illumina) was used to construct sequencing libraries, circularized cdna was amplified by cycles of pcr and gel purified overnight (ingolia et al., ) . libraries were barcoded, pooled and sequenced in an illumina hiseq system (paired-end bp). rna-seq results were trimmed to remove adapters and filtered for low quality base-calls using fastp (chen et al., b) , then aligned to a tair genome index generated using the araport gene model annotations using star (dobin et al., ) with -outfiltermultimapnmax , -outfiltermismatchnmax , and -alignintronmax (determined from cheng et al., ) options. for determining differential gene expression, aligned reads were counted using a unionexon approach with featurecounts (liao et al., ) , then normalized and compared with deseq [love anders huber deseq ]. aligned ribo-seq and rna-seq files were processed by ribotaper v . . (calviello et al., ) using a minimal conda environment (anaconda, ) to solve back-compatibility issues with other softwares. ribo-seq data were used to create metaprofiles of different sequence lengths and visually inspected for bp periodicity. periodic footprint lengths were then used to calculate translational occupancy for all supported reading frames allowing sequence reads to be counted towards multiple reading frame models. to compare with rna-seq results, we chose the most highly-occupied reading frame to represent the locus and divided ribo-seq fpkm by rna-seq fpkm to calculate relative translation efficiency. at least quiescent seedlings per sample were collected and frozen in liquid nitrogen. the samples were homogenized in a tissuelyser, then lysed in a lysis buffer ( mm hepes, mm nacl and protease and phosphatase inhibitor from cell signal technology) and sonicated on ice. protein concentrations were determined by bca protein assays (thermo). protein lysates ( mg) were reduced with mm tcep and alkylated with mm iodoacetamide prior to trypsin digestion at c overnight. digests were acidified with formic acid and subjected to sep-pak c solid phase extraction (waters) and resuspended in ml % acn/ % water. a ml aliquot ( %) of the sample was reserved for global analysis and the remaining ml of sample was subjected for phosphopeptide enrichment. global proteomics ml of mm hepes/acn ( %/ %, ph . ) buffer was added to each digested peptide sample. a reference pooled sample which is composed of equal amount of material from all samples was also generated to link both tmt plexes. all individual and pooled samples were labeled according to the tmt plex reagent kit instructions with the labeling scheme in the end of the report. briefly, tmt regents were brought to room temperature and dissolved in anhydrous acetonitrile. peptides were labeled by the addition of each label to its respective digested sample. labeling reactions were incubated with shaking for hr at room temperature. reactions were terminated with the addition of hydroxylamine. subsequent labeled digests were combined into a new ml microfuge tube, acidified with formic acid, subjected to sep-pak c solid phase extraction and dried down. the dried peptide mixture was dissolved in ml of mobile phase a ( mm ammonium formate, ph . ). ml of the sample was injected onto a .  mm xselect csh c column (waters) equilibrated with % mobile phase b ( mm ammonium formate, % acn). peptides were separated using a similar gradient to batth et al., with the following gradient parameters at a flow rate of . ml/min. peptide fractions were collected corresponding to min each. pooled samples were generated by concatenation (yang et al., ) in which every th fraction (i.e., , , , , , ; six fractions total) was combined. the pooled samples were acidified, dried down and resuspended with ml % acn, . % fa. phosphopeptide enrichment was performed using the magresyn titanium dioxide (tio ) functional magnetic microparticles (resyn biosciences) following vendor's protocol. briefly, dried peptides were reconstituted in ml of loading buffer ( m glycolic acid in % can and % tfa) and applied to the tio two beads that was previously equilibrated and washed with loading buffer. after reapplying sample once, the beads were washed with ( ) ml of loading buffer ( ) ml of wash buffer ( %acn in % tfa), and ( ) with ml of lc-ms grade water. bound peptides were eluted three times with ml of elution buffer ( % nh oh). eluates containing the enriched phosphopeptides were acidified with % fa then cleaned up with c tip before tmt labeling. for the global samples, each aliquot was reconstituted with ml of % acn/ . % fa. the phosphopeptide samples were each dissolved in ml of % acn/ . % fa. samples were transferred to autosampler vials for lc-ms analysis. ml was analyzed by lc-ms (hcd for ms/ms) with a dionex rslcnano hplc coupled to a q-exactive (thermo scientific) mass spectrometer using a hr gradient. peptides were resolved using mm x cm pepmap c column (thermo scientific). all ms/ms samples were analyzed using proteome discoverer . (thermo scientific). the sequest ht search engine in the proteome discover was set to search human database (uniprot. org). the digestion enzyme was set as trypsin. the hcd ms/ms spectra were searched with a fragment ion mass tolerance of . da and a parent ion tolerance of ppm. oxidation of methionine and acetylation of n-terminal of protein (phosphorylation of serine, threonine and tyrosine were added when analyzing phosphoproteome data) were specified as a variable modification, while carbamidomethyl of cysteine and tmt labeling was designated at lysine residues or peptide n-termini were specified in proteome discoverer as static modifications. ms/ms based peptide and protein identifications and quantification results was initially generated in proteome discover . and later uploaded to scaffold (version scaffold_ . . proteome software inc, portland, or) for final tmt quantification and data visualization. normalized and scaled protein/peptide abundance ratios were calculated against the abundance value of the 'reference' (which is the pooled sample). at least seedlings were pooled per sample and a total of four samples per genotype was analyzed. approximately mg to mg fresh weight of each sample was weighed into an eppendorf tube and frozen in liquid nitrogen. ml of . mm c-and n-labeled amino acid internal standard was then added to each tube, followed by ml extraction solution ( : : water:chloroform: methanol) and two steel balls. these samples were then put onto a tissulizer to lyse cells and extract free amino acids. samples were centrifuged to pellet tissue and the supernatant was transferred to a fresh eppendorf tube. this extraction procedure was then repeated and combined with the first extraction to ensure thorough extraction of all amino acids. after the second extraction, ml chloroform and ml water were added to each tube, which were then mixed vigorously. debris was separated by centrifugation, the supernatant was collected and dried with a speedvac, and then the pellet was resuspended in . ml % methanol. the methanol-dissolved samples were transferred to a vial for analysis by lc/ms/ms with a velos ion trap, and separation was accomplished using a hilic column combined with a waters uplc instrument. at least quiescent seedlings per sample were collected and frozen in liquid nitrogen. after grounding the tissue in a bead beater, ml of acetone was added, homogenized by vortexing for s at full speed, and centrifuged for min at , x g. the supernatants were transferred to new tubes and the extraction was repeated one more time. after combining the supernatants, the measurements of chlorophylls were determined spectrophotometrically. ml of each sample was loaded in ml of cold % acetone, homogenized by vortexing at full speed, and centrifuged for min at , x g. the supernatants were transferred into a cuvette right before measuring and the od at nm (chlorophyll b), nm (chlorophyll a), and nm (protein/baseline) were determined. the total chlorophyll was calculated by the porra method (porra et al., ) . at least quiescent seedlings per sample were collected and frozen in liquid nitrogen. protein was then extracted from the plant tissue in mm mops (ph . ), mm nacl, % sds, . % bmercaptoethanol, % glycerin, mm pmsf, and x phosstop phosphatase inhibitor (sigma-aldrich). s k-pt was detected by western blot using a phosphospecific antibody (ab , abcam) and an hrp-conjugated goat anti-rabbit igg secondary antibody (jackson immuno research, no. - - ). s k levels were detected by western blot using a custom monoclonal antibody described in busche et al., . total protein was visualized after transfer using ponceau s red staining. western blot images were scanned, converted to grayscale, and adjusted for contrast and brightness using imagej. to calculate topscores, peat reads from morton et al., were mapped to utr sequences annotated in araport using bedtools at galaxy. any site with less than reads was excluded from further analysis. reads were then scored using the topscore method (philippe et al., ) , and scores were divided by the total number of reads that mapped to the annotated utrs ( figure a ). col- seedlings were grown as described above and flash-frozen in liquid nitrogen. rna was extracted with the spectrum plant total rna kit (sigma) following manufacturer's instructions. . mg total rna was treated first with calf intestinal phosphatase to dephosphorylate truncated and non-mrna ends, then with tobacco acid pyrophosphatase to decap mrnas, ligated with a generacer rna oligo with t rna ligase, and finally reverse transcribed with superscript iii and oligo dt primer, all using the generacer kit following manufacturer's instructions (invitrogen). to amplify ends of specific mrnas ( figure b ), gene-specific oligos were used along with the generacer primer (invitrogen) for pcr. the amplified pcr products were then purified and cloned using the zero blunt topo cloning kit (invitrogen). plasmids from to separate clones were extracted and sequenced using sanger sequencing for each gene shown in figure b . the consensus leader sequence is shown. mapman (thimm et al., ) was used to analyze global profiling experiments to identify significantly-affected biological processes. mapman uses mann-whitney u tests to identify overrepresentation of gene ontologies; the stringent benjamini-yekutieli method was applied throughout to correct for false positives. . supplementary file . transcriptome level changes in wt+torin treated seedlings compared to wt. table presents the fold-change of mrna levels of the genes from the different categories, as determined by mapman analysis. these data are represented as box-whisker plots in figure b and figure c . . supplementary file . translational efficiency level changes in wt+torin treated seedlings compared to wt. table presents the fold-change in translation efficiency levels of the genes from the different categories, as determined by mapman analysis. these data are represented as a box-whisker plot in figure d . . supplementary file . proteomic analysis of cytosolic rp abundance changes and phosphoprotein abundance changes in wt+torin treated seedlings compared to wt. table presents the foldchange of cytosolic rp abundance and the fold-change of phosphoprotein abundance of the genes with statistically-significant changes. these data are represented as scatterplots in figure e and figure a . . supplementary file . transcriptome level changes in larp seedlings compared to wt. table presents the fold-change of mrna levels of the genes from the different categories, as determined by mapman analysis. these data are represented as a box-whisker plot in figure a . . supplementary file . proteomic analysis of photosynthesis protein abundance changes in larp seedlings compared to wt and correlation analysis of phosphoprotein abundance changes in wt +torin treated seedlings compared to wt and larp seedlings compared to wt. table presents the fold-change of photosynthesis protein abundance of the genes with statistically-significant changes, the fold-change of phosphoprotein abundance with statistically-significant changes in wt +t compared to wt, and the fold-change of phosphoprotein abundance with statistically-significant changes in larp compared to wt. these data are represented as scatterplots in figure b and figure f . . supplementary file . translational efficiency level changes in larp seedlings compared to wt. table presents the fold-change in translation efficiency levels of the genes from the different categories, as determined by mapman analysis. these data are represented as a box-whisker plot in figure c . . supplementary file . transcriptome level changes in larp +torin treated seedlings compared to larp . table presents the fold-change of mrna levels of the genes from the different categories, as determined by mapman analysis. these data are represented as a box-whisker plot in figure d . . supplementary file . translational efficiency level changes in larp +torin treated seedlings compared to larp . table presents the fold-change in translation efficiency levels of the genes from the different categories, as determined by mapman analysis. these data are represented as a box-whisker plot in figure e . . supplementary file . arabidopsis top mrnas. p ribosomal s kinase and p ribosomal s kinase link phosphorylation of the eukaryotic chaperonin containing tcp- to growth factor, insulin, and nutrient signaling rrb and rrb encode maize retinoblastoma-related proteins that interact with a plant d-type cyclin and geminivirus replication protein cell cycle control by the target of rapamycin signalling pathway in plants the larp la-module recognizes both ends of top mrnas anaconda software distribution larp specifically recognizes the ' terminus of poly(a) mrna regulation of cyclin d expression by mtorc signaling requires eukaryotic initiation factor e-binding protein arabidopsis leaf flatness is regulated by ppd and ninja through repression of cyclin d genes pathway of actin folding directed by the eukaryotic chaperonin tric arabidopsis villin and villin act redundantly in sclerenchyma development via bundling of actin filaments this citation should refer to: off-line high-ph reversed-phase fractionation for in-depth phosphoproteomics auxin stimulates s ribosomal protein phosphorylation in maize thereby affecting protein synthesis regulation cyclin g associates with protein phosphatase a catalytic and regulatory b' subunits in active complexes and induces nuclear aberrations and a g /s phase cell cycle arrest the messenger rna for alcohol dehydrogenase in drosophila melanogaster differs in its ' end in different developmental stages controversies around the function of larp adf/cofilin: a functional node in cell biology top mrnas are translationally inhibited by a titratable repressor in both wheat germ extract and reticulocyte lysate proteomics of endoplasmic reticulum-golgi intermediate compartment (ergic) membranes from brefeldin a-treated hepg cells identifies ergic- , a new cycling protein that interacts with human erv a mammalian protein targeted by g -arresting rapamycin-receptor complex exaptive evolution of target of rapamycin signaling in multicellular eukaryotes tor dynamically regulates plant cell-cell transport ties that bind: the integration of plastid signalling pathways in plant cell metabolism the rna binding protein larp regulates cell division, apoptosis and cell migration tor coordinates nucleotide availability with ribosome biogenesis in plants karyopherins in cancer detecting actively translated open reading frames in ribosome profiling data homeostasis of branchedchain amino acids is critical for the activity of tor signaling in arabidopsis oncogenic mapk signaling stimulates mtorc activity by promoting rsk-mediated raptor phosphorylation capturing the mechanism underlying top mrna binding to larp drosophila tcf and groucho interact to repress wingless signalling activity nutrient-sensing mechanisms across evolution tor and rps transmit light signals to enhance protein translation in deetiolating arabidopsis seedlings fastp: an ultra-fast all-in-one fastq preprocessor araport : a complete reannotation of the arabidopsis thaliana reference genome signaling through cyclin d-dependent kinases torin exploits replication and checkpoint vulnerabilities to cause death of pi k-activated triple-negative breast cancer cells importin b mediates the nuclear import of human ribosomal protein l through its interaction with the multifaceted basic clusters of l azd is a potent, selective, and orally bioavailable atp-competitive mammalian target of rapamycin kinase inhibitor with in vitro and in vivo antitumor activity structural and functional analysis of the role of the chaperonin cct in mtor complex assembly the interaction network of the chaperonin cct an arabidopsis homolog of raptor/kog is essential for early embryo development the arabidopsis tor kinase links plant growth, yield, stress resistance and mrna translation distribution, organization an evolutionary history of la and larps in eukaryotes star: ultrafast universal rna-seq aligner the arabidopsis tor kinase specifically regulates the expression of nuclear genes coding for plastidic ribosomal proteins and the phosphorylation of the cytosolic ribosomal protein s expression profiling and functional analysis reveals that tor is a key player in regulating photosynthesis and phytohormone signaling pathways in arabidopsis ulk inhibits mtorc signaling, promotes multisite raptor phosphorylation and hinders substrate binding plant retinoblastoma homologues control nuclear proliferation in the female gametophyte la-related protein (larp ) represses terminal oligopyrimidine (top) mrna translation downstream of mtor complex (mtorc ) mutations of the atyak kinase suppress tor deficiency in arabidopsis regulation of mtor complex (mtorc ) by raptor ser and multisite phosphorylation coordination of cell proliferation and cell expansion mediated by ribosome-related processes in the leaves of arabidopsis thaliana autogenous control of top mrna stability by s ribosomes importin and exportin link c-myc and p to regulation of ribosomal biogenesis cell cycle checkpoint control: the cyclin g /mdm /p axis emerges as a strategic target for broad-spectrum cancer gene therapy -a review of molecular mechanisms for oncologists a sars-cov- -human protein-protein interaction map reveals drug targets and potential drug-repurposing ampk phosphorylation of raptor mediates a metabolic checkpoint amino acid sufficiency and mtor regulate p s kinase and eif- e bp through a common effector mechanism endoplasmic reticulum-golgi intermediate compartment protein knockdown suppresses lung cancer through endoplasmic reticulum stress-induced autophagy larp functions as a molecular switch for mtorc -mediated translation of an essential class of mrnas the mtor-regulated phosphoproteome reveals a mechanism of mtorc -mediated inhibition of growth factor signaling super-resolution ribosome profiling reveals unannotated translation events in arabidopsis characterization of the rapamycin-sensitive phosphoproteome reveals that sch is a central coordinator of protein synthesis proteomic lc-ms analysis of arabidopsis cytosolic ribosomes: identification of ribosomal protein paralogs and re-annotation of the ribosomal protein genes the maize retinoblastoma protein homologue zmrb- is regulated during leaf development and displays conserved interactions with g /s regulators and plant cyclin d (cycd) proteins genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling the ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mrna fragments mammalian tor complex controls the actin cytoskeleton and is rapamycin insensitive importin beta, transportin, ranbp and ranbp mediate nuclear import of ribosomal proteins in mammalian cells rapamycin selectively represses translation of the "polypyrimidine tract" mrna family pik-related kinases: dna repair, recombination, and cell cycle checkpoints ssn -tup is a general repressor of transcription in yeast antiviral potential of erk/mapk and pi k/akt/mtor signaling modulation for middle east respiratory syndrome coronavirus infection as identified by temporal kinome analysis tsc -mtorc signaling controls striatal dopamine release and cognitive flexibility signals from chloroplasts converge to regulate nuclear gene expression target of rapamycin in yeast, tor , is an essential phosphatidylinositol kinase homolog required for g progression transcripts from downstream alternative transcription start sites evade uorf-mediated inhibition of gene expression in arabidopsis larelated protein (larp ) binds the mrna cap, blocking eif f assembly on top mrnas fast gapped-read alignment with bowtie protein degradation rate in arabidopsis thaliana leaf growth and development differential tor activation and cell proliferation in arabidopsis root and shoot apexes featurecounts: an efficient general purpose program for assigning sequence reads to genomic features ergic , which is regulated by mir- a, is a potential biomarker for non-small cell lung cancer kinome-wide selectivity profiling of atp-competitive mammalian target of rapamycin (mtor) inhibitors and characterization of their binding kinetics characterization of torin , an atp-competitive inhibitor of mtor, atm, and atr . mtor at the nexus of nutrition, growth, ageing and disease two tor complexes, only one of which is rapamycin sensitive, have distinct roles in cell growth control erbb- binding protein regulates translation and counteracts retinoblastoma related to maintain the root meristem transformation of shoots into roots in arabidopsis embryos mutant at the topless locus arabidopsis target of rapamycin interacts with raptor, which regulates the activity of s kinase in response to osmotic stress signals expression and disruption of the arabidopsis tor (target of rapamycin) gene xrn and larp are required for a heat-triggered mrna decay pathway involved in plant acclimation and survival during thermal stress translational control of ribosomal protein mrnas in eukaryotes the race to decipher the top secrets of top mrnas atp-competitive mtor kinase inhibitors delay plant growth by triggering early differentiation of meristematic cells but no developmental patterning change mutations in the arabidopsis homolog of lst /gbl, a partner of the target of rapamycin kinase, impair plant growth, flowering, and metabolic adaptation to long days paired-end analysis of transcription start sites in arabidopsis reveals plant-specific promoter signatures larp post-transcriptionally regulates mtor and contributes to cancer progression cyclins: roles in mitogenic signaling and oncogenic transformation the stress-induced atf -gelsolin cascade underlies dendritic spine deficits in neuronal models of tuberous sclerosis complex c. elegans la-related protein, larp- , localizes to germline p bodies and attenuates ras-mapk signaling during oogenesis topless mediates brassinosteroid-induced transcriptional repression through interaction with bzr cyclin g recruits pp a to dephosphorylate mdm tsc (+/-) mice develop tumors in multiple sites that express gelsolin and are influenced by genetic background mammalian erv localizes to the endoplasmic reticulum-golgi intermediate compartment and to cis-golgi cisternae the tct motif, a key component of an rna polymerase ii transcription system for the translational machinery ninja connects the co-repressor topless to jasmonate signalling la-related protein (larp ) repression of top mrna translation is mediated through its cap-binding domain and controlled by an adjacent regulatory region global analysis of larp translation targets reveals tunable and dynamic features of ' top motifs three piggyback genes that specifically influence leaf patterning encode ribosomal proteins determination of accurate extinction coefficients and simultaneous equations for assaying chlorophylls a and b extracted with four different solvents: verification of the concentration of chlorophyll standards by atomic absorption spectroscopy arabidopsis villins promote actin turnover at pollen tube tips and facilitate the construction of actin collars target of rapamycin complex regulates actin polarization and endocytosis via multiple pathways quantitative phosphoproteomics of the ataxia telangiectasia-mutated (atm) and ataxia telangiectasia-mutated and rad -related (atr) dependent dna damage response in arabidopsis thaliana alternative transcription start site selection leads to large differences in translation activity in yeast cyclin g regulates the outcome of taxane-induced mitotic checkpoint arrest rictor, a novel binding partner of mtor, defines a rapamycin-insensitive and raptor-independent pathway that regulates the cytoskeleton . mtor signaling in growth, metabolism, and disease an orthogonal proteomic survey uncovers novel zika virus host factors mutations in the arabidopsis rol /isopropylmalate synthase locus alter amino acid content, modify the tor network, and suppress the root hair cell development mutant lrx gtp ase rop binds and promotes activation of target of rapamycin, tor, in response to auxin tor is required for organization of the actin cytoskeleton in yeast a new regulatory motif in cell-cycle control causing specific inhibition of cyclin d/cdk the translational cis-regulatory element of mammalian ribosomal protein mrnas is recognized by the plant translational apparatus tor signaling in plants: conservation and innovation the erv -erv complex serves as a retrograde receptor to retrieve escaped er proteins signal transduction mutants of arabidopsis uncouple nuclear cab and rbcs gene expression from chloroplast development topless mediates auxin-dependent transcriptional repression during arabidopsis embryogenesis proteomic analysis of capdependent translation identifies larp as a key regulator of 'top mrna translation mapman: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes tor signalling and control of cell growth a unifying model for mtorc -mediated regulation of mrna translation protein homeostasis: live long, won't prosper phytohormones participate in an s kinase signal transduction pathway in arabidopsis arabidopsis villin and villin are required for the generation of thick actin filament bundles and for directional organ growth capturing the phosphorylation and protein interaction landscape of the plant tor kinase rapamycin (ay- , ), a new antifungal antibiotic. i. taxonomy of the producing streptomycete and isolation of the active principle mammalian target of rapamycin complex (mtorc ) activity is associated with phosphorylation of raptor by mtor reciprocal regulation of the tor kinase and aba receptor balances plant growth and stress response peapod regulates lamina size and curvature in arabidopsis coordination of gene expression between organellar and nuclear genomes suppression subtractive hybridization identified differentially expressed genes in lung adenocarcinoma: ergic as a novel lung cancer-related gene vln regulates plant architecture by affecting microfilament dynamics and polar auxin transport in rice plant cells contain a novel member of the retinoblastoma family of growth regulatory proteins glucose-tor signalling reprograms the transcriptome and activates meristems arabidopsis ketch is critical for the nuclear accumulation of ribosomal proteins and gametogenesis rapamycin and glucose-target of rapamycin (tor) protein signaling in plants moving beyond translation: glucose-tor signaling in the transcriptional control of cell cycle defining the tric/cct interactome links chaperonin function to stabilization of newly made proteins with complex topologies high-ph reversed-phase chromatography with fraction concatenation for d proteomic analysis ribosomal proteins promote leaf adaxial identity the e ubiquitin ligase march regulates ergic -dependent trafficking of secretory proteins a single mouse a-amylase gene specifies two different tissuespecific mrnas phosphoproteomic analysis identifies grb as an mtorc substrate that negatively regulates insulin signaling tor signaling promotes accumulation of bzr to balance growth with carbon availability in arabidopsis ipo promotes the proliferation and tumourigenicity of colorectal cancer cells by mediating rasal nuclear transportation network-based drug repurposing for novel coronavirus -ncov/sars-cov- mrs and job were supported by nih grant dp -od to j.o.b. sl was supported by an nsf postdoctoral research fellowship (ios- ). this work used the vincent j coates genomics sequencing laboratory at uc berkeley for illumina sequencing (supported by nih grant s -od ). we thank bradley evans and shin-cheng tzeng at the proteomics and mass spectrometry facility at the donald danforth plant science center for proteomics and metabolomics support. we thank snigdha chatterjee and hannah riedy for experimental assistance. the funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. key: cord- -oj re k authors: zhou, haixia; chen, yingzhu; zhang, shuyuan; niu, peihua; qin, kun; jia, wenxu; huang, baoying; zhang, senyan; lan, jun; zhang, linqi; tan, wenjie; wang, xinquan title: structural definition of a neutralization epitope on the n-terminal domain of mers-cov spike glycoprotein date: - - journal: nat commun doi: . /s - - - sha: doc_id: cord_uid: oj re k most neutralizing antibodies against middle east respiratory syndrome coronavirus (mers-cov) target the receptor-binding domain (rbd) of the spike glycoprotein and block its binding to the cellular receptor dipeptidyl peptidase (dpp ). the epitopes and mechanisms of mabs targeting non-rbd regions have not been well characterized yet. here we report the monoclonal antibody d that binds to the n-terminal domain (ntd) of the spike glycoprotein and inhibits the cell entry of mers-cov with high potency. structure determination and mutagenesis experiments reveal the epitope and critical residues on the ntd for d binding and neutralization. further experiments indicate that the neutralization by d is not solely dependent on the inhibition of dpp binding, but also acts after viral cell attachment, inhibiting the pre-fusion to post-fusion conformational change of the spike. these properties give d a wide neutralization breadth and help explain its synergistic effects with several rbd-targeting antibodies. m iddle east respiratory syndrome coronavirus (mers-cov), a novel lethal human virus in the family of coronaviridae, was first identified in saudi arabia in june . infection by this pathogen causes an acute respiratory disease designated as mers, with symptoms that are very similar to those of sars . globally, mers-cov infections have been confirmed in countries causing deaths (http://www.who. int/emergencies/mers-cov/en/). interspecies transmission from dromedary camels to humans is considered to be one major route of transmission in the middle east region , . however, many infected patients without camel exposure and a recent mers outbreak in korea demonstrated that large-scale human-tohuman transmissions can occur through close contacts . due to its potential for mutating toward efficient human-to-human transmission and causing a pandemic, mers-cov was listed as a category c priority pathogen by the us national institute of allergy and infectious diseases. monoclonal antibodies (mabs) with potent neutralizing activity have become promising candidates for both prophylactic and therapeutic interventions against viral infections . on coronaviruses, the component primarily targeted by mabs is the homotrimeric spike (s) glycoprotein of the virion. as a typical class i fusion glycoprotein, the s trimer of highly pathogenic coronaviruses such as mers-cov and sars-cov, which mediates receptor recognition and membrane fusion during viral entry [ ] [ ] [ ] [ ] [ ] [ ] , undergoes protease cleavage into the s and s subunits, positional change of the receptor-binding domain (rbd) in the s subunit for receptor binding, dissociation of the s -receptor complex, and finally formation of a six-helix bundle by the s subunits. a series of rbd-targeting antibodies against mers-cov, which block the binding of the s trimer to the cellular receptor dpp , have been reported and characterized [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . these antibodies exhibited high potency in inhibiting the infectivity of pseudotyped and live mers-cov in cells and animal models. the neutralizing epitopes and mechanisms of antibodies including c , d , m , mers- , jc - , cdc-c , mers- , and mers-gd were further elucidated at the atomic level by structural and functional studies [ ] [ ] [ ] [ ] [ ] [ ] [ ] . sequence comparisons of different mers-cov strains have shown that most naturally occurring mutations of the s glycoprotein are located on the rbd of the s subunit and the s subunit. considering the rapid evolution and high genome variation of rna viruses, more mutations on the rbd may enable the new strains to escape neutralization by currently known rbd-targeting antibodies. therefore, new mabs targeting other functional regions of the mers-cov s glycoprotein and/or neutralizing by different mechanisms are important for developing effective prophylactic and therapeutic interventions against mers-cov infection. although several mabs targeting non-rbd regions have recently been reported, their neutralizing epitopes and mechanisms remain unclear , , . in this study, we isolated and characterized the mouse mab d by combining structural, biochemical, and functional studies. the d antibody recognizes the ntd of mers-cov s glycoprotein and neutralizes the infectivity of pseudotyped and live virus with a potency comparable to those of the most active rbd-targeting antibodies. we also found that the epitope and mechanism of d , which are different from those of rbdtargeting antibodies, enable it to have a better neutralizing breadth and to work synergistically with other antibodies against different mers-cov strains. all these results indicate that d is a very promising candidate for the future combined use of different antibodies in our battle against mers-cov. characterization of neutralizing mab d targeting the ntd. to generate mers-cov neutralizing mabs with epitopes outside the rbd, mice were immunized with recombinant mers-cov s protein (residues - ). subsequently, the spleenocytes were harvested and fused with sp / myeloma cells, and the hybridoma cell lines were screened for positive clones by elisa with the s protein . the positive clones were further tested for their reactivity to different s fragments, including the s subunit ntd (residues - ), rbd (residues - ), and the s subunit (residues - ). one ntd-specific mab, named as d , was finally isolated with an ec of approximately . μg ml − in elisa (fig. a) . it exhibited no crossreactivity with the rbd at a concentration of μg ml − (fig. b) . we further assessed the potential of d , in the form of crude extracts from mouse ascites, for inhibiting mers-cov entry into susceptible huh cells and vero e cells with either pseudotyped or infectious viruses. as expected, d was able to neutralize the infectivity of pseudotyped and live mers-cov (fig. c, d) . the neutralizing activity of d was dose-dependent, with an ic of approximately . μg ml − against pseudotyped virus and practically the same ic of approximately . μg ml − against live virus (emc strain) (fig. c, d) . images illustrating the reduced pfu formation, corresponding to the rate of neutralization of live mers-cov, are shown in fig. e . antibody isotyping showed that d belongs to the igg subtype. sequencing further determined that the heavy chain germline v and j segments are ighv - * and ighj * , while those of light chain are igkv - * , igkj * , and igkj * , respectively (supplementary table ) . we also generated a chimeric version of d ( d -h) by combining the v segments of d with the human igg backbone, which was efficiently expressed and purified in freestyle -f cells ( supplementary fig. a ). the bio-layer interferometry (bli) experiment showed that the affinity constant of the binding between d -h and ntd was approximately nm (table and supplementary fig. b) . the ic of the purified d -h against cell entry by pseudotyped mers-cov was approximately . μg ml − (supplementary fig. c ). we also investigated the protective efficacy of d -h against infection of pseudotyped mers-cov using r -hdpp mice model with a human dpp inserted into the rosa locus by crispr/cas , which could also been productively infected by high-titer mers-cov pseudovirus, with effects comparable to the authentic infection . bioluminescence of the fluc reporter showed that the pseudovirus infection in the mice was clearly prevented by d -h and rbd-specific mab mers- when both antibodies were administered by the intraperitoneal injection with a dose of μg per mouse ( supplementary fig. d ). the recombinant chimeric d -h, which retained the activities as the mouse d and protected r -hdpp mice against challenge of pseudotyped mers-cov, was utilized in subsequent binding and neutralization experiments. overall structure of the d scfv bound to the ntd. to structurally characterize the d and its binding to the spike protein, we determined the crystal structure of the antibody scfv ( d -scfv) in complex with the ntd at a resolution of . Å with a final r work of . and r free of . . statistics of diffraction data collection, processing, and structure refinement are listed in table . there were three complexes of d -scfv bound to ntd per asymmetric unit. the refined model contains residues tyr to ser of mers-cov ntd, glu to ser of the v h and asp to lys of the v l . n-linked glycans attached to asn , asn , asn , asn , asn , asn , asn , and asn of the ntd are also included in the model. it has been previously shown that the mers-cov ntd folds into a galectin-like structure, which can be separated into top, core and bottom subdomains (fig. a) . upon binding, the d -scfv contacts the top subdomain of the ntd and the asn -linked glycans with its heavy and light chains ( fig. a and supplementary fig. ). all three cdrs of the heavy chain and the cdr and cdr of the light chain participate in the binding (fig. a) . the buried surface between the d -scfv and the ntd encompasses approximately Å for the heavy chain and Å for the light chain. structural features of the interface between d and ntd. the binding interface between d -scfv and ntd consists of residues and asn -linked glycans from the ntd, as well as residues from all cdrs except for lcdr (fig. b, c) . the interacting residues from the ntd are tyr , asp , pro , asp , val , ser , glu , ser , asn , lue , arg , and asn . together with the asn -linked nag , nag and man , they form the conformational epitope recognized by d (fig. b) . the residues recognizing d are ser , tyr , asn from the hcdr , tyr , asn , and ser from the hcdr , arg , tyr , asn , tyr , and tyr from the hcdr , tyr , and tyr from the lcdr , and arg and asp from the lcdr (fig. c) . specifically, d hcdr residues ser , tyr , and asn interact with pro and asp from the ntd, and a formed hydrogen bond is from d asn to ntd asp ( fig. d and supplementary fig. crystal structure of d -scfv bound to ntd and the binding interface. a an overall structure of the ntd/ d -scfv complex in which the ntd, n -linked glycans on the ntd, d v l , and d v h are colored in blue, gray, magenta, and cyan, respectively. b epitope on the ntd recognized by d . the ntd is represented as blue surface, on which the protein region bound by d is displayed in orange and the n -linked glycans are displayed as gray sticks. c d residues that are involved in the binding. the v l and v h are colored in magenta and cyan, respectively, and the residues interacting with d are displayed in orange. d interactions between the d v h residues and the corresponding residues of ntd. e interactions between the d v l residues and the corresponding residues of ntd. f zoom-in view of interactions between n -linked glycans and d interacting with tyr , asp , pro , asp , and arg of the ntd (fig. d ). tyr and asn of d form two hydrogenbonding interactions with asp of the ntd (supplementary table ). for the light chain, the lcdr and lcdr residues tyr , tyr , arg , and asp interact with glu , ser , arg , and asn of the ntd, and a salt bridge is formed between arg of lcdr and glu of the ntd (fig. e and supplementary table ) . a prominent feature at the interface is the extensive recognition of asn -linked glycans by all three heavy chain cdrs ( fig. f and supplementary table ). specific hydrogen-bonding interactions occur between tyr and arg of d and the nag and man glycans, respectively ( fig. f and supplementary table ). confirmation of the neutralizing epitope. to confirm the epitope and its critical residues, we performed a mutagenesis study by introducing single mutations to all ntd recognized residues including trp , asp , pro , asp , val , ser , glu , ser , asn , asn , lue , arg , and asn . we first examined the effects of these ntd mutations on the binding by d -h. the d -h bound the wild-type ntd with an affinity of approximately nm (table and supplementary fig. ). by contrast, the d a and r a mutations dramatically reduced the binding, to a level that was undetectable by bli experiment (table and supplementary fig. ). the e a and n q mutations reduced by the binding affinity by -fold to . μm and -fold to . μm, respectively (table and supplementary fig. ). all the other nine mutations had variant unequal effects on the binding by reducing the affinity in the range of -to -fold (table and supplementary fig. ). the effects of these mutations on the neutralizing activity of d -h were in consistent with the changes of binding affinity. pseudotyped mers-cov bearing d a, e a, or r a mutation in the spike glycoprotein escaped the neutralization by d -h (table and supplementary fig. ). the ic values of d against pseudotyped mers-cov bearing d a, v a, or n q mutation were increased approximately by -, -, and -fold (table and supplementary fig. ). the binding and neutralization assays collectively revealed that asp , val , glu , arg , and asn -linked glycans are critical for recognition and neutralization of mers-cov by d . sequencing of multiple clinical isolates had revealed that the mers-cov s glycoprotein is evolving at an average rate of . × − substitutions per site per year . alignments of the deposited sequences in the ncbi identified naturally changing residues from the prototype emc sequence including v f, v i, v a, d y, l f, t i, a y, l f, d g, v l, v a, e k, d e, v i, q r, q h, r h, r q, a s, t i, g s, and v a, which are located in the ntd (residues - ) and rbd (residues - ) of the s subunit, and the s subunit (residues - ). several residue changes on the rbd, such as those occurring on d , d , and e , indeed enabled the mers-cov to escape the neutralization of antibodies targeting the rbd , . considering that most of the mutations are outside the ntd, we speculated that d -h would have a better tolerance for these naturally occurring mutations. we generated pseudotyped mers-cov bearing the emc strain s glycoproteins and its mutants harboring all the listed residue changes. the neutralization assays showed that d -h showed effective neutralizing activity against almost all pseudotyped mers-cov variants. only the two mutations v f and v a on the ntd increased the ic value of d -h by more than -fold and significantly reduced its neutralization activity (fig. a, b) , which confirmed the results of the structural and biochemical studies of the binding interface. all other naturally occurring mutations, most of them on the rbd and the s subunit did not affect the neutralization capability of d -h (fig. a, b) , indicating that d would have a wide neutralization breadth against different variants of mers-cov. combination of d with other rbd-targeting antibodies. the current available mers-cov antibody epitopes with solved structures are all on the rbd, which can be grouped into three categories: ( ) epitope of mers- ; ( ) epitopes of mers- , d , c , and jc - ; and ( ) epitopes of m , mca , cdc-c , and the newly reported mers-gd ( supplementary fig. ) . in our study of the rbd-specific mab mers- , we also found synergism with the ntd-targeting mab f . thus, the elucidation of the epitope targeted by d , which added a category outside the rbd ( supplementary fig. ), prompted us to study the combined effect of d together with the three representative antibodies mers- , mers- , and mers-gd in the neutralization of pseudotyped mers-cov by titrating the neutralizing potency of an equimolar mixture of the two antibodies and comparing the dose response with that observed in neutralization assays performed with the individual antibody alone. as shown in the fig. , the combination index (ci) values of mers-gd combined with d at fa values of effective dose %, %, %, and % (ed , ed , ed , and ed , respectively) were . , . , . , and . , respectively. as a ci value of indicates an additive effect, < indicates synergism, and > indicates antagonism, the combination of d and mers-gd worked in a clearly synergistic manner. meanwhile, the combination index (ci) values of combined mers- with d at fa values of effective dose %, %, %, and % (ed , ed , ed , and ed ) were . , . , . , and . , respectively. thus, the combination of mers- and d also demonstrated synergism, in particular at relatively lower concentrations. however, the percent neutralization obtained using combined mers- and d showed no obvious difference of half maximal inhibitory concentration (ic ) compared with that of d alone. the combination index (ci) values of combined mers- and d at fa values of effective dose %, %, %, and % (ed , ed , ed , and ed ) were . , . , . , and . , respectively. it indicated that the combination of d with mers- exhibited neither synergy nor antagonism. mechanism of d neutralization. a major reported mers-cov neutralization mechanism relies on inhibiting the binding of the s trimer with the cellular receptor dpp . the epitopes of these reported antibodies all reside in the rbd responsible for receptor binding. the fact that the d epitope is outside the rbd indicated that it may have a different neutralizing mechanism. we first examined if d is still able to inhibit the receptor binding by the s trimer. the facs analysis of cellsurface staining showed that the scfv and fab fragments of d -h did not inhibit the staining of huh cells by the s trimer, while the d -h slightly reduced the staining (fig. a , supplementary table and supplementary fig. ). by contrast, the rbdtargeting mab mers- was much more potent than d -h in inhibiting the binding of the s trimer to huh cells. moreover, the fab and scfv fragments of mers- retained nearly the same potency in the inhibition ( fig. a and supplementary table ). surface plasmon resonance (spr) analysis confirmed these conclusions by showing that d -h, and not its fab or scfv fragments, could interfere with the binding of the s trimer to chipcoupled dpp in a dose-dependent manner ( supplementary fig. ) , while the igg, fab, and scfv of mers- all inhibited the binding ( supplementary fig. ) . to investigate why the igg, fab, and scfv of d inhibit receptor binding differently, we constructed models of their binding to the s trimer. the mers-cov s trimer structure was determined by cryo-em with the rbd in standing or lying positions, and only the standing rbd could bind to the dpp receptor. after superimposing the ntd/ d -scfv crystal structure onto the s trimer, we observed no steric clashes between three ntd-bound scfv fragments and one or two rbdbound dpp receptors ( fig. b and supplementary fig. ). the s trimer with three rbd-bound receptors was not considered because the cryo-em study of the mers-cov s trimer only revealed conformations with one or two standing rbds. when the scfv was replaced with the fab, there were also no steric clashes between the fab and dpp receptor (fig. c ). it is more complicated to model the binding of d -h to the s trimer, considering that the igg form has two binding sites and the intrinsic flexibility. we found that binding of the d -h igg to the ntd in certain orientations could inhibit the binding of dpp due to steric clashes, while there were still no steric clashes with the d -h bound in some other orientations (fig. d, e) . these results provided a structural explanation for the inability of d -h scfv and fab to inhibit the binding of the s trimer to the dpp receptor. they may also explain why the d -h igg form is not as potent as the mers- igg, fab, and scfv which all directly bind to the rbd. in parallel with biochemical studies, we also examined the neutralizing activities of d -h igg, fab, and scfv. the d -h fab and scfv did not interfere with the binding of the s trimer to the dpp receptor. however, they were still able to inhibit the cell entry of pseudotyped mers-cov with ic value of . μg ml − and . μg ml − , respectively (fig. a) . although the d -h fab and scfv are less active than the igg in infection inhibition, they were still comparable to the fab or scfv fragments of several reported rbd-targeting antibodies such as mers- fab (ic : . μg ml − ) and mers- scfv (ic : . μg ml − ) ( supplementary fig. a ). these results collectively indicated that neutralization by d -h involves other mechanism besides interfering with the initial receptor binding. we tested and compared the neutralizing activity of d in pre-attachment and post-attachment settings. after the cell attachment, d was still able to inhibit infection by pseudotyped mers-cov with an ic of . μg ml − (fig. b) . in comparison, mers- , which is more potent than d in inhibiting receptor binding, exhibited very weak neutralization after receptor binding (supplementary fig. b) . the above results, especially the retaining activity of d after viral attachment indicated that d would also interfere with the prefusion to postfusion conformational transition of the s glycoprotein required for membrane fusion. this transition , . we showed that the mers-cov s glycoprotein in the prefusion state is sensitive to the digestion of proteinase k (fig. c) . previous studies have demonstrated that cleavage at the s /s site by trypsin and the binding with cellular receptor greatly enhanced the prefusion to postfusion transition of the spike glycoprotein . consistently, the amount of a kda and proteinase-k-resistant band of the s glycoprotein representing the postfusion six-helix bundle was at the maximum level in the presence of trypsin and dpp (fig. c) . and the addition of d -h fab obviously reduced the intensity of the band (fig. c) . meanwhile, we analyzed the full-length mers-cov s trimer embedded in the membrane of pseudotyped virus and the trigger we used to induce the conformational transition was the incubation with huh cells that endogenously expressing dpp receptor. after incubating the pseudotyped virus with huh cells for h at °c, a proteinase-k resistant band on the sds-page gel appeared and the addition of d -h, d -h fab, or d scfv all clearly decreased the intensity of this band ( supplementary fig. ) . thus, these biochemical results strongly suggest that d could also exert its neutralizing activity in the postattachment stage after receptor-binding by inhibiting the conformational transition of the s glycoprotein required for membrane fusion (fig. d) . since dpp , which is a critical step for viral cell attachment. in this study, we first isolated the neutralizing mouse antibody d targeting the ntd of the s glycoprotein. neutralization assays showed that d is highly potent and its activity is comparable to that of the most potent rbd-targeting antibodies. structural determination of d scfv bound to the ntd and mutagenesis studies revealed the epitope and key residues on the ntd for binding and neutralization at atomic level. comparisons of d scfv, fab, and igg forms in dpp -binding competition and neutralization assays indicated that its activity is not solely dependent on the inhibition of dpp binding. further experiments indicated that the neutralizing activity of d after cell attachment is through the inhibition of prefusion to postfusion conformational transition of the s glycoprotein trimer, which mediates the fusion of viral and cell membranes. we also showed that d has a wide neutralization breadth against mers-cov variants bearing naturally occurring mutations and exhibited synergistic effects with several rbd-targeting antibodies. these results collectively revealed an antibody epitope and neutralization mechanism on the s glycoprotein, which would contribute to the global efforts to control mers-cov infection and transmission by providing alternatives for mers-cov immunotherapy. similar the ntds of the s protein of other betacoronaviruses such as mhv, bcov and hku , that of mers-cov also folds into a galectin-like structure. although the galectin domain is a typical carbohydrate-recognition domain, the betacoronavirus ntds can include structural variations that enable more diverse functions in viral infection. the examples, include the ntd of bcov that retains the glycan-binding activity recognizing -n-acetyl- -oacetylneuraminic acid (neu , ac ) and the ntd of mhv that evolved specific protein-protein interactions with its cellular receptor ceacam , and both interactions are important for the viral cell attachment , . however, there is still no report on the glycan or protein-binding activities of the mers-cov ntd. in fact, crystallographic structure determination showed that the glycanbinding site on the mers-cov ntd is occupied by a short helix (residues - ) and the asn -linked glycan, indicating that it is not able to bind glycans in the same way as the ntd of bcov . notably, the asn -linked glycan is involved in the recognition by d , whereby nag and man undergoes specific hydrogen-bonding interactions with tyr and arg of d , respectively. the ntd n q mutation also dramatically reduced the binding and neutralization by d , but did not dramatically affect the cell infection of pseudotyped mers-cov ( supplementary fig. ) . therefore, the asn -linked glycan serves as an important anchor point for the binding of d to the mers-cov ntd. as the largest class i viral fusion protein, the coronavirus s glycoprotein is expected to undergo a prefusion to postfusion conformational transition to mediate the interaction between viral and cellular membrane proteins, although structural studies just began to shed light on this recently. the s glycoprotein of betacoronaviruses mhv and hku , whose structures have been determined by the cryo-em method, all adopt a similar prefusion homotrimeric architecture , . interestingly, in the prefusion architecture of the s trimer of highly pathogenic mers-cov and sars-cov, two major conformational states were observed. a d -h igg was tested for neutralizing activity against pseudotyped mers-cov before or after receptor binding. vrc mab was used as unrelated control. c the effect of d -h fab on the conformational change of the mers-cov s trimer was probed by western blotting using an anti-mers-cov s polyclonal antibody. refolding to the postfusion conformation was detected by the appearance of a proteinase-k resistant band. trypsin was used at μg ml − and proteinase k at μg ml − . digestion experiments and western blots were performed in triplicates, and a representative result is shown for each of them. d a cartoon representation designed by us showing the neutralization mechanism by which d blocks mers-cov entry. on the one hand, some virus particles can not bind to dpp due to steric hindrance caused by d binding. on the other hand, d still recognizes the particles when the up receptor-binding domain (rbd) binds to dpp , and may inhibit the prefusion to postfusion transition of the s subunit and the initiation of membrane fusion. source data are provided as a source data file major difference between them is the change of the rbd in the s subunit from a down to an up position, which was proposed to be a prerequisite for the binding of the s trimer to their respective cellular receptor dpp and ace . this proposal was recently confirmed by our cryo-em study of the sars-cov s trimer in complex with ace , and we also showed that ace -binding could induce the dissociation of the s subunit, which results in the falling apart of the prefusion s trimer and the transition to the prefusion state of the s subunit . a major neutralization mechanism of antibodies against mers-cov is to directly or indirectly compete with the cellular receptor dpp for binding to the rbd. in theory, antibodies that interfere with the coronavirus membrane fusion process other than receptor binding would also have a neutralizing activity, and the d mab targeting the ntd we studied is one such example. here, we showed that d neutralization is not solely dependent on dpp -binding competition, and its inhibition of the s trimer conformational transition after cell attachment also plays a significant role in the neutralization. we suggested that the binding of d may stabilize the prefusion architecture of the s trimer, even after the binding of dpp receptor. the stabilization of viral fusion protein at one conformational state for neutralization has also been observed and studied in other viruses such as hiv. a recent study revealed that the hiv env trimer is intrinsically dynamic with three major and distinct prefusion conformations . among them, the closed, ground-state conformation is dominant and could be remodeled to another two conformations by cd receptor binding, which is essential for the subsequent prefusion to postfusion transition . the binding of neutralizing antibodies, whether inhibiting the binding of the cd receptor (such as vrc ) or not (such as g and pgt ) all resulted in the stabilization of the ground-state conformation of the env, which finally disfavors its prefusion to postfusion state transition required for viral entry , . to the best of our knowledge, our study offers the first structural definition of the neutralizing epitope of an antibody targeting the s ntd of mers-cov. as we summarized in supplementary table , a total of six anti-ntd mabs have been reported , , , . all of them neutralize the infection of pseudotyped mers-cov emc strain with high potency except for mab . f . the mab f and our d showed the same neutralizing activity against live mers-cov in plaque reduction neutralization testing. notably, the mouse mab g can greatly relieve the symptom of dpp -transgenic mice infected following mers-cov infection and our d -h can inhibit the infection of pseudotyped mers-cov in r -hdpp mice. however, the specific neutralizing epitopes and mechanisms of f , g , jc - , and fib-h are largely unknown. in addition, the combination of different antibodies is supposed to be an effective strategy to combat mers-cov infection as it continues to spread among multiple animal species and to probe and adapt to the human population [ ] [ ] [ ] . an effective combination would require the candidate antibodies to bind to disparate epitopes or with distinct mechanisms and hence display additive or synergistic effects, as the mabs mers- and f we mentioned before . although the exact mechanism that leads to the synergy or additive is uncertain, our d -h with mers-gd or mers- antibodies demonstrated a synergy in inhibiting the infectivity of pseudotyped mers-cov, while d -h and mers- antibodies together had an additive effect. consequently, d is currently the most comprehensively studied ntd-targeting mab with a different epitope and working mechanism, which makes it an excellent candidate, in combination with other rbd-targeting neutralizing antibodies or alone, in our battle against mers-cov infection. three weeks after the initial immunization, these mice were boosted twice at -week intervals. cells collected from the spleens of sacrificed animals were fused with cultured sp / cells at a : ratio in the presence of peg (sigma). hat selection medium was used for the fused hybridoma cultures. after -weeks of incubation, the positive hybridomas were selected via s-coated elisa, and the positive clones were subjected to limited dilutions and downstream validation. for large-scale mab production, ascites fluid from mice inoculated with the hybridomas was collected and purified by the caprylic acid-ammonium sulfate precipitation method. protein expression and purification. the coding sequence of the mers-cov spike glycoprotein ectodomain (emc strain, spike residues - ) was ligated into the pfastbac-dual vector (invitrogen) with a c-terminal t fibritin trimerization domain and a hexa-his-strep tap tag to facilitate further purification processes. briefly, the protein was prepared using the bac-to-bac baculovirus expression system, purified by sequentially applying strep-tactin and superose column (ge healthcare) with hbs buffer ( mm hepes, ph . , mm nacl). fractions containing mers-cov s glycoprotein were pooled and concentrated for subsequent biochemical analyses. the sequence encoding mers-cov s ntd (residues - ) with a c-terminal hexa-his tag was inserted into the eukaryotic expression vector pvax. freestyle -f cells were transfected with the plasmid using polyethylenimine (pei) (sigma). after h, the supernatant was collected and the ntd was purified using nta sepharose (ge heathcare) and superdex high performance column (ge healthcare) with hbs buffer ( mm hepes, ph . , mm nacl). the sequence encoding the d v l and v h were separately cloned into the backbone of antibody expression vectors containing the constant regions of human igg . the chimeric antibody d -h was expressed in freestyle -f cells by transient transfection and purified by affinity chromatography using protein a sepharose and gel-filtration chromatography. the purified d -h was exchanged into phosphate-buffered saline (pbs), and was digested with papain protease (sigma) over night at °c. the digested antibody was then passed back over protein a sepharose to remove the fc fragment, and the unbound fab in the flow through was additionally purified using a superdex high performance column (ge healthcare). the gene encoding the d v l followed by v h with a connecting triple gggs linker and a c-terminal hexa-his tag was synthesized and cloned into the eukaryotic expression vector pvrc . freestyle -f cells were transfected the plasmid in the presence of pei (sigma). the cell-culture supernatant was collected h after the transfection, and the d scfv was collected and captured on nta sepharose (ge healthcare). the bound d scfv was eluted with hbs buffer containing mm imidazole and was then further purified by gel-filtration chromatography using a superdex high performance column (ge healthcare). complex preparation and crystallization. the mers-cov ntd and the scfv fragment of d were mixed at a molar ratio of : . , incubated for h at °c and further purified by gel-filtration chromatography. the purified complex concentrated to approximately mg ml − in hbs buffer ( mm hepes, ph . , data collection and structure determination. to collect the diffraction data, all crystals were flash-cooled in liquid nitrogen after being incubated in reservoir solution containing % (v/v) glycerol. the diffraction images were collected on the bl u beamline at the shanghai synchrotron research facility (ssrf) with the wavelength of . Å. all images were processed with hkl . the structure was solved by molecular replacement using phaser from the ccp suite . the search models were the mers-cov ntd structure (pdb id: vyh) and the structures of the variable domain of the heavy and light chains available in the pdb with the highest sequence identities. subsequent model building and refinement were performed using coot and phenix, respectively , . there are % of most favored, . % of allowed and . % of disallowed ramachandran plot in the final refinement model. all structural figures were generated using pymol . neutralizing assay of pseudotyped mers-cov. t cells cultured in mm dish were co-transfected with μg of pcdna . -mers-spike or its mutants and μg of pnl - .luc.re. the supernatants containing sufficient pseudotyped mers-cov were harvested - h post-transfection. subsequently, the % tissue culture infectious dose (tcid ) was determined by infection of huh cells. for the neutralization assay, tcid per well of pseudoytped virus were incubated with or serial : dilutions of purified antibodies, fabs or scfvs for h at °c, after which huh cells (about . × per well) were added. after incubation for h at °c, the neutralizing activities of antibodies were determined by the luciferase activity and presented as ic , calculated using the dose-response inhibition function in graphpad prism (graphpad software inc.) cell entry of pseudotyped virus. the concentration of the harvested pseudotyped virions was normalized by p elisa kit (beijing quantobio biotechnology co., ltd., china) before infecting the target huh cells. the infected huh cells were lysed at h after infection and viral entry efficiency was quantified by comparing the luciferase activity between pseudotyped viruses bearing the mutant-and wildtype mers-cov spike glycoproteins. postattachment neutralization assay. for the postattachment pseudotyped virus neutralization assay, huh cells, upon reaching a density of . × per well in a -well plate, were incubated with tcid per well of pseudotyped virus at °c for h. after removing the supernatant, μl of pbs was added twice to each well to wash the un-bond pseudotyped viruses. a total of serial : dilutions of purified antibodies in dmem ( % fbs) were then added to the huh cells with attached pseudotyped viruses, as well as dmem ( % fbs) alone as control. neutralization activities were determined based on the luciferase activity after incubation for h at °c and also presented as ic , calculated using the dose-response inhibition function in graphpad prism (graphpad software inc.) cooperativity of mabs for neutralization. synergistic, additive, and antagonistic interaction between d and mers-gd , d , and mers- , as well as d and mers- for virus neutralization were evaluated by the median effect analysis method using compusyn software as previously reported , . the measured neutralization values were input to the program as fractional effects (fa) ranging between . and . for each of the two antibodies and for both in combination. ci values were calculated in relation to fa values. a logarithmic ci value of indicates an additive effect, < indicates synergism, and > indicates antagonism. live mers-cov neutralization assay. the neutralizing activity of the mabs against live mers-cov was also determined in dpp -expressing vero e cells. upon reaching a density of × per well in a -well plate, cell monolayers were infected with - plaque-forming units (pfu) of live virus in the presence or absence of the mab. after three days of incubation at °c, the inhibitory capacity of the mabs was assessed by determining the numbers of plaques compared with the potent mers-cov anti-rbd and anti-n mabs. murine model of mers-cov pseudovirus infection. the mers-cov susceptible animal model hdpp -knockin mouse, which was established by inserting human dipeptidyl peptidase (hdpp ) into the rosa locus using crispr/cas , resulting in global expression of the transgene in a genetically stable mouse line , was used in this experiment. mice (n = ) were challenged by intraperitoneal injection (i.p.) with doses of . × . tcid of pseudotyped mers-cov. d -h and mers- were administered i.p. to r -hdpp mice at a dose of μg per mouse prior to challenge with pseudovirus. mice (n = for the pbs group and n = for the c group) were also administered pbs or control mab c (mab of anti-na of h n , at a dose of μg per mouse) and challenged using the same i.p. dose of pseudovirus. the ivis-lumina ii imaging system (xenogen, baltimore, md, usa) was used to detect bioluminescence. prior to measuring luminescence, the mice were anesthetized using an i.p. injection of sodium pentobarbital ( mg kg − ). the exposure time was s, and fluorescence intensity in regions of interest was analyzed using living image software (caliper life sciences, baltimore, md, usa). different wavelengths were used for detecting pseudovirus and tdtomato fluorescence. the substrate, d-luciferin ( mg kg − , xenogen-caliper corp., alameda, ca, usa), was injected i.p. and imaging was conducted min later. the relative intensities of emitted light were represented as colors ranging from red (intense) to blue (weak) and quantitatively presented as photon flux in photons s − cm − sr − . binding studies using bli. binding kinetics of mers-cov ntd and its mutants with d were studied using a fortébio octet htx instrument. assays with agitation set to rpm in hbs buffer ( mm hepes, ph . , mm nacl) supplemented with . % (v/v) tween were performed at °c in solid black tilted-bottom -well plates (greiner bio-one). d ( μg ml − ) was used to load anti-human igg fc capture probes for s to capture levels of . - nm. biosensor tips were then equilibrated for s in hbs buffer supplemented with . % (v/v) tween prior to binding assessment with different concentrations of wild-type or mutant mers-cov ntd for s, followed by dissociation for s. data analysis and curve fitting were performed using octet software, version . . binding competition assays by spr. real-time binding and analysis by spr were conducted on a biacore t instrument with cm chips (ge healthcare) at room temperature. for all the analyses, hbs buffer consisting of mm hepes, ph . , mm nacl and . % (v/v) tween was used, and all proteins were exchanged to the same buffer. the blank channel of the chip was used as the negative control. dpp ( μg ml − ) was immobilized on the chip at about response units. soluble mers-cov spike trimer (s) at the same gradient in the present or absence of the concentration gradient of iggs, fabs, or scfvs was flowed over the chip surface. after each cycle, the sensor surface was regenerated with . mm naoh. data were analyzed using the biacore t evaluation software by fitting to a : langmuir binding model. facs analysis of cell-surface staining. the binding between recombinant soluble mers-cov spike trimer (s) and human dpp expressed on the surface of huh cells was measured using fluorescence-activated cell sorting (facs). all cellsurface staining experiments were performed at room temperature. soluble mers-cov spike trimer (s) with strep-tag ( μg) was incubated with monoclonal antibodies (mabs) in advance at molar ratios of : , : , : , and : for h. huh cells were trypsinized and then incubated with s or s and mabs mixtures for h. after washing the un-bound s with pbs times, the huh cells were then stained with streptavidin apc (bd ebioscience) for another min. cells were subsequently washed with pbs times and analyzed by flow cytometry on a facs aria iii machine (bd ebiosciences). western blots. totally, μl pseudotyped mers-cov was thawed and mixed with μg of antibodies (igg, fab or scfv) for h. the virus alone or the mixture was incubated with μl of huh cell suspension for another h at °c. an equal volume of buffer and proteinase-k (final concentration of μg ml − ; thermo_fisher) was then added and incubated h at °c. for the soluble s, μg of the s trimer was incubated with μg of the dpp ectodomain or μg of d fab for h on ice. trypsin (final concentration of μg ml − ; thermo_-fisher) was then added to these samples and incubated min at °c. subsequently, the samples were supplemented with μg ml − proteinase-k and incubated min at °c. × sds-page loading buffer was then added to all samples prior to boiling at °c. samples were run on a - % gradient tris-mops-gel (genscript) and transferred to polyvinylidene fluoride membranes. an anti-s mers-cov s polyclonal antibody ( : dilution; thermo_fisher; cat#pa - ) and an hrp-conjugated goat anti-rabbit secondary antibody ( : dilution; huaxingbio; cat#hx ) were used for western blotting. ai was used to develop images. reporting summary. further information on research design is available in the nature research reporting summary linked to this article. the source data underlying figs. a-d, , , and a-c and supplementary figs. a-c, , , , - are provided as a source data file. crystal structures presented in this work has been deposited in the protein data bank (pdb) and are available with accession code j . isolation of a novel coronavirus from a man with pneumonia in saudi arabia the emerging novel middle east respiratory syndrome coronavirus: the "knowns" and "unknowns middle east respiratory syndrome coronavirus in dromedary camels: an outbreak investigation human infection with mers coronavirus after exposure to infected camels, saudi arabia environmental contamination and viral shedding in mers patients during mers-cov outbreak in south korea history of passive antibody administration for prevention and treatment of infectious diseases structure of mers-cov spike receptor-binding domain complexed with human receptor dpp host cell entry of middle east respiratory syndrome coronavirus after two-step, furin-mediated activation of the spike protein cryo-electron microscopy structures of the sars-cov spike glycoprotein reveal a prerequisite conformational state for receptor binding immunogenicity and structures of a rationally designed prefusion mers-cov spike antigen cryo-em structures of mers-cov and sars-cov spike glycoproteins reveal the dynamic receptor binding domains cryo-em structure of the sars coronavirus spike glycoprotein in complex with its host cell receptor ace a conformation-dependent neutralizing monoclonal antibody specifically targeting receptor-binding domain in middle east respiratory syndrome coronavirus spike protein potent neutralization of mers-cov by human neutralizing monoclonal antibodies to the viral spike glycoprotein identification of human neutralizing antibodies against mers-cov and their role in virus adaptive evolution exceptionally potent neutralization of middle east respiratory syndrome coronavirus by human monoclonal antibodies prophylactic and postexposure efficacy of a potent human monoclonal antibody against mers coronavirus pre-and postexposure efficacy of fully human antibodies against spike protein in a novel humanized mouse model of mers-cov infection a humanized neutralizing antibody against mers-cov targeting the receptor-binding domain of the spike protein evaluation of candidate vaccine approaches for mers-cov importance of neutralizing monoclonal antibodies targeting multiple antigenic sites on mers-cov spike to avoid neutralization escape ultrapotent human neutralizing antibody repertoires against middle east respiratory syndrome coronavirus from a recovered patient junctional and allele-specific residues are critical for mers-cov neutralization by an exceptionally potent germline-like antibody structural basis for the neutralization of mers-cov by a human monoclonal antibody mers- structural definition of a unique neutralization epitope on the receptor-binding domain of mers-cov spike glycoprotein a novel neutralizing monoclonal antibody targeting the nterminal domain of the mers-cov spike protein receptor-binding domain of severe acute respiratory syndrome coronavirus spike protein contains multiple conformation-dependent epitopes that induce highly potent neutralizing antibodies a human dpp -knockin mouse's susceptibility to infection by authentic and pseudotyped mers-cov spread, circulation, and evolution of the middle east respiratory syndrome coronavirus two-step conformational changes in a coronavirus envelope glycoprotein mediated by receptor binding and proteolysis unexpected receptor functional mimicry elucidates activation of coronavirus fusion protective effect of intranasal regimens containing peptidic middle east respiratory syndrome coronavirus fusion inhibitor against mers-cov infection passive transfer of a germline-like neutralizing human monoclonal antibody protects transgenic mice against lethal middle east respiratory syndrome coronavirus infection prophylaxis with a middle east respiratory syndrome coronavirus (mers-cov)-specific human monoclonal antibody protects rabbits from mers-cov infection b -n, a monoclonal antibody against mers-cov, reduces lung pathology in rhesus monkeys following intratracheal inoculation of mers-cov jordan-n / crystal structure of bovine coronavirus spike protein lectin domain crystal structure of mouse coronavirus receptor-binding domain complexed with its murine receptor pre-fusion structure of a human coronavirus spike protein cryo-electron microscopy structure of a coronavirus spike glycoprotein trimer conformational dynamics of single hiv- envelope trimers on the surface of native virions broadly neutralizing antibodies and the search for an hiv- vaccine: the end of the beginning towards a solution to mers: protective human monoclonal antibodies targeting different domains and functions of the merscoronavirus spike glycoprotein hiv therapy by a combination of broadly neutralizing antibodies in humanized mice human monoclonal antibodies against highly conserved hr and hr domains of the sars-cov spike protein are more broadly neutralizing improving neutralization potency and breadth by combining broadly reactive hiv- antibodies targeting major neutralization epitopes automatic crystal centring procedure at the ssrf macromolecular crystallography beamline processing of x-ray diffraction data collected in oscillation mode phaser crystallographic software coot: model-building tools for molecular graphics phenix: building new software for automated crystallographic structure determination pymod . : improvements in protein sequence-structure analysis and homology modeling within pymol quantitative analysis of dose-effect relationships: the combined effects of multiple drugs or enzyme inhibitors drug combination studies and their synergy quantification using the chou-talalay method we would like thank dr. changfa fan (division of animal model research, institute for laboratory animal resources, national institutes for food and drug control) for help in providing the r -hdpp mouse model and experimental method. we thank dr. jianhua he and the staff scientists at the ssrf bl u beam line, as well as dr. shilong fan at the x-ray crystallography platform of the tsinghua university technology center for assistance in diffraction data collection. this work was supported by the national key plan for scientific research and development of china (grants yfd and yfd ), the national natural science foundation of china (grants and u ), and the national major project for control and prevention of infectious disease in china ( zx - ). h.z., w.t., l.z. and x.w. designed the experiments. y.c., k.q. and w.t. isolated the antibody d and sequenced the corresponding v l and v h genes. b.h. carried out the neutralizing assay with live mers-cov. h.z. and s.z. expressed, purified, and crystallized the protein, and h.z. carried out the bli and spr analysis. h.z. conducted all the neutralizing assays based on pseudotyped mers-cov with the help of w.j. h.z. conducted dpp -competition assays and the western blots analysis. p.n. performed the protection assay in mice. h.z. and j.l. collected the diffraction data. h.z. and x.w. processed the diffraction data, determined, and analyzed the structure. h.z. and x.w. wrote the paper with contributions from l.z. and w.t. supplementary information accompanies this paper at https://doi.org/ . /s - - - .competing interests: the authors declare no competing interests.reprints and permission information is available online at http://npg.nature.com/ reprintsandpermissions/ peer review information: nature communications thanks the anonymous reviewers for their contribution to the peer review of this work. peer reviewer reports are available.publisher's note: springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/ licenses/by/ . /. key: cord- - w x g authors: wu, joseph t.; leung, kathy; bushman, mary; kishore, nishant; niehus, rene; de salazar, pablo m.; cowling, benjamin j.; lipsitch, marc; leung, gabriel m. title: estimating clinical severity of covid- from the transmission dynamics in wuhan, china date: - - journal: nat med doi: . /s - - - sha: doc_id: cord_uid: w x g as of february there were , confirmed cases and , deaths from covid- in mainland china. of these, , cases and , deaths occurred in the epicenter, wuhan. a key public health priority during the emergence of a novel pathogen is estimating clinical severity, which requires properly adjusting for the case ascertainment rate and the delay between symptoms onset and death. using public and published information, we estimate that the overall symptomatic case fatality risk (the probability of dying after developing symptoms) of covid- in wuhan was . % ( . – . %), which is substantially lower than both the corresponding crude or naïve confirmed case fatality risk ( , / , = . %) and the approximator( ) of deaths/deaths + recoveries ( , / , + , = %) as of february . compared to those aged – years, those aged below and above years were . ( . – . ) and . ( . – . ) times more likely to die after developing symptoms. the risk of symptomatic infection increased with age (for example, at ~ % per year among adults aged – years). of case fatality risk warrants these interventions, which seriously disrupt social and economic stability. for a completely novel pathogen, especially one with a high (say, > ) basic reproductive number (the expected number of secondary cases generated by a primary case in a completely susceptible population) relative to other recently emergent and seasonal directly transmissible respiratory pathogens , assuming homogeneous mixing and mass action dynamics, the majority of the population will be infected eventually unless drastic public health interventions are applied over prolonged periods and/or vaccines become available sufficiently quickly. even under more realistic assumptions about mixing informed by observed clustering of infections within households and the increasingly apparent role of superspreading events (for example, the diamond princess cruise ship, chinese prisons and the church in daegu, south korea) , , at least one-quarter to onehalf of the population will very likely become infected, absent drastic control measures or a vaccine. therefore, the number of severe outcomes or deaths in the population is most strongly dependent on how ill an infected person is likely to become, and this question should be the focus of attention. we therefore extended our previously published transmission dynamics model , updated with real-time input data and enriched with additional new data sources, to infer a preliminary set of clinical severity estimates that could guide clinical and public health decision-making as the epidemic continues to spread globally. estimation of true case numbers-necessary to determine the severity per case-is challenging in the setting of an overwhelmed healthcare system that cannot ascertain cases effectively. therefore, as in our prior work , our approach has been to use a range of publicly available and recently published data sources (numbered to below) to build a picture of the full number of cases and deaths by age group. briefly, because the healthcare structure has been overwhelmed in wuhan and milder cases were unlikely to have been tested, we used the prevalence of infection in travelers (both on commercial flights before january and on charter flights from january to february) to estimate the true prevalence of infection in wuhan; we also used the wuhan case numbers from only the first cases to estimate the growth rate of the epidemic (assuming that the ascertainment proportion was constant between december and january ) (fig. ) . specifically, we inferred the epidemiologic parameters listed in extended data fig. by fitting an age-structured transmission model to the following data: . the epidemic curve of confirmed cases of covid- in wu- han with no epidemiologic links to huanan seafood wholesale market (which was postulated to be the index zoonotic source of the covid- epidemic) between december and january ( fig. table ). . the time between onset and death or the time between admission and death for death cases of covid- in wuhan [ ] [ ] [ ] (supplementary table ). . the time between the onset dates (that is, serial intervals) of infector-infectee pairs (supplementary table ). the clinical severity of infectious diseases is typically measured in terms of infection fatality risk (ifr), symptomatic case fatality risk (scfr) and hospitalization fatality risk (hfr). the case definitions underlying these severity measures are as follows: . ifr defines a case as a person who would, if tested, be counted as infected and rendered (at least temporarily) immune, as usually demonstrated by seroconversion or other immune response . such cases may or may not be symptomatic. . scfr defines a case as someone who is infected and shows certain symptoms. . hfr defines a case as someone who is infected and hospitalized. it is typically assumed in such estimates that the hospitalization is for treatment rather than isolation purposes. figure summarizes our estimates of age-specific scfrs and susceptibility to symptomatic infection. both parameters increase substantially with age. if the probability of developing symptoms after infection, p sym , is . , the scfr values are . % ( . - . %), . % ( . - . %) and . % ( . - . %) for those aged < years, - years and > years, respectively. the overall scfr is . % ( . - . %). compared to those aged - years, those aged < years and > years are . ( . - . ) and . ( . - . ) times more susceptible to symptomatic infection. our estimates of scfrs would be lower if p sym were higher than the baseline value of . ; for example, the overall scfr is . % ( . - . %) and . % ( . - . %) if p sym is . and . , respectively. our estimates of age-specific susceptibility are not sensitive to p sym . figure summarizes our estimates of the key epidemiologic parameters of covid- in wuhan. in the baseline scenario i.e., cases due to human-to-human (h h) transmission) between december and january (blue), the daily number of cases exported from wuhan to cities outside mainland china via air travel between december and january (orange) and the proportion of expatriates on charter flights between january and february who were laboratory-confirmed to be infected (green). the numbers of passengers and confirmed cases who returned to their countries from wuhan on chartered flights are provided in supplementary table . bars indicate the % confidence intervals (cis) of the proportion. b, the daily number of deaths in wuhan reported between december and february . (p sym = . ), the basic reproductive number is . ( . - . ). the mean serial interval is . ( . - . ) days, with a standard deviation of . ( . - . ) days. the mean time from onset to death is ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) days, with a standard deviation of ( - ) days. the epidemic doubling time (the time it takes for daily incidence to double) was . ( . - . ) days before wuhan was quarantined and public health interventions implemented within wuhan reduced transmissibility by % ( - %). we estimate that only . % ( . - . %) of symptomatic cases that occurred between december and january were ascertained. figure suggests that our estimates of the basic reproductive number, mean generation time and intervention effectiveness would be slightly lower if p sym were higher than the baseline value of . , whereas our estimates of the other parameters are largely insensitive to p sym . there is a clear and considerable age dependency in symptomatic infection (susceptibility) and outcome (fatality) risks, by multiple folds in each case. given that we have parameterized the model using death rates inferred from projected case numbers (from traveler data) and observed death numbers in wuhan, the precise fatality risk estimates may not be generalizable to those outside the original epicenter, especially during subsequent phases of the epidemic. the experience gained from managing those initial patients and the increasing availability of newer, and potentially better, treatment modalities to more patients would presumably lead to fewer deaths, all else being equal. public health control measures widely imposed in china since the wuhan alert have also kept case numbers down elsewhere, so that their health systems are not nearly as overwhelmed beyond surge capacity, thus again perhaps leading to better outcomes , . indeed, so far, the death-to-case ratio in wuhan has been consistently much higher than that among all the other mainland chinese cities (extended data fig. ). given the intensive efforts of case finding and the sharp drop in community transmission of covid- in chinese cities outside hubei over the past few weeks, the ascertainment rates in these cities were probably very high. as such, we postulate that confirmed case fatality risk in these cities should be in some ways comparable to our scfr estimates for wuhan, which attempt to account for under-ascertainment of cases in wuhan. nonetheless, crude case fatality risks estimated from cities outside wuhan should be, and are, lower than our scfr estimates for wuhan, because the former do not account for the delay between onset and death (thus being artefactually lower) and because healthcare outside hubei is less overwhelmed (thus allowing a truly lower cfr). indeed, as of february , the crude case fatality risk in areas outside hubei was . %, which is ~ - % lower than our scfr estimates of . - . % for wuhan . considering the risk estimates in context, extended data fig. compares infection, case and hospitalization fatality risks for pandemic influenza in and , sars and mers. sars causes moderate to severe disease requiring hospitalization, so the infection fatality risk and case fatality risk are essentially the same as the hospitalization fatality risk. the hospitalization fatality risk for mers is well documented, although the shape and depth of the clinical iceberg remains less well defined. in contrast, because ( ) the majority of covid- infections do not cause severe disease and ( ) hospitals in wuhan have been overwhelmed, presumably having led to prioritized admission of more serious cases, the scfr will be substantially lower than the hfr. however, despite a lower scfr, covid- is likely to infect many more (given emerging evidence of presymptomatic transmission , and growing evidence of extensive community spread in numerous countries ), thus ultimately causing many more deaths than sars and mers. compared with the and influenza pandemics, our estimates are intermediate but substantially higher than , which was generally regarded as a low-severity pandemic. we find that scfr is highest in the oldest age group. unlike any previously reported pandemic or seasonal influenza, we find that risk of symptomatic infection also increases with age, although this may be in part due to preferential ascertainment of older and thus more severe cases. one largely unknown factor at present is the number of asymptomatic, undiagnosed infections. these do not enter our estimates of scfr, but if such asymptomatic or clinically very mild cases existed and were not detected, the infection fatality risk would be lower than scfr. further clarifying this requires new data sources that are not yet available, specifically including agestratified serologic studies. our inferences were based on a variety of sources, and have a number of caveats that are highlighted below, but considering the totality of the findings they nevertheless indicate that covid- transmission is difficult to control. with a basic reproductive number of around two, we might expect at least half of the population to be infected, even with aggressive use of community mitigation measures. perhaps the most important target of mitigation measures would be to 'flatten out' the epidemic curve, reducing the peak demand on healthcare services and buying time for better treatment pathways to be developed. in due course, but almost certainly after the first global wave of infections, vaccines may also be available to protect against infection or severe disease. although our estimates of scfr are concerning, these could be reduced if effective antivirals were identified and widely adopted for the treatment of severe cases. timely data from clinical trials of remdesivir, lopinavir/ritonavir and other potential chemotherapies, as well as supportive care modalities, would be extremely informative. several important caveats are worth mentioning, as follows. first, and most importantly, our modeled estimates have necessarily relied on numerous strong assumptions, given the paucity of definitive data elements such as serosurveys, serial viral shedding studies, robust ascertainment of sufficient transmission chains and incomplete testing of travelers and returnees from wuhan, all of which need to be underpinned by systematic unbiased sampling of the underlying population and by important age and other sub-groups. our estimates of scfr are inevitably affected by under-ascertainment of cases and deaths of covid- . on the one hand, overstretched and overwhelmed healthcare surge capacity in wuhan could result in scfrs that are higher than they would be in a less stressed healthcare setting, as presumably the sicker patients would have been prioritized for admission while leaving the milder cases untested and thus unconfirmed. our prevalence estimates relying on travelers are based on those well enough to travel, so may slightly underestimate prevalence in wuhan by not including those who are already in a serious condition and perhaps hospitalized. we have accounted for the possibility that travelers may underestimate the prevalence of infection in wuhan by using our best estimate, from a separate analysis, of the probability of detection for international travelers ( % ( - %)) . on the other hand, the numerator of the number of deaths could also have been undercounted, although much less likely compared to enumerating the denominator, for the same surge capacity reason or due to imperfect test sensitivity, especially during the first month of the outbreak . if deaths in wuhan were under-ascertained, this would bias our severity estimates downward. another caveat concerns one of our key inputs-the infection prevalence among returnees airlifted out of wuhan on charter flights. their point prevalence might well be lower than that among local residents, because of a generally more advantaged socioeconomic background, and the sensitivity for detecting infected individuals among them might not be %, as assumed. as such, this would be a lower bound of the cross-sectional disease prevalence. if this were the case, then we would have overestimated the reduction in transmissibility conferred by public health interventions in wuhan and overestimated the severity. based on only publicly available data, there is necessarily substantial uncertainty in our estimates of the effectiveness of intra-wuhan public health interventions in reducing transmissibility. calculating the instantaneous reproductive number from a set of line lists that are updated daily would be the most reliable method for detecting changes in transmissibility associated with interventions. there has been refinement of case definitions at both national and provincial levels, such as excluding rt-pcr-test-positive asymptomatics (perhaps, in fact, very mildly symptomatics) from being labeled an officially 'confirmed' case or including test-naïve clinically diagnosed cases with clear epidemiologic links as 'confirmed' . although these should not affect our estimation given our data sources from the earlier phase of the epidemic, such changes in the reporting criteria may influence the interpretation of future data. finally, given that wuhan is no longer the only (albeit the first) location with sustained local spread, it would be important to assess and take into account the experience from elsewhere, both domestically in mainland china and overseas. these secondary epicenters, having learned from the early phase of the wuhan epidemic, might have had a systematically different epidemiology and response that could impact the parameters estimated here - . any methods, additional references, nature research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and estimates of basic reproductive number, mean serial interval, initial doubling time, intervention effectiveness, ascertainment rate and the mean time from onset to death, assuming p sym is . (red), . (green) and . (blue). the markers show the posterior means and the bars show % cris. we made the following assumptions in the model: . the population of wuhan is stratified into m = age groups: - , - , - , - , - , - , - , - and > . the relative susceptibility to infection of age group i is α i with respect to those aged - years (that is, α = ). the scfr of age group i is scfr i . . the probability density function (pdf) of the incubation period, f incubation , is gamma, with a mean of . days and standard deviation of . days . . the pdf of the time between onset and death, f onset-to-death , is gamma. we inferred the values of the mean and standard deviation of f onset-to-death (extended data fig. ). . the pdf of the generation time, f gt , is gamma and the same as that of the serial interval. we inferred the values of the mean and standard deviation of f gt (extended data fig. ). . the infection-symptomatic probability (p sym ; the proportion of infections that progress to develop symptoms) is the same for all age groups. we assume p sym = . in the baseline scenario and . and . in alternate scenarios. . the sensitivity of detecting symptomatic cases exported from mainland china is p det = % ( %- %) for cities that reported case importation between december and january (supplementary table . as such, we assume that the epidemic in wuhan was seeded by a single zoonotic event that generated z infections on november . we inferred the value of z (extended data fig. ). . public health interventions in wuhan reduced local transmissibility by φ . we inferred the value of φ (extended data fig. ). . given that the epidemic curve in wuhan was weeks ahead of that in other mainland chinese cities, we ignored the effect of case importation at wuhan. these assumptions were reflected in the following susceptible-infectedrecovered (sir) model for simulating the covid- epidemic in wuhan, where s i (t), and r i (t) are the number of susceptible and recovered individuals in age group i at time t, and i(t, τ) is the number of infected individuals in age group i at time t who were infected at time t − τ: the next-generation matrix for this sir model is where t g is the mean generation time. the basic reproductive number r is the largest eigenvalue of this matrix, which is βtg n p m i¼ α i n i i . the incidence rates of infection, onset and death for age group i at time t are calculated as follows: the number of new cases (onset) and the cumulative number of cases in age group i on day d are be the summation of the number of new cases, the cumulative number of cases and the cumulative number of deaths across all age groups up to time t, respectively. similarly, is the total number of infected individuals at time t. we inferred the parameters listed in extended data fig. assuming that the remaining parameters are fixed at the values shown in extended data fig. . we use θ to denote the set of parameters that are subject to inference (extended data fig. ). the likelihood function is a product of several components associated with the data in supplementary tables - : the formulation of each component was as follows: . the number of observed international case exportations on each day is assumed to be an imperfect poisson observation of the number of infected travelers leaving wuhan on that day who had or would develop symptoms. let x d be the observed number of such international case exportations on day d between december (d s , ) and january (d e , ) based on the data in supplementary table . we assume that travel behavior is not affected by disease and hence such case exportation occurs according to a non-homogeneous process with rate λ t ð Þ ¼ p sym lw;i t ð Þ n t ð Þ iðtÞ: i let p det be the probability that an infected traveler who has or will develop symptoms is detected in the destination country. the expected number of detected case exportations on day d is and hence x d ≈ poisson(λ d ). as such, the likelihood function associated with the data in supplementary table is where g is the posterior distribution of p det from a separate study that had a mean of % and a % credible interval of - % . . let y d be the observed number of confirmed cases of covid- in wuhan with no epidemiologic links to huanan seafood wholesale market (which is presumed to be the index zoonotic source of the covid- epidemic) on day d between december (d s , ) and january (d e, ) based on the data in supplementary table . these cases are assumed to be a poisson observation of the true number of newly symptomatic cases on that day, with ascertainment rate ε, which remained fixed over this time period. as such, assuming y d ≈ poisson(εω d ), the likelihood function for the data in supplementary table is table . the prevalence of infection and symptoms among travelers are assumed to reflect a representative binomial sample of the same quantities in the wuhan population on their day of departure. the likelihood function associated with the data in supplementary table is is the proportion of individuals who were infected on day d. . we assume that all deaths from covid- infection in wuhan were confirmed. let g be the cumulative number of death cases in wuhan as of february (time t). we assume g ≈ poisson(d(t)) and hence the likelihood function associated with this data is l θ ð Þ ¼ e �dðtÞ dðtÞ g g! . we assume that the age distribution of confirmed cases is a multinomial sampling process from the age distribution of true cases. let c i be the observed number of confirmed cases in age group i in wuhan based on the data in supplementary table . the likelihood function for the data in supplementary table is . we assume that the age distribution of confirmed deaths is a multinomial sampling process from the age distribution of true deaths. given that most covid- deaths were wuhan-related, we assume that the age distribution of confirmed deaths for wuhan is the same as that for mainland china . let b i be the observed number of death cases in age group i in wuhan based on the data in supplementary table . the likelihood function for the data in supplementary table is with regard to the data in supplementary table , let a be the set of death cases whose onset dates are known, and b the set comprising the remaining cases. let v j be the observed time delay between onset and death for the jth case in a and let v l j i be the observed time between hospital admission and death (which serves as a lower bound for the delay between onset and death) for the jth case in b. the likelihood function for the data in supplementary table is where f onset-death and f onset-death are the pdf and cumulative density function (cdf) of the time between onset and death (assumed to be gamma-distributed with mean μ d and standard deviation σ d ). . with regard to the data in supplementary table , let a be the set of infectorinfectee pairs for whom the serial interval (time elapsed between their onset dates) is known and b the set comprising the remaining pairs for whom only the ranges of their serial intervals are known. let s j be the observed value of the serial interval for the jth pair in a, and s l j ; s u j i be the observed range of the serial interval for the jth pair in b. for some infector-infectee pairs, the travel history and onset dates of the infector impose a lower bound on the serial interval (supplementary table ). let s * j be such a lower bound for the jth pair. the likelihood function for the data in supplementary table is where f si and f si are the pdf and cdf of the serial interval. we assume that the serial interval and the generation time have the same pdf. we estimated the model parameters θ using markov chain monte carlo methods with gibbs sampling and non-informative flat priors. point estimates and statistical uncertainty are presented using posterior means and % cris, respectively. reporting summary. further information on research design is available in the nature research reporting summary linked to this article. we collated epidemiological data from publicly available data sources (news articles, press releases and published reports from public health agencies). all the epidemiological information that we used is documented in the main text, the extended data and supplementary tables. the codes are available upon request to the corresponding author. last updated by author(s): mar , reporting summary nature research wishes to improve the reproducibility of the work that we publish. this form provides structure for consistency and transparency in reporting. for further information on nature research policies, see authors & referees and the editorial policy checklist. for all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or methods section. the exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement a statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly the statistical test(s) used and whether they are one-or two-sided only common tests should be described solely by name; describe more complex techniques in the methods section. a description of all covariates tested a description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons a full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) and variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) for null hypothesis testing, the test statistic (e.g. f, t, r) with confidence intervals, effect sizes, degrees of freedom and p value noted for manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. we strongly encourage code deposition in a community repository (e.g. github). see the nature research guidelines for submitting code & software for further information. policy information about availability of data all manuscripts must include a data availability statement. this statement should provide the following information, where applicable: -accession codes, unique identifiers, or web links for publicly available datasets -a list of figures that have associated raw data -a description of any restrictions on data availability all raw data have been provided in supplementary tables. please select the one below that is the best fit for your research. if you are not sure, read the appropriate sections before making your selection. behavioural & social sciences ecological, evolutionary & environmental sciences methods for estimating the case fatality ratio for a novel, emerging infectious disease case fatality risk of influenza a (h n pdm ): a systematic review human infection with avian influenza a h n virus: an assessment of clinical severity nowcasting and forecasting the potential domestic and international spread of the -ncov outbreak originating in wuhan, china: a modelling study secondary attack rate and superspreading events for sars-cov- report of the who-china joint mission on coronavirus disease (covid- ) early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia the novel coronavirus pneumonia emergency response epidemiology team. the epidemiological characteristics of an outbreak of novel coronavirus diseases (covid- )-china chinese center for disease control and prevention. dashboard of reported -ncov cases data platform of shanghai observer. line list of -ncov confirmed fatal cases (from publicly available information wuhan municipal health commission. wuhan municipal health commission's briefing on the current pneumonia epidemic in the city hubei municipal health commission's briefing on the current pneumonia epidemic in the province infection fatality risk of the pandemic a(h n ) virus in hong kong sars-cov- viral load in upper respiratory specimens of infected patients viral load of sars-cov- in clinical samples time to use the p-word? coronavirus enters dangerous new phase quantifying bias of covid- prevalence and severity estimates in wuhan, china that depend on reported cases in international travelers the state council of the people's republic of china national health commission of people's republic of china. notice of the general office of the national health commission on the distribution of the plan of prevention and control of the pneumonia caused by the novel coronavirus situation of the epidemic of pneumonia caused by the novel coronavirus in hubei case fatality of sars in mainland china and associated risk factors epidemiological determinants of spread of causal agent of severe acute respiratory syndrome in hong kong the epidemiology of severe acute respiratory syndrome in the hong kong epidemic: an analysis of all patients a comparative epidemiologic analysis of sars in hong kong, beijing and taiwan influenza: the mother of all pandemics age and sex incidence of influenza in the epidemic of - , with comparative data for preceding outbreaks: based on surveys in baltimore and other communities in the eastern states epidemiologic characterization of the influenza pandemic summer wave in copenhagen: implications for pandemic control strategies mortality from pandemic a/h n influenza in england: public health surveillance study middle east respiratory syndrome: what we learned from the outbreak in the republic of korea hospitalization fatality risk of influenza a (h n )pdm : a systematic review and meta-analysis middle east respiratory syndrome coronavirus (mers-cov) neutralising antibodies in a high-risk human population incubation period of novel coronavirus ( -ncov) infections among travellers from wuhan, china extended data fig. | a summary of severity estimates among pandemic influenza strains and coronaviruses with pandemic potential in the past the study is a mathematical modeling study. we have provided the full information of data in figures data exclusions not applicable. the study is a mathematical modeling study. we have provided the full information of data in figures replication all the data have been provided in figures, extended data and supplementary tables randomization not applicable. the study is a mathematical modeling study. we have provided the full information of data in figures blinding not applicable.not applicable. the study is a mathematical modeling study. we have provided the full information of data in figures reporting for specific materials, systems and methods we require information from authors about some types of materials, experimental systems and methods used in many studies. here, indicate whether each material, system or method listed is relevant to your study the authors declare no competing interests. extended data is available for this paper at https://doi.org/ . /s - - - .supplementary information is available for this paper at https://doi.org/ . / s - - - .correspondence and requests for materials should be addressed to j.t.w.peer review information joao monteiro was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. key: cord- -abs rvjk authors: liu, ming; kong, jian-qiang title: the enzymatic biosynthesis of acylated steroidal glycosides and their cytotoxic activity date: - - journal: acta pharm sin b doi: . /j.apsb. . . sha: doc_id: cord_uid: abs rvjk herein we describe the discovery and functional characterization of a steroidal glycosyltransferase (sgt) from ornithogalum saundersiae and a steroidal glycoside acyltransferase (sga) from escherichia coli and their application in the biosynthesis of acylated steroidal glycosides (asgs). initially, an sgt gene, designated as ossgt , was isolated from o. saundersiae. ossgt -containing cell free extract was then used as the biocatalyst to react with structurally diverse drug-like compounds. the recombinant ossgt was shown to be active against both β- and β-hydroxyl steroids. unexpectedly, in an effort to identify ossgt , we found the bacteria laca gene in lac operon actually encoded an sga, specifically catalyzing the acetylations of sugar moieties of steroid β-glucosides. finally, a novel enzymatic two-step synthesis of two asgs, acetylated testosterone- -o-β-glucosides (at- β-gs) and acetylated estradiol- -o-β-glucosides (ae- β-gs), from the abundantly available free steroids using ossgt and ecsga as the biocatalysts was developed. the two-step process is characterized by ecsga -catalyzed regioselective acylations of all hydroxyl groups on the sugar unit of unprotected steroidal glycosides (sgs) in the late stage, thereby significantly streamlining the synthetic route towards asgs and thus forming four monoacylates. the improved cytotoxic activities of ′-acetylated testosterone -o-β-glucoside towards seven human tumor cell lines were thus observable. steroidal glycosides (sgs) are characterized by a steroidal skeleton glycosidically linked to sugar moieties, which can be further acylated with aliphatic and aromatic acids thus forming complex acylated steroidal glycosides (asgs) . the resulting steroidal glycoside esters (sges) exhibit a wide variety of biological activities, like cholesterol-lowering effect , anti-diabetic properties , anti-complementary activity , immunoregulatory functions , and anti-cancer actions [ ] [ ] [ ] , which made asgs promising compounds with pharmaceutical potential. numerous methods, including direct extraction , chemical synthesis , and biosynthesis , have been developed to synthesize these acylated steroidal glycosides. direct extraction from varied organisms is one of the main methods to obtain asgs [ ] [ ] [ ] . however, the content of asgs was usually low in natural sources [ ] [ ] [ ] . moreover, the extraction routes were highly time-consuming and required laborious purification procedures [ ] [ ] [ ] , resulting in poor yields and/or low purity of the final products. the production of asgs was also achieved by chemical synthesis previously [ ] [ ] [ ] [ ] [ ] . however, these efforts often encounter a fundamental challenge, namely, regioselective acylation of single hydroxyl group of unprotected sgs in the late stage of the chemical synthesis of asgs. sgs generally possess multiple hydroxyl groups with similar reactivity. regioselective acylation of a particular one of multiple hydroxyl groups generally requires multi-step protection/deprotection procedures, which makes the synthetic pathway of these sges costly, wasteful, long and timeconsuming, and results in low yield in the end. the biosynthesis of asgs from free steroids based on enzymatic catalysis was deemed to reduce the number of protection/ deprotection steps due to the high selectivity of enzymes. theoretically, the biosynthesis of asgs includes two steps. in the first reaction, the sugar moiety from nucleotide-activated glycosyl donors was attached to steroids at different positions, most commonly at the c- hydroxyl group (oh), under the action of nucleotide dependent sgts , . the glycosylation of a hydroxyl group at the c- position of steroids was well characterized and a few of steroidal β-glucosyltransferases were isolated from diverse species , . however, the reports of sgts specific for positions other than c- of steroids are limited. the sugar moieties of the resultant sgs can further be acylated by sgas to form asgs in the next step , . compared to sgts, surprisingly little is known about sgas. up to date, no sga genes has yet been cloned, which in turn limit the enzyme-mediated biosynthesis of asgs. hence, the successful gene isolation and functional characterization of sgas have become a prominent challenge for bioproduction of asgs. herein, the functional characterizations of a plant-derived sgt with activity against both β-and β-hydroxyl steroids, and a bacterial sga, as well as their application in the biosynthesis of asgs are reported. initially, a steroidal glycosyltransferase ossgt was isolated from medicinal plant o. saundersiae and showed activities for β-and βhydroxyl steroids. unexpectedly, in an effort to identify the function of ossgt , we characterized laca protein (designated as essga ) from e. coli as a sga, catalyzing testosterone- -oβ-glucoside (t- β-g) and estradiol- -o-β-glucoside (e- β-g) to form corresponding acylates. further, under the synergistic actions of ossgt and essga , the biosynthetic preparation of two acylated steroidal glycosides (asgs), namely acetylated t- β-gs (at- β-gs) and e- β-gs (ae- β-gs), was first achieved, thereby yielding four monoacetylated steroidal glucosides, namely -o, -o, -o and -o-acylates (scheme ). the cytotoxic activities of these monoacylates were evaluated against seven human tumor cell lines (hct , bel , mgc , capan , nci-h , nci-h and a ) and -acetylated testosterone -o-β-glucoside was observed to display improved cytotoxic activity against these seven cell lines (scheme ). the species o. saundersiae is a monocotyledonous plant rich in steroidal glycosides, suggesting that it may contain sgts responsible for the glycosylation of steroidal aglycons [ ] [ ] [ ] . o. saundersiae is thus selected as the candidate plant for sgts isolation. the transcriptome of o. saundersiae was thus sequenced with the aim of isolating genes encoding sgts. a total of , , raw reads were generated after the transcriptome sequencing of o. saundersiae. after removal of dirty reads with adapters, unknown or low quality bases, a total of , , clean reads were retained. these clean reads were combined by assembling soft trinity to form longer unigenes. finally, an rna-seq database containing , unigenes with mean length of bp was obtained. next, these unigenes were aligned to publicly available protein databases for functional annotations, retrieving unigenes displaying the highest sequence similarity with sgts. unigene with bp in length was thus retrieved from the unigene database for its high similarity with sgts ( supplementary information fig. s ). moreover, orf finder result showed that this unigene contained a complete open reading frame (orf) of bp, starting at nucleotide with an atg start codon and ending at position with a tga stop codon. the unigene contained bp of -utr (untranslated region) and bp of -utr. therefore, unigene was selected for further investigation. to verify the identity of unigene , a nested pcr assay was therefore carried out to amplify the cdna corresponding to the orf of unigene using gene-specific primers (supplementary information table s ). an expected band with approximately . kb was obtained, as observed in agarose gel electrophoresis ( supplementary information fig. s a ). the amplicon was then inserted into peasy tm -blunt plasmid (supplementary information table s ) to form a recombinant vector for sequencing. results indicated that the amplified product was % identity with that of unigene , confirming unigene was a bona fide gene in o. saundersiae genome. the bp orf encoded a polypeptide of amino acids (aa) with a predicted molecular mass of . kda and pi of . . blast analysis of the deduced protein revealed its predominant homology with sterol β-glucosyltransferase from elaeis guineensis (xp_ . , %), musa acuminata subsp. malaccensis (xp_ . , %) and anthurium amnicola (jat . , %). the cdna was therefore designated as ossgt and submitted to genbank library with an accession number of mf . the sequence analyses of ossgt were first assessed with the aim to direct its expression and functional verification. no putative trans-membrane domain was observed in ossgt based on the prediction results by tmhmm (http://www.cbs.dtu.dk/services/ tmhmm/), suggesting ossgt is a cytoplasmic sgt and may be expressed heterologously in e. coli in a soluble form. multiple alignment of ossgt and other plant sgts indicated that the middle and c-terminal parts of these sgts were more conservative than the n-terminal region ( supplementary information fig. s ), consistent with previous notion . moreover, two conservative motifs, namely a putative steroid-binding domain (psbd) and a plant secondary product glycosyltranferase box (pspg), were observed in ossgt ( supplementary information fig. s ). the region named psbd located in the middle part of ossgt and was thought to be involved in the binding of steroidal substrates . pspg box is about aa in length and close to the carboxy-terminus. this box is a characteristic "signature sequence" of udp glycosyltransferase and deduced to be responsible for the binding of the udp moiety of the nucleotide sugar . the presence of psbd and pspg boxes suggests that ossgt may be involved in secondary metabolism, catalyzing the transfer of udp-sugars to steroidal substrates thereby forming steroidal glycosides. the phylogenetic tree based on deduced amino acid sequences of ossgt and other sgt was generated by mega . . as scheme an enzymatic two-step synthesis of at- β-g ( b- e) and ae- β-g ( b- e) from the free steroids testosterone ( ) and estradiol ( ) . firstly, two sgs, t- β-g ( a) and e- β-g ( a), were prepared from their corresponding steroidal substrates testosterone ( ) and estradiol ( ) in the presence of a steroidal glycosyltransferase ossgt from o. saundersiae. the resulting t- β-g ( a) was further regioselectively acetylated under the action of an acyltransferase ecsga from e. coli, thereby yielding four monoacetylated steroidal glucosides ( b- e) with the yield ratio of : : : . likewise, e- β-g ( a) was acetylated by ecsga to form monoacetylated products b- e in a ratio of : : : . shown in supplementary information fig. s , all selected sgts were clusted into four clades, mon, di, ba and fun clades. the four clades included sgts from monocots, dicots, bacteria and fungi, respectively. ossgt belonged to mon clade, suggesting that ossgt was most similar to sgts from monocots. ossgt was then inserted into pet- a(þ) to yield a recombinant pet a-ossgt (supplementary information table s ) , which was transformed into transetta(de ) (transgen, beijing, china) for heterologous expression. sodium dodecyl sulfate polyacrylamide gel electrophoresis (sds-page) indicated that most of the expressed ossgt protein was present in the form of insoluble inclusion body, which was regarded to be devoid of bioactivity. it was well known that chaperone proteins were able to assist protein folding and thus increase production of active protein . therefore, a chaperone plasmid pgro (takara biotechnology co., ltd., dalian, china) was applied to be co-expressed with pet aossgt in bl (de ) (transgen, beijing, china), facilitating the soluble expression of ossgt . as shown in supplementary information table s , the plasmid pgro contains two genes encoding chaperone proteins groes and groel. under the synergistic action of chaperones groes and groel, an intense band with an apparent molecular mass of kda was present in the crude extract of bl (de )[pet aossgt þpgro ], but not in the crude proteins of the control strain bl (de )[pet- a (þ)þpgro ] ( supplementary information fig. s b ). the immunoblot analysis with an anti-polyhistidine tag antibody showed a bound band, but the control extract did not cross-react with the antibody ( supplementary information fig. s c ). these data collectively indicated that ossgt was successfully expressed in e. coli in a soluble form (supplementary information figs. s b and c) in accord with the predicted result of soluble expression of ossgt . to identify the activity of ossgt , the ossgt -containing crude protein was used as the biocatalyst for glycosylation reactions. each member of the acceptor library ( - , fig. b and supplementary information fig. s ) was first assessed as sugar table s ). of the substrates, only steroids were observed to be glucosylated by ossgt , forming corresponding monoglucosides (figs. - and supplementary information figs. s - ) . the ten steroids included seven β-hydroxysteroids ( - , supplementary information figs. s - ), two β-hydroxylsteroids ( - , figs. and ) and one β, β-dihydroxysteroid ( , supplementary information fig. s ). the reaction activities of ossgt towards the ten substrates indicated that ossgt was an sgt showing activities towards both β-and β-hydroxysteroids, consistent with the predicted result by bioinformatics analyses (supplementary information figs. s and s ). in fact, the reports of sgts with glycosylation activity against steroids at positions other than c- were limited and only three sgts from yeast were verified to exhibit selectivity towards both β-and β-hydroxylsteroids . ossgt was therefore viewed as the first plant sgt with selectivity towards both β-and β-hydroxylsteroids (figs. and and supplementary information figs. s - ) . however, the glycosylation activity of ossgt towards βhydroxyl group would be lost if additional hydroxyl group at β-( β-oh-testosterone, ), β-( β-oh-testosterone, ), β-( β-oh-testosterone, ), or α-position ( α-oh-testosterone, ) , even a methyl group at c -position (methyltestosterone and its derivatives, - ) was attached to testosterone ( ), generating not any glycosylated products. moreover, ossgt has no activity towards other compounds, including steroids without β-and β-hydroxyl groups ( - and - ) , flavonoids ( - ), alkaloids ( ) ( ) ( ) ( ) ( ) ( ) ( ) , triterpenoids ( - ), phenolic acids ( - ) and coumarins ( - ) as shown in supplementary information fig. s . among reactive β-and β-hydroxylsteroids, dehydroepiandrosterone ( ) had a maximum conversion approaching %, followed by diosgenin ( ) with % conversion and the other compounds having conversion below % (fig. a) . to produce sufficient glucosylated products for structural characterization, scale-up of ossgt -mediated reactions to preparative scale ( ml) was conducted. the resultant glucosides ( ) glycosylation. hplc chromatogram of reaction product of estradiol ( ) incubated with ossgt protein (a) or without ossgt (b). uv spectra of and enzymatic product a are shown in upper panels. the hplc conditions are available in supplementary information table s . were prepared by hplc and subjected to nmr analysis for structural elucidation. to determine the glycosylation sites of a- a, a and a, the c nmr analyses of the corresponding aglycons - , and were also performed (supplementary information figs. s - and table s ). the c nmr glycosylation shifts (Δδ, δ glucoside -δ aglycon ) of these glycosides were thus examined to ascertain the glycosylation position (table , and supplementary information tables s - ). the steroidal glycosides were observed to have significant glycosylation shift Δδs for c- (glycosides - ) or c- position (compounds - ), showing their -or -glycosides. for a and a, the location of glucose group was determined to be at c- based on their hmbc correlations between h- and c- (supplementary information figs. s and ). the β-anomeric configuration of the d-glucose unit in these ten glucosides ( a- a) was determined from the large anomeric proton-coupling constants of h- (j ¼ . hz) ( table and supplementary information tables s - ). the structures of these glucosides were thus assigned to βglucosides ( a- a and a) or β-glucosides ( a and a) of steroids based on h nmr ( a- a) and c nmr ( a- a) signals, hsqc ( a- a, a- a, a- a), hmbc ( a- a, a) and dept ( a, a) spectra (table , supplementary information figs. s - and tables s - ). these data collectively showed that ossgt was an inverting-type glycosyltransferase. in the preparation of t- β-g ( a), when the concentrated reaction mixture was separated by reversed phase highperformance liquid chromatography (rp-hplc), we accidentally discovered that in addition to the major peak representing t- β-g ( a), a minor peak ( b) was also present in the hplc profile ( supplementary information fig. s a ). the minor product with a t r of . min was also subjected to lc-ms analysis. surprisingly, the [m þ h] þ value of the minor product was assigned to . , more than that of monoglycosylated testosterone ( supplementary information fig. s b ). this finding hints that the minor product may be an acetylated testosterone glucoside. to characterize the exact structure of b, the minor product was prepared in bulk for nmr experiment (supplementary information figs. s - ). details of h and c nmr spectra were tabulated in table . the minor product was thus identified as -acetylated testosterone -o-β-glucoside ( -at- β-g, b). to test if the acetylated product b was from glucoside a, the purified glucoside a was used as the substrate to incubate with crude extract of e. coli expressing pet- a(þ) or pet ossgt , respectively. in both conditions, we observed the presence of b ( supplementary information fig. s ). on the contrary, no acetylated product b was detected in the e. coli lysate without the addition of substrate a ( supplementary information fig. s ). we therefore inferred that testosterone ( ) was first glycosylated at the β-hydroxyl group by ossgt to form t- β-g ( a), which was then selectively acetylated at c- of sugar moiety to yield the -at- β-g ( b) by a soluble bacterial acetyltransferase ( supplementary information fig. s ) . likewise, two metabolites, e- β-g ( a) and -acetylated estradiol -o-β-glucoside ( -ae- β-g, b), were detected in the concentrated ossgt -catalyzed reaction mixture of estradiol ( ) as shown in supplementary information figs. s - , tables s and s . these data collectively revealed that e. coli cell contained at least one sgas specific for the acetylation of steroidal β-glycosides. moreover, the sugar donor promiscuity of ossgt -catalyzed glycosylation reactions was also investigated. β-sitosterol ( ) and testosterone ( ) were chosen as the sugar acceptors to react with varied sugar donors listed in the supplementary experimental section, respectively. results demonstrated that both β-sitosterol ( ) and testosterone ( ) had no reactive activity towards other udp-activated nucleotides except udp-glc under the action of ossgt , indicating ossgt was specific for udp-glc. to characterize the genes encoding sgas, the first task was to analyze the genome sequence of bl (de ), which was public in ncbi database (accession no. cp . ). this bacterial strain contains at least putative acetyltransferase genes, in which laca, maa and wech, were predicted to encode o-acetyltransferase (supplementary information table s ). as shown in supplementary information fig. s , further sequence analyses revealed wech protein was a membrane-bound protein with a total of membrane-spanning helixes, inconsistent with the above results, in which the candidate acetyltransferase was determined to be a soluble protein in bacterial lysate. laca and maa proteins were predicted to have no transmembrane helixes, suggesting their soluble form in bacterial. thus, the remaining two genes, laca and maa, were further investigated. first, the entire orfs of the two genes were isolated from bacterial genome using gene-specific primers (supplementary information fig. s a and table s ). the orfs of laca and maa genes were and bp encoding polypeptides of and aa, respectively. the predicted molecular weights of the two proteins were . and . kda. the two genes were then inserted into pet- a(þ) to generate two recombinant vectors, which were introduced into bl (de ) for heterologous expression. after isopropyl-β-d-thiogalactoside (iptg) induction, the accumulation of approximately or kda was observed in the lysate of bacterial strain harboring pet a-laca or pet a-maa (supplementary information fig. s b ). moreover, the presence of bacterially-expressed his-laca or his-maa fusion protein in the bacterial lysate was verified by western-blot with anti-his antibody ( supplementary information fig. s c ). the expressed laca or maa protein was then purified to near homogeneity by affinity chromatography ( supplementary information fig. s b ). the purified his-laca or his-maa fusion protein was used as the biocatalyst to react with t- β-g ( a) and acetyl-coa. the reactions were monitored by hplc-uv/ms analysis using the method e (supplementary information table s ). as shown in fig. , -at- β-g ( b) was detected in laca-catalyzed bioconversion of t- β-g ( a), attesting laca encoded a sga (fig. , upper panel). on the contrary, there were not any new products in maa-mediated reaction. laca was thus designated as ecsga (e. coli steroidal glucoside acetyltransferase) for convenience hereinafter and submitted to genbank with an accession number of mf . it is generally accepted that hydrolases and acyltransferases are two classes of enzymes responsible for acylation reactions of sgs . the enzymatic acylations reported now are largely performed by hydrolases like lipases . on the other hand, not any genes encoding sgas are isolated up to date , . ecsga is therefore regarded as the first steroidal glycoside acyltransferase catalyzing the attachment of acyl groups into the hydroxyl groups of steroidal β-glycosides, to our knowledge. also, ecsga was observed to catalyze another steroidal βglycoside, e- β-g ( a), to form corresponding acylate ( b, fig. , upper panel) . on the other hand, the other glucosides listed in supplementary information fig. s could not be acetylated by ecsga , testifying ecsga was specific to steroidal β-glycosides. moreover, the acyl donor promiscuity of ecsga was investigated. t- β-g ( a) or e- β-g ( a) was used as the acyl acceptor to react with different acyl donors (acetyl-coa, succinyl-coa, arachidonoyl-coa, palmitoyl-coa and acetoacetyl-coa) under the action of the purified ecsga . results manifested that neither t- β-g ( a) nor e- β-g ( a) could react with these acyl donors except acetyl-coa, indicating that ecsga had strict donor selectivity. after careful check of ecsga -catalyzed reaction mixture in hplc profile, we have found several other minor peaks adjacent to the major product b (fig. , upper panel) . these minor peaks are so close that we could not distinguish. therefore, an efficient hplc method, namely method i (supplementary information table s ), was developed to separate these peaks. as shown in fig. (lower panel) , besides the major product b (t r ¼ . min), we observed three other minor peaks at t r ¼ . , . table s ). lower panel, hplc profile of acetylated products of a separated by chromatogramic method i (supplementary information table s ). and . min, respectively. the lc-ms measurement of these minor peaks showed that all of them have an [m þ na] þ value of . , thus suggesting their monoacetylated testosterone glucosides ( supplementary information fig. s ) . likewise, e- β-g ( a, t r ¼ . min) was observed to form four acetylated glucosides using purified ecsga as the biocatalyst (fig. ) . besides the well-characterized -ae- β-g ( b, t r ¼ . min), the other three products were determined to be monoacetylated estradiol glucosides based on their ms data ( supplementary information fig. s ). it was assumed that ecsga could introduce an acyl group into different hydroxyl groups of steroidal β-glycosides, generating varied monoacetylated products (figs. and ). to obtain sufficient amount of monoacetylated testosterone glucosides for structural characterization and further cytotoxicity assay, an enzymatic two-step process for at- β-gs ( b- e) was developed (scheme ). firstly, the whole cell biotransformation for the formation of at- β-gs ( b- e) was exploited due to its simple catalyst preparation. when testosterone ( ) was incubated with the engineered strain bl (de )[pet a-ossgt þpgro ], not any new products were detected. on the other hand, when t- β-g ( a) was added into the same whole-cell system, -at- β-g ( b) was present in the reaction mixture ( supplementary information fig. s ). these data indicated that testosterone ( ) could not be transported into the cell while the glycosylation of testosterone ( ) significantly improved the intercellular transport. thus, the formation of at- β-gs ( b- e) from testosterone ( ) using the single whole-cell biocatalyst is infeasible. a two-step process is therefore established to address this limitation. specifically, ossgt -catalyzed reaction was performed in the membrane-free crude cell extract of bl (de )[pet a-ossgt þpgro ], while ecsga -mediated acetylation was conducted in the whole-cell system of bl (de )[pet a-essga ]. the optimal ph and temperature of ossgt -catalyzed reaction using the cell-free extract of bl (de )[pet a-ossgt þp-gro ] as the biocatalyst were first determined to be alkaline ph value of and c, respectively (supplementary information fig. s ). next, the μl screening scale of ossgt -catalyzed glycosylation reaction was scaled to ml scale, in which mg testosterone ( ) were glycosylated to form mg t- β-g ( a) under optimized conditions (scheme ). the resultant t- β-g ( a) was subsequently used as the substrate applied in the scale-up of the whole-cell system of bl (de )[pet a-essga ] ( ml) under optimized ph . and c. the resulting reaction mixture was subjected to high-performance liquid chromatography-solid phase extraction-nmr spectroscopy (hplc-spe-nmr) measurement. comparison of the h and c nmr spectra of c- e with those of b suggested that compounds c- e had the same framework as b and the structural difference might be the position of the acetyl group. the location of acetyl group was determined to be at c- based on the hmbc correlations between h- (δ . ) and c- " (δ . ) as shown in supplementary information fig. s . thus, compound c was assigned as -at- β-g. the isolated glucose proton at δ . (h- ) of compound d exhibited long-range correlations with carbonyl carbons at δ . ( supplementary information fig. s ). moreover, h- (δ . ) of compound e showed long-range table s ). lower panel, hplc profile of acetylated products of a separated by chromatogramic method (supplementary information table s ). correlations with c- " (δ . ), as revealed by the hmbc spectrum ( supplementary information fig. s ). these data supported that the structure of d and e was elucidated as -at- β-g and -at- β-g, respectively. hence, the three trace products at t r ¼ . , . and . min were thus assigned to be -( d), -( e) and -at- β-g ( c) based on their respective nmr data (tables - and supplementary information figs. s - ) . these data indicate that ecsga can effectively introduce the acetyl group into the primary hydroxyl group and each secondary hydroxyl group of t- β-g ( a), yielding four monoacylates without the formation of diacylates ( fig. and scheme ). because the primary c( )-oh was the most reactive of the four hydroxyl groups in t- β-g ( a), acetylation of t- β-g ( a) took place preferentially at the c ( )-oh, giving -o-acylate predominantly in % yield ( fig. and scheme ). also, ecsga can regioselectively acetylate each secondary hydroxyl of t- β-g ( a) in the presence of the primary hydroxyl group, giving -( c), -( d) and -at- β-g ( e) in %, % and % yield, respectively. these data revealed the reactivity trend of hydroxyls is -oh -oh -oh -oh. likewise, the formation of four monoacylates was also present in ecsga -catalyzed acylation of e- β-g ( a, t r ¼ . min, fig. ). in addition to the well-characterized major product b, there are three trace products c- e. because of their trace amount, we did not further enrich these monoacetylated estradiol glucosides for nmr analysis. however, according to the catalytic behavior of ecsga towards t- β-g ( a), it was easy to infer that these products were most likely -( c, t r ¼ . min), -( d, t r ¼ . min) and -ae- β-g ( e, t r ¼ . min, fig. ). the order of reactivity of the hydroxyls was determined as -oh -oh -oh -oh with a yield ratio of : : : (fig. ) . regioselective acylation of one of the multiple hydroxyl groups in sgs is the major obstacle to the synthesis of sges and direct methods for site-selective acylation of unprotected sgs have rarely been documented. in this contribution, we successfully achieved the regioselective acylation of fully unprotected sgs using ecsga as the biocatalyst, thereby leading to an extremely short-step synthesis of asgs. acetylated steroidal glucosides, namely b, c, d and e, together with b were tested for their in vitro cytotoxicity against seven human cancer cell lines including hct , bel , mgc , capan , nci-h , nci-h and a . the results indicated that -at- β-g ( d) exhibited a wide spectrum of cytotoxic activities against the tested cell lines (table ). -at- β-g ( b) displayed much less cytotoxicity than -at- β-g ( d) but showed a mild cytotoxicity against human non-small cell lung carcinoma cell line nci-h with ic values of . μmol/l ( table ) . on the contrary, the control t- β-g ( a) did not display significant cytotoxicity towards these tested cell lines (ic . μmol/l). these evidences revealed that the acyl groups of sges are of importance to their cytotoxicity and direct regioselective acylation of sgs is thus believed as a powerful tool for the discovery of drug candidates. acylated steroidal glycosides have attracted our attentions primarily due to their biological and pharmacological significances , , , . there are two enzymes, namely sgts and sgas, responsible for the biosynthesis of asgs. to synthesize asgs, the primary premise is to obtain glycosyltransferases capable of catalyzing the formation of sgs from the abundantly available free steroids. o. saundersiae is thus selected as the candidate plant for sgts isolation. o. saundersiae is a monocotyledonous plant rich in acylated steroidal glycosides, suggesting that it may contain sgts and sgas responsible for the biosynthesis of asgs . thus, the transcriptome of o. saundersiae was sequenced with the aim to facilitate the genes discovery. ossgt was then isolated from o. saundersiae based on the rna-seq data. subsequently, ossgt -containing cell-free extract was used as the biocatalyst for glycosylations of structurally diverse drug-like scaffolds. the use of cell-free extract offers a number of advantages. unlike the ambitious purification procedures, the preparation of cell-free extract was simple and timesaving. moreover, compared the purified enzymes, the recombinant proteins used in crude extract-based system were more stable. steroidal glycosides are one of the main sources of innovative drugs . sgt-catalyzed glycodiversification of steroids could expand the molecular diversification, thereby facilitating the discovery of pharmacological leads. thus, the search of sgts with catalytic promiscuity may provide potent biocatalysts for glycodiversification. therefore, a library containing structural diverse drug-like molecules was utilized to react with the recombinant ossgt with the aim to explore the substrate flexibility of ossgt . in vitro enzymatic analyses revealed that ossgt was active against various steroids, including physterols ( - , supplementary information figs. s - ) , steroid hormones ( , - , figs. and , and supplementary information figs. s , s and s ), steroidal sapogenin ( , supplementary information fig. s ) and cardiac aglycon ( , supplementary information fig. s ) , exhibiting a wider substrate range than that of previously identified sgts from plant , . to investigate the regioselectivity of ossgt , diversified steroids were selected as the sugar acceptors for ossgt -catalyzed glycosylations. as illustrated in figs. - and supplementary information s - , ossgt specifically attacked the hydroxyl groups at c- and c- positions, but no activities towards hydroxyl groups at c- ( , ), c- ( ), c- ( ), c- ( ) , c- ( ), c- ( ) , c- ( , and ) and c- ( and ) . when steroids having two potentially reactive hydroxyl groups, like α-hydroxypregnenolone ( ) or androstenediol ( ), were used as the substrate for ossgt -assisted glycosylation, only glycosides with a glycosyl substituent in c- position were detected in the reaction mixture, suggesting ossgt exhibited prominent regioselectivity towards the -oh of both substrates ( supplementary information figs. s and s ) . also, ossgt the stereoselectivity of ossgt was also assessed in this study. estradiol ( ) and α-estradiol ( ) differ for the configuration of the hydroxyl group at c- position. when each of the two compounds was used to react with ossgt , only β-configurated glycosides were generated (fig. ) . likewise, ossgt showed βselective glycosylation towards the hydroxyl group at c- position. cumulatively, these evidences revealed that ossgt catalyzed glycosylations were conducted in a region-and stereoselective fashion. one of the most striking findings of this study is the characterization of bacterial laca protein as a steroidal glycoside acyltransferase. it is well known that laca is one of three structural genes (lacz, lacy and laca) in lac operon , . the function of lacz and lacy is well-characterized , . lacz encodes a βgalactosidase, catalyzing the cleavage of lactose into glucose and galactose. lacy encodes a lactose permease responsible for lactose uptake , . the third structural protein encoded by laca gene in lac operon was initially inferred to be an acetyltransferase. the exact action of this protein, however, remains in doubt until now. in this investigation, in an effort to identify the function of ossgt , we unexpectedly characterized laca protein from e. coli as a sga. in vitro enzymatic analyses revealed that laca protein could specifically catalyze the attachment of acyl groups into the hydroxyl groups of sugar moieties of steroidal β- glycosides (figs. and ) . although we have no evidences for the role of laca protein in vivo, these findings in the present work may provide a starting point for identifying the exact activity of laca protein in lactose metabolism. the bottleneck in enzymatic synthesis of asgs is the lack of well-characterized sgas. the successful characterization of laca made it to be the first sga and laca protein was thus designated as ecsga . a novel enzymatic two-step synthesis of at- β-gs and ae- β-gs from the abundantly available free steroids under the sequential actions of a steroidal glycosyltransferase ossgt from o. saundersiae and ecsga was achieved. the two-step process is characterized by acyltransferase-catalyzed regioselective acylations of all hydroxyl groups of unprotected sgs in the late stage, thereby significantly streamlining the synthetic route towards asgs and thus forming four monoacylates. regioselective acylation could expand molecular diversity, thereby facilitating the discovery of pharmaceutical leads. in this investigation, ecsga -catalyzed acetylation of two steroid βglucosides (t- β-g and e- β-g) leaded to the production of eight new monoacylates. furthermore, the cytotoxic activities of these monoacylates were tested and -at- β-g was observed to display improved activities towards seven human tumor cell lines, suggesting this compound had promisingly pharmacological potential. this study therefore reports for the first time a novel synthetic process for the green preparation of acylated steroidal glycosides with medicinal interest. a steroidal glycosyltransferase ossgt from o. saundersiae was identified to be the first plant sgt with selectivity towards both β-and β-hydroxylsteroids. one of the most striking findings of this study is the characterization of bacterial laca protein as a steroidal glycoside acyltransferase, catalyzing the attachment of acyl groups into the hydroxyl groups of steroidal β-glycosides. a novel enzymatic two-step synthesis of at- β-gs and ae- β-gs from the abundantly available free steroids under the sequential actions of ossgt and ecsga was achieved. the two-step process is characterized by acyltransferase-catalyzed regioselective acylations of all hydroxyl groups of unprotected sgs in the late stage, thereby significantly streamlining the synthetic route towards asgs and thus forming four monoacylates. moreover, the cytotoxic activities of these monoacylates were tested and -at- β-g was observed to display improved activities towards seven human tumor cell lines. this study therefore reports for the first time a novel synthetic process for the green preparation of acylated steroidal glycosides with medicinal interest. in this contribution, four compound libraries, namely sugar acceptor, sugar donor, acyl acceptor and acyl donor libraries, were provided for enzyme-mediated reactions. the compounds listed in fig. and supplementary information fig. s include diverse structures like steroids , flavonoids ( ) ( ) ( ) ( ) ( ) ( ) ( ) , alkaloids ( ) ( ) ( ) ( ) ( ) ( ) ( ) , triterpenoids ( - ), phenolic acids ( - ) and coumarins ( - ) are used as the sugar acceptors for ossgt -catalyzed glycosylation reactions ( fig. and supplementary information fig. s ). the sugar donors consist of seven udp-activated nucleotides, among which, udp-dglucose (udp-glc), udp-d-galactose (udp-gal), udp-d-glucuronic acid (udp-glca) and udp-n-acetylglucosamine (udp-glcnac) were obtained from sigma-aldrich co., llc. (st. louis, mo, usa). udp-d-xylose (udp-xyl), udp-l-arabinose (udp-ara) and udp-d-galacturonic acid (udp-gala) was synthesized by enzymemediated reactions in our laboratory [ ] [ ] [ ] . the acyl acceptor library is made up of steroidal glucosides ( a- a) and other glucosides ( - ) listed in supplementary information fig. s . the acyl donor library includes acetyl-coa, succinyl-coa, arachidonoyl-coa, palmitoyl-coa and acetoacetyl-coa, all of which were purchased from sigma-aldrich co., llc. the other chemicals were either reagent or analytic grade when available. the resulting raw reads from sequence library of o. saundersiae was firstly filtered to obtain clean reads, discarding dirty reads with adaptors, unknown or low quality bases. these clean reads were subsequently combined to form longer unigenes by assembling program trinity. these unigenes obtained by de novo assembly cannot be extended on either end. next, these unigenes were aligned by blast x algorithm to protein databases, such as nr, swiss-prot, kegg and cog (e-valueo . ) for functional annotation. the unigenes displaying similarity to sgts were retrieved for further orf analysis. in a word, those unigenes with a complete orf and displaying high similarity to sgts were selected as the candidate for further investigation. to verify the authenticity of the candidate unigene, cdna isolation was performed using gene-specific primers by a nested pcr assay as previously described (supplementary information table s ) [ ] [ ] [ ] [ ] . the obtained amplicon was inserted into peasy tm -blunt plasmid (transgen co., ltd., beijing, china) and then transformed into e. coli trans -t competent cells for recombinant plasmid selection (supplementary information table s ). the resultant recombinant plasmid was isolated and subjected to nucleotide sequencing. the obtained cdna was thus designated as ossgt for convenience. the bioinformatics analyses of ossgt , like prediction of physiochemical properties, multiple sequence alignment and phylogenetic analysis, were performed as detailed in our previous reports [ ] [ ] [ ] [ ] . ossgt was amplified using gene-specific primers (supplementary information table s ) and the resulting pcr product was ligated into ecori and hind iii sites within the pet- a(þ) vector (novagen, madison, usa) using seamless assembly cloning kit (clonesmarter technologies inc., houston, tx, usa) as described previously . the generated construct pet a-ossgt was transformed into e. coli strain transetta (de ) for expression as described previously . also, to improve heterologous expression of ossgt , pet a-ossgt was cotransformed into e. coli bl (de ) strain with a chaperone plasmid pgro (takara biotechnology co., ltd., dalian, china) as introduced by yin et al. the expression of ossgt was induced by iptg at a final concentration of . mmol/l. the expressed ossgt was checked by sds-page and western-blot analyses as described by guo et al. next, the bl (de ) [pet a-ossgt þpgro ] suspension cells were disrupted in a high-pressure homogeniser (apv- , albertslund, denmark) operated at bar. disrupted cells were centrifuged at , rpm for min to discard the pellet. the resultant supernatant, namely the membrane-free crude extract, was used as the biocatalyst for steroidal glycosylation. after verification of heterologous expression of ossgt , the crude extract containing the recombinant ossgt was applied as the biocatalyst to react with various sugar acceptors and donors ( fig. and supplementary information fig. s ). the total reaction mixture was μl contained mmol/l phosphate buffer (ph . ), a sugar acceptor ( mmol/l), a sugar donor ( mmol/l) and μl crude ossgt proteins. the reaction mixture was incubated at c for h. the formation of glycosylated products was table the cytotoxic activities of monoacylates against human tumor cell lines. unambiguously determined by a combination of hplc-uv, hplc-ms and nmr as described previously , . the determination conditions for hplc-uv were summarized in supplementary information table s . the genome dna was extracted from e. coli strain bl (de ) using bacteriagen dna kit (cwbio co., ltd., beijing, china) according to the supplier recommendation. the resulting genome dna was then used as the template of pcr amplification to isolate these candidate sga genes using gene-specific primers (supplementary information table s ). the amplified pcr products were inserted into peasy tm -blunt plasmid to generate recombinant vectors for sequencing verification. next, these oacetyltransferase-encoding genes were heterologously expressed in bl (de ) as described above. sds-page and western-blot of these recombinant proteins were conducted as that of ossgt (see above). the recombinant o-acetyltransferase proteins were subjected to purification with ni-nta agarose columns according to the manufacture's protocol. purified protein concentrations were determined using bradford protein assay (bio-rad, hercules, ca, usa). the enzymatic activities of ecsgas were determined in μl citrate buffer solution (ph . ) containing an acyl acceptor ( mmol/l) listed in supplementary information fig. s , an acyl donor ( mmol/l) summarized in chemicals section and the purified protein ( . μg). the reactions were incubated at c for h. then μl methanol was added to terminate the reaction. the reaction mixture was monitored by hplc-uv (supplementary information table s ) and the structure of the generated product was determined by a combination of hplc-ms and nmr as reported by liu et al. . . biotransformation of testosterone and testosterone- -oβ-glucoside using the whole cell system after iptg induction, the engineered bl (de )[pet a-ossgt þpgro ] cells were harvested by centrifugation at , rpm for min and then resuspended in m medium with cell density of od value of . . the substrate testosterone ( ) or testosterone- -o-β-glucoside (t- -o-β-g, a) with the final concentration of . mmol/l was added into the m medium and continued to incubate at c overnight. the formation of products was monitored by hplc analysis as mentioned above. the effects of ph and temperature on ossgt -and ecsga catalyzed reactions were investigated. the crude extract of bl (de )[pet a-ossgt þpgro ] and the purified ecsga protein were applied as the biocatalyst in their respective reactions. the effects of ph on both reactions were determined at varied buffers including citric acid/sodium citrate buffer ( . mol/l, ph . - . ), na hpo /nah po ( . mol/l, ph . - . ), na hpo / naoh buffer ( . mol/l, ph . - ). the influences of temperature were explored in a range of to c with intervals of c (ossgt -mediated glycosylation) or c (ecsga -catalyzed acylation) in the standard reaction mixture as described above. scale-up of ossgt -and ecsga -catalyzed reactions was performed to obtain sufficient at- β-gs for structural characterization and further cytotoxicity assay. initially, the μl ossgt catalyzed reaction was directly scaled to ml, in which mg testosterone ( ) were added into and then incubated with crude cell extract at optimal ph and temperature for h. the resultant reaction mixture was applied to preparative hplc to isolate pure t- β-g, which was then used as the substrate in ml ecsga -catalyzed reaction for at- β-gs production. structure characterization of at- β-gs was performed using hplc-spe-nmr technique as described by liu et al. except some modifications on chromatographic conditions. hplc separation was carried out on an ymc-pack ph column ( μm, nm, mm  . mm) with an isocratic elution of % water-trifluoroacetic acid (a, . %: . %, v/v) and % methanol (b) at a flow rate of ml/min. seven human cancer cell lines, hct- (human colon cancer cell line), bel (human hepatocellular carcinoma cell line), mgc (human gastric carcinoma cell line), capan (human pancreatic cancer cell line), nci-h , nci-h and a (human lung cancer cell lines) were used in the cytotoxicity assay. the viability of the cells after treated with various chemicals was evaluated using the mtt ( -( , -dimethylthiazol- -yl)- , -diphenyl tetrazolium bromide) assay performed as previously reported , . the inhibitory effects of these tested compounds on the proliferation of cancer cells were reflected by their respective ic ( % inhibitory concentration). steryl glycosides and acylated steryl glycosides in plant foods reflect unique sterol patterns rice bran extract containing acylated steryl glucoside fraction decreases elevated blood ldl cholesterol level in obese japanese men structural analysis of novel bioactive acylated steryl glucosides in pre-germinated brown rice bran a potent anticomplementary acylated sterol glucoside from orostachys japonicus therapeutic potential of cholesteryl o-acyl alpha-glucoside found in helicobacter pylori immunological functions of steryl glycosides cholestane glycosides with potent cytostatic activities on various tumor cells from ornithogalum saundersiae bulbs a new rearranged cholestane glycoside from ornithogalum saundersiae bulbs exhibiting potent cytostatic activities on leukemia hl- and molt- cells acylated cholestane glycosides from the bulbs of ornithogalum saundersiae synthesis of a cholestane glycoside osw- with potent cytostatic activity first total synthesis of an exceptionally potent antitumor saponin, osw- improved enzyme-mediated synthesis and supramolecular selfassembly of naturally occurring conjugates of β-sitosterol simple method for high purity acylated steryl glucosides synthesis regioselective diversification of a cardiac glycoside, lanatoside c, by organocatalysis sterol glycosyltransferases-the enzymes that modify sterols the functions of steryl glycosides come to those who wait: recent advances in plants, fungi, bacteria and animals sterol β-glucosyltransferase biocatalysts with a range of selectivities, including selectivity for testosterone molecular cloning and biochemical characterization of a recombinant sterol -o-glucosyltransferase from gymnema sylvestre r.br. catalyzing biosynthesis of steryl glucosides cloning and functional expression of ugt genes encoding sterol glucosyltransferases from saccharomyces cerevisiae, candida albicans, pichia pastoris, and dictyostelium discoideum glycosyltransferases in plant natural product synthesis: characterization of a supergene family chaperone coexpression plasmids: differential and synergistic roles of dnak-dnaj-grpe and groel-groes in assisting folding of an allergen of japanese cedar pollen, cryj , in escherichia coli regioselective enzymatic acylation of complex natural products: expanding molecular diversity structure, bioactivity, and chemical synthesis of osw- and other steroidal glycosides in the genus ornithogalum functional diversification of two ugt enzymes required for steryl glucoside synthesis in arabidopsis lac operon induction in escherichia coli: systematic comparison of iptg and tmg induction and influence of the transacetylase laca the lac operon galactoside acetyltransferase cdna isolation and functional characterization of udp-d-glucuronic acid -epimerase family from ornithogalum caudatum transcriptome-guided discovery and functional characterization of two udp-sugar -epimerase families involved in the biosynthesis of anti-tumor polysaccharides in ornithogalum caudatum transcriptome-guided gene isolation and functional characterization of udp-xylose synthase and udp-d-apiose/udp-dxylose synthase families from ornithogalum caudatum ait transcriptome-wide identification of sucrose synthase genes in ornithogalum caudatum transcriptomeenabled discovery and functional characterization of enzymes related to ( s)-pinocembrin biosynthesis from ornithogalum caudatum and their application for metabolic engineering functional analyses of ocrhs and ocuer involved in udp-l-rhamnose biosynthesis in ornithogalum caudatum interactions among sars-cov accessory proteins revealed by bimolecular fluorescence complementation assay cdna isolation and functional characterization of squalene synthase gene from ornithogalum caudatum probing steroidal substrate specificity of cytochrome p bm variants steroids hydroxylation catalyzed by the monooxygenase mutant - from bacillus megaterium bm metabolic engineering of escherichia coli for -butanol production cytotoxic cholestane glycosides from the bulbs of ornithogalum saundersiae supplementary data associated with this article can be found in the online version at doi: . /j.apsb. . . . key: cord- -u ts ur authors: furuyama, wakako; reynolds, pierce; haddock, elaine; meade-white, kimberly; quynh le, mai; kawaoka, yoshihiro; feldmann, heinz; marzi, andrea title: a single dose of a vesicular stomatitis virus-based influenza vaccine confers rapid protection against h viruses from different clades date: - - journal: npj vaccines doi: . /s - - -z sha: doc_id: cord_uid: u ts ur the avian influenza virus outbreak in highlighted the potential of the highly pathogenic h n virus to cause severe disease in humans. therefore, effective vaccines against h n viruses are needed to counter the potential threat of a global pandemic. we have previously developed a fast-acting and efficacious vaccine against ebola virus (ebov) using the vesicular stomatitis virus (vsv) platform. in this study, we generated recombinant vsv-based h n influenza virus vectors to demonstrate the feasibility of this platform for a fast-acting pan-h influenza virus vaccine. we chose multiple approaches regarding antigen design and genome location to define a more optimized vaccine approach. after the vsv-based h n influenza virus constructs were recovered and characterized in vitro, mice were vaccinated by a single dose or prime/boost regimen followed by challenge with a lethal dose of the homologous h clade virus. we found that a single dose of vsv vectors expressing full-length hemagglutinin (hafl) were sufficient to provide % protection. the vaccine vectors were fast-acting as demonstrated by uniform protection when administered days prior to lethal challenge. moreover, single vaccination induced cross-protective h -specific antibodies and protected mice against lethal challenge with various h clade viruses, highlighting the potential of the vsv-based hafl as a pan-h influenza virus emergency vaccine. influenza a viruses, which belong to the family orthomyxoviridae, have a single-stranded negative-sense rna genome consisting of eight segments. they are important zoonotic pathogens, with high morbidity in pigs, horses, poultry, and humans. influenza a viruses have two envelope glycoproteins (gps), hemagglutinin (ha) and neuraminidase (na), and are divided into subtypes based on antigenicity. subtypes h - ha and n - na have been isolated from water birds, the natural reservoir of influenza a viruses. , until , avian influenza a viruses were considered unlikely to be transmitted directly to humans because they do not bind the human sialic acid-α , -galactose (saα , gal) receptor with high affinity. however, highly pathogenic avian influenza (hpai) viruses can be transmitted from wild birds upon close contact causing sporadic outbreaks in domestic poultry. this happened for the first time in in hong kong when human cases of respiratory illness, including six fatalities, were caused by hpai subtype h n viruses. [ ] [ ] [ ] since then, human cases, with deaths (~ % case fatality rate), have been reported by the world health organization. furthermore, some reassortant h viruses with different na subtypes (e.g. h n , h n , and h n ) originated from the same ancestral h n virus, and have recently emerged in china and spread to other countries in eurasia and north america. [ ] [ ] [ ] [ ] [ ] [ ] since some hpai viruses are resistant to the currently available treatment options for influenza a virus infections namely oseltamivir, amantadine, and interferon (ifn), , the development of vaccines is an ongoing effort of high priority for public health to be prepared for a potential epidemic or pandemic of hpai. several different vaccination strategies have been developed against influenza a viruses including inactivated whole virus, liveattenuated influenza virus, viral vectors, and dna vaccines. currently, the fda-approved and licensed whole virus and liveattenuated vaccines against human influenza a viruses are mainly produced in embryonated chicken eggs and the manufacturing process can take up to months. [ ] [ ] [ ] unfortunately, the high pathogenicity of hpai viruses for the chicken embryo reduces virus growth complicating efforts to obtain quality allantoic fluid with high virus titers. therefore, hpai viruses are not suitable as seed viruses for inactivated virus-based vaccine production. vesicular stomatitis virus (vsv) is a single-stranded negativesense rna virus in the family rhabdoviridae. although vsv can cause disease in livestock and other animals, it is highly restricted by the human ifn response and generally does not cause any or only very mild disease. the vsv platform used here is based on the attenuated replication-competent vaccine that produces a rapid and robust immune response to foreign antigens after a single immunization and has been shown to protect against numerous pathogens. [ ] [ ] [ ] [ ] [ ] especially, the vsv-based ebola virus (ebov) vaccine, vsv-ebov (also known as rvsv-zebov or ervebo), which expresses the ebov gp instead of the vsv gp, is considered safe and highly immunogenic based on data from multiple clinical trials. , noteworthy, vsv-ebov has shown promising efficacy against ebov in a phase iii clinical trial and is currently being used in the democratic republic of the congo during the ongoing ebov outbreak. the promising safety profile of this liveattenuated vaccine and the favorable immune cell targeting mediated by the ebov gp makes vsv-ebov an interesting platform for vaccine development. the feasibility of this concept has previously been demonstrated in preclinical studies with vaccines for influenza (hpai virus), flavi-(zika virus), and bunyaviruses (andes virus). , [ ] [ ] [ ] in this study, we designed and tested different vsv-ebov-based vaccine vectors expressing different versions of the h n ha (a/vietnam/ / (vn/ )) to demonstrate the feasibility of the platform for a fast-acting pan-h vaccine. mice were vaccinated with a single dose or prime/boost regimen of the different vaccine candidates and challenged with a lethal dose of homologous h n virus. we found that a single vaccination with vsv-vectors expressing the full-length ha (hafl) induced crossreactive h -specific antibodies and conferred complete protection against lethal challenge with various h clade viruses. furthermore, a single dose of these vaccine vectors provided uniform protection in mice against lethal h n challenge within days after vaccination. we generated vsv-based h vaccine vectors by inserting either the full-length open reading frame (orf) of the h n hafl (vn/ ) or a soluble version of this gene lacking the transmembrane and cytoplasmic domains but carrying a mutated single-basic cleavage site to prevent cleavage in the cells and a gcn leucine zipper domain (shazip) for stabilization of the trimeric structure into the vsv-ebov vector , (fig. a) . this shazip antigen has previously been shown to be protective in chickens as a subunit vaccine. we also generated a vsv vector expressing the h n hafl alone without the ebov gp (vsv-hafl; fig. a ) in order to control for the contribution of the ebov gp to vaccine efficacy. expression of the different h antigens from the vsv vectors was confirmed by subjecting the supernatant of infected cells to sds-page and immunoblotting (fig. b, supplementary fig. ). first, we showed the presence of vsv particles by detecting the vsv matrix (m) protein in the supernatant of infected cells (fig. b, supplementary fig. ). the incorporation of ebov gp into vsv particles differed among the vectors and was, as expected, highest for vsv-ebov for which it is the only surface protein and antigen encoded by this vector (fig. b, supplementary fig. ). expression of shazip was verified by detecting the non-cleaved sha precursor likely secreted from infected cells. hafl expression was demonstrated by detecting the furin-cleaved fragment ha in mature spikes on vsv particles. as expected, the incorporation of hafl into recombinant vsv particles was much stronger for vsv-hafl compared to vsv-ebov-hafl, likely because it is the only surface gp and encoded antigen in the vsv vector. next, we performed a series of studies measuring the rate and extent of vaccine virus growth over time. vero e cells were infected in triplicate with each vsv-based vector (multiplicity of infection (moi) . ) and samples were collected from the supernatant at , and h for titration. wild-type vsv (vsvwt) grew more rapidly and to significantly higher titers early post infection compared to any of the other recombinant vsv-based vectors (fig. c) . we did not observe any significant difference in the growth kinetics of vsv-based vectors expressing either one (vsv-ebov, vsv-hafl) or two foreign antigens (vsv-ebov-shazip, vsv-ebov-hafl) with most vectors reaching peak titers between and tcid /ml at h suggesting that the expression of a second antigen did not significantly further attenuate vsv-ebov (fig. c) . ( tcid ) of hpai h n . as expected, control and vsv-ebov vaccinated mice succumbed to the lethal h n challenge within days (fig. ) . single-dose vaccination of the vsv-ebov-shazip showed only . % protection against h n infection with severe weight loss (fig. , left panels). prime/boostvaccination improved the outcome of the vsv-ebov-shazip resulting in mild disease as evidenced by temporary weight loss and moderate disease with % survival for vsv-ebov-shazip (fig. , right panels). in contrast, single and prime/boost vaccination with vsv-ebov-hafl or vsv-hafl protected % of the mice from lethal challenge with no signs of clinical disease (fig. ) . in order to improve the protective efficacy of the vsv-shazip vector, we wanted to increase the antigen expression levels. therefore, the shazip or sha (without trimerization domain) antigens were inserted further upstream into the vsv-ebov backbone resulting in two additional vaccines, vsv-shazip-ebov and vsv-sha-ebov (supplementary fig. a ). these vaccine viruses were recovered and antigen expression was confirmed in the cell supernatant similarly to the previously generated vaccines ( supplementary fig. ). in vitro growth kinetics demonstrated no difference in comparison to the other vsv constructs (fig. b, supplementary fig. b ). next, we analyzed the protective efficacy of these improved vsv-sha-ebov vaccine candidates in mice. single-dose and prime/boost-vaccinations with the vsv-shazip-ebov revealed similar protective efficacies compared to vsv-ebov-shazip (fig. , supplementary fig. c ). interestingly, the h n challenge of vsv-sha-ebov-vaccinated mice demonstrated higher survival rates compared to shazip expressing vsv vectors ( fig. , supplementary fig. d ). taken together, the challenge experiments demonstrated that hafl is the superior antigen to any of the sha versions as a single dose results in uniform protection using the vsv platform. interestingly, sha performed better than shazip. analysis of vsv-based vaccine-mediated antibody responses total anti-ha (h ) immunoglobulin g (igg) and neutralizing antibody responses from all vsv-vaccinated mice were analyzed in serum samples collected directly prior to challenge (day ) and day and day post challenge. enzyme-linked immunosorbent assay (elisa) was performed to determine total anti-ha igg levels in the serum of the mice over time (fig. , supplementary fig. ). in control and vsv-ebov-vaccinated mice, we observed no antibody responses on day , but ha-specific igg was detected on day after h n challenge with all animals succumbing to infection by day (fig. , supplementary fig. ). all haflvaccinated mice responded with ha-specific antibody responses to a single dose on day (fig. , top left panel) that were lower compared to those of the corresponding prime/boost vaccinated mice (fig. , top right panel). the same was observed for sha/ shazip-vaccinated mice ( supplementary fig. ). vsv-hafl prime/ boost vaccination elicited the highest ha-specific igg responses that were significantly higher compared to control mice and the mice vaccinated with vsv-ebov, vsv-shazip-ebov, and vsv-ebov-shazip. in all vaccinated and surviving mice, the h n challenge served as a boost as documented by the increase in haspecific igg titers measured on day post challenge ( fig. , supplementary fig. ). time to immunity of the vsv-based hafl vaccines finally, we determined the minimum time to immunity for the two most promising vaccine candidates, vsv-ebov-hafl and vsv-hafl. groups of eight female balb/c mice were im vaccinated with × pfu of vsv-ebov-hafl or vsv-hafl on days , , or prior to lethal homologous h n challenge. we found that both vaccines resulted in % protection with no or little weight loss when mice were vaccinated at least days prior to challenge, whereas vsv-ebov-vaccinated mice succumbed to infection within days (fig. ) . furthermore, the day − vaccinations resulted in partial survival with . % for vsv-hafl and % for vsv-ebov-hafl (fig. ) . overall, the data demonstrate that both vaccine candidates are equally potent inducers of rapid protection with a slight but not statistically significant benefit of vsv-ebov-hafl over vsv-hafl. cross-protection with a single dose of the vsv-based hafl vaccines due to frequently occurring antigenic changes with influenza viruses, it is important to determine if vaccine candidates elicit antibodies against viruses from different antigenic clades within the same subtype. therefore, we performed hemagglutinin inhibition (hi) tests to examine the ability of the vsv-based h n vaccines to generate cross-neutralizing antibody responses against heterologous h influenza viruses. for this, we used the day mouse serum samples and a panel of nine attenuated candidate vaccine influenza viruses encoding has belonging to different h clades that were isolated from geographically distinct locations (supplementary table ). we found that prime/boost vaccination with vsv-ebov-hafl or vsv-hafl elicited crossneutralizing antibodies against all tested clades (table ) . crossneutralizing antibodies were also detected in the single-dose vaccination group of these two vaccines; however, and similar to the total ha-specific igg, levels were lower and crossneutralization was not detected for all clades (table ). in contrast, all other vaccines expressing the sha/shazip antigen revealed no cross-neutralizing activities after administration of a single-dose and limited cross-neutralizing activity after the prime/boost. these results demonstrated that the vsv-based vaccines expressing hafl induce more potent cross-neutralizing antibodies than the sha/ shazip antigens. in order to support the cross-protective potential of the vsvbased hafl vaccines, we vaccinated groups of mice with a single dose of × in contrast, all mice vaccinated with vsv-ebov-hafl or vsv-hafl were completetly protected aginast lethal challenge (fig. ), but mice challenged with a/indonesia/ / showed minor body weight loss on days - after challenge (fig. c) . taken together, the vsv vectors expressing hafl confer h cross-protection in the mouse model. the outbreak of human h n -caused disease in hong kong was controlled with the depopulation of poultry. however, while this outbreak was contained, hpai h n viruses have been circulating in poultry for almost two decades now and have spread to more than countries. the broad geographic distribution of hpai h n viruses and the risk of transmission to humans causing severe pneumonia with high case fatality rates are a major concern to animal and human health since years. treatment is an option for individual human cases but if these hpais gain transmissibility for humans, vaccines are likely the only public health measure to fight an epidemic or potential pandemic. in this study, we used the well-characterized vsv-ebov vaccine as our starting platform as it has advantages over other vaccine approaches such as ease of genetic modification, efficient and cost-effective manufacturing, proven human safety and immunogenicity profile, and potential favorable immune cell targeting. , to define a more optimized vaccine approach, we generated several different vsv-ebov-based vaccine vectors and compared the protective efficacy against hpai h n virus challenge in the mouse model to a vsv-hafl vector without the ebov gp. despite promising results from previous studies in chickens showing that adjuvanted subunit vaccines consisting of the trimeric h sha (shazip) induced high levels of cross-neutralizing antibodies (clade and . . ), , we could not demonstrate convincing protection with the shazip-expressing vsv vectors in this study (fig. , supplementary fig. c ). in fact, vsv-sha-ebov without the trimerization sequence performed better than the shazipexpressing vectors with complete protection following prime/ boost vaccination (supplementary fig. c) . in contrast to all the sha-based vaccines, single doses of the vsv-ebov-hafl or vsv-hafl vectors were sufficient to provide complete protection from lethal homologous h n challenge in mice (fig. ) . thus, in our study, the vsv vectors expressing hafl are superior over those expressing sha or shazip. it should be noted that vaccine doses in this study were about to times lower than those used in previous vsv-based hpaiv h n vaccine studies. , , this is an important observation, as lower-dose vaccination would likely reduce potential adverse effects of vaccination as has been reported ocassionally from human clinical trials using vsv-ebov vaccination. recently, it has been shown that low-dose vaccination with vsv-ebov does not compromise protective efficacy in nonhuman primates. lower-dose vaccination would also have a beneficial effect on vaccine manufacturing. currently, h hpai viruses have been classified into several clades based on phylogenetic analysis of their ha genes. notably, mainly clade viruses have evolved rapidly and extensively in recent years, and the continued evolution of this particular virus has heightened the concern for a pandemic. thus, here we selected eight viruses from clade and one virus from clade (supplementary table ) to investigate the crossneutralizing nature of the vaccine-induced antibody response by hi test. we found that a prime/boost vaccination with the vsvs expressing hafl elicited cross-neutralizing antibodies against all tested h viruses (table ) suggesting that these vaccine vectors will likely cross-protect. the presence of hi antibodies with titers of : is considered protective as demonstrated in previous animal studies using poxvirus-based vaccination. crossprotection could indeed be demonstrated in mouse challenge experiments utilizing three different h clade isolates (fig. ) highlighting the cross-protective potential of the vaccines. while prime/boost vaccination with vsv-shazip-ebov or vsv-sha-ebov induced cross-neutralizing antibodies, the responses were generally lower in titer and detected in fewer animals. previous studies have shown that the influenza virus ha stem has the potential to induce broad protective immunity and that the removal of the transmembrane domain may affect the native conformation of the ha stem potentially destroying those conformational antibody epitopes. thus, the finding that our vsv vectors expressing sha/shazip did perform worse than those expressing hafl is likely due to the specific design of expressing a soluble antigen that lacks both the transmembrane and cytoplasmic domains. our studies demonstrate that vsv-based vaccines expressing hafl are superior over those expressing modified sha. however, this study did not provide any data supporting an advantage of including vsv-ebov as part of the vector design over just expressing vsv-hafl as both vectors performed similarly well with no statistically significant difference in efficacy following single-dose or prime/boost administration (fig. ) nor in antibody responses (fig. , table ). the lack of differences in ha-specific antibody responses is not necessarily in line with higher expression of hafl following vsv-hafl infection in tissue culture (fig. b) , but replication may be different in vivo. on the other hand, the postulated favorable immune cell targeting through vsv-ebov , - may balance the advantage of higher antigen expression by vsv-hafl. vsv-ebov has been shown to induce rapid protective immune responses in preclinical and clinical studies. , thus, this platform has the potential to be utilized as an emergency vaccine. while we could not demonstrate a significant difference between the vsv-hafl and vsv-hafl-ebov vaccine in regard to fast-acting properties, protection after immunization on day − is marginally better with vsv-ebov-hafl than vsv-hafl (fig. ) . this difference could be due to the favorable immune cell targeting of the ebov gp, , - but further studies with bigger animal group sizes are needed to prove this hypothesis. previous studies demonstrated that vsv-based vaccines provide rapid protection via involvement of the innate immune system combined with an early adaptive response, suggesting that the vsv-ebov-hafl and vsv-hafl vaccines may induce innate immune responses that are able to control the challenge virus, allow for the adaptive immune system to catch up and lead to protection of the mice. nevertheless, the fast-acting feature makes this vaccine extremely valuable for the public health response during an epidemic or pandemic as the vaccine could be strategically administered to more vulnerable populations such as elderly and hospitalized people keeping in mind the replicative nature of the vaccine vectors. vsv-based h influenza virus vaccine candidates have advantages compared to the currently used influenza virus vaccines including the ease of generation of the vectors as well as the vaccine production in cell lines which are already approved for manufacturing of human vaccines. a switch to cell line production would eliminate concerns regarding allergies to egg proteins. the downside of an attenuated, replication-competent vaccine approach such as vsv is adverse reactions to vaccination. however, previous preclinical vaccine work using the vsv platform, including immunization of several immunecompromised animal species, as well as clinical trials with vsv-ebov demonstrated low levels of vaccine-related adverse effects resulting in the general conclusion that the vsv vaccine platform is safe. , in addition, vsv-based replicating vaccines are efficacious at lower doses compared to non-replicating approaches and do not require adjuvants. in conclusion, we have developed two vsv-based vaccine candidates, vsv-ebov-hafl and vsv-hafl, that provide proof-ofconcept for rapid protection against hpai virus infection that are mediating cross-neutralizing responses. if clinical development confirms the promise of being fast-acting and strongly protective, vsv-based vectors might be a promising approach for the development of a pan-h influenza virus emergency vaccine. all infectious work was performed at the required containment level at the integrated research facility, rocky mountain laboratories (rml), division of intramural research (dir), national institute of allergy and infectious disease (niaid), national institutes of health (nih). vaccinations were carried out in bsl settings; h n s were handled exclusively under maximum containment. the animal work was approved by the institutional animal care and use committee (iacuc) and performed according to the guidelines of the association for assessment and accreditation of laboratory animal care, international and the office of laboratory animal welfare. all procedures on animals were carried out by trained and certified personnel following standard operating procedures (sops) approved by the institutional biosafety committee (ibc). humane endpoint criteria in compliance with iacuc-approved scoring parameters were used to determine when animals should be humanely euthanized. african green monkey kidney (vero e ) cells were grown in dulbecco's modified eagle's medium (dmem) (sigma-aldrich) containing % or % fetal bovine serum (fbs), mm l-glutamine, u/ml penicillin, and μg/ ml streptomycin (all from thermo fisher scientific). baby hamster kidney (bhk)-t cells were grown in minimum essential medium (mem) (thermo fisher scientific) containing % tryptose phosphate broth (thermo fisher scientific), % fbs, l-glutamine, penicillin, and streptomycin. madin-darby canine kidney (mdck) cells were grown in eagle's minimum essential medium (emem) containing % fbs, l-glutamine, penicillin, streptomycin, mem non-essential amino acid (thermo fisher scientific), and bicarbonate (thermo fisher scientific). the h n challenge viruses a/ the h hafl orf was constructed using the entire ha cdna sequence of h n a/vietnam/ / . the soluble ha (sha) orf was generated from the hafl orf by deleting the transmembrane domain and replacing the sequences encoding the polybasic cleavage site between ha and ha (pqrerrrkkrg) by one preventing cleavage (pqietrg). the sha with leucine zipper (shazip) was constructed of the sha orf by adding a gcn pll sequence for trimerization. all orfs were cloned into the patx-vsv-ebov plasmid encoding the ebov-mayinga gp. replicationcompetent recombinant vsvs (vsv-ebov-shazip, vsv-shazip-ebov, vsv-sha-ebov, vsv-ebov-hafl, and vsv-hafl) were generated as described previously. the complete sequence of the vsv vaccines was confirmed by sanger sequencing. detailed sequence information can be obtained from the authors upon request. the titer of each virus stock was quantified using standard plaque and tcid assays on vero e cells. the same vaccine virus stock was used for all in vitro and in vivo work. vero e cells were grown to confluency in a -well plate and infected in triplicate with vsvwt, vsv-ebov, vsv-ebov-shazip, vsv-shazip-ebov, vsv-sha-ebov, vsv-ebov-hafl, and vsv-hafl (moi of . ). the inoculum was removed, cells were washed three times with dmem, and covered with dmem containing % fbs, . μg/ml tpck trypsin (thermo fisher scientific), and u/mg na from vibrio cholerae (sigma-aldrich). tpck trypsin and na are required for the propagation of the vsv-hafl vaccine. supernatant samples were collected at , , , and h post-infection and stored at − °c. the titer of the supernatant samples was determined performing tcid assay on vero e cells. samples were generated in parallel from each vaccine virus stocks produced in vero e cells mixed : with sodium dodecyl sulfatepolyacrylamide (sds) gel electrophoresis sample buffer containing % β-mercaptoethanol and heated to °c for min. sds-page with all samples was performed in parallel on tgx criterion pre-cast gels (bio-rad laboratories) (supplementary fig. ). subsequently, proteins were transferred to a trans-blot polyvinylidene difluoride membrane (bio-rad laboratories). the membrane was blocked for h at room temperature in pbs with % powdered milk and . % tween (thermo fisher scientific). protein detection was performed using the following rabbit or mouse primary antibodies: anti-ha : (cat. # -t - , sino biological inc.), anti-ebov gp (zgp / . , μg/ml; kindly provided by ayato takada, hokkaido university, sapporo, japan), and anti-vsv m ( h , : ; kerafast inc.). after horse-raddish peroxidase (hrp)-labeled secondary antibody staining using either anti-mouse igg ( : , ) or antirabbit igg ( : ) (mouse cat. # - - ; rabbit cat. # - - ; both jackson immunoresearch), the blots were imaged using the supersignal west pico chemiluminescent substrate (thermo fisher scientific) and a fluorchem e system (proteinsimple). groups of female balb/c mice (n = ) were vaccinated im with × pfu of the vsv-based vectors in . ml (two sites, . ml each) on day − and − (prime/boost vaccination) or − only (single-dose vaccination). on the day of challenge (day ), four animals in each group were euthanized for serum collection. the remaining animals in each group were challenged intranasally (in) with ld ( tcid ) of hpai h n virus a/vietnam/ / . on day post challenge, four animals in each group were euthanized and samples were collected for serology. the remaining eight mice were monitored until days post challenge when a terminal blood sample was collected prior to euthanasia. for the time to immunity study, groups (n = ) of female balb/c mice were im vaccinated on day − , − , or − with × pfu of the vsv-hafl or vsv-ebov-hafl vaccine in . ml (two sites, . ml each). vsv-ebov was used as a control. all the groups were challenged in with ld ( tcid ) of hpai h n virus a/vietnam/ / . surviving mice were monitored until day post infection. for the h cross-protection study, groups (n = ) of female balb/c mice were im vaccinated on day − with × pfu of the vsv-hafl or vsv-ebov-hafl vaccine in . ml (two sites, . ml each). vsv-ebov was used as a control. all the groups were challenged in with tcid serum samples from h n -infectd mice were inactivated by gammairradiation and used in bsl according to ibc-approved sops. elisa plates were coated with µg/ml ( µl/well) of recombinant influenza ha (h ) (a/vietnam/ / ) antigen (ibt bioservices). after three washes with pbs/tween, plates were blocked with % bsa in pbs for h at room temperature, followed by three additional washes with pbs/tween. the plates were incubated with fourfold serial dilutions of the mouse serum samples for h at °c, and washed three times with pbs/tween. bound antibodies were visualized with horseradish peroxidase-conjugated goat anti-mouse igg (h+l) (jackson immunoresearch) at a : dilution and tmb substrate (kpl). the reaction was measured using the synergy™ htx multi-mode microplate reader (biotek). titers were calculated by a parameter curve fitting model using microsoft excel software. the cutoff value was set as the mean optical density plus three standard deviations of the control samples. hi assay hi assays were performed using eight hemagglutination units/ μl of the different h viruses incubated with μl of the fourfold serial dilutions of each mouse serum sample (day ) in round-bottom -well plates for h at room temperature. then μl of a . % turkey red blood cell solution (innovative research) was added to each well. plates were covered and incubated for min on ice. hemagglutination titers were determined by the reciprocal of the last dilution containing agglutinated turkey red blood cells. hi titers represent the highest serum dilution that completely inhibited hemagglutination. statistical analysis was performed in prism (graphpad). data presented in figs c, (upper panels), (upper panels), (left panels), supplementary fig. c , and supplementary fig. d (upper panels) were examined using two-way anova with tukey's multiple comparison to evaluate statistical significance at all timepoints between all groups. significant differences in the survival curves shown in figs. (lower panels), (lower panels), (right panels), and supplementary fig. d (lower panels) were determined performing log-rank analysis. data presented in fig. and supplementary fig. were analyzed for statistical significance using one-way anova with multiple comparison. statistical significance is indicated as follows: p < . (****), p < . (***), p < . (**) and p < . (*). structural and functional motifs in influenza virus rnas evolution and ecology of influenza a viruses characterization of a novel influenza a virus hemagglutinin subtype (h ) obtained from black-headed gulls evolution and ecology of influenza a viruses sialobiology of influenza: molecular mechanism of host range variation of influenza viruses characterization of an avian influenza a (h n ) virus isolated from a child with a fatal respiratory illness human influenza a h n virus related to a highly pathogenic avian influenza virus a pandemic warning? cumulative number of confirmed human cases for avian influenza a(h n ) reported to who novel eurasian highly pathogenic avian influenza a h viruses in wild birds novel reassortant influenza a(h n ) viruses among inoculated domestic and wild ducks intercontinental spread of asian-origin h n to north america through beringia by migratory birds reassortant highly pathogenic influenza a h n virus containing gene segments related to eurasian h n in british columbia novel reassortant highly pathogenic h n avian influenza viruses in poultry in china characterization of three h n and one h n highly pathogenic avian influenza viruses in china avian flu: isolation of drug-resistant h n virus lethal h n influenza viruses escape host anti-viral cytokine responses scientific barriers to developing vaccines against avian influenza viruses universal vaccines and vaccine platforms to protect against influenza viruses in humans and agriculture emerging vaccines for influenza resistance to influenza virus and vesicular stomatitis virus conferred by expression of human mxa protein vesicular stomatitis virus-based vaccines against lassa and ebola viruses an effective aids vaccine based on live attenuated vesicular stomatitis virus recombinants a vsv-based zika virus vaccine protects mice from lethal challenge single-dose liveattenuated nipah virus vaccines confer complete protection by eliciting antibodies directed against surface glycoproteins vsv-ebov rapidly protects macaques against infection with the / ebola virus outbreak strain the vesicular stomatitis virus-based ebola virus vaccine: from concept to clinical trials ebola: lessons on vaccine development keeping your cool -doing ebola research during an emergency single-dose live-attenuated vesicular stomatitis virus-based vaccine protects african green monkeys from nipah virus disease protective efficacy of a bivalent recombinant vesicular stomatitis virus vaccine in the syrian hamster model of lethal ebola virus infection characterization of a bivalent vaccine capable of inducing protection against both ebola and cross-clade h n influenza in mice vesicular stomatitis virus-based vaccine protects hamsters against lethal challenge with andes virus a single immunization with soluble recombinant trimeric hemagglutinin protects chickens against highly pathogenic avian influenza virus h n recombinant trimeric ha protein immunogenicity of h n avian influenza viruses and their combined use with inactivated or adenovirus vaccines vesicular stomatitis virus vectors expressing avian influenza h ha induce cross-neutralizing antibodies and long-term protection poultry and the influenza h n outbreak in hong kong, : abridged chronology and virus isolation h n highly pathogenic avian influenza: timeline of major events brief literature review for the who global influenza research agenda-highly pathogenic avian influenza h n risk in humans potent vesicular stomatitis virus-based avian influenza vaccines provide long-term sterilizing immunity against heterologous challenge single low-dose vsv-ebov vaccination protects cynomolgus macaques from lethal ebola challenge toward a unified nomenclature system for highly pathogenic avian influenza virus (h n ) antigenic and genetic characteristics of zoonotic influenza viruses and development of candidate vaccine viruses for pandemic preparedness haemagglutination-inhibiting antibody to influenza virus cross-clade immunity in cats vaccinated with a canarypoxvectored avian influenza vaccine towards a universal influenza vaccine: different approaches for one goal enhanced immunogenicity of stabilized trimeric soluble influenza hemagglutinin antibodies are necessary for rvsv/zebov-gp-mediated protection against lethal ebola virus challenge in nonhuman primates ebola haemorrhagic fever vsvdeltag/ebov gp-induced innate protection enhances natural killer cell activity to increase survival in a lethal mouse adapted ebola virus infection cell culture-based influenza vaccines: a necessary and indispensable investment for the future ebola vaccines in clinical trial: the promising candidates properties of replication-competent vesicular stomatitis virus vectors expressing glycoproteins of filoviruses and arenaviruses we thank the animal care staff of the rocky mountain veterinary branch (niaid, nih) for their support of the animal experiments. we also thank david wentworth, vivien dugan, todd davis, and bin zhou of the virology surveillance and diagnosis branch, influenza division, centers for disease control and prevention for providing the candidate vaccine viruses and the h n viruses utilized in this study. this work was funded by the division of intramural research, niaid, nih. further information on research design is available in the nature research reporting summary linked to this article. the data supporting the findings of this study are available from the corresponding author upon reasonable request. h.f. claims intellectual property regarding the vesicular stomatitis virus-based vaccines for viral hemorrhagic fevers. no other competing interests are to be disclosed. supplementary information is available for this paper at https://doi.org/ . / s - - -z.correspondence and requests for materials should be addressed to a.m. publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons. org/licenses/by/ . /. this is a u.s. government work and not under copyright protection in the u.s.; foreign copyright protection may apply key: cord- -ph eji authors: mostajo, nelly f; lataretu, marie; krautwurst, sebastian; mock, florian; desirò, daniel; lamkiewicz, kevin; collatz, maximilian; schoen, andreas; weber, friedemann; marz, manja; hölzer, martin title: a comprehensive annotation and differential expression analysis of short and long non-coding rnas in bat genomes date: - - journal: nar genom bioinform doi: . /nargab/lqz sha: doc_id: cord_uid: ph eji although bats are increasingly becoming the focus of scientific studies due to their unique properties, these exceptional animals are still among the least studied mammals. assembly quality and completeness of bat genomes vary a lot and especially non-coding rna (ncrna) annotations are incomplete or simply missing. accordingly, standard bioinformatics pipelines for gene expression analysis often ignore ncrnas such as micrornas or long antisense rnas. the main cause of this problem is the use of incomplete genome annotations. we present a complete screening for ncrnas within bat genomes. ncrnas affect a remarkable variety of vital biological functions, including gene expression regulation, rna processing, rna interference and, as recently described, regulatory processes in viral infections. within all investigated bat assemblies, we annotated ncrna families including snornas and mirnas as well as rrnas, trnas, several snrnas and lncrnas, and other structural ncrna elements. we validated our ncrna candidates by six rna-seq data sets and show significant expression patterns that have never been described before in a bat species on such a large scale. our annotations will be usable as a resource (rna.uni-jena.de/supplements/bats) for deeper studying of bat evolution, ncrnas repertoire, gene expression and regulation, ecology and important host–virus interactions. bats (chiroptera) are the most abundant, ecologically diverse and globally distributed animals within all vertebrates ( ), but representative genome arrangements and corresponding coding and non-coding gene annotations are still incomplete ( ) . except for the extreme polar regions, bats can be found throughout the globe, feeding on diverse sources such as insects, blood, nectar, fruits and pollen ( ) . their origin has been dated in the cretaceous period, with a diversification explosion process dating back to the eocene ( ). the bat families known to date are classified into the suborders yinpterochiroptera and yangochiroptera ( , ) ( figure ). although they account for > % of the total living mammalian diversity ( ) , the genomes of only bat species of the estimated > species ( ) have been sequenced with adequate coverage to date and are publicly available ( figure ) . bats have developed a variety of unique biological features that are the rarest among all mammalian, including laryngeal echolocation ( , ) , vocal learning ( ) and the ability to fly ( ) . they occupy a broad range of different ecological niches ( ) , have an exceptional longevity ( - ) and a natural and unique resilience against various pathogenic viruses ( , ) . for example, bats are the suspected reservoirs for some of the deadliest viral diseases such as ebola and sars ( ) ( ) ( ) , but appear to be asymptomatic and survive the infection. possibly, the solution to better understand and fight these pathogens lies in the uniquely developed immune system of bats ( , ) . studying bats and their genomes is likely to have high impacts on various sci- figure . we used available genomes of bat species from eight out of families for non-coding rna annotation in this study. the tree shows their phylogenetic relationship and is based on a molecular consensus on family relationships of bats ( ), further adapted and extended from ( ) . bat families and species with published genomes currently available in the ncbi are shown (details see table ). bat families still lacking a published genome assembly are written in gray color. rna-seq data sets were selected from species marked with an asterisk and additionally obtained from a myotis daubentonii cell line (see table ). bat silhouettes were adapted from artworks created by fiona reid. entific fields, including healthy ageing, immune and ecosystem functioning, the evolution of sensory perception and human communication, and mammalian genome architecture (see the recent bat k review for further details ( ) ). despite the unique biological characteristics of these flying mammals and their important role as natural reservoirs for viruses, bats are one of the least studied taxa of all mammalian ( ) . accordingly, there is little knowledge about the non-protein-coding transcriptome of bats, which plays a crucial role in an extensive number of cellular and regular functions and comprises a very diverse family of untranslated rna molecules ( , ) . in addition, it is believed that due to the early evolutionary radiation of bats (compared to other mammals) their innate and acquired immune responses have a different set of molecules ( ) . genome assemblies and annotations are essential starting points for many molecular-driven and comparative studies ( ) . especially, studies of non-model organisms play important roles in many investigations ( ) . in most cases, however, these organisms lack well-annotated genomes ( ) , which severely limit our ability to gain a deeper understanding of these species and may further impede biomedical research ( ) . in this study, we comprehensively annotated non-coding rnas in available bat genome assemblies (table ) . for each bat species, we provide final annotations that are compatible with current ensembl and ncbi (national center for biotechnology information) standards (gtf format) and that can be directly used in other studies, for example for differential gene expression analysis. we compare our new annotations with the currently available annotations for bats and show that a large number of non-coding genes are simply not annotated and are therefore overlooked by other studies. we used six rna-seq data sets comprising different conditions (table ) to validate our annotations and to determine the expression levels of our newly annotated ncrnas. exemplarily, we show that our novel annotations can be used to identify ncrnas that are significantly differential expressed during viral infections and were missed by previous studies. we downloaded the last recent genome versions for bat species in september from the ncbi genome database (table ) . within the order of yinpterochiroptera, nine genomic sequences were obtained covering the bat families pteropodidae, hipposideridae, rhinolophidae and megadermatidae whereas for the order of yangochiroptera another seven genome assemblies were available for the bat families phyllostomidae, mormoopidae, vespertilionidae and miniopteridae ( figure ). we introduced a unique three-letter abbreviation code (table ) to easily distinguish between the bat species in the manuscript and intermediate annotation files provided in the electronic supplement. we used quast (v . . ) ( ) to calculate several assembly statistics for all genomes, shown in supplementary table s . at the end of , two new bat genomes were presented by the bat k project (http://bat k.ucd.ie) ( ), comprising a newer version of the greater horseshoe bat genome (rhinolophus ferrumequinum; rhinolophidae) and the genome of the pale spear-nose bat (phyllostomus discolor; phyllostomidae). however, these two bat genomes were not included in our present study due to the data use policy of the bat k consortium and to support a fair and productive use of these data. to validate our novel ncrna predictions, we selected six rna-seq data sets ( , ( ) ( ) ( ) ( ) comprising all together samples gathered from four different bat species. we have labeled each published rna-seq data set based on the first authors last name and the year of data set publication ( table and supplementary table s ). all samples were quality trimmed using trimmomatic ( ) (v . ) with a nt-step sliding-window approach (q ) and a minimum remaining read length of nt. for the field- data set ( ) , we additionally removed the three leading ' nucleotides from the reads of each sample because of a generally low quality observed by fastqc (www.bioinformatics.babraham.ac.uk/projects/ table . we have annotated ncrnas within bat genomes of different assembly quality. we introduced three-letter abbreviations for each bat species used throughout the manuscript and in supplemental files and annotations. genome sizes were estimated (est.) by using c-values (dna content per pg) from the animal genome size database (http://genomesize.com) and by applying the following formula: genome size = ( . · ) · c. if multiple entries for one species were available, an average over all c-values was calculated and used to estimate the genome size. if one species could not be found, an average c-value for the corresponding genus was used. supplementary table s provides additional assembly statistics calculated by quast (v . . ) ( ) . ncbi acc. -genbank assembly accession without the prefix gca fastqc/) (v . . ). the remaining reads of all processed samples were individually mapped to all bat assemblies using hisat ( ) (v . . ) and transcript abundances were subsequently calculated from all resulting mappings by featurecounts ( ) (v . . ). if suitable, the appropriate strand-specific counting mode was applied for each data set (see table for information about the strand specificity). for each bat genome assembly, the merged annotation of already known (ncbi) and newly identified (this study) ncr-nas was used (supplementary files s ). due to the size of the annotations and the huge amount of overlaps, long ncr-nas were counted and analyzed for differential expression separately. to enable a better investigation of small rnas (sr-nas), we included a data set of the targeted sequencing of srnas (especially mirnas) from m. daubentonii cells (weber- ). to obtain this srna sequencing data, total rna of samples, which was obtained using the same procedure like explained for the rrna-depleted m. daubentonii data ( ) , was preprocessed using the illumina truseq smallrna protocol, sequenced on one hiseq lane, and finally uploaded in the course of this study under geo accession gse . the reads of these srna samples were additionally preprocessed by removing potential adapter sequences with cutadapt ( ) (v . . ) followed by a quality (q ) trimming using again a window-size of and a minimum length of nt by prinseq ( ) (v . . ). the processed srna samples were either individually mapped to the bat genomes for differential expression analysis or combined and mapped on each bat genome to predict known and novel mirnas with mirdeep ( ). only uniquely mapped reads were counted and used for the differential gene expression analyses with deseq ( ) (v . . ). annotated rrna genes were removed prior de-seq and tpm (transcripts per million) analysis. all raw read counts from samples of one data set were combined and normalized together using the built-in functionality of deseq , followed by pairwise comparisons to detect significant (adjusted p-value < . ; absolute log fold change > ) differential expressed ncrnas. besides the deseq normalization, we calculated tpm values for each ncrna in each sample as previously described ( ) : where c i is the raw read count of ncrna i, l i is the length of ncrna i (and the cumulative exon length in the case of lncrnas) and n is the number of all ncrnas in the given annotation. to this end, we obtained for each rna-seq sample, each bat annotation, and each ncrna one tpm value representing the normalized expression level of this ncrna. if available, we calculated all tpm values in relation to the expression of all already known coding and non-coding genes and not only based on our novel ncrna annotation. although we performed mappings, read countings, and normalization for all samples, bat genome assemblies and all six data sets ( table ; overall mappings), we only selected one comparison per data set to exemplarily show novel and significantly differential expressed ncrnas (supplementary files s . -s . ; divided by data set and input annotation). for each data set, we chose the bat species that was also used in the corresponding study. for the hölzer- and weber- data sets, we used the closely related m. lucifugus genome assembly and annotation as a refer- table . six rna-seq data sets comprising all together samples derived from four different bat species were used to evaluate our novel ncrna annotations. all samples were quality trimmed and individually mapped to all bat assemblies using hisat ( ) and transcript abundances were subsequently calculated from all mappings by featurecounts ( ) . we labeled each rna-seq data set based on the first authors last name and the year of data set publication. raw read data of the enriched sequencing of small rnas (especially mirnas) of a m. daubentonii cell line (accompanying gse ( ) ) have been uploaded in the course of this publication under geo accession gse (weber- ). polya+ -library preparation with mrna selection; rrna--library preparation with rrna depletion and size selection (> nt); srna -library preparation with size selection (< nt); se/pe -single-/paired-end sequencing; ss/not-ss -strand-specific/unstranded sequencing; ss s /ss a -strand-specific in sense orientation/in antisense orientation figure . please note that for m. daubentonii currently no genome assembly is available, so the genome assembly of m. lucifugus was used as a close relative. all of our annotations follow the general transfer format (gtf) as described and defined in the ensembl ( ) database (https://ensembl.org/info/website/upload/gff. html). therefore, each row of each annotation file is either defined as a gene, transcript, or exon (by the feature column) and strictly following a hierarchical structure, even if only one exon (as for most ncrnas) is reported. by adhering to this annotation format, our novel annotations can be easily merged with existing ones as derived from ensembl or ncbi and are directly usable as input for computational tools such as hisat for mapping or featurecounts for transcript abundance estimation. we defined gene, transcript, and exon ids following the ensembl pattern: < -digit-number>. in general, we used specialized computational tools for the annotation of specific ncrna classes (supplementary tables s -s ). if not otherwise stated, the main ncrna-discovery is based on homology searches against the rfam database ( - ) (v . ). we used the gorap pipeline (https://github.com/koriege/gorap), a specially developed software suite for genome-wide ncrna screening. gorap screens genomic sequences for all ncrnas present in the rfam database using a generalized strategy by applying multiple filters and specialized software tools. to this end, gorap takes huge advantage of infernal ( , ) (v . . ) to annotate ncrnas based on input alignment files conserved in sequence and secondary structure (so-called stockholm alignment files; stk). all resulting alignment files were automatically pre-filtered by gorap based on different ncrna class-specific parameters including taxonomy, secondary structure and primary sequence comparisons. due to repeats, pseudogenes, undiscriminable un-/functional genes and overlapping results from the different assembly methods, we defined a ncrna set per species for annotation that includes filtered sequences, but allows for variants and multiple copies. this final annotation set is defined by hand-curating the resulting stk alignments of gorap with the help of emacs ralee mode ( ) . due to the removal of sequences in the stockholm alignments, the remaining sequences were extracted and again aligned into stockholm format using cmalign --noprob from infernal. the rfam-derived ncrna alignments were further split into snornas (supplementary table s ), mirnas (supplementary table s ) and other ncrnas including snrnas, lncrnas and other structural rnas (supplementary table s ). in general, our annotation results give an overview about the amount of different ncrnas in bat species and intentionally can include false positive hits and duplicates. all ncrna hits are placed as stk (if available), gtf and fasta-files in the electronic supplement and osf (doi. org/ . /osf.io/ cmdn). rrnas. we used the prediction tool rnammer ( ) (v . ) to identify . s, s and s rrna genes using hidden markov models. the tools output was transformed into regular gtf file format. all output files can be found in supplementary table s . trnas. for the annotation of trnas, we applied trnascan-se ( ) (v . . ) to the bat contigs using default parameters. the results were filtered by removing any trnas of type 'undet' or 'pseudo' and the tabular output was transformed into the gtf file format. additional information about the anticodon and the coverage score were added to the description column. we provide the raw trnascan-se files in supplementary table s . snornas. we annotated snornas based on available alignments from the rfam database using gorap and additionally marked and classified them into box c/d and box h/aca when intersecting with the set of snornas from http://www.bioinf.uni-leipzig.de/publications/supplements/ - ( ) (supplementary table s ). mirnas. additionally to the rfam-screening (supplementary table s ), mirnas were predicted by the mird-eep pipeline ( ) (v . . . ) using default parameters and the combined smallrna-seq data set (weber- ; samples) mapped to each individual bat assembly as an input for mirdeep (supplementary table s ). to validate the accuracy of our approach, we compared our mirdeep annotations of m. lucifugus/p. alecto (based on the transcriptomic data derived from m. daubentonii; weber- ) with annotations of mirnas for transcriptomic data of myotis myotis ( ) and p. alecto ( ) . for reference mapping, huang et al. also used the m. lucifugus genome, so we were theoretically able to directly compare our annotations with the annotations of both studies. unfortunately, no positional information (annotation file) of the identified mirnas derived from the transcriptomic data of m. myotis were given in the manuscript or supplement ( ) . therefore, we blasted the precursor mirna sequences identified with the help of the m. myotis transcriptome against the m. lucifugus genome and retained only hits with a sequence identity of %. the so obtained positional information was further used to calculate the overlap between our predicted mirnas in m. lucifugus. we used the same approach for the p. alecto comparison. if the annotated location of an mirna and one of our identified mirna locations in m. lucifugus/p.alecto were overlapping by at least %, we counted this location as a common prediction. lncrnas. long ncrnas were annotated using a high confidence data set h from the lncipedia ( ) (v . ) database comprising transcripts of potential human lncrnas. the transcripts were used as input for a blastn ( . . +, e − ) search against each of the bat assemblies (compiled as blast databases). the blastn result for each bat assembly was further processed to group single hits into potential transcripts as follows: first, for each query sequence q ∈ h, hits of q found on the same contig c and strand s were selected (hits c, s, q ) and the longest one, q , was chosen as a starting point so that trscp c, s, q = (q ). second, all hits q i ∈ hits c, s, q with q i ∈trscp c, s, q , which overlap neither in the query q nor in the target sequence and do not exceed a maximum range of nt from the most upstream to the most down-stream target sequence position of all q j ∈ trscp c, s, q ∪q i , were added iteratively to trscp c, s, q . to this end, we introduced a simple model of exon-intron structures, naturally occurring when using transcript sequences as queries against a target genome assembly. we defined the nt search range based on an estimation of lncrna gene sizes derived from the human ensembl ( ) annotation. if the sum of the lengths of all q i ∈ trscp c, s, q covers the query transcript length length q at least for %, trscp c, s, q was considered as a transcript and its elements q i as exons, otherwise all q i ∈ trscp c, s, q were withdrawn. this procedure is repeated until all hits ∈ hits c, s, q were used or withdrawn. therefore, each so-defined group of non-overlapping hits derived from the same query sequence and found on the same contig and strand should represent a lncrna transcript with its (rough) exon structure. the defined transcripts were saved as blast-like output and transformed into gtf file format. to follow the gtf annotation format and to harmonize our lncrna annotations with the other ncrna annotations, each lncrna transcript was also saved as a gene annotation and consists of at least one exon. as we observed a lot of different sequences from lncipedia aligning to the same positions in the genomes, we decided to condense exons at the same sequence positions, considering transcripts with one or multiple exons separately. for each contig and strand, starting from the ' end, exons with a minimum overlap of nt were grouped together. in the case of multiple exons, groups of exons were merged, if they shared any transcript origin. if all exons in the group originated from the same lncipedia gene, the group was considered as one gene with several transcripts and its associated exon(s). otherwise, we defined a lncrna hot spot on gene level with several transcripts and their associated exon(s). the lncipedia names of the gathered transcripts of a lncrna hot spot, as well as start and end positions of all exons, are listed in the gtf gene attribute field (supplementary table s ). the scripts used for the identification of lncrnas can be found at https://github.com/rnajena/bats ncrna. table ) . for the other six species we used blastn ( . . +, e − ) with the mlu and pva mitochondrial genomes as queries against the remaining bat genomes. for ehe, we found a possible mtdna contig in full-length ( nt; awhc ) in the genome assembly. due to the circularization of mtdna, we rearranged the sequence of this contig to start with the gene coding for the phenylalanin trna and to match the gene order of the other mitochondrial genomes. only for esp, rsi, mly, efu and mna, we were not able to detect any possible contigs of mtdna (table ). all mitochondrial genomes were annotated with mitos ( ). the ncrna results were filtered by e-value (threshold . ), thus one of two small rrnas in mbr and rfe and one of two large rrnas in ehe were discarded as false positive hits (supplementary table s ). for the five bat assemblies directly including mtdnas (table ) , the mitos annotations were added to the final merged ncrna annotation. all other mtdnas and annotations can be found in supplementary table s . as we annotated all bat assemblies by using different tools, we needed to merge the resulting gtf files to resolve overlapping annotations and to receive a final annotation of ncrnas for each bat species. furthermore, we extended the available ncbi annotations (including protein-and noncoding genes) by integrating our novel ncrna annotations. the scripts used to merge the different annotation files and to calculate overlaps between annotations can be found at https://github.com/rnajena/bats ncrna. due to their size, we have not included the lncrna annotations based on lncipedia. these can be downloaded and used separately (supplementary table s ). of novel non-coding annotations. for each bat species, we merged the ncrna annotations (except for lncrnas) using a custom script (merge gtf global ids.py). after reading in all features and asserting correct file structure, overlaps were resolved in the following manner: (i) exons are considered overlapping if > % of the shorter one is covered by table . mitochondrial bat genomes (mtdna) publicly available and used for annotation with mitos ( ). for out of the bat species investigated in this study, mtdna could be found in the ncbi. for four bat species, the mtdna is also part of the genome assembly as determined using blastn. for e. helvum, no mtdna could be found in the ncbi, but we were able to identify a single contig that is part of the genome assembly as mtdna using blastn and the mitochondrial genomes of the other bats as query. the contig was rearranged to match the gene order of the other mtdnas. r -found via blastn and rearranged merge of ncbi and novel annotations. we first converted and filtered the ncbi annotations to a compatible format with a custom script (format ncbi.py) and then combined the results with our merged novel ncrna annotations using the same strategy to resolve overlaps as above, but imposing less strict format rules (merge gtf ncbi. py). at best, a genome assembly represents the full genetic content of a species at chromosome level. whereas the first complex eukaryotic genomes were generated using sanger chemistry, today's technologies such as illumina short-read sequencing and pacbio or oxford nanopore long-read approaches are increasingly used ( ) . the currently available bat genomes vary widely regarding their assembly quality and completeness (table and figure ; supplementary table s ) and were predominantly assembled by using short illumina-derived reads and low (∼ x) ( ) up to moderate/higher coverage ( - x) approaches ( ) ( ) ( ) ( ) ( ) ) . a new assembly of the cave nectar bat (eonycteris spelaea) ( ) was exclusively generated from long-read data derived from the pacbio platform, and the genome of the egyptian fruit bat (rousettus aegyptiacus) ( ) was assembled by using a hybrid-approach of illumina and pacbio table and supplementary table s ) supplementary table s . data. these two genomes from the pteropodidae family are of a generally higher quality (figure and supplementary table s ). regardless of their assembly quality, these genomes need to be annotated to identify regions of interest, for example, encoding for protein-and non-coding genes or other regulatory elements. current genome annotations, mostly generated by automatic annotation pipelines provided by databases such as the ncbi ( ) or ensembl ( ), are predominantly focusing on protein-coding genes and well-studied ncrnas such as trnas and rrnas. accordingly, the available bat genome annotations vary a lot regarding their quality, ranging from more comprehensive annotations for longstanding bat genomes such as m. lucifugus or p. vampyrus to annotations on region level, completely missing any coding or non-coding gene annotations at the current ncbi version ( figure and table ). furthermore, by using strandspecific rna-seq data, we could show that some genes (e.g. ifna /ifnw in the ensembl annotation of m. lucifugus ( ) ) are annotated on the wrong strand and are therefore entirely missed by differential expression studies when relying on a strand-specific read quantification. for all publicly available bat genomes, ncrnas are generally annotated on low levels and are highly incomplete, mostly only comprising some trnas, rrnas, snrnas, snornas and lncrnas ( figure and table ). therefore, many ncrnas, especially mirnas, are simply overlooked by current molecular studies, for example from rna-seq studies that aim to call differential expressed genes based on such in-complete genome annotation files. studies that have made additional effort on annotating ncrnas in bats ( , , , ( ) ( ) are not reporting their results on a level that can be directly used for further computational assessment (e.g. as a direct input for rna-seq abundance estimation). currently, in the ncbi database, five bat assemblies are entirely lacking any coding/non-coding annotations and mirnas are not annotated at all ( table ). the rfam database ( ) contains mainly for m. lucifugus and p. vampyrus ncrna families. other ncrnas are currently unknown from bat genomes or not well documented. the genome assembly status of different bat species varies widely: ranging for example from contigs and an n of nt (d. rotundus) to contigs and an n of only nt (m. lyra), see table and supplementary table s . accordingly, the annotation status also varies a lot (table ) . within this work, we give an overview of potential ncrna annotations in bats. however, the precise number of ncrnas remains unclear, because of ncrnas being present several times in the assemblies, and others still remaining undiscovered. to give a better estimation of transcribed and potentially functional ncrnas, we used six illumina short-read rna-seq data sets derived from four bat species (table ) to estimate the expression levels of our novel annotations. note that we refer throughout this paper to an rna-seq data set by the first author's name and the year of the respective data set publication. the only included data set derived from a species of the yinpterochiroptera suborder (r. aegyptiacus) was obtained from a study dealing with the differential transcriptional responses of ebola and marburg virus infections in human and bat cells ( ) (data set: hölzer- ) . in this study, total rna of nine samples of r e-j cells, either challenged by the ebola or marburg virus or left un-infected, were harvested at , or h post infection (poi) and sequenced. unfortunately, no biological replicates could be generated for this study. therefore, we did not use this data set for the differential ex-pression analysis, but also calculated normalized expression values (tpm; transcripts per million) as done for all rna-seq data sets. the other five data sets comprise yangochiroptera species of the vespertilionidae (m. lucifugus, m. daubentonii) and miniopteridae (m. natalensis) families (table ). field et al. conducted two transcriptomic studies ( , ) (field- , field- using wing tissue of the hibernating little brown myotis bat. they were especially interested in transcriptional changes between un-infected wing tissue and adjacent tissue infected with pseudogymnoascus destructans, the fungal pathogen that causes the white-nose syndrome. two other data sets were obtained from virus-(rvfv clone ) and interferon (ifn) alphainduced transcriptomes of a myotis daubentonii kidney cell line (mydauni/ c). rna of mock, ifn and clone samples were gathered at two time points, and h poi ( ) . from the same samples, rrna-depleted (hölzer- ) and smallrna-concentrated (weber- ; see methods section for details) libraries were generated and sequenced. finally, we used data of the long-fingered bat m. natalensis initially obtained to characterize the developing bat wing ( ) . here, total rna was extracted from paired forelimbs and hindlimbs from three individuals at three developmental stage. we have mapped each sample to each bat genome, regardless of the origin of the rna. expectedly, the mapping rate decreases in bat species that are evolutionary more far away from the original species from which the rna was sequenced. however, we were interested to find out which ncrnas are consistently transcribed in all investigated bat species or only in certain bat families and sub-groups. over all bat assemblies, we annotated ncrna families for in total trnas, rrnas, mirnas ( predicted by mirdeep ), snornas, mitochondrial (mt-)trnas and mt-rrnas as well as other ncrnas additionally derived from the rfam database ( ) (selected ncrnas are shown in table ). with a broad approach, we have identified potential lncrnas and defined between (m. lyra) and (d. rotundus) lncrna hot spots. all annotations, separated for each ncrna class and summarized for each bat species, can be found in the open science framework (doi.org/ . /osf.io/ cmdn) and in our electronic supplement (rna.uni-jena. de/supplements/bats) and are compatible with the genome assembly versions listed in table . thus, our annotations together with the bat genome assemblies obtained from the ncbi can be directly used for subsequent analysis such as differential gene expression detection. we detected s and s rrna for the majority of investigated bat species (supplementary table s ). the varying number of rrnas is in line with all currently available metazoan genomes, lacking the correct composition of rrnas due to misassemblies. however, the number of . s rrna varies a lot between for p. vampyrus and for m. natalensis (table and supplementary table s ). interestingly, of pteropodidae show a higher number of . s rrnas compared to the other species (∼ - fold). only the p. alecto assembly is in line with the other bat assemblies in regard to the amount of . s rrna copies. for all bat species, we observed various numbers of trnas (table and supplementary table s ). we could detect full sets of trnas for e. fuscus and m. davidii, whereas between one and four trnas are missing from the assemblies of h. armiger, m. brandtii, m. lucifugus, m. lyra, m. natalensis, p. parnellii, r. ferrumequinum, r. sinicus and d. rotundus, and between nine and twelve are missing for e. helvum, p. alecto, p. vampyrus, r. aegyptiacus and e. spelaea (supplementary table s ). interestingly, we identified a large number of trnas ( ) for m. lyra in comparison to all other bat species (supplementary table s ), likely a result of the low assembly quality (figure and supplementary table s ). the lowest amount of trnas was annotated for p. parnellii and d. rotundus with only and copies, respectively. in all other bat genomes, we found between and trna genes (supplementary table s ). the trna encoding for valine (val) with the anticodon structure tac had high copy numbers (over ) in all genome assemblies of the pteropodidae family (table ) . similar copy numbers were achieved by the r. ferrumequinum and r. sinicus assemblies regarding the trna encoding for isoleucine (ile) with the anticodon aat (supplementary table s ). for trna(ile) and the anticodon gat, we also observed high copy numbers in h. armiger. interestingly, all species with high trna(val) and trna(ile) copy numbers had rather low counts (between and ) of trna(sec) with the anticodon tca, while this trna was found with higher copy numbers in p. parnelli ( ) and in d. rotundus ( ) and with even higher counts (between and ) in all other bat species (supplementary table s ). however, high copy numbers might be also occur due to assembly quality and false positive predictions of trnascan-se. in supplementary table s we list all detected snornas, divided into box c/d and box h/aca types. overall, we found snorna families within the investigated bat species, comprising box c/d, box h/aca and unclassified snornas. many snornas were found with exactly one copy present in each bat genome assembly (e.g. scarna ), whereas others were found in multiple copies for each bat species (e.g. snord ) or completely absent from certain bat families (e.g. aca ), see table . exactly one copy of the small nucleolar rna aca was found within the genomes of the pteropodidae family and multiple copies for d. rotundus, p. parnellii, m. natalensis and members of the vespertilionidae family; however, this snorna seems to be completely absent from bat species of the megadermatidae and rhinolophidae families ( table ). the h/aca box snorna aca is predicted to guide the pseudouridylation of s rrna u ( ) . interestingly and as another example, snora was found in higher copy numbers in the genomes of the vespertilionidae family. among others, this h/aca box snorna was described to be commonly altered in human disease ( ) . over all bat assemblies, we detected mirna families based on rfam alignments (supplementary table s ) and predicted between (e. helvum) and (m. davidii) mirnas based on the combined small rna-seq data sets (weber- ) using mirdeep ( ) (supplementary table s ). the higher number of mirnas predicted for myotis species can be explained because the small rna-seq data set is derived from a myotis daubentonii cell line. similar to other ncrna classes, we observe various differences in mirna copy numbers between the bat families. for example, mir- and mir- are absent in all vespertilionidae and m. natalensis (table ) , whereas mir- is present in all yangochiroptera (except d. rotundus) but absent from all pteropodidae (supplementary table s ). the mirna is absent from all yangochiroptera except d. rotundus. there are many other examples of absent/present mirnas in certain bat species/families such as mir- (absent from pteropodidae), mir- (absent from yinpterochiroptera) and mir- (absent from rhinolophidae and megaderma lyra), see supplementary table s . with mirdeep , we detected hundreds of potential mirnas for all investigated bat species (supplementary table s ). generally, all bat species can be divided into two groups. for the majority of the bat assemblies ( out of ) about mirnas were predicted. for the other species of the vespertilionidae family about times as many mir-nas could be found. this is in concordance with the small rna-seq data used for the prediction, that was obtained from myotis daubentonii kidney cells (see methods). nevertheless, ∼ conserved mirnas can be predicted in the evolutionary more distant bat species. table . general genome information for each of the investigated bat assemblies and selected ncrna examples annotated in this study. we selected ncrnas with interesting copy number distributions among the investigated bat species. for snora , pvt and hottip and we additionally found sophisticated differential expression patterns in at least one of the used rna-seq data sets (absolute log fold-change greater , tpm > ). full tables and detailed information for each ncrna class (fasta, stk, gtf files) can be found in the electronic supplement online (supplementary tables s -s ) lncrnas for the annotation of lncrnas, we have deliberately chosen a broad blast-based approach, using , transcripts of potential human lncrnas obtained from the lncipedia database ( ) . we have consciously chosen this approach, because lncr-nas have diverse genomic contexts, reveal various functions and act in different biological mechanisms ( ) ( ) ( ) . from lncipedia transcripts, we annotated genes and lncrna hot spots. we defined regions in a genome as a lncrna hot spot, if different lncipedia transcripts derived from different genes map to the same region (see methods section for detailed description). overall, we found between and potential lncrna genes in m. lucifugus and r. sinicus, respectively, and between and lncrna hot spots in m. lucifugus and d. rotundus, respectively. we annotated the previously described lncrnas tbx -as and hottip ( ) in all bat genomes, except tbx -as in m. lucifugus, presumably due to the lower genome assembly quality (table ) . besides evolutionarily explainable differences in the presence of lncrnas, we have observed that the number of lncrnas and lncrna hot spots increases with increasing assembly quality (e.g. with a higher n ; see supplementary figure s ). we have not observed such a clear correlation between assembly quality and annotation results for short ncrnas. based on the rfam alignments, we were able to detect other ncrna families in addition to the rrnas, trnas, snornas and mirnas described before (supplementary table s ). overlaps with annotated lncrnas (supplementary table s ) are intentional, because the rfam includes only highly structured parts of long ncrnas. the highest number of ncrna copies ( ) was detected for the u spliceosomal rna in m. brandtii, already known to have a lot of pseudogenes ( ) . for ncrnas such as caesar (rf ), g-csf slide (rf ), nron (rf ), tusc (rf ), xist exon (rf ), linc (rf ), and hammerhead hh and hh (rf , rf ), we found exactly one copy in each investigated bat genome assembly (supplementary table s ). again, we also observed ncrna families that are lost for some species or entire families, for example the ribozyme cotc (rf ) that seems to be absent from all vespertilionidae members and m. natalensis. for each investigated bat species except e. spelaea, r. sinicus, m. lyra, e.fuscus and m. natalensis (where no mitochondrial contigs could be identified; table ) mitochondrial protein-coding genes and ncrnas were annotated (see methods). in total, mitochondrial genes comprising trnas, two rrnas ( s and s) and protein-coding genes were detected for each bat species as known for other metazoans ( ) (supplementary table s ). the mitochondrial genome lengths range from nt in e. helvum to nt in m. davidii. for the five bat species, where the mitochondrial genome could be identified as a part of the ncbi genome assembly, we appended the mtdna annotation to the final annotation of ncrnas. exemplarily, we investigated known and novel differentially expressed (de) ncrnas found in the genome of m. lucifugus in more detail. to this end, we used the rna-seq data sets field- , field- , hölzer- and weber- (table ) as a basis to identify de ncrnas that were newly discovered in this study and were not part of the current ncbi or ensembl genome annotations for this bat species. more detailed de results can be found in supplementary file s . we filtered for novel m. lucifugus de ncrnas by (i) an absolute log fold change (fc) > , (ii) an adjusted pvalue < . , and (iii) a tpm > . we further manually investigated the expression patterns with the igv ( ) and discarded de ncrnas overlapping with the current ncbi (myoluc . annotation release ) or ensembl (myoluc . . ) annotations. based on the small rna-seq comparison of mock and virus-infected (clone ) samples h post infection (weber- ), we found several mirnas (rfam-and mirdeep -based) and snornas to be differentially expressed ( figure and details in supplementary files s . ). in general, replicates of virus-infected and ifn-treated samples h post infection tend to cluster together only based on the expression profiles of small ncrnas (mainly mirnas) ( figure a and b) . most differences can be observed between the h virus-infected and all other samples, which seem to show no clearly distinguishable expression pattern. interestingly, at h post infection, we see replicates clustering together regardless of their treatment (mock, ifn, clone ). thus, after only h, few mirnas are differentially expressed and therefore the samples of each replicate (mock, ifn, clone ) cluster together, because they have the same passaging history but the passaging history in between the replicates differ ( ) . after h, more and more mirnas are significantly differentially expressed and the samples can be better distinguished based on their treatment ( figure a and b). we observed that, in general, mirnas tend to be down-regulated ( figure b ; upper half), while snornas tend to be up-regulated (lower half) after h of clone infection compared to mock. for example, we found a novel mirna (mlugd in our annotation; predicted by mirdeep ; supplementary table s ) located in an intron of the protein-coding gene sema g, significantly down-regulated (log fc = - . ) during clone infection ( figure c and d) . based on rfam alignments we further found a histone '-utr stemloop (rf ), an rna element involved in nucleocytoplasmic transport of the histone mrnas, significantly down-regulated during infection. for the same comparison of mock and virus-infected samples at h, the rrna-depleted data set (hölzer- ) revealed several de lncrnas. for example, we found a lncrna potentially transcribed in an intron of mx (mlugl ) up-regulated (log fc = . ) during infection. another lncrna (mlugl ), we found potentially transcribed as a part of two exons of the plat protein-coding gene and down-regulated (log fc = − . ) during viral infection (see details in supplementary files s . ). interestingly, based on the field- rna-seq data, we found two internal ribosomal entry site (ires) in the genes vegfa and odc with rfam ids rf and rf , respectively, to be -fold up-regulated during p. destructans infection. in this study, we comprehensively annotated ncrnas in readily available bat genomes obtained from the ncbi database (table ) . we provide novel annotations in the common gtf format, following a hierarchical structure of gene, transcript and exon features to allow direct integration of our annotations into already available ones (supplementary tables s -s ). finally, we provide for each bat genome assembly an extended annotation file merged with the protein-and non-coding gene annotations that were already available by the ncbi database (supplementary files s ; leaving out potential lncrnas that can be downloaded separately, see supplementary table s ) . we used six rna-seq data sets derived from the transcriptomic sequencing of four bat species (table ) to calculate normalized expression values for our newly annotated ncrnas and exemplarily show significantly differential expressed ncrnas (figure ) , which were never before described on such a large scale for any bat species. in addition to the evolutionarily explainable differences in the pure existence and the amount of annotated ncrnas in bats, we have observed that the assembly quality can also influence the annotation results. while the effects on the annotation of short ncrnas seem to be small (with some exceptions), the number of identified lncrnas and lncrna hot spots increases with increasing assembly quality (e.g. with a higher n ; see supplementary figure s ). however, this observation may be true for our data and analyses, but it also depends strongly on the annotation method used. recently, the bat k project (http://bat k.ucd.ie/) was announced as a global effort to sequence, assemble and annotate high-quality genomes of all living bat species ( ) . we aim to extend our annotation of ncrnas regularly and whenever new bat genomes become publicly available. in mid january , new low-coverage bat genomes of nine families were submitted by the broad institute to the ncbi genome database. unfortunately, our timeconsuming and computationally extensive analyses were already completed at this time point. we want to further automate our ncrna annotation workflow, to easily include these and any new bat (or other mammalian) genomes that will be sequenced and assembled in the future. our current identification of ncrnas in bat species will be usable as a resource (electronic supplement) for deeper studying of bat evolution, ncrnas repertoire, gene expression and regulation, ecology, and important host-virus interactions. detailed information about the bat genomes used in this study, their assembly quality and all ncrna candidates (in fasta, stk and gtf format) can be found in the electronic supplement (rna.uni-jena.de/supplements/ bats). the final extended annotations for each investigated bat species can be found in supplementary files s and the lncrna annotations in supplementary table s . to allow full reproducibility of our study, all final and intermediate data files (such as used genome files and mapping files in bam format) were uploaded to the open science framework under accession doi.org/ . /osf.io/ cmdn. python scripts used to filter and merge our annotations were deposited at github (github.com/rnajena/ bats ncrna). the virus-infected and ifn-stimulated small rna-seq data of the m. daubentonii kidney cell line was uploaded to geo (gse ). bats: important reservoir hosts of emerging viruses bat biology, genomes, and the bat k project: to generate chromosome-level genomes for all living bat species an eocene big bang for bats phylogeny, genes, and hearing: implications for the evolution of echolocation in bats mammal species of the world. a taxonomic and geographic reference a molecular phylogeny for bats illuminates biogeography and the fossil record crowd vocal learning induces vocal dialects in bats: playback of conspecifics shapes fundamental frequency usage by pups blood mirnomes and transcriptomes reveal novel longevity mechanisms in the long-lived bat growing old, yet staying young: the role of telomeres in bats' exceptional longevity longitudinal comparative transcriptomics reveals unique mechanisms underlying extended healthspan in bats bats and their virome: an important source of emerging viruses capable of infecting humans mass extinctions, biodiversity and mitochondrial function: are bats 'special' as reservoirs for emerging viruses? bats as 'special' reservoirs for emerging zoonotic pathogens global patterns in coronavirus diversity bat flight and zoonotic viruses differential transcriptional responses to ebola and marburg virus infection in bat and human cells the immune gene repertoire of an important viral reservoir, the australian black flying fox non-coding rnas: the architects of eukaryotic complexity novel insight into the non-coding repertoire through deep sequencing analysis genome annotation: from sequence to biology applications of next generation sequencing in molecular ecology of non-model organisms next-generation genome annotation: we still struggle to get it right quast: quality assessment tool for genome assemblies transcriptomic and epigenomic characterization of the developing bat wing virus-and interferon alpha-induced transcriptomes of cells from the microbat myotis daubentonii. iscience the white-nose syndrome transcriptome: activation of anti-fungal host responses in wing tissue of hibernating little brown myotis effect of torpor on host transcriptomic responses to a fungal pathogen in hibernating bats trimmomatic: a flexible trimmer for illumina sequence data hisat: a fast spliced aligner with low memory requirements featurecounts: an efficient general purpose program for assigning sequence reads to genomic features cutadapt removes adapter sequences from high-throughput sequencing reads quality control and preprocessing of metagenomic datasets ) mirdeep accurately identifies known and hundreds of novel microrna genes in seven animal clades moderated estimation of fold change and dispersion for rna-seq data with deseq rna-seq gene expression estimation with read mapping uncertainty rfam: an rna family database rfam . : shifting to a genome-centric resource for non-coding rna families non-coding rna analysis using the rfam database infernal . : inference of rna alignments infernal . : -fold faster rna homology searches ralee--rna alignment editor in emacs rnammer: consistent and rapid annotation of ribosomal rna genes trnascan-se: a program for improved detection of transfer rna genes in genomic sequence matching of soulmates: coevolution of snornas and their targets characterisation of novel micrornas in the black flying fox (pteropus alecto) by deep sequencing lncipedia : towards a reference set of human long non-coding rnas mitos: improved de novo metazoan mitochondrial genome annotation a comprehensive study of de novo genome assemblers: current challenges and future prospective genome-wide signatures of convergent evolution in echolocating mammals a high-resolution map of human evolutionary constraint using mammals comparative analysis of bat genomes provides insight into the evolution of flight and immunity the genomes of two bat species with long constant frequency echolocation calls hologenomic adaptations underlying the evolution of sanguivory in the common vampire bat genome analysis reveals insights into physiology and longevity of the brandt's bat myotis brandtii exploring the genome and transcriptome of the cave nectar bat eonycteris spelaea with pacbio long-read sequencing the egyptian rousette genome reveals unexpected features of bat antiviral immunity reference sequence (refseq) database at [ncbi: current status, taxonomic expansion, and functional annotation sequencing and annotation for the jamaican fruit bat (artibeus jamaicensis) down but not out: the role of micrornas in hibernating bats a computational screen for mammalian pseudouridylation guide h/aca rnas small rnas with big implications: new insights into h/aca snorna function and their role in human disease long noncoding rnas: past, present, and future incredible rna: dual functions of coding and noncoding cncrnas: bi-functional rnas with protein coding and non-coding functions evolution of spliceosomal snrna genes in metazoan animals animal mitochondrial genomes integrative genomics viewer (igv): high-performance genomics data visualization and exploration icarus: visualizer for de novo assembly evaluation we thank ivonne görlich key: cord- -grpi gnc authors: allen, cameron; metternicht, graciela; verburg, peter; akhtar-schuster, mariam; inacio da cunha, marcelo; sanchez santivañez, marioldy title: delivering an enabling environment and multiple benefits for land degradation neutrality: stakeholder perceptions and progress date: - - journal: environ sci policy doi: . /j.envsci. . . sha: doc_id: cord_uid: grpi gnc achieving land degradation neutrality (ldn) was adopted by countries in as one of the targets of the global sustainable development goals (sdgs). as ldn is a relatively new concept there is an increasing need for evidence on the potential socio-economic and environmental benefits of ldn as well as how an enabling environment for implementing ldn measures can be developed. this paper summarises the results from a global survey of ldn stakeholders, and a review of national progress in target setting that was commissioned by the united nations convention to combat desertification (unccd) in . the study presents the perceptions of relevant stakeholders on the key components of an enabling environment for achieving and maintaining ldn (institutional, financial, policy/regulatory, and science-policy) as well as expectations of multiple benefits from its implementation. we also highlight key challenges and gaps in progress to date that are emerging from ongoing national target setting programs to implement ldn. the study finds that progress in implementing ldn has been widespread across countries. however there remains a lack of awareness of ldn and its key concepts along with high-level political buy-in. this may be impeding the integration of ldn into national development planning and budgeting processes where progress was assessed as limited. national capacities for securing land tenure and governance arrangements and integrated land use planning were perceived as comparatively low, further hampering the implementation of ldn. despite these gaps, most stakeholders (> %) who participated in the global survey expected ldn to deliver a broad range of multiple benefits for human wellbeing, livelihoods and the natural environment. we argue that greater efforts are needed to raise awareness of ldn, educate core stakeholders in its concepts, enablers and benefits, raise its political profile, and provide evidence on national measures that will support implementation of ldn. the sustainable development goals (sdgs) adopted by the un general assembly in september include achieving land degradation neutrality (ldn) as one of the targets (target . ) (united nations general assembly, ) . ldn aims to avoid further land degradation while balancing losses in land-based natural capital and associated ecosystem functions and services with measures that produce gains through sustainable land management (slm) and restoration or rehabilitation measures (cowie et al., ) . the aim is to reverse losses to lands' productivity, to sustain or to improve land-based natural capital and ecosystem services over the long-term for the benefit of human wellbeing and livelihoods. progress towards ldn requires the existence of an enabling environment to help ldn measures to be successfully developed, implemented, executed and monitored. in this context, an enabling environment can be thought of as the combination of contextual elements allowing progress to be made towards a clearly defined goal (akhtar-schuster et al., ) . it includes the collaboration of science and policy as well as other relevant stakeholders, the consideration of multifarious demands and values existing in society, the availability of financial means, stable institutional arrangements and responsible and purposeful land governance (verburg et al., ) . while ldn is a relatively new concept, there is emerging international experience in operationalising it at the national level in both developed and developing countries. since , the unccd secretariat and the global mechanism of the unccd have supported a total of countries (as of late ) through the ldn target setting programme which aids nations in the definition of baselines, targets and associated measures to achieve ldn by . an emerging expert literature (chasek et al., ; wunder and bodle, ; okpara et al., ; bodle, ; akhtar-schuster et al., ; cowie et al., ; kust et al., ; von maltitz et al., ; solomun et al., ; speranza et al., ; herrick et al., ) highlights a range of challenges to the implementation of interventions to attain ldn, including a lack of political will and leadership often due to limited insight into the concept of ldn and its cross-sectoral benefits, inadequate targets, rules and guidelines, land tenure insecurity, disregard for integrative approaches required for slm, and a lack of earmarked funds as well as other resources, all related to an enabling environment for ldn. in the context of the sdgs, another key challenge relates to the interlinkages between ldn and other targets and the need to understand and manage these interrelationships. building upon efforts to date, there is an increasing need for evidence on elements of an enabling environment to support policy makers, subnational decision makers and practitioners to implement and to maintain ldn. understanding the perceptions and expectations of practitioners and other stakeholders regarding the enabling environment for ldn, progress and challenges to date, and the potential multiple benefits and trade-offs can help to accelerate implementation. the aim of this paper is to summarise the results of a study commissioned by the unccd science-policy interface (spi) in to determine what are considered the main elements of an enabling environment for ldn as well as the potential for ldn to contribute to enhancing well-being, livelihoods and the sustainable use of the natural environment and to provide evidence on what and how national measures will support implementation of ldn. the analysis is based on a global survey of ldn stakeholders and a review of national ldn target setting programme (tsp) reports from a wide selection of countries. first we introduce the methodological approach for the study, followed by a brief presentation and discussion of the results and finally some concluding remarks. for the purposes of this study, the enabling environment was defined as comprising four key dimensions and enablers (table ) based upon a review of the available academic and grey literature relating to ldn (supplementary table ) and consultation amongst experts, particularly in the field and from the unccd spi. a selection was made of those components of the enabling environment judged of importance for ldn, and these enablers were used as a framework of criteria to provide structure for the subsequent analysis. the institutional enabling environment for ldn is complex, involving the interplay between a range of stakeholders (enemark, ) . each stakeholder plays a unique role in achieving ldn and often has different objectives, approaches, values, institutions and rules (akhtar-schuster et al., ; pierce et al., ) . it is therefore consistently indicated that ldn needs a national political commitment at the highest level, and that effective mechanisms are put in place to drive coordination, collaboration and engagement. institutional capabilities are also needed in policy coordination and planning, stakeholder engagement and implementation, enforcement and progress monitoring. establishing an effective financial enabling environment includes adequate assessment of financial resource requirements, identification of sources of finance, and securing and allocating finance or setting in place instruments and mechanisms to incentivize the allocation of financial resources towards ldn (akhtar-schuster et al., ; chasek et al., ) . for effective implementation, ldn needs to be integrated into the land administration and planning system in each country as defined by its policy and regulatory enabling environment. this includes governance provisions for securing land tenure and equal access to land, which is a building block not only for ldn, but also for broader economic and social objectives such as the eradication of poverty and hunger (food and agriculture organization of the united nations, ; higgins et al., ; holden and ghebru, ) . ldn implementation also requires that associated policy procedures in day-to-day operations are in place to enforce, monitor, and verify the impacts of national policies (chasek et al., ) . central to this is integrated land use planning, which seeks to balance economic, social and cultural opportunities provided by the land with the need to maintain and enhance ecosystem services provided by land-based natural capital (orr et al., ) . in the context of ldn, a key component of this is a 'neutrality mechanism' which assists land users, land-use planners and decision makers with counterbalancing losses with equivalent (or greater) gains (chasek et al., ; cowie et al., ) . finally, an effective science policy interface includes the establishment of a scientifically sound monitoring system and data infrastructure, technical capacities and tools to support assessment of land degradation as well as progress in ldn implementation, the evaluation of economic, social and environmental benefits and trade-offs associated with achieving ldn, and the effective collation and translation of scientific knowledge to policy-makers, planners and other relevant stakeholders (akhtar-schuster et al., ; chasek et al., ; cowie et al., ; orr et al., ) . in the context of this study, the focus is on enabling the uptake of science in policy-making at the national level where systematic obstacles exist including the lack of scientific understanding of policy makers, limited dissemination of research, lack of incentives, and lack of institutional channels (jones et al., (jones et al., , . given the centrality of science to achieving ldn, we include this as a separate dimension, which incorporates key features set out in the ldn scientific conceptual framework such as the national data and monitoring systems and preliminary assessments (orr et al., ) . the objective of the review of national tsp reports was to provide an assessment of national progress and challenges in implementing an enabling environment for ldn, as well as approaches to addressing multiple benefits. the framework of four dimensions and enablers in table provided the criteria for a systematic evaluation. a total of national tsp reports were reviewed (supplementary table ). the selection of national reports was undertaken to ensure balance across the five unccd regional implementation annexes, as well as balance within regions in terms of covering diversity in the level of development of each country and sub-regional differences. to ensure inter-regional balance, where available, a minimum of six countries were selected from each region . to ensure intra-regional balance and a common reference base, the human development index (hdi) (undp, ) was used as a proxy, with country selection including a spectrum of hdi values ranging from the lowest to highest. reports were reviewed in english, french, spanish and russian by a task team of the spi and the unccd secretariat. a rating scale and scoring template were developed to provide a consistent approach for evaluating the reports across each of the criteria. the rating scale adopted a simple scoring approach (supplementary table ). reviewers were asked to use this rating scale to evaluate the evidence of an enabling environment for ldn contained in the tsp reports, documenting their analysis in a standard reporting template. the survey was designed to collect information in two key areas: firstly, regarding what is needed to achieve and maintain ldn in terms of policies, enablers, incentives, and support; and secondly, how ldn initiatives contribute to achieving multiple benefits in terms of environmental objectives as well as improving human well-being and livelihoods. the survey questions were developed with advice from experts in the field including members of the unccd spi through several rounds of consultations. the majority of the questions adopted either likert-scales or rating-scales to collect responses. the survey was implemented as an online survey and circulated to practitioners and experts involved in the ldn tsp and associated activities in mid-november . the survey was delivered via surveymonkey and comprised a maximum of questions (supplementary annex ). question logic was used to determine the final set of questions viewed by respondents, based on their affiliation or function (i.e. national focal point (nfp), national consultant, regional consultant, researcher/scientist, civil society organisation (cso), intergovernmental organisation (igo), business). as a result, the number of questions varied between and , depending on the function of the respondent. the final stage of the study triangulated results from the review of tsp reports and the online survey to evaluate overall progress and potential gaps, priority challenges moving forward with ldn implementation and key messages. this was again structured using the framework of enablers of the ldn enabling environment (table ) . to identify higher priority gaps/challenges, each of the enablers was reviewed in terms of the ranking of its perceived importance as well as progress made. priority gaps were considered to be those where an enabler was perceived to be of high importance for the enabling environment for ldn and where progress was limited ( supplementary fig. ). the importance of each enabler was rated as high, moderate, or low based on the results from the stakeholder survey, in particular questions relating to perceptions around important measures or priority challenges for implementing ldn. enablers that were in the top-third of rankings or scores were considered of comparatively high importance, those in the middle third were considered of moderate importance, and those in the bottom third were considered of comparatively lower importance. the progress made on each enabler was rated as good, moderate, or limited based on both the results from the tsp review as well as the stakeholder survey. for the results from the tsp review, a rating of good progress aligned with an average score of equal = > , moderate progress as = < , and limited progress as < . in addition, the perceptions from the survey relating to progress made or existing capacities for specific activities were also factored into the analysis and ratings were adjusted accordingly. based on the average scores, there was greater progress in terms of the institutional enabling environment than other enablers, in particular in establishing the national political commitment and agenda (including target setting) (criterion . ), coordination mechanisms ( . ), and stakeholder consultation ( . ). other enablers that scored relatively high included regulations and rules around ldn ( . ), policy coherence and alignment ( . ), data and monitoring systems ( . ) and consideration of causes and effects or drivers of ldn ( . ). key gaps were evident in terms of establishing financing needs and costings ( . ), consideration of land tenure/rights ( . ), integrated land use planning ( . ) and establishing or embedding a neutrality mechanism ( . ). c. allen, et al. environmental science and policy ( ) - when mode values are substituted for averages, the results highlight that a greater number of countries had individually made good progress (scores or ) on the institutional enabling dimension ( . - . ) as well as on regulations and rules around ldn ( . ) and consideration of policy coherence and alignment ( . ). overall, stakeholder consultation had the highest mode value (mode = / ). enablers lagging furthest behind in terms of individual country progress correspond to the financial dimension ( . and . ), some elements of the policy/regulatory environment ( . land tenure, . integrated land-use planning, and . neutrality mechanism), as well as national technical capacities for ldn assessments and implementation ( . ). averages for each of the four enabling dimensions were also aggregated across all regions as well as for each of the five regions (fig. ) . in most regions (except for northern mediterranean), greater progress was evident in the institutional dimension compared with other dimensions. progress on the science-policy interface dimension was higher in the northern mediterranean (nm) and latin america and caribbean (lac) regions, while progress on the financial dimensions was higher for nm and central eastern europe (cee). overall progress was most limited in the financial dimension, particularly africa and lac, while progress on the policy/regulatory environment also lagged behind, particularly in nm, cee and africa. the complete results from the survey across all questions are provided the supplementary annex . a subset of the results is presented here. a total of responses to the survey were received with good coverage and balance in terms of the affiliation, expertise and geographic distribution. with regard to their function, respondents comprised three relatively balanced groups: national focal points (nfps) of the unccd or consultants engaged in supporting national ldn target setting ( %), researchers/scientists ( %), and csos/igos or private sector ( %). the most common areas of expertise were land degradation ( %) and environmental management ( %), with a relatively small proportion having expertise in economics ( %) or social sciences ( %). close to % of respondents indicated that they had been involved in the implementation of an ldn initiative in a specific country or countries across all five of the unccd regional groupings. a total of % of respondents indicated that they had participated in the ldn tsp. respondents were asked to rank the three most important policies, procedures and incentives that can help implementation of measures to avoid, reduce and reverse land degradation. based on a selection of ten potential options, those with the highest rankings were 'a common national long-term vision and commitment to ldn' ( % ranked in first place), a 'national budget for ldn' ( % ranked in first place), and 'secured land tenure and access to land' ( % ranked in first place) (fig. a) . these three measures also ranked the highest based on their relative weighted averages calculated across all three rankings, with 'a common national long-term vision and commitment to ldn' identified as a clear overall priority by respondents (supplementary annex , figure b ). the sensitivity of the results to the type of respondent were also c. allen, et al. environmental science and policy ( ) - analysed to highlight differences in priorities. fig. b shows the differences in the top-ranked (i.e. rank = ) most important measure for implementing ldn across three stakeholder groupings. this highlights a high degree of consistency across the three groups, however some variation can be seen for specific measures. for example, the cso/igo/ business grouping gave a stronger preference for secured land tenure and access to land than the other groups. respondents were also asked to rank the five most important challenges to the implementation of ldn moving forward. based on a selection of potential options, those that were ranked the highest were 'insufficient awareness of ldn and understanding of concepts' ( % ranked in first place), 'insufficient finance' ( % ranked in first place), and 'insufficient high-level commitment to ldn' ( % ranked in first place) (fig. a) . a high degree of consistency across the three stakeholder groups can again be seen (fig. b) . however, some variations are evident, for example the researchers/scientists grouping gave a lower priority to insufficient finance compared with the other groups, while nfps/ consultants gave a higher priority to insufficient ldn implementation guidance. in terms of securing a national political commitment to ldn, % of respondents indicated that ldn was considered of 'high importance' to the national government and politicians, while % indicated that it was of 'some importance', and % that it was of 'low or no importance' (supplementary annex , figure ). respondents also confirmed considerable progress in national ldn target-setting, with % of respondents indicating that targets had been adopted or that the process was underway, and a further % indicating that they intended to adopt targets but that the process was yet to commence (supplementary annex , figure a ). when asked to rate national capacity to complete ldn-related activities, respondents rated their capacity for ldn target setting and alignment with policy frameworks and plans as fair-to-good (based on a weighted average score . out of , on a scale of 'very poor = ′ to 'fair = ′ to 'very good = ′) (supplementary annex , figure b ). national capacity to undertake stakeholder consultation was rated slightly higher (average of . out of ), while national capacity to solve land conflicts and secure land tenure arrangements was rated the lowest out of nine available options (average of out of , or 'fair'). almost one-third ( %) of respondents rated their national capacity for this activity as poor or very poor. in terms of the sources of finance, 'national government budgets' received the most top rankings ( % ranked in first place), followed by the global environment facility (gef) ( % ranked in first place) and the unccd global mechanism ( % ranked in first place) (supplementary annex , figure a ). when asked if their country had secured finance for ldn as yet, only % of respondents indicated that being the case. the implementation of operational or advanced integrated land-use planning systems was quite limited in countries ( % of respondents, figure -si). overall, % of respondents indicated that their country had no integrated land use planning at all, while % indicated that they used land use planning, but that it was not fully integrated. integrated land use planning was defined as land use planning that seeks to balance the economic, social and cultural opportunities provided by land with the need to maintain and enhance ecosystem services provided by the land-based natural capital. only % of respondents reported that a neutrality mechanism had been fully embedded into their land-use planning system, while % reported that such a mechanism had not been embedded (supplementary annex , figure ). almost half of respondents reported that a mechanism was 'somewhat embedded', while % of respondents didn't know. a total of % of respondents reported that they had national data systems in place to support land-use planning. however, approximately half of these respondents ( %) indicated that their national data systems were rated 'fair' in terms of providing the information necessary to determine land potential and assess land condition, while a further % rated them as 'ineffective' or 'very ineffective' (supplementary annex , figure ). with regard to the three global indicators for monitoring ldn, the vast majority of respondents ( %) reported that their country will make use of land cover change, while % reported that they would use net primary productivity, and % would use soil organic carbon stocks. progress on setting baseline values for each of the three global indicators was also relatively advanced, where % of respondents had set a baseline value for land cover change, % for net primary productivity, and % for soil organic carbon (supplementary annex , figure ). overall, respondents rated their national capacity to set baseline values for ldn indicators and track progress as relatively low compared to other options (average seventh from nine options) (supplementary annex , figure b ). national capacity to undertake target setting and alignment with policy frameworks was rated slightly better (average fifth out of nine options). technical activities with low levels of capacity included 'resilience assessment' ( % rated as 'poor' or 'very poor') and 'economic and social assessment' ( % rated as 'poor' or 'very poor') (supplementary annex , figure c ). respondents reported higher completion or commencement rates for land condition and degradation assessments ( % completed, % underway), and the lowest levels for resilience assessments ( % completed, % underway) (supplementary annex , figure a ). respondents ranked the 'full and effective participation from local communities and stakeholders' as the most important factor for ensuring that social co-benefits are maximised in ldn initiatives (supplementary annex , figure a ). approximately % of respondents ranked this measure as the most important element. the vast majority of respondents either strongly agreed ( %) or agreed ( %) that they were expecting positive effects on human wellbeing and livelihoods as a result of slm and ldn ( supplementary annex , figure a) . a majority of respondents also indicated that they strongly agreed ( %) or agreed ( %) that consideration of multiple benefits makes planning for ldn easier. however, less than half of respondents agreed or strongly agreed that it was clear how to manage trade-offs associated with ldn initiatives. respondents reported that a range of multiple benefits were expected from ldn implementation (fig. a) . the multiple benefits expected most often on average were increased biodiversity, increased food security, enhanced local livelihoods, increased yields/productivity, and increased resilience to drought (supplementary annex , figure b ). the sensitivity of these results to the type of respondent were also analysed to highlight differences in expectations regarding multiple benefits from ldn. fig. b shows the differences in perceptions across three different stakeholder groupings regarding multiple benefits that are expected 'often'. the most obvious difference can be seen in the researchers/scientists grouping, where values are consistently below the other two groups. this highlights that this group is expecting multiple benefits less often than other groupings. in terms of the monitoring of multiple benefits, the survey results highlight considerable gaps in the availability of quality data (supplementary annex , figure a ). areas with absent or particularly poor data quality included resilience ( % not monitored or data quality is poor), soil organic carbon ( %), and gender equality ( %). table provides a brief synthesis and triangulation of results from the tsp review and the stakeholder survey. a more comprehensive analysis is in supplementary table . the analysis is structured around the enablers that were used to define the enabling environment for ldn and adopted the framework defined in the methods (section . and supplementary fig. ). an enabler was evaluated as a priority gap when it was perceived to be of high importance and where limited progress had been made. this included finance, land tenure and user arrangements, integrated land-use planning, and neutrality mechanisms for counterbalancing gains and losses. the study results represent stakeholder perceptions on the key components of an enabling environment for ldn and highlight key challenges and gaps in progress to date, and are discussed briefly here across the four dimensions of the ldn enabling environment. a common national long-term vision and high-level commitment was perceived by stakeholders as comparatively more important than any other measure (ranked st out of measures). overall, good progress has been made on target-setting at the national level ( % adopted or underway and % intended), however gaps remain in terms of mainstreaming targets into national development plans (only % mainstreamed into national development plans, and % into national action plans). while this reflects that a political commitment has been made to ldn in most cases, the tsp reports show that these commitments are sectoral, primarily made by environment or agriculture ministries. this was also supported by the survey results, where ldn was not considered to be a top policy priority for most countries ( % rated it of low or some importance). this may stem from a lack of awareness of ldn and its concepts, which was ranked as the top priority challenge for ldn implementation moving forward. while progress has been made on setting ldn targets through the tsp, our results also suggest that high level political buy-in for ldn will be a fundamental enabler for national implementation, and it remains lacking at present. advancements made in integrating ldn into national planning may also be undermined by weak monitoring and enforcement capabilities. c. allen, et al. environmental science and policy ( ) - the study results show that finance is essential for implementing ldn initiatives, however limited progress has been made both in assessing needs as well as identifying and tapping into potential sources. insufficient finance ranked as a high priority challenge to ldn moving forward (ranked rd out of challenges), while a national budget for ldn was seen both as an important measure for addressing land degradation (ranked nd out of measures) and as the most important source of finance for ldn (ranked st out of nine sources). despite the perceived importance, few countries have secured necessary finance to date (only % of respondents indicating that aspect being fulfilled). the tsp reports evidence that some countries see potential synergies between existing environment or climate finance opportunities and ldn. however, very limited progress had been made to understand the financial needs and costs associated with ldn interventions, or to allocate resources for implementation. in terms of regional progress, the nm regional grouping (italy and turkey) had made considerably more progress than other regions in this enabler (fig. ) , which could reflect their status as high or upper-middle income countries and higher hdi scores. the results show that a priority next step would be supporting countries to assess the financial requirements for ldn in the mediumto long-term (operational, monitoring, enforcement) and to develop national financial plans or integrated financing strategies linking to sources of finance including national budgets, investment instruments and potential partners (global funds, private sector, bilateral donors, csos, and philanthropy). based on the survey results, secured land tenure and access to land was ranked as one of the top three most important elements for supporting ldn implementation ( rd out of measures), while national capacity to secure land tenure was rated the lowest from nine available options. the few tsp reports that addressed land tenure identified it as a weakness or barrier to slm, or as a cause of land degradation. the survey results highlight that the evaluation of environmental, economic, and social trade-offs was seen as important for maximising multiple benefits from ldn (ranked nd from seven options), however capacity to manage these conflicting interests was ranked lowest out of nine options. this may be due to the limited adoption of effective integrated land-use planning systems, with only % of respondents describing their systems as either advanced or operational. only % of respondents to the survey reported that a neutrality mechanism had been effectively embedded into their land-use planning system, highlighting that progress on this measure has been very limited to date. overall, good progress was evident in terms of setting national baselines on global and national indicators for monitoring progress on ldn, with most countries setting clear baselines for the three global indicators ( % with baseline for land cover, % for land productivity dynamics, and % for soil organic carbon). however, the tsp reports revealed that in most cases this assessment was based solely on global data provided by unccd, or a combination of national and global data. the broad reliance on global data and external technical assistance for setting national baselines may suggest that national monitoring capabilities for tracking progress on the ldn indicators as well as national data systems are quite limited. the survey results support this with respondents rating their national capacity (on average) to set baselines and track progress as 'fair'. an important finding from the survey was that 'insufficient awareness of ldn and understanding of key concepts' ranked as the top priority challenge moving forward with ldn implementation. the study result suggests that, while a robust scientific conceptual framework has been developed to support implementation of ldn, this is yet to be widely understood and applied at the national level. this finding aligns with other studies that show a lack of understanding of ldn amongst policy makers and planners prevents effective policy responses and allocation of resources (chasek et al., ) . previous research also highlights that some of the major challenges to incorporating scientific knowledge into policy include low levels of scientific understanding by policy makers, limited openness of politicians to using this information, limited dissemination of research findings, and lack of incentives and institutional channels (jones et al., (jones et al., , . overall, the results highlight that while baseline data to support decision making on ldn has advanced considerably, a priority moving forward will be the effective use of this data to inform and influence policy change and to drive national investment in achieving ldn. while establishing baselines and trends is an important initial step and good progress has been made in most countries, evidence-based assessments are also needed that analyse the costs, benefits and trade-offs of addressing land degradation, and mechanisms for introducing this evidence into the political, cultural and social debate. although methods for undertaking such assessments are outlined in the ldn scientific conceptual framework, few operational implementations have occurred to date. integrated national multi-disciplinary assessments addressing economic, social and environmental benefits and trade-offs associated allen, et al. environmental science and policy ( ) [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] with ldn -co-designed with decision makers-would provide a means to understand the interconnections between environment and development issues in order to develop and implement informed, costeffective and socially acceptable policies or practices. this requires strengthening not only the scientific and research capabilities within countries, but also improvements in the way that scientific information is constructed, integrated and communicated so that it can contribute more effectively and efficiently to policy formulation. in this context, lessons could be learned from recent experience in improving health policy outcomes associated with the production of evidence by means of participatory research models, which incorporate stakeholders into the design (mapping), evaluation (analysis), communication (visualisation and sharing) and implementation phases of research (horton and brown, ) . lessons could also be drawn from countries with greater progress on the science-policy interface enabler identified through the review of tsp reports. this included countries in the nm and lac regional groupings (fig. ) such as turkey, italy, guyana, colombia and bolivia. overall, synergies and multiple benefits associated with ldn were considered in most of the tsp reports reviewed, however these tended to focus on environmental linkages such as to the rio conventions on biodiversity and climate change. fewer reports mentioned multiple benefits associated with socioeconomic or well-being outcomes. some reports included more detailed analysis of multiple benefits in the form of a ldn leverage plan. common recommendations for leveraging multiple benefits included mainstreaming of ldn targets into national sustainable development plans or relevant sectoral plans (primarily plans for combatting land degradation, biodiversity loss, or land-based mitigation or adaptation to climate change), incorporating ldn outcomes into existing programs funded through gef or with climate finance, as well as greater engagement of central planning and finance ministries, land users and stakeholders. these results align well with responses to the survey. over % of respondents either agreed or strongly agreed that they were expecting positive effects on human well-being and livelihoods as a result of slm and ldn. respondents ranked the three most important elements for ensuring that social co-benefits are maximised in ldn initiatives as the 'full and effective participation from local communities and stakeholders', 'evaluation of environmental, economic and social trade-offs', and the 'identification of livelihood needs and prioritisation of livelihood outcomes in program design'. this aligns well with previous studies that concluded that effective project design and engagement of local communities are critical for identifying and addressing trade-offs and maximising multiple benefits (bullock et al., ; lamb et al., ; stanturf et al., ; budiharta et al., ) . this research finds that while the central role of an enabling environment for attaining ldn is acknowledged, knowledge on effective configurations, and the extent to which it materialises multiple benefits, is scarce. to develop such enabling environments, existing institutional arrangements need to be coherent and conducive to operationalising ldn and there needs to be a firm grounding of the neutrality concept in national policies, targets and budgets. our results show good progress in the institutional dimension of an enabling environment for ldn, yet high-level political buy-in that could accelerate the integration of ldn into national development planning and budgeting processes remains a gap. we conclude that greater efforts need to be made to raise awareness of ldn, educate core stakeholders in its concepts and enablers, and raise its political profile. the recent declaration of the un decade on ecosystem restoration ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) provides an additional opportunity to raise the profile of land degradation challenges and ldn in the context of achieving the sdgs. this is also of relevance in the context of the ongoing covid- pandemic which is derailing efforts to achieve the sdgs, with some experts calling for a great reset (naidoo and fisher, ) . the focus of ldn interventions on maintaining and improving natural capital, building resilience and generating co-benefits that strengthen other forms of capital offer win-wins for the broader achievement of the sdgs. thorough assessment of the costing associated with ldn interventions, and sourcing finance remain important gaps. further analysis could be undertaken of countries that have secured finance for ldn to date, and opportunities to replicate or scale-up this investment in other countries. information collected for this research evidence that national capacities for securing land tenure and governance arrangements likely represent an important capacity gap for national implementation of ldn, which undermines efforts for its attainment. furthermore, our results show limited progress in adopting integrated land use planning for ldn interventions. given that integrated land use planning and the adoption of a neutrality mechanism are considered fundamental to ldn achievement, it is apparent that this remains a priority implementation gap. overall countries have made good progress on setting baselines, though evidence points to a lack of national capabilities for key technical activities and assessments that support ldn implementation, which are needed to adequately assess trade-offs and multiple benefits of ldn for wellbeing and sustainable livelihoods, and to design projects and programs that maximise benefits and manage tensions or unintended consequences. the authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. improving the enabling environment to combat land degradation: institutional, financial, legal and science-policy challenges and solutions unpacking the concept of land degradation neutrality addressing its operation through the rio conventions implementing land degradation neutrality at national level: legal instruments in germany. international yearbook of soil law and policy enhancing feasibility: incorporating a socio-ecological systems framework into restoration planning restoration of ecosystem services and biodiversity: conflicts and opportunities operationalizing zero net land degradation: the next stage in international efforts to combat desertification? land degradation neutrality: the science-policy interface from the unccd to national implementation land in balance: the scientific conceptual framework for land degradation neutrality assessing resilience to underpin implementation of land degradation neutrality: a case study in the rangelands of western new south wales ts a -land governance, paper no. . fig working week -knowing to manage the territory, protect the environment. evaluate the cultural heritage prioritizing land for investments based on short-and long-term land potential and degradation risk: a strategic approach investigating the impacts of increased rural land tenure security: a systematic review of the evidence land tenure reforms, tenure security and food security in poor agrarian economies: causal linkages and research gaps integrating evidence, politics and society: a methodology for the science-policy interface political science?-strengthening science-policy dialogue in developing countries strengthening science-policy dialogue in developing countries: a priority for climate change adaptation uncertainties and policy challenges in implementing land degradation neutrality in russia restoration of degraded tropical forest landscapes reset sustainable development goals for a pandemic world gender and land degradation neutrality: a cross-country analysis to support more equitable practices scientific conceptual framework for land degradation neutrality. united nations convention to combat desertification (unccd) systematic conservation planning products for land-use planning: interpretation for implementation assessing land condition as a first step to achieving land degradation neutrality: a case study of the republic of srpska land degradation neutrality-potentials for its operationalisation at multi-levels in nigeria contemporary forest restoration: a review emphasizing function human development index and its components transforming our world: the agenda for sustainable development, outcome document of the united nations summit for the adoption of the post- agenda creating an enabling environment for land degradation neutrality: and its potential contribution to enhancing well-being, livelihoods and the environment experiences from the south african land degradation neutrality target setting process achieving land degradation neutrality in germany: implementation process and design of a land use change based indicator we would like to acknowledge funding provided by the unccd secretariat to undertake this research, as well as expert input and advice provided by members of the unccd spi and other experts in the field. supplementary material related to this article can be found, in the online version, at doi:https://doi.org/ . /j.envsci. . . . key: cord- -i ukefp authors: gómez-herranz, maria; nekulova, marta; faktor, jakub; hernychova, lenka; kote, sachin; sinclair, elizabeth h.; nenutil, rudolf; vojtesek, borivoj; ball, kathryn l.; hupp, ted r. title: the effects of ifitm and ifitm gene deletion on ifnγ stimulated protein synthesis date: - - journal: cell signal doi: . /j.cellsig. . . sha: doc_id: cord_uid: i ukefp interferon-induced transmembrane proteins ifitm and ifitm (ifitm / ) play a role in both rna viral restriction and in human cancer progression. using immunohistochemical staining of ffpe tissue, we identified subgroups of cervical cancer patients where ifitm / protein expression is inversely related to metastasis. guide rna-cas methods were used to develop an isogenic ifitm /ifitm double null cervical cancer model in order to define dominant pathways triggered by presence or absence of ifitm / signalling. a pulse silac methodology identified irf , hla-b, and isg as the most dominating ifnγ inducible proteins whose synthesis was attenuated in the ifitm /ifitm double-null cells. conversely, swath-ip mass spectrometry of ectopically expressed sbp-tagged ifitm identified isg and hla-b as dominant co-associated proteins. isg ylation was attenuated in ifnγ treated ifitm /ifitm double-null cells. proximity ligation assays indicated that hla-b can interact with ifitm / proteins in parental siha cells. cell surface expression of hla-b was attenuated in ifnγ treated ifitm /ifitm double-null cells. swath-ms proteomic screens in cells treated with ifitm -targeted sirna cells resulted in the attenuation of an interferon regulated protein subpopulation including mhc class i molecules as well as ifitm , stat , b m, and isg . these data have implications for the function of ifitm / in mediating ifnγ stimulated protein synthesis including isg ylation and mhc class i production in cancer cells. the data together suggest that pro-metastatic growth associated with ifitm / negative cervical cancers relates to attenuated expression of mhc class i molecules that would support tumor immune escape. interferons (ifns) are pleiotropic cytokines produced by the innate immune system as a defensive response [ ] . many biological functions have been described for the interferon pathway. among them, the best characterized are anti-tumor activity, immunomodulatory effects, antipathogen, and anti-viral activity [ ] . ifn effects rely on three different types of receptors: type i (ifnα, ifnβ and ifnω), type ii (ifnγ) and type iii (ifnλ) [ , ] . ifns increase in response to a broad range of factors such as persistent viral infection or dna damaging agents which activate the jak kinase-stat pathway. ultimately, this signalling cascade will regulate the transcriptional synthesis of interferon stimulated genes (isgs) [ , ] . type i ifns facilitate anti-proliferative and pro-apoptotic pathways in a wide range of cell types and it has been extensively used as antitumor therapeutic agent. high doses of ifns are used for cancer therapy and can activate anti-tumor immunity as well as pro-apoptotic and antiproliferative programs. by contrast, it has been demonstrated that sustained low level of ifn causes a steady-state expression of the interferon resistance dna damage signature (irds) which is comprised of a subset of interferon-stimulated genes [ ] . irds proteins promote phenotypes that contribute to the tumor development such as resistance to dna damage, suppression of t cell toxicity, metastasis, and facilitation of epithelial-mesenchymal transition [ ] . ifitm is a pro-oncogenic receptor which is a component of the irds pathway. basal expression of ifitm proteins is observed in some cells and expression can also be induced by type i and type ii interferons. ifitm is up-regulated during development of radiation resistance, escaping from pro-apoptotic and anti-proliferative effects [ , ] . ifitm expression has been extensively reported in many types of cancer; breast, cervix, colon, ovary, brain and oesophagus cancer and its high expression correlates with tumor progression and can lead to a poor outcome [ ] [ ] [ ] [ ] [ ] [ ] [ ] . ifitm was previously known as ifi (interferon induced protein ), ifi - (interferon inducible protein - ), cd (cluster of differentiation) and leu- (leucocyte surface protein). it is a membrane protein that plays a key role in restriction of anti-viral immune response and belongs to the interferon induced transmembrane family (ifitm) [ ] . ifitm is coded by the ifitm gene located on chromosome p . and flanked by ifitm and ifitm genes. the ifitm immunity-related protein family are composed of short amino-terminal and carboxy-terminal domains, two transmembrane domains, and a cytoplasmic domain. ifitm is slightly different from ifitm and ifitm , with some studies demonstrating that ifitm is uniquely expressed at the cell surface [ , ] . ifitm family members are capable of attenuating the entrance of many human and animal viruses. they suppress the entry of viruses such as influenza a virus (iav), west nile virus, dengue virus, sars coronavirus, filoviruses, vsv, and hcv among others [ ] . ifitm family proteins inhibit cytosolic entry of viruses by preventing fusion of viral and host membranes. the protein-protein interaction networks by which ifitm proteins inhibit viral propagation are just emerging. one study using yeast two-hybrid methodology has identified an interaction of ifitm , ifitm , and ifitm with vapa that in turn mediates an accumulation of cholesterol in multivesicular structures [ ] . this reduces the fusion of intraluminal virion-containing vesicles with endosomal membranes and thereby inhibits virus release into the cytosol. in addition, different family members exhibit specific viral preferences. in the case of ifitm , it is more active in controlling filoviruses, influenza a, and sars [ , , ] . in this report, we focus on applying methodologies in interactomics, gene editing, swath-immunoprecipitation mass spectrometry, and pulse-silac mass spectrometry to propose an ifnγ responsive biochemical function for the ifitm / proteins. all chemicals and reagents were obtained from sigma unless indicated otherwise. all antibodies from thermofisher unless indicated otherwise. siha, ifitm null-siha, and ifitm /ifitm null-siha cell lines were grown in rpmi medium (invitrogen) supplemented with % (v/v) fetal bovine serum (labtech) and % penicillin/streptomycin (invitrogen) and incubated at °c with % co . cervical cancer samples were obtained at the masaryk memorial cancer institute where patients gave written consent for tissue use according to local ethical regulations. tissues were fixed in % formaldehyde for approximately h before processing into paraffin wax and sectioned ( μm). after sections were cut, the antigens were retrieved, and samples were probed with primary antibodies (including those to p protein (data not shown), and ifitm / proteins (using the mab-mhk); supplementary fig. ). following this stage, the tissue sections were incubated with secondary antibodies conjugated to streptavidin-hrp. sections were incubated with dab (dako), counterstained with haematoxylin and mounted for visualization, as previously described [ ] . . . generation of ifitm null and ifitm /ifitm double null cell line using crispr/cas technology guide rna sequences were designed for ifitm ( ′ tccaaggtcc accgtgatca ′) and for ifitm ( ′ gtcaacagtggccagccccc ′). the guide rna was cloned into the lenticrisprv expression vector [ ] . after transfection, cells were selected with puromycin and sorted in a -well plate using facs. after weeks, individual clones were screened for absence of ifitm and/or ifitm protein expression by western blot using the mab-mhk ( supplementary fig. ). chromosomal dna from the positive clones were sequenced for a final validation to define the precise gene edit (as in figs. and ). chromosomal dna was extracted from frozen cell pellets following the instruction manual (gentra puregene cell kit, qiagen). validation of the edited dna sequence was confirmed by cloning the genomic ifitm and ifitm pcr products into a holding vector and by amplifying the entire gene and bp surrounding the guide rna targeting site followed by sanger sequencing at source bioscience service (scotland). . . cloning, transfection and affinity purification of sbp-ifitm protein ifitm cdna was cloned by pcr into pexpr-iba expression vector containing sbp tag at the n-terminus of the coding region (sbp vector (iba)). siha cells were grown in rpmi as a duplicate. for transfection, cells were grown to approximately % confluency and transfected using attractene (qiagen). cells were transfected with sbpempty vector (control cells) and sbp-ifitm for h. h after transfection, cells were treated with carrier or with ng/ml ifnγ (invitrogen) for h in order to "stabilize" potential interferon-activated sbp-ifitm interacting proteins. cells were washed twice in ice cold pbs and scraped into buffer containing mm kcl, mm hepes ph . , mm edta, mm egta, . mm na vo , mm naf, % (v/v) glycerol, protease inhibitor mix, and . % triton x- , then incubated for min on ice and centrifuged at , rpm for min at °c. equal amounts of protein were used for performing the affinity capture. streptavidin agarose conjugated beads (millipore) were prewashed with in pbs. then cell lysate was added and incubated for h at rt with gentle rotation. binding proteins were eluted with a buffer containing mm hepes ph , mm dtt, and m urea. small interfering rna directed against the human ifitm (qiagen) and an allstars negative controls flexitube sirna (qiagen) were used to transfect siha cells for and h. cells were transfected using hiperfect (qiagen) following the manufacturer's instructions. for performing "pulse" silac, parental siha, ifitm /ifitm double null cells, and ifitm single null cells were grown as biological triplicate and incubated with silac heavy media for and h with or without ng/ml ifnγ before harvesting [ ] [ ] [ ] . cells were isotopically pulse-labeled using silac rpmi-heavy media (dundee cell products, uk); l-[ c n ] arginine (r ) and l-[ c n ] lysine (k ). cells were harvested in a buffer containing m urea, . m tris ph . . total protein extracts were measured by the bradford assay [ ] . proteins were detected using the following primary antibodies: mouse monoclonal anti-bodies generated to a peptide that is identical in ifitm and ifitm (mab-mhk) (moravian biotechnology, this study is their first description). the antibody we name mab-mhk can therefore detect co-expression of both ifitm and ifitm proteins ( supplementary fig. ). other sources include, rabbit monoclonal anti-stat (cell signalling), mouse monoclonal anti-irf (bd transduction laboratories), rabbit polyclonal anti-hla-b (thermo fisher), rabbit polyclonal antibody anti-isg (cell signalling), mouse monoclonal anti-sbp-tag (sigma), mouse monoclonal anti-β-actin (sigma), and the mouse monoclonal anti-gapdh (abcam). protein from lysed samples was quantified using protein assay dye reagent (bio-rad). proteins were resolved by sds-page using % gels [ ] and transferred onto nitrocellulose membranes (amersham protran, ge healthcare). immunoblots were processed by enhanced chemiluminescence (ecl). parental siha and ifitm /ifitm double null cells were grown over mm diameter glass coverslips. stimulated cells were treated with ng/ml ifnγ for h. cells were fixed with % (v/v) paraformaldehyde in pbs for min at room temperature and permeabilized using . % triton x- in pbs for min. then, the cells were blocked with % bsa in pbs for min. the primary antibody was incubated at a : dilution overnight. alexa fluor goat antirabbit igg (h + l) (invitrogen) was incubated as a secondary antibody at : dilution for h. the fluorescent signal was detected using a zeiss axioplan microscope at ×. replicates are described in the fig. legend. fluorescence was measured using imagej software; cells were selected, and information was extracted on the area, integrated density, and mean gray values by selecting set measurements in the analyse menu. a region with no fluorescence was selected as background for each image. the following formula was applied for each cells analyzed: ctcf = integrated density -(area of selected cell × mean fluorescence of background readings); *ctcf = corrected total cell fluorescence. parental siha and ifitm /ifitm double null cells were grown and processed as described in the immunofluorescence method (above). primary antibodies from different species were incubated with the fixed and permeabilized cells: mhk mouse mab with rabbit polyclonal anti-hla-b, at : dilution overnight. duolink® assays (sigma) were carried out following supplier's instructions. the fluorescent signal was detected using a zeiss axioplan microscope at ×. fluorescence was measured using imagej software, as above reviewed for standard immunofluorescence. . . mass spectrometric experimental screens . . . peptide generation using fasp cell lysates, immunoprecipitates, or gradient fractions were processed using filter-aided sample preparation protocol (fasp) [ , ] . urea buffer ( m urea in . m tris ph . ) was added to a kda spin filter column (mrcprt , microcon). protein concentration was determined using the rc-dc method (bio-rad). normalized sample was added into the spin filter column and was centrifuged at g for min at °c. urea buffer was added again with mm tris ( carboxyethyl) phosphine hydrochloride (aldrich) and mixed. the column was left on a thermo-block set at °c shaking at rpm and centrifuged at , rpm at °c for min. urea buffer and mm of iodoacetamide (sigma) were mixed using a thermo-mixer at rpm in the dark for min, then was maintained statically for a further min at room temperature in the dark. the sample was centrifuged at , rpm at °c for min and the supernatant was discarded. a solution containing mm ammonium bicarbonate were added to the column and then it was centrifuged at , rpm at °c for min. this step was repeated one more time. the column was placed in a new collecting tube (low binding affinity) and mm ammonium bicarbonate was added along with trypsin diluted in trypsin buffer (promega) at a : ratio. the column was incubated at °c overnight. the following day the column was centrifuged at , rpm at °c for min. determination of the peptide concentration was performed using the quantitative colorimetric peptide assay (pierce, thermo-scientific). peptides were desalted on micro spin columns c- (harvard apparatus, usa). c- columns were conditioned three times with % acetonitrile (acn) and . % formic acid (fa) and centrifuged at rpm at room temperature for min. the column was washed with . % fa and centrifuged at rpm at room temperature for min. the column was hydrated in . % fa for min following centrifugation at rpm at room temperature for min. the sample was loaded into the column and centrifuged at rpm for min. after washing the column three times with . % fa, the peptides were eluted in three consecutive centrifugations at rpm for min using %- % and % acn with . % fa. subsequently, peptide eluates were evaporated and dissolved in % acn with . % fa. there were three distinct mass spectrometric screens used in the manuscript. the rationale for replicates in each distinct approach is as follows; (i) the swath-ip (immunoprecipitation) to identify ifitm enriched associated proteins (fig. ) . the immunoprecipitation and immunoblots of sbp-ifitm vs sbp control is representative of experiments performed at least three times. the representative enrichment of ifitm -associated proteins (in fig. from supplementary table ) was processed by label free (swath) mass-spectrometry in technical triplicates. the measured fold-changes and p-values for quantified proteins are listed in supplementary table and fig. ; (ii) the sirna swath-ms ( fig. a and b). targeted sirna to deplete ifitm in siha cells was performed in at least three separate experiments. a representative depletion of ifitm protein (fig. a ) was performed in technical triplicates using two different biological states ( and h). the statistical rationale for performing two biological states (equivalent plating of cell number and differing by time of interferon exposure) rather than two biological replicates at the same time point, was due to the variable induction and suppression of the interferon cascade over this time frame. thus, any proteins that are observed in two biological states as a function of time are thought to have higher significance than an analysis performed in duplicates at one time point. the samples in the two biological states were processed in technical triplicate for label-free (swath) analysis ( fig. ; supplementary table ); (iii) pulse silac to measure protein synthesis as a function of genotype. twelve enzymatically digested samples (four samples of parental siha, four samples of ifitm null, and four samples of ifitm /ifitm double null), each of them as independent biological triplicates, were processed using isotopically labeled amino acids and were separated using lc-ms/ms analysis ( table ). the statistical rationale for using three biological replicates with one injection per replicate, rather three technical replicates from one biological sample, relates to the dynamics and variability in the interferon dynamics. by using biological triplicates, any common overlaps are deemed to be more significant because of the possible variability in the cell plating and interferon cascade. peptides were loaded on a pre-column (μ-precolumn, mm i.d., mm length, c pepmap , μm particle size, Å pore size) and further separated on an acclaim pepmap rslc column ( mm i.d., length mm, c , particle size mm, pore size Å) with a nl/min flow rate using a linear gradient: % b over min, - % b over min, - % b over min, with a = . % aq. formic acid and b = % acn in . % aq. formic acid. peptides eluting from the column were introduced into an orbitrap elite (thermo fisher scientific, massachusetts, usa) operating in top data dependent acquisition mode. a survey scan of - m/z was performed in the orbitrap at resolution with an agc target of × and ms injection time followed by twenty data-dependent ms scans performed in the ltq linear ion trap with microscan, ms injection time and , agc. the data from mass spectrometer were processed either using proteome discoverer . or proteome discoverer . that is employed with imbedded statistical tools (both programs were from thermo fisher scientific, massachusetts, usa). proteome discoverer . processed the data using mascot engine with the following search settings: database swiss-prot human (april ); enzyme trypsin; missed cleavage sites; precursor mass tolerance ppm; fragment mass tolerance . da; dynamic modifications: carbamidomethyl [c], oxidation [m], acetyl [protein n-terminus]. the results of the search were further submitted to generate the final report considering % fdr on both psm and peptide group levels. only unique peptides were used for the protein quantification. silac labels of r and k were chosen for heavy and r and k for light. the relative quantification value was represented as heavy/light ratio (supplementary table ). in the processing and consensus workflows subsequent nodes were used: the minora feature detector, the precursor ion quantifier, and the feature mapper. the data were processed using sequest ht engine with the following search settings: database swiss-prot - - , # sequences , , taxonomy: homo sapiens (updated february ); enzyme trypsin; missed cleavage sites; precursor mass tolerance ppm; fragment mass tolerance . da; static modification carbamidomethyl [+ . da, (c)], label c( ) [+ . da (k, r)]; dynamic modifications oxidation [peptide terminus, + . da (m)], met-loss +acetyl [protein terminus, − . da (m)]. the results of the search were further submitted to generate the final report with a % fdr using percolator. for the protein quantification and statistical assessment of the biological triplicates, only unique peptides and razor peptides were used. the relative quantification value is represented as the relative peak area of the peptides with the heavy isotope labels with ifnγ treated cells/ifnγ untreated cells ratios (supplementary table b ). label free quantitation was performed using fasp-processed tryptic digests with liquid chromatography coupled to tandem mass spectrometry on an eksigent ekspert nanolc (sciex, california, usa) online connected to a tripletof + (sciex, toronto, canada) mass spectrometer. cells lysates were processed in technical triplicates. prior to the separation the peptides were concentrated and desalted on a cartridge trap column ( μm i.d. × mm) packed with a c pepmap sorbent with a μm particle size (thermo fisher scientific, waltham, ma, usa). after a washing using . % trifluoroacetic acid in % acetonitrile and % water, the peptides were eluted using a gradient of acetonitrile/water ( nl/min) using a capillary emitter column picofrit® nanospray columns μm × mm (new objective, massachusetts, usa) self-packed with prontosil - -c aq sorbent with μm particles (bischoff, leonberg, germany). mobile phase a was composed from . % (v/v) formic acid in water, and mobile phase b was composed of . % (v/v) formic acid in acetonitrile. gradient elution started at % mobile phase b for the first min and then the proportion of mobile phase b increased linearly up to %b for the following min. output from the separation column was directly coupled to an ion source (nano-electrospray). nitrogen was used as a drying and nebulizing gas. temperature and flow of the drying gas was set to °c and psi. voltage at the capillary was . kv. pooled sample for the spectral library was measured in data-dependent positive mode (ida). the ms precursor mass range was set from m/z up to m/z and from m/z up to m/z in ms/ ms. cycle time was . s and in each cycle most intensive precursor ions were fragmented. subsequently, their corresponding ms/ms spectra were measured. precursor exclusion time was set to s. precursor ions with intensity below cps were suspended from the ida experiment. the extraction of mass spectra from chromatograms, their annotation and deconvolution were performed using protein pilot . (sciex, toronto, canada). ms and ms/ms data were searched using the uniprot+swissprot database ( . , , entries) restricted to homo sapiens taxonomy. fixed modificationalkylation on cysteine using iodoacetamide and digestion using trypsin was set for all searches. fdr analysis was performed by searching the shotgun data against the decoy search database. the resulting group file was imported into peakview . . . (sciex, toronto, canada), where only proteins with fdr below % were imported into the spectral library ( proteins for sbp-ifitm pull down swath). in swath mode, the mass spectrometer operated in high sensitivity positive mode. precursor range was set from m/z up to m/z . it was divided into precursor ion windows with the width of da and da overlap. accumulation time was ms per swath window and the duty cycle was . s, which enabled acquisition of at least data points across a chromatographic peak. product ions were scanned from m/z up to m/z . data extraction was performed in peakview . . . (sciex, toronto, canada) with the spectral library. the retention time window for extraction was manually set to min for sbp-ifitm pull down swath and for the sirna swath ( fig. ). protein quantification was performed using up to peptides and transitions per protein to define the statistically significant proteins. scope of peptides used for quantification was narrowed to only those with peptide confidence higher than % and without any variable modification. protein summed peak areas were normalized using total area sums option in markerview . . . (sciex, toronto, canada). samples were compared pairwise using paired t-test. the mass spectrometric data have been deposited to the proteomexchange consortium via the pride partner repository with the dataset identifier pxd ". for reviewers, the details include; username: reviewer @ebi.ac.uk and password: oftomsm . the parental siha and ifitm /ifitm double null cells were grown in rpmi (supplemented with % fbs, pen-strep and pyruvate) to % confluence in -well plates and treated with ng/ml ifnγ for h. cells were harvested using accutase (sigma-aldrich, a ), centrifuged rpm for min with rpmi and kept on ice in % bsa in pbs for min. primary antibodies (hla-b: pa - ) were diluted in % bsa/pbs to : . cells were centrifuged as before, and cell pellets were resuspended in μl of diluted primary antibodies or in the same amount of % bsa/pbs (for control samples) and incubated min a room temperature on a tube rotator. after a triple wash in icecold % bsa/pbs, cells were incubated with μl of secondary antibody (abcam, goat anti-rabbit igg h&l dylight ), diluted : , on a tube rotator for min at room temperature. after a triple wash in ice-cold % bsa/pbs, cells were resuspended in μl of % bsa/pbs and kept on ice before measurement. samples were measured on facsverse (bd biosciences) and data were analyzed using facsuite software (bd biosciences). a negative control without primary antibody was prepared for each sample. hla expression on the cell surface was counted as fitc median fluorescence intensity (mfi) divided by the mfi of the negative control. two independent experiments were performed, each with two independently isolated ifitm /ifitm double null cells and two different hla antibodies (hla-a (data not shown) and hla-b (fig. ) ). identifying a clinically relevant model to dissect ifitm and ifitm (ifitm / ) signalling ifitm and ifitm (ifitm / ) can function as oncogenic factors in several cancer cells [ , ] . attenuation of ifitm protein expression can inhibit growth, invasion, and/or migration of cancer cells [ ] . patient subgroups with clinically relevant expression data are not welldefined. we developed a panel of monoclonal antibodies to a n-terminal peptide with a high homology between ifitm and ifitm (supplementary fig. a , b). this would allow the development of monoclonal antibodies that detect the co-expression of both ifitm and ifitm proteins. we aimed to use such tools to screen a large panel of human cancers for those that express ifitm / proteins. this would identify clinically relevant models for a focus to dissect ifitm / mediated signalling pathways. the monoclonal antibody chosen (named mab-mhk; supplementary fig. b) can bind to ifitm / antigens in a range of human cancer cells ( supplementary fig. c ). some cancer cells exhibit no expression of ifitm / such as the lymphoma cell line whu-nhl ( supplementary fig. c ). further studies confirmed that mab-mhk can bind to both ifitm and ifitm proteins, as defined using single ifitm single null and ifitm /ifitm double null cells (see below). mab-mhk was used to screen large panels of archival formalin-fixed human cancer tissues to identify potential clinically relevant models. we could detect differential expression of ifitm / in breast cancer, colon cancer, and oesophagus cancer (data not shown). we could also detect differential expression of ifitm / protein in a cervical cancer array ( fig. a-f) . squamous cervical cancer samples expressed either high levels of ifitm / (fig. a) , lower levels of ifitm / (fig. b) , or undetectable levels of the antigens (fig. c ). cervical adenocarcinoma often exhibited high expression (fig. d) . interestingly, some normal squamous cervical epithelium exhibited high expression in the basal 'stem' cell or pluripotent layer only (fig. e ), but not the differentiating cell layers. we can conclude that out of cervical cancer specimens are positive for ifitm / proteins using the mab-mhk (fig. f, top panel) . what is also interesting is that there is a statistically significant inverse association between ifitm / protein expression and the number of lymph node metastases in patients (fig. f, bottom panel) . this will be rationalized in the discussion based on data that emerges below. developing an ifitm and ifitm double null cell line using a crispr-guide rna methodology ifitm is implicated in ifnγ mediated growth control in some cancer cells with an active p pathway [ ] . ifitm is also implicated in a growth stimulatory role in cervical squamous cancers [ ] [ ] . the hpv + and ifitm / positive cervical cancer cell line siha [ ] [ ] exhibit ifnγ inducible stat , irf , and ifitm / proteins (fig. g ). as such, we focused on using this cervical cancer cell line (siha) as a model to identify ifitm / dependent signalling pathways. in order to continue to develop a cervical cancer cell model that reflects the clinical data (ifitm / positive or ifitm / negative cancers; fig. f ), we set out to develop a double null ifitm and ifitm cell line model. we first generated an isogenic ifitm null cell panel through gene editing to validate the guide rna. ifitm knockout mice are viable [ ] so it was likely we would be able to generate ifitm null cells. guide rnas targeting exon in the ifitm gene ( figure a) were cloned into plentiv . cells were transfected and selected for resistance to puromycin to allow stable integration of the ifitm targeting guide rna cassette. cell clones were chosen for sequencing across the guide rna targeting site ( fig. a) using pcr (fig. b ). both ifitm alleles were gene edited in a representative ifitm -null clone that creates two distinct frameshifts ( fig. c and d) . we examined four representative ifitm null cells in dna damage response assays and all lines were shown to be either chemosensitive or x-irradiation sensitive (data not shown). since all single knock-out clones behaved similarly, we chose one representative ifitm null cell line for continued study. the ifitm gene was next targeted using guide rna methodologies (fig. a) to to create a ifitm /ifitm double null cell ( fig. a-c) . ifitm null mice, and mice with the entire ifitm chromosomal locus deleted, are also viable [ , ] . this created an isogenic cell model that removed any redundancy of ifitm in the ifitm interaction landscape, especially as they both are reported to interact with vapa [ ] . immunoblotting confirmed that ifitm and ifitm proteins are not detected in the ifitm /ifitm double null cell, respectively ( fig. d ; lanes and ). we chose a representative doublenull cell line for subsequent studies. given that one established effector of ifitm is ifnγ [ ] , we evaluated ifnγ responsive protein synthesis in ifnγ stimulated parental siha, ifitm single null, and ifitm /ifitm double null cells. the parental siha, ifitm single null, and ifitm /ifitm double null cells were treated with heavy isotopic amino acid labeling media (the silac method) for or h, in the absence or presence of ifnγ (fig. ) . cell lysates were then processed by fasp [ , ] and then analyzed for ifitm / -dependent protein synthesis using mass spectrometry. the silac methodology has been subjected to an analysis of random error associated with the multiple steps in this approach including; cell plating in biological replicates, switch to heavy isotopic media, cell recovery from plastic plates, cell lysis, centrifugation, filter assisted trypsinization, and tryptic peptide recovery and processing. this error can be reduced by employing multiple replicates (n = ) as highlighted previously [ ] . to highlight the importance of biological replicates and the inherent error in this multiple step process, we plot the data not as an average of three replicates, but as individual points from all three replicates (as in supplementary fig. and fig. ) . the dominant ifnγ responsive protein to be identified at -h post labelling is irf protein (supplementary fig. b vs a) with an attenuation of isotopically labeled irf peptides recovered in the biological triplicates from ifitm /ifitm double null cells ( supplementary fig. h ). this suggests that irf is partially dependent upon ifitm / signalling and this was confirmed using irf transcriptional reporter assays (data not shown). this data provides some degree of confidence that the methodology is able to identify a known ifnγ responsive target (irf ). there are other proteins whose synthesis was detected at hours post-isotopic labelling including stat , eif , and b m ( supplementary fig. i-k) . eif is not known to be linked to interferon signalling, but it is known to regulate the accuracy of aug codon selection by the scanning pre-initiation complexes [ ] . eif might prove to be involved in regulating interferon dependent anti-viral mrna selection. nevertheless, all three proteins are also iftm / -independent ( supplementary fig. i-k) . stat and b m are also both known ifnγ responsive proteins further validating the methodology used to measure quantitative changes in protein synthesis. that all three proteins (stat , eif , and b m) exhibit equivalent protein synthesis rates in the parental and double-null cell model indicates that the double null has retained many key regulatory features of the parental cell. this suggests that many ifn regulatory features of the double null cell have been retained despite the selection process creating the cell model. by -h post ifnγ treatment of siha cells, hla family members, b m, and stat proteins were detected (fig. b vs a) , again indicating that the methodology can detected known ifnγ inducible proteins. by -h post ifnγ treatment, isotopically labeled irf fig. . immunohistochemical analysis of ifitm / protein expression in cervical cancers using the mab-mhk. formalinfixed, paraffin-embedded cervical carcinoma tissue was processed as stated in the experimental procedures using the mab-mhk that binds to shared epitopes in the n-terminal domains of ifitm and ifitm ( supplementary fig. a, b) . peptides are attenuated (fig. h) , which is consistent with the early and transient induction of irf by ifnγ. the synthesis of mhc class i molecules was ifitm / -dependent (fig. b vs f ; quantified in biological triplicates in fig. l ). all three, major hla alleles exhibited attenuated synthesis in the double null cell, as defined using the tryptic peptide coverage (supplementary fig. ). isotopically labeled isg tryptic peptides are also not observed in the early interferon response ( supplementary fig. ) , and the isotopically labeled isg peptide recovery after h is attenuated in the ifitm /ifitm double null cells (quantified in biological triplicates in fig. g ) suggesting that isg protein synthesis is largely ifitm / dependent. by contrast, stat protein synthesis at h appears largely ifitm / -independent (quantified in biological triplicates in fig. i ). providing another internal control, another well-known inducible ifnγ protein, b m, exhibits equivalent synthesis in the ifitm /ifitm double null cell (quantified in biological triplicates in fig. k ). this indicates that one key regulatory feature of the double null cell, stat production, has remained intact. these data first demonstrate that using the pulse-silac methodology, the siha cell model reflects the classic ifnγ responsive induction of stat , irf , b m, isg , and mhc class i molecules (fig. b vs a and supplementary fig. ) . also of note is attenuation of hla-a, hla-b, hla-c, and isg protein synthesis h post-ifnγ treatment in the ifitm /ifitm double null cells compared to parental siha (fig. f vs b). in order to determine if ifitm deletion alone impacted on this set of gene products, the parental siha and ifitm -null cells were in parallel treated with silac heavy-labeling media for or h. as with the double ifitm /ifitm double null cells, ifnγ dependent induction of irf protein synthesis is attenuated hours post labelling in the ifitm -single null cells ( supplementary fig. d vs b ). this suggests that irf is dependent upon ifitm . elevation of stat protein synthesis are ifitm -independent based on the equivalent induction of stat in the ifitm -single null cells (fig. i) . hla-b protein synthesis is attenuated hours post-ifnγ treatment in the ifitm single null cells (fig. l) . isg synthesis is also attenuated in the ifitm single null cell (fig. g) . together, these data suggest that mhc class i family members and isg require at least ifitm for maximal ifnγ stimulated protein synthesis. the software used to identify hla orthologues in the pulse-silac screen (fig. ) can discriminate between hla-a, hla-b, and hla-c based on tryptic peptide sequences (supplementary fig. ) . we focus here on hla-b, which shows accurate identification of hla-b specific peptides and it is also a member of the irds pathway [ ] . we thus validated hla-b protein expression in orthogonal assays to determine whether apparent reductions in hla-b protein synthesis reduction in the ifitm /ifitm double null cells was reflected in total steady state protein levels and subcellular localization on the plasma membrane. first, immunofluorescence of hla-b was defined in parental siha and ifitm /ifitm double null cells. parental siha cells revealed significant induction of hla-b immunoreactivity h after ifnγ treatment (fig. c vs b and quantified in g) . by contrast, basal hla-b protein expression was attenuated in the ifitm /ifitm double null cells after ifnγ treatment (fig. f vs e) . quantitation of the total immunofluorescence in the absence and presence of ifnγ, in the parental siha and ifitm /ifitm double null cells, also confirms attenuated hla-b induction in the null cell panel (fig. h vs g ). this is consistent with the reduced protein synthesis observed for hla-b in the pulse silac quantitation in the ifitm /ifitm double null cells. the dominant subcellular localization of hla-b is thought to reside on the cell surface as an antigen presentation carrier. we therefore evaluated whether hla-b expression on the plasma membrane was altered in the ifitm /ifitm double null cell using facs analysis with non-permeabilized cells. two independent ifitm /ifitm double null cell clones were used as a form of biological replicate in comparison to the parental siha cell line. twenty-four hours post treatment, hla-b was elevated on the plasma membrane in the parental siha cell line (data not shown). quantitation revealed reduced levels of hla-b in both independent ifitm /ifitm double null biological replicates in the absence and presence of ifnγ (fig. i) . these data indicate that the fig. . methodological approach to identify signalling pathways altered in ifitm and ifitm double null cells. the indicated cells (parental siha, ifitm single null, or ifitm /ifitm double null) were pre-treated with carrier or ifnγ for h. the media was replaced with r k isotopically labeled media with carrier or ifnγ for or h. cells were harvested, lysed, and tryptic peptides processed for analysis by mass spectrometry (ms) as indicated in the experimental procedures. (caption on next page) reduced synthesis of hla-b in the ifitm /ifitm double null cell (fig. ) has an impact on its subcellular localization at the plasma membrane. we next also examine whether there is any direct protein-protein interaction between ifitm and hla-b since they are both co-synthesized, have transmembrane localizations on the cell surface, and are both irds components. in vivo proximity ligation assays are emerging methodologies that have been shown to demonstrate the "association" of two endogenously expressed proteins in fixed cells without the need for harsh lysis [ ] . the method can be considered as an in situ mimic of an "immunoprecipitation assay". proximity ligation assays can identify a protein-protein interaction/association with a distance of - nm that is in the upper range of that observed using fret ( - nm) and this methodology can detect authentic endogenous proteins in situ that does not rely on transfected or artificially gfptagged protein vectors [ ] . we evaluated whether ifitm / and hla-b co-associate using this methodology using antibodies to hla-b and mab-mhk (that can bind to both ifitm and ifitm proteins; supplementary fig. ) . a significant protein-protein interaction was observed in the ifnγ treated cells (representative images; fig. a-d) . these foci were absent in the ifitm /ifitm double null cells (fig. e-h) . together, these data validate the pulse-silac data that identified hla-b as a downstream effector of ifitm / . isg was not easily visualized using immunofluorescence in situ table ). cells were incubated with ifn-γ for h in b, d and f. data were plotted as a function of log fold change of heavy/light peptide intensities. triplicates were represented in the x, y, and z-axis. in samples (g-l), representative peptides used for quantification in biological triplicates are highlighted to demonstrate a protein that is induced independent of ifitm /ifitm (stat , i and eif , j) and proteins that are ifitm / dependent (isg ; g, hla-b, l). (data not shown) nor could we identify a protein-protein association between ifitm / proteins and isg using proximity ligations (data not shown). we thus used an independent assay for orthogonal validation of isg induction. we developed a sbp-tagged ifitm expression construct in order to ectopically express the protein in the parental siha cells and design methodologies for capturing ifitm associated proteins. the transfection of the sbp-ifitm expression could be detected as migrating at a higher mass (due to the sbp tag) than endogenous ifitm / proteins in the parental siha cells (fig. a , lane vs lane ) and specifically captured after affinity purification following expression in the ifitm /ifitm double null cell (fig. a table ). the data first highlight the detection of peptides belonging to a homologous sequence for ifitm / / proteins (fig. b, fig. c ). since ifitm was our bait protein we presume that it was the detected isoform, serving as an internal positive control (fig. a) . nevertheless, ifitm may also be interacting with ifitm or ifitm . the previously identified ifitm / / interacting protein vapa was also detected (fig. b and c) , supplementary table . the proteins with the most significant p-values were enriched in ifn-γ treated cells, relative to non-treated cells ( fig. b and c, supplementary table ). these included several proteins involved in cell-cell or cell matrix interactions including cornifin, galectin- , desmocolin, jup, hornerin, and desmoglein ( fig. c; supplementary table ). this is suggestive of a pathway interaction of ifitm with membrane-dependent cell-cell communications. the enrichment of higher confident targets in ifn-γ treated cells also suggests that ifn-γ treatment may be required to fully 'activate' the ifitm protein interaction landscape. interestingly, proteins related to ifnγ signalling were also detected; isg and hla-b ( fig. b and c) . immunoblotting also confirmed that isg ylation was for quantitation, three independent assays were performed, and each assay had two independent biological replicates. for each assay, fluorescence was measured in at least cells per condition. fluorescence was measured using imagej software. statistical study was performed with -way anova and bonferroni correction (p-value < . ). (for interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) attenuated in ifitm /ifitm double null cells after ifnγ stimulation (fig. d) . by contrast free isg protein remained equivalent in both cells (fig. d) , suggesting that the isg synthesis detected using pulse silac is directly linked to ifn stimulated protein conjugation. the ratio changes of conjugated to free isg is summarized in fig. e . together, the orthogonal validation of hla-b and isg is consistent with the pulse silac data; ifitm / proteins are required for maximal induction of hla-b and isg ylation in ifnγ treated cells. we were unable to complement the ifitm and ifitm gene back into the double-null siha cells to stimulate isg ylation and hla-b protein levels in ifn-γ treated cells (data not shown). as such, we took an independent approach to define dominating ifitm -dependent signalling pathways and whether these overlap with the signalling proteins identified using the pulse-silac experimental methods. although the majority of analyses identified hla-b and isg protein expression changes using ifitm /ifitm double null cells, we did create an ifitm single null cell that revealed similar reductions in isg and hla-b after ifnγ treatment (fig. d, g, and l) . as such, we examined ifitm dependence in the steady-state proteome levels using ifitm targeted sirna treatment of parental siha cells (fig. a, lanes and vs and ) to determine if attenuation of ifitm by this method gave rise to similar proteome changes. however, there is a caveat to this method. similar to plasmid transfection [ ] , double stranded rna can also induce an irf- dependent transcriptional interferon response [ ] . accordingly, we have also found that sirna transfection methodologies can induce an irf transcriptional response (data not shown). nevertheless, sirna is a powerful tool to determine gene dependencies in cell models. when parental siha cells, treated from and h with control sirna or ifitm -targeted sirna, there was table with selected binding proteins detected by mass spectrometry for sbp-ifitm enrichment without or with ifn-γ stimulation (from supplementary table ). (d). cells (parental siha; lane ; or ifitm /ifitm double null (lane ) were transfected with empty vector (as in a) and treated with ifnγ. samples were processed by immunoblotting with antibodies to isg . free, monomeric isg protein is highlighted, as well as conjugated isg . free isg protein is expressed at similar levels in both cells whilst conjugated isg is attenuated in the ifitm /ifitm double null cell (lane vs ). this suggests that isg ylation protein synthesis is linked to covalent conjugation under these conditions, under conditions where attenuated isg protein synthesis was observed using pulse silac (fig. ). antibodies to β-actin and ifitm / proteins were used as loading controls, as indicated. (e). free and conjugated isg were quantified using imagej software. the relative units (in a.u.) define expression as a function of free or conjugated isg in parental and ifitm /ifitm double null cells. the relative change in isg conjugation over free isg was . a.u. in parental cells. the relative change in isg conjugation over free isg was . a.u. in ifitm / double null cells. striking and selective reduction in interferon-responsive proteins using label free quantitative mass spectrometry at both time points (fig. b and supplementary table ). some of these proteins, including ifitm itself, are also components of the interferon-stimulated dna damage resistant signature of proteins [ , ] . interestingly, isg and hla-b are also lowered after the treatment of siha cells with ifitm targeted sirna. however, there is one notable difference in the pulse-silac (fig. ) vs the sirna methodology (fig. b) ; stat and b m are ifitm / -independent as defined using pulse silac (fig. ) ; whilst stat and b m are ifitm -dependent using the sirna methodology fig. . the impact of ifitm targeted sirna on the steady-state proteome. (a). immunoblotting to demonstrate that ifitm -targeted sirna can attenuate ifitm / protein levels. cells were treated with the indicated control or targeted sirna for two time points to capture the overlap in the transient dynamics of the interferon signalling cascade. lysates were immunoblotted with the mab-mhk antibody (the mhk monoclonal antibody cross reacts to a common n-terminal epitope in ifitm and ifitm , see supplementary fig. ) to quantify ifitm / protein and the loading control (gapdh), as highlighted. (b). evaluation of the total steady-state proteome in response to sirna targeting of ifitm in siha cells using swath-ms (data from supplementary table ). the impact of ifitm -targeted sirna treatment for and h. these time points were a point of focus since the sirna treatment activates the irf transcriptional response over these two-time frames (data not shown). as such, the screen is conducted under experimental conditions in which we consider that the irf response is activated by rna treatment. the data from these two biological states are plotted as log fold change in protein levels (using swath-ms) as a function of either the or -hour time point. the key proteins whose steady-state levels were suppressed after ifitm -targeted sirna treatment are highlighted in red, in the lower left quadrant. (c) the ifitm signalling model. pulse labelling using silac methodologies identified stat as a dominant protein synthesized after ifnγ treatment. this forms an internal positive control and is consistent with the classic jak-stat response to ifnγ treatment. (i). stat can produce mrnas that are translated in response to ifnγ treatment including irf and other interferon effectors such as b m. (ii). by contrast to stat and b m, some of the ifnγ stimulated factors are ifitm dependent including mhc class i molecules and isg (fig. g and l) . the sirna-mediated depletion of ifitm represents an orthogonal assay that identified reductions in isg and mhc class i molecules (fig. a) . stat protein reflects a distinct mechanism of control by ifitm / . although pulse silac revealed that stat synthesis is ifitm / -independent (fig. ) , the steady state levels of stat protein are reduced after targeted depletion of ifitm in siha cells (fig. a) . these data suggest that turnover of stat protein might be dependent on ifitm / , but its synthesis is independent of ifitm / . however, these methodologies are complicated to compare directly, since the sirna methodology uses an intrinsic rna signal (double stranded rna) that stimulates irf but without exogenously added ifnγ, whilst the pulse silac used ifnγ without rna ligands. altogether, these data place ifitm / proteins as a coordinator of the synthesis and/or steady state levels of a subset of key players in the ifnγ response. the notable induction of mhc class i molecules and isg in an ifitm / dependent manner identifies a coordinated signalling pathway with potential clinical relevance (fig. ) . the recent observation that lowered hla-a, hla-b, and hla-c alleles correlates with poor prognosis and enhanced metastatic growth in cervical cancers [ ] is further consistent with the existence of an ifitm / :hla signalling pathway regulating cervical cancer outcomes. (for interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) (fig. b, supplementary table ). this might reflect that fact that the rna ligand (sirna) induces additional rna-activated pathways in the targeted sirna proteome screen than ifnγ alone. in conclusion, the pulse silac methodology in the ifitm single null (fig. ) and the sirna treatment using ifitm targeted sirna in parental siha cells (fig. ) both gave rise to overlapping proteome changes; mainly isg and hla-b. these data suggest that ifitm might alone function as an effector to these two proteins. however, we cannot rule out a role for ifitm in the ifitm single null cell, since ifitm is also reduced by ifitm targeted sirna treatment of siha cells (fig. b) . thus, ifitm and ifitm might cooperate to mediate these effects. the ifitm protein family was identified in whole genome sirna screens as rna virus restriction factors [ ] . the molecular functions of the ifitm family in the anti-viral response are just beginning to be defined. yeast-two hybrid, with ifitm as a bait, was used as a methodology to discover new protein-protein interactions within the ifitm family that mediate viral restriction [ ] ; a protein-protein interaction was identified for ifitm and the vesicle-membrane-protein-associated protein a (vapa). the vapa interaction occurred with ifitm , ifitm , or ifitm and these three were not distinguished from each other in this assay. nevertheless, the study focused on defining how vapa interaction with ifitm results in reduction of vapa binding to oxysterol-binding protein (osbp), disrupting cholesterol homeostasis, and preventing viral maturation. as ifitm and ifitm are also reported to bind to vapa, it is not yet clear whether they play minor and/ or redundant roles in controlling cholesterol-mediated viral maturation [ ] . interestingly, vapa was also identified in our swath-ip using sbp-tagged ifitm (fig. ) , which is consistent with vapa being a dominant interactor for the ifitm family. in addition to vapa, palmitoylation of ifitm / / family members has been reported to be required for some anti-viral functions [ ] . whether the anti-viral associations with vapa or palmitoylation by the ifitm family impact on cancer growth control is not yet known. the ifitm family also have reported roles in oncogenic and/or prometastatic cancer cell growth. it is not known if the anti-viral and prooncogenic functions of the ifitm family overlap. within the ifitm family, iftm is the most well documented to have pro-metastatic roles and it is the only ifitm family member with a partial residence in the plasma membrane [ ] . ifitm is also the only ifitm family member that is a component of the interferon-regulated dna damage resistance (irds) pathway [ ] . the irds contains a subset of approximately interferon responsive genes that are up-regulated late in the viral response, upregulated during development of radiation resistance, and upregulated as a result of chronic exposure to lower levels of type i or type ii interferons. irds pathway expression is mediated by 'unphosphorylated' stat [ ] and can be stimulated by irf [ ] . interestingly, two irds genes (isg and hla-b) are induced after ifnγ exposure in a 'ifitm -dependent' manner (fig. c ). our unpublished data also indicate that ifitm single null cells are x-ray sensitive or cisplatin-sensitive, consistent with the hypothesis that ifitm expression is linked to chemo or irradiation resistance. the mechanisms whereby ifitm itself regulates such chemoresistant or pro-metastatic cell signalling events have not been mechanistically defined. in this report, we begin to set up biochemical assays that could shed light on dominant cancer-associated functions of ifitm . we first aimed to identify a clinically relevant human cancer model to study ifitm family of proteins. this required us to develop a pan-specific antibody that binds to both ifitm and ifitm proteins ( supplementary fig. ) . this resulted in a focus on cervical cancer (fig. ) that exhibited high, medium or no expression of ifitm / proteins (fig. f) . of particular interest was the inverse correlation between ifitm / protein expression and lymph node metastasis ( fig. f) suggesting that loss of ifitm / protein expression correlates with evasion of the immune system. if ifitm / are 'pro-oncogenic', why would cervical cancer panels reveal an inverse correlation between ifitm / expression and cancer-positive lymph nodes in patients? it is now becoming apparent that there are two distinct modes of metastatic cell growth. the first is the more classically defined metastasis due to enhanced 'invasion and migration" to secondary tissue niches. the second represents metastasis due to cancer cell escape from immune surveillance [ ] . ifitm / -positive cancers might indeed be pro-invasive, depending on the microenvironment and tissue type, leading to poor clinical prognosis. the methodologies that highlighted such classic pro-metastatic roles for ifitm include over-expression by ectopic transfection of plasmid encoded genes or attenuation of gene expression using targeted sirna [ ] . by such experimental approaches, ifitm does indeed promote cancer cell growth and/or 'cell invasion' (i.e. a model of metastatic growth) [ ] . consistent with this, using clinical material ifitm protein has often been shown in literature to be over-produced in cancers using immunohistochemistry and this often correlates with poor prognosis [ , ] . however, our data reviewed using cervical cancer (fig. ) indicate that there can be two distinct states of cervical cancer with respect to ifitm / expression. ifitm / negative cervical cancers might also be more pro-invasive due to immune escape. this is based on the data indicating that the most dominant proteins whose synthesis depends on ifitm / using the pulse silac methodology are in fact hla family members (figs. and ). hla family members are components of the irds, they play a role in anti-viral immunity through the presentation of viral peptides through the mhc class i system, and hla expression is linked to immune rejection of cancer cells [ ] . this suggests that although ifitm / might be pro-oncogenic under some conditions, ifitm / -hla signalling would presumably function as a 'tumor suppressor' signal via engagement of cd + t-cells. such ifitm / and hla positive cancers might not metastasize because they produce neoantigen presenting mhc class i molecules that keep the primary tumor in a local chronic state of equilibrium with the immune system. ifitm / -negative cancers by contrast might be expected to produce lower amounts of mhc class i molecules following ifnγ stimulation ( fig. ) resulting in lowered neoantigen expression, immune escape, and metastasis. such hypotheses are consistent with two clinical observations in cervical cancers. first, an inverse correlation exists between ifitm / expression and lymph-node positive cancers (fig. ) . second, a recent study has highlighted that lowered mhc class i expression also predicts poor prognosis in cervical cancers [ ] . this data is also consistent with several studies that highlight elevated rates of metastatic cancer cell growth in vivo are inversely correlated to mhc class i expression including deletion of the mhc class i locus [ ] . although expression of ifitm / was detected in cancer cells (fig. ) , it is interesting that ifitm / protein expression was observed and confined to the basal 'stem cell' layer in representative normal tissue controls using immunohistochemistry (fig. e) . expression was not observed even on one cell layer up from the basal layer. in squamous human skin, this basal layer is reflective of p (a squamous stem cell transcription factor) and phosphorylated p positive stem cell populations after uv irradiation [ ] . thus, this expression might reflect a role for ifitm in squamous stem cell pluripotency. we do not have very many samples reflecting a 'normal' human cervical epithelium, but the few cases we have exhibit very strong staining in the basal layer (data not shown). during the course of these studies the human protein atlas has also populated their immunohistochemical library with expression patterns of ifitm protein in normal tissues. we can see in this library that ifitm protein is also confined to the "basal stem cell" layer in normal squamous oesophagus, cervix, and oral mucosa ( supplementary fig. ) . together, these data would suggest that ifitm might play a role in squamous stem cell physiology as its expression appears specifically confined to the basal layer and is not expressed in suprabasal normal squamous cells. as the 'normal' squamous cervical epithelium we have used is from patients undergoing screening for cervical cancer or dysplasia, we cannot rule out a role for hpv in this expression in normal cervical cells. however, as the oesophagus squamous epithelium also exhibits ifitm protein expression in the basal cells of normal squamous epithelium (protein atlas, supplementary fig. ) , and the normal oesophagus is not noted for hpv infection, we would suggest that the expression of ifitm protein in basal cells is not related to hpv status. the signalling pathways that might trigger basal ifitm protein expression in squamous stem cell populations are not precisely defined. however, at the mrna level, differential expression patterns of ifitm gene expression were identified in the uteri of mice and there were correlations between the patterns of ifitm gene expression and wnt/β-catenin expression [ ] . these data might suggest a role for wnt signalling as an upstream regulator of ifitm in stem cells. our focus on cervical cancer as a model to understand the cancerassociated role of ifitm / is interesting considering ifitm /ifitm are themselves rna viral restriction factors and hpv infection is a risk factor in cervical cancer progression. there is evidence that ifitm and ifitm might be a positive cofactor for a dna virus; hpv viral propagation [ ] . we do not have any data that defines the hpv status of the cervical cancers we have analyzed (fig. ) , as the main clinically approved assay for diagnostics is p positive cancer cells [ ] . there is a close relation between persistent viral infection, development of cancer and failure in immune response. as an example, cervical as well as vulvar intraepithelial neoplasia are pre-cancerous conditions characterized by sustained hpv- infection. a clinical study shows favorable prognosis related to an increase in ifnγ-producing cd + cytotoxic t following to robust cell response induced by vaccination [ ] . vaccination delivers a high dose of specific antigen against hpv- oncoproteins e and e and mediates mhc-binding peptide complex presentation [ ] . a similar outcome was also observed after vaccination of a preclinical mouse model of hpv positive cervical cancer [ ] . there is a significant correlation between mhc class i (not found for mhc class ii) expression on malignant cells and t-cell infiltration (til) in human ovarian cancer [ ] , being a positive prognostic factor. thus, ifitm and ifitm might have dual roles in stimulating hpv propagation, but also in suppressing cancer escape from immune surveillance. one of the key approaches we used to evaluate the role of ifitm in ifnγ dependent protein production was the utilization of crispr-cas guide rnas to create isogenic knock-out cells. at the outset, we relied less on the use of sirna to deplete ifitm since this would only give a transient reduction in a target protein but also sirna itself can induce an interferon response (data not shown; [ ] ). this would have complicated our analysis of the interferon-responsive nature of any ifitm protein interactions we visualize and measure. nevertheless, sirna was used as a final orthogonal approach to define ifitm signalling events (fig. a, b) . by contrast, the limitation of using gene knockout tools to reduce expression of a protein requires that loss of the protein is not a lethal event. in the case of ifitm and ifitm knock-out mice are reported as viable [ ] , thus it was not unexpected that we were able to generate single ifitm null and double ifitm /ifitm double null cell panels (figs. and ) . however, we were unable to generate single ifitm null cells under experiments carried out in parallel to those reported in this manuscript (data not shown). using these isogenic cell panels, we were able to determine whether there were defects in ifnγ dependent protein synthesis using pulse silac methodologies. we chose to use a pulse-silac approach to identify the isotopically labeled tryptic peptides with the most significant fold change after ifnγ treatment and which are altered in the double ifitm /ifitm or ifitm single null cells. the most significantly suppressed proteins in the double ifitm /ifitm or ifitm single null cells after ifnγ treatment were mhc class i orthologues encoded by the hla-a, hla-b, and hla-c genes (fig. ) . as rationalized above, the data suggested an inverse correlation between ifitm / protein and hla expression with metastatic growth in cervical cancer. the interferon-responsive protein isg was also attenuated in the double ifitm /ifitm null cells or single ifitm single null cells after ifnγ treatment (fig. g) . the fact that isg and hla-b are enriched in the sbp-ifitm protein affinity purification after ifnγ treatment (fig. ) suggests a co-operative activity exists between the two proteins. indeed, as hla-b can interact in situ with ifitm / after ifnγ treatment (fig. ) , and as isg is a high-confident ifitm -associated protein using swath-ip mass spectrometry (fig. ) , these data suggest that the two proteins are directly involved in the ifitm / dependent ifnγ response. isg is also a component of an interferon and immune responsive gene cluster that are suppressed by stem cell pluripotent gene product expression [ ] , suggesting that, like hla suppression, isg suppression might be co-incident with immune escape. in the case of ifitm / signalling, we do not see defects in 'free' monomeric isg in the double ifitm /ifitm null cells, but reductions in the conjugation of higher molecular mass isg ylated adducts (fig. d) . these data suggest that conjugation of proteins to isg during interferon stimulation might play a coordinated role in the ifitm / dependent immune-tumor cell interactions. overproduction of isg has been reported previously to stabilize ifitm [ ] , consistent with our data that isg is detected in the sbp-ifitm complex (fig. ) . ubiquitination of ifitm might counteract the stimulatory effect of isg on the anti-viral functions of the protein [ ] . how ubiquitination and isg ylation regulate ifitm and/or ifitm in a coordinated fashion is not defined. these data together provide a novel biochemical pathway relevant for cancer associated functions of ifitm / that correlates with the interferon-responsive nature of ifitm / signalling; they can mediate ifnγ dependent protein production of mhc class i proteins and isg (fig. ) , whilst the maintenance of stat protein in response to ifnγ involves by a different signalling mechanism that is ifitm / -independent (fig. b ). both antigen presentation and isg ylation signalling events are important for anti-viral signalling as well as immune regulation of cancer cells at the immune-cancer synapse [ ] [ ] [ ] [ ] [ ] [ ] . further research will shed light on how reductions in hla and isg ylation can impact on both oncogenic signalling and/or anti-viral activity in response to ifitm / expression. supplementary data to this article can be found online at https:// doi.org/ . /j.cellsig. . . . interferon signalling network in innate defence mechanisms of type-i-and type-ii-interferon-mediated signalling human interferons alpha, beta and omega interferons at age : past, current and future impact on biomedicine jak-stat pathways and transcriptional activation in response to ifns and other extracellular signaling proteins transcriptional regulation of interferon-stimulated genes interferons and their stimulated genes in the tumor microenvironment interactions among genes, tumor biology and the environment in cancer health disparities: examining the evidence on a national and global scale molecular pathways: interferon/ stat pathway: role in the tumor resistance to genotoxic stress and aggressive growth an interferon-related gene signature for dna damage resistance is a predictive marker for chemotherapy and radiation for breast cancer progression of cancer from indolent to aggressive despite antigen retention and increased expression of interferon-gamma inducible genes expression of ifitm as a prognostic biomarker in resected gastric and esophageal adenocarcinoma a snapshot of microarray-generated gene expression signatures associated with ovarian carcinoma interferoninduced transmembrane protein (ifitm ) is required for the progression of colorectal cancer interferon-induced transmembrane protein (ifitm ) overexpression enhances the aggressive phenotype of sum inflammatory breast cancer cells in a signal transducer and activator of transcription (stat )-dependent manner ifitm promotes the metastasis of human colorectal cancer via cav- ifitm-family proteins: the cell's first line of antiviral defense the c-terminal sequence of ifitm regulates its anti-hiv- activity a membrane topology model for human interferon inducible transmembrane protein the ifitm proteins mediate cellular resistance to influenza a h n virus, west nile virus, and dengue virus the antiviral effector ifitm disrupts intracellular cholesterol homeostasis to block viral entry distinct patterns of ifitm-mediated restriction of filoviruses, sars coronavirus, and influenza a virus discriminating functional and non-functional p in human tumours by p and mdm immunohistochemistry improved vectors and genome-wide libraries for crispr screening protein turnover on the scale of the proteome turnover of the human proteome: determination of protein intracellular stability by dynamic silac targeted absolute quantitative proteomics with silac internal standards and unlabeled full-length protein calibrators (taqsi) the bradford method for protein quantitation cleavage of structural proteins during the assembly of the head of bacteriophage t sample preparation and digestion for proteomic analyses using spin filters universal sample preparation method for proteome analysis klf -mediated negative regulation of ifitm expression plays a critical role in colon cancer pathogenesis up-regulation of ng proteoglycan and interferon-induced transmembrane proteins and in mouse astrocytoma: a membrane proteomics approach knockdown of interferon-induced transmembrane protein (ifitm ) inhibits proliferation, migration, and invasion of glioma cells ifitm plays an essential role in the antiproliferative action of interferon-gamma differential gene expression identified in uigur women cervical squamous cell carcinoma by suppression subtractive hybridization down-regulation of ifitm and its growth inhibitory role in cervical squamous cell carcinoma structural and transcriptional analysis of human papillomavirus type sequences in cervical carcinoma cell lines transcriptional trans-activation by the human papillomavirus type e gene product ifitm limits the severity of acute influenza in mice defining the range of pathogens susceptible to ifitm restriction using a knockout mouse model evaluation of the variation in sample preparation for comparative proteomics using stable isotope labeling by amino acids in cell culture hinnebusch, eif loop interactions with met-trnai control the accuracy of start codon selection by the scanning preinitiation complex proximity ligation assays: a recent addition to the proteomics toolbox let there be light!, proteomes nucleofection of expression vectors induces a robust interferon response and inhibition of cell proliferation irf up-regulates isg gene expression in dsrna stimulation or csfv infection by targeting nucleotides − to − in the ′ flanking region the interferon-induced transmembrane proteins, ifitm , ifitm , and ifitm inhibit hepatitis c virus entry unphosphorylated stat prolongs the expression of interferon-induced immune regulatory genes irf and unphosphorylated stat cooperate with nf-kappab to drive il expression hallmarks of cancer: the next generation overexpression of ifitm has clinicopathologic effects on gastric cancer and is regulated by an epigenetic mechanism allele-specific hla loss and immune escape in lung cancer evolution classical and non-classical hla class i aberrations in primary cervical squamous-and adenocarcinomas and paired lymph node metastases ck -site phosphorylation of p is induced in deltanp expressing basal stem cells in uvb irradiated human skin characterisation of mouse interferon-induced transmembrane protein- gene expression in the mouse uterus during the oestrous cycle and pregnancy the antiviral restriction factors ifitm , and do not inhibit infection of human papillomavirus, cytomegalovirus and adenovirus a cocktail of p (ink a) and ki- , p (ink a) and minichromosome maintenance protein as triage tests for human papillomavirus primary cervical cancer screening vaccination against hpv- oncoproteins for vulvar intraepithelial neoplasia immunotherapy of established (pre)malignant disease by synthetic long peptide vaccines established human papillomavirus type -expressing tumors are effectively eradicated following vaccination with long peptides hla class i expression on human ovarian carcinoma cells correlates with t-cell infiltration in vivo and t-cell expansion in vitro in low concentrations of recombinant interleukin- suppression of dsrna response genes and innate immunity following oct , stella, and nanos overexpression in mouse embryonic fibroblasts e ubiquitin ligase nedd promotes influenza virus infection by decreasing levels of the antiviral protein ifitm s-palmitoylation and ubiquitination differentially regulate interferon-induced transmembrane protein (ifitm )-mediated resistance to influenza virus direct identification of clinically relevant neoepitopes presented on native human melanoma tissue by mass spectrometry free isg triggers an antitumor immune response against breast cancer: a new perspective beyond isglylation: functions of free intracellular and extracellular isg innate antiviral response targets hiv- release by the induction of ubiquitin-like protein isg understanding mhc class i presentation of viral antigens by human dendritic cells as a basis for rational design of therapeutic vaccines hla class i molecules consistently present internal influenza epitopes regulation of the e ubiquitin ligase activity of mdm by an n-terminal pseudosubstrate motif key: cord- -sjab zsk authors: mendez, aaron s; vogt, carolin; bohne, jens; glaunsinger, britt a title: site specific target binding controls rna cleavage efficiency by the kaposi's sarcoma-associated herpesvirus endonuclease sox date: - - journal: nucleic acids res doi: . /nar/gky sha: doc_id: cord_uid: sjab zsk a number of viruses remodel the cellular gene expression landscape by globally accelerating messenger rna (mrna) degradation. unlike the mammalian basal mrna decay enzymes, which largely target mrna from the ′ and ′ end, viruses instead use endonucleases that cleave their targets internally. this is hypothesized to more rapidly inactivate mrna while maintaining selective power, potentially though the use of a targeting motif(s). yet, how mrna endonuclease specificity is achieved in mammalian cells remains largely unresolved. here, we reveal key features underlying the biochemical mechanism of target recognition and cleavage by the sox endonuclease encoded by kaposi's sarcoma-associated herpesvirus (kshv). using purified kshv sox protein, we reconstituted the cleavage reaction in vitro and reveal that sox displays robust, sequence-specific rna binding to residues proximal to the cleavage site, which must be presented in a particular structural context. the strength of sox binding dictates cleavage efficiency, providing an explanation for the breadth of mrna susceptibility observed in cells. importantly, we establish that cleavage site specificity does not require additional cellular cofactors, as had been previously proposed. thus, viral endonucleases may use a combination of rna sequence and structure to capture a broad set of mrna targets while still preserving selectivity. viral infection dramatically reshapes the gene expression landscape of the host cell. by changing overall messenger rna (mrna) abundance or translation, viruses can redirect host machinery towards viral gene expression while si-multaneously dampening immune stimulatory signals ( ) ( ) ( ) . suppression of host gene expression, termed host shutoff, can occur via a variety of mechanisms, but one common strategy is to accelerate degradation of mrna ( ) ( ) ( ) . this occurs during infection with dna viruses such as alphaherpesviruses, gammaherpesvirues, and vaccinia virus, as well as with rna viruses such as influenza a virus and sars and mers coronaviruses ( , , ) . in the majority of these cases, a viral factor promotes endonucleolytic cleavage of target mrnas. this strategy bypasses the normally rate limiting steps of deadenylation and decapping to effect rapid mrna degradation by host exonucleases ( ) . virally encoded host shutoff endonucleases are usually specific for mrna, yet broad-acting in that they target the majority of the mrna population. this is exemplified by herpesviral nucleases, including the sox endonuclease encoded by kaposi's sarcoma-associated herpesvirus (kshv), an oncogenic human gammaherpesvirus that causes kaposi's sarcoma and b cell lymphoproliferative diseases ( , ) . kshv sox is a member of the pd-(d/e)xk type ii restriction endonuclease superfamily that possesses mechanistically distinct dnase and rnase activities ( ) ( ) ( ) . the rnase activity of the gammaherpesvirus sox protein has been shown to play key roles in various aspects of the viral lifecycle, including immune evasion, cell type specific replication, and controlling the gene expression landscape of infected cells ( ) ( ) ( ) ( ) . however, the mechanism by which sox targets mrnas remains largely unknown. sequencing data indicate that within the mrna pool there appears to be a range of sox targeting efficiencies; some transcripts are efficiently cleaved in cells, while others are partially or fully refractory to cleavage ( ) ( ) ( ) ( ) ( ) . additionally, sox has been shown to cut within specific locations of mrnas in cells, further emphasizing that there must be transcript features that confer selectivity ( , ) . indeed, a transcriptome-wide cleavage analysis indicated that sox targeting is directed by a relatively degenerate motif, often containing an unpaired polyadenosine stretch shortly upstream of the cleavage site, which is located in a loop structure ( ) . cleavage within an unpaired loop was confirmed in a recent crystal structure of sox with rna, although additional contacts that could confer sequence specificity were not observed ( ) . thus, a major outstanding question is how rna sequence and/or structure contribute to sox target recognition. in this context, it is unclear how sequence features surrounding the rna cleavage site might impact sox targeting, for example by changing its affinity for a given rna or the efficiency with which cleavage occurs. to address these questions, we sought to reconstitute the sox cleavage reaction in vitro using purified components. using an rna substrate that is efficiently cleaved by sox in cells, we revealed that specific rna sequences within and outside of the cleavage site significantly contribute to sox binding efficiency and target processing. in particular, we found that the polyadenosine stretch adjacent to the cleavage site is critical for sox binding, and we experimentally verified the importance of an open loop structure surrounding the cleavage site. finally, we demonstrated that this in vitro system faithfully recapitulates the initial endonucleolytic cleavage event that is an essential component of mrna target specificity in vivo. collectively, our data reveal that specific sequence features potently impact sox binding, and thus provide key insight into the breadth of sox targeting efficiency observed across the transcriptome. more broadly, this information provides a framework for better understanding the target specificity of endonucleases, which play central roles in mammalian quality control processes and viral infection outcomes. kshv sox was codon optimized for sf expression and synthesized from genewiz. sox was then subcloned using restriction sites bamhi and sali (new england bio-labs) into pfastbac htd. this vector was modified to carry a gst affinity tag and prescission protease cut site as described ( ) . all sox mutants were generated using single primer site-directed mutagenesis ( ) . sequences were validated using standard pgex forward and reverse primers. generation of viral bacmids and transfections were prepared as described in the bac-to-bac ® baculovirus expression system (thermo fisher scientific) manual. after transfection, sf cells (thermo fisher scientific) were grown for h at • c using sf- smf media (gibco) substituted with % fetal bovine serum (fbs) and % antibiotic antimycotic (aa). supernatant was transferred to a six-well tissue culture plate containing ml of × ∧ cells/well. cells were incubated for hr to generate passage (p ). the p supernatant was transferred to a flask containing ml of × cells/ml and incubated for h, a time point sufficient to yield mg of sox per ml of cells. protein expression was confirmed by western blot with an anti-gst antibody (ge health care life sciences). sf cell pellets were suspended in lysis buffer containing mm nacl, % glycerol, . % triton x- , mm dtt, mm hepes ph . with a complete, edta-free protease inhibitor cocktail tablet (roche). cells were sonicated on ice using a macro trip for s bursts with min rests for min at a. cell lysate was cleared using a pre-chilled ( • c) sorvall lynx superspeed centrifuge spun at rpm for min. the cleared lysate was incubated for h at • c with rotation with ml of a gst bead slurry (ge healthcare life sciences) that had been pre-washed × with wash buffer (wb) containing mm nacl, % glycerol, mm dtt, mm hepes ph . . the bead-protein mixture was washed × with ml of wb, then transferred to a ml disposable column (qiagen) and washed with an additional ml of wb followed by ml of low salt buffer (lsb) containing mm nacl, % glycerol, mm dtt, mm hepes ph . with periodic resuspension to prevent compaction. sox was then cleaved on column with prescission protease (ge healthcare life sciences) overnight at • c, and protein eluate was collected for a final volume of ml in lsb. cleaved protein was concentrated to ∼ ml using amicon filter concentrator membrane cut off kda (emd millipore), then loaded onto a hiload superdex s pg gel filtration column (ge healthcare life sciences). protein elutions were concentrated using an amicon concentrator described above to mg/ml and l aliquots were snap frozen in liquid n using nuclease-free . ml microfuge tubes (ambion life technologies) and stored at - • c. all rna substrates (sequences in supplementary table s ) unless stated otherwise were synthesized by dharmacon (ge healthcare) with hplc and page purification. rnas were end labeled with ␥ -[ p]-atp- ci/mmol mci/ml (perkin elmer) using t pnk (new england bi-olabs). rnas were end labeled with -[ p]-pcp ci/mmol mci/ml using t rna ligase (new england biolabs). labeled rna substrates were purified using % urea-page and were isolated from gel slices by incubating overnight at • c in a buffer containing mm tris-hcl, mm edta ph . . eluted rnas were ethanol precipitated and resuspended in rnase-free ddh o. k obs and hill coefficients of sox were determined from the cleavage kinetics of [ p]-labeled rna substrates as previously described ( ) . briefly, l (≤ pm) of [ p]-labeled rna was added to l of premixture containing mm hepes ph . , mm nacl, mm mgcl , mm tcep, % glycerol, and increasing concentrations of purified sox. reactions were performed at room temperature under single turnover conditions, and quenched at the indicated time intervals with l stop solution ( m urea, . % sds, . mm edta, . % xylene cyanol, . % bromophenol blue). samples were resolved by % urea-page, imaged using a typhoon variable mode imager (ge healthcare), and quantified using imagequant and gelquant software packages (molecular devices). the data were plotted and fit to exponential curves using prism software package (graphpad) to determine observed rate constants. a fret probe with excitation at nm and emission at nm (limd flo) was purchased from dharmacon (supplementary table s ). the rna fret probe was added at a final concentration of nm to l of premixture containing mm hepes ph . , mm nacl, mm mgcl , mm tcep, % glycerol with m of sox ( ) . terminator exonuclease (lucigen) was added to reactions using a : dilution of the enzyme. reactions were quenched at indicated time intervals with equal volumes of stop solution containing % formamide and mm edta, then resolved using urea-page and visualized using a typhoon variable mode imager (ge halthcare). the data were plotted using prism software package (graphpad). all experiments were repeated > times and mean values were computed. for assays designed to detect endonucleolytic cleavage intermediates, l of labeled rna substrate was combined with l of reaction solution ( mm hepes ph . , mm nacl, . mm cacl, . mm mgcl , % glycerol, . mm tcep) in the presence or absence of m sox for min at room temperature. rna was then ethanol precipitated, resuspended in % formamide solution containing mm edta, and resolved on a % urea-page analytical grade sequencing gel together with a ss-rna decade ladder (ambion life technologies) for . h at w before imaging as described above. the sequence surrounding the cut site in limd was inserted into a pbssk (-) backbone using the bamhi and xbai restriction sites. mutations were introduced by the quickchange site directed mutagenesis protocol (agilent). the nt sequence surrounding the gfp cut site was inserted using the bamhi and xhoi restriction site. in-line probing was performed as described previously ( ) . briefly, pbssk(-) plasmids containing the indicated sequences (see supplementary table s ) were linearized by digestion with xhoi and scai for gfp or blpi and saci (neb) for limd , gel purified, phenol/chloroform extracted, and ethanol precipitated. the fragments were then used as templates for in vitro transcription with the hiscribe t high yield rna synthesis kit (neb) and afterwards subjected to turbo dnase (ambion by life technologies) treatment. rna was resolved by % urea page, and full length transcripts were excised from the sybr gold stained gel (thermo fisher scientific), eluted overnight in g buffer ( mm tris hcl ph . , mm naoac, mm edta, . % sds), phenol/chloroform extracted, and ethanol precipitated. the rna (∼ pmol) was dephosphorylated using shrimp alkaline phosphatase (rsap, neb), labeled with l [␥ p] atp ( mci/ml) using usb optikinase (affymetrix), then gel purified as described above and dissolved in l of nuclease free water. for the in-line probing reaction, l rna (≥ cpm) was incubated in × reaction buffer ( mm tris-hcl ph . , mm mgcl , mm kcl) at room temperature for or h. the reaction was quenched with × loading buffer ( m urea, . mm edta ph . ). to generate ladders, l of the purified rna was separately subjected to hydrolysis using the next magnesium rna fragmentation module (-oh) or rnase t digestion (t ) (neb). reactions were resolved by % urea-page, exposed on a phoshorimager screen, and scanned using the storm imaging system (ge healthcare). deduced rna structures were drawn using the rna secondary structure visualization tool forna (vienna rna web services). rna probes used in emsa experiments were radiolabeled using the protocol described for ribonuclease activity assays. reactions were incubated at rt for min in buffer containing mm hepes ph . , mm kcl, mm cacl , . % tween- , . tcep, . mg/ml bsa (sigma-aldrich), g/ml of yeast trna (ambion thermo fisher), and the indicated amount of purified sox protein. calcium chloride was used in these binding assays to prevent substrate processing and stabilize rna-protein interactions. reactions volumes were kept at l and stopped with l × emsa loading dye ( mm hepes ph . , mm kcl, % glycerol). reactions were resolved by % native page, and gels were imaged on a typhoon multivariable imager (ge healthcare) and quantified using gelquant software package (molecular dynamics). limd - rna was end labeled with ␥ -[ p]-atp- ci/mmol mci/ml (perkinelmer) using t pnk (new england biolabs). rna was then gel purified as stated previously. emsa gel shifts were first used to determine optimal binding conditions (> % binding, homogeneous complexes of rna-protein). binding buffer contained . % tween (sigma-aldrich), mm cacl , mm kcl, mm nacl, . mm tcep, mm hepes ph . , . mg/ml yeast trna (ambion), . mg/ml nuclease free bovine serum albumin (bsa) (ambion). a dilution series of sox ( - . m) was incubated with l of radiolabeled limd - in the presence of . unit of rnase t (epicentre illumina). reactions were incubated at rt for a total of min before being ethanol precipitated. rna pellets were then resuspended in l of % formamide solution containing mm edta and boiled for min. samples were then loaded onto a % analytical grade urea-page gel and run at w for . h. gels were imaged and analyzed as stated above. in order to produce an rnase t ladder, l of limd - was incubated with . units of rnase t . reactions were incubated at rt for min before being quenched and prepared as stated previously. the limd - hydrolysis ladder was generated as stated in the in-line probing methods. rna probes ( end labeled with biotin) were synthesized from dharmacon (ge healthcare) and hplc and page purified (see supplementary table s ). the octet red e bio-layer interferometry instrument and streptavidin (sa) biosensors were available from fortebio (menlo park, ca, usa). all steps were performed in reaction buffer similar to emsa binding conditions. biosensors were incubated with nm of the biotinylated rna substrate for containing no rna. sox protein was incubated with the rna conjugated biosensors for - s in order to reach saturation. indicated protein concentrations for each bio sensor are located on corresponding binding curves. complexes were dissociated for minimum of min. response curves for each biosensor were normalized against biosensors conjugated to rna in the absence of sox (buffer only control). normalized response curves were processed using octet software version by fitting the group of selected bio sensors to a nonlinear regression model ( ) . dissociation constants (k d ) were determined from k on and k dis values derived from the fitted curves. a complete table of all values is provided in supplementary table s . in cells, the mrna fragments resulting from the primary sox endonucleolytic cleavage are predominantly cleared by the host - exonuclease xrn , while in vitro, rna fragments are rapidly degraded by - exonucleolytic activity intrinsic to purified sox ( ) . thus, it has been challenging to analyze the initial endonucleolytic cleavage event that is an essential component of mrna target specificity in vivo. here, we sought to develop a biochemical system to address these questions. our prior analysis of sox targets in cells identified the human limd mrna, which codes for a protein essential for p body formation and integrity, as being highly susceptible to cleavage by sox ( ) . the minimum sequence required to directly cut the putative cleavage site in limd in cells was mapped to a -nucleotide segment (limd - ), and we therefore chose this as our model substrate to study sox targeting in vitro ( ) . we first expressed and purified kshv sox to greater then % purity from sf insect cells (supplementary figure s a ). using the limd - substrate, we plotted the observed rate constant (k obs ) as a function of sox concentration, yielding a hill coefficient of n = . ( figure a ). thus, in agreement with previous observations ( , ), sox appears to function predominantly as a monomer. under conditions of half maximal activity ( m; figure a ), sox displayed a strong preference for the 'hard' divalent metal mg + and a weaker preference for the 'softer' and larger metals mn + , co + and zn + ( figure b ). this is again consistent with other characterized members of the p/dexk family of enzymes ( , ) . notably, sox activity in the presence of mg + was inhibited in a dose-dependent manner upon competitive addition of ca + ( figure c and supplementary figure s d ). this is likely the result of increased coordination partners engaged by ca + , which decreases the ability of catalytic residues to promote proper base hydrolysis ( ) ( ) ( ) . finally, increasing the nacl concentration above mm led to substantially decreased sox activity ( figure d ), in accordance with the observation that high salt concentrations frequently inhibit nuclease activity by disrupting protein-protein or proteinsubstrate interactions ( ) . given that recombinant sox displays robust - exonuclease activity ( , ), we sought to confirm that limd - was subject to endonucleolytic sox cleavage, as this is the predominant event that directs mrna turnover in sox expressing cells ( , ) . both the and ends limd - were blocked by capping the end with a cy fluorophore and the end with an iowa black quencher (limd - flo). we confirmed this rna was resistant to degradation by the -phosphate dependent exonuclease terminator ( figure e, lane ) . however, in the presence of sox, a cleavage product was observed that correlated with an endonucleolytic cut ( figure e, lane ) . to confirm this processing event was not a result of contamination, we purified a sox mutant containing mutations within two key residues of the sox active site (d n/e q). incubation of this mutant with limd - over the course of . h yielded no rna cleavage (supplementary figure s e ). thus, recombinant sox appears to target limd - for endonucleolytic cleavage in vitro, as has been observed for this substrate in cells. to analyze rna substrate selectivity using our in vitro assay, we first compared sox degradation of limd - to a -nucleotide sequence of the mrna encoding gfp (gfp- ). we have previously shown that gfp mrna is cleaved by sox in cells, and that gfp- is the minimal sequence required to elicit cleavage ( , ) . the cleavage sites for limd - and gfp- are predicted to occur in an open loop region (figure a, red arrow) . upon direct comparison of these two rnas, we observed a ∼ -fold increase in the catalytic efficiency of sox for the limd - substrate compared to gfp- ( figure b ). this difference was not exclusively due to the fact that the gfp substrate was slightly shorter than limd - , as sox also displayed a fold reduction of catalytic efficiency on a longer, nt gfp substrate (gfp- ; figure b ). electrophoretic mobility shift assays (emsa) further revealed a -fold increase in sox binding to limd - compared to gfp- ( figure c ). given that both substrates contain the requisite unpaired bulge at the predicted cleavage site (see figure a and supplementary figure s ), these observations suggest that additional sequence or structural features impact sox targeting efficiency on individual rnas. two sox point mutants, p s and f a, located in an unstructured region of the protein that bridges domains i and ii have been shown to be selectively required for its endonucleolytic processing of rna substrates (supplementary figure s a and s b) ( , ) . structural data indicate that residue f forms a stacking interaction with an adenine base in the rna, likely stabilizing the protein-rna interaction, while p is hypothesized to contribute to structural rearrangements required for f engagement ( ) . we purified both mutants to evaluate their relative rna processing and rna binding activity against the optimal limd - substrate. both mutants displayed purity and elution profiles similar to wild type (wt) sox (see supplementary figure s a-c) . however, the catalytic efficiency of each mutant was > -fold less than wt sox ( figure d ). furthermore, rna binding was severely perturbed; the binding kinetics of wt sox for limd - are in the single digit nanomolar range (k d = nm), while p s and f a display > log defects (k d = nm and nm, respectively) ( figure e and supplementary figure s a c). thus, the large defect in rna binding likely explains the decreased efficiency of rna processing. notably, while there was a dramatic decrease in the relative affinities of the two mutants for limd - , there was not a complete loss of binding or rna processing. this could be a result of secondary nonspecific interactions and/or nonspecific exonucleolytic degradation by sox from the monophosphorylated end of the probe. in silico rna folding predictions of sox targeting motifs, coupled with rna mutagenesis experiments, have indicated that an rna stem loop structure is an important determinant in sox targeting both in vitro and in vivo ( , ) . given the importance of this predicted motif, and in partic- predicted rna fold ular the proposed requirement for unpaired sequence at the cut site, we sought to experimentally determine the structure of limd - using chemical based in-line probing ( figure a ). this showed that the limd - structure contains a largely base paired stem region, followed by a loop at positions - that encompasses the predicted sox cleavage site between nt and , and a short hairpin struc-ture at positions - ( figure b) . notably, some differences exist between the predicted and observed structures of limd - , including a larger loop region and the subsequent short stem-loop (compare figure b to figure a ) however, in both cases the predicted cleavage site of sox resides in a loop region. recently, a high-resolution crystal structure was solved of sox bound to a nt fragment of the kshv pre-microrna k - (k - ). in this structure, the only observed contacts between sox and k - occurred between the four active site residues of sox (y , r , c , f ) and the ugaag motif surrounding the cleavage site of the rna ( ) . it was therefore hypothesized that no other residues beyond this unpaired ugaag motif were involved in transcript recognition ( ) . however, the binding affinity we observed for limd - was -fold stronger that what was previously reported for k - ( ) , suggesting that a more extended interaction surface might distinguish optimal from sub-optimal rna substrates. we therefore used rna footprinting to map the sox binding sites on limd - . indeed, sox protected a region of limd - that included the three adenosine stretch (positions - ) from rnase t digestion in a dose dependent manner ( figure ) . notably, this mapped binding region is the same region predicted from in vivo pare-seq data to be important for sox targeting, although the reason for its importance remained unknown ( ) . we also observed a modest protection of base (g) located directly adjacent to the predicted cleavage site of sox, which represents the region detected in the crystal structure of k - bound to sox. collectively, these findings suggest that while sox may interact with residues directly adjacent to the cut site, a more extensive interaction interface exists for its preferred in vivo targets. to explore the importance of the residues involved in sox binding and cleavage, we engineered mutants of the limd - substrate ( figure a ). first, we preserved the loop structure but replaced the three adenosines bound by sox (residues - ) with guanosines (limd - xa-g). second, we largely abolished the loop structure by providing complementary base pairing (limd - zipper). third, we mutated the residue located at the predicted sox cut site that was also protected in the footprinting assay (limd - a-g). this mutant has been previously identified to block sox cleavage in vivo ( ) . tary figure s ). real-time binding kinetics for sox with wt limd - and each of the three mutant substrates were then measured using bio-layer interferometry (bli). all rna probes were biotinylated and immobilized to a streptavidin-coated bli probe, whereupon the binding and dissociation of sox was measured. to prevent degradation of the probe, excess calcium ion was used in place of magnesium (supplementary figure s e) . sox retained similar binding affinity to the cut site mutant table s ). to rule out the possibility that the effect on binding affinity to the limd - zipper mutant was a result of altered residues within the binding site, we also engineered an additional zipper mutant (limd - zipper ) that did not disrupt the polyadenosine sequence. in agreement with the loop structure playing a critical role in target recognition, this limd - zipper mutant also displayed a substantial defect in binding (k d = . m; supplementary figure s a , b, supplementary table s ). finally, we measured sox binding to the kshv pre-mirna sequence used to obtain the sox-rna cocrystal structure (k - ) ( ) . notably, the affinity of sox for k - was within the range of the limd - structural mutants (k d = . m), suggesting that despite having an ugaag motif upstream of a predicted bulge, this is unlikely to be a sox target ( figure b , supplementary figure s e and supplementary table s ) . we next quantitatively measured the catalytic efficiency of sox towards each of the above rna substrates. despite sox having wt binding affinity for the predicted cleavage site mutant limd - a-g, there was a -fold defect in ' its ability to degrade this mutant ( figure c ). even more marked defects in sox catalytic efficiency were observed for the binding site mutant limd - xa-g, the loop mutants limd - zipper and limd - zipper , and the pre-mirna k - ( figure c and supplementary figure s ). collectively, these data indicate that efficient rna cleavage requires both an appropriate sox binding site and a suitable cut site. in cells, sox cleaves its mrna substrates site-specifically. mutagenesis of residues in mapped cleavage sites generally abolishes sox cleavage at that location ( ) . to determine if our in vitro assay faithfully recapitulated the site specificity of sox endonucleolytic targeting observed in cells, we established reaction conditions that enabled trapping of nucleic acids research, , vol. , no. the early cleavage events. by combining ca + and mg + in our reaction buffer, we were able to sufficiently slow sox processing to visualize cleavage products derived from p labeled substrates. indeed, we observed a predominant nt band, which is the size of the product released upon limd - cleavage at the predicted cut site ( figure a , lane ). additional bands also appeared, likely representing subsequent processing events. importantly, when we incubate sox with the cut site mutant limd - a-g, there is a complete loss of this nt product, as well as the additionally processed intermediates ( figure a, lane ) . production of these cleavage intermediates required sox, as no decay was observed in the rna-only controls ( figure a , lanes - ). finally, we sought to verify that the predominant nt cleavage product we observed was a result of an endonucleolytic cleavage and not end processing. to this end, we generated a limd - substrate containing a p pcp label and a free oh to block end processing. again, in the presence of sox, wt limd - but not the a-g mutant produced a cleavage product whose size corresponded to cleavage at the predicted site ( figure b ). taken together, these data confirm that our in vitro assay faithfully recapitulates sox cleavage site specificity on a true substrate. endonuclease-directed mrna degradation plays key roles in the lifecycle of gammaherpesviruses, yet the fundamental principles governing target specificity by sox and other viral endonucleases are not well understood. here, through the development of the first biochemical system to faithfully recapitulate the internal cleavage specificity observed for sox in cells, we revealed how both rna sequence and structure contribute to targeting. these findings resolve a central feature of the current model of sox activity ( figure ). previous observations established that sequences flanking the cut site were required to direct cleavage by sox ( , ) . however, it was unresolved whether they played a strictly structural role in presenting an exposed loop for cleavage, served as a platform for sox binding, or created a binding site for one or more cellular factors that then indirectly recruited sox to its targets. through a combination of mutational analyses, rna structure probing, and rna footprinting assays, we showed that efficient sox targeting requires both an exposed loop structure and upstream sequences that serve as a sox binding platform. this combination of sequence and structural features within the targeting motif helps explain why some mrnas are efficiently cleaved by sox, whereas others are weaker substrates. a key open question related to sox function is how it can target the majority of mrnas in cells, yet with significant site specificity. our observations suggest that there must be specific mrna features that influence targeting. indeed, pare-seq analyses of cleavage intermediates in sox expressing cells revealed that cleavage sites were associated with a degenerate sequence motif ( ) . sequences proximal to the cleavage site were predicted to be un-base paired and frequently contained a polyadenosine stretch followed by a purine ( ) . the requirement for these sequence features for sox targeting was validated for the limd transcript in cells ( ) . because limd has been established as a particularly robust sox target in cells ( ) , we reasoned that it must contain features optimal for sox processing and therefore would be an ideal substrate to dissect biochemically why these features are important. indeed, sox binding to limd - was -fold better than to the commonly used reporter substrate gfp, and ∼ -fold better than to the k - pre-mirna, which has not been demonstrated to be processed by sox in cells. importantly, these binding differences correlated with the efficiency of sox cleavage in vitro, arguing that the ability to bind the targeting motif is a key step in target recognition. through rna footprinting assays, we were able to show that sox binds to a bulge structure proximal to the cleavage site containing the polyadenosine stretch previously predicted to be important for mrna cleavage by sox in cells ( ) . mutating either just the bulge structure (limd - zipper ) or maintaining the bulge but mutating the polyadenosine stretch (limd - xa-g) resulted in a ∼ -fold reduction in binding affinity, correlating with a dramatic decrease in cleavage efficiency. collectively, these data demonstrate that variability in the efficiency of sox targeting observed in cells is likely due to differences in rna sequences that mediate sox binding. a recent crystal structure of sox bound to the k - pre-mirna captured the importance of the exposed loop region for sox cleavage ( ) . however, the structure did not reveal additional interactions between sox and the rna beyond the three residues surrounding the cut site. our data suggest that this is likely because the k - rna lacks the additional residues necessary for sox binding site found in both limd and gfp. while the k - rna does contain adenosines upstream of the cleavage site, structural predictions indicate these residues are within a stem region ( ) , rather than in an exposed loop as is the case for limd and gfp. together, these observations indicate that while upstream adenosines are important for binding, they must be present in an unpaired state to promote sox binding. it is notable that prior studies reported much weaker interactions between sox and rna (k d = m) compared to its dna substrates (k d = m) ( , , ) . however, in these cases binding assays were conducted with scrambled rna sequences. we found that sox binding affinities to rna substrates vary over several orders of magnitude, in a manner that correlates with cleavage efficiency. interestingly, the crystal structure of sox bound to dna showed more dynamic interactions along the length of the protein (∼ Å interaction surface), when compared to the k - rna bound structure (∼ Å interaction surface). it is therefore possible that more interaction along the length of sox protein might occur with optimal substrates such as limd that are more tightly bound. the fact that purified sox endonucleolytically cleaved limd - at the precise site observed in sox-expressing cells demonstrates that cleavage site selection on an mrna is not mediated by a cellular cofactor. instead, targeting at particular rna motifs is strongly influenced by the strength of sox binding. our observation that the p s and f a sox mutants display significant rna binding defects indicates that their failure to cleave mrnas in cells is due to an inability to efficiently bind the targeting motif. target identification exonucleolytic degradation by xrn / dis l figure . model of mrna targeting by sox. sox is able to distinguish mrna from other types of rna in cells by an as yet unknown mechanism. subsequently, it endonucleolytically cleaves its targets at specific sites, whereupon the fragments are degraded by host exonucleases such as xrn and dis l . here, we revealed that in addition to the requirement for an unpaired loop at the cleavage site, additional upstream rna sequences increase the affinity of sox for individual targets, thereby controlling cleavage efficiency. the mechanism by which sox initially distinguishes rna polymerase ii transcribed mrnas from other types of rna in cells remains an important open question, as this feature of sox selectivity is not preserved in vitro. we hypothesize that cellular co-factors, perhaps though interactions with sox, enable this distinction. more broadly, endonucleases are instrumental in rna processing and degradation. nuclease processing defects lead to several human pathologies ranging from cancer to neurodegeneration ( ) ( ) ( ) ( ) ( ) , and our study provides a framework for better understanding the mechanistic features governing endonuclease targeting. emerging roles for rna degradation in viral replication and antiviral defense modulation of the translational landscape during herpesvirus infection a common strategy for host rna degradation by divergent viruses a two-pronged strategy to suppress host protein synthesis by sars coronavirus nsp protein influenza a virus protein pa-x contributes to viral growth and suppression of the host antiviral and immune responses increasing incidence of cancers associated with the human immunodeficiency virus epidemic human herpesvirus- : kaposi sarcoma, multicentric castleman disease, and primary effusion lymphoma the exonuclease and host shutoff functions of the sox protein of kaposi's sarcoma-associated herpesvirus are genetically separable crystal structure of the shutoff and exonuclease protein from the oncogenic kaposi's sarcoma-associated herpesvirus crystal structure of a kshv-sox-dna complex: insights into the molecular mechanisms underlying dnase activity and host shutoff global mrna degradation during lytic gammaherpesvirus infection contributes to establishment of viral latency gammaherpesviral gene expression and virion composition are broadly controlled by accelerated mrna degradation host shutoff during productive epstein-barr virus infection is mediated by bglf and may contribute to immune evasion aberrant herpesvirus-induced polyadenylation correlates with cellular messenger rna destruction lytic kshv infection inhibits host gene expression by accelerating global mrna turnover coordinated destruction of cellular messages in translation complexes by the gammaherpesvirus host shutoff factor and the mammalian exonuclease xrn deep sequencing reveals direct targets of gammaherpesvirus-induced mrna decay and suggests that multiple mechanisms govern cellular transcript escape an rna element in human interleukin confers escape from degradation by the gammaherpesvirus sox protein nuclease escape elements protect messenger rna against cleavage by multiple viral endonucleases transcriptome-wide cleavage site mapping on cellular mrnas reveals features underlying sequence-specific cleavage by the viral ribonuclease sox kshv sox mediated host shutoff: the molecular mechanism underlying mrna transcript processing t cell costimulatory receptor cd is a primary target for pd- -mediated inhibition the unfolded protein response signals through high-order assembly of ire in-line probing analysis of riboswitches structure of the atp synthase catalytic complex (f( )) from escherichia coli in an autoinhibited conformation identification of new homologs of pd-(d/e)xk nucleases by support vector machines trained on data derived from profile-profile alignments crystal structures of lambda exonuclease in complex with dna suggest an electrostatic ratchet mechanism for processivity why do divalent metal ions either promote or inhibit enzymatic reactions? the case of bamhi restriction endonuclease from combined quantum-classical simulations cofactor-mediated conformational control in the bifunctional kinase/rnase ire the rna exosome and rna exosome-linked disease mutations of exosc /rrp p associated with neurological diseases impact ribosomal rna processing functions of the exosome in s. cerevisiae the rnase ii/rnb family of exoribonucleases: putting the 'dis' in disease nonsense-mediated mrna decay and cancer applying nonsense-mediated mrna decay research to the clinic: progress and challenges mfold web server for nucleic acid folding and hybridization prediction we thank members of the glaunsinger lab for their suggestions and critical reading of the manuscript. we would like to thank the university of california, berkeley tissue culture facility for sf cell maintenance and the university of california san francisco quantitative biosciences institute, antibiome center for use of their octet red e for binding kinetics measurements. nucleic acids research, , vol. , no. key: cord- -qffg r authors: wong, alan h. m.; tomlinson, aidan c. a.; zhou, dongxia; satkunarajah, malathy; chen, kevin; sharon, chetna; desforges, marc; talbot, pierre j.; rini, james m. title: receptor-binding loops in alphacoronavirus adaptation and evolution date: - - journal: nat commun doi: . /s - - -x sha: doc_id: cord_uid: qffg r rna viruses are characterized by a high mutation rate, a buffer against environmental change. nevertheless, the means by which random mutation improves viral fitness is not well characterized. here we report the x-ray crystal structure of the receptor-binding domain (rbd) of the human coronavirus, hcov- e, in complex with the ectodomain of its receptor, aminopeptidase n (apn). three extended loops are solely responsible for receptor binding and the evolution of hcov- e and its close relatives is accompanied by changing loop–receptor interactions. phylogenetic analysis shows that the natural hcov- e receptor-binding loop variation observed defines six rbd classes whose viruses have successively replaced each other in the human population over the past years. these rbd classes differ in their affinity for apn and their ability to bind an hcov- e neutralizing antibody. together, our results provide a model for alphacoronavirus adaptation and evolution based on the use of extended loops for receptor binding. c oronaviruses are enveloped, positive-stranded rna viruses that cause a number of respiratory, gastrointestinal, and neurological diseases in birds and mammals , . the coronaviruses all possess a common ancestor and four different genera (alpha, beta, gamma, and delta) that collectively use at least four different glycoproteins and acetylated sialic acids as host receptors or attachment factors have evolved [ ] [ ] [ ] . four coronaviruses, hcov- e, hcov-nl , hcov-oc , and hcov-hku circulate in the human population and collectively they are responsible for a significant percentage of the common cold as well as more severe respiratory disease in vulnerable populations , . hcov- e and hcov-nl are both alphacoronaviruses and although closely related, they have evolved to use two different receptors, aminopeptidase n (apn) and angiotensin converting enzyme (ace ), respectively , . the more distantly related betacoronaviruses, hcov-oc and hcov-hku , are less well characterized and although hcov-oc uses -o-acetylsialic acid as its receptor , the receptor for hcov-hku has not yet been determined [ ] [ ] [ ] . recent zoonotic transmission of betacoronaviruses from bats is responsible for sars and mers, and in these cases infection is associated with much more serious disease and high rates of mortality [ ] [ ] [ ] . like hcov-nl , sars-cov uses ace as its receptor and the observation that mers-cov uses dipeptidyl peptidase highlights the fact that coronaviruses with new receptor specificities continue to arise. the coronavirus spike protein (s-protein) is a trimeric singlepass membrane protein that mediates receptor binding and fusion of the viral and host cell membranes . it is a type- viral fusion protein possessing two regions, the s region that contains the receptor-binding domain (rbd) and the s region that contains the fusion peptide and heptad repeats involved in membrane fusion [ ] [ ] [ ] [ ] [ ] [ ] . the coronavirus s-protein is also a major target of neutralizing antibodies and one outcome of hostinduced neutralizing antibodies is the selection of viral variants capable of evading them, a process known to drive variation [ ] [ ] [ ] . as shown by both in vivo and in vitro studies, changes in host, host cell type, cross-species transmission, receptor expression levels, serial passage, and tissue culture conditions can also drive viral variation [ ] [ ] [ ] [ ] [ ] . rna viruses are characterized by a high mutation rate, a property serving as a buffer against environmental change . a host-elicited immune response, the introduction of antiviral drugs, and the transmission to a new species provide important examples of environmental change . nevertheless, the means by which random mutations lead to viral variants with increased fitness and enhanced survival in the new environment are not well characterized. given their wide host range, diverse receptor usage and ongoing zoonotic transmission to humans, the coronaviruses provide an important system for studying rna virus adaptation and evolution. the alphacoronavirus, hcov- e, is particularly valuable as it circulates in the human population and a sequence database of natural variants isolated over the past fifty years is available. moreover, changes in sequence and serology have suggested that hcov- e is changing over time in the human population [ ] [ ] [ ] . reported here is the x-ray structure of the hcov- e rbd in complex with human apn (hapn). the structure shows that receptor binding is mediated solely by three extended loops, a feature shared by hcov-nl and the closely related porcine respiratory coronavirus, prcov. it also shows that the hcov- e rbd binds at a site on hapn that differs from the site where the prcov rbd binds on porcine apn (papn), evidence of an ability of the rbd to acquire novel receptor interactions. remarkably, we find that the natural hcov- e sequence variation observed over the past fifty years is highly skewed to the receptor-binding loops. moreover, we find that the loop variation defines six rbd classes (classes i-vi) whose viruses have successively replaced each other in the human population. these rbd classes differ in their affinity for hapn and their ability to be bound by a neutralizing antibody elicited by the hcov- e reference strain (class i). taken together, our results provide a model for alphacoronavirus adaptation and evolution stemming from the use of extended loops for receptor binding. characterization of the hcov- e rbd interaction with hapn. to define the limits of the hcov- e rbd, we expressed a series of soluble s-protein fragments and measured their affinity to a soluble fragment (residues - ) of hapn, the hcov- e receptor. the smallest s-protein fragment made (residues - ) bound hapn with an affinity (k d of . ± . µm) similar to that of the entire s region (residues - ) ( table , supplementary fig. a , b) and this fragment was used in the structure determination. to confirm the importance of the table analysis of the hapn ectodomain (residues - , wt and mutants) interaction with fragments of the hcov- e sprotein (wt and mutants) using surface plasmon resonance hcov- e rbd-hapn interaction for viral infection, we showed that both the rbd and the hapn ectodomain inhibited viral infection in a cell-based assay (fig. a, b, c) . crystals of the hcov- e rbd-hapn complex were obtained by co-crystallization of the complex after size exclusion chromatography. the crystallographic data collection and refinement statistics are shown in table . the asymmetric unit contains one hapn dimer (and associated rbds) and one hapn monomer (and associated rbd) that is related to its dimeric mate by a crystallographic two-fold rotation axis. both dimers (noncrystallographic and crystallographic) are found in the closed conformation and are essentially identical to that which we previously reported for hapn in its apo form (rmsd over all cα atoms of . Å). each apn monomer is bound to one rbd as shown in fig. a . the hcov- e rbd-hapn interaction buries Å of surface area on the rbd and Å on hapn. the hcov- e rbd is an elongated six-stranded β-structural domain with three extended loops (loop : residues - , loop : residues - , loop : residues - ) at one end that exclusively mediate the interaction with hapn (fig. b ). loop is the longest and it contributes~ % of the rbd surface buried on complex formation (figs. c and g). within loop , residues cys and cys form a disulfide bond that makes a stacking interaction with the side chains of hapn residues tyr and glu (fig. c) . the c s/c s rbd double mutant showed no binding to hapn at concentrations up to μm (table , supplementary fig. d , and supplementary table ), evidence of the importance of the stacking interaction and a likely role for the disulfide bond in defining the conformation of loop . notably, loop contains three tandemly repeated glycine residues (residues - ) whose nh groups donate hydrogen bonds to the side chain of asp and the carbonyl oxygen of phe of hapn (fig. c) ; mutation of hapn residue asp to alanine leads to a~ -fold reduction in affinity ( (fig. c) ; the importance of trp of loop is evidenced by the fact that mutating it also ablates binding (table , supplementary fig. f , and supplementary table ). hcov- e and prcov bind at different sites on apn. as with hcov- e, the porcine respiratory alphacoronavirus, prcov, also uses apn as its receptor . as our complex shows, hcov- e binds at a site on hapn (h-site) that differs from the site on papn (p-site) used by prcov (fig. a, b) . glu in hapn, a residue in the hapn-rbd interface, is an n-glycosylated asparagine (asn ) in papn and attempts to dock the hcov- e rbd at the h-site on papn leads to a steric clash with the n-glycan ( supplementary fig. a ). consistent with this observation, the hcov- e rbd cannot bind to a mutant form of hapn (e n/k e/q t) that possesses an n-glycan at position , as we have shown ( . across species, the sequence identity at the h-and p-sites is only~ % ( fig. c and supplementary fig. c ) and the receptor-binding loops of these viruses must be accommodating the remaining apn structural differences on receptors from species that they do not infect. together these results provide evidence that the extended receptor-binding loops of these alphacoronaviruses possess conformational plasticity. the observation that hcov- e and prcov bind to different sites on apn has important consequences. among species, apn is found in open/intermediate and closed conformations and conversion between them is thought to be important for the catalysis of its substrates , . the hcov- e rbd binds to hapn in its closed conformation and structural comparison shows that the h-site does not differ between the open and closed conformations. this is to be contrasted with the p-site of papn that differs in the open and closed conformations. indeed, the prcov rbd has recently been shown to bind to papn in the open conformation as a result of p-site interactions made possible in the open form . these differences in binding and receptor conformation are reflected in the fact that enzyme inhibitors that promote the closed conformation of apn block tgev infection , but not hcov- e infection , and the fact that the prcov s-protein , but not hcov- e , inhibits apn catalytic activity. the receptor-binding loops of hcov- e vary extensively. sequence data from viruses isolated over the past years provides a wealth of data on the natural variation shown by hcov- e ( supplementary fig. ). with reference to the hcov- e rbd-hapn complex reported here, we now show that % of the amino acids in the receptor-binding loops and supporting residues vary among the sequences analyzed ( sequences in total), while only % of the rbd surface residues outside of the receptor-binding loops show variation (fig. a, b) . moreover, for the eight variants where full genome sequences were reported, the receptor-binding loops represent the location at which the greatest variation in the entire genome is observed (fig. c) . analysis of the hcov- e rbd-hapn interface further shows that of the rbd surface residues that are fully or partially buried on complex formation, of them vary in at least one of the sequences analyzed and a pairwise comparison of the sequences suggests that many of these positions can vary simultaneously ( supplementary fig. ). finally, we show that the six invariant interface residues on the rbd (gly , gly , cys , cys , asn , and arg ) constitute only % of the viral surface area buried, the very region expected to be the most highly conserved from a receptor-binding standpoint. the fig. naturally occurring hcov- e sequence variation. a color-coded amino-acid sequence conservation index (chimera) mapped onto a ribbon representation of the hcov- e rbd. blue represents a high percentage sequence identity and red represents a low percentage sequence identity among the viral isolates analyzed. b surface representation in the same orientation as in (a, left), and rotated °(right). the asn-glcnac moiety of the nglycans are shown in stick representation. color coding as in a. c amino-acid sequence variation shown by the eight viral isolates whose entire genome sequences have been reported. the entire protein coding region of the viral genome was treated as a continuous amino acid string ( residues in total). amino acid differences among the eight sequences were analyzed in residue bins and for each bin the sum was plotted. green-colored bins correspond to residues in the s-protein and purple-colored bins correspond to residues in the rbd. the horizontal dotted line denotes the average number of aminoacid differences per bin across the protein-coding region of the whole viral genome. d alignment of the sequences selected for each of the six classes. the "|" symbol demarcates every residues in the alignment. e representative images showing hcov- e infection of l- cells in the presence of: pbs, monoclonal antibody . .e at two different concentrations, and monoclonal antibody . h at two different concentrations (anti-hcov-oc antibody). the nucleus is stained blue and green staining indicates viral infection. magnification (× ) and scale bar = µm. f statistical quantification of the monoclonal antibody inhibition experiment. error bars correspond to standard deviations obtained from three independent experiments remaining % (i.e., Å ) of the viral surface area buried is made up of residues that differ in their variability and the role they play in complex formation (supplementary table ). loop variation leads to phylogenetic classes. phylogenetic analysis of the hcov- e rbd sequences found in the database showed that they segregate into six classes ( supplementary fig. ). class i contains the atcc- reference strain (originally isolated in and deposited in ) and related lab strains, while classes ii-vi, represent clinical isolates that have successively replaced each other in the human population over time since the s. to characterize these classes, a representative sequence from each was selected; for class i, the rbd of the reference strain, also used in our structural analysis, was selected. to simplify characterization, the rbds of the other five classes were synthesized with the class i sequence in all but the loop regions (fig. d) . as observed for class i, the other rbds do not bind to the hapn mutant that introduces an n-glycan at glu (supplementary fig. d) , an observation suggesting that they all bind at the same site on hapn. the rbds bound hapn with añ -fold range in affinity (k d from~ to~ nm). these differences in affinity are largely a result of differences in k off with little difference in k on (table and supplementary fig. ) . table shows the identity of the loop residues that have shown variation. of those buried in the rbd-hapn interface, residues , , and are particularly noteworthy as they undergo considerable variation in amino-acid character. residue , for example, accounts for % of the total buried surface area on complex formation and changes from gly to val to pro in the transition from classes i to vi. variation of this sort provides insight into how changes in receptor-binding affinity might be mediated during the process of viral adaptation. each of the six rbd classes were also characterized using a neutralizing mouse monoclonal antibody ( . e ) that we generated against the hcov- e reference strain (class i). as shown in fig. e, f, . e inhibits hcov- e infection of the l cell-line. this antibody binds to the class i rbd with a k d of nm (k on = . × m − s − , k off = . s − ) and as shown by a competition binding experiment, it blocks the rbd-hapn interaction ( supplementary fig. a, b) . in contrast, . e shows no binding to the other five rbd classes at a concentration of μm (supplementary fig. c ), strong evidence that the receptorbinding loops of the class i rbd are important for antibody binding and that loop variation can abrogate antibody binding. consistent with this observation, non-conserved amino-acid changes both within and outside of the rbd-hapn interface are observed across all classes (supplementary table ). correlating structure and function with natural sequence data is a powerful means of studying viral adaptation and evolution. to this end, we have delimited the hcov- e rbd and determined its x-ray structure in complex with the ectodomain of its receptor, hapn. we found that three extended loops on the rbd are solely responsible for receptor binding, and that these loops are highly variable among viruses isolated over the past years. a phylogenetic analysis also showed that the rbds of these viruses define six rbd classes whose viruses have successively replaced each other in the human population. the six rbds differ in their receptor-binding affinity and their ability to be bound by a neutralizing antibody ( . e ) and taken together, our findings suggest that the hcov- e sequence variation observed arose through adaptation and selection. antibodies that block receptor binding are a common route to viral neutralization and exposed loops are known to be particularly immunogenic . loop-binding neutralizing antibodies are elicited by the alphacoronavirus tgev , and the receptorbinding loops of hcov- e mediate the binding of the neutralizing antibody, . e . as shown by the sequences of the viral isolates analyzed, the rbds differ almost exclusively in their receptor-binding loops. . e blocks the hapn-rbd interaction and it can only bind to the rbd (class i) found in the virus that elicited it. this observation shows that loop variability can abrogate neutralizing antibody binding. indeed, the successive replacement or ladder-like phylogeny observed, when the sequence of the hcov- e rbd is analyzed, is characteristic of immune escape as shown by the influenza virus , . taken together, our results suggest that immune evasion contributes to if not explains the extensive receptor-binding loop variation shown by hcov- e over the past years. hcov- e infection in humans does not provide protection against different isolates , and viruses that contain a new rbd class that cannot be bound by the existing repertoire of loop-binding neutralizing antibodies provide an explanation for this observation. neutralizing antibodies that block receptor binding can also be thwarted by an increase in the affinity/avidity between the virus and its host receptor. increased receptor-binding affinity/avidity allows the virus to more effectively compete with receptor blocking neutralizing antibodies, a mechanism thought to be important for evading a polyclonal antibody response . in addition, an optimal receptor binding affinity is thought to exist in a given environment. as such, adaptation in a new species, changes in tissue tropism, and differences in receptor expression levels can all lead to changes in receptor binding affinity , , . recent cryoem analysis has shown that the receptor-binding sites of hcov-nl , sars-cov, mers-cov, and by inference hcov- e, are inaccessible in some conformations of the prefusion s-protein trimer [ ] [ ] [ ] [ ] [ ] . although the ramifications of this structural arrangement are not yet clear, restricting access to the binding site has been proposed to provide a means of limiting bcell receptor interactions against the receptor-binding site . how this might work in mechanistic terms is also not clear given the need to bind receptor. however, in a simple model, the inaccessible s-protein conformation(s) would be in equilibrium with a less stable (higher energy) but accessible s-protein conformation (s). the energy difference between these conformations is a barrier to binding that decreases equally the intrinsic free energy of binding of both the viral receptor and the b-cell receptor and relative binding energies may be the key. both soluble hapn and values after ± correspond to the residual standard deviation reported by scrubber . two experiments were performed nature communications | doi: . /s - - -x article antibody . e can inhibit hcov- e infection in a cell-based assay, an indication that their binding energies (k d of and nm, respectively) are sufficient to efficiently overcome the barrier to binding. however, b-cell receptors bind their antigens relatively weakly prior to affinity maturation and they would be much less able to do so. the dynamics of the interconversion between accessible and inaccessible conformations may also be a factor in the recognition of inaccessible antibody epitopes , , and further work will be required to establish if and how restricting access to the receptor binding site enhances coronavirus fitness. the cryoem structures also show that the receptor-binding loops make intra-and inter-subunit contacts in the inaccessible prefusion trimer. this suggests the intriguing possibility that the magnitude of the energy barrier, or the dynamics of the interconversion between accessible and inaccessible conformations, might be modulated by loop variation during viral adaption. immune evasion and cross-species transmission involve viral adaptation and we posit that the use of extended loops for receptor binding represents a strategy employed by hcov- e and the alphacoronaviruses to mediate the process. such loops can tolerate insertions, deletions, and amino acid substitutions relatively free of the energetic penalties associated with the mutation of other protein structural elements. indeed, our analysis of the six rbd classes shows that the receptor-binding loops possess a remarkable ability to both accommodate and accumulate mutational change while maintaining receptor binding. among the six classes, % of the loop residues show change and only % of the receptor interface buried on receptor binding has been conserved. as we have shown, variation in the receptorbinding loops can abrogate neutralizing antibody binding and it will also increase the likelihood of acquiring new receptor interactions by chance. in this way, the selection of viral variants capable of immune evasion and/or cross-species transmission will be facilitated , , [ ] [ ] [ ] . cross-species transmission involves the acquisition of either a conserved (i.e., a similar interaction with a homologous receptor) or a non-conserved receptor interaction (i.e., an interaction with a non-homologous receptor, or an interaction at a new site on a homologous receptor) in the new host. hcov- e binds to a site on hapn that differs from the site where prcov binds to papn (fig. a, b) , and hcov-nl is known to bind the nonhomologous receptor, ace . clearly, conserved receptor interactions have not accompanied the evolution of these alphacoronaviruses ( fig. d-g) . in mechanistic terms, receptor-binding loop variability and plasticity would facilitate the acquisition of both conserved and non-conserved receptor interactions. however, compared to conserved receptor interactions, the successful acquisition of non-conserved interactions would be expected to be relatively infrequent and more likely to require viral replication and mutation in the new host to optimize receptor-binding affinity. many coronaviruses have originated in bats , and it is tempting to speculate that viral transmission between bats has facilitated the emergence of non-conserved receptor interactions. bats account for~ % of all mammalian species and they possess a unique ecology/biology that facilitates viral spread between them , . moreover, the barriers to viral replication in a new host are lower among closely related species , . it follows that the viral replication required to optimize non-conserved receptor interactions in the new host would be facilitated by transmission between closely related bat species. by a similar reasoning, the use of conserved receptor interactions requiring little optimization would facilitate large species jumps. several bat coronaviruses showing a high degree of sequence similarity with hcov- e have recently been identified , and an analysis of how they interact with bat apn will inform this discussion. predicting the emergence of new viral threats is an important aspect of public health planning and our work suggests that rna viruses that use loops to bind their receptors should be viewed as a particular risk. rna viruses are best described as populations , and extended loops-inherently capable of accommodating and accumulating mutational change-will enable populations with loop diversity. such populations will provide routes to escaping receptor loop-binding neutralizing antibodies, optimizing receptor-binding affinity, and acquiring new receptor interactions, interrelated processes that drive viral evolution and the emergence of new viral threats. protein expression and purification. the soluble ectodomain of hapn (residues - ) was expressed and purified from stably transfected hek s gnt -cells (atcc crl- ) as described previously . the various soluble forms of the hcov- e s-protein were expressed and purified from stably transfected hek s gnt -cells for x-ray crystallography, and from hek t (atcc crl- ) and/or hek f (invitrogen - ) cells for cell-based and biochemical characterization, as described previously . point mutations were generated using the infusion hd site-directed mutagenesis protocol (clontech). in all cases, the target proteins were secreted as n-terminal protein-a fusion proteins with a tobacco etch virus (tev) protease cleavage site following the protein-a tag. harvested media was concentrated -fold and purified by igg affinity chromatography (igg sepharose, ge). the bound proteins were liberated by on-column tev protease cleavage and further purified by anion exchange chromatography (hitrap q hp, ge). protein crystallization. the rbd of the s-protein of hcov- e (residues - ) and the soluble ectodomain of hapn (residues - ) were mixed in a ratio of . : (rbd:hapn) and the complex was purified by superdex (ge) gel filtration chromatography in mm hepes, mm nacl, ph . . the complex was concentrated in gel filtration buffer to mg/ml for crystallization trials. crystals were obtained by the hanging drop method using a : mixture of stock protein and well solution containing % peg , mm gssg, mm gsh, % glycerol, µg/ml endo-β-n-acetylglucosaminidase a and mm mes, ph . at k. crystals were typically harvested after days and flash-frozen with well solution supplemented with . % glycerol as cryoprotectant. data collection and structure determination. diffraction data were collected at the canadian light source, saskatoon, saskatchewan (beamline cmcf- id- ) at a wavelength of . Å. data were merged, processed, and scaled using hkl ; % of the data set was used for the calculation of r free . phases were obtained by molecular replacement using the human apn structure as a search model (pdb id: fyq) using phaser in phenix . manual building of the hcov- e rbd was performed using coot . alternate rounds of manual rebuilding and automated refinement using phenix were performed. secondary structural restraints and torsion-angle non-crystallographic symmetry restraints between the three monomers in the asymmetric unit were employed. ramachandran analysis showed that % of the residues are in the most favored region, with % in the additionally allowed region. data collection and refinement statistics are found in table . a stereo image of a portion of the electron density map in the hcov- e-hapn interface is showed in supplementary fig. . figures were generated using the program chimera . buried surface calculations were performed using the pisa server. surface plasmon resonance binding assays. surface plasmon resonance (biacore) assays were performed on cm- dextran chips (ge) covalently coupled to the ligand via amine coupling. the running and injection buffers were matched and consisted of mm nacl, . % tween- , . mg/ml bsa, and mm hepes at ph . . response unit (ru) values were measured as a function of analyte concentration at k. kinetic analysis was performed using the global fitting feature of scrubber (biologic software) assuming a : binding model. for experiments using hapn as a ligand, between and ru were coupled to the cm- dextran chips. for experiments using . e , ru was immobilized. viral inhibition assay. hcov- e was originally obtained from the american type culture collection (atcc vr- ) and was produced in the human l cell line (atcc ccl ) which was grown in minimum essential medium alpha (mem-α) supplemented with % (v/v) fbs (paa). the l ( × ) cells were seeded on coverslips and grown overnight in mem-α supplemented with % (v/v) fbs. for inhibition assays in the presence of soluble hapn, wild-type hcov- e ( . tcid ) was pre-incubated with the fragment (residues - ) diluted in pbs for one hour at °c before being added to cells for h at °c. for inhibition assays in the presence of the soluble sprotein fragments, the different fragments, diluted in pbs, were added to cells and kept at °c on ice for h. medium was then removed and cells were inoculated with wild-type hcov- e ( tcid ) for h at °c. for both inhibition assays, after the -h incubation period, medium was replaced and cells were incubated at °c with fresh mem-α supplemented with % (v/v) fbs for h before being analyzed by an immunofluorescence assay (ifa). cells on the coverslips were directly fixed with % paraformaldehyde (pfa %) in pbs for min at room temperature and then transferred to pbs. cells were permeabilized in cold methanol (− °c) for min and then washed with pbs for viral antigen detection. the s-protein-specific monoclonal antibody, - h. , raised against hcov- e (igg , produced in our laboratory by standard hybridoma technology), was used in conjunction with an alexafluor- -labeled mouse-specific goat antibody (life technologies a- ), for viral antigen detection . after three washes with pbs, cells were incubated for min with dapi (sigma-aldrich) at µg/ml to stain the nuclear dna. to determine the percentage of l- cells positive for the viral s-protein, fields containing a total of - cells were counted, at a magnification of × using a nikon eclipse e microscope, for each condition tested in three independent experiments. green fluorescent cells were counted as s-protein positive and expressed as a percentage of the total number of cells. statistical significance was estimated by the analysis of variance (anova) test and tukey's test post hoc. monoclonal antibodies (igg , produced in our laboratory by standard hybridoma technology) raised against hcov- e ( . e ) or hcov-oc ( . h , negative control) that were found to be s-protein specific were tested in an infectivity neutralization assay. wild-type hcov- e ( . tcid ) was preincubated with the antibodies ( / of hybridoma supernatant) for h at °c before being added to l- cells for h at °c. cells were washed with pbs and incubated at °c with fresh mem-α supplemented with % fbs (v/v) for h before being analyzed by an immunofluorescence assay (ifa). statistical significance was estimated by an anova test, followed by post hoc dunnett (twosided) analysis. comparative sequence analysis of hcov- e viral isolates. the protein sequence of the hcov- e p e isolate rbd (residues - ) was used to perform a search of the non-redundant protein sequence database using blastp. and the residue-specific sequence conservation index was mapped onto the surface of the rbd using the "render by conservation" tool in chimera . percentage identity is mapped using a color scale with blue indicating % identity and red indicating % identity. the protein-coding regions of the eight sequences for which the entire genome were reported (genbank identifier numbers: nc_ . , jx . , jx . , kf . , kf . , kf . , af . , and ku . ) were aligned using muscle. the entire protein-coding region of the viral genome was treated as a continuous amino-acid string ( residues in total). protein residues that were not identical among the eight sequences were counted as a difference and plotted in residue bins. the sequence aak . was chosen as the representative of class i and the loop sequences of abb . , abb . , abb . , abb . , and afr . were combined with the non-loop sequences of aak . to generate the rbds of classes (ii-vi), respectively. data availability. coordinates and structure factors for the hcov- e rbd in complex with human apn were deposited in the protein data bank with pdb id: atk. the authors declare that all other data supporting the findings of this study are available within the article and its supplementary information files, or are available from the authors upon request. received: may accepted: october a decade after sars: strategies for controlling emerging coronaviruses epidemiology, genetic recombination, and pathogenesis of coronaviruses discovery of seven novel mammalian and avian coronaviruses in the genus deltacoronavirus supports bat coronaviruses as the gene source of alphacoronavirus and betacoronavirus and avian coronaviruses as the gene source of gammacoronavirus and deltacoronavirus molecular evolution of human coronavirus genomes coronavirus host range expansion and middle east respiratory syndrome coronavirus emergence: biochemical mechanisms and evolutionary perspectives neuroinvasive and neurotropic human respiratory coronaviruses: potential neurovirulent agents in humans epidemiology and clinical presentations of the four human coronaviruses e, hku , nl , and oc detected over years using a novel multiplex real-time pcr method human aminopeptidase n is a receptor for human coronavirus e human coronavirus nl employs the severe acute respiratory syndrome coronavirus receptor for cellular entry human and bovine coronaviruses recognize sialic acidcontaining receptors similar to those of influenza c viruses identification of the receptor-binding domain of the spike glycoprotein of human betacoronavirus hku human coronavirus hku spike protein uses o-acetylated sialic acid as an attachment receptor determinant and employs hemagglutininesterase protein as a receptor-destroying enzyme crystal structure of the receptor binding domain of the spike glycoprotein of human betacoronavirus hku severe acute respiratory syndrome coronavirus-like virus in chinese horseshoe bats isolation and characterization of a bat sars-like coronavirus that uses the ace receptor further evidence for bats as the evolutionary source of middle east respiratory syndrome coronavirus angiotensin-converting enzyme is a functional receptor for the sars coronavirus dipeptidyl peptidase is a functional receptor for the emerging human coronavirus-emc structure, function, and evolution of coronavirus spike proteins the coronavirus spike protein is a class i virus fusion protein: structural and functional characterization of the fusion core complex cryo-electron microscopy structures of the sars-cov spike glycoprotein reveal a prerequisite conformational state for receptor binding pre-fusion structure of a human coronavirus spike protein glycan shield and epitope masking of a coronavirus spike protein observed by cryo-electron microscopy cryo-electron microscopy structure of a coronavirus spike glycoprotein trimer cryo-em structures of mers-cov and sars-cov spike glycoproteins reveal the dynamic receptor binding domains contributions of the structural proteins of severe acute respiratory syndrome coronavirus to protective immunity identification of human neutralizing antibodies against mers-cov and their role in virus adaptive evolution effects of human anti-spike protein receptor binding domain antibodies on severe acute respiratory syndrome coronavirus neutralization escape and fitness the evolution and emergence of rna viruses host-specific parvovirus evolution in nature is recapitulated by in vitro adaptation to different carnivore species recombination, reservoirs, and the modular spike: mechanisms of coronavirus cross-species transmission human coronavirus e encodes a single orf protein between the spike and the envelope genes clinical isolates of human coronavirus e bypass the endosome for cell entry the role of mutational robustness in rna virus evolution hiv pathogenesis: dynamics and genetics of viral populations and infected cells analysis of human coronavirus e spike and nucleoprotein genes demonstrates genetic drift between chronologically distinct strains the behaviour of recent isolates of human respiratory coronavirus in vitro and in volunteers: evidence of heterogeneity among e-related strains differences in neutralizing antigenicity between laboratory and clinical isolates of hcov- e isolated in japan in - depend on the s region sequence of the spike protein the x-ray crystal structure of human aminopeptidase n reveals a novel dimer and the basis for peptide processing structural bases of coronavirus attachment to host aminopeptidase n and its inhibition by neutralizing antibodies mutational analysis of aminopeptidase n, a receptor for several group coronaviruses, identifies key determinants of viral host range allosteric inhibition of aminopeptidase n functions related to tumor growth and virus infection human coronavirus e: receptor binding domain and neutralization by soluble receptor at degrees c broadly neutralizing antiviral antibodies unifying the epidemiological and evolutionary dynamics of pathogens viral phylodynamics hemagglutinin receptor binding avidity drives influenza a virus antigenic drift evolution of the hemagglutinin protein of the new pandemic h n influenza virus: maintaining optimal receptor binding by compensatory substitutions preconfiguration of the antigen-binding site during affinity maturation of a broadly neutralizing influenza virus antibody a single mutation in the envelope protein modulates flavivirus antigenicity, stability, and pathogenesis conformational dynamics of single hiv- envelope trimers on the surface of native virions hiv- fitness cost associated with escape from the vrc class of cd binding site neutralizing antibodies spread of mutant middle east respiratory syndrome coronavirus with reduced affinity to human cd during the south korean outbreak escape from human monoclonal antibody neutralization affects in vitro and in vivo fitness of severe acute respiratory syndrome coronavirus crystal structure of nl respiratory coronavirus receptorbinding domain complexed with its human receptor bats as "special" reservoirs for emerging zoonotic pathogens bats as viral reservoirs host phylogeny constrains cross-species emergence and establishment of rabies virus in bats jumping species-a mechanism for coronavirus persistence and survival evidence for an ancestral association of human coronavirus e with bats surveillance of bat coronaviruses in kenya identifies relatives of human coronaviruses nl and e and their recombination history what can we predict about viral evolution and emergence? simple piggybac transposon-based mammalian cell expression system for inducible protein production synthesis of neoglycoenzymes with homogeneous n-linked oligosaccharides using immobilized endo-beta-n-acetylglucosaminidase a processing of x-ray diffraction data collected in oscillation mode phenix: a comprehensive python-based system for macromolecular structure solution coot: model-building tools for molecular graphics ucsf chimera--a visualization system for exploratory research and analysis persistent infection of human oligodendrocytic and neuroglial cell lines by human coronavirus e muscle: multiple sequence alignment with high accuracy and high throughput the work was supported by cihr operating grants to j.m.r. and p.j.t. and a canada research chair to p.j.t. the canadian light source is acknowledged for synchrotron data collection. supplementary information accompanies this paper at doi: . /s - - -x.competing interests: the authors declare no competing financial interests.reprints and permission information is available online at http://npg.nature.com/ reprintsandpermissions/ publisher's note: springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/ licenses/by/ . /. key: cord- -k ev qkn authors: janosevic, danielle; myslinski, jered; mccarthy, thomas; zollman, amy; syed, farooq; xuei, xiaoling; gao, hongyu; liu, yunlong; collins, kimberly s.; cheng, ying-hua; winfree, seth; el-achkar, tarek m.; maier, bernhard; ferreira, ricardo melo; eadon, michael t.; hato, takashi; dagher, pierre c. title: the orchestrated cellular and molecular responses of the kidney to endotoxin define the sepsis timeline date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: k ev qkn clinical sepsis is a highly dynamic state that progresses at variable rates and has life-threatening consequences. staging patients along the sepsis timeline requires a thorough knowledge of the evolution of cellular and molecular events at the tissue level. here, we investigated the kidney, an organ central to the pathophysiology of sepsis. single cell rna sequencing revealed the involvement of various cell populations in injury and repair to be temporally organized and highly orchestrated. we identified key changes in gene expression that altered cellular functions and can explain features of clinical sepsis. these changes converged towards a remarkable global cell-cell communication failure and organ shutdown at a well-defined point in the sepsis timeline. importantly, this time point was also a transition towards the emergence of recovery pathways. this rigorous spatial and temporal definition of murine sepsis will uncover precise biomarkers and targets that can help stage and treat human sepsis. acute kidney injury (aki) is a common complication of sepsis that doubles the mortality risk. in addition to failed homeostasis, kidney injury can contribute to multi-organ dysfunction through distant effects. indeed, the injured kidney is a significant mediator of inflammatory chemokines, cytokines, and reactive oxygen species that can have both local as well as remote deleterious effects - . therefore, understanding the complex pathophysiology of kidney injury is crucial for the comprehensive treatment of sepsis and its complications. we have recently shown that renal injury in sepsis progresses through multiple phases. these include an early inflammatory burst followed by a broad antiviral response and culminating in translation shutdown and organ failure . in a non-lethal and reversible model of endotoxemia, organ failure was followed by spontaneous recovery. the exact cellular and molecular contributors to this multifaceted response remain unknown. indeed, the kidney is architecturally a highly complex organ in which epithelial, endothelial, immune and stromal cells are at constant interplay. therefore, we now examined the spatial and temporal progression of endotoxin injury to the kidney using single cell rna sequencing (scrnaseq). our data revealed that cell-cell communication failure is a major contributor to organ dysfunction in sepsis. remarkably, this phase of communication failure was also a transition point where recovery pathways were activated. we believe this spatially and temporally anchored approach to sepsis pathophysiology is crucial for identifying potential biomarkers and therapeutic targets. we harvested a cumulative amount of , renal cells obtained at , , , , , and hours after endotoxin (lps) administration. the majority of renal epithelial, immune and endothelial cell types were represented (fig. a) . note the absence of podocyte and mesangial cells, which can be a limitation of single cell rnaseq renal dissociation procedures . cluster identities were assigned and grouped using known classical phenotypic markers (fig. b, supplementary fig. a ) [ ] [ ] [ ] [ ] [ ] . interestingly, the umap-based computational layout of epithelial clusters recapitulated the normal tubular segmental order in the nephron. this indicates that gene expression gradually changes among neighboring tubular segments along the nephron. note that the expression of cluster-defining markers varied significantly during the injury and recovery phases of sepsis ( fig. s b; supplementary table ). therefore, we also identified a set of genes that are conserved across time for a given cell type (fig. s c) . in the integrated umap (fig. a) , we noted the presence of a proliferative cell cluster (cdk and ki expression). by back mapping to time-specific unintegrated umaps, we determined that these proliferating cells could be traced to specific cell types at various points along the sepsis timeline (fig. c) . for example, within the first hour after lps, these proliferative indices were expressed primarily in s cells. these cells are the site of lps uptake in the kidney as we have previously shown - . at later time points, proliferative indices are seen in macrophages ( hours) and s cells ( hours) (fig. c) . these proliferative indices reflect cell cycle activity which may be involved in injury, repair or recovery processes . we also noted the presence of a proximal tubular cluster expressing unique gene identifiers: agt, rnf , slc a and slc a (fig. a) . this is likely the proximal tubular s -type (s t ) reported by others . this cluster maintained a separate and distinct identity throughout the sepsis timeline (fig. c) . because the location of s t is currently unknown, we performed in-situ spatial transcriptomics on septic mouse kidneys . we then integrated our scrnaseq with the in-situ rnaseq in order to map our scrnaseq clusters onto the tissue (supplementary fig. a, s b) . we found that the classical s cluster localizes to the cortex while s t is in the outer stripe of the outer medulla (fig. b, supplementary fig. b) . we confirmed the location of s t to the os-om with single molecular fish (supplementary fig. c) . the differential gene expression between s and s t is likely dictated by regional differences in the microenvironments of the cortex and the outer stripe. because angiotensinogen (agt) was strongly expressed in s t , we examined the expression of other components of the renin-angiotensin system (ras). we first noted the absence of ace expression in s tubular cells (fig. c, supplementary fig. ) . in contrast, ace was strongly expressed in s , s and s t cells. there is currently great interest in understanding the biology of ace because of its role in sars-cov- cellular invasion. other essential components of the sars-cov- entry mechanism include tmprss and slc a - . while tmprss was expressed in all proximal tubular segments, slc a was more strongly expressed in s throughout the sepsis timeline. this may point to the s tubular segment as one point of entry of sars-cov- into the kidney. the immune cell profile in the septic kidney was time-dependent and showed a five-fold increase in immune cells, primarily macrophages (fig. a, b) . we noted two distinct macrophage clusters denoted as macrophage a and macrophage b (mϕ-a, mϕ-b). both of these clusters expressed classical macrophage markers such as cd b (itgam) (fig. c) . accumulated macrophages were predominantly mϕ-a. we noted the absence of proliferation markers (cdk , ki ) in this cluster, raising the possibility that this may be an infiltrative macrophage type (fig. d) . the mϕ-b cluster, located between mϕ-a and conventional dendritic cells (cdc) expressed also cdc markers such as mhc-ii subunit genes (h -ab ) and cd c (itgax) indicating that it is an intermediary macrophage type (fig. c) . this continuum between macrophages and dendritic cells in the kidney has been reported - . interestingly, mϕ-b cells expressed proliferation markers (cdk , ki ) and thus, may be differentiating towards a mϕ-a or cdc phenotype (fig. c) . pseudotime and velocity field analysis suggested that at earlier time points ( hour) mϕ-b was differentiating toward mϕ-a phenotype. at later time points ( hours) the velocity field suggested that mϕ-b was differentiating towards cdc but pseudotime analysis was inconclusive (fig. e) . similarly, the mϕ-a cluster also showed two subclusters on the rna velocity map (supplementary fig. a) . one of the subclusters showed increased expression of alternatively activated macrophages (m ) markers such as arg (arginase ) and mrc (cd ) at later time points ( hours, supplementary fig. b) . therefore, rna velocity analysis may be a useful tool in distinguishing macrophage subtypes in scrnaseq data. in t-cells, while cd expression was minimal at all time points, the expression of cd was robust and relatively preserved over time (fig. s c) . we also noted an increase of a distinct plasmacytoid dendritic cell cluster at one hour (pdc). these pdcs, along with natural killer (nk) cells, are known to signal through the interferon-gamma pathway and stimulate cd expression , . this supports the early antiviral response we have previously reported in this sepsis model defined with pseudotime analysis. we note that at any given time point, directional progression of states along pseudotime correlated well with real time state changes (fig. a) . note that the endothelium exhibited changes in states as early as hour, while s showed changes at later time points ( hours). these sequential state changes may reflect the spatial and temporal propagation of lps signaling in the kidney. as sepsis progressed, many cell types lost function- defining markers while acquiring novel ones. for example, s and s lost classical markers like slc a (sglt ) and aqp and expressed new genes involved in antigen presentation such as h -ab (mhc-ii) and cd (fig. b) . moreover, the highly distinct phenotypes that differentiated s from s /s at baseline merged into one phenotype for all three sub-segments by hours after lps (fig. c) . however, despite the apparent convergent phenotype at hours, additional analytical approaches such as rna velocity revealed significant differences in rna splicing kinetics between s and s segments at this time point. in addition, rna velocity revealed the presence of two subclusters within the s segment at hours (fig. d) . these two velocity subclusters did not correlate with the two states seen in pseudotime analysis. this indicates that multiple analytic approaches are needed to fully characterize cellular changes along the sepsis timeline. we next show gene expression profiles in select cell types along the sepsis timeline. in this analysis, we included endothelial cells, pericyte/stromal cells, macrophages and s tubular cells. within hour of lps exposure, most cell types showed decreased expression of select genes involved in ribosomal function, translation and mitochondrial processes such as eef and rpl genes (fig. a, supplementary fig. a ). this reduction peaked at hours and recovered by hours. concomitantly, most cell types exhibited increased expression of several genes involved in inflammatory and antiviral responses such as tnfsf , cxcl , ifit , and irf . however, this increase was not synchronized among all cell populations. indeed, it occurred as early as hour in endothelial cells, macrophages and pericyte/stromal cells, all acting as first responders. in contrast, epithelial cells were late responders, with increases in inflammatory and antiviral responses occurring between and hours. in fact, four hours after lps administration, cluster-specific go terms were indistinguishable among the majority of cell types with enrichment in terms related to defense, immune and bacterium responses (fig. b) . one noted exception was the s t cells (outer stripe s ) which did not enrich as robustly as other cell types in these terms. it mostly maintained an expression profile related to ribosomes, translation and drug transport throughout the sepsis timeline (supplementary fig. ). other players of interest in sepsis pathophysiology such as prostaglandin and coagulation factors are described in supplementary figure b . at the -hour time point, while s cells partially recovered to baseline, the macrophages showed increased expression of genes involved in phagocytosis, cell motility and leukotrienes, broadly representative of activated macrophages (e.g. csf r, lst , capzb, s a , cotl , alox ap, fig. a) . intriguingly, at this late time point, the pericyte/stromal cells are enriched in unique terms related to specific leukocyte and immune cell types such as lymphocyte-mediated immunity, t cell mediated cytotoxicity and antigen processing and presentation. this suggests that the pericyte may function as a transducer between epithelia and other immune cells. therefore, we next examined comprehensively cell-cell communication along the sepsis timeline. we show select examples of cell type-specific receptor ligand pairs. for example, we found that s and endothelial cells communicate with the angpt (angiopoetin ) and tek (tie ) ligand-receptor pair at baseline and throughout the sepsis timeline ( fig. a-b, supplementary fig. a ). in contrast, c was strongly expressed in pericyte/stromal cells, while its receptor c ar localized to macrophage/dcs. this communication, present at baseline, did increase along the sepsis timeline with additional players such as s participating in the cross talk (supplementary fig. ) . another strong communication was noted between endothelial cells and macrophage/lymphocytes using the ccl and ccr receptor-ligand pair. the architectural layout of these four cell types, with pericytes and endothelial cells residing between proximal tubule and macrophage/dcs may explain these complex communication patterns . such communication patterns among these four cell types may also explain macrophage clustering around s tubules at later time points in sepsis as we previously reported . when examined comprehensively, receptor-ligand signaling progressed from a broad pattern at baseline into a more discrete and specialized one hours after lps (fig. c, supplementary fig. b-c) the murine sepsis timeline allows staging of human sepsis finally, we asked whether our mouse sepsis timeline could be used to stratify human sepsis aki. to this end, we selected the differentially expressed genes from all cells combined (pseudo bulk) for each time point across the mouse sepsis timeline (supplementary table ) . we then examined the orthologues of these defining genes in human kidney biopsies of patients with sepsis and aki. the clinical data associated with these human biopsies did not allow further stratification or staging of the sepsis timeline (supplementary table ). as shown in figure d , our approach using the mouse data succeeded in partially stratifying the human biopsies into early, mid and late sepsis-related aki. these findings suggest that underlying injury mechanisms are conserved, and the mouse timeline may be valuable in staging and defining biomarkers and therapeutics in human sepsis. in this work, we provide comprehensive transcriptomic profiling of the kidney in a murine sepsis model. to our knowledge, this is the first description of spatial and temporal transcriptomic changes in the septic kidney that extend from early injury well into the recovery phase. our data cover nearly all renal cell types and are time-anchored, thus providing a detailed and precise view of the evolution of sepsis in the kidney at the cellular and molecular level. using a combination of analytical approaches, we identified marked phenotypic changes in multiple cell populations along the sepsis timeline. the proximal tubular s segment exhibited significant alterations consisting of early loss of traditional function-defining markers (e.g., sglt ). similar losses of function-defining markers along the nephron may explain the profound derangement in solute and fluid homeostasis seen in sepsis. concomitantly, we observed novel epithelial expression of immune-related genes such as those involved in antigen presentation. this indicates a dramatic switch in epithelial function from transport and homeostasis to immunity and defense. these phenotypic changes were reversible, thus underscoring the remarkable resilience and plasticity of the renal epithelium. in addition, our combined analytical tools clearly identified unique subclusters within each epithelial cell population (e.g., cortical s and os s ). these subclusters likely represent novel populations that may be in part influenced by the complex microenvironments in the kidney. it is likely that such microenvironments define unique features in epithelial subpopulations such as the expression of complete sars-cov- machinery in s . similarly, we also identified unique features in immune-cell populations. for example, the combined use of rna velocity field and pseudotime analyses uncovered differences in macrophage subtypes relating to rna kinetics and cell differentiation trajectories. of note is that these subtypes only partially matched the traditional flow cytometry-based classification of macrophages (e.g., m /m ). therefore, the use of single-cell rna seq is a powerful approach that will add to and complement our current understanding of the immune cell repertoire in the additional approaches such as receptor-ligand crosstalk and gene regulatory network analyses identified unique cell-and time-dependent players involved in sepsis pathophysiology. our work points to the urgent need for defining a more accurate and precise timeline for human sepsis. such definition will guide the development of biomarkers and therapies that are cell and time specific. we show evidence supporting the relevance of murine models and their usefulness in staging human sepsis. these precisely time-and space-anchored data will provide the community with rich and comprehensive foundations that will propel further investigations into human sepsis. animal model: male c bl/ j mice were obtained from the jackson laboratory. mice were - biopsies were used in this study, the institutional review board determined that informed consent was not required. murine kidneys were transported in rpmi (corning), on ice immediately after surgical procurement. kidneys were rinsed with pbs (thermofisher) and minced into eight sections. each sample was then enzymatically and mechanically digested with reagents from multi- tissue dissociation kit and gentlemacs dissociator/tube rotator (miltenyi biotec). the samples were prepared per protocol "dissociation of mouse kidney using the multi tissue dissociation kit " with the following modifications: after termination of the program "multi_e_ ", we added ml rpmi (corning) and % bsa (sigma-aldrich) to the mixture, filtered and homogenate was centrifuged ( g for minutes at °c). cell pellet was resuspended in ml of rbc lysis buffer (sigma), incubated on ice for minutes, and cell pellet washed three times ( g for minutes at °c ). annexin v dead cell removal was performed using magnetic bead separation after final wash, and the pellet resuspended in rpmi /bsa . %. viability and counts were assessed using trypan blue (gibco) and brought to a final concentration of million cells/ml, exceeding % viability as specified by x genomics processing platform. the sample was targeted to , cell recovery and applied to a single cell master mix with lysis buffer and reverse transcription reagents, following the chromium single cell ' reagent kits v user guide, cg rev a ( x genomics, inc.). this was followed by cdna synthesis and library preparation. all libraries were sequenced in illumina novaseq platform in paired-end mode ( bp + bp). fifty thousand reads per cell were generated and the x genomics cellranger (v. . . ) pipeline was utilized to demultiplex raw base call files to fastq files and reads aligned to the mm murine genome using star . cellranger computational output was then analyzed in r (v. . . ) using the seurat package v. . . . , . seurat objects were created for non-integrated and integrated (inclusive of all time points) using the following filtering metrics: gene counts were set between - and mitochondrial gene percentages less than to exclude doublets and poor quality cells. gene counts were log transformed and scaled to . the top principle components were used to perform unsupervised clustering analysis, and visualized using umap dimensionality reduction (resolution . ). using the seurat package, annotation and grouping of clusters to cell type was performed manually by inspection of differentially expressed genes (degs) for each cluster, based on canonical marker genes in the literature - , , . in some experiments, we used edger negative binomial regression to model gene counts and performed differential gene expression and pathway enrichment analyses (topkegg, topgo, fig. , supplementary fig. a, supplementary fig. , and david . fig. b . , . the immune cell subset was derived from the filtered, integrated seurat object and included the macrophage/dc (cluster ), neutrophil (cluster ) and lymphocyte (cluster ) cells. gene counts were log transformed, scaled and principle component analysis performed as for the integrated object above. umap resolution was set to . , which yielded clusters. the clusters were manually assigned based on inspection of degs for each cluster, and cells grouped if canonical markers were biologically redundant. we confirmed manual labeling with an automated labeling program in r, singler . scenic analysis was performed using the default setting and mm - bp-upstream- species.mc nr.feather database was used for data display. we performed pseudotime analysis on the integrated seurat object containing all cell types as well as the immune cell subset. cells from each of the seven time points were included and were split into individual gene expression data files organized by previously defined cell type. a septic mouse kidney was immediately frozen in optimal cutting temperature media. a µm frozen tissue section was cut and affixed to a visium spatial gene expression library preparation slide ( x genomics). the specimen was fixed in methanol and stained with hematoxylin-eosin reagents. images of hematoxylin-eosin-labeled tissues were collected as mosaics of x fields using a keyence bz-x fluorescence microscope equipped with a nikon x cfi plan fluor objective. the tissue was then permeabilized for minutes and rna was isolated. the cdna libraries were prepared and then sequenced on an illumina novaseq . using seurat . . , we identified anchors between the integrated single cell object and the spatial transcriptomics datasets and used those to transfer the cluster data from the single cell to the spatial transcriptomics. for each spatial transcriptomics spot, this transfer assigns a score to each single cell cluster. we selected the cluster with the highest score in each spot to represent its single cell associated cluster. using a loupe browser, expression data was visualized overlying the hematoxylin-eosin image. formalin-fixed paraffin-embedded cross sections were prepared with a thickness of µm. the slides were baked for minutes at °c. tissues were incubated with xylene for minutes x , % etoh for minutes x , and dried at room temperature. rna in situ hybridization was fluorescein plus evaluation kit (perkinelmer, inc) was used as secondary probes for the detection of rna signals. all slides were counterstained with dapi and coverslips were mounted using fluorescent mounting media (prolong gold antifade reagent, life technologies). the images were collected with a lsm confocal microscope (carl zeiss). no blinding was used for animal experiments. all data were analyzed using r software packages, with relevant statistics described in results, methods and fig. legends . data will be deposited to ncbi geo. the authors declare that all relevant data supporting the findings of this study are available on request. r scripts for performing the main steps of analysis are available from the lead contact on request. correspondence and requests for resources and reagents should be directed to and will be fulfilled by the lead contact takashi hato (thato@iu.edu). supplemental fig. - : refer to "supplemental_fig - .pdf" supplemental table : cell-type specific differentially expressed genes from - hours, related to fig. , supplemental fig. . lung-kidney cross-talk in the critically ill patient distant organ dysfunction in acute kidney injury: a review sepsis: current dogma and new perspectives sepsis associated acute kidney injury bacterial sepsis triggers an antiviral response that causes translation shutdown rna sequencing of adult kidney: rare cell types and novel cell states revealed in fibrosis representation and relative abundance of cell-type selective markers in whole- kidney rna-seq data a single-nucleus rna-sequencing pipeline to decipher the molecular anatomy and pathophysiology of human kidneys deep sequencing in microdissected renal tubules identifies nephron segment-specific transcriptomes single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease single-cell profiling reveals sex, lineage, and regional diversity in the mouse kidney the macrophage mediates the renoprotective effects of endotoxin preconditioning endotoxin preconditioning reprograms s tubules and macrophages to protect the kidney endotoxin uptake by s proximal tubular segment causes oxidative stress in the downstream s segment epithelial cell cycle arrest in g /m mediates kidney fibrosis after injury joint profiling of chromatin accessibility and gene expression in thousands of single cells visualization and analysis of gene expression in tissue sections by spatial transcriptomics structural basis for the recognition of sars-cov- by full-length human ace sars-cov- cell entry depends on ace and tmprss and is blocked by a clinically proven protease inhibitor multiorgan and renal tropism of sars-cov- renal histopathological analysis of postmortem findings of patients with covid- in china sars-cov- entry factors are highly expressed in nasal epithelial cells together with innate immune genes identification and functional characterization of dendritic cells in the healthy murine kidney and in experimental glomerulonephritis quantification of dendritic cell subsets in human renal tissue under normal and pathological conditions macrophages in renal injury and repair the debate about dendritic cells and macrophages in the kidney distinct macrophage phenotypes contribute to kidney injury and repair pivotal role of plasmacytoid dendritic cells in inflammation and nk-cell responses after tlr triggering in mice interferon-lambda modulates dendritic cells to facilitate t cell immunity during infection with influenza a virus star: ultrafast universal rna-seq aligner comprehensive integration of single-cell data a transcriptional map of the renal tubule: linking structure to function single-cell transcriptomics of a human kidney allograft biopsy specimen defines a diverse inflammatory response bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage scenic: single-cell regulatory network inference and clustering the dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells characterization of cell fate probabilities in single-cell data with palantir rna velocity of single cells cellphonedb: inferring cell- cell communication from combined expression of multi-subunit ligand-receptor complexes circlize implements and enhances circular visualization in r we thank the kidney precision medicine project for making data available for human kidney reference nephrectomy specimens. we thank daria barwinska for assistance with specimen validation. this work was supported by nih k -dk to th, nih r -dk , key: cord- -ewnf cz authors: srivastava, mayank; zhang, ying; chen, jian; sirohi, devika; miller, andrew; zhang, yang; chen, zhilu; lu, haojie; xu, jianqing; kuhn, richard j.; andy tao, w. title: chemical proteomics tracks virus entry and uncovers ncam as zika virus receptor date: - - journal: nat commun doi: . /s - - -y sha: doc_id: cord_uid: ewnf cz the outbreak of zika virus (zikv) in created worldwide health emergency which demand urgent research efforts on understanding the virus biology and developing therapeutic strategies. here, we present a time-resolved chemical proteomic strategy to track the early-stage entry of zikv into host cells. zikv was labeled on its surface with a chemical probe, which carries a photocrosslinker to covalently link virus-interacting proteins in living cells on uv exposure at different time points, and a biotin tag for subsequent enrichment and mass spectrometric identification of the receptor or other host proteins critical for virus internalization. we identified neural cell adhesion molecule (ncam ) as a potential zikv receptor and further validated it through overexpression, knockout, and inhibition of ncam in vero cells and human glioblastoma cells u- mg. collectively, the strategy can serve as a universal tool to map virus entry pathways and uncover key interacting proteins. z ika virus (zikv) has been the focus of immense investigation since the recent epidemic. while studies on zikv structure revealed its structural similarity to other virus relatives within the family flaviviridae, some notable differences such as the asn glycosylation site present on each of the e proteins may affect receptor interactions, antibody response, and downstream biology of the virus , . for the past several years, efforts have been made to understand fundamental biology of zikv infection, including the use of animal models to evaluate their immune response to the virus invasion , as well as potential therapeutics against zikv infection [ ] [ ] [ ] . however, unanswered questions remain regarding molecular mechanisms of host restriction and immune evasion . identification of direct interactors of zikv during its entry into host cells could not only suggest the molecular pathways manipulated by the virus, but also provide immense opportunity to develop antivirals by offering new potential drug targets. owing to the transient nature of these interactions and the extreme rapidness in the flavivirus entry in general, identification of dynamic interactors of virus is a formidable task. our understanding of zikv internalization and cellular trafficking would greatly benefit from a systematic, temporal characterization of major proteins involved in the dynamic virus entry. real-time fluorescence microscopy has been used to study the transport, acidification, and fusion of single virus , . the movement of single virus demonstrated an intriguing and dynamic process. the molecular information, in particular protein machinery involved in the process, is typically limited to labeled molecules in the amazing technique. on the other hand, affinity and chemical proteomics studies identified virusinteracting proteins [ ] [ ] [ ] . the molecular mechanisms and dynamic virus-host interactions responsible for the internalization of zikv, however, have remained unresolved. a systematic quantitative measurement of temporal changes in virus-protein interactions may prove extremely valuable for the identification of host molecules as potential therapeutic targets. we have previously used chemical proteomics strategies and modified a nanoparticle, polyamidoamine generation dendrimer, to understand the endocytic pathways of a nanoparticle followed by a recent study to investigate the entry of salmonella into host cells . here, we expand the concept and hypothesize that chemical modification of zikv would not significantly affect its infectivity and would allow us to track the virus entry into living cells and identify virus-interacting proteins by mass spectrometry (ms), revealing the spatiotemporal distribution of the key proteins involved in the pathways for zikv entry and trafficking. synthesis and characterization of zikv-labeling probe. we devised and synthesized a multifunctional chemical probe ( fig. a; supplementary figs. - ) bearing a labeling group that conjugates the probe to the zikv surface, a photo-reactive group that allows for covalent crosslinking of zikv proteins to interacting host cell proteins upon uv exposure, and an isolation tag of biotin for purifying the interacting proteins for quantitative ms analysis, thus facilitating the investigation of host-pathogen interactomes in a time-resolved manner (fig. b) . we chose the maleimide group to label the virus through its specific conjugation with thiol groups on the virus surface proteins at physiological condition to form a stable thioether linkage. as sulfhydryls thiols are present in most proteins but are not as abundant as primary amines, we expected limited labeling on cysteine residues would have a minimal labeling effect on the zikv activity. according to the structure of mature zikv determined by cryoelectron microscopy by our group , there are cysteines in zikv e ( in the ectodomain, in the transmembrane domain) and no cysteine in m protein (supplementary fig. a ; cysteine residues are highlighted as gray spheres in the structures). in addition, zikv, like other flaviviruses and enveloped viruses in general, is quite unstable and prone to undergo structural changes under external influence . considering the virus stability and infectivity, we preferred minimal labeling of virus through the maleimide-thiol conjugation under mild conditions at neutral ph. moreover, the three functionalities are separated by a polyethylene glycol (peg)-like linker to improve water solubility, while offering the flexibility for efficient crosslinking and enrichment. the labeling was first examined with a standard protein and then with intact zikv. bovine serum albumin (bsa) was incubated with mm of reagent in phosphate buffer ph at °c and the labeled protein was enriched on streptavidin beads and analyzed by sds-page. the labeling efficiency was estimated to be - % ( supplementary fig. b ). after labeling, modified zikv was lysed, and the labeled zikv surface proteins were purified on streptavidin beads and assessed by silver stain and western blotting using the g antibody against the e protein ( supplementary fig. c ). we further analyzed the captured proteins using ms. multiple unique peptides from e protein were identified by ms ( supplementary fig. ). we did not identify any peptide from virus membrane (m) protein, capsid (c), or any of the nonstructural proteins of zikv, further confirming the exclusive tagging of virus surface with the reagent. this result is primarily owing to the membrane impermeable attribute of the chemical proteomic probe imparted by the pegylated linkers. finally, we examined the effect of labeling purified zikv with several concentrations and labeling time points by the plague assay. no loss of infectivity was observed under labeling conditions for mm reagent concentration ( supplementary fig. d ). hence, we concluded that the minimal labeling of virus achieved by cysteine-reactive maleimide group does not perturb the infectivity of the virus. chemical proteomics to track the early-stage entry of zikv. we used the labeled zikv to infect vero cells and interacting proteins were crosslinked at fixed time points to identify the virus-host factors and elucidate the virus entry mechanism (fig. c) . flavivirus are quite promiscuous in their selection of receptors for entry to different cells. the complex entry mechanism might involve multiple receptor interactions to help virus internalize. though some previous studies have identified axl, a tam family tyrosine kinase, as a putative receptor for zikv , some conflicting evidence including ours suggests that the virus might employ multiple different classes of receptors for entry , . furthermore, while most of the viruses are believed to enter cells by clathrin-mediated endocytosis, there is no evidence suggesting the absence of any parallel mode of virus entry. the complexity of the flavivirus entry mechanism hints at the presence of varied virus-protein interactions after membrane recruitment of host cellular proteins to initiate viral infection. in order to identify zikv receptors, we allowed the virus to attach to the cells at °c for h, followed by uv photo crosslinking on ice. for the virus entry, we chose and min according to a previous study that indicates the membrane fusion of a similar flavivirus at s post binding, during its entry into vero cells . after crosslinking proteins at designated time points of attachment or entry, cells were harvested and proteins were extracted, followed by the enrichment using avidin beads. we reasoned that the chemical proteomic probe on the virus surface only crosslinks with proteins in direct contact with zikv, which can subsequently withstand vigorous washing conditions to remove nonspecifically bound proteins. the tryptic peptides derived from enriched samples were then analyzed by nanoflow hplc coupled to high-resolution ms. proteins were identified by a shotgun proteomic strategy and quantitated using the label-free method to measure their relative abundance across three time points and to distinguish crosslinked proteins from nonspecifically bound proteins. in total, we identified around crosslinked proteins across three time points, out of which more than proteins are previously implicated in virus infection ( fig. a and supplementary data ). the principal component analysis (pca) shows that all biological replicates are tightly clustered together and each time point is well-separated, meaning the samples are clustered by the nature of sample but not the batch (fig. b) . the pca and heatmap analysis also indicates that distinctive proteins were crosslinked at three different time points, suggesting that the strategy was able to reveal the temporal distribution of the interacting proteins crosslinked with zikv during the virus' early entry (fig. b, c) . gene ontology (go) analysis, as expected, indicated proteins annotated as membrane and extracellular region were significantly overrepresented in the crosslinked proteins across all time points (fig. d) . to further investigate whether the strategy was also capable of correlating spatial information with the virus crosslinked proteins, we performed the string analysis to determine whether there is statistical overrepresentation of specific genes or proteins in the sample at specific time points and identify proteins specific at the attachment or cellular entry stages ( supplementary fig. ). notably, we identified crosslinked proteins at min, of which were membrane proteins, such as c qbp, cd , itga , letm , ncam , rack , and slc a . complement c q binding protein is a key host factor for efficient respiratory syncytial virus (rsv) production . the tetraspanin cd facilitates mers-coronavirus entry by scaffolding host cell receptors and proteases . e-syt proteins were discovered to impact the formation of virus-induced syncytia during hsv- infection . integrin α directly interacts with hepatitis e virus (hev) and plays a key role in cellular attachment and entry of nehev . receptor of activated protein c kinase (rack ) is important for lymphocystis disease virus entry and infection . f cell-surface antigen heavy chain (slc a ) was reported interacting with zikv ns b protein and was indicated as a candidate host factor for zikv infection . in this study, coimmunoprecipitation followed by western blotting of transduced cells verified that slc a specifically associates with zikv env (supplementary fig. a) . interestingly, neural cell adhesion molecule (ncam ), which was reported as a receptor for rabies virus , was identified at different time points and presents as a candidate receptor for zikv infection. previous analysis of the global proteomic changes that occurred during differentiation of labeled zikv was diluted in dmem and incubated with confluent cells for h at °c. in addition, cells were incubated with the labeled viruses in °c for fixed time points to allow virus entry. unbound viruses were removed and cells were directly exposed to uv light. cells were lysed and biotinylated proteins were captured on the avidin beads. proteins were digested on beads using sequential lys-c and trypsin digestion, and analyzed by lc-ms/ms. label-free quantitation was performed using maxquant to identify and quantify the crosslinked proteins. nature communications | https://doi.org/ . /s - - -y article hnpcs into neurons also revealed significant upregulation of ncam . moreover, a few other noticeable proteins crosslinked specifically at min include ap m and calml . ap m is the µ subunit of adaptor protein- (ap- ) complex that recognizes the tyrosine-rich sorting signals on the cytoplasmic tail of receptor proteins . our crosslinking experiment validates the complex's structural assembly on the membrane as ap m was selectively crosslinked at min of infection, while no other ap- subunits were observed. ap m has further been shown to modulate early-stage infectious entry of hcv, in its phosphorylated form . calml was discovered as a potential host factor for zikv infection . we further demonstrated that the strategy allowed us to identify clusters of proteins representing a temporal shift of zikv subcellular localizations and the functions of crosslinked proteins are highly correlate to the temporal information in many cases (fig. , supplementary data ). for example, at min, we observed certain proteins enriched such as stat , a key mediator of type-i interferon signaling. it has been reported that zikv suppresses the host immune response by inhibiting the type-i interferon signaling pathway . previous studies also have suggested the mechanism of blocking immune response by zikv as either the degradation of signal transducer and activator of transcription (stat ) , or antagonism of stat and stat phosphorylation . rack , which mediates the interactions between ifn receptor and stat , was also identified as a zikv-interacting protein at that time point. previously, a different class of virus was shown to interact with rack , thus initiating the dissociation of rack -stat complex and inhibition of interferon signaling by the virus . our study emphasizes the involvement of both stat and rack in the immune response elicited by the zikv infection, suggesting zikv might employ a similar approach to suppress the host immune response. we also identified an isoform of a key component of recycling endosomes, rab a. the role of rab a in transport of viral ribonucleoprotein or core proteins to plasma membrane for the generation of new virus particles of influenza and hepatitis c virus (hcv), respectively, is well documented . however, some viruses may also utilize the recycling endosomal pathway to evade lysosomal degradation. our study suggests the use of the recycling pathway by zikv after entry. besides rab a, we also observed rab c at min. previous studies have established the importance of rab c in the entry of zikv . the identification of rab c and rab a in our chemical proteomics experiment at a late time point of entry is consistent with the published data on japanese encephalitis virus (jev) , a neurovirulent pathogen from the flavivirus genus structurally similar to the zikv. other proteins identified of infection include hspa , also known as heat shock cognate (hsc ), known for its role in vesicle uncoating in the later stage of endocytosis . overall, more than direct interactors such as ap b and itgb were also identified as zikv interactors in previous work , further indicating the confidence of the identified interactors in our study. in addition, we examined temporally zikv-crosslinked proteins against protein components involved in major endocytic mechanisms for zikv internalization. prior studies showed zikv infection could be prevented by lysosomotropic agents which neutralize the normally acidic ph of endosomal compartments and was also blocked by chlorpromazine , , indicating the requirement of clathrin-mediated endocytosis and low ph for zikv infection. in our study, itgb , hspa , rab c, rack , and rab a were crosslinked, suggesting the clathrin-mediated pathway employed by the virus to infect vero cells. furthermore, identification of arcn , itga , flna, and flnc also suggests the utilization of a caveolar-mediated pathway by zikv for endocytosis. identification and verification of ncam as zikv receptor. finally, considering ncam was crosslinked at several time points, totally not detected in control samples and ncam is abundantly expressed in brain, we further examined whether ncam is a potential receptor for zikv infection leading to neurological disorders. to assess the physiological importance of ncam interaction with zikv, we first examined the surface expression of ncam on u- mg and vero cells. we found that ncam was highly expressed on the surface of u- mg and vero cells (fig. a) . the immunofluorescence image also revealed that ncam is located on the membrane of the transfected hek t cells (fig. b) . to validate the interaction between ncam and zikv env, we performed coimmunoprecipitation experiments in the context of zikv env and ncam -flag overexpression. immunoblot analyses revealed that ncam specifically interacted with the zikv env (fig. c ), but not with egfr, another transmembrane glycoprotein as a negative control (fig. d ). in addition, we found that the other host factor identified in our study, hspa , previously reported as a key host factor for zikv infection and zikv nonstructural protein stabilization , also bound to the zikv env protein directly ( supplementary fig. b ). we knocked out hspa in u- mg cells using two different crispr/cas sgrnas. the knockout efficiency by the sgrnas was confirmed with western blot analysis and the efficiency of sgrna is lower than that of sgrna ( supplementary fig. c ). we found that hspa depletion by sgrna in u- mg cells remarkably attenuated zikv infection ( supplementary fig. c ). to further assess the effect of ncam on zikv binding and entry, we employed ncam extracellular domain (ecd) protein and anti-ncam antibody to compete and block ncam binding, respectively, and examined whether it could lead to the reduction in zikv attachment and internalization. preincubation with ncam ecd protein, but not the control protein, remarkably reduced zikv binding (fig. f) and entry (fig. g) into u- mg cells. similarly, pretreatment of u- mg cells with the anti-ncam antibody also reduced zikv binding (fig. f) and entry (fig. g) . ncam ecd and antibody also inhibited zikv binding to vero cells ( supplementary fig. d ), but had no statistical difference on the zikv internalization ( supplementary fig. e ). in this study, we present a chemical proteomic approach in which virus was chemically tagged with a biocompatible probe, to reveal the virus-host interactome in real time. the work demonstrates how chemical proteomics can facilitate our understanding of molecular mechanisms and key players in virus infection. specific identification of virus-interacting proteins was made possible by the integration of several techniques: a multifunctional chemical probe that can achieve the labeling, crosslinking, and isolation steps, photo-reactive crosslinking that allows us to select designated time points and capture interacting proteins covalently to minimize loss during lysis and washing, and label-free ms-based quantitation. in particular, the analyses of proteomic data sets with or without virus infection at different time points enabled the extraction of temporal and spatial information during the virus infection, which provides a useful and universal tool to study any pathogen invasion in theory. compared with the previously reported mass spectrometric methods for the identification of receptors specific to glycoproteins , , our method offers the additional advantage of covalently linking proteins at different time points, thus serving the dual purpose of identification of receptors and other host factors involved at different stages of virus entry. the chemical proteomics strategy was applied to zikv and enabled us to identify multiple zikv-interacting proteins that indicate zikv subcellular localizations and potential entry mechanisms, among which a new zikv receptor was discovered and validated through virus attachment and entry assays. overexpression of ncam in hek t cells increased viral binding and entry. ncam depletion in u- mg cells remarkably inhibited zikv infection. inhibition of ncam receptor by ncam ecd and anti-ncam antibody reduced the zikv attachment and entry into u- mg cells. there was also % of reduction in zikv attachment to vero cells but no statistical significant reduction was observed for the zikv entry into vero cells. we reason that vero cells are less dependent on specific receptors for virus infection and therefore are commonly used as a model cell line for virus infection. on the other hand, zikv infection of u- mg cells are highly dependent on specific receptors. once a specific receptor, e.g., ncam , was blocked, significant reduction in attachment and entry into u- mg cells was observed. to exclude a general effect of ncam on other viruses binding and entry, we performed an inhibition assay using influenza a virus (iav) and dengue virus (denv- ). ncam ecd and anti-ncam antibody had minimal effects on iav binding and entry ( supplementary fig. f , left), but ncam ecd had significant effect on denv binding and anti-ncam antibody had minimal effects on denv- binding and entry ( supplementary fig. f , right), suggesting that ncam may affect the attachment of zikv or denv but not iav. to further confirm the receptor activity of ncam , we overexpressed ncam in hek t cells and the expression efficiency was validated with immunofluorescence imaging (fig. b) and western blotting assays ( supplementary fig. ), and then we performed binding and entry assays. heterologous expression of human ncam in t cells enhanced both zikv attachment (fig. h ) and internalization (fig. i) . to further demonstrate the requirement of ncam for infection by zikv, we knocked out ncam in u- mg cells using two different crispr/cas sgrnas. the knockout efficiency by the sgrnas was confirmed with flow cytometry staining and the efficiency of sgrna (green) is lower than that of sgrna (blue) (fig. j) . we found that ncam depletion by sgrna in u- mg cells remarkably attenuated zikv infection (fig. k, l) . this result supports our observation that ncam could be a potential receptor for zikv infection (supplementary fig. g) . the chemical proteomics strategy highlighted its unique feature that allowed us to track the virus movement in real time, which is challenging due to the highly dynamic nature of the process and the transient virus-host protein interactions. moreover, the crosslinking chemistry permits the identification of potential receptors which present analytical challenges to identify the interaction on cell membrane. lastly, this technology can be applied to relatively unstable enveloped viruses, owing to the minimal labeling by cysteine-reactive maleimide group. briefly, virus particles were precipitated from the media with % polyethylene glycol (peg) overnight at °c, pelleted at , × g for min at °c. resuspended particles were pelleted through a % sucrose cushion, resuspended in . ml nte buffer ( mm tris ph . , mm nacl, mm edta), and purified with a discontinuous gradient in % intervals from % to % k-tartrate, mm tris ph . , mm edta. mature virus was extracted from the gradient, concentrated, and buffer exchanged into nte buffer. plaque assay. the plaque assay was performed as described below. briefly, virus was diluted serially in the order of ten folds and incubated with monolayers of vero cells for h at room temperature. cells were layered with agarose and incubated at °c for days. plaques were counted following cell staining using neutral red. synthesis and purification of a multifunctional chemical probe. the viruslabeling chemical probe was synthesized on the rink-amide-am-resin ( - mesh) % dvb manually, using standard solid phase peptide synthesis approach (supplementary fig. ) . a % piperidine solution in dmf (n,n-dimethylformamide) was used to deprotect the fmoc ( -fluorenylmethoxycarbonyl) groups, while % tfa (trifluoroacetic acid) was used for boc (tert-butoxycarbonyl) group deprotection. hctu (o-( h- -chlorobenzotriazole- -yl)− , , , -tetramethyluronium hexafluorophosphate) was utilized as an activating agent for the carboxyl group on the incoming reactant, in presence of the base nmm ( methylmorpholine). the synthesis was performed on the µmol scale and using . -fold excess of the reagents compared with the resin. each step involved the deprotection of amine group, activation of carboxyl group followed by coupling reaction. the excess reagents were removed by thorough washing of beads by dmf. ninhydrin test was performed after each deprotection and coupling reaction. the synthesis was performed using the strategy, as previously described . mg ( µmol) of rink-amide-am-resin was added to the fritted reaction vessel. beads were conditioned with dmf for min. the solution was removed by filtration, and % piperidine in dmf was added to the beads for fmoc deprotection. the mixture was end-to-end rotated for min, and solution was removed followed by washing of beads with dmf. . , . , . , . , . , . , . , . , . , . , . , . , . the pure maleimide-biotin ( mg, . µmol) was dissolved in dmf and reacted with excess nhs-lc-diazirine (succinimidyl- -( , ′-azipentanamido) hexanoate) in phosphate buffer ph , for h at room temperature. the product was purified by directly injecting into the hplc and using similar conditions as described above. the virus-labeling chemical probe was obtained as a white powder ( . mg, . µmol, %) and characterized by maldi-tof, h and c nmr (supplementary fig. ) . (m, h), . - . (m, h), . - . (m, h), . (s, h) . maldi showed a peak at . m/z, corresponding to the m-n + h + . this is due to the loss of n from diazirine under maldi conditions. bsa labeling/virus labeling. fifty micrograms bsa or purified zikv was diluted to µl with pbs ph , and mixed with the labeling chemical probe in the final concentration of mm (supplementary fig. b-d) . the labeling was carried out by gentle end-to-end rotation in °c overnight. for the infection experiment, virus labeling was initiated a day before the cells reached < % confluency for infection. the reaction was quenched by adding three times excess of cysteine. virus infection and crosslinking of host proteins. vero cells were first grown in t- flasks in dmem supplemented with % fbs, then passaged to the cm plates and grown to < % confluency. cells were washed with cold pbs twice, and cooled down to °c. the labeled virus was diluted in dmem and added to the cells at moi of . cells were gently rocked for h in °c, to allow for virus attachment. for the receptor crosslinking, the unbound virus was removed, cells were washed once with cold pbs, and directly exposed to the uv light for min on ice. all the above operations were performed on ice and using cold pbs to minimize any virus entry. to understand the virus internalization mechanism, additionally virus was allowed to enter cells by incubation in °c for or min, following preattachment for an hour at °c. subsequent to uv photo crosslinking, cells were collected by scraping in pbs, and stored in − °c until further processing. as a control, cells treated with the labeling chemical probe and exposed to uv were included to account for random crosslinking. sample preparation for lc-ms/ms analysis. frozen cells were lysed in % sds, mm tris hcl ph . supplemented with protease inhibitor on ice, using sonication ( cycles for s each, with an interval of s). cell lysates were cleared by centrifugation at , rpm to pellet down cell debris, and supernatant were used for the biotin-neutravidin affinity purification. bicinchoninic acid assay (thermo fisher scientific) was performed for protein quantitation, and the lysates equivalent to mg protein for each sample were reduced and alkylated by boiling at °c, in mm tcep (tris( -carboxyethyl)phosphine) and mm caa (chloroacetamide) respectively. the lysates were then diluted to . % sds and rotated with µl preconditioned neutravidin beads slurry in °c overnight. the beads were washed with . % sds in mm tris (ph . ), m nacl, mm glycerol in mm tris (ph . ), and then transferred to the low protein binding eppendorf tubes, where they were further washed three times with mm ammonium bicarbonate (abc) buffer ph . two hundred microliters abc buffer was added to the beads, and proteins were digested on-bead at °c using µg lys-c for h and ng trypsin for h. the supernatant containing peptides was collected and beads were washed twice with µl abc buffer, further pooled with the supernatant. peptides were acidified and desalted using in-house stagetips with sdb-xc ( m). the peptides were dried in speedvac before subjecting to lc-ms/ ms analysis. lc-ms/ms analysis. the peptides were dissolved in . % formic acid and injected into easy-nlc (thermo fisher scientific). the peptides were separated on a cm in-house column ( µm od × µm id), packed with c resin ( . µm, Å, bischoff chromatography, leonberg, germany) and heated to °c with a column heater (analytical sales and services, flanders, new jersey). the mobile phase was comprised of . % formic acid in ultra-pure water (solvent a) and . % formic acid in % acetonitrile (solvent b), and the gradient used for separation was - % b over a linear min at a flow rate of nl/min. the easy-nlc was connected online to the ltq-orbitrap velos pro mass spectrometer (thermo fisher scientific) by a nanospray source. data acquisition was performed in the data-dependent mode, in which a full scan (range from m/z - with a resolution of , at m/z ) was followed by ms/ms scans of top intense ions (normalized collision energy %, automatic gain control e , maximum injection time ms) with a dynamic exclusion for s and dynamic list of . proteomic data analysis. raw files were processed with maxquant v . . . , and the label-free quantitation (lfq) was performed. the raw data were searched against uniprotkb (rhesus macaque) fasta database (version july , , entries) with andromeda search engine, using default parameters. the first peptide precursor mass tolerance was set at ppm, and ms/ms tolerance at . da. carbamidomethylation was set as a fixed modification for cysteines, while oxidation of methionine and acetylation at n-terminus were selected as variable modifications. enzyme specificity was set to trypsin with maximum two missed cleavages. the search was performed with % false discovery rate at both peptide and protein levels. the identifications were transferred from the sequenced peaks to the unidentified peaks of the same m/z within a time window of . min (match between runs) across samples. for lfq, initially, the lfq intensities were extracted from the maxquant output file. proteins with valid values in minimum two out of three replicates in at least one group were only considered, and values were imputed for all missing values based on normal distribution. significant proteins at different time points vs. control (fc ≥ , t test p ≤ : ) were considered as crosslinked proteins, which were further processed by homemade matlab program for functional analysis. for heatmap analysis, the expression value (ibaq) of total quantified proteins were clustered based on euclidean distances with average linkage using modified function clustergram in matlab (version r b), and heatmap was also visualized by matlab. the color shows that there are great changes from sample to sample, where red is upregulation, green is downregulation, black represents no change, and gray represents non available. each column in the graph represents an experiment condition, and each row corresponds to a gene. row of protein were normalized by maximum value of corresponding row. the rows and columns are displayed in the order given by the clustering output trees in the two dimensions. go enrichment analysis of the differentially expressed proteins were conducted according to the information from go databases, each bar in the figure denotes the enrichment score from different sample, and enrichment score was defined as −log (p). p value was calculated using the hypergeometric formula as below: n is the number of all identified proteins that can be connected with go analysis information. n is the number of differential proteins in n. m is the number of proteins that can be connected with a certain go term. m is the number of differential proteins with certain go term. if p value below . , we regard this go term as a significant enrichment of differential proteins. the information of protein-protein interaction of significant proteins were retrieved by string database, and visualized by cytoscape. viral attachment and entry assay. for antibody inhibition experiment, u- mg cells in twelve-well plate were preincubated with µg/ml anti-ncam antibody (cat. no. bd- , bd) or control isotype igg antibody (cat. no. bd- , bd) in dmem supplemented with % fbs for h at °c. cells were then incubated with purified zikv (moi = ) on ice for h in the presence of antibody. the supernatant was then removed and the cells were washed three times with cold pbs. cellular rna was extracted and purified for test the viral attachment. otherwise, prewarmed medium was added to the cells to initiate zikv internalization. the cells were incubated at °c for additional min and then cellular rna was extracted and purified. then rt-qpcr was performed to measure viral entry. for ncam protein inhibition experiment, purified zikv (pfu = ) were preincubated with µg ncam ecd (sino biological, cat. no. -h h) or control protein for h at °c. the viruses (moi = ) were then added to u- mg cells in twelve-well plate and incubated on ice for h. the supernatant was then removed and the cells were washed three times with cold pbs. cellular rna was extracted and purified for test the viral attachment. otherwise, prewarmed medium was added to the cells to initiate zikv internalization. the cells were incubated at °c for additional min and then cellular rna was extracted and purified. then rt-qpcr was performed to measure viral entry. for ncam overexpression experiment, control pcdna . plasmid or pcdna . -ncam plasmid were transfected to t cells separately. cells were then incubated with purified zikv (moi = ) on ice for h after h transfection. the supernatant was then removed and the cells were washed three times with cold pbs. the increase in viral attachment was measured with rt-qpcr. on the other hand, prewarmed medium was added to the cells to initiate zikv internalization. the cells were incubated at °c for additional min and then cellular rna was extracted and purified. then rt-qpcr was performed to measure internalized zikv rna. immunoprecipitation assay. for co-immunoprecipitation experiments, t cells ( × cells per cm dish) were transiently transfected with µg of pcdna . -ncam / pcdna . -hspa and pcmv-ns /pcmv-e separately for h using turbofect transfection reagent (thermo fisher scientific). cells were rinsed twice with cold pbs and were then transferred to clean tubes and lysed in cell lysis buffer for immunoprecipitation supplemented with % protease inhibitor cocktail (cat. no. p , sigma). cell lysates were incubated with piercetm protein a/g agarose (cat. no. , sigma) for h at °c and were then subjected to centrifugation at , × g for min at °c. the supernatant was transferred to a new tube and incubated with µl anti-flag m affinity gel (cat. no. a , sigma) overnight at °c. the sepharose samples were centrifuged, washed five times with cell lysis buffer and eluted using × flag peptide (cat. no. f , sigma). then all samples were boiled with sds loading buffer for min. flow cytometry experiments. surface expression of ncam was analysed in u- mg and vero cells by staining the cells with rabbit anti-ncam antibodies (pe) ( : , cat. no. -mm -p, sino biological) at room temperature for min. cells were washed three times with pbs supplemented with % fbs. all flow cytometry experiments were carried out using an lsrfortessa cell analyser (bd bioscience). samples were analysed using flowjo software version (treestar). crispr-cas knockout assay. oligos encoding sgrnas for generating knockout cells using crispr-cas were cloned into the lenticrisprv plasmid (addgene plasmid, cat. no. ) as previously described . the oligo sequences of the sgrnas targeting ncam and hspa are listed as follows. ncam -sgrna : aacgccaacatcgacgacgc; ncam -sgrna : acaccactgagatc cgctgc; hspa -sgrna : acagatgccaaacgtctgat; hspa -sgrna : ctagactgttaccaatgctg. lenticrisprv clones containing the guide sequences were sequenced, purified, and used for lentiviral production. to generate heterogeneous knockout cell populations, u- mg cells were infected with the lenticrisprv -derived lentivirus for h and were then reseeded into complete dmem containing µg/ml puromycin for days to select for transduced cells. surviving populations derived in this manner were propagated and expanded for using. rna extraction and real time-quantitative polymerase chain reaction (rt-qpcr). rna was isolated from mammalian cells (u- mg, vero, and hek t) using rneasy mini kit (qiagen, valencia, ca) and normalized based on total rna amount as determined by nanodrop™ / c spectrophotometer (thermo fisher scientific). rt-qpcr was performed using superscript iii platinum sybr green one-step qpcr kit w/rox (thermo fisher scientific) and analyzed on applied biosystems real-time pcr system. the samples were subjected to thermal cycling for min at °c, min at °c, and cycles of s at °c and min at °c, at which point data were collected, and this was followed by dissociation curve analysis. the ct values obtained were converted to the number of zikv rna molecules using a standard curve generated from in vitro-transcribed viral rna. zikv cdna clone, used for in vitro transcription, was kindly provided by shi. for the standard curve, the plasmid containing zikv cdna, was linearized by cla and viral rna was transcribed using t rna polymerase (new england biolabs). the dna template was digested by rnaasefree-dnaase enzymatic treatment for min at °c and the viral rna was subsequently purified by rneasy mini kit (qiagen). rna concentration and quality were determined by nanodrop and e copies of rna were serially diluted tenfold and subjected to thermal cycling as described above to obtain the standard curve and pcr efficiency. all pcr primers are listed as follows: zikv-f tgggaggtttgaagaggctg; zikv-r tctcaacatggcagcaagatct; gapdh-fctgggctacactgagcacc; gapdh-raagtggtcgttgag ggcaatg; denv-f aaggactagaggttagaggagac; denv-r ggcgttctgtgcctggaatgat; iav-f cgcacagagacttgaggatg; iav-r tgggtctccattcccattta. immunofluorescence. hek t cells (for transfection with ncam ) were seeded on cover slips in -well plate. cells were washed with pbs and fixed with . % paraformaldehyde (pfa) for min at room temperature. cells were again washed with pbs three times and blocked with % bsa in pbs for h. anti-ncam antibody in blocking solution was incubated with the cells for h at room temperature. cells were washed three times and incubated with anti-mouse fitc or anti-mouse alexa fluor for h at room temperature. dapi staining was performed for min, followed by final three pbs washes. cover slips were mounted on glass slide and images were captured using olympus ix fluorescence microscope with a x oil immersion objective. for detecting zikv, cells on slides were fixed with % pfa for min at room temperature, permeabilized with . % triton-x in pbs for min, and blocked with blocking buffer ( % bsa and % donkey serum diluted in pbs) for min. immunofluorescence analyses of zikv-infected cells were performed using a mouse anti-flavivirus envelop protein antibody ( : , clone d - g - - , millipore), with a alexa fluor donkey anti-mouse igg (h + l) ( : , ab , abcam) as the secondary antibody. all cells were mounted with prolongtm gold antifade with dapi (life technologies, p ) and imaged with a tissuefaxs flow-type tissue cytometer (tissuegnostics gmbh, vienna, austria). all statistical analyses of immunofluorescence staining present the results from at least cells per replicate, and data are shown as the mean ± s.e.m. structure of the thermally stable zika virus the . a resolution cryo-em structure of zika virus axl is not an indispensable factor for zika virus infection in mice neutralizing human antibodies prevent zika virus replication and fetal disease in mice a human antibody against zika virus crosslinks the e protein to prevent infection specificity, cross-reactivity, and function of antibodies elicited by zika virus infection zika virus: immune evasion mechanisms, currently available therapeutic regimens, and vaccines live cell imaging of viral entry dissecting the cell entry pathway of dengue virus by single-particle tracking in living cells direct identification of ligand-receptor interactions on living cells and tissues hatric-based identification of receptors for orphan ligands an orthogonal proteomic survey uncovers novel zika virus host factors time-resolved proteomic visualization of dendrimer cellular entry and trafficking tracking pathogen infection by time-resolved chemical proteomics zika virus is not uniquely stable at physiological temperatures compared to other flaviviruses axl mediates zika virus entry in human glial cells and modulates innate immune responses axl promotes zika virus infection in astrocytes by antagonizing type i interferon signalling mitochondrial protein p /hapb /gc qr/c qbp is required for efficient respiratory syncytial virus production the tetraspanin cd facilitates mers-coronavirus entry by scaffolding host cell receptors and proteases extended synaptotagmin interacts with herpes simplex virus glycoprotein m and negatively modulates virus-induced membrane fusion integrin alpha is involved in non-enveloped hepatitis e virus infection voltage-dependent anion channel protein (vdac ) and receptor of activated protein c kinase (rack ) act as functional receptors for lymphocystis disease virus infection the neural cell adhesion molecule is a receptor for rabies virus interaction of tyrosine-based sorting signals with clathrinassociated proteins ap- -associated protein kinase and cyclin g-associated kinase regulate hepatitis c virus entry and are potential drug targets a mouse model of zika virus pathogenesis zika virus targets human stat to inhibit type i interferon signaling zika virus antagonizes type i interferon responses during infection of human dendritic cells association of mumps virus v protein with rack results in dissociation of stat- from the alpha interferon receptor complex rab a is essential for transport of the influenza virus genome to the plasma membrane rab and rab are required for clathrin-dependent endocytosis of japanese encephalitis virus in bhk- cells zika virus dependence on host hsp provides a protective strategy against infection and disease global interactomics uncovers extensive organellar targeting by zika virus infection by zika viruses requires the transmembrane protein axl, endocytosis and low ph molecular characterization of staphylococcus aureus plasmids associated with strains isolated from various retail meats enhancing dengue virus maturation using a stable furin over-expressing cell line ucsf chimerax: meeting modern challenges in visualization and analysis the pride database and related tools and resources in : improving support for quantification data western blot. following overexpression of ncam in hek t, cells were lysed at h post transfection. the samples were boiled at °c in gel loading buffer and , -dithiothreitol (dtt) for min. the cell lysates were separated on the precast nupage - % bis-tris polyacrylamide gels (invitrogen) for min at constant voltage of v. a mops solution ( mm mops, mm tris-base, mm edta, . % sds) was used as a running buffer. the proteins were transferred onto polyvinylidene fluoride membranes in bicine-bis-tris transfer buffer containing % methanol, for min at a constant current of ma. the membrane was blocked with % bsa in tbst, and probed with anti-human ncam ( : , , cell signaling technology) for h at room temperature. following washings, anti-mouse igg hrp-conjugated secondary antibody ( : , s, cell signaling technology) was utilized for visualization ( supplementary fig. ). western blot detection of zikv env was performed using a rabbit anti-env antibody ( the matlab code used for functional analysis has been uploaded to pride partner repository with the dataset identifier pxd [http://proteomecentral. proteomexchange.org/cgi/getdataset?id=pxd ]. source data are provided with this paper. this project has been funded by nih grant r gm and nsf grant to w.a.t., r ai to r.j.k., and yfa to h.j.l. and y.z. we also thank the resource for biocomputing, visualization, and informatics at the university of california, san francisco for the creation of the zikv figures in supplementary fig. a using ucsf chimerax with support from nih r -gm and p -gm . m.s. and ying z. performed the initial experiments and analyzed the data. yang z. and h.j.l. analyzed the data. a.m., d.s., and r.k. provided zikv and initial experiments on the zikv characterization. j.c., j.q.x., and z.l.c. performed the validation experiments and analyzed the data. r.k. and w.a.t. designed the experiment. ying z. and w.a.t. wrote the paper. the authors declare no competing interests. supplementary information is available for this paper at https://doi.org/ . /s - - -y.correspondence and requests for materials should be addressed to y.z., j.x., r.j.k. or w.a.t.peer review information nature communications thanks aurelie mousnier, andreas pichlmair and pei-yong shi for their contribution to the peer review of this work.reprints and permission information is available at http://www.nature.com/reprintspublisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/ licenses/by/ . /. key: cord- -j h qa authors: liu, xiaojing; liu, tingting; shang, yafang; dai, pengfei; zhang, wubing; lee, brian j.; huang, min; yang, dingpeng; wu, qiu; liu, liu daisy; zheng, xiaoqi; zhou, bo o.; dong, junchao; yeap, leng-siew; hu, jiazhi; xiao, tengfei; zha, shan; casellas, rafael; liu, x. shirley; meng, fei-long title: ercc l promotes dna orientation-specific recombination in mammalian cells date: - - journal: cell res doi: . /s - - - sha: doc_id: cord_uid: j h qa programmed dna recombination in mammalian cells occurs predominantly in a directional manner. while random dna breaks are typically repaired both by deletion and by inversion at approximately equal proportions, v(d)j and class switch recombination (csr) of immunoglobulin heavy chain gene overwhelmingly delete intervening sequences to yield productive rearrangement. what factors channel chromatin breaks to deletional csr in lymphocytes is unknown. integrating crispr knockout and chemical perturbation screening we here identify the snf -family helicase-like ercc l as one such factor. we show that ercc l promotes double-strand break end-joining and facilitates optimal csr in mice. at the cellular levels, ercc l rapidly engages in dna repair through its c-terminal domains. mechanistically, ercc l interacts with other end-joining factors and plays a functionally redundant role with the xlf end-joining factor in v(d)j recombination. strikingly, ercc l controls orientation-specific joining of broken ends during csr, which relies on its helicase activity. thus, ercc l facilitates programmed recombination through directional repair of distant breaks. programmed dna recombination processes, including v(d)j and antibody class switch recombination (csr), diversify lymphocyte antigen receptors for efficient adaptive immunity. rag endonucleases assemble v, d and j gene segments to form variable region exons of b and t cell receptor genes, and activationinduced cytidine deaminase (aid) further introduces dna lesions upstream of antibody constant region genes to switch the antibody class from igm to other classes. general dna repair pathways efficiently process immunoglobulin heavy chain gene (igh) v(d)j and csr lesions in b cells. despite the many mechanistic differences, in both cases intermediate breaks are joined in an orientation-specific manner, i.e. rearrangements occur predominantly by deletion rather than by inversion. , this feature increases the probability of productive rearrangements and consequently the number of peripheral lymphocytes available to fight infection. conversely, random dna breaks or designer endonuclease-cutting ends are typically joined both by deletion and by inversion at about equal proportions. rag directional linear tracking and intrinsic properties of recombination signal sequences (rsss) enforce orientation-specific joining of v(d)j breaks. , how csr ends are processed mostly by deletion is less understood. in mammalian cells, double-strand breaks (dsbs) are sensed by the mrn complex, which activates the atm-dependent dsb response (dsbr) pathway. atm-substrate bp and its downstream effectors prevent excessive end resection to promote nonhomologous end-joining (nhej) at the price of homologous recombination (hr). the mammalian nhej pathway is initiated by dsb end recognition through ku -ku (ku in mouse) heterodimer, which further recruits nucleases and/or polymerases and finally ligase -xrcc -xlf complex for repair. besides the evolutionarily conserved core nhej factors found in all eukaryotes, several new nhej factors have involved in vertebrates and mammals, as exampled by the recently identified nhej factors paxx and mri/cyren. [ ] [ ] [ ] nhej can function flexibly on a diverse range of dsb substrates in different chromatin contexts with its dna-based factors (e.g. core nhej subunits) and chromatin factors, and specialized proteins could be involved in joining of a subset of breaks. v(d)j recombination and csr occur within the context of topological associated domains (tads), where loop extrusion facilitates contacts between recombining elements as well as promoters and enhancers. [ ] [ ] [ ] during csr, chromatin loop extrusion shapes the igh chromatin architecture in a spatiotemporal manner. upon antigen stimulation, the activation of ipromoters drives stepwise cohesin loading on the pre-assembled csr center in naive b cells. the chromatin subdomains position the directional alignment of donor sμ and acceptor s regions, which ensures deletional csr in cis. however, the identity of the trans-acting factors behind orientation-specific end-joining remains unclear. dsbr factors, especially bp , favor deletional csr. the absence of bp is associated with excessive resection of dsb ends and near complete block of csr, indicating that the loss of directional repair might be an indirect effect. similarly, lig deficient cells tend to have more inversional csr joining during csr, reflecting that the escaped broken ends from the joining complex are joined randomly in a diffusional manner at low levels. to search for additional end-joining factors, we combined crispr knockout screening and chemical perturbation screens, and functionally characterized the hit in the context of immune diversification. the strategy identified ercc l (excision repair cross-complementation group like ) as a new nhej factor that channels programmed csr to directional repair. to identify potentially new nhej factors, we combined chemical perturbation screens on compounds with focused crisprknockout screens on genes in the ch b cell line ( fig. a ; supplementary information, table s , see materials and methods for details). genes were selected from known dna repair factors and their homologs based on gene-ontology (go) terms (supplementary information, table s ). dna damage-inducing chemicals, including carcinogens, therapeutic agents, or dna damage response (ddr) inhibitors, were selected based on their known functions and inhibitory concentrations (ic) in ch cells (supplementary information, table s ). pools of knock-out cells were cultured with chemicals at ic ( % maximal inhibitory concentration) for days. hit genes were called with the mageck algorithm and presented as a z-score, where a negative z-score indicates that the knockout of a gene renders the cell more sensitive to the chemical. clustering of the chemicals by their crispr screen z-scores across all the genes categorize the chemicals into two main groups ( fig. b; supplementary information, fig. s a ). the first group consists of reagents that generate single and double-strand breaks (ssbs and dsbs). the ssb subgroup included poly-adp ribose polymerase (parp) and dna topoisomerase i (top ) inhibitors (fig. b) , consistent with the observation that these reagents produce similar dna lesions. , in the dsb subgroup, dna topoisomerase ii (top ) inhibitors clustered with γ-radiation mimicking reagent zeocin, and g-quadruplexinteracting drug pyridostatin, all of which are known to induce dsbs. [ ] [ ] [ ] although methyl methanesulfonate (mms) does not directly cause breaks, the downstream lesions may be converted into dsbs at the dosage used in our screen. the second major group contains reagents that cause blockage on dna, such as interstrand crosslink (icl) reagents, nucleoside analogs, crosslinkers, and dna intercalators as well as ddr inhibitors against atm and dna-pkcs (fig. b) . this clustering suggests that blockage of dna replication/transcription may cause the cell proliferation defects observed in the group. we conclude that the combined crispr-chemical screens cluster similar chemicals (illustrated by the same color block in fig. b) based on the kind of dna damage they cause, demonstrating the effectiveness of our approach at dissecting the function of dna repair genes and dna damage chemicals. ercc l clusters with other nhej factors next, we clustered all dna repair genes by their z-scores across the chemicals used, which categorized genes into three major groups depending on their impact on cell growth (supplementary information, fig. s a ). consequently, epistatic genes segregated together, such as those involved in fanconi anemia and nhej factors (supplementary information, fig. s a ). nhej factors segregated in turn into two main clusters (fig. c) : cluster contained core subunit genes (ku / , lig , dna-pkcs) and potentially new members (baz b, ercc l ). cluster comprised several other known nhej genes: xrcc , paxx, bp , polλ, polθ and the apurinic/apyrimidinic endonuclease gene apex . mutations in ercc l have recently been identified in inherited bone marrow failure (bmf) patients. [ ] [ ] [ ] [ ] [ ] several classic nhej gene mutants have been implicated in bmf, leading us to wonder whether ercc l contributes to nhej pathway. interestingly, ercc l deficient cells were depleted upon zeocin treatment which induces dsbs (fig. d ), but not in the presence of cisplatin or veliparib treatment which creates icls and ssbs, respectively (fig. e ). this is consistent with results obtained from patient-derived lines carrying ercc l mutations. to confirm the screening results, we deleted ercc l in ch b cells with two sets of sgrnas. set deleted the predicted catalytic domain on ercc l , while set created an out of frame mutation (supplementary information, fig. s b , table s ). we found that all resulting clones were hypersensitive to treatments that induce dsbs, such as γ-irradiation (ir), zeocin and etoposide ( fig. f ; supplementary information, fig. s c ). this phenotype is similar to, but less severe than that observed in isogenic cells lacking the major nhej ligase lig (fig. f) . increased sensitivity to dsbs was also evident in ercc l -deleted abelson virus-transformed mouse pro-b cells and human osteosarcoma u os cells (supplementary information, fig. s d , e). altogether, these data demonstrate that ercc l promotes dsb repair. ercc l is required for optimal csr v(d)j recombination and csr have been used to characterize the function of dsbr/nhej factors, and deleterious mutants of nhej genes frequently lead to primary immunodeficiencies (pid) (fig. c) . we therefore performed a focused crispr-knockout screen in ch b cells stimulated to undergo igm to iga csr, and compared the enrichment of knocked-out genes between iga + and igm + populations (fig. a) . as controls, the screen included sgrnas targeting known genes required for csr (e.g. aicda, stat ). consistent with previous reports, table s ). among the potentially novel csr genes, ercc l was ranked highest in the mageck analysis. conversely, other ercc family members, ercc /csb (functioning in transcription-coupled nucleotide excision repair) and ercc l/pich (playing a role in spindle assembly checkpoint) were not required for csr (fig. b) . in cytokine activated ch cells, csr was decreased~ % in the absence of ercc l , which is comparable to results in isogenic lig −/− cells (fig. c ). to study this phenotype under more physiological conditions, we deleted ercc l by crispr-cas ( fig. d; supplementary information, fig. s a ) in mouse embryos, which were either wild type or carried preassembled heavy and light chain (hl) genes. , ercc l −/− mice were viable (supplementary information, fig. s b ). splenic naive b cells were purified from ko and control mice and stimulated with lps or lps + il to induce csr to igg or igg respectively. consistent with the ch results, ercc l deficiency resulted in a~ % reduction in csr ( table s ). correspondingly, the serum igg/iga levels were significantly decreased in ercc l −/− mice (supplementary information, fig. s l ). in germinal centers of immunized mice, aid-induced somatic hypermutation (shm) was unaffected (supplementary information, fig. s m ). this is reminiscent to the phenotype reported for dsbr/nhej-deficient germinal center b cells. these results demonstrate that ercc l facilitates antibody isotype switching. during csr, aid deaminates cytosines at switch region dna. ber and mmr enzymes convert the uracils into dsbs, which are then processed by nhej factors during the recombination step. to test whether ercc l functions upstream or downstream of dsb induction, we used cas to create breaks at igh switch regions in the absence of aid (supplementary information, table s ). under these conditions, cas can promote efficient csr to igg or iga (a process dubbed cas-csr, fig. e ; supplementary information, fig. s a) . end-joining level of cas breaks in the absence of ercc l was reduced to %- % of that in isogenic control cells ( fig. e ; supplementary information, fig. s b ). these results indicate that ercc l promotes general dsb end-joining. it is of note that the reduction of cas-csr in ercc l deficiency is less than that caused by lig deficiency (fig. e) , while ercc l and lig deficiencies had comparable effects on aid-initiated csr (fig. c ). in the cas-csr assay, the expression of cas was well controlled by the co-transfected mcherry + control cells in the same transfection reaction (supplementary information, fig. s ). the levels of cas generated break levels among difference genotypes cannot be quantitatively revealed by the current technology. however, our observations are reminiscent of the fact that ercc l deficient cells showed less sensitivity to ir-or chemical-induced dsbs comparing to isogenic lig deficient cells (fig. f) . thus, it is roles in bone marrow failure (bmf) or primary immunodeficiency (pid) are annotated by colored blocks. d, e sensitivity of gene knockouts to zeocin (d) or cisplatin/veliparib (e) treatment. genes are grouped by gene ontology and depicted with different colors based on the involved biological processes along the x-axis. the beta-score difference (chemical-dmso) indicating positive or negative enrichment in the chemical treatment samples was calculated from two replicates and is plotted for each gene. representative negatively enriched genes are labeled. fa indicated fanconi anemia genes. f sensitivity of ercc l -deficient or lig -deficient b cells to different treatments. cell viability curve was calculated and the area-under-the-curve (auc) was computed. heat map of sensitivity, which is indicated as "log (auc ko /auc wt )", is plotted. ir γ-irradiation, uvc ultraviolet wavelength nm, aph aphidicolin, cpt camptothecin, hu hydroxyurea, actd actinomycin d, drb , -dichlorobenzimidazole -β-d-ribofuranoside. unlikely that ercc l affected cas cutting efficiency. the comparison of aid-csr and cas-csr suggests ercc l might have an additional role in physiological csr besides end-joining. c-terminal half of ercc l leads its catalytic activity to dna damage sites ercc l contains three conserved domains: a tudor domain, an atpase/helicase domain and a conserved hebo domain of unknown function (fig. a) . bioinformatic analysis revealed that except for the hebo domain, the c-terminal half of ercc l is less conserved and contains intrinsically disordered sequence by bioinformatic prediction (supplementary information, fig. s a ). we found that tudor domain mutants can fully support csr in ercc l −/− cells ( fig. a; supplementary information, fig. s b ). however, ercc l n-terminal or c-terminal fragments failed to do this ( fig. a; supplementary information, fig. s b ). furthermore, the helicase catalytic-dead (deah > aaah) mutant did not promote csr ( fig. a ; supplementary information, fig. s b ), indicating that ercc l 's predicted catalytic activity is required for dna end-joining. consistent with this idea, ercc l helicase catalytic-dead mutant and various frame-shift mutants (containing n-terminal fragments only) were also identified in bmf patients (supplementary information, fig. s c ). to shed further light on ercc l protein domains, we fused the full-length protein or its mutants to gfp and studies the dynamics in cells exposed to laser microirradiation. we found that gfp-ercc l is recruited to micro-irradiated sites within seconds of dna damage, with similar kinetics to that of ku and xlf ( fig. b ; supplementary information, fig. s d ). this recruitment was observed in more than % of cells, and was independent of ku , h ax, nbs , xlf or parps ( fig. c- fig. s e ). the ercc l c-terminal half fragment was sufficient to drive nuclear localization and recruitment to damaged sites (fig. g) , while the n terminal fragment could not be efficiently transported into the nucleus (supplementary information, fig. s f ). remarkably, fusing ercc l n to a nuclear localization signal peptide did not rescue dna damage foci formation (fig. g) . these results indicate that the c-terminal domains recruits ercc l catalytic activity to damaged chromatin. ercc l interacts with end-joining factors ercc l has been suggested to be an early dsb response factor. however, ir-induced phosphorylation of atm substrates h ax, chk and kap was unaffected in ercc l −/− cells (fig. a) . instead, we found that ectopically overexpressed ercc l co-immunoprecipitated with several nhej subunits in hek t cells in a dna-independent manner (supplementary information, fig. s a , b). this has been previously observed by immunoprecipitation-mass spectrometry (ip-ms) analysis, fig. ercc l is required for optimal csr. a schematic illustration of csr screening procedure. representative flow cytometry plots are showed. b enriched csr genes. genes are grouped and illustrated as fig. d . the beta-score difference (iga + -igm + ) is plotted and representative genes are labeled. c ercc l is required for optimal csr in ch f cells. aid-initiated csr is illustrated at left, and csr to iga in presence of cytokines (cit, α-cd /il /tgfβ) of indicated cells are plotted at right. blue arrows indicate transcription. colored points indicate knockout clones obtained with different sets of sgrnas. d ercc l is required for optimal csr in ex vivo activated splenic b cells. gene knockout strategies with two sets of sgrnas are illustrated on top. representative csr flow cytometry plots are showed at left. data from four pairs of ercc l −/− (sgrna pair ) and wild-type (wt) mice and three pairs of hl-ercc l −/− (sgrna pair ) and corresponding ig heavy and light chain knockin (hl) mice are summarized. e ercc l is required for optimal cas-csr. crispr/cas -initiated csr is schematically illustrated at left, and normalized csr level of indicated cells are plotted at right. data are represented as mean ± sd (standard deviation) in (c, d, e). two-tail unpaired t-test was performed for (c, d, e). ****p < . , ***p < . , **p < . ; *p < . ; ns: p > . . although we cannot conclude whether it is a direct or indirect interaction. interestingly, consistent with ip-ms analysis and genome-wide yeast two-hybrid report, , we observed a dnaindependent interaction between ercc l and mri/cyren (fig. b) . this interaction was mediated by ercc l cterminus and a conserved motif in the middle of mri/cyren protein ( fig. c; supplementary information, fig. s c, d) . despite extensive testing, we were unable to obtain a workable anti-ercc l antibody or serum and currently no commercial anti-ercc l antibody is available to detect the endogenous protein. to confirm the protein interaction in vivo, we generated ercc l -ha knock-in mice (fig. d, top) , in which the ha-tag was fused to the last exon of ercc l . we detected high levels of ercc l -ha protein in csr-activated b cells (supplementary information, fig. s e) , where it co-immunoprecipitated with ku ( fig. d) and an ectopically-expressed flag-tagged mri (fig. e) . together, these findings indicate that ercc l is associated with other nhej subunits. next, we examined how the physical interaction might contribute to ercc l 's function in end-joining. similar to ku , mri is not required for ercc l recruitment to dna damage sites (supplementary information, fig. s a ). we then checked the recruitment kinetics of nhej factors in ercc l −/− u os cells, and found ku / , paxx and xlf to be unaffected while the recruitment of xrcc was slightly decreased (supplementary information, fig. s b, c) . to further investigate this, we generated ercc l −/− mefs. compared to controls, there were fewer ko cells showing mri or xrcc /lig recruitment to micro-irradiated sites, and those that did, showed a significant decrease in mri or xrcc /lig signals at foci ( bone marrow was similar to that of wt. although the total cell numbers in the ercc l −/− thymus were~ % of that in wt, the distribution of cd − cd − , cd + cd + , cd + cd − and cd − cd + ercc l −/− t cells were comparable to that of wt (supplementary information, fig. s a ). the lymphocyte development in ercc l deficient mice is different from that in core nhej factor deficiency. previous studies have uncovered a functional redundancy between xlf and atm-dependent dsb response factors, or between xlf and other non-essential nhej factors including paxx and mri. , - therefore, we hypothesized that ercc l may play a role in v(d)j recombination that is masked by other nhej factors, possibly xlf. to test this idea, we performed a focused crispr screen in wt and xlf −/− v-abl lines with a modified pmx-inv substrate ( fig. a; supplementary information, fig. s b , c). dna repair genes required for recombination were identified as negatively enriched in the recombined hcd + population (see materials and methods). as expected, core nhej factors and artemis were identified in both genotypes (fig. b) . also, as previously reported, [ ] [ ] [ ] [ ] [ ] [ ] xlf was functionally redundant with paxx, bp , and h ax. the assay however identified new xlf redundant factors, including nbs , mdc , rnf , rnf , ino , and importantly ercc l (fig. b) . of note, mri was not included in our original focused crispr sgrna library (supplementary information, table s ), so mri did not show up here. the functional redundancy between ercc l and xlf was confirmed by chromosomal v(d)j recombination assays in ercc l −/− xlf −/− and isogenic v-abl lines ( fig. c; supplementary information, fig. s d ). at the same time, there was no obvious redundancy in ercc l −/− paxx −/− cells. similar to lig −/− , ercc l −/− xlf −/− cells showed a near complete block in v(d)j recombination ( fig. c; supplementary information, fig. s d ). this defect could be rescued by ectopic expression of ercc l but not ercc l n-terminal fragment (supplementary information, fig. s e ). absence of coding joins and a smear signal of coding ends were also noticed in ercc l −/− xlf −/− cells (fig. d) , reminiscent of extensive resection of unprocessed dna ends. in summary, ercc l and xlf play functional-redundant roles in the repair of dna ends during v(d)j recombination. ercc l controls the orientation of dna recombination our experiments have so far conclusively established ercc l promoting dsb end-joining in conjunction with xlf, primarily during programmed recombination in lymphocytes. the precise molecular mechanism however is unclear. we thus applied high- gfp-lig and mcherry-xrcc were co-expressed in cells, and the panel is illustrated as in (f). data are represented as mean ± sem in (f) and (g). a t-test was applied as described in the "materials and methods" section. **p < . ; *p < . . throughput genome-wide translocation sequencing (htgts), which can simultaneously quantify end resection, microhomology (mh) usage and orientation of dna end joining, to explore whether csr junctions are improperly joined in ercc l −/− cells. as bait, we used aid-mediated breaks at sμ (aid sμ ), which are joined to sγ /sε in b cells undergoing igm to igg /ige switching ( fig. a; supplementary information, fig. s a ). remarkably, ercc l −/− b cells displayed a profound defect in orientation-specific joining (fig. a, b ; supplementary information, fig. s a, b) . compared to controls, where inter-s recombination occurs predominantly by deletion (> %), in ercc l −/− cells deletions and inversions were present at nearly equal frequencies ( % vs %, fig. a ). in bp −/− cells, which were previously found defective in directional csr repair, the ratio was % vs %, whereas in both atm −/− and xlf −/− cells this ratio was around % vs % (fig. a) . in ercc l deficiency, the resection levels at s regions were slightly higher than wild type and similar to those in atm deficiency, but significantly lower than that in bp or xlf deficiency (fig. b) . the much mild increase of resection levels observed in ercc l deficiency distinguished ercc l from bp in dsb repair or directional dsb repair. similarly, the mh usage analysis also revealed a mild but significantly decreased direct joining and increased joining with longer mh in csr junctions of ercc l deficiency comparing to wt (fig. c) . however, the mh usage bias in ercc l deficiency was less pronounced than in atm, bp or xlf deficiencies (fig. d) . thus, although ercc l deficiency shows similar trends of more resection and increased longer mh usage in csr-junctions as nhej/dsbr deficiencies, orientation-specific joining did not correlate with the extend of end-resection or mh usage. this suggests that ercc l plays a much more robust and unique role in directional end joining during csr. to bypass the b cell development defect of core-nhej factor deficiencies and quickly access the directional csr, we deleted the non-productive igh allele in ch f cells as previous described (supplementary information, fig. s c , named ch -ncdel cells) and perform htgts assay with endogenous aid sμ baits. similar results were obtained in this system, and moreover, we found mri did not affect the inversion/direction ratio (supplementary information, fig. s d, e) . these results therefore demonstrate that ercc l has a unique role in orientation-specific class switch recombination of antibody genes. ercc l activity is required for directional end-joining the catalytic activity of ercc l is required for optimal csr (fig. a) , suggesting the catalytic activity is required for its functions in dsb end-joining. to further examine ercc l catalytic activity in mediating directional end-joining during csr, we generated an ercc l d n mouse line with a bmf patient-derived mutation at deah-box helicase catalytic site (supplementary information, fig. s a ). normal numbers of splenic mature b cells were obtained from the knockin mice (supplementary information, fig. s b ). upon cytokine stimulation, ercc l d n b cells switched at a lower frequency comparing to wt b cells (fig. a and supplementary information, fig. s c ). the~ % reduction of csr level was comparable to the csr reduction in ercc l ko b cells (fig. d) . we then performed htgts to access the csr junctions, and found that catalytic-dead ercc l phenocopied the ercc l knockout in end-resection and directional end-joining (fig. b) . in ercc l d n b cells, deletions and inversions were present at nearly equal frequencies ( % vs %, fig. b ) during csr. we conclude that ercc l 's catalytic activity is required for deletional repair during csr. in this study, we used a comprehensive crispr and chemical screening approach to define a core dna repair genetic network. in our x screen matrix, functionally-related genes or chemicals clustered together, offering new insights into gene function and how dna damage-causing agents impact cells. even though our focused crispr screening could not cover the whole genome, it has the advantage of better signal-to-noise ratios which can easily pinpoint regulators involved in subtle mechanisms. our focused approach successfully identified factors that impact recombination by~ %. focused crispr screens can also be used as a reverse genetic tool to dissect genetic interactions, as we demonstrated with v(d)j recombination. although our studies focused on the role of ercc l in nhej, it is worth noting that the assay revealed new functions of many other factors, including baz b, a bromodomain-containing protein involved in h ax phosphorylation. moreover, our data provided important hints on combinatorial cancer treatments, since many of the dna damage reagents tested are widely used in the clinic. a working model of ercc l function during immune diversification, programmed dna breaks are processed into productive rearrangements within constrained topological microenvironments or tads. we have shown that ercc l regulates how dna ends are joined in a spatially oriented manner during repair (fig. c) . during csr, transcribed s regions are synapsed and recruit aid deamination activity, which in turn engages ber and mmr enzymes that convert uracils into staggered dna breaks with g-rich repetitive sequences. several chromatin features appear to facilitate this activity, including rloops, nascent g-rich rna, and paused transcription complexes. these features, along with tightly packed nucleosomes, might slow down or interfere with end-joining. in this context, ercc family members are believed to function as nucleosome remodelers or dna translocase. [ ] [ ] [ ] it is therefore tantalizing to speculate that ercc l removes nucleosomes or other protein/nucleotides near dna lesions, probably to facilitate xrcc /lig sliding towards dna ends. this function might ensure rapid in situ ligation of dsbs in the pre-assembled chromatin subdomains. another non-mutually exclusive model is that aid-generated stagger ends within s regions could be misaligned and ercc l activity could detach the mis-aligned ends, leading to the dissociation of the intervening sequence from the csr centre. thus, ercc l could be directly involved in the spatiotemporal formation of igh chromatin architecture, which is of great interest for further investigation. these possibilities suggest a model (fig. c) on how ercc l facilitates deletional , c, d) . two-tail unpaired t-test was performed for (b) and (d). data from ercc l knockout are compared with those from other genotypes. ****p < . , ***p < . , **p < . ; *p < . , ns: p > . . csr of antibody genes. considering that ercc l is the latest evolved member of the ercc family, we speculate that it might have evolved to promote efficient end-joining under more complicated and specialized settings, such as in lymphocytes of higher organisms. dna damage in ercc l -linked pathologies mutations of ercc l and lig were previously identified in bmf patients, where dna damage may originate from the same source. in fanconi anemia, a subtype of bmf, icls generated by reactive aldehydes appear to be the major source of dna damage in hscs and other progenitor cells. however, the dna lesions underlying ercc l -and nhej-pathologies might be of different origin, as cells isolated from such patients are typically not hyper-sensitive to icl-inducing agents. our dna repair genetic screen provided some hints in this regard, in which it showed that nhej and ercc l knockouts are markedly sensitive to top inhibitors. top b-induced lesions are known to accumulate genome-wide at chromosome loop anchors, , and at promoters of key genes during neuronal stimulation. notably, some ercc l -mutated bmf patients also show neurological dysfunctions. , , whether top b-induced breaks are causal to ercc l -and nhej-pathologies represents an interesting future line of research. ercc l and xlf knockout mouse lines were constructed through zygote injection of crispr/cas constructs. ercc l -ha knock-in mouse line was constructed by oocyte injection of androgenetic haploid embryonic stem cells harboring an ercc l -ha allele obtained with homologous recombination (hr). ercc l d n mouse line was constructed by zygote injection of crispr/cas constructs and a single-strand hr template. guide rna sequences are listed in supplementary information, table s . bp −/− and atm −/− mouse lines have been described previously. all animal experiments were performed under protocols approved by the institutional animal care and use committee of shanghai institute of biochemistry and cell biology. cell lines used in this study were listed in the supplementary information, table s . parental b-lineaged ch f cell line and isogenic lig −/− cell line, parental abelson virus-transformed (v-abl) mouse pro-b cell line containing the eμ-bcl transgene have been described previously. ch f and its derived isogenic cells were cultured with rpim ( - -cv; corning), β-mercaptoethanol (m - ml; sigma-aldrich), penicillin-streptomycin-glutamine ( ; thermo fisher scientific), fig. ercc l activity is required for directional end-joining during csr. a csr levels to igg in csr-activated wt and ercc l d n b cells at day upon lps/il stimulation. b distribution of s μ -s γ junctions at s γ , end-resection and inversion/deletion ratio of csr junctions in wt and ercc l d n b cells. dashed lines indicate the levels in bp −/− b cells, which were assayed at the same time. c a working model to explain roles of ercc l , see "discussion" for details. data are represented as mean ± sd and two-tail unpaired t-test was performed in (a, b). ***p < . , *p < . . plus % fbs (fcs ; excell bio), and v-abl cells were cultured with rpim ( - -cv, corning), β-mercaptoethanol (m - ml; sigma-aldrich), penicillin-streptomycin-glutamine ( ; thermo fisher scientific), sodium pyruvate ( ; thermo fisher scientific), mem non-essential amino acids solution ( ; thermo fisher scientific), hepes( ; thermo fisher scientific), plus % fbs ( - -cm; corning). mef, hek t and u os cells were cultured with dmem ( - -cv, corning), penicillin-streptomycin-glutamine ( ; thermo fisher scientific), plus % fbs (fsp , excell bio). all cell lines are negative for mycoplasma contamination. primers, plasmids and antibodies primers, plasmids and antibodies used in this study were listed in the supplementary information, table s . focused crispr screenings and data analysis dna repair-related crispr library design. dna repair genes were picked based on gene ontology within amigo database (http://amigo.geneontology.org/amigo). we also chose several csr related genes from literature serving as potential csr positive controls, and genes affecting cell viability (essential genes) and non-essential genes serving as crispr screening controls. genes are listed in supplementary information, table s . guide rnas were designed with crispr-focus. oligo pool was synthesized by synbio technologies (suzhou) and cloned into lentiguide-puro by using easygeno assembly kit (vi ; tiangen). pooled sgrna viruses were used to infect ch f or v-abl cell lines stably-expressing cas protein. resulting cell pools were selected with puromycin for days (day − to day ) allowing efficient ko of targeted genes, before subjected to chemical treatment, csr or v(d)j recombination assay. chemical treatment in ch f cells. inhibitory concentration (ic) was determined for each chemical in b-lineaged ch f cells (supplementary information, table s ). gene knockout cells were cultured with dna damage chemicals at ic for days (from day to day ) before harvested for genomic dna purification. dmso at a concentration of . % was included as a control. library preparation and data analysis. the sgrna sequences were pcr-amplified according to previous reported method, and subjected to illumina sequencing. the raw reads were trimmed with fastx_trimmer in fastx-toolkit, and sgrna sequences were counted by mageck. clonal batch difference was controlled by using a batch-removal sub-module in mageck-flute, and enriched genes were retrieved with mageck test subcommand. cluster of chemical treatment result. we performed hierarchical clustering for both drugs and genes by r programming language. first, we used dist function to compute euclidean distances for drugs or genes by using z-score computed by mageck-mle submodule. then the distance matrix was passed to hclust function for clustering using ward.d method. exclude cell fitness genes and identify csr/v(d)j recombination factors. we found that cell fitness genes which affect cell viability or growth rate often were retrieved in different crispr screening assays as false positive hits. thus, we identified the cell fitness genes in dna repair gene list by comparing recovered sgrnas at day after puromycin selection and sgrnas in viral vector. the list was named as "genes_affecting_cellular_fitness" in supplementary information, table s . after retrieving the enriched genes from csr or v(d)j recombination screenings, cell fitness genes were first removed from the result. a false discovery rate (fdr) < . was further applied. chemical sensitivity assay cells were plated at a concentration of × e cells/ml with indicated chemicals or treated with indicated doses of x-ray or uv. after h, cell viability was measured with a cell counting kit- assay (k ; apexbio). the survival data were fit to a mixedeffects model using lmer function from the r package 'lme ', with "dosage" and "genotype" as fixed-effects parameters, and "cell viability data from each repeat" as the random effect parameter. the significance of the fixed effects parameters is obtained by the t-test. antibody class switch recombination assays aid-initiated csr assay. ch f cells were stimulated with α-cd ( - - ; ebioscience), il (ck ; novoprotein) plus tgf-β (ca ; novoprotein), and csr to iga were monitored at day and . splenic naive b cells were purified and cultured as previous described. csr to igg or igg was monitored at day and . cas -initiated csr assay. for crispr/cas -initiated csr (cas-csr) in ch f cells, test cells were mixed with control ch f -mcherry cells at a ratio of : , sgrnas targeting up-and downstream s regions were transfected into the mixture via electroporation. the ch f -mcherry controls for electroporation decreased technical variations among each transfection. csr level to other ig in ko cells was first normalized to the csr level of ch f -mcherry cells which was transfected in the same cuvette, and the relative csr level was defined as a ratio of the csr levels in ko and parental ch f cells as below: ko csr level at transfection # : a = (ko igg+ /ko all ) / (mcherry# igg+ /mcherry# all ) wt csr level at transfection # : b = (wt igg+ /wt all ) / (mcherry# igg+ /mcherry# all ) relative cas-csr level = a/b. it is of note that in the cas-csr assay, the cas -generated break levels among difference genotypes cannot be quantitatively revealed by the current technology. thus, decreased cas-csr level could also be resulted from the low cas cutting efficiency in a specific genotype. somatic hypermutation assay peyer's patch gc b cells (b + pna hi ) were sorted from indicated mice. j h and jκ introns were pcr amplified as previously reported , and the pcr products were further tagged with illumine p and p index primers and subjected to illumina hiseq. data were analyzed as previously described. chromosomal v(d)j recombination assay the v-abl cells were infected with inv-invert-hcd a cassette and enriched by using anti-hcd microbeads ( - - ; miltenyi). cells which already underwent v(d)j recombination were removed with hcd microbeads ( - - ; miltenyi) at each experiment. cells were stimulated with sti- for h and h, and then subjected to flow cytometry and/or southern blot. southern blot was performed as previously described. for each sample, μg genomic dna was digested by nde i/nhe i and nco i/nhe i, respectively. a bp hind iii/nhe i fragment of hcd sequence was used as the probe. lymphocyte development flow cytometry was applied to assay lymphocyte development. bone marrow and lymphocyte cellularity were calculated by using a hemocytometer. in brief, bone marrow cells were isolated by flushing or by crushing the long bones with a mortar and pestle in ca + and mg + free hbss with % heat-inactivated serum. spleen and thymus cells were obtained by crushing the tissue between two glass slides. the cells were filtered through a μm nylon mesh before staining. gene deletion and complementation in cell line for gene deletion in cell lines, pairs of sgrnas were designed with ssc programe and selected based on the published gene knockout strategy or gene function domains. a gfp-expressing plasmid and px -based crispr/cas plasmids were cotransfected into cells. twenty-four hours after transfection, gfphigh cells were sorted with bd fcas aria iii and plated into single clones in -well plates. individual clones were genotyped by pcr (supplementary information, table s ) and positive clones were further confirmed by western blot or rt-qpcr. despite many attempting, we were unable to get an anti-ercc l antibody or serum to detect the endogenous ercc l protein. thus, the knockout of ercc l was confirmed by pcr-genotyping and rna-seq. ercc l plasmids were constructed by pcr with total cdna templates from either mouse activated-b cells or human hek t cells. the gene was cloned into a lentiviral vector. the resulting lentiviral particles were used to generate ercc l or ercc l -mutant cell lines. empty vector (ev) was used as control. the ercc l mutant genes express at similar level as the wt gene in the experimental conditions. laser micro-irradiation laser microirradiation was performed as described. briefly, u os cells were plated on mm diameter glass-bottom plates (d - - -n, cellvis) at × e cells/ml. on the next day, μg gfp-gene fusion plasmid with . μg mcherry-pcna plasmid were transfected into cells. for mefs, μg gfp-gene fusion plasmid was transfected into cells by electroporation (vca- , lonza nucleofector b) following the manufacture's instruction. the next day after transfection, cells were incubated with μm -brdu (hy- ; medchemexpress) overnight and exposed to μg/ml hoechst (c ; beyotime technology) for min just before irradiation. laser microirradiation was performed with nikon a confocal microscope and a nm laser with % energy. the fluorescence data were fit to a mixed-effects model using lmer function from the r package 'lme ', with "acquired time" and "genotype" as fixed-effects parameters, and "data from each observed cell" as the random effect parameter. the significance of the fixed effects parameters is obtained by the ttest. co-transfection of gfp tagged nhej subunits and xflag-ercc l or ha-ercc l in hek t cells was performed using lipofectamine according to the manufacturer's instruction. after h, cells were washed, scraped and lysed for min in lysis buffer ( mm tris-hcl ph . , mm nacl, . mm mgcl , % glycerol, . % np- ), supplemented with complete protease inhibitor mix ( , roche), u/ml benzonase (e , sigma-aldrich) and μg/ml ethidium bromide. lysates were centrifuged at , g for min at °c, and supernatants were incubated with μl anti-gfp nanobody beads (ktsm , shenzhen kt life technology), anti-flag conjugated agarose beads (m , abmart) or anti-ha conjugated agarose beads (m , abmart) for h at °c. beads were washed times with wash buffer ( mm tris-hcl ph . , mm nacl, . mm mgcl , % glycerol, . % np- ) and boiled in μl × sds buffer. proteins were analyzed by immunoblotting. for immunoprecipitation from activated splenic b cells, cells were stimulated with lps/il for days, collected, washed and processed as above. htgts was performed as previously reported. in primary b cell csr assay, s region rearrangements were cloned from endogenous aid-initiated sγ breaks with ′-red-iμ primer. the htgts cloning primers are listed in supplementary information, table s . s junction and resection ratios were plotted and calculated as previously described. transcription analyses rna-seq and data analysis prepare rna-seq library: total rna was isolated using trizol reagents ( ; life technologies) and was treated with dnase i (m ; promega). rna-seq libraries were prepared using kapa stranded rna-seq kits with riboerase (hmr) (kr ; kapa) following the manufacturer's instructions. then these libraries were sequenced with illumina hiseq at geneseeq company. about~ million of × bp paired reads were retrieved for each library. two biological replicates of mouse naive and csractivated b cells were performed for each genotype. two repeats were performed in parental ch f cell line along with two ercc l −/− clones. one repeat was performed in parental u os cell line along with its isogenic ercc l −/− cell line. genome-wide analysis: the raw reads were first cleaned with cutadapt (v. . ) cut adapter sequences and mapped to mouse or human rrna reference using bowtie (v. . . ) to filter out rrna reads. then the filtered reads were mapped to mouse (mm ) or human (hg ) reference genome with star (v. . . a) using default parameters. htseq (v. . . ) and deseq (v. . . ) were used for differential gene expression analysis. package bamcoverage from deeptools (v. . . ) was used to generate bigwig files from bam files and igv was used for downstream visualization. prepare pro-seq library: pro-seq was performed according to previously published protocol. for primary b cells, wt and ercc l −/− b cells were harvested at h after csr-activation, and three biological replicates of wt b cells and four biological replicates of ercc l −/− b cells were subjected to pro-seq. data analysis: pro-seq data were first cut to remove adapter sequences with cutadapt (v. . ), then mapped to mouse reference genome (mm ) by bowtie (v. . . ). transcribed regions were defined as previously described. only coordinate of the last base at ′ end of each read was exacted for pausing index calculation. pausing index was defined as previously described. briefly, we calculate the ratio of mean counts between − bp to bp relative to the tss and the remaining length of the gene for ncbi refseq genes longer than kb. quantification and statistical analysis statistical analyses were performed by using the r (version . . , r foundation for statistical computing, vienna, austria. url http:// www.r-project.org) or graphpad prism (version . . ). the number of replicates, statistical test procedures are indicated in the figure legends. all sequencing data generated in this study, including crispr screening, rna-seq, pro-seq, shm amplicon-seq and htgts data, are deposited in the ncbi sequence mechanisms of programmed dna lesions and genomic instability in the immune system recombination: mechanisms of initiation mutations, kataegis and translocations in b cells: understanding aid promiscuous activity orientation-specific joining of aid-initiated dna breaks promotes antibody class switching genome-wide detection of dna double-stranded breaks induced by engineered nucleases chromosomal loop domains direct the recombination of antigen receptor genes the mre complex: starting from the ends double-strand break repair: bp comes into focus the mechanism of double-strand dna break repair by the nonhomologous dna end-joining pathway regulation of dna repair pathway choice in s and g phases by the nhej inhibitor cyren mri is a dna damage response adaptor during classical nonhomologous end joining dna repair. paxx, a paralog of xrcc and xlf, interacts with ku to promote dna double-strand break repair the fundamental role of chromatin loop extrusion in physiological v(d)j recombination the energetics and physiological impact of cohesin extrusion fundamental roles of chromatin loop extrusion in antibody class switching cis-and trans-factors affecting aid targeting and mutagenic outcomes in antibody diversification bp inhibits homologous recombination in brca -deficient cells by blocking resection of dna breaks dna double-strand break response factors influence end-joining features of igh class switch and general translocation junctions mageck enables robust identification of essential genes from genome-scale crispr/cas knockout screens trapping of parp and parp by clinical parp inhibitors dna topoisomerase i inhibitors: chemistry, biology, and interfacial inhibition small-molecule-induced dna damage identifies alternative dna structures in human genes targeting dna topoisomerase ii in cancer chemotherapy oxidative damage, bleomycin, and gamma radiation induce different types of dna strand breaks in normal lymphocytes and thymocytes. a comet assay study methyl methanesulfonate (mms) produces heat-labile dna damage but no detectable in vivo dna double-strand breaks ercc l mutations link a distinct bone-marrow-failure syndrome to dna repair and mitochondrial function a nonsense mutation in the dna repair factor hebo causes mild bone marrow failure and microcephaly a landscape of germ line mutations in a cohort of inherited bone marrow failure patients bone marrow failure syndrome caused by homozygous frameshift mutation in the ercc l gene ercc l -associated inherited bone marrow failure syndrome molecular mechanisms of somatic hypermutation and class switch recombination altered kinetics of nonhomologous end joining and class switch recombination in ligase iv-deficient b cells b cell development under the condition of allelic inclusion receptor editing in a transgenic mouse model: site, efficiency, and role in b cell tolerance and antibody diversification genome instability is a consequence of transcription deficiency in patients with bone marrow failure harboring biallelic ercc l variants a proteome-scale map of the human interactome network the response to and repair of rag-mediated dna double-strand breaks functional overlaps between xlf and the atm-dependent dna double strand break response paxx promotes ku accumulation at dna breaks and is essential for end-joining in xlf-deficient mice deficiency of xlf and paxx prevents dna double-strand break repair by non-homologous end joining in lymphocytes paxx is an accessory c-nhej factor that associates with ku and has overlapping functions with xlf specific roles of xrcc paralogs paxx and xlf during v(d)j recombination paxx and xlf dna repair factors are functionally redundant in joining dna breaks in a g -arrested progenitor b-cell line synthetic lethality between paxx and xlf in mammalian development wstf regulates the h a.x dna damage response via a novel tyrosine kinase activity r-loops at immunoglobulin class switch regions in the chromosomes of stimulated b cells non-coding rna generated following lariat debranching mediates targeting of aid to dna activation-induced cytidine deaminase targets dna at sites of rna polymerase ii stalling by interaction with spt pich: a dna translocase specially adapted for processing anaphase bridge dna pich and blm limit histone association with anaphase centromeric dna threads and promote their resolution structural basis for the initiation of eukaryotic transcription-coupled dna repair sliding sleeves of xrcc -xlf bridge dna and connect fragments of broken dna fancd counteracts the toxic effects of naturally produced aldehydes in mice genome organization drives chromosome fragility activity-induced dna breaks govern the expression of neuronal early-response genes bp is required for class switch recombination atm damage response and xlf repair factor are functionally redundant in joining dna breaks crispr-focus: a web server for designing focused crispr screening experiments improved vectors and genome-wide libraries for crispr screening convergent transcription atintragenic super-enhancers targets aid-initiated genomic instability rapid methods for the analysis of immunoglobulin gene hypermutation: application to transgenic and gene targeted mice the downstream transcriptional enhancer, ed, positively regulates mouse ig kappa gene expression and somatic hypermutation sequence-intrinsic mechanisms that target aid mutational outcomes on antibody genes sequence determinants of improved crispr sgrna design detecting dna double-stranded breaks in mammalian genomes by linear amplification-mediated high-throughput genome-wide translocation sequencing base-pair-resolution genome-wide mapping of active rna polymerases using precision nuclear run-on (pro-seq) read archive (sra accession: prjna ). all other data are available from the authors on request. supplementary information accompanies this paper at https://doi.org/ . / s - - - .competing interests: the authors declare no competing interests. key: cord- -nkql h x authors: muus, christoph; luecken, malte d.; eraslan, gokcen; waghray, avinash; heimberg, graham; sikkema, lisa; kobayashi, yoshihiko; vaishnav, eeshit dhaval; subramanian, ayshwarya; smilie, christopher; jagadeesh, karthik; duong, elizabeth thu; fiskin, evgenij; triglia, elena torlai; ansari, meshal; cai, peiwen; lin, brian; buchanan, justin; chen, sijia; shu, jian; haber, adam l; chung, hattie; montoro, daniel t; adams, taylor; aliee, hananeh; samuel, j.; andrusivova, allon zaneta; angelidis, ilias; ashenberg, orr; bassler, kevin; bécavin, christophe; benhar, inbal; bergenstråhle, joseph; bergenstråhle, ludvig; bolt, liam; braun, emelie; bui, linh t; chaffin, mark; chichelnitskiy, evgeny; chiou, joshua; conlon, thomas m; cuoco, michael s; deprez, marie; fischer, david s; gillich, astrid; gould, joshua; guo, minzhe; gutierrez, austin j; habermann, arun c; harvey, tyler; he, peng; hou, xiaomeng; hu, lijuan; jaiswal, alok; jiang, peiyong; kapellos, theodoros; kuo, christin s; larsson, ludvig; kyungtae lim, michael a. leney-greene; litviňuková, monika; lu, ji; maatz, henrike; madissoon, elo; mamanova, lira; manakongtreecheep, kasidet; marquette, charles-hugo; mbano, ian; mcadams, alexi marie; metzger, ross j; nabhan, ahmad n; nyquist, sarah k.; ordovas-montanes, jose; penland, lolita; poirion, olivier b; poli, sergio; qi, cancan; reichart, daniel; rosas, ivan; schupp, jonas; sinha, rahul; sit, rene v; slowikowski, kamil; slyper, michal; smith, neal; sountoulidis, alex; strunz, maximilian; sun, dawei; talavera-lópez, carlos; tan, peng; tantivit, jessica; travaglini, kyle j; tucker, nathan r.; vernon, katherine; wadsworth, marc h.; waldmann, julia; wang, xiuting; yan, wenjun; zhao, william; ziegler, carly g. k. title: integrated analyses of single-cell atlases reveal age, gender, and smoking status associations with cell type-specific expression of mediators of sars-cov- viral entry and highlights inflammatory programs in putative target cells date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: nkql h x the covid- pandemic, caused by the novel coronavirus sars-cov- , creates an urgent need for identifying molecular mechanisms that mediate viral entry, propagation, and tissue pathology. cell membrane bound angiotensin-converting enzyme (ace ) and associated proteases, transmembrane protease serine (tmprss ) and cathepsin l (ctsl), were previously identified as mediators of sars-cov cellular entry. here, we assess the cell type-specific rna expression of ace , tmprss , and ctsl through an integrated analysis of single-cell and single-nucleus rna-seq studies, including lung and airways datasets ( unpublished), and datasets from other diverse organs. joint expression of ace and the accessory proteases identifies specific subsets of respiratory epithelial cells as putative targets of viral infection in the nasal passages, airways, and alveoli. cells that co-express ace and proteases are also identified in cells from other organs, some of which have been associated with covid- transmission or pathology, including gut enterocytes, corneal epithelial cells, cardiomyocytes, heart pericytes, olfactory sustentacular cells, and renal epithelial cells. performing the first meta-analyses of scrna-seq studies, we analyzed , , cells from nasal, airway, and lung parenchyma samples from donors spanning fetal, childhood, adult, and elderly age groups, associate increased levels of ace , tmprss , and ctsl in specific cell types with increasing age, male gender, and smoking, all of which are epidemiologically linked to covid- susceptibility and outcomes. notably, there was a particularly low expression of ace in the few young pediatric samples in the analysis. further analysis reveals a gene expression program shared by ace +tmprss + cells in nasal, lung and gut tissues, including genes that may mediate viral entry, subtend key immune functions, and mediate epithelial-macrophage cross-talk. amongst these are il , its receptor and co-receptor, il r, tnf response pathways, and complement genes. cell type specificity in the lung and airways and smoking effects were conserved in mice. our analyses suggest that differences in the cell type-specific expression of mediators of sars-cov- viral entry may be responsible for aspects of covid- epidemiology and clinical course, and point to putative molecular pathways involved in disease susceptibility and pathogenesis. covid- is a global health threat due to its rapid spread, morbidity, and mortality. despite progress in viral identification, sequencing of the full viral genome, creation of initial diagnostics, and the development of therapeutic hypotheses, many outstanding hurdles remain. these include deciphering the basis of the increased risk associated with certain demographic groups and identifying molecular mechanisms of disease pathogenesis. the clinical presentation and transmission of covid- is complex. common symptoms include fever, cough, shortness of breath, chest pain, malaise, fatigue, headache, myalgias, anosmia, and diarrhea, while laboratory and radiographic findings include lymphopenia and ground-glass opacities on chest imaging, respectively [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . of an initial cohort of , hospitalized patients diagnosed with covid- , many developed diffuse alveolar damage (dad) , pneumonia ( . %), not infrequently complicated by acute respiratory distress syndrome (ards, . %), and shock ( . %), with . % of patients requiring icu admission and . % requiring ventilation . as the number of patients has surged, multi-system pathologies have been increasingly described, including kidney injury , liver injury, gastrointestinal symptoms , cardiac injury and dysfunction , , , and multiorgan failure [ ] [ ] [ ] [ ] . in addition to nasal and throat secretions, sars-cov- rna has also been detected in saliva and stool specimens , , suggesting possible alternative routes of transmission beyond respiratory droplets , . sars-cov- may also infect the testis, similarly to sars-cov , . vertical transmission from mother to fetus remains a possibility. at least five neonates born to pregnant women with covid- pneumonia were reported to test positive for sars-cov- infection after birth [ ] [ ] [ ] and other studies report newborns with elevated virus-specific antibodies to sars-cov- born to mothers with covid- , . however, several other studies have thus far failed to find evidence for intrauterine transmission from pregnant women with covid- to their newborns in cohorts as large as patients [ ] [ ] [ ] . additionally, newborns from covid- patients who had cesarean deliveries in their third trimester tested negative for sars-cov- . there is substantial variation in the clinical consequences of infection across individuals, ranging from asymptomatic carrier status to death. it has been suggested that undocumented subclinical infection contributes to the rapid dissemination of the virus . as of april , , covid- has caused , , confirmed infections and , deaths worldwide (https://coronavirus.jhu.edu/map.html). while true case fatality rate (cfr) is difficult to assess early in an epidemic [ ] [ ] [ ] , estimates from modeling studies range from . % - . % , . disease severity and mortality rates show a striking rise with age , with cfr estimates ranging from < . % for patients under years old to > % for those over , with a slightly higher incidence and mortality in men , . children are significantly less likely than adults to develop severe disease, and reported pediatric deaths are rare . smoking is most likely associated with more severe disease , . finally, adults with pre-existing cardiovascular disease and acute myocardial injury have higher rates of disease acuity and death , . the coronavirus non-segmented, positive sense rna genome of ~ kb contains coding regions for the expression of structural proteins including spike (s), envelope (e), membrane (m), and nucleocapsid (n) proteins. virion recognition of host cells is initiated by interactions between the s protein and its receptor . ace [ ] [ ] [ ] , an essential regulator of the renin-angiotensin system , is the receptor for both sars-cov and sars-cov- . the receptor-binding domain of the sars-cov- s-protein has a higher binding affinity for human ace than sars-cov , , whereas the interaction with cd (encoded by the gene bsg), another reported receptor for the sars-cov- s-protein, is weak (cd : kd, . μm vs hace : kd, ~ nm) . following receptor binding, the virus gains access to the host cell cytosol through acid-dependent proteolytic cleavage of the s protein. for sars-cov, a number of proteases including tmprss and ctsl cleave at the s and s boundary and s domain (s ') to mediate membrane fusion and virus infectivity. for sars-cov- , both pharmacological inhibition of endogenous tmprss protein and tmprss overexpression support a role for tmprss -mediated cellular entry , . the identification of the specific cell types that can be infected by sars-cov- will inform our understanding of disease transmission and pathogenesis, which are often cell context-specific. studies suggest that key infection routes involve the nasal passages, airways, and alveoli, where epithelial cells play a key barrier role. identifying putative target cells in other organs could inform our understanding of extra-pulmonary covid- associated organ failure or of potential placental transmission. early analyses of the human lung cell atlas revealed that some of the cells of the nasal passages, airways, alveoli, and gut co-express ace and tmprss , . here, we perform integrated analysis of single-cell and single-nucleus rna-seq studies, including studies of the lung and airways, and additional studies of other diverse tissues, spanning both published and unpublished datasets. we comprehensively define the expression patterns of the ace viral receptor and accessory proteases genes. we test how their expression is related to age (from prenatal to old age), sex, and smoking status. we identify gene expression programs associated with cells that can be infected by the virus and compare these programs across specific cell types, organs, and species. to further inform future studies, we assess the conservation of these human features in mouse models and explore the expression of other proteases that may play a role in the viral replication cycle. previous analyses of human cell atlas datasets established that ace , the viral receptor, and one of its entry-associated proteases, tmprss , are expressed in nasal, lung, and gut epithelial cells . specifically, nasal goblet cells and multiciliated cells comprised the highest fraction of dual-positive ace + tmprss + cells , consistent with a plausible role for a nasal viral reservoir that supports transmissivity. in the distal lung, co-expression occurred in at cells , , . previous surveys across other tissues also showed a relatively high portion of ace + tmprss + cells within colonic enterocytes, another potential viral reservoir that promotes viral transmission . to perform a comprehensive survey, we enumerated the proportion of dual-positive ace + tmprss + cells and ace + ctsl + cells across human studies (including seven of the lung and airways) with single-cell or single-nucleus rna-seq (sc/snrna-seq) ( fig. , methods, supplementary table and ). these included a large survey of published datasets from diverse tissues, which we assigned to five broad cell categories, (fig. a,b , extended data fig. , , supplementary table ) . we further analyzed more finely annotated published and unpublished datasets (methods, fig. c ,d, supplementary table ) . consistent with previous reports , dual-positive ace + tmprss + cells in the proximal airways were largely secretory goblet and multiciliated cells, and dual-positive cells in the distal lung were largely at cells (fig. c, extended data fig. a) . ace expression in secretory (especially goblet) and at cells is also supported by scatac-seq from the primary carina and subpleural parenchyma, respectively (fig. , n= samples per location, n= patient), showing accessibility at the ace locus in a portion of at cells ( . %, out of , cells, methods), as well as secretory and multiciliated cells, and to a lesser extent some basal and tuft cells (fig. a-c) . the proportion of at cells with an open ace locus is somewhat higher than of ace + at cells by scrna-seq in the same patient and region (pleura: . %, out of , cells vs. . %, out of , cells). cells with accessible chromatin at both the ace and tmprss loci were also most commonly found in epithelial cells, especially at cells (fig. d, . %, of , at cells from a subpleural sample; comparable to . %, of , in matched scrna-seq), secretory cells ( . %, of secretory cells from small airway of the subpleural region, and %, of secretory cells from primary carina of the large airway), and multiciliated cells ( . %, of multiciliated cells from subpleural sample, and . %, out of multiciliated cells from primary carina). there were dual-positive ace + tmprss + cells in tissues beyond the respiratory system ( fig. a-c) , including enterocytes, pancreatic ductal cells, prostate luminal epithelial cells, cholangiocytes , oligodendrocytes in the brain, inhibitory enteric neurons, heart fibroblasts/pericytes , and fibroblasts and pericytes in multiple other tissues (fig. c) . ace + tmprss + epithelial cells were most prevalent (in order) within the ileum, liver, lung, nasal mucosa, bladder, testis, prostate, and kidney (fig. a) . enterocytes had a substantial proportion of dual-positive cells (fig. c) , and are possibly part of a renin-angiotensin multicellular circuit . in line with the kidney's role in the renin-angiotensin-aldosterone system, dual-positive cells are enriched in the proximal tubular cells and in principal cells of the collecting duct (fig. a,c) . interestingly, brain oligodendrocytes, multiciliated and sustentacular cells in olfactory epithelium, at cells in non-smoker lung, and ductal cells in pancreas -all ace + tmprss + -were also all enriched for myrf, a transcription factor necessary for myelination in the brain and sufficient to induce expression of the myelin proteins mog (myelin oligodendrocyte glycoprotein) and mbp (myelin basic protein) . ace + ctsl + cells were enriched in additional subsets, associated with covid- pathology, most notably the olfactory epithelium, ventricular cardiomyocytes, heart macrophages, and pericytes in multiple tissues, including the heart, lung, and kidney (fig. d) . the presence of dual-positive cells in the lung, heart, and kidney may reflect that cells in these organs may be direct targets of viral infection and pathology , . dual-positive cells in the sustentacular and basal cells of the olfactory epithelium ( fig. c ) may be associated with a loss of the sense of smell . dual-positive cells in the corneal and conjunctival epithelium, may contribute to viral transmission , . dual positive cardiomyocytes may be related to "direct" cardiomyocyte damage (see tucker et al. companion manuscript ), whereas heart pericytes may indicate a vascular component to the cardiac dysfunction, and could contribute to increased troponin leak in patients without coronary artery disease. notably, ace -expressing heart pericytes in another dataset (tucker et al) is even higher ( %) than any other tissue dataset analyzed here (max. . % in kidney, extended data fig. ) . despite the lymphopenia observed with covid- , , , , we did not typically observe ace mrna expression in scrna-seq profiles in the bone marrow or cord blood (fig. a,b) , although there was ace expression in some tissue macrophages, including alveolar and heart macrophages (extended data fig. ) . further studies of ace rna and protein expression in covid- disease tissue will help elucidate its expression in immune cells . to validate our findings from scrna-seq analysis and to determine the spatial expression patterns of ace , tmprss , and ctsl, and their corresponding proteins we performed fluorescence in situ hybridization and immunohistochemistry on tissue sections of airway and alveoli from healthy donor lungs that were rejected for lung transplantation. first, we performed triple fluorescence in situ hybridization to identify ace , ctsl and tmprss on alveolar sections. we observed co-expression, albeit at low levels, of all three genes in alveolar cells (fig. e) . we then performed co-staining with cell type-specific markers. we observed ace transcripts in a subset of type (at ) cells identified by canonical at protein markers, htii- and pro-sftpc (fig. f,g) . similarly, we observed tmprss gene expression in htii- + at cells (fig. f) . immunostaining for tmprss protein further confirmed at cell expression (extended data fig. a) . we also observed tmprss protein expression at low levels in some at cells identified by the canonical at protein marker ager (extended data fig. a ). of note, some non-epithelial cells also expressed these three genes. we further validated the expression of ace by bulk mrna-seq in sorted at cells, including those from long-term cultured alveolar organoids (extended data fig. b) . we then performed immunohistochemistry and deployed three different available putative ace antibodies to establish ace protein expression (supplementary table ). one of these antibodies, the one used previously to functionally block cellular viral entry, specifically labeled adult pro-sftpc-positive at cells (extended data fig. c) . as a cautionary note, the lack of agreement between antibody staining patterns suggests that some of these antibodies may be non-specific. previous studies have revealed that ace is highly enriched in mucous cells of the nasal and lower respiratory tract epithelium . in healthy lungs, both large and small airways contain mucous cells in surface airway epithelium, albeit at low numbers. however, submucosal glands (smgs) that reside deep within the airway tissues are composed of abundant mucous cells. to test whether these mucous cells also express ace , tmprss , and ctsl, we performed an integrated analysis on scrna-seq datasets obtained from microdissected smgs of the healthy donors. we observed overlapping expression and relatively high enrichment of aec , tmprss and ctsl in mucous cells of the smgs (extended data fig. e ). in situ transcript analysis for ace further confirmed the presence of transcripts in acinar epithelial cells of the smgs (fig. h) , and cells expressing ace in the large airway epithelium (fig. i) . we next sought to understand how the expression of each of these three key genes --ace , tmprss , and ctsl --in specific cell subsets may relate to three key covariates that have been associated with disease severity: age (older individuals are more severely affected), sex (males are more severely affected), and smoking (smokers are more severely affected) . we integrated samples across many studies, as no single dataset generated to date is sufficiently large to address this question. we assembled datasets (supplementary table , supplementary data file d ), comprised of , , cells from individuals, spanning healthy nasal, lung, and airway samples profiled by scrna-seq or snrna-seq from either biopsies, resections, entire lungs that could not be used for transplant, or post mortem examinations, allowing us to study a diversity of respiratory regions and cell types (fig. a) . these included published datasets [ ] [ ] [ ] [ ] [ ] [ ] and datasets that are not yet published [ ] [ ] [ ] [ ] [ ] . in the case of unpublished data, we only obtained single-cell expression counts for the three genes, as well as the total umi counts per cell, cell identity annotations, and the relevant anonymous clinical variables (age and sex, as well as smoking status when ascertained). cell identity annotations were manually harmonized using an ontology with three levels of annotation specificity (fig. b, supplementary table ) ; focusing on levels and allowed us to include a large number of datasets, while retaining relatively high cell subset specificity (fig. a,b) . to facilitate rapid data sharing, we analyzed data pre-processed by each data-generating team at the level of gene counts, using total counts as a size factor. we used poisson regression (diffxpy package; methods) to model the association between the expression counts of the three genes and age, sex, and smoking status, and their possible pair-wise interactions (fig. c) , using total counts as an offset, and dataset as a technical covariate to capture sampling and processing differences. it should be noted that modeling interaction terms was crucial as their omission resulted in reversed effects for age and sex for particular cell types (discussion). this model was fitted to non-fetal lung data ( , cells, samples, donors, datasets) within each cell type to assess cell-type specific association of these covariates with the three genes. to further validate sex and age associations, we fit a simplified version of the model without smoking status covariates to the full non-fetal lung data ( , cells, samples, donors, datasets). uncertainty is challenging to model in our single-cell meta-analysis as variability exists on the levels of both donors and cells. for simplicity, we modeled the overall variance with both contributions covered implicitly by treating each cell as an independent observation. as cells from the same donor cannot be typically regarded as independent observations, this can result in inflated p-values, especially when there are few donors for a particular cell type. to counteract this limitation, we employed three approaches: ( ) we used a simple noise model (poisson) to reduce the chance of overfitting donor variability to obtain spurious associations; ( ) we confirmed significant associations from the single-cell model in a pseudo-bulk analysis to ensure effect directions are consistent when modeling only donor variation (methods, fig. d ( ) we investigated whether significant associations change direction when holding out any one dataset to ensure that the effect is not dominated by the inclusion of many cells from only one source (methods, fig. f, supplementary data d ) . we regarded an association that passes all of these validations as a robust trend, while associations that appear dominated by a single dataset (often because this dataset is a major contributor of a given cell type) were denoted as indications. we focused on trends or indications in those cell types where both tmprss and ace are predominantly expressed in the lung: airway epithelial cells (basal, multiciliated, and secretory cells), alveolar at cells, and submucosal gland secretory cells (fig. e) . strikingly, we find robust trends of ace expression with age, sex, and smoking status in these cell types (fig. d , extended data fig. and malte ): ace expression increases with age in basal and multiciliated cells. ace expression is elevated in males in airway secretory cells and alveolar at cells. furthermore, we find strongly elevated levels of ace in past or current smokers in multiciliated cells (log fold change (log fc): . , fig. d ). significant associations of ace expression indicate increased expression with age in at (largest age effect: slope of log expression per year of . ) and secretory cells, and increased expression of ace in males in multiciliated cells. further indications associate past or current smoking with decreased ace expression in at cells, and increased ace expression in basal cells. however, these last five indications are not robust trends and depend on the inclusion of a single dataset (fig. f) , often because that dataset contributes a large number of cells of a particular type. specifically, when we held out the largest declined donor transplant dataset (supplementary table , "regev-rajagopal", most cells and most samples), a declined donor tracheal epithelium dataset ("seibold", supplementary table , most donors in the smoking analysis), or a further declined donor lung dataset ("kropski-banovich", supplementary table ) respectively, the effect is no longer present (methods, fig. f , supplemenatry data d ). the above trends and indications for sex and age were further validated in the simplified model on the full non-fetal lung dataset (extended data fig. , supplementary data d ) . with the exception of the age association in basal and secretory cells, all associations were found to be significant at a false discovery rate (fdr) threshold of %, confirmed by pseudo-bulk analysis, and were supported at least at the level of indication. indeed, all robust trends were also supported as robust trends in the simplified model. fitting the simplified model on the smoking data subset shows that modeling smoking status is crucial to detect the basal and secretory age association. without modeling the effect of smoking status on basal and secretory cell ace expression, this variance is captured as uncertainty against which the age effect is evaluated as not significant. taking into account smoking status is of particular importance as the effect sizes associated with smoking tend to be much larger than age effects, and tend to be larger than sex effects. for example, in multiciliated cells the effect sizes assigned to smoking status, sex, and age associations with ace expression are β= . , β= . , and β= . respectively, where β represents the log fc and the slope of log expression per year. examining joint trends of ace and the protease genes within the same cell type, there are indications of up-regulation of both ace and tmprss in multiciliated cells (ace indication dependent on "seibold" dataset) in males and with age (both indications dependent on "regev-rajagopal" dataset). in at cells, there is an indication of joint up-regulation of ace and tmprss with age (dependent on "regev-rajagopal" dataset), and an indication of ace and ctsl down-regulation in smokers (dependent on "kropski-banovich" dataset). all above joint trends for age and sex covariates were confirmed on the full non-fetal lung data using the simple model without smoking covariates. in aggregate, elevated levels of cell-type specific ace and associated proteases are correlated to increasing age, in smokers, and in males. the age associations highlighted particularly low expression in samples from very young children (newborn to years old). as reports suggest that most infants and young children cases do not display severe disease , , we inspected the subsets of studies of human development and pediatric samples from our integrated analysis (supplementary table ). these included cells from first trimester samples ( donors) of fetal lung ( . weeks post conception; wpc), fetal lung samples ( donors) from the second trimester ( - weeks) , lung samples ( donors) spanning from third trimester premature births (n= ), full term newborns (n= ), ~ -month old (n= ), -year-old (n= ), and -year-old (n= ) children. because the number of samples here is small, all observations must be interpreted with caution. the extent of ace expression in lung cells changes during development (fig. g, supplementary table ). there are dual-positive cells present in the very early first trimester lungs by scrna-seq, and some ace expression in epithelial cells in the second trimester samples (extended data fig. a,b) . notably, spatial transcriptomics of . pcw fetal lung did not capture any ace expression (data not shown). in lungs from third trimester pre-term births, ace expression is high, with ace + tmprss + cells observed in alveolar at cell populations (extended data fig. b) . strikingly, ace expression is very low in normal lungs of newborns, the one ~ months' lung sample, and the -year-old lungs (fig. g) . this is further supported by single-cell chromatin accessibility by transposome hypersensitive sites sequencing (scths-seq ) from human pediatric samples (full gestation, no known lung disease) collected at day of life, months, years, and years (n= at each time point) (extended data fig. a) . ace gene activity scores (methods), when present, were in the at /at population, but no signal was present at birth, it was low in the -year-old and -year-old sample, and higher in the month-old (extended data fig. b-d) . notably, immunohistochemistry (ihc) of the ~ month-old infant lung also showed fewer ace -immunoreactive at cells (extended data fig. d ). we also assessed whether ace , tmprss , and ctsl are expressed in the human placenta during pregnancy, using data from three published scrna-seq studies: two from the first trimester ( , cells and , cells) and one from full-term placenta ( , cells) [ ] [ ] [ ] . ace was expressed ( . %) in maternal decidual/stromal cells, maternal pericytes, and fetal extravillous trophoblasts, cytotrophoblasts, and syncytiotrophoblast in both first-trimester and term placenta (fig. d) . note that there was little expression of tmprss ( . %) in the placenta and accordingly few ace + tmprss + dual-positive cells (as we previously reported ). however, ctsl is expressed in most cells ( %) in the maternal-fetal interface, and there are ace + ctsl + dual-positive cells ( . %) among maternal decidual/stromal cells, pericytes, and fetal trophoblasts (hla-a, b negative) in both first-trimester and term placentae. overall, these patterns may be important in understanding why children are more resistant to covid- and and in considering risk during pregnancy. our human lung cell atlas analyses have revealed immune signaling genes that co-vary with ace and tmprss in airway and lung cells , . these analyses identified antiviral response genes that are enriched in ace + tmprss + cells (e.g., ido , irak , nos , tnfsf , oas , mx ), and suggested that ace itself is interferon regulated , . to explore such gene programs in a broader context, we identified signatures for dual-positive ace + tmprss + cells compared to dual-negative ace -tmprss cells in the nasal epithelium, lung, and gut (supplementary tables , ) with two complementary approaches. the first aimed to find features that characterize programs of dual positive cells that are shared by different cell types in one tissue ("tissue programs"). the second aimed to find features that are associated with dual positive cells compared to other cells of the same type, and may or may not be shared with other types ("cell programs") (methods). to infer tissue programs, we trained a random forest classifier to discriminate between dual-positive and dual-negative cells (excluding ace and tmprss ; : class balanced test-train split), generalizing across multiple cell types in one tissue, and ranked genes according to their importance scores in the classifier (methods). to infer cell programs, we performed differential expression analysis between dual-positive and dualnegative cells within each cell subset. we note that ace + tmprss + cells have more unique transcripts detected (extended data fig. b) : this can reflect a technical confounder, biological features, or both. we conservatively controlled for these differences (by sampling dual positive and dual negative cells from matched gene complexity bins; methods; extended data fig. b , extended data fig. ) . importantly, these methods do not assume that ace + tmprss + cells form a distinct subset within each cell type. rather, our goal is to leverage the variation among single cells within a single type to identify gene programs that are co-regulated with ace and tmprss within each expressing cell subset. tissue programs (fig. a, extended data fig. a, supplementary table , ) were enriched in several pathways related to viral infection and immune response (see fig. b , extended data fig. a for visualization of selected genes, and supplementary tables for the full list). these include phagosome structure, antigen processing and presentation, and apoptosis. among the tissue program genes we highlight: ceacam (lung, nasal, gut programs) and ceacam (lung), surface attachment factors for coronavirus spike protein; slpi (lung, nasal), a secreted protease inhibitor that is associated with virus resistance ; pigr (lung, gut), the polymeric immunoglobulin receptor that may promote antibody-dependent enhancement via iga ; and, cxcl (lung, nasal), a mucosal chemokine that attracts dendritic cells and monocytes to the lungs table ) were enriched in many of the same genes and pathways as tissue-specific programs (fig. d , supplementary table , , , , ), and highlight a potential role for tnf signaling in ace regulation. we first confirmed that the cell programs were not merely associated with the number of transcripts per cell (extended data fig. c ). while some genes were shared between the tissue and cell programs (e.g., many virus-related genes, such as ceacam , cxcl , slpi, and hla-dra), the cell programs further captured unique biological functions and activities. for example, dual positive lung secretory cells differentially expressed genes involved in tnf signaling including ripk , a key regulator of inflammatory cell death via necroptosis, previously implicated in sars-cov pathogenesis . both lung dual positive secretory and multiciliated cells differentially expressed lysosomal genes (mfsd , ctss, ctns, ctsh), potentially relevant for endolysosomal entry of coronaviruses . dual positive at cell programs included genes involved in immunoproteasome (psmb , psmb , fig. c) , class i and ii antigen presentation (hla-dma, hla-drb , hla-dpb , hla-dra, hla-dpa ), and phagocytosis. dual-positive nasal goblet cells differentially expressed several cytokines and chemokines, including granulocyte-colony stimulating factor (csf ), which may impact hematopoiesis, the recruitment of neutrophils, and inflammatory pathology; cxcl and cxcl , chemoattractants for neutrophils; interleukin- (il ), which induces the production of il- and tnf ; and ccl , which is upregulated by tnf . the at cell program included the surfactant proteins, sftpa and sftpa ; the il- receptor (il r ), which may promote antiviral immune responses (below); and, multiple components of mhc-ii (e.g., hla-dpa , hla-dpb ), congruent with a role in antigen presentation. cell programs from multiple tissues (fig. c,d) included genes related to tnf signaling (e.g., birc , ccl , cxcl , cxcl , jun, nfkb ), raising the possibility that anti-tnf therapy may impact the expression of ace and/or tmprss . consistent with this hypothesis, ace expression in enterocytes was significantly lower in ulcerative colitis patients treated with anti-tnf compared to untreated patients (mean = . and . log (transcripts per , (tp k)+ ) in treated vs. untreated; adjusted p < e- ). however, we could not control for many important features, including disease severity, which is strongly associated with anti-tnf treatment, raising the need for future work. some of the genes are targets of known drugs . for example, dual-positive lung secretory cells expressed, in addition to ace (targeted by ace inhibitors), other drug targets, including c , hdac , il a, pik ca, ramp , and slc a . other program genes were shown to interact with sars-cov- proteins via affinity purification mass spectrometry . among those was gdf , which was identified as a putative interaction partner for the sars-cov- protein orf , is a central regulator of inflammation , and was a member of the dual-positive cell programs of both lung basal cells and nasal multiciliated cells. some program genes may be particularly related to covid pathological features and may indicate putative therapeutic targets. for example, muc is especially highly induced in dualpositive cells (in tissue and specific cell programs), which may be associated with respiratory secretions . importantly, the lung tissue and gut enterocyte programs include the gene encoding the il co-receptor (il st), and the at cell program includes il . il signaling has been implicated in uncontrolled immune responses in the lungs of covid patients, elevated serum il levels are associated with the need for mechanical ventilation , and anti-il r antibodies (tocilizumab) are being tested for clinical efficacy in covid- patients. indeed, il st and il are higher in dual positive vs. dual negative at cells (extended data fig. d ), although il expression is relatively low in these cells from healthy tissue. additional cell types, such as heart pericytes, are enriched for cells with co-expression of ace with il r or il st (extended data fig. ). the immune-like features of ace + epithelial cells are also reflected in the regulatory features of the ace locus by scatac-seq (fig. f) . note that because epithelial cells with an accessible ace locus tend to have a higher number of fragments in peaks than cells with inaccessible ace (extended data fig. f ), consistent also with higher umis in scrna-seq, some of the cells with inaccessible ace could be false negatives, reducing our power. previous studies in the healthy lung predicted that interactions between at cells and myeloidlineage macrophages may be important for immune regulation and surfactant homeostasis . to explore this possibility, we predicted interactions between at cells (in general, or ace + tmprss + dual-positives) and myeloid cells (methods ), using our large declined donor transplant dataset ("regev/rajagopal"; samples, patients, - locations each). at cells and myeloid cells were present in lung lobes samples from all patients, whereas samples from patients contained both ace + tmprss + dual-positive at cells and myeloid cells. we identified significant predicted interactions involving oncostatin m (osm), an il -type cytokine expressed by myeloid cells , with the oncostatin m receptor (osmr) and its paralog receptor lif receptor subunit alpha (lifr) expressed in both for at cells in general, and in double positive ones). interactions involving the complement pathway were also predicted (for all at and dual positives) between complement c and c expressed by at cells and their cognate receptor expressed in myeloid cells. three samples had interactions between the il receptor on at cells and il b or il rn in myeloid cells. the il -receptor interactions were identified mostly (in out of samples) involving only dual-positive at cells, suggesting a possible role of ace + tmprss + dual-positive cells in il -mediated processes. finally, we identified interactions between csf , , or expressed in at cells (including double positives) and their receptors expressed in myeloid cells. these predicted interactions further support the previously identified roles for cross-talk between at and myeloid cells, such as macrophages, in immune regulation (osm, complement, il ) and surfactant homeostasis (csf), as previously highlighted . we next asked whether human cell types of interest were present in animal models. while such analyses cannot address molecular compatibility (due to sequence variation in ace across species, as shown for lower compatibility of sars-cov and mouse ace ), they can help determine if dual-positive cells are present in commonly employed models, and if their characteristics, proportions, and programs are similar to those of their human counterparts. in a separate study , our lung network showed strong similarities to the human data in a macaque model. here, we focused on the more distant, but commonly used, mouse model. ace + tmprss + and ace + ctsl + dual-positive cells were present primarily in club and multiciliated cells in the airway epithelia of healthy mice (ace + tmprss + club . % [ . %, . %] and multiciliated . % [ . %, . %], ace + ctsl + club . % [ . %, . %] and multiciliated . % [ . %, . %]), consistent with the expression patterns found in human airways (fig. a) . furthermore, ace expression increased over a -month time-course of healthy mouse aging in both club (p= . e- ) and goblet (p= . ) cells (fig. a) . the proportion of ace + tmprss + dual-positive cells did not significantly increase with age during this time course (data not shown), but the proportion of ace + ctsl + dual-positive cells significantly increased in club cells during this time-course (fig. b) . interestingly, the mice were aged between - months, a -month period that is reported to reflect the maturation period from early to mature adults . examining bulk rna-seq profiles of sorted populations of alveolar at cells (sftpc + ), airway basal cells (krt + ), alveolar endothelial cells (cd -cd + ), alveolar epithelial cells (epcam + ), whole lung and whole trachea from a krt -creer/lsl-tdtomato/sftpc-egfp transgenic mouse model, and across tissues from encode, showed that ace , tmprss and ctsl are expressed in sorted at cells, whole trachea and whole lung, as well as in stomach, intestine, kidney and bladder. in human smokers, statistical modeling uncovered a robust trend of increased ace expression in airway epithelial cells, while expression in at cells was reduced (fig. d, extended data fig. ). to experimentally confirm these findings, we examined cell profiles from mice exposed daily to cigarette smoke for two months, followed by scrna-seq of whole lungs (fig. c) . epithelial specific expression patterns of mouse ace and the ace + tmprss + and ace + ctsl + dual-positive cells were largely consistent with the human data (fig. d) . upon smoke exposure, there was a significant increase in ace + airway secretory cell numbers, while the fraction of ace + at cells was unaltered (fig. e) . moreover, the expression levels of ace were significantly increased in airway secretory cells (fig. f ), but not in at cells (fig. g) . this was in agreement with bulk rna-seq of mouse lungs exposed to different doses of cigarette smoke , in which ace levels increased in a dose-dependent manner by daily cigarette smoke over months (fig. h) . notably, the covid- relevant proteases tmprss and ctsl were also significantly increased by smoke exposure in mice (fig. i,j) . thus, mouse smoking data shows similar trends as observed in humans and experimentally confirms the association of ace levels with smoking. we also compared the patterns between the mouse and human placenta, analyzing ace , tmprss , and ctsl expression across , cells from scrna-seq data during mouse placenta development from embryonic days . to (shu et al., unpublished). we find ace + tmprss + dual-positive cells ( . %) in a large fraction of fetal trophoblasts with strong epithelial signatures. ace + tmprss + dual-positive cells express signatures of at cells and hepatocytes, and many also express ctsl. ace + ctsl + dual-positive cells ( . %) are also present among fibroblasts, stromal cells, and fetal trophoblasts in both mice and humans (fig. k, extended data fig. ) . notably, while ace + ctsl + dual-positive fibroblasts and stromal cells in humans are of maternal origin, ace + ctsl + dual-positive fibroblasts and stromal cells are of fetal origin in mice. tmprss has been demonstrated to mediate sars-cov- infection in vitro , , but sars-cov- also infects cells in the absence of tmprss . thus, additional proteases likely play roles in proteolytic cleavage events of spike and other viral proteins that underlie entry (fusion) and egress. to systematically predict proteases potentially involved in sars-cov- pathogenesis, we tested for co-expression of each of annotated human protease genes with ace in the large declined donor transplant dataset ("regev/rajagopal") from patients. the analysis recovered tmprss as one of the significantly co-expressed in multiple lung epithelial cell types (fig. a, supplementary table , ). in addition, multiple members of the proprotein convertase subtilisin kexin (pcsk) family were also significantly co-expressed with ace in both proximal and distal airway epithelial cells (fig. a,b) , including furin, pcsk , pcsk , pcsk and pcsk in at cells. proprotein convertases have known roles in coronavirus s-protein priming , , . we obtained similar results in an independent dataset of , cells from donors (extended data fig. a,b, "aggregated lung") . to further investigate the role of proprotein convertases as candidates for sars-cov- s-protein processing we analyzed the sars-cov- spike protein sequence. multiple sequence alignment of s-protein sequences of sars-cov- and other beta-coronaviruses revealed a polybasic insert at the s /s junction present only in sars-cov- spike (extended data fig. c) . while polybasic sites are found in multiple members of betacoronavirus lineages a and c (e.g., mers-cov), sars-cov- is the only known member of lineage b harboring a polybasic motif in the s /s region (extended data fig. c) . as previously reported, this polybasic sequence corresponds well to cleavage motifs of multiple pcsk family proteases (extended data fig. d ) [ ] [ ] [ ] , and has a high probability for its pcsk-mediated cleavage (at amino acid ) (by prop and prosperous , ) as well as additional sites including the s ' position (at amino acid ), which would release predicted fusion-mediating peptides (extended data fig. e ) . we next examined pcsks expression and co-expression with ace across lung cell subsets (fig. c, extended data fig. f) . furin, pcsk and pcsk were broadly expressed across multiple lung cell types, and pcsk and pcsk were largely restricted to neuroendocrine cells, as previously reported , with pcsk further detected in . % of at cells (fig. d, extended data fig. g ). in many cell subsets we observed dual expression with ace at fractions comparable to or higher than those of ace + tmprss + cells (fig. e, extended data fig. h) . these include at cells (ace + tmprss + , ace + furin + and ace + pcsk + at . %, . %, and . %, fig. e) ; multiciliated cells in the proximal airway (ace + tmprss + , ace + furin + , ace + pcsk + , and ace + pcsk + at . %, . %, . %, and . %), and basal cells (ace + tmprss + , ace + furin + , and ace + pcsk + at . %, . %, and . %). coexpression is present across tissues in addition to the lung (extended data fig. i,j) , including the liver, ileum, kidney and nasal airways, with the highest percentages of ace + pcsk + dual positive cells in nasal airways (ace + pcsk + . %, ace + furin + . %), bladder (ace + pcsk + . %) and testis (ace + pcsk + . %). because different host proteases may contribute to different stages of the viral life cycle , , we also examined the prevalence of ace + tmprss + pcsk + triple-positive cells (tps) in the lung dataset. ace + tmprss + pcsk + were the main triple positive cells in multiciliated ( . %) and secretory cells ( . %) of proximal airways, and ace + tmprss + furin + tps were the most common within at cells ( . %) (extended data fig. k) . finally, when we examined all known human proteases for co-expression with ace in major lung epithelial cell types (fig. f) , we recovered cathepsins (ctsb, ctsc, ctsd, ctsl, ctss), proteasome subunits (e.g. psmb , psmb , psmb ), and complement proteases (c r, c , cfi) (fig. f, extended data fig. ), the latter also captured in our programs above (fig. ) . we performed integrative analyses of single-cell atlases in the lung and airways and across tissues to identify cell types and tissues that have the key molecular machinery required for sars-cov- infection. we then examined the relationship between specific cell types and three key covariates --age, sex, and smoking status --that have been related to disease severity. we further used the scale of these integrated atlases to identify gene programs in major epithelial cell subsets that may be infected by the virus, and search for other potential accessory proteases. our hope is that this extensive analysis and resource will help with hypothesis generation (and refutation) towards better understanding of the molecular and cellular basis of covid- infection, and the identification of putative therapeutic avenues. our cross-tissue analysis substantially expands on our , , , and others' - earlier efforts, allowing us to identify cell subsets across diverse tissues that may be implicated in virus transmission, pathogenesis, or both. focusing on pathogenesis, in addition to key subsets in the lung, airways and gut, we identified ace + cells that co-express either tmprss or ctsl in diverse organs, many of which have been associated with severe disease. these include epithelial cells in the liver, kidney, pancreas, and olfactory epithelium, cardiomyocytes, pericytes and fibroblasts in the heart, and oligodendrocytes in the brain. for example, the presence of double positive cardiomyocytes and cardiac pericytes and fibroblasts may provide a pathological basis for the cardiac abnormalities noted in covid- patients including elevated troponin, a signature of cardiomyocyte injury, myocarditis, and sudden cardiac death . as the co-expression of genes involved in sars-cov infection are highest in cardiac pericytes in healthy heart, damage to vascular beds may trigger troponin release in otherwise normal hearts. moreover, as myocardial ace expression is increased in patients with existing cardiovascular diseases (tucker et al. companion manuscript ), sars-cov infection may result in greater damage to cardiomyocytes, and account for greater disease acuity and poorer survival in these patients. one intriguing clinical observation is that some covid- patients display an array of neurologic symptoms , (helms et al. ) , reported as seizures and acute necrotizing encephalopathy, similar to that previously observed following other infections such as influenza , . neuroinflammation could result from direct viral infection of the brain, or a systemic cytokine storm [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . direct viral invasion of sars-cov and mers-cov is observed in multiple brain regions in human patients and mouse models , , consistent with widespread ace expression in numerous brain cell types. furthermore, sars-cov has been shown to infiltrate the brain via the olfactory epithelium-olfactory bulb axis ; olfactory transmission for sars-cov- has been recently proposed . other possible transmission routes could be through the infection of ace + tmprss + enteric neurons synapsing with vagal afferents, or entry through blood-cns interfaces such as the choroid plexus or meninges , [ ] [ ] [ ] . profiling immune cells at these sites after infection is an important future step to better understand how the viral response may lead to encephalitis. one intriguing possibility is that encephalitis might arise as an autoimmune response to myelin antigens expressed by infected cells. antibodies against peptides of myelin proteins have been clinically shown to be associated with autoimmune encephalitis and seizures [ ] [ ] [ ] , and myelin peptides are targets of t cells in demyelinating inflammatory neurological diseases such as acute demyelinating encephalomyelitis, guillain-barre syndrome, and multiple sclerosis [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . oligodendrocytes, the myelin-producing cells of the cns, are the main ace + tmprss + cell type in the brain, the myelin transcriptional regulator myrf was enriched in certain ace + tmprss + cell types as noted above, and myelin proteins mog and mbp were co-expressed in numerous ace + clusters across organs. myrf and mbp were significantly differentially expressed in ace + tmprss + subsets of the lung and gut (supplementary table ); myelin targeting th cells trained in the gut are able to infiltrate the cns in a mouse model of experimental autoimmune encephalomyelitis . taken together, the expression of myelin proteins across multiple ace + tmprss + cells could hypothetically contribute to antigen presentation and autoimmune response in the context of viral infection. one test of this hypothesis would be to establish whether demyelination occurs in covid- patients, and whether it can be triggered by anti-myelin specific immunity induced by virus-infected cells, related to observations following other viral infections like zika, influenza, and epstein-barr virus , , . our meta--analysis of scrna-seq across studies provided the required statistical power to uncover population-level signals at a molecular level and at single-cell resolution. we found that the sars-cov- receptor and associated proteases were up-regulated in airway epithelial and at cells with age and in males, an association that may shed light on the marked increase in mortality with age. furthermore, ace was up-regulated in airway epithelial cells (basal and multiciliated cells) in past or present smokers, but down-regulated in their at cells; we have also confirmed this in an experimental system in a mouse model. these contrasting smoking associations show the importance of the single-cell resolution, as the down-regulation in at cells will be masked by the airway epithelial signal, leading to loss of association or misinterpretation of seemingly consistent ace signals in bulk rna-seq . importantly, ace is particularly lowly expressed in young pediatric samples, also mirrored by lack of chromatin accessibility in the ace locus. ace expression is known to be regulated in complex ways across different tissues and may be affected by both common therapies (acei/arb) (tucker et al., companion manuscript ), and during infection . moreover, both higher ace expression per cell and a higher fraction of ace + cells can in principle have implications for infection, but may have conflicting effects on pathogenesis, as ace knockout mice show more severe ards upon lung injury because of its role in the renin-angiotensin pathway, which seems to protect from consequences of lung injury and inflammation (for potential roles of this pathway on cov infection (pre-covid) see the review in ). as sars-cov binding will lead to internalization and therefore downregulation of ace on the cell surface, the protective function of ace via its proteolytic processing of angiotensin-ii may be lost. thus, the smoking mediated downregulation of ace in at cells may not protect cells from being infected but rather may increase ards due to more severe loss of ace from the cell surface upon infection. other confounders, including acei which are not available to our meta-analysis may further impact our results. to the best of our knowledge, this study is the first single-cell meta-analysis (in any setting). to perform this meta-analysis, we used a model that included both the tested covariates (age, sex, and smoking status), technical covariates (dataset and the number of umis per cell), and several interaction terms. including these interaction terms was crucial, as omission resulted in increased background variation and reversed effect estimates. likewise, modeling the smoking status of a donor was important to reduce background variation and account for the unbalanced distribution of covariates in the dataset. for example, while we have similar numbers of male ascertained smokers and non-smokers ( and donors), there are three times as many female ascertained non-smokers as female smokers ( and donors), which is reflective of this bias in the population . the addition of these terms increases the complexity of the model. indeed, only one dataset ("seibold") had sufficient numbers of donors of various ages, sex, and smoking status to fit the full model. thus, performing the meta-analysis was only possible due to the aggregation of a large number of healthy single-cell datasets enabled by the hca lung biological network and a community-wide effort. a limitation of our expression model is that each cell is treated as an independent observation. thus, the significance of association with traits such as sex, age, and smoking status may show inflated p-values, especially where the associations are determined from few donors. in this case, the variation between cells from a single donor dominates the variation between donors, background variation is underestimated and effect significance can be overestimated. aggregating many datasets allows us to counteract this effect, yet p-value inflation may occur in cell types that are not as commonly shared across datasets. our main conclusions are drawn on airway epithelial and at cells, which are distributed widely across datasets and are modeled on the basis of many donors. furthermore, we have confirmed significant associations by pseudobulk analysis and by holding out datasets. this confirmation ensures that associations are consistent when only considering donor variation, and we are aware if these associations are dataset dependent (often when one dataset is a particularly major source of a given cell type). models that account for both single-cell count distributions, and population structure in the data have the potential to improve future meta-analyses across single-cell atlases. having a cell type annotation with consistent resolution across datasets was instrumental for analyzing the association with clinical covariates up to the resolution of basal, secretory, or multiciliated cells. these cell type labels still aggregate over considerable diversity, which is the subject of ongoing scientific research. importantly, the labeled subtypes of these cell clusters differ between datasets. thus, in those cases where individual associations depend on a particular dataset not being held out, it may be the case that these associations become more robust at a higher level of cell type annotation. future high-resolution cell annotation efforts have the potential to further consolidate our single-cell meta-analysis results. in addition to modeling associations between gene expression and clinical co-variates, we also examined whether the proportion of ace + tmprss + cells per sample is associated with age or sex. while we can observe a trend of double positive cell proportions increasing with age (extended data fig. a) , the high compositional diversity across samples and studies (fig. a, extended data fig. ) , the potential confounders (total counts, dataset), and limited sample numbers are prohibitive to modeling these associations. further metadata that describe the sample diversity such as harmonized annotations on anatomical location, sampling methods, and sample processing can help to capture this heterogeneity in ace + tmprss + cell proportion models. the expression of ace and tmprss in lung, nasal and gut epithelial cells is associated with expression programs with many shared features, involving key immunological genes and genes related to viral infection, raising many hypotheses for future studies, especially as more patient tissue samples are analyzed in the coming months. in the lung, epithelial cells express il , il r and il st, which raises the hypothesis that infection may trigger cytokine expression from these cells and contribute to uncontrolled immunological responses. the immune-like programs in these cells are further reinforced by the accessibility of stat and irf binding sites in scatac-seq data, consistent with another study from our network showing the role of interferon in regulating ace expression in epithelial cells . notably, scrna-seq analysis of immune cells from bronchoalveolar lavage fluid of covid- patients identified high activity of transcription factors such as stat / and irf / / / / / in macrophage states increased in severe covid- patients . other hypotheses for future studies include lysosomal genes in dual positive lung secretory and multiciliated cells, which may be consistent with putative "viral entry" cells, and ripk expression in the cell programs of airway cells, which opens the hypothesis of necroptosis initiating a pro-inflammatory response. interestingly, we observed relatively high enrichment of ace in secretory cell types (mucous cells and at cells). we speculate that viruses may take advantage of the rich secretory pathway components in these cells for their efficient dispersal. additionally, smgs of the airways are recently shown to serve as reservoirs of reserve stem cells , . therefore, we also speculate that smgs similarly may serve as reservoirs for viruses where they can escape from muco-ciliary transport and mechanical expulsion associated with severe cough in the airway luminal surface. the gene programs of at cells can also contribute to cross talk with alveolar macrophages. our cell-cell interaction analysis suggests that at cells engage with alveolar macrophages through oncostatin m, csf, il and complement pathways, suggesting therapeutic hypotheses. the complement pathway is particularly intriguing in the context of covid- . first, viral protein glycosylation is a known trigger for the lectin pathway (lp) of the proteolytic complement cascade, such that in addition to classical complement activation via antibody complexes, other coronavirus glycoproteins can be recognized by lp-inducing host collectin proteins , . moreover, excessive complement activation resulting in acute lung injury and cytokine storms were also implicated in the pathogenesis of sars , and complement inhibition using the anti-c antibody eculizumab is currently evaluated as anti-inflammatory experimental emergency treatment for severe covid- in clinical trials (clinicaltrials.gov identifier nct ). in our analysis, multiple complement pathway proteases (e.g. c r, c , cfi) are co-expressed with ace across different lung cell subsets (fig. f,g, extended data fig. ) and complement inhibitory factor cd and complement protease c were preferentially expressed by ace + tmprss + dps within lung tissue, in multiciliated and secretory cells, respectively (fig. a,c, supplementary table , supplementary tables , ) . moreover, cell-cell interaction analysis predicted cross-talk between at cells expressing complement proteins c and c and macrophages expressing cognate receptors. at cell expression of negative complement regulators cfi and cd might represent a strategy for sars-cov- to at least partially escape complement surveillance. finally, to explore therapeutic hypotheses related to disruption of viral processing via protease inhibition, we explored the expression of other proteases across our integrated atlases. although a multitude of different sars-cov- features likely account for its high pathogenicity and transmissivity, it has been speculated that the prrar loop might contribute to increased covid- severity. introduction of similar polybasic cleavage sites into avian influenza viruses and human coronaviruses was shown to render them more pathogenic, increasing mortality and viral spread , . one hypothesis is that acquisition of a pcsk cleavage site would expand the number of cell types that can be directly infected by sars-cov- . a recent report has started to address expression of furin in cells expressing sars-cov- host factors ace or tmprss , and furin activity is inhibited by guanylate-binding proteins (gbps), a group of interferonstimulated genes, in order to restrict viral envelope processing . however, the highly overlapping recognition sequence of pcsk family members (extended data fig. d) suggests that multiple pcsks in addition to furin could mediate cleavage at the s /s prrar motif (extended data fig. e) . our expression analysis confirms that pcsk family members, in particular furin, pcsk and pcsk , are more broadly expressed than tmprss across lung cell types (fig. d) , as well as across tissues (extended data fig. i) . in the lung, we note the higher proportion of ace + pcsk + basal cells and ace + pcsk / + fibroblasts (fig. e, extended data fig. h) . interestingly, the host interactome of sars-cov- further suggests interaction of viral proteins with pcsk , which also showed significant ace co-expression in at cells (fig. b, extended data fig. b) . moreover, because pcsk localization is detected in different membrane compartments along the secretory and endocytic pathways , it is conceivable that pcsks could process sars-cov- s-proteins at different stages of the viral life cycle. moreover, further analysis is required to assess the extent to which sars-cov- relies on proteolytic activity provided in trans either by neighboring cells or extracellularly localized proteases . altogether, this could provide sars-cov- with an immense flexibility in different entry and egress pathways. taken together, our analyses provide a rich molecular and cellular map as context for the transmission, pathogenesis, clinical associations, and therapeutic hypotheses for covid- . as new single cell atlases will be generated from covid- tissues and experimental models, they will help further advance our understanding of this disease. sample collection underwent irb review and approval at the institutions where the samples were originally collected. "adipose_healthy_manton_unpublished" was collected under irb p / (orsp- ). tissue samples from breast, esophagus muscularis, esophagus mucosa, heart, lung, prostate, skeletal muscle and skin referred to as "tissue_healthy_regev_snrna-seq_unpublished" were collected under orsp- . samples (supplementary table , ) publicly available single-cell rna-seq datasets were downloaded from gene expression omnibus (geo) . we searched geo for datasets that met all of the following three criteria: ( ) provided unnormalized count data; ( ) was generated using the x genomics's chromium platform ; and ( ) profiled human samples. these tissue samples spanned a wide range, including primary tissues, cultured cell lines, and chemically or genetically perturbed samples. applying these filters increases standardization of sample as the vast majority were prepared using the same x chromium instrument and cell ranger pipelines. datasets comprise of one or more samples (individual gene expression matrices), which often correspond to individual experiments or patient samples. in total, this yielded , , cells from samples from distinct datasets (supplementary table ) . to allow comparison across samples and datasets, we mapped through a common dictionary of gene symbols and excluded unrecognized symbols. if a gene from an aggregated master list was not found in a sample, the expression was considered to be zero for every cell in that sample. after all datasets were collected, we quantified the percentage of cells with > umis for both ace and tmprss or ace and ctsl. for further analyses with broad cell classes, we only used datasets with more than double positive cells yielding , cells from samples. for integration across datasets, we used two levels of annotations. when possible, every sample was annotated with its tissue of origin based on the available metadata from geo. we excluded any sample for which tissue was not specified. for the smaller subset of , cells we then manually annotated cell clusters with broad cell type classes using marker genes. these clusters were generated using the harmony-pytorch python implementation (version . . (https://github.com/lilab-bcb/harmony-pytorch ) of the harmony scrna-seq integration method for batch correction and leiden clustering from the scanpy package (version . . ) . clusters without clear markers distinguishing types were excluded from further analysis. data was processed using scanpy. individual datasets were normalized log (umis/ , + ) by column sum and the log p function (ln( , * gij + ) where a gene's expression profile, g, is the result of the umi count for each gene, i, for cell j, normalized by the sum of all umi counts for cell j. this data normalization step was only used for generating the clusters and cell type annotations. all other statistical tests for the integrated analysis were performed on the cell's binary classification as a double positive or not. for example, for a cell to be considered ace +, it has > ace transcripts. double positive cells have > transcripts for both genes of interest. we used fisher's exact test to test for statistical dependence between the expression of ace and tmprss or ctsl and corrected for multiple testing via benjamini-hochberg over all tests for each gene pair. we compiled a compendium of published and unpublished datasets consisting of , , cells from tissues and/or organs including adipose, bone marrow, brain, breast, colon, cord blood, enteric nervous system, esophagus mucosa, esophagus muscularis, anterior eye, heart, kidney, liver, lung, nasal, olfactory epithelium, pancreas, placenta, prostate, skeletal muscle and skin. after the harmonization of cell type annotations, ace -tmprss and ace -ctsl coexpression were assessed using a logistic mixed effect model: where yi was the binarized expression level of either tmprss or ctsl, and covariates were binarized ace expression in cell i and a sample-level random intercept. models were fit separately for each cell type in each dataset. in order to avoid spurious associations in cell types with very few ace + cells and due to very low expression of ace , we subsampled ace cells to the number of ace + cells within each cell type and discarded cell types containing fewer than cells expressing either ace or fewer than cells expressing the other gene being tested after the subsampling procedure. the significance of the association between ace and tmprss /ctsl is controlled for % fdr using the statsmodels python package (version . . ) . data processing was performed using scanpy python package (version . . ) and logistic models were fit using lme r package (version . . ) . library generation and sequencing. libraries were generated using the x chromium controller and the chromium single cell atac library & gel bead kit (# ) according to the manufacturer's instructions (cg -rev c; cg -rev b) with unpublished modifications relating to cell handling and processing. briefly, human lung derived primary cells were processed in . ml dna lobind tubes (eppendorf), washed in pbs via centrifugation at g, min, c, lysed for min on ice before washing via centrifugation at g, min, c. the supernatant was discarded and lysed cells were diluted in x diluted nuclei buffer ( x genomics) before counting using trypan blue and a countess ii fl automated cell counter to validate lysis. if large cell clumps were observed, a µm flowmi cell strainer was used prior to the tagmentation reaction, followed by gel bead-in-emulsions (gems) generation and linear pcr as described in the protocol. after breaking the gems, the barcoded tagmented dna was purified and further amplified to enable sample indexing and enrichment of scatac-seq libraries. the final libraries were quantified using a qubit dsdna hs assay kit (invitrogen) and a high sensitivity dna chip run on a bioanalyzer system (agilent). all libraries were sequenced using nextseq high output cartridge kits and a nextseq sequencer (illumina). x scatac-seq libraries were sequenced paired end ( x cycles). initial data processing and qc. fastq files were demultiplexed using x genomics cellranger atac mkfastq (version . . ). we obtained peak-barcode matrices by aligning reads to grch (cr v . . pre-built reference) using cellranger atac count. peak-barcode matrices from six channels were normalized per sequencing depth and pooled using cellranger atac aggr. the aggregated, depth-normalized, filtered dataset was analyzed with signac (v . . , https://github.com/timoast/signac), a seurat extension developed for the analysis of scatacseq data. all the analyses in signac were run with a random number generator seed set as . cells that appeared as outliers in qc metrics (peak_region_fragments ≤ or peak_region_fragments ≥ , or blacklist_ratio ≥ . or nucleosome_signal ≥ or tss.enrichment ≤ ) were excluded from the analysis. normalization and dimensionality reduction. the aggregated dataset was processed with latent semantic indexing , i.e. datasets were normalized using term frequency-inverse document frequency (tf-idf), then singular value decomposition (svd), ran on all binary features, was used to embed cells in low-dimensional space. uniform manifold approximation and projection (umap) was then applied for visualization, using the first dimensions of the svd space. gene activity matrix and differential motif activity analysis. a gene activity matrix was calculated as the chromatin accessibility associated with each gene locus (extended to include kb upstream of the transcription start site, as described in the vignette 'analyzing pbmc scatac-seq' (version: march , , https://satijalab.org/signac/articles/pbmc_vignette.html), using as gene annotation the genes.gtf file provided together with cellranger's atac grch - . . reference genome. clusters were annotated using label transfer from matching scrna samples or by literature / expert search of marker "active" (i.e. accessible) genes. differential motif activity analysis was performed using signac's implementation of chromvar , with motif position frequency matrices from jaspar (http://jaspar.genereg.net/) selecting transcription factors motifs from human (species= ), broadly following the vignette 'motif analysis with signac' (https://satijalab.org/signac/articles/motif_vignette.html). cells were identified as positive for ace and/or tmprss (i.e. with the loci accessible) if at least one fragment was overlapping with the gene locus or kb upstream. differential activity scores between epithelial cells positive for ace (with the above-mentioned definition of 'positive') and non-expressing ace was performed with the findmarkers function of seurat (version . . ), using as test 'lr' (i.e. logistic regression) and as latent variable the number of counts in peak. the following publically available bulk-rnaseq datasets were obtained from the encode database: lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid ; lid , generated by the gingeras lab . these fastq datasets were aligned to the mm annotation build using star and processed using the standard tuxedo suite to yield a normalized fpkm matrix. immunohistochemistry analysis was performed on % pfa fixed, oct embedded tissue sections from human explant donors. briefly, cells were permeabilized with % triton-x in pbs. slides were either not treated with antigen retrieval or antigen retrieval was performed with the tris-citrate buffer as needed (supplementary table ). slides were incubated overnight with primary antibodies at indicated concentrations (supplementary table ) in donkey serum with % triton-x in pbs. slides were treated with alexa-fluor secondary antibodies mixed with dapi at : concentration in donkey serum with % triton-x in pbs for hour at room temperature. slides were mounted and imaged on a confocal microscope. proximity ligation in situ hybridization (plish) was performed as described previously . briefly, frozen human trachea and distal lung sections were fixed with . % paraformaldehyde for min, treated with protease ( μg/ml proteinase k for lung or pepsin for trachea for min) at °c, and dehydrated with up-series of ethanol. the sections were incubated with gene-specific oligos (supplementary table ) in hybridization buffer ( m sodium trichloroacetate, mm tris [ph . ], mm edta, . mg/ml heparin) for h at °c. common bridge and circle probes were added to the section and incubated for h followed by t ligase reaction for h. rolling circle amplification was performed by using phi polymerase (# , lucigen) for hours at °c. fluorophore-conjugated detection probe was applied and incubated for min at °c followed by mounting in medium containing dapi. to assess the association of age, sex, and smoking status with the expression of ace , tmprss , and ctsl, we aggregated scrna-seq datasets of healthy human nasal and lung cells, as well as fetal samples. aggregation of these datasets was enabled by harmonizing the cell type labels of individual datasets within scanpy (version . . . ). we harmonized annotations together with data contributors using a preliminary ontology generated on the basis of published datasets [ ] [ ] [ ] [ ] with levels of annotations (level -lowest resolution; supplementary table ) . we further harmonized metadata by collapsing the smoking covariate into "has smoked" and "has never smoked" and by taking mean ages where only age ranges were given. this endeavor produced a dataset of , , cells in samples from donors (supplementary data d ) . we divided the data into fetal ( , to get an overview of sample diversity, we clustered the samples using the proportion of cells in level cell types as features. clustering was performed using louvain clustering (resolution . ; louvain package version . . ) on a knn-graph (k= ) computed on euclidean distances over the top principal components of the cell type proportion data within scanpy. this produced five clusters. sample cluster labels were assigned based on metadata for anatomical location that was obtained from the published datasets and via input from the data generators. within adult datasets we modeled the association of age, sex, and smoking status with gene expression for ace , tmprss , and ctsl using a generalized linear model with the log total counts per cell as offset and poisson noise as implemented in statsmodels (version . . ) and using a wald test from diffxpy (www.github.com/theislab/diffxpy; version . . , batchglm version . . ). specifically we used the model: here, " denotes the raw count expression of gene i in cell j and age:sex, sex:smoking, and age:smoking are interaction terms of the three modeled covariates. these terms model whether there is a difference in the smoking effect in men and women, and likewise whether the age effect is different for smokers and non-smokers. while we model these interaction terms, we only tested age, sex, and smoking effects individually to reduce the multiple testing burden. we included the dataset term to model the batch effects between the diverse datasets we obtained, and the log total counts per cell was used as an offset. here, the total counts were scaled to have a mean of across all cells before the log was taken. in order to fit this model we pruned the data to contain only datasets that have at least donors and for which smoking status metadata was provided. this resulted in a dataset of , cells and samples from donors for adult lung data. only a single dataset remained for adult nasal data after this filtering on which the model could not be fit. to obtain cell-type specific associations the above model was fit within each cell type for all cell types with at least , cells. we performed wald tests over the age and sex covariates independently and corrected for multiple testing via benjamini-hochberg over all tests within a cell type. as metadata on smoking status was only available for a subset of the data, we also fitted a simpler model on a larger dataset to confirm sex and age associations. the simplified model: was fit on , cells in samples from donors of adult lung data. again, log total counts (scaled) was used as an offset. the above models treat cells as independent observations and thus model cellular and donor variation jointly. as donor variation tends to be larger than single-cell variation, when most cells come from few donors (either there are few donors, or few donors contribute most of the cells), this can lead to an inflation of p-values. to counteract this effect, we verified that significant associations are consistent when modeling only donor variation via pseudo-bulk analysis, and we tested whether effects are dependent on few donors by holding out datasets. pseudo-bulk data was generated by computing the mean for each gene expression value and numi covariate for cells in the same cell type and donor. after filtering as described above, models ( ) and ( ) were fit to the data. in contrast to the single-cell model, pseudo-bulk analysis underestimates certainty in modeled effects as uncertainty in the pseudo-bulk means are not taken into account. thus, we used only effect directions from pseudo-bulk analysis to validate single-cell associations. we regarded only those associations as significant, where the fdrcorrected p-value in the single-cell model is below . , and the sign of the estimated effect is consistent in both the single-cell and the pseudo-bulk analysis. we further separated significant associations into robust trends and indications depending on the holdout analysis. a significant association was regarded as a robust trend if the effect direction is consistent when holding out any dataset when fitting the model irrespective of p-value. in the case that holding out one dataset caused the maximum likelihood estimate of the coefficient to be reversed, we denote this as the effect no longer being present, which characterized the association as an indication. for each of the lung, nasal, and gut datasets, we labeled the cells with non-zero counts for both ace and tmprss as dual-positive cells (dps), and the cells with zero counts for both ace and tmprss as dual-negative cells (dns). within each tissue, we identified cell types with greater than dps, and for each of these cell types, we selected the genes with increased expression (log fold change greater than ) in dps vs dns (so that we focus on important "positive" features). we trained a classifier with : train:test split to classify the dps from dns within each of these cell types using the sklearn (version . . ) randomforestclassifier function with the following parameters: n_estimators set to , the criterion as gini, and the class_weight parameter set to balanced_subsample. we first trained individual classifiers separately for each of the cell types, and pooled genes with positive feature importance values (using the feature_importance field in the trained randomforestclassifier object) to train a final dp vs dn classifier across each tissue. we used the top genes, as ranked by their feature importance scores, to define the signature for the gene expression program of dps for the tissue. this procedure was carried out in lung, nasal, and gut datasets, yielding tissue-specific signatures for gene expression programs of dps from each tissue. for visualization purposes only, we generated network diagrams using the networkx (version . ) tool with the forceatlas graph layout algorithm . we scored genes that appeared in signatures for multiple tissues by their aggregated feature importance (using a plotting heuristic that used the sum of importance ranks for genes in individual tissues and by assigning a large valued rank ( ) to a gene that did not appear in a particular tissue) and selected the top genes that were shared by each pair of tissues or shared by all tissues along with additional genes that included the ones unique to each tissue's signature to plot in the network visualization. the go terms enriched in the gene expression programs shared by dps across tissues were found using gprofiler (version . . ) using the scanpy.queries.enrich tool. this analysis was performed in two ways: on the original data, as well as after accounting for differences in distribution of the number of umis (numi) per cell between dps and dns. this was done by binning the numi distribution in the dps for each tissue into a bins and then randomly sampling from the numi distribution for the dns in each bin to match the distribution of the dps in that bin. the numi distributions before and after the matching are shown in extended data fig. b . in parallel, we used a regression framework to recover gene modules enriched in dp vs. dn cells (fig. c,d, extended data fig. a,b) in the nasal, lung, and gut datasets. we first restricted our analysis to cell subsets derived from at least two donor individuals that each contained a mixture of dn and dp cells (nawijn nasal: multiciliated, goblet; regev/rajagopal lung: at , at , basal, multiciliated, secretory; aggregated lung: at , multiciliated, secretory; regev/xavier colon: best + enterocytes, cycling ta (transit amplifying), enterocytes, immature enterocytes , ta- ). for each of these cell subsets, we then used mast (version . . ) to fit the following regression model to every gene with cells as observations: where yi is the expression level of gene i in cells, measured in units of log (tp k+ ), x is the binary co-expression state of each cell (i.e. dp vs. dn), and s is the donor that each cell was isolated from. to control for donor-specific effects (i.e. batch effects), we used a mixed model with a random intercept that varies for each donor. to fit this model, we subsampled cells from dp and dn groups to ensure that both the donor distribution and the cell complexity (i.e. the number of genes per cell) were evenly matched between the two groups, as follows. first, for each subset, we restricted our analysis to donors containing at least two dn and two dp cells. using these samples, we partitioned the cells into equally-sized bins based on cell complexity and subsampled dn cells from each bin to match the cell complexity distribution of the dp cells. finally, we fit the mixed model (above), controlling for both donor and cell complexity. to build gene modules for dp cells, we prioritized genes by requiring that they be expressed in at least % of dp cells, and to have a model coefficient greater than with an fdr-adjusted pvalue less than . (for the combined coefficient in the hurdle model). after this filtering step, genes were ranked by their model coefficient (i.e. estimated effect size). the top genes were selected for network visualization within each cell type (fig. c,d, extended data fig. a,b) . in three cases (gut cycling ta, ta- and best + cells), rp -* antisense genes were flagged and excluded from visualizations. to visualize overlap across each network, we indicated whether each gene was among the top genes from each of the other cell types. putative drug targets were identified by querying the drugbank database . gene set enrichment analysis was performed using the r package enrichr (version . ) , selecting the top genes from each cell type for the pan-tissue analysis ("all" category; fig. e ), and the top genes from each cell type for the tissue-specific analyses ("gut", "nasal", and "lung" categories; fig. e ). we note a few caveats/challenges/limitations that may influence our results, including non uniform sampling across donors; variation in cell compositions across regions (e.g., distal lung vs carina), and additional cellular heterogeneity that the current level of broad subset annotation may not have been captured. cellphonedb v. . . was run with default parameters on the human lung samples of the regev/rajagopal dataset, analyzing the cells from each dissected region separately. for each sample (patient/location combination), for each cell type we distinguished double positive cells (ace > and tmprss > ) from all others. only interactions highlighted as significant, i.e. present in the "significant means" output from cellphonedb were considered. ace -protease co-expression (figure , extended data fig. ) and ace -il /il r/il st coexpression (extended data fig. ) were tested via the logistic mixed-effects model described in "integrated co-expression analysis of high resolution cell annotations across tissues" (equation , above). data and an interactive analysis examining the co-expression of genes across datasets can be accessed via the open-source data platform, terra at https://app.terra.bio/#workspaces/kcoincubator/covid- _cross_tissue_analysis. interactive visualization and download of gene expression data can be accessed on the single cell portal at https://singlecell.broadinstitute.org/single_cell?scpbr=hca-covid- -integrated-analysis n.k. was a consultant to biogen idec, boehringer ingelheim, third rock, pliant, samumed, numedii, indaloo, theravance, lifemax, three lake partners, optikira and received non-financial support from miragen. all of these outside the work reported. j.l. is a scientific consultant for x genomics inc a.r. is a co-founder and equity holder of celsius therapeutics, an equity holder in immunitas, and an sab member of thermofisher scientific, syros pharmaceuticals, asimov, and neogene therapeutics o.r.r., is a co-inventor on patent applications filed by the broad institute to inventions relating to single cell genomics applications, such as in pct/us / and us provisional application no. / , . a.k.s. compensation for consulting and sab membership from honeycomb biotechnologies, cellarity, cogen therapeutics, orche bio, and dahlia biosciences. s.a.t. was a consultant at genentech, biogen and roche in the last three years. f.j.t. reports receiving consulting fees from roche diagnostics gmbh, and ownership interest in cellarity inc. l.v. is funder of definigen and bilitech two biotech companies using hpscs and organoid for disease modelling and cell based therapy. (c) statistical model. model fitted to the data to assess sex, age, and smoking status associations with expression of the three genes. denotes gene counts and numi denotes the total umi counts per cell. (d) age, sex, and smoking status associations with expression of ace (blue), tmprss (orange), and ctsl (green) in epithelial cells. effect size (x axis) of the association, in log fold change (sex, smoking status) or slope of log expression with age. colored bars: associations with an fdr-corrected p-value< . , where pseudo-bulk analysis shows a consistent effect direction. error bars: standard errors around coefficient estimates. (e) distribution of ace and tmprss expression across level lung cell types. red shading indicates the main cell types that express ace and tmprss . (f) hold out analysis shows the robustness of associations to holding out a dataset. the values show the number of held-out datasets that result in loss of association between a given covariate (rows) and ace , tmprss , or ctsl expression in a given cell type (columns). robust trends are determined by significant effects that are robust to holding out any dataset ( values). (g) low expression in pediatric samples. mean expression level (log cpm, y axis) of ace (blue), tmprss (orange), and ctsl (green) across age bins (x axis) in at (left) and ciliated (right) cells. pediatric samples: - years. samples from past or current smokers were removed from this plot to avoid smoking confounders. error bars are omitted due to y-axis limitations. they are typically -fold the mean value (supplementary table ). multiciliated and at cells are shown as these cell types are present in fetal data, and show significant age associations with ace expression. (a) gradual increase in ace expression by airway epithelial cell type with age. mean expression (y axis) of ace in different airway epithelial cells (x axis) of mice of three consecutive ages (color legend, upper right). shown are replicate mice (dots), mean (bar), and error bars (standard error of the mean (sem)). (b) increase in proportion of ace + ctsl + goblet and club cells with age. percent of ace + ctsl + cells (x axis) in different airway epithelial cell types (y axis) of mice of three consecutive ages (color legend, upper right). shown are replicate mice (dots), mean (bar), and error bars (sem). (c-j) increase in ace expression in secretory cells with smoking. mice were daily exposed to cigarette smoke or filtered air as control for two months after which cells from whole lung suspensions were analyzed by scrna-seq (drop-seq). (c,d) umap of scrna-seq profiles (dots) colored by experimental group (c) or by ace + cells and indicated double positive cells (d). alveolar epithelial cells (at and at ) and airway epithelial secretory and ciliated cells are marked. (e) the relative frequency of ace + cells is increased by smoking in airway secretory cells but not at cells. relative proportion (y axis) of ace + (red) and ace -(grey) cells in smoking and control mice of different cell types (x axis). (f, g) expression of ace is increased in airway secretory cells, but not in at cells. distribution of ace expression (y axis) in secretory (f) and at (g) cells from control and smoking mice (x axis). (h-j) re-analysis of published bulk mrna-seq of lungs exposed to different daily doses of cigarette smoke show increased expression of (h) ace , (i) tmprss , and (j) ctsl after five months of chronic exposure. extended data figure . age, sex, and smoking status associations with expression of ace , tmprss , and ctsl across level cell type annotations. effect size (y axis) of association as log fold changes (sex, smoking status) and slope of log expression with age. bars that are colored in indicate associations with an fdr-corrected p-value of < . where the pseudo-bulk analysis shows a consistent effect direction. error bars represent model uncertainties. extended data figure . age, sex, and smoking status associations with expression of ace , tmprss , and ctsl across level cell type annotations. effect size (y axis) of the association as log fold changes (sex, smoking status) and slope of log expression with age. bars that are colored in indicate associations with an fdr-corrected p-value of < . where the pseudo-bulk analysis shows a consistent effect direction. error bars represent model uncertainties. fig. . cell programs for dual positive cells (a,b) top genes from each cell program recovered for different lung (a) or gut (b) epithelial cell-type (nodes, colors). colored concentric circles: overlap with a gene in the top significant genes in other cell types. ace and tmprss are included even if not among the top . (c) comparison of signature scores of cell programs between dp and dn cells for each cell type stratified by gene complexity bin. cells were partitioned into gene complexity bins for every cell type. (d,e) il and its receptor's expression in specific cell types in lung and heart. (d) significance (dot size) and fold change (dot color) of differential expression between dp and dn cells within different types (rows) for il and its receptors il r and il st (columns) across tissues. (e) top: significance (dot size) and fold change (dot color) of differential expression between dp and dn cells within different cell types in the heart (rows)for il and its receptors il r and il st (columns). bottom: significance (dot size) and effect size (dot color) from a mixed effects model of co-expression of il , il r, or il st (columns) coexpression with ace . (f) distribution of number of counts in peaks (y axis) in ace + epithelial cells (having at least fragment in the ace gene locus) and ace cells. figure . co-expression of ace and il ,il r,il st. co-expression of ace and il ,il r,il st in select single-cell datasets. p-values and significance (fdr %) derived from the logistic mixed-effects model. figure . expression of ace , tmprss and ctsl in mouse placenta. umap embedding (as in fig. k ) of scrna-seq profiles of placenta cells collected at e. . (top) or along a time course (bottom), colored by expression level oface , tmprss , and ctsl. figure . additional analyses to identify other proteases that may have a role in infection. (a) multiple proteases are co-expressed with ace in another human lung scrna-seq ("aggregated lung"). scatter plot of significance (y axis, -log (adjusted p value)) and effect size (x axis) of co-expression of each protease gene (dot) with ace within each indicated epithelial cell type (color). dashed line: significance threshold. tmprss and pcsks that significantly coexpressed with ace are marked. (b) ace -protease co-expression with pcsks, tmprss and ctsl across lung cell types ("aggregated lung"). significance (dot size, -log (adjusted p value)) and effect size (color) for co-expression of ace with selected proteases (columns) across cell types (rows). (c-d) predicted cleavage sites in the sars-cov- s-protein s /s region. (c) multiple amino acid sequence alignment of sars-cov- s-protein s /s region with orthologous sequences from other betacoronaviruses (top) and polybasic cleavage sites of other human pathogenic viruses (bottom). pathological findings of covid- associated with acute respiratory distress syndrome clinical features of patients infected with novel coronavirus in wuhan clinical features of covid- related liver damage. medrxiv clinical characteristics of coronavirus disease in china covid- and the cardiovascular system identification of a novel coronavirus causing severe pneumonia in human: a descriptive study a novel coronavirus outbreak of global health concern the epidemiology and pathogenesis of coronavirus disease (covid- ) outbreak kidney impairment is associated with in-hospital death of covid- patients clinical and radiographic features of cardiac injury in patients with novel coronavirus pneumonia characteristics of pediatric sars-cov- infection and potential evidence for persistent fecal viral shedding covid- ) detection of sars-cov- in different types of clinical specimens the ace expression in sertoli cells and germ cells may cause male reproductive disorder after sars-cov- infection an insight of comparison between covid- ( -ncov disease) and sars in pathology and pathogenesis possible vertical transmission of sars-cov- from an infected mother to her newborn clinical characteristics and intrauterine vertical transmission potential of covid- infection in nine pregnant women: a retrospective review of medical records neonatal early-onset infection with sars-cov- in neonates born to mothers with covid- in wuhan, china antibodies in infants born to mothers with covid- pneumonia an analysis of pregnant women with covid- , their newborn infants, and maternal-fetal transmission of sars-cov- : maternal coronavirus infections and pregnancy outcomes infants born to mothers with a new coronavirus (covid- ) lack of vertical transmission of severe acute respiratory syndrome coronavirus substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov ) adjusted age-specific case fatality ratio during the covid- epidemic in hubei, china estimating clinical severity of covid- from the transmission dynamics in wuhan, china case-fatality risk estimates for covid- calculated by using a lag time for fatality estimating risk for death from novel coronavirus disease, china real estimates of mortality following covid- infection evolving epidemiology and impact of non-pharmaceutical interventions on the outbreak of coronavirus disease covid- -new insights on a rapidly changing epidemic sars-cov- infection in children covid- and smoking: a systematic review of the evidence cardiac involvement in a patient with coronavirus disease (covid- ) association of coronavirus disease (covid- ) with myocardial injury and mortality coronaviruses: an overview of their replication and pathogenesis cryo-em structure of the -ncov spike in the prefusion conformation sars-cov- cell entry depends on ace and tmprss and is blocked by a clinically proven protease inhibitor receptor recognition by novel coronavirus from wuhan: an analysis based on decade-long structural studies of sars tissue renin-angiotensin-aldosterone systems: targets for pharmacological therapy angiotensin-converting enzyme is a functional receptor for the sars coronavirus structure, function, and antigenicity of the sars-cov- spike glycoprotein sars-cov- invades host cells via a novel route: cd -spike protein host cell proteases: critical determinants of coronavirus tropism and pathogenesis enhanced isolation of sars-cov- by tmprss -expressing cells sars-cov- entry genes are most highly expressed in nasal goblet and ciliated cells within human airways sars-cov- receptor ace is an interferon-stimulated gene in human airway epithelial cells and is enriched in specific cell subsets across tissues sars-cov- receptor ace and tmprss are predominantly expressed in a transient secretory cell type in subsegmental bronchial branches single cell rna sequencing of human tissues identify cell types and receptors of human coronaviruses ace expression by colonic epithelial cells is associated with viral infection sars coronavirus, but not human coronavirus nl , utilizes cathepsin l to infect ace -expressing cells specific ace expression in cholangiocytes may cause liver damage after -ncov infection regulation of ace in cardiac myocytes and fibroblasts intra-and inter-cellular rewiring of the human colon during ulcerative colitis myelin gene regulatory factor is a critical transcriptional regulator required for cns myelination clinical course and risk factors for mortality of adult inpatients with covid- in wuhan, china: a retrospective cohort study lost sense of smell may be peculiar clue to coronavirus infection evaluation of coronavirus in tears and conjunctival secretions of patients with sars-cov- infection myocyte specific upregulation of ace in cardiovascular disease: implications for sars-cov- mediated myocarditis covid- , ecmo, and lymphopenia: a word of caution lymphopenia predicts disease severity of covid- : a descriptive and predictive study the novel severe acute respiratory syndrome coronavirus (sars-cov- ) directly decimates human spleens and lymph nodes sex difference and smoking predisposition in patients with covid- a cellular census of human lungs identifies novel cell states in health and in asthma single-cell transcriptomic analysis of human lung provides insights into the pathobiology of pulmonary fibrosis proliferating spp /mertk-expressing macrophages in idiopathic pulmonary fibrosis scrna-seq assessment of the human lung, spleen, and esophagus tissue stability after cold preservation allergic inflammatory memory in human respiratory epithelial progenitor cells in vitro and in vivo development of the human airway at single-cell resolution a single-cell atlas of the human healthy airways single cell rna-seq reveals ectopic and aberrant lung resident cell populations in idiopathic pulmonary fibrosis a molecular cell atlas of the human lung from single cell rna sequencing single-cell rna-sequencing reveals profibrotic roles of distinct epithelial and mesenchymal lineages in pulmonary fibrosis dissecting the cellular specificity of smoking effects and reconstructing lineages in the human airway epithelium coronavirus infections in children including covid- characterization of chromatin accessibility with a transposome hypersensitive sites sequencing (ths-seq) assay single-cell reconstruction of the early maternal-fetal interface in humans a single-cell survey of the human first-trimester placenta and decidua integrative single-cell and cell-free plasma rna transcriptomics elucidates placental cellular dynamics carcinoembryonic antigen-related cell adhesion molecule is an important surface attachment factor that facilitates entry of middle east respiratory syndrome coronavirus secretory leukocyte protease inhibitor (slpi) in mucosal fluids inhibits hiv-i the role of the polymeric immunoglobulin receptor and secretory immunoglobulins during mucosal infection and immunity cxcl is a mucosal chemokine elevated in idiopathic pulmonary fibrosis that exhibits broad antimicrobial activity primary type ii alveolar epithelial cells present microbial antigens to antigen-specific cd + t cells sars-coronavirus open reading frame- a drives multimodal necrotic cell death coronavirus cell entry occurs through the endo-/lysosomal pathway in a proteolysis-dependent manner il- induces production of il- and tnf-alpha and results in cell apoptosis through tnf-alpha ccl is an inducible product of human airway epithelia with innate immune properties drugbank . : a major update to the drugbank database for a sars-cov- -human protein-protein interaction map reveals drug targets and potential drug-repurposing gdf is an inflammation-induced central mediator of tissue tolerance the role of the cell surface mucin muc as a barrier to infection and regulator of inflammation level of il- predicts respiratory failure in hospitalized symptomatic covid- patients. infectious diseases (except hiv/aids chromvar: inferring transcription-factor-associated accessibility from single-cell epigenomic data forkhead box transcription factors of the foxa class are required for basal transcription of angiotensin-converting enzyme single-cell connectomic analysis of adult mammalian lungs cellphonedb: inferring cell-cell communication from combined expression of multi-subunit ligand-receptor complexes oncostatin m is a differentiation factor for myeloid leukemia cells efficient replication of severe acute respiratory syndrome coronavirus in mouse cells is limited by murine angiotensin-converting enzyme a revised airway epithelial hierarchy includes cftr-expressing ionocytes the transcriptome of nrf -/-mice provides evidence for impaired cell cycle progression in the development of cigarette smoke-induced emphysematous changes the degradome database: expanding roles of mammalian proteases in life and disease the proteolytic regulation of virus cell entry by furin and other proprotein convertases furin-mediated protein processing in infectious diseases and cancer the spike glycoprotein of the new coronavirus -ncov contains a furin-like cleavage site absent in cov of the same clade structural modeling of -novel coronavirus (ncov) spike protein reveals a proteolytically-sensitive activation loop as a distinguishing feature compared to sars-cov and related sars-like coronaviruses the biology and therapeutic targeting of the proprotein convertases prediction of proprotein convertase cleavage sites prosperous: high-throughput prediction of substrate cleavage sites for proteases with improved accuracy physiological and molecular triggers for sars-cov membrane fusion and entry into host cells evidence that tmprss activates the severe acute respiratory syndrome coronavirus spike protein for membrane fusion and reduces viral control by the humoral immune response sars-cov- receptor ace and tmprss are primarily expressed in bronchial transient secretory cells tissue distribution of ace protein, the functional receptor for sars coronavirus. a first step in understanding sars pathogenesis single-cell rna expression profiling of ace , the putative receptor of wuhan knowledge synthesis from million biomedical documents augments the deep expression profiling of coronavirus receptors neurological manifestations of hospitalized patients with covid- in wuhan, china: a retrospective case series study covid- -associated acute hemorrhagic necrotizing encephalopathy: ct and mri features encephalitis and encephalopathy associated with an influenza epidemic in japan influenza surveillance system of japan and acute encephalitis and encephalopathy in the influenza season illuminating viral infections in the nervous system human coronaviruses: viral and cellular factors involved in neuroinvasiveness and neuropathogenesis the neuroinvasive potential of sars-cov may be at least partially responsible for the respiratory failure of covid- patients neurologic alterations due to respiratory virus infections. front detection of severe acute respiratory syndrome coronavirus in the brain: potential role of the chemokine mig in pathogenesis systemic cytokine responses in patients with influenza-associated encephalopathy severe acute respiratory syndrome coronavirus infection causes neuronal death in the absence of encephalitis in mice transgenic for human ace non-neural expression of sars-cov- entry genes in the olfactory epithelium suggests mechanisms underlying anosmia in covid- patients cns infection and immune privilege the vagus nerve is one route of transneural invasion for intranasally inoculated influenza a virus in mice oral inoculation with herpes simplex virus type infects enteric neuron and mucosal nerve fibers within the gastrointestinal tract in mice seizures and encephalitis in myelin oligodendrocyte glycoprotein igg disease vs aquaporin igg disease encephalitis is an important clinical component of myelin oligodendrocyte glycoprotein antibody associated demyelination: a single-center cohort study in shanghai, china infectious mononucleosis triggers generation of igg auto-antibodies against native myelin oligodendrocyte glycoprotein trans-presentation of il- by dendritic cells is required for the priming of pathogenic t cells molecular mimicry as an inducing trigger for cns autoimmune demyelinating disease myelin-specific cd t cells exacerbate brain inflammation in cns autoimmunity zika virus and the guillain-barré syndrome -case series from seven countries guillain-barré syndrome a peptide from myelin oligodendrocyte glycoprotein that induces demyelinating encephalomyelitis resembling multiple sclerosis antimyelin antibodies as a predictor of clinically definite multiple sclerosis after a first demyelinating event anti-mog and anti-mbp antibody subclasses in multiple sclerosis disrupting myelin-specific th cell gut homing confers protection in an adoptive transfer experimental autoimmune encephalomyelitis early guillain-barré syndrome associated with acute dengue fever clinical features of guillain-barré syndrome with vs without zika virus infection cigarette smoke triggers the expansion of a subpopulation of respiratory epithelial cells that express the sars-cov- receptor angiotensin-converting enzyme protects from severe acute lung failure renin-angiotensin system in human coronavirus pathogenesis smoking in men vs the landscape of lung bronchoalveolar immune cells in covid- revealed by single-cell rna sequencing myoepithelial cells of submucosal glands can function as reserve stem cells to regenerate airways after injury submucosal gland myoepithelial cells are reserve stem cells that can regenerate mouse tracheal epithelium infection of human alveolar macrophages by human coronavirus strain e a single asparagine-linked glycosylation site of the severe acute respiratory syndrome coronavirus spike glycoprotein facilitates inhibition by mannose-binding lectin through multiple mechanisms the role of c a in acute lung injury induced by highly pathogenic viral infections highly pathogenic coronavirus n protein aggravates lung injury by masp- -mediated complement over-activation. medrxiv bovine viral diarrhoea virus infection disrupts uterine interferon stimulated gene regulatory pathways during pregnancy recognition in cows cleavage of a neuroinvasive human respiratory virus spike glycoprotein by proprotein convertases modulates neurovirulence and virus spread within the central nervous system guanylate-binding proteins and exert broad antiviral activity by inhibiting furin-mediated processing of viral envelope proteins ncbi geo: archive for functional genomics data sets-update massively parallel digital transcriptional profiling of single cells cumulus: a cloud-based data analysis framework for large-scale single-cell and single-nucleus rna-seq fast, sensitive and accurate integration of single-cell data with harmony scanpy: large-scale single-cell gene expression data analysis econometric and statistical modeling with python fitting linear mixed-effects models using lme comprehensive integration of single-cell data multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing umap: uniform manifold approximation and projection for dimension reduction jaspar : update of the open-access database of transcription factor binding profiles the encyclopedia of dna elements (encode): data portal update automated cell-type classification in intact tissues by single-cell molecular profiling single-cell rna-sequencing reveals profibrotic roles of distinct epithelial and mesenchymal lineages in pulmonary fibrosis scikit-learn: machine learning in python classification and regression trees exploring network structure, dynamics, and function using networkx forceatlas , a continuous graph layout algorithm for handy network visualization designed for the gephi software profiler: a web server for functional enrichment analysis and conversions of gene lists ( update) mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell rna sequencing data enrichr: a comprehensive gene set enrichment analysis web server update key: cord- -a dlaxn authors: johnson, todd a.; mashimo, yoichi; wu, jer-yuarn; yoon, dankyu; hata, akira; kubo, michiaki; takahashi, atsushi; tsunoda, tatsuhiko; ozaki, kouichi; tanaka, toshihiro; ito, kaoru; suzuki, hiroyuki; hamada, hiromichi; kobayashi, tohru; hara, toshiro; chen, chien-hsiun; lee, yi-ching; liu, yi-min; chang, li-ching; chang, chun-ping; hong, young-mi; jang, gi-young; yun, sin-weon; yu, jeong-jin; lee, kyung-yil; kim, jae-jung; park, taesung; lee, jong-keuk; chen, yuan-tsong; onouchi, yoshihiro title: association of an ighv - gene variant with kawasaki disease date: - - journal: j hum genet doi: . /s - - -z sha: doc_id: cord_uid: a dlaxn in a meta-analysis of three gwas for susceptibility to kawasaki disease (kd) conducted in japan, korea, and taiwan and follow-up studies with a total of , subjects ( cases and controls), a significantly associated snv in the immunoglobulin heavy variable gene (ighv) cluster in q . was identified (rs ; or = . , p = . × (− )). investigation of nonsynonymous snvs of the ighv cluster in japanese subjects identified the c allele of rs , located in ighv - , as the most significant reproducible association (or = . , p = . × (− ) in cases and controls). we observed highly skewed allelic usage of ighv - , wherein the rs a allele was nearly abolished in the transcripts in peripheral blood mononuclear cells of both kd patients and healthy adults. association of the high-expression allele with kd strongly indicates some active roles of b-cells or endogenous immunoglobulins in the disease pathogenesis. considering that significant association of snvs in the ighv region with disease susceptibility was previously known only for rheumatic heart disease (rhd), a complication of acute rheumatic fever (arf), these observations suggest that common b-cell related mechanisms may mediate the symptomology of kd and arf as well as rhd. kawasaki disease (kd) is an acute systemic vasculitis syndrome characterized by high fever, bilateral conjunctivitis, polymorphous skin rash, reddening of lips and oral cavity, changes in extremities and nonsuppurative cervical lymphadenopathy [ ] and predominantly affects infants and children younger than years [ ] . in many cases, kd is self-limiting. however, it causes coronary artery complications such as dilatations and aneurysms (coronary artery lesions; cals) in - % of untreated patients [ ] . replacing acute rheumatic fever (arf), kd is now the leading cause of acquired heart diseases of children in developed countries [ ] . most kd patients are treated with high dose intravenous immunoglobulin (ivig) infusion combined with oral aspirin, which was established in the s and s and has become a standard treatment that is effective at resolving inflammation and reducing cals [ ] [ ] [ ] . however, the mechanism of ivig action on kd has not been revealed, and - % of kd patients do not respond to the treatment and have a higher risk for cal. recently a series of genome-wide association studies (gwas) revealed several definitive kd susceptibility loci [ ] [ ] [ ] . however, these genetic factors can explain only a part of the etiology. also, the reason for the high incidence among east asian children [ ] , which is up to -fold higher than those in western countries and is a crucial epidemiologic feature of kd, has not yet been explained. it might be attributed in part to lack of statistical power in the previous gwas analyses that were carried out using modest sample-sizes, and therefore, many common genetic factors with relatively low penetrance may have gone undetected. in this study, to identify novel susceptibility loci for kd, we conducted a meta-analysis of gwas from japan, korea, and taiwan, the three countries with the highest incidence of kd in the world. in this study, susceptibility loci for kd were screened and verified in a two-stage association analysis (fig. ) . stage was a whole-genome meta-analysis of results from gwas conducted in japan [ ] , taiwan [ ] , and korea [ ] involving individuals. we carried out follow-up association studies using three case-control panels comprised of kd patients and controls (japanese), kd patients and controls (korean) and kd patients and controls (taiwanese) independent from the subjects in the gwas and performed meta-analyses with the stage results (stage ). loci for the follow-up studies were selected based on their predicted potential to achieve impute & minimac softwares genomes east asian haplotypes p values less than . × − in the meta-analyses, with the prediction made by iterative simulations of follow-up studies with virtual case and control cohorts (detailed below). to further validate the associations of rs and rs in q . , an additional kd cases and controls collected in japan were used. the number of kd cases and controls, as well as platforms in the three previous gwas and follow-up studies in japan, korea, and taiwan are summarized in supplementary table . [ ]) as reference using pre-phasing with shapeit v (ref. [ ] ) and imputation with impute and minimac softwares [ , ] . imputed genotype data were analyzed by each study center in a case-control logistic regression analysis, and the output was merged in the r statistics environment (url: https://www.r-project.org/) and filtered for variants that were polymorphic in both cases and controls (maf cases ≥ . and maf controls ≥ . ) and had info ≥ . in all three studies' dataset; there were , , snvs in the final filtered dataset. a fixed-effects meta-analysis of the three data sets' beta-coefficients and standard errors was performed using the r package metafor (https://cran. r-project.org/package=metafor). each chromosome's snvs were filtered for those with a meta-analysis p < . and then labelled based on linkage disequilibrium to regional "top snvs". briefly, snvs with p < × − were sorted by p value, the top snv identified, and any snvs within ± mb that had r > . to that top snv assigned to its region; data on a chromosome was processed as such until no snvs with p < × − were remaining. each top snv region was labelled based on chromosome and minimum and maximum positions of the linked snvs. based on that process, each region label would be unique, but regions could overlap with each other. calculation of r was performed using the ld function in the r bioconductor snpstats package using east asian genotype data from the genomes phase version release. for simplicity, we will refer to the regions of nominally linked snvs described above as "loci". to perform the stage follow-up study efficiently, loci that had a high potential of achieving p values less than . × − in a meta-analysis of stage and results were selected by p value simulation as follows. for each locus, we first selected any nominally associated snvs (p < . ) identified in the stage analysis, and for each snv created a virtual set of , case genotypes and a virtual set of , control genotypes for each of the three stage case-control panels. each virtual set was generated based on the genotype frequencies observed in that panel's stage cases or controls at a particular snv, and each of those virtual sets was then re-sampled in r to randomize their order. each virtual case and control set was then sampled times to create virtual case-control cohorts; the numbers of cases and controls sampled from the virtual genotype sets were the case-control counts that were expected in the three collaborators' follow-up studies (riken in japan: n cases = , n controls = ; academia sinica in taiwan: n cases = , n controls = ; asan medical center, university of ulsan in korea: n cases = , n controls = ). for each iteration, associations of the candidate snvs were evaluated in a meta-analysis of the virtual cohorts and the three gwas data sets. for each candidate snv, the frequency of observing p values of . × − or smaller in iterations of the simulated meta-analysis was scored as the simulation score. loci with at least one snv having simulation scores of . or higher were considered to be promising, and for de novo genotyping, a representative snv was selected for which assays were designable across the different platforms employed in each research center (invader in japan, sequenom massarray or taqman in taiwan and veracode goldengate genotyping kit or taqman in korea, respectively) (supplementary table ). nonsynonymous snvs in ighv genes were genotyped basically by the invader assay. primers and the probes were carefully designed in order to ensure specificity of the assay. we refrained from using multiplex pcr to avoid both expected and unexpected nonspecific amplification of dna fragments of high sequence homology which will allow cross reaction between amplicons and probes for different loci. sequences of the primers and probes for nonsynonymous snvs in ighv genes and rs , as well as representative genotyping results of rs , are provided in supplementary tables and and supplementary fig. , respectively. next-generation sequencing (ngs) of ighv repertoires two milliliters of venous blood was drawn from patients who were admitted to the hospitals for kd at four time points including ( ) acute phase before receiving ivig ( - illness days), ( ) h after the patients became afebrile ( - illness days), ( ) the first follow-up visit to the pediatric clinic after discharge ( - days after the disease onset), and ( ) the second follow-up visit to the pediatric clinic after discharge ( - months after the disease onset). blood samples were collected into vacutainer cpt cell preparation tube (bd) and mononuclear cells were separated according to the manufacturer's instruction. total rna from the mononuclear cells was extracted by using the rneasy mini kit (qiagen). . μg of rna was reverse transcribed with primescript (takara) and the mixed oligonucleotides of random hexamer and oligo-dt primers. isotype-specific libraries for ngs were prepared as follows. mixed forward primers covering the framework region of subgroups of ighv genes (v -v ) [ ] and reverse primers specific to each ighc gene for igm, igd, igg, and iga (including a partial illumina adapter sequence in the ′ ends of both primers) were designed for the st round pcr. sequences of the primers are provided in supplementary table . -base barcode sequence and the full illumina adapter sequence were added at ′ and ′ ends of the immunoglobulin amplicons in the nd round pcr. the barcode sequences were used to distinguish the patients and the sampling time points. the libraries were sequenced with miseq reagent kit v ( -cycle) (illumina). forward and reverse sequence reads were merged by using flash software [ ] . sequences unmerged due to insufficient overlapping length were excluded from subsequent analyses. merged sequence reads were classified into isotypes and subclasses based on primer sequences by using blast software (https://blast.ncbi.nlm.nih.gov/blast. cgi), and then quality filtering and removing the primer sequence were performed by using the fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html). immunoglobulin repertoires were determined by migmap- . . software (https://github.com/mikessh/migmap). sequence reads that terminated prematurely or had a complementarity determining region (cdr) that was non-canonical (i.e., lacking consensus amino acids at both ends or not fully mapped) were excluded from subsequent analyses. clonotypes were defined by combinations of ighv, igh diversity (ighd), and igh joining (ighj) gene alleles. correlation trend between the proportions of the igh clonotypes using ighv - and the c allele counts at rs was evaluated by the jonckheere-terpstra test using r package pmcmr (https://cran.rproject.org/package=pmcmr). change of each clonotype proportion from baseline (the st sampling point) was calculated by subtracting the baseline proportion from those at nd, rd, and th points. the diversity of cdr clonotypes was evaluated by the inverse simpson's index using the r package vegan (https://cran.r-project.org/pa ckage=vegan). in the stage analysis, a meta-analysis of three gwas data sets, genome-wide level associations were seen in known susceptibility loci such as fam a-blk (rs ; p = . × − ), itpkc (rs ; p = . × − ), casp (rs ; p = . × − ) and cd (rs ; p = . × − ). four top snvs in hla class iii~class ii regions (rs , rs , rs , and rs ) and rs in the q . region also showed significant associations (supplementary table , supplementary fig. ) . from a standard test for inflation, the genomic control inflation factor λ gc = . ( supplementary fig. ). in a simulation of meta-analyses, a "simulation score" was calculated for each nominally associated snv (p ≤ . ; see methods). forty-nine loci were identified with one or more snvs expected to satisfy a p value threshold of . × − with simulation score of . or higher (supplementary table ). from every locus, we selected representative snvs for genotyping in replication sample panels and then performed a meta-analysis with data from the three gwas. for one particular snv locus (rs ) that satisfied the criteria, a genotyping assay could not be designed due to the complexity of the surrounding sequence. as a result, out of examined loci showed significant associations ( table , supplementary table ). these included four previously known susceptibility gene loci such as casp , fam a-blk, itpkc, and cd , seven on chromosome p (nc_ . : . - . mb) and one in the immunoglobulin heavy variable gene (ighv) region on chromosome q . . the trend of association for rs on chromosome q . was not replicated in the follow-up sample panels from the three populations (supplementary table ) . association of an ighv - gene variant with kawasaki disease association signals that newly achieved genomewide significance after the meta-analyses of three gwas and follow-up association studies a significant association of rs located near hla-dob and hla-dqb genes with kd susceptibility was reported in the earlier japanese gwas [ ] . however, the present study revealed that the association statistics for rs were not consistent across the other east asian populations (supplementary table ). instead, out of groups of snvs in the p region examined in the stage analyses showed significant association in the metaanalyses of the data sets in the three gwas as well as in the follow-up studies ( table and supplementary fig. a ). to verify that those seven signals were associated independently from rs , we calculated ld between these snvs and rs , and we also performed logistic regression analyses conditioning on rs using the japanese stage sample set. consistent with information from genomes, the seven candidates were not in high ld with each other (supplementary fig. b ). the association was most significant for rs , and in the conditioned logistic regression analyses, the p value for rs , which was located just kb distal to and in marginal ld with rs (d′ = . , r = . ), became nonsignificant (p > . ). after conditioning, p values for the other snvs including rs itself increased but were still significant (p ≤ . ). (supplementary table ). these included snvs with previous information on their functional significance or association with diseases such as rs , where the associated allele is linked to the a allele at rs (r = . in the genomes jpt population), which has been associated with reduced expression of lymphotoxin a [ ] , and rs , which was previously associated with asthma in a japanese population [ ] . thus, further investigation including that in each ethnic group is needed to unravel the involvement of the variants in this region in kd susceptibility. in the meta-analysis of the stage and studies, a significant association was also obtained for an snv within the chromosome q . region represented by rs (p = . × − ) ( table ) . five hundred and nineteen snvs, linked with rs (r > . ), and having simulation score of . or higher were distributed across a kb region ( . - . mb) of this locus where many ighv genes as well as their pseudogenes are clustered in tandem. similar to the hla region, most ighv genes harbor common nonsynonymous snvs within them contributing to the high level of diversity observed within the immunoglobulin heavy chain repertoire. therefore, we proceeded to analyze the nonsynonymous snvs in that locus. among the snvs, were nonsynonymous and the others were intergenic, intronic, in the untranslated region, or synonymous snvs. when the japanese sample set used in the stage analysis was genotyped for these nonsynonymous snvs, rs in the ighv - gene showed the most significant association ( table ) . as expected, some other nonsynonymous snvs also showed a similar trend of association. however, those associations, including that of rs , were considered to be explained by ld with rs because they were no longer significant after conditioning the logistic regression analyses on rs ( table ). in contrast, rs remained significant (p < . ) after adjusting for any of the other snvs (data not shown). in a stage analysis, rs and rs were further examined in an additional japanese sample set, and rs achieved genome-wide significance in a meta-analysis of the japanese sample panels (table ) . meanwhile, the association trend of rs was weaker in korean and unreproducible in taiwanese subjects (supplementary table ). rs genotypes and ighv - gene usage in the immunoglobulin heavy chain repertoire the significant association of an snv within ighv - with kd prompted us to investigate the impact of rs on the function of immunoglobulin molecules or b cell receptors (bcrs) harboring ighv - in their heavy chain variable regions. firstly, we examined ighv usage in different immunoglobulin isotypes (igm, igd, igg, and iga) by analyzing the immunoglobulin heavy chain repertoire in peripheral b lymphocytes from healthy japanese individuals (n = ) and patients with acute kd (n = ) by ngs. for every ig isotype, we could identify ighv - clonotypes in each individual and found that the ighv - usage tended to be correlated with the number of c alleles each individual had ( fig. a and supplementary fig. ) . consistent with the correlation trend, we observed significantly skewed usage of the alleles in the analyses of the five patients who were heterozygous for rs , with the protective allele (a) almost completely suppressed in all immunoglobulin classes and time points examined ( supplementary fig. ). similar correlation trend was also observed between rs genotypes and relative levels of igh transcript for ighv - , the closest neighboring gene to ighv - . however, the correlations were more modest than that seen for ighv - ( supplementary fig. ). paralogues or pseudogenes of ighv - which have the 'a' nucleotide at the corresponding position of rs as well as pcr amplification bias in library preparation caused by additional variants interfering with the primer annealing could lead to such results. however, the distinct separation pattern of the allele discrimination plot in the invader assay and the clear electropherogram data in the direct sequencing of pcr-amplified genomic dna from heterozygous patients does not support that possibility ( supplementary fig. ). thus, we consider these observations not to be artificial in origin. we also negated the chance of misassignment of ighv - with the a allele at rs (ighv - * ) to other alleles (* , * , or * ) or paralogues such as ighv - by confirming that the migmap software could correctly assign an artificially created sequence with the rs a allele as ighv - * (data not shown). to obtain an insight into the role of immunoglobulins using ighv - in their heavy chain variable region in the pathogenesis of kd, we investigated how ighv - usage in the immunoglobulin heavy chain gene repertoire changed over time during the clinical courses of nine kd patients (supplementary table ). proportions of igh clones harboring ighv - in their variable region seemed to be stable or not to have common varying patterns for igm, igd, and iga, while in two patients (patients and ), transient increases were observed for igg at the third evaluation point (fig. b) . then, we investigated the increased clonotypes in the igg heavy chain repertoire more precisely by defining them by the combinations of ighv, ighd, and ighj genes. we found that both patients and had single clonotypes with ighv - that increased . % or more as a proportion of the total from the first to the third evaluation points. however, the gene combinations of them were not the same (table ) , and cdr amino acid sequences corresponding to the v-d-j combinations also differed between the patients (data not shown). five of the remaining seven patients also had one or more igg heavy chain clonotypes with ighv - that increased . % or more at the same follow-up time point. however, cooccurring v-d-j combinations in the increased clonotypes were only seen between patients and . on the other hand, some of the cdr clonotypes which have recently been reported to dominate in the igm heavy chain repertoire (> . %) in the acute pretreatment phase of taiwanese kd patients [ ] were also prevalent at a similar time point in the japanese kd patients in this study (supplementary table ). a distinct increase in diversity of the cdr clonotypes of igm heavy chain during the convalescent phase of kd, which has also been reported in previous research of taiwanese kd patients [ ] , was observed for four of the nine kd patients ( supplementary fig. ). it has generally been considered that the immune-mediated vasculitis of kd is triggered in response to infection with some type of microorganism. this assumption was made based on factors indicative of a primary infection, such as its clinical symptoms, which included fever and skin rash, epidemiologic findings such as its peak age of onset ( months to year), along with the seasonality of the occurrence of regional outbreaks and nation-wide epidemics [ , ] . despite tremendous efforts, no single microorganism has been conclusively proven as the pathogen of kd and lack of information about the pathogen has been a significant obstacle to causal treatment and disease prevention. histopathological and immunological studies have revealed activation of neutrophils, macrophages, and monocytes in the acute phase of kd [ , ] , and now the leading hypothesis for the pathophysiology of kd inflammation is an attack of the dysregulated or hyperactive innate immune system against vascular walls [ ] . in addition, genetic studies using a genome-wide approach have identified several robust susceptibility loci/genes for kd [ - , , - ] , and an in silico prediction of responsible variants and the types of cells where they have biological significance has highlighted the importance of b cells in the pathogenesis of kd [ ] . in previous literature, polyclonal activation and increase of b cells [ , ] and detection of auto-antibodies against components of vascular endothelial cells or neutrophils in peripheral blood of acute kd patients have been documented [ , ] . infiltration of oligoclonal plasma cells producing iga into the vascular wall of kd patients was also reported [ ] . these previous findings have suggested the involvement of b cells in kd inflammation. however, a recent transcriptomic study revealed downregulation of genes related to bcr signaling in acute kd patients [ ] . thus, there have been both supportive and unsupportive pieces of evidence, and there is no consensus view on the role of b cells, as well as of the adaptive immune system in kd pathogenesis. in this study, we identified genetic variants within the ighv region significantly associated with kd in the japanese population. the immunoglobulin heavy (igh) locus spans . mb (from . to . mb on nc_ . ) of the chromosome clonotypes increased more than % during st and rd evaluation points are bold faced q . region and encompasses serially arranged igh constant (ighc), ighj, ighd, and ighv clusters. because of high sequence redundancy which obstructs designing of specific genotyping assays, there is a large blank area (~ . mb) which was not covered by the snp arrays and therefore we lack association information about the snvs in the area (supplementary fig. ). however, because snvs within the gap are not in high ld (r < . ) with rs or rs in the genomes jpt population ( supplementary fig. ) and we did not observe association signals in the chromosomal region on the opposite side of the blank area, the association signals represented by rs or rs might be localized within the region where we could investigate in this study. in contrast to the robust association of rs in the japanese subjects, it was not consistent in the korean and taiwanese subjects. although neither significant nor consistent with the results in the current study, a marginal trend of association between snvs within the ighv region and linked to rs (r = . - . in the genomes chb population) and kd in the taiwanese population had already been reported [ ] . findings in the previous and the current studies are strongly indicating that variants in the ighv region are commonly involved in kd pathogenesis at least in the japanese and taiwanese populations. however, the robustness of association might not be uniform among the populations. given that multiple pathogens with regionally or seasonally differing epidemic patterns could be triggering kd, it might be possible that various susceptibility genes or alleles corresponding to different antigenicity for the pathogens exist in this locus and are related to the mixed robustness of the association. highly skewed expression of igh transcripts with the risk-associated allele adds support to consider that ighv - is the functional target of the associated variants. the a to c nucleotide change at rs results in an amino acid alteration from cysteine to glycine within the cdr region of ighv - . thus, the modification might affect the affinity to some antigens of immunoglobulins carrying ighv - . however, we consider that the mechanism of increased susceptibility to kd associated with the c allele might not be due to reduced host defense to particular agents because it is inconsistent with the observation that the a allele, which is protective against kd and expected to have a higher neutralizing ability in this scenario, seemed to be nearly silenced. ighv - has been recognized as one of the functional ighv genes and purported to be utilized at frequencies of around % [ ] . the average proportions of ighv - clonotypes in the healthy adults in our study ( . % for igm, . % for igd, . % for igg, and . % for iga), were consistent with that previous information. it is currently unknown in what stage, i.e., somatic recombination, rna transcription, and processing of the pre-mrna, the a allele was excluded and resulted in the significantly skewed allelic usage to the c allele of rs . it is suggestive that usage of ighv - , the nearest functional gene to ighv - , showed difference among genotypes at rs which was similar to that of ighv - ( supplementary fig. ). one potential reason is suggested by the predicted binding of ctcf transcription repressor in the . kb region encompassing rs ( supplementary fig. ). ctcf has been reported to interact with multiple sites in the igh region and plays essential roles in somatic recombination [ ] of the distal area of the ighv gene region. the association data of rs and the increased usage of igh transcripts with the riskassociated allele are indicative of some active role of the immunoglobulin molecules as antibodies or as components of bcrs in the development of kd. one possible role of such immunoglobulins might be activation of b cells. established kd susceptibility gene products such as b lymphoid tyrosine kinase (blk) [ ] [ ] [ ] and inositol , , trisphosphate -kinase c (itpkc) [ , ] are involved in bcr signaling. if ivig acts through competing with such immunoglobulins for agents or antigens relevant to kd pathogenesis, the requirement of a high dose administration ( - g/kg) of ivig to treat kd might be reasonable because only a fraction of the igg would contribute to the therapeutic effect, with ighv - expected to only account for up to several percent of the ivig preparations. b cells can be activated by nonspecific binding of microbial products such as superantigens (sags) to bcrs. intriguingly, b cell sags restricted to ighv segment of immunoglobulins have been known [ ] . however, considering that the innate immune system has been thought to play a central role in the kd vasculitis and that kd can develop in patients with x-linked agammaglobulinemia, who lack or have small numbers of b cells [ ] , the activity of b cells or immunoglobulins in kd might be relevant to initiation or enhancement of the innate immune activation but may be substitutable. recently, a significant association of an allele of the ighv - gene (ighv - * ) with susceptibility to rheumatic heart disease (rhd), which is a long-term complication of arf, was reported [ ] . ighv - is located only kb downstream of ighv - (supplementary fig. ) . arf develops as a sequela of streptococcus pyogenes (s. pyogenes) infection and, similar to kd, has been recognized to affect genetically susceptible individuals [ ] . among reports of gwas for human diseases, a genome-wide significant association of variants in the ighv gene region has been identified only for rhd. although s. pyogenes is not recognized as the cause of kd, considering that kd shares some characteristic symptoms such as skin rash and strawberry tongue with s. pyogenes infection, it is suggestive that the previously discussed role of b cells might be related to some underlying mechanisms of the common symptoms. in , multiple series of patients with pediatric inflammatory multisystem syndrome temporally associated with sars-cov infection (pims-ts) or (mis-c) having kd-like symptoms or increase of severe kd patients after the sars-cov epidemic were reported from the us and european countries [ , ] . in the latest studies, overrepresentation of ighv - and ighv - in neutralizing monoclonal antibodies against the receptor binding domain of sars-cov spike were also reported [ , ] . given immunoglobulin with ighv - play a role in kd, it might be possible that kd-like symptoms seen in such patients are mediated by interaction between sars-cov and b cells expressing ighv - . we also found some commonality between the cdr clonotypes that were increased in the igm heavy chains of the taiwanese and the japanese kd patients (supplementary table ). ighv - was used only in one commonly increased cdr clonotype (supplementary table ). however, as far as can be understood from the limited number of observations, ighv - seemed not to be important in the igm response in the acute pretreatment phase. future characterization of the endogenous immunoglobulin molecules that are increased in kd patients utilizing information of the light chains that can be obtained simultaneously in single-cell analyses [ ] will facilitate identification of the agent triggering kd as well as understanding the mechanism of action of ivig treatment. there are limitations in this study. first, in addition to the gap above where we could not examine the association of the variants, our strategy to focus only on nonsynonymous snvs on ighv genes left the possibility that rs is just a proxy of the genuinely responsible variant located outside ighv - . second, we did not analyze the timecourse change of the immunoglobulin heavy chain repertoire in other infectious diseases. so it is uncertain whether the upregulation of igg heavy chain transcripts with ighv - is a specific observation for kd or not. third, our results might not directly reflect changes of the immunoglobulins at the protein level because we lack information about the correlation between the proportions of particular igh clones in the transcripts from b cells and in the proteins expressed on the cell surface or circulating in the serum. in conclusion, a significant association of a nonsynonymous snv in the ighv - gene with kd was observed. further intensified study of the association in this region and repertoire analyses of immunoglobulins in different ethnicities and subpopulations of the patients with different demographic features would give insights into both the role of b cells in the kd pathogenesis and the causal agent of the disease. author contributions jkl, jyw, ytc, and yo supervised the study. jkl, jyw, and yo conceived the study. jyw, taj, tt, jkl, and yo designed the study. taj, ym, jkl, jyw, and yo wrote the manuscript. ah, hs, hh, th, and japan kawasaki disease genome consortium collected japanese samples. ymh, gyj, swy, jjy, kyl, and korean kawasaki disease genetics consortium collected korean samples. taiwan kawasaki disease genetics consortium and taiwan pediatric id alliance collected taiwanese samples. yml coordinated the multi-center collaboration in taiwan as the project manager and collected samples and clinical information. mk performed gwas assays for the japanese samples. at performed statistical analyses for the japanese gwas data. dy and tp performed statistical analyses for the korean gwas data. jjk conducted a follow-up study (stage ) for the korean samples. chc performed statistical analyses for the taiwanese gwas data and followed-up meta-analyses for the taiwanese data. ycl supervised the gwas and replication genotyping pipeline, performed the data analyses. lcc performed statistical analyses for the taiwanese gwas data and followed-up meta-analyses for the taiwanese data. cpc performed genotyping and direct sequencing of taiwanese samples. taj, dy, and chc conducted the whole-genome imputation. taj performed p value simulation and meta-analyses. ko, tt, and ki performed genotyping and direct sequencing of the japanese samples. ym and yo performed the ngs data analyses for the igh repertoires. conflict of interest the authors declare that they have no conflict of interest. ethical approval this study was approved by the institutional review board at all involved institutes. informed consent written informed consent was obtained from all subjects. publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. acute febrile mucocutaneous syndrome with lymphoid involvement with specific desquamation of the fingers and toes in children epidemiological observations of kawasaki disease in japan coronary aneurysms in infants and young children with acute febrile mucocutaneous lymph node syndrome nationwide survey of kawasaki disease and acute rheumatic fever high-dose intravenous gammaglobulin for kawasaki disease the treatment of kawasaki syndrome with intravenous gamma globulin a single intravenous infusion of gamma globulin as compared with four infusions in the treatment of acute kawasaki syndrome genome-wide association study identifies fcgr a as a susceptibility locus for kawasaki disease a genome-wide association study identifies three new risk loci for kawasaki disease two new susceptibility loci for kawasaki disease identified through genome-wide association analysis kawasaki disease: a brief history a genome-wide association analysis reveals p and p . as susceptibility loci for kawasaki disease an integrated map of genetic variation from , human genomes a linear complexity phasing method for thousands of genomes genotype imputation with thousands of genomes fast and accurate genotype imputation in genome-wide association studies through pre-phasing characterizing immunoglobulin repertoire from whole blood by a personal genome sequencer flash: fast length adjustment of short reads to improve genome assemblies allele-specific repression of lymphotoxin-alpha by activated b cell factor- genome-wide association study identifies three new susceptibility loci for adult asthma in the japanese population immunoglobulin profiling identifies unique signatures in patients with kawasaki disease during intravenous immunoglobulin treatment neutrophilic involvement in the damage to coronary arteries in acute stage of kawasaki disease activation of peripheral blood monocytes and macrophages in kawasaki disease: ultrastructural and immunocytochemical investigation kawasaki disease: a matter of innate immunity itpkc functional polymorphism associated with kawasaki disease susceptibility and formation of coronary artery aneurysms common variants in casp confer susceptibility to kawasaki disease a genome-wide association study identifies novel and functionally related susceptibility loci for kawasaki disease identification of novel susceptibility loci for kawasaki disease in a han chinese population by a genome-wide association study genome-wide linkage and association mapping identify susceptibility alleles in abcc for kawasaki disease genetic variation in the slc a calcium signaling pathway is associated with susceptibility to kawasaki disease and coronary artery abnormalities whole genome sequencing of an african american family highlights toll like receptor variants in kawasaki disease susceptibility a genome-wide association analysis identifies nmnat and hcp as susceptibility loci for kawasaki disease genetic and epigenetic fine mapping of causal autoimmune disease variants immunoregulatory abnormalities in mucocutaneous lymph node syndrome mononuclear cell subsets and coronary artery lesions in kawasaki disease immunoglobulin m antibodies present in the acute phase of kawasaki syndrome lyse cultured vascular endothelial cells stimulated by gamma interferon antineutrophil cytoplasm antibodies in kawasaki disease oligoclonal iga response in the vascular wall in acute kawasaki disease unique activation status of peripheral blood mononuclear cells at acute phase of kawasaki disease accumulation of vh replacement products in igh genes derived association of an ighv - gene variant with kawasaki disease from autoimmune diseases and anti-viral responses in human ctcf-binding elements mediate control of v(d)j recombination age-associated changes in binding of human b lymphocytes to a vh -restricted unconventional bacterial antigen autoimmunity in x-linked agammaglobulinemia: kawasaki disease and review in the literature pacific islands rheumatic heart disease genetics network. association between a common immunoglobulin heavy chain allele and rheumatic heart disease risk in oceania cumulative incidence of rheumatic fever in an endemic region: a guide to the susceptibility of the population? hyperinflammatory shock in children during covid- pandemic clinical characteristics of children with a pediatric inflammatory multisystem syndrome temporally associated with sars-cov- potent neutralizing antibodies against sars-cov- identified by highthroughput single-cell sequencing of convalescent patients' b cells structures of human antibodies bound to sars-cov- spike reveal common epitopes and recurrent features of antibodies high-throughput sequencing of the paired human immunoglobulin heavy and light chain repertoire • taesung park • korean kawasaki disease genetics consortium, taiwan kawasaki disease genetics consortium acknowledgements this study was supported by grants from the millennium project, from the japan kawasaki disease research center ( to ym, and and to yo), and from the japan agency for medical research and development (jp ek to th). this study was also supported by a grant from the ministry of health & welfare of the republic of korea (hi c to jkl). we are grateful to the kd patients and their family members as well as the medical staff taking care of the patients. we also thank ms. yoshie kikuchi for her technical assistance. key: cord- -rgo sj i authors: bernard, sandra c; simpson, nandi; join-lambert, olivier; federici, christian; laran-chich, marie-pierre; maïssa, nawal; bouzinba-ségard, haniaa; morand, philippe c; chretien, fabrice; taouji, saïd; chevet, eric; janel, sébastien; lafont, frank; coureuil, mathieu; segura, audrey; niedergang, florence; marullo, stefano; couraud, pierre-olivier; nassif, xavier; bourdoulous, sandrine title: pathogenic neisseria meningitidis utilizes cd for vascular colonization date: - - journal: nat med doi: . /nm. sha: doc_id: cord_uid: rgo sj i neisseria meningitidis is a cause of meningitis epidemics worldwide and of rapidly progressing fatal septic shock. a crucial step in the pathogenesis of invasive meningococcal infections is the adhesion of bloodborne meningococci to both peripheral and brain endothelia, leading to major vascular dysfunction. initial adhesion of pathogenic strains to endothelial cells relies on meningococcal type iv pili, but the endothelial receptor for bacterial adhesion remains unknown. here, we report that the immunoglobulin superfamily member cd (also called extracellular matrix metalloproteinase inducer (emmprin) or basigin) is a critical host receptor for the meningococcal pilus components pile and pilv. interfering with this interaction potently inhibited the primary attachment of meningococci to human endothelial cells in vitro and prevented colonization of vessels in human brain tissue explants ex vivo and in humanized mice in vivo. these findings establish the molecular events by which meningococci target human endothelia, and they open new perspectives for treatment and prevention of meningococcus-induced vascular dysfunctions. supplementary information: the online version of this article (doi: . /nm. ) contains supplementary material, which is available to authorized users. n. meningitidis, also referred to as meningococcus, is an obligate human gram-negative bacterium that normally resides in the nasopharyngeal mucosa without affecting the host, a phenomenon known as carriage . pathology is initiated when meningococci gain access to the bloodstream, multiply in the blood and disseminate into various tissues, causing meningitis or purpura fulminans, the most severe form of meningococcal invasive disease. despite antimicrobial therapy, systemic meningococcal infections remain a major cause of mortality and severe neurological sequelae [ ] [ ] [ ] . the endothelium lining blood and lymphatic vessels is a key barrier separating circulating body fluids from host tissues and is a major target for several pathogenic bacteria . a crucial step in the pathophysiology of invasive bloodborne meningococci is their adhesion to and proliferation in both peripheral and brain blood microvessels, a process referred to as vascular colonization . this intimate interaction of meningococci with endothelial cells leads to deregulated inflammatory and coagulation processes, endothelial dysfunction and, ultimately, the breach of endothelial barriers and bacterial dissemination into perivascular tissues . deciphering the mechanisms that govern this initial pathophysiological step of invasive meningococcal diseases represents an essential development toward the design of new therapeutic approaches to control bacterial infection and subsequent vascular and tissue damage. the capacity of pathogenic capsulated meningococci to interact tightly with human endothelial cells relies on the expression of type iv pili , . however, the host cell adhesion receptor that mediates the bacterial attachment to human vessels has remained unknown. type iv pili are long filamentous structures that extend from the bacterial cell surface. they mediate adhesion to endothelial cells and signaling events that eventually lead to bacterial translocation through endothelia [ ] [ ] [ ] [ ] . these structures are heteromultimeric assemblies of pilin subunits that form helical fibers . the major pilin subunit, pile, constitutes the essential fiber scaffold and is subject to antigenic variation. other less abundant ('minor') pilins, such as pilv, pilx and comp, structurally resemble pile and participate in various pilus-related functions such as adhesion, bacterial aggregation and dna uptake, respectively [ ] [ ] [ ] . both pile and pilv have been shown to be involved in adhesion to host cells , [ ] [ ] [ ] [ ] . furthermore, these two pilins were recently reported to activate the g protein-coupled β -adrenergic receptor (β -ar), which promotes the endothelial signaling events that are subsequent to bacterial adhesion , . however, β -ar-depleted endothelial cells, although unable to promote signaling events, still support pilus-mediated bacterial adhesion, thus indicating that the primary meningococcal attachment requires another, yet unidentified, host receptor for type iv pili. here, we report that cd , a member of the immunoglobulin (ig) superfamily, is a receptor for type iv pilus-mediated adhesion of pathogenic meningococci to human brain or peripheral endothelial cells through interaction with both major and minor pilins-pile and pilv-and we establish the central role of cd for vascular colonization by meningococci in vivo. cd is a candidate endothelial receptor for meningococci to identify candidate receptors for the initial attachment of piliated, capsulated meningococci, we used a gain-of-function strategy in bb cells, a dedifferentiated human brain endothelial cell line transformed by papilloma virus that is poorly permissive to meningococcal adhesion under basal conditions. treatment of bb cells with phorbol ester -o-tetradecanoylphorbol- -acetate (tpa), a potent transcriptional regulator, induced enhanced adhesion of the meningococcus strain nm c . (supplementary fig. a) . treatment with the transcription inhibitor actinomycin d blocked the tpa effect, suggesting the regulated expression of a putative receptor by tpa at the transcriptional level (supplementary fig. a) . as expected for type iv pilusmediated adhesion, a nonpiliated meningococcal pile-null mutant (∆pile) failed to adhere to tpa-treated cells (supplementary fig. b) . in order to identify upregulated genes in permissive cells, we used a differential, quantitative, large-scale analysis of gene expression (online methods) that showed genes upregulated by tpa among , analyzed genes. six of them encoded membrane-associated proteins (supplementary table ). among those, four proteins had either limited tissue distribution (sialic acid-binding ig-type lectin (siglec ), siglec and delta-and notch-like epidermal growth factor-related receptor were located mainly in the brain) or limited surface exposure (phosphatidylethanolamine-binding protein). therefore, we instead focused on the type plasma membrane protein cd , a member of the ig superfamily and a well-established marker of brain capillaries , , and cd , a glycosylphosphatidylinositol-anchored glycoprotein known to inhibit membrane attack complex formation in the complement cascade and to induce signaling events . we confirmed that tpa treatment of bb cells upregulated expression of cd and cd ( supplementary fig. c) . cd expression was enriched at sites of meningococcal adhesion, whereas cd was not (supplementary fig. d) . we investigated the relevance of cd and cd as endothelial candidate receptors for meningococcal adhesion using two fully differentiated human endothelial cell lines isolated from either brain (hcmec/d ) or bone marrow (human bone marrow endothelial cell, hbmec) capillaries. as observed on tpa-treated bb cells, cd was not recruited to sites of meningococcal adhesion on hbmecs (supplementary fig. a) . in noninfected fully polarized hcmec/d or hbmec monolayers, cd was enriched at cell-cell junctions, where it is engaged in homophilic interactions , , and was also present at the luminal surface ( fig. a and supplementary fig. a) . upon infection with nm c . , cd accumulated at sites of meningococcal adhesion to hcmec/d cells (fig. a) and hbmecs ( supplementary fig. a) , as already observed for the signaling β -ar receptor , . cd accumulation occurred rapidly upon contact of diplococci with the cell surface ( fig. b) and was independent of the recruitment of the β -ar, as expected for a receptor mediating initial adhesion (fig. c,d and supplementary fig. ). cd thus seemed to be a potential endothelial receptor candidate for n. meningitidis. to establish the role of cd in meningococcal adhesion, we transfected endothelial cells with cd -specific sirnas. the reduction of cd surface expression in hcmec/d cells and hbmecs ( % and %, respectively) correlated with decreased meningococcal adhesion to both cerebral or peripheral endothelial cells ( % and %, respectively; fig. a,b) . these data were supported by the rescue of meningococcal adhesion in cd -complemented endothelial cells (fig. b) . conversely, increased expression of cd on hbmecs increased bacterial adherence (fig. c) . this was specific to cd , as depletion of cd did not affect bacterial adhesion to hbmecs ( supplementary fig. b) . furthermore, depletion of cd or intercellular adhesion molecule- (icam- ), two membrane-associated molecules recruited to sites of bacterial adhesion upon type iv pilus-mediated signaling events , did not affect meningococcal adhesion to either hcmec/d cells or hbmecs (supplementary fig. ) . with the aim of approaching the physiological conditions of shear stress that are encountered in human blood microvasculature, we infected endothelial cells with meningococci in controlled laminar flow chambers that enable the quantification of initial bacterial contacts with the endothelial cell surface . we observed that attenuation of cd expression by % on hcmec/d cells reduced the number of initial adhesion events by % (fig. d) , confirming, under dynamic conditions, the selective effect of cd depletion on meningococcal adhesion. pathogenic meningococci belong to different capsular groups and show great capacity for antigenic variation in the pile gene with no alteration in type iv pilus-mediated primary adhesion to host cells , . we therefore performed adhesion experiments using meningococcal strains belonging to different capsular serogroups (z , serogroup a; mc , serogroup b; fam , serogroup c; and rou, serogroup w ) (fig. e) , as well as derivatives of nm c . (serogroup c) displaying different variants of the major pilin subunit pile, designated pile sa and pile sb (fig. f) . reduction of cd expression affected the adhesion of these different strains and variants similarly, demonstrating that the capacity of cd to support type iv pilus-dependent adhesion was independent of capsular antigen and of antigenic variation of pile. competition experiments with cd -fc, a soluble recombinant chimeric molecule containing the extracellular domain of cd fused to the fc domain of human igg , showed that cd -fc caused a % reduction in bacterial adhesion to hbmec cells, whereas soluble icam- -fc had no effect (fig. g) . to confirm that the extracellular portion of cd is a binding site for meningococci, we assayed two antibodies specific for cd , mem-m / and mem-m / , which respectively bind to the extracellular n-terminal and the c-terminal ig domains of cd , as interaction inhibitors (supplementary fig. a ). both antibodies labeled noninfected hcmec/d cells similarly (supplementary fig. b) , but only mem-m / efficiently labeled cd molecules recruited to bacterial adhesion sites, suggesting that the mem-m / antibody and meningococci compete for the same binding site in the c-terminal ig domain of cd ( supplementary fig. c) . accordingly, preincubation of hcmec/d cells or hbmecs with the mem-m / antibody before infection inhibited the initial attachment of nm c . under shear stress by % and %, respectively, whereas the mem-m / antibody reduced the initial attachment by only % and % (fig. h) . overall, our data indicate that cd is a critical receptor for the primary attachment of piliated meningococci to both peripheral and brain human endothelial cells. we next investigated whether the interaction of meningococcus with cd was direct by performing meningococcal adhesion experiments using recombinant cd -fc immobilized on glass coverslips. meningococci adhered in a type iv pilus-dependent manner to immobilized cd -fc but not to icam- -fc or noncoated control wells (fig. a) . to identify the bacterial pilus components involved in interaction with cd , we used escherichia coli to produce the minor pilins pilv, pilx and comp, as well as two variants of the major pilin pile (pile sa and pile sb ), as fusion proteins with maltose-binding protein (mbp). both pile variants and pilv fusion proteins reduced the adhesion of piliated, capsulated meningococci by - % in competition assays, whereas pilx or comp had no effect (fig. b) . we further assessed the direct interaction of cd with these pilins using several complementary approaches. first, purified pile sb or pilv recombinant pilins immobilized on inactivated staphylococci could selectively pull down soluble cd -fc (fig. c) . second, in a protein-protein interaction assay based on oxygen singlet transfer (alphascreen), we detected a dose-dependent association with cd -fc for both pile sb and pilv, but not for the other minor pilins (fig. d,e) . this interaction was independent of antigenic variation in the pile subunit (supplementary fig. a) and was specific to cd (supplementary fig. b ). as observed above for meningococcal adhesion to human endothelial cells, these interactions were inhibited by the mem-m / antibody but not by the control anti-icam- antibody ( c ) (fig. f and supplementary fig. c) . npg third, using surface plasmon resonance, we found a specific interaction of pile sb -or pilv-coated staphylococci with immobilized cd (supplementary fig. ) , but we observed no interaction when we used monomeric pilins (data not shown). these results indicate that, despite relatively low affinities of pile sb and pilv monomers for cd , their clustering on staphylococci can provide sufficient avidity to cd , which is consistent with the multimeric organization of pilins within pilus fibers. we then quantified the interaction of pile variants or pilv with cd by atomic force microscopy (afm) (fig. g) . we observed a load-dependent correlation for the interaction forces of pile variants or pilv with cd , further confirming the specificity of this interaction (data not shown). measured interaction forces with cd ( ± pn, ± pn and ± pn, for pile sa , pile sb and pilv, respectively) were only about twofold lower than that between cd and its established ligand cyclophilin ( ± pn) . these data, consistent with a previously described role for pile and pilv in neisseria species pilus-mediated adhesion to host cells , - , confirm the direct participation of pile and pilv in the selective interaction of meningococci with cd . to establish the role of pile and pilv in meningococcal adhesion to endothelial cells, we used defective strains of nm c . . neither the nonpiliated ∆pile nor the piliated pilv-null (∆pilv) mutants were able to adhere in vitro to hcmec/d or hbmec cells (fig. a) . as a control, ∆pile and ∆pilv complemented strains recovered bacterial adhesion to both endothelial cell lines with an efficiency correlated to the expression level of the complemented proteins (supplementary fig. a-c) . to further investigate the specific roles of pile and pilv in endothelial colonization by meningococci in vivo and because this human specific pathogen does not colonize mouse vessels, we used a humanized mouse model of severe combined immunodeficiency (scid) mice grafted with human skin, in which functional human blood vessels are maintained within the graft , . although limited to dermal vessels, this model constitutes a unique tool to study meningococcal interaction with human endothelial cells under realistic conditions in a live host. as previously shown , h after bacterial challenge, the nm c . wild-type strain massively colonized the human dermal vasculature in a type iv pilus-dependent manner, as we observed no colonization with the nonpiliated ∆pile derivative (fig. b,c) . consistent with our in vitro observation, the meningococcal ∆pilv derivative did not associate with vessels in the grafted human skin (fig. b,c) , whereas both ∆pile and ∆pilv complemented strains colonized human vessels with an efficiency correlating to the expression level of the complemented proteins (supplementary fig. d-f) . these results establish pile and pilv as essential bacterial components for in vivo type iv pilus-mediated vascular colonization. we further analyzed the selective interaction between meningococci and cd using an in situ meningococcal infection model of fresh npg human frontal brain tissues obtained from deceased normal subjects, as, in this setting, histological and anatomical characteristics of the brain vessels are conserved. meningococci incubated with tissue sections established specific tight association with brain vessels, predominantly in virchow-robin spaces and in cortical regions ( fig. a and supplementary fig. ) , reminiscent of neuropathological findings in patients with meningococcal meningitis , . adhesion relied on the expression of both pile and pilv, as ∆pile or ∆pilv mutants adhered poorly (fig. b-d and supplementary fig. ). upon infection, wild-type meningococci developed microcolonies immediately adjacent to cd -positive endothelial cells (fig. a) . we also found meningococci in the vicinity of cd -positive leptomeningeal cells and of cortical brain vessels ( supplementary fig. a,b) , but they were not associated with glial or neuronal cells that do not express cd (supplementary fig. c) , indicating a close correlation between meningococcal adhesion to fresh human brain tissue and cd expression. consistent with in vitro cellular models, pretreatment of brain sections with the mem-m / antibody substantially reduced adhesion of meningococci to human brain vessels, whereas the mem-m / antibody or the control c anti-icam- antibody had no effect (fig. b,c) . altogether, these data confirm that type iv pilus-mediated adhesion of meningococcus to cd is necessary for endothelial cell colonization by this bacterial pathogen. this study demonstrates that the specific interaction between the meningococcal ligands pile and pilv and the cellular host receptor cd is essential for meningococcal adhesion to human endothelial cells and colonization of human blood vessels, a prerequisite to the major vascular alterations that are the hallmark of invasive meningococcal infections. these results, which help unravel the molecular basis of a key pathophysiological step of meningococcal diseases, are consistent with previous reports suggesting that, in addition to its structural role in the pilus scaffold, the pile subunit plays a direct role in bacterial adhesion to host cells . pilv has also previously been shown to be required for the adhesion of the closely related neisseria gonorrhoeae . interaction with cd occurs independently of antigenic variation in the pile subunit and of possible post-translational modifications. pile and pilv have relatively low affinities for cd , whereas multimeric pilins efficiently bind to cd , indicating a system where avidity is crucial, consistent with the natural organization of pilins in multimeric complexes within pilus fibers. because endothelial cells, and more particularly brain microvessels, express high levels of cd , it is likely that cd expression level may also contribute to avidity, compensating for the weak affinity of the cd -meningococcal pilins interaction. how these two pilus components act in concert to create the adhesive phenotype remains to be solved. identification of the precise pile and pilv epitopes required for interaction with cd will allow the development of targeted antibodies that could prevent type iv pilus-cd interaction, opening the path to new vaccine strategies against meningococcal infection. cd is expressed on different cell types, such as erythrocytes and epithelial cells. pilus-dependent interactions between meningococci and erythrocytes, causing hemagglutination, have been documented in vitro , , and meningococci have been described crossing polarized epithelia, indicating that meningococcus might interact with cd in a variety of cell types. notably, several viruses, including hiv- , severe acute respiratory syndrome coronavirus and measles virus , as well as the bacterial pathogen listeria monocytogenes , also target cd to adhere to and/or invade epithelial cells, indicating that cd constitutes an evolutionarily conserved efficient target for pathogens to infect tissues and spread within organisms. cd is also known as emmprin, a master regulator of the production of matrix metalloproteinase . it is therefore likely that, in addition to its role in pathogen adhesion and invasion processes, cd also contributes to the disruption of the epithelial and endothelial barrier integrity by promoting local production of matrix metalloproteinases. whereas cd expression is required to support bacterial adhesion to endothelial cells of brain and peripheral origin, β -ar expression is necessary for meningococcus-triggered signaling . β -ar is dispensable for adhesion , , but cd -mediated meningococcal adhesion is a prerequisite for activating the β -ar-β-arrestin signaling pathway, thus suggesting that cd and β -ar might cooperate to strengthen meningococcal adhesion and promote subsequent activation of signaling events. notably, a similar functional association of these two receptors has been reported in the context of erythrocyte infection by the human parasite plasmodium falciparum, the agent of malaria, although the mechanism of this association is still unknown , . the implication of these two receptors in different tissues and in the context of different infectious diseases might reflect a preexisting functional connection between cd and β -ar that is hijacked by different pathogens to disseminate and induce multiple organ failure. npg taken together, our data highlight a key role for cd in the cascade of pathophysiological events following meningococcal entry into the bloodstream. agents or vaccines preventing type iv pilus interaction with cd might therefore be highly effective for treatment or prevention of meningococcal infection and associated vascular dysfunction. methods and any associated references are available in the online version of the paper. anti-hcd mab (clones mem-m / and mem-m / ) were purchased from abd serotec, anti-hcd mab (clone mem / ) from tebu-bio, anti-hcd mab (clone j ) from immunotech, anti-hicam- mab (clone c ) from r&d systems, anti-human collagen iv (clone phm ) from abcam, and anti-gfap (clone g-a- ) and antivimentin (clone v ) from sigma-aldrich. polyclonal antiserum raised against ezrin was obtained from p. mangeat. mab raised against pile ( d ) and polyclonal antiserum raised against meningococcal c . strain were described previously . secondary antibodies used for immunofluorescence labeling, chromogenic immunohistochemistry and immunoblotting were from jackson immunoresearch laboratories. soluble chimeric cd -fc-his, icam- -fc and alcam- -fc-his molecules were purchased from r&d systems. dapi, rhodamine-phalloidin, tpa, actinomycin d, isoproterenol, dab peroxidase substrate and nbt/bcip phosphatase alkaline substrate were purchased from sigma-aldrich. bacterial strains. nm c . (formerly clone ) is a piliated capsulated opa -opcvariant of the serogroup c meningococcal clinical isolate (ref. ) cultured as described previously . nm c . expressing pilin variants pile sa , pile sb or gfp were previously described , . the pilv mutant was engineered by introduction of an aph ′ resistance cassette into the pilv gene. to complement the pilv mutant, the pilv allele was amplified using primers pilv forward ′-cct taattaaggagtaattttatgatgagtaataaaatggaaca- ′ and pilv reverse ′-ccttaattaactattttttacgattagagaaagc- ′, containing overhangs with restriction sites for paci. this pcr fragment was restricted with paci and cloned into paci-restricted pgcc vector as described before . this placed pilv under the transcriptional control of an iptg inducible promoter within a dna fragment corresponding to an intragenic region of the meningococcal chromosome. this pilv construct was then introduced into the chromosome of the ∆pilv mutant by homologous recombination. a similar scheme was used for the construction of a complemented pile isogenic meningococcus strain. the orf of the pile gene was inserted, in e. coli, at the paci restriction site of the pgcc plasmid. the primers used to amplify pile were pile forward ′-cttaattaaggagtaatttatgaacacccttcaaaaaggttttac- ′ and pile reverse ′-cttaattaattagctggcagatgaatcatcgcg- ′. the inducible pile construct was inserted ectopically into the chromosome of the nm c . strain through transformation, using erythromycin for selection. the wild-type pile locus was subsequently inactivated through transformation of chromosomal dna of a previously described apha -insertionally inactivated endogenous ∆pile derivative and selected with kanamycin . virtually no pile could be detected in the absence of induction in the resulting ∆pile/pile strain, whereas induction yielded up to approximately % of the wild-type expression level of pile, as pile is one of the most strongly constitutively expressed meningococcal proteins. phenotypically, despite the low level of pile complementation induced upon iptg treatment, the functionality of type iv pili expressed upon pile complementation was ascertained through restoration of competence (competence frequency < × − , × − and × − per µg dna for ∆pile, ∆pile/ pile and wild-type strains, respectively cell lines. hcmec/d is a fully differentiated human brain endothelial cell line derived from brain capillaries, produced in the laboratory and which recapitulates the major phenotypic features of the blood-brain barrier . hbmec is a human endothelial cell line isolated from bone marrow capillaries, provided by b. weksler, that maintains in vitro most of the characteristics of primary endothelial cells . bb is a human brain endothelial cell line transformed by papilloma virus, provided by j.a. nelson, which has lost most of the phenotypic features of brain endothelial cells . cell culture, transfection and infection. to induce bacterial adhesion, bb cells were treated with the phorbol ester -o-tetradecanoylphorbol- -acetate (tpa) ng ml − , for min, washed and bacterial adhesion measured h after treatment. when indicated, h after tpa treatment, the transcriptional inhibitor actinomycin d was added ( µg ml − , for h), cells were washed and adhesion was measured h later. the hcmec/d and hbmec cell lines were cultured, transfected and infected as previously described , . plasmid encoding human cd was provided by m. bukrinsky, and plasmid encoding β adrenergic receptor fused to yfp (β -ar-yfp) was described previously . to silence the expression of cd , cd , cd , icam- and β -ar, pools of four sirna duplexes (on-target plus smartpool sirna from dharmacon) were used. the sicontrol sirna (dharmacon) was used as control. h after transfection, efficiency of knockdown was assessed by facs analysis or by reverse-pcr analysis for β -ar depletion as previously described . when indicated, hbmec cells were transfected with sirna targeting the ′ untranslated region of cd , ′-gcugucugguugcgccauuuu- ′ or control sirna ′-auguauuggccuguauag- ′ (eurogentech), and h later, cells were again either mock transfected or transfected with cd cdna coding region. h later, surface expression of cd was quantified by facs analysis and cells were infected concurrently with nm c . . serial analysis of gene expression adapted for downsized extracts. µg of rna extracted from × untreated or tpa-treated bb cells h after treatment with tpa were used as a substrate for the serial analysis of gene expression adapted for downsized extracts (sade) screen carried out as described . briefly, total rna was extracted from cells, polyadenylated rna was selected on oligo-dt columns and tagged cdna was synthesized from the poly-a rnas. concatemers of dna tags were sequenced, and the number of sequenced tags differentially represented in the two libraries was determined and analyzed using monte carlo statistical analysis. adhesion assays under static and flow conditions. meningococcal adhesion on hcmec/d , hbmec and bb cells was assayed under both static and shear stress conditions, as previously described , , . when indicated, meningococci were preincubated with µg ml − soluble recombinant human cd -fc and icam- -fc chimera before interaction with hbmec cells with additional µg ml − soluble proteins. meningococcal adhesion was quantified following a -min infection in static conditions. to address the inhibitory effect of antibodies, hbmec or hcmec/d cells were grown on ibidi chambers, pretreated for h with µg ml − antibodies targeting icam- ( c ) or cd (mem-m / and mem-m / ) and submitted to laminar flow ( . dynes/cm ) under an inverted microscope. gfp-expressing bacteria were introduced in the chamber under flow for min, and the number of adherent bacteria was determined, as previously described . for adhesion assays on immobilized proteins, cd -fc and icam- -fc chimera were immobilized on glass slides using a modification of the technique described . briefly, slides were coated with . % poly-l-lysine, washed with pbs, crosslinked with glutaraldehyde ( . %), washed, incubated with anti-fc in assay buffer (pbs with % bsa) for h, washed in assay buffer and incubated overnight at °c with µg ml − chimeric protein. slides were washed before infection with meningococcal suspension of od . . after incubation with bacteria, slides were washed three times and fixed. bacteria were labeled and visualized with a leica dmi microscope using a × oil-immersion objective. the number of adherent bacteria per field was quantified for fields using imagej software. expression, purification and immobilization of maltose-binding proteinpilin recombinant proteins. purified pilins were produced as fusion proteins with the maltose-binding protein (mbp), as described before . fragments of pile ( sa and sb ), pilv, pilx and comp lacking the region coding for amino acid residues to of the full-length proteins fused to the mbp were produced in e. coli, purified on amylose resin (new england biolabs) and immobilized on staphylococcus aureus (atcc ) expressing specific receptors for the fc domain of igg as described before . coated bacteria were incubated with µg of cd -fc chimera for min on ice. after washes, bacteria were lysed with laemmli buffer, and the quantity of coprecipitated cd -fc protein was assessed by immunoblot analysis. bound cd -fc was quantified using imagej software. when indicated, hcmec/d cells were pretreated with µg ml − of mbp-pilins for min before infection with meningococcal suspension of od . for min. after incubation with bacteria, cells were npg neisseria meningitidis: an overview of the carriage state epidemic meningitis, meningococcaemia, and neisseria meningitidis pathogenic neisseriae: surface modulation, pathogenesis and infection control classification and pathogenesis of meningococcal infections breaking the wall: targeting of the endothelium by pathogenic bacteria vascular colonization by neisseria meningitidis neisseria meningitidis infection of human endothelial cells interferes with leukocyte transmigration by preventing the formation of endothelial docking structures meningococcal type iv pili recruit the polarity complex to cross the brain endothelium meningococcus hijacks a β -adrenoceptor/β-arrestin pathway to cross brain microvasculature endothelium chronic meningococcemia cutaneous lesions involve meningococcal perivascular invasion through the remodeling of endothelial barriers type iv pili: paradoxes in form and function pilx, a pilus-associated protein essential for bacterial aggregation, is a key to pilus-facilitated attachment of neisseria meningitidis to human cells extracellular bacterial pathogen induces host cell surface reorganization to resist shear stress systematic functional analysis reveals that a set of seven genes is involved in fine-tuning of the multiple functions mediated by type iv pili in neisseria meningitidis pili of neisseria meningitidis. analysis of structure and investigation of structural and antigenic relationships to gonococcal pili pilus-facilitated adherence of neisseria meningitidis to human epithelial and endothelial cells: modulation of adherence phenotype occurs concurrently with changes in primary amino acid sequence and the glycosylation status of pilin roles of pilin and pilc in adhesion of neisseria meningitidis to human epithelial and endothelial cells neisseria gonorrhoeae pilv, a type iv pilus-associated protein essential to human epithelial cell adherence two strikingly different signaling pathways are induced by meningococcal type iv pili on endothelial and epithelial cells cd immunoglobulin superfamily receptor function and role in pathology expression of emmprin (cd ), a cell surface inducer of matrix metalloproteinases, in normal human brain and gliomas developmental changes of ht expression in the microvessels of the chick embryo brain alternative roles for cd the hcmec/d cell line as a model of the human blood brain barrier characterization of a newly established human bone marrow endothelial cell line: distinct adhesive properties for hematopoietic progenitors compared with human umbilical vein endothelial cells upregulation of hab g/cd in activated human umbilical vein endothelial cells enhances the angiogenesis crystal structure of hab g/cd : implications for immunoglobulin superfamily homophilic adhesion cerebral microcirculation shear stress levels determine neisseria meningitidis attachment sites along the blood-brain barrier frequency and rate of pilin antigenic variation of neisseria meningitidis conserved virulence of c to b capsule switched neisseria meningitidis clinical isolates belonging to et- /st- clonal complex cyclophilin-cd interactions: a new target for anti-inflammatory therapeutics meningococcal interaction to microvasculature triggers the tissular lesions of purpura fulminans adhesion of neisseria meningitidis to dermal vessels leads to local vascular damage and purpura in a humanized mouse model greenfield's neuropathology th edn differences in the adhesive properties of neisseria meningitidis for human buccal epithelial cells and erythrocytes evidence for functionally distinct pili expressed by neisseria meningitidis function of hab g/cd in invasion of host cells by severe acute respiratory syndrome coronavirus cd /emmprin acts as a functional entry receptor for measles virus on epithelial cells a role for membrane-bound cd in nod -mediated recognition of bacterial cytoinvasion emmprin/cd , an mmp modulator in cancer, development and tissue repair basigin is a receptor essential for erythrocyte invasion by plasmodium falciparum erythrocyte g protein-coupled receptor signaling in malarial infection antigenic variation of pilin regulates adhesion of neisseria meningitidis to human epithelial cells activation of erbb receptor tyrosine kinase supports invasion of endothelial cells by neisseria meningitidis purification and characterization of eight class outer membrane protein variants from a clone of neisseria meningitidis serogroup a complete genome sequence of neisseria meningitidis serogroup b strain mc isolation by streptonigrin enrichment and characterization of a transferrin-specific iron uptake mutant of neisseria meningitidis interaction of neisseria meningitidis with the components of the blood-brain barrier correlates with an increased expression of pilc studies of plasmodium falciparum cytoadherence using immortalized human brain capillary endothelial cells detection of β -adrenergic receptor dimerization in living cells using bioluminescence resonance energy transfer (bret) serial microanalysis of renal transcriptomes invasion of endothelial cells by neisseria meningitidis requires cortactin recruitment by a pi -kinase/rac signalling pathway triggered by the lipo-oligosaccharide mt -mmp-dependent invasion is regulated by ti-vamp/vamp creating biomimetic surfaces through covalent and oriented binding of proteins elastic membrane heterogeneity of living cells revealed by stiff nanoscale membrane domains elastic properties of the cell surface and trafficking of single ampa receptors in living hippocampal neurons fuzzy logic algorithm to extract specific interaction forces from atomic force microscopy data the authors declare no competing financial interests.reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.washed three times and the number of adherent bacteria determined as described above.alphascreen. the alphascreen technology was used to assess the interaction between mbp-pilin recombinant fusion proteins and cd -fc-his or alcam- -fc-his as a control. the binding reaction was performed using white -well optiplates (perkinelmer, whalham, ma, usa) in µl (total reaction volume). the alphascreen reagents (anti-mbp-coated acceptor beads and nickel chelate-coated donor beads) were obtained from perkinelmer. cd -fc-his (or alcam- -fc-his) and mbp-pilins were prepared in mm tris, ph . , mm nacl. donor beads ( µg/ml) were incubated with cd -his ( , or nm) for min at room temperature. in parallel, acceptor beads ( µg ml − ) were incubated with mbp-pilins ( , or nm) for h at room temperature in the alphascreen reaction buffer ( mm tris, ph . at °c, mm nacl, . % bsa and . % tween ). next, µl of each of the interacting partners were added to the plate, allowed to incubate for either h or overnight in the dark and at room temperature. in antibody competition assays, µl of mem-m- / or c antibodies (at variable concentrations) were added to cd -fc-his-donor beads for min at room temperature before incubation with the mbp-pilin-acceptor beads. light signal was detected with the envision multilabel plate reader (perkinelmer).surface plasmon resonance. surface plasmon resonance (spr) experiments were carried out on a biacore spr biosensor (biacore internationalab). cd -fc or control icam- -fc chimera ( µg ml − ) were immobilized onto cm chip surfaces at µl min − using a standard amine coupling protocol with edc ( -ethyl- ( -dimethylaminopropyl) carbodiimide)/nhs (n-hydroxysuccinimide) following the manufacturer's instructions. the density was controlled at an increased response level of , and , response units (ru) for icam- -fc and cd -fc, respectively. soluble monomeric pilins diluted at various concentrations ( . to µg ml − ) in mm hepes ph . , mm edta, . % (v/v) p surfactant or µl of a suspension of pilin-coated staphylococci in pbs ph . were used as analytes. analytes were applied at a flow rate of µl min − for min followed by a dissociation time of min. regeneration of the sensorchip surface was performed with . m nacl in pbs at µl min − for min followed by two washes at µl min − with pbs. output sensorgrams were analyzed using biacore bianalysis software.atomic force microscopy. force spectroscopy was performed between pilins and a cd -functionalized surface. cd -fc-his or control alcam- -fc-his chimera were deposited on a glass surface using a nitrilotriacetic acid that ensured covalent and oriented binding with the protein . mbp-pilins and control proteins (mbp, cyclophilin) were coated on afm tips using established protocol known to keep protein functionality intact as previously published , . all the experiments were performed at room temperature on a commercial afm (multimode , bruker, santa barbara, ca) driven under nanoscope . . mbp-pilins were used at µg ml − . bruker dnp cantilevers with a nominal spring constant of . n/m and a nominal tip radius of nm were used. true cantilever spring constant was determined using bruker's thermal tune calibration tool. experiments were done in force-volume mode operating in mm nacl, mm hepes buffer (ph . ). pulling speed was set to µm s − on µm areas, nm z ramp, with a relative trigger of pn on the cantilever deflection and a contact time of ms. , force curves per condition were recorded, using at least different tips. most nonspecific tip-surface adhesion events were automatically discarded from analysis. force-distance curves were analyzed as previously described , and binding-unbinding events on the retraction curve were detected according to their shape and size characteristics by a fuzzy logic algorithm fully described in (ref. ) . briefly, we considered unbinding events based on the vertical segment size, the v-shape and the angle with the baseline. statistical analysis was performed using r . where provided data are the peak position mean values ± s.d.confocal immunofluorescence microscopy. hcmec/d , bb or hbmec cells were grown to confluence on permanox coverslips (thermo fisher scientific) or on transwell filters (costar). after the indicated treatments and/or infection, cells were fixed and labeled as previously described , . anti-ezrin antibody ( / , ), anti-cd ( / ) anti-cd (mem-m / , / ), anti-cd ( / ) and rabbit polyclonal serum anti-nm c . strain ( / , ) were used as primary antibodies, and dapi ( . mg ml − ) was added to alexa fluor-conjugated secondary antibodies ( / ). image acquisition and analysis were performed with a dmi microscope (leica, ×). series of optical sections were obtained with a confocal spinning disk microscope (leica, ×). three-dimensional ( d) reconstructions were obtained using imaris software and quantification histograms with imagej software (nih). quantitative analysis of protein recruitment under bacterial colonies was determined as the proportion of colonies positive for the protein of interest indicated. at least colonies were observed per coverslip. each experiment was repeated at least times in duplicate or triplicate.infection of severe combined immunodeficiency mice grafted with human skin. six-week-old cb /icr-prkdc scid /icricocrl scid female mice were purchased from charles river laboratories (saint germain sur l' arbresle, france). human skin tissues were obtained from adult patients undergoing plastic surgery in the service de chirurchie plastique et reconstructive of the groupe hospitalier saint-joseph (paris, france). in accordance with the french legislation, patients were informed and did not refuse to participate in the study. experimental procedures were performed as previously described , in accordance with the guidelines of the institut national de la santé et de la recherche médicale. the use of human tissue and the experimental protocol was approved by the animal experimentation ethics committee of the université paris descartes (consent form ceea .o.j.l. . ). briefly, mice were prepared for transplantation by shaving the hair of the back and abdominal areas after they received an intraperitoneal injection of ketamine mg kg − and xylazine mg kg − . a skin flap was created, and a full-thickness human skin graft was placed onto the wound bed. the transplants were held in place with - nonabsorbable monofilament suture materials, and the flap was then sutured above the transplant. grafted mice were used for meningococcal infection experiments month after human skin transplantation. meningococcus strains were grown overnight at °c on gcb agar plates prepared without iron and supplemented with µm deferoxamine (desferal, novartis) with appropriate antibiotic. bacterial colonies were harvested and cultured in rpmi with % bovine serum albumin medium and . µm deferoxamine with gentle agitation to reach the exponential phase of growth. bacteria were then resuspended in physiological saline containing mg ml − of human holotransferrin to promote bacterial growth in vivo ( ht, r&d systems). mice were infected intraperitoneally with . ml of this bacterial suspension to minimize the inoculum required to obtain a reproducible high bacteremia in grafted mice. to assess bacteremia in infected animals, µl of blood was sampled using a heparinized hematocrit glass tube h after injection by puncture in the lateral tail vein, or h after injection by intracardiac puncture on animal killed by intraperitoneal injection of a lethal dose of ketamine and xylazine. bacterial counts were performed by plating serial dilutions of blood on gcb agar plates. results were expressed in colony-forming units (cfu) per ml of blood. biopsy of human skin grafts were carefully collected using a dermatological punch biopsy system and fixed overnight at °c in paraformaldehyde %, lysin mm and sodium periodate mm in phosphate buffer. after they were washed in phosphate buffer, the specimens were dehydrated in successive sucrose gradient solutions ( %, %, % prepared in phosphate buffer . m) for h at °c each, embedded in oct and then frozen at − °c. -µm thick sections of the dermis were immobilized on superfrost plus microscope slides and analyzed via immunofluorescence. sections were incubated with the following primary antibodies for h in pbs/bsa . %: monoclonal anti-human collagen iv ( / ) and a rabbit polyclonal serum against the nm c . strain ( : , ). dapi ( . mg ml − ) was added to alexa fluor-conjugated secondary antibodies ( / ) for h. after additional washing, coverslips were mounted in glycergel (dako). sections per graft were analyzed using epifluorescence (zeiss axiovert, ×). quantification analysis of the images (n > per graft) was performed using imagej software (nih). results are presented as a vascular colonization index corresponding to the area occupied by the fluorescently labeled bacteria in relation to the human vessel area delineated by the anticollagen iv staining × , from independent grafts per condition. npg infection of human brain tissues. fresh human brain sections were obtained from frontal lobe specimens of macroscopically and histologically normal brain (confirmed by a neuropathologist) of individuals referred to the department of forensic medicine for unexplained out-of-hospital sudden death (consent forms ml , pfs - , clinicaltrials.gov nct approved by the institutional review board of the poincaré hospital, versailles-saint quentin university and the agence de la biomédecine). after freezing of the brain tissue with isopentane cooled in liquid nitrogen, the -µm-thick sections containing leptomeninges, cortical ribbon and the underlying white matter were immobilized on superfrost plus microscope slides and stored at − °c. defrosted sections were rehydrated in pbs for min and incubated for h with medium containing . % bsa before infection with suspensions of bacteria ( × in µl of medium containing . % bsa) for h at °c. sections were then gently washed horizontally times and fixed in paf % for min at room temperature. when indicated, sections were treated for h with µg ml − of antibodies targeting cd or icam- and then washed times before infection by bacteria. adherent meningococci were detected by chromogenic immunohistochemistry and immunofluorescence analysis. to perform chromogenic immunohistochemistry, after tissue rehydration, sections were incubated overnight with the indicated primary antibodies (monoclonal anti-human cd (mem-m / , / ) and a rabbit polyclonal serum anti-nm c . strain ( / , )) and revealed using horseradish peroxidase-coupled secondary donkey anti-mouse antibody and an alkaline phosphatase donkey anti-rabbit ( / ). dab (brown color) and nbt/bcip (blue color) were used as chromogens. for immunofluorescence analysis, brain sections were incubated with the following primary antibodies for h in pbs/bsa . %: monoclonal anti-human cd (mem-m / , / ), anti-gfap ( / ), anti-human vimentin ( / ) and a rabbit polyclonal serum anti-nm c . strain ( / , ). alexa fluor-conjugated phalloidin ( / ) and dapi ( . mg ml − ) were added to alexa fluorconjugated secondary antibodies ( / ) for h. after additional washing, coverslips were mounted in glycergel (dako). entire samples were scanned using nanozoomer . (hamamatsu) and were further analyzed using optical microscopy, epifluorescence (zeiss axiovert, ×) and confocal (spinning disk leica, ×) microscopy. quantification analysis of the fluorescently labeled bacteria that adhered on a mm surface area was performed using imagej software. results are presented as a mean of fluorescence per µm , from two independent experiments. d reconstructions were performed on deconvoluted confocal stacks using imaris software.statistical analyses. all values are expressed as the mean ± s.e.m. statistical differences were determined to be significant at p < . . the specific tests used are described in the figure legends. all analyses were performed using graphpad prism software. no statistical method was used to predetermine sample size. the investigators were not blinded to allocation during experiments and outcome assessment. the experiments were not randomized. key: cord- -nu ib h authors: dong, dong; lei, ming; hua, panyu; pan, yi-hsuan; mu, shuo; zheng, guantao; pang, erli; lin, kui; zhang, shuyi title: the genomes of two bat species with long constant frequency echolocation calls date: - - journal: mol biol evol doi: . /molbev/msw sha: doc_id: cord_uid: nu ib h bats can perceive the world by using a wide range of sensory systems, and some of the systems have become highly specialized, such as auditory sensory perception. among bat species, the old world leaf-nosed bats and horseshoe bats (rhinolophoid bats) possess the most sophisticated echolocation systems. here, we reported the whole-genome sequencing and de novo assembles of two rhinolophoid bats – the great leaf-nosed bat (hipposideros armiger) and the chinese rufous horseshoe bat (rhinolophus sinicus). comparative genomic analyses revealed the adaptation of auditory sensory perception in the rhinolophoid bat lineages, probably resulting from the extreme selectivity used in the auditory processing by these bats. pseudogenization of some vision-related genes in rhinolophoid bats was observed, suggesting that these genes have undergone relaxed natural selection. an extensive contraction of olfactory receptor gene repertoires was observed in the lineage leading to the common ancestor of bats. further extensive gene contractions can be observed in the branch leading to the rhinolophoid bats. such concordance suggested that molecular changes at one sensory gene might have direct consequences for genes controlling for other sensory modalities. to characterize the population genetic structure and patterns of evolution, we re-sequenced the genome of great leaf-nosed bats from four different geographical locations of china. the result showed similar sequence diversity values and little differentiation among populations. moreover, evidence of genetic adaptations to high altitudes in the great leaf-nosed bats was observed. taken together, our work provided a useful resource for future research on the evolution of bats. bats (order chiroptera) are one of the largest monophyletic clades in mammals (order chiroptera), and constitute nearly % of living mammalian species. they can perceive their surroundings using a wide range of sensory systems, and have long been regarded as the most unusual and specialized species of all mammals. most bats are sophisticated echolocators and rely on their echolocation systems for navigation. however, old world fruit bats have no laryngeal echolocating ability, and navigate largely by vision. based on overwhelming molecular genetic evidence, it has been proposed that echolocating bats are paraphyletic (teeling, et al. ) . old world fruit bats and some laryngeal echolocators (including rhinolophidae, hipposideridae, craseonycteridae, megadermatidae, and rhinopomatidae families) are a natural group -the suborder yinpterochiroptera, and the remaining laryngeal echolocating bats are grouped to another suborder yangochiroptera . two distinct navigation approaches can be employed by echolocating bats: low duty cycle (ldc) echolocation and high duty cycle (hdc) echolocation (teeling ). ldc echolocators can separate pulse and echo in time to avoid forward masking, whereas some species of hdc echolocators separate pulse and echo in frequency. it has been documented that rhinolophoid bats might possess the most sophisticated echolocation systems (jones and teeling ) . recently, results from some hearing-related genes suggested sequence convergence in laryngeal echolocating bats (li, et al. ; davies, et al. ) . we attempted to investigate whether similar patterns can be detectable in other hearing-related genes. furthermore, a sensory trade-off between investment in vision and echolocation has been identified (dechmann and safi ) . loss-of-function in short-wave sensitive opsin (sws gene) occurred in rhinolophoid bats, which use hdc echolocation and can emit long constant frequency calls (zhao, et al. ). although several bat genomes have been sequenced (zhang, et al. ) , the evolutionary mechanisms of the rhinolophoid bats remains unclear. comparative genomics will provide us opportunities to investigate whether similar patterns can be detectable in other sensory genes. the great leaf-nosed bat (hipposideros armiger) and the chinese rufous horseshoe bat (rhinolophus sinicus) are two important species of rhinolophoid bats. first, these are model organisms with remarkable hdc echolocation ability and can emit continuous ultrasonic calls of long constant frequency with remarkable acoustic features (doppler-shift compensation) (schnitzler, et al. ) . we can comprehensively explore how rhinolophoid bats evolved a specialized form of echolocation. second, they are important reservoir hosts of emerging viruses, and the chinese rufous horseshoe bat has been suggested to carry the direct ancestor of severe acute respiratory syndrome (sars) coronavirus (ge, et al. ) . in this work, we presented the genomes of the great leaf-nosed bat and the chinese rufous horseshoe bat using the next generation sequencing platform (illumina hiseq ). the result revealed the adaptation of auditory sensory perception in hdc echolocators, and showed an extensive contraction of olfactory receptor gene repertoires as well as pseudogenization of some vision-related genes. furthermore, we performed genome re-sequencing to analyze the population genetic structure of the great leaf-nosed bats. the genomic data provide genetic evidence of adaptive evolution in rhinolophoid bats. a female great leaf-nosed bat (hipposideros armiger) and a female chinese rufous horseshoe bat (rhinolophus sinicus) were captured from a cave (n ° . ′ genomic dna was extracted from bat muscle using the qiagen dneasy blood and tissue kit. six paired-end libraries with insert size of bp, bp, bp, k bp, k bp and k bp were constructed and sequenced for the great leaf-nosed bat and the chinese rufous horseshoe bat, respectively. the libraries were sequenced using illumina hiseq platform, which has a read length of bp. low quality sequencing reads were filtered out and potential sequencing errors were removed. the following filtering criteria were carried out: ) filter reads with > % unidentified nucleotides; ) filter reads with > nucleotides aligned to the adapter sequence, allowing < mismatches; ) remove putative pcr duplicates generated by pcr amplification in the library construction process. finally, we generated . gb and . gb of sequences for the great leaf-nosed bat and the chinese rufous horseshoe bat, respectively. the genome sequences were assembled using allpaths software (butler, et al. ). briefly, contigs were generated by constructing a de bruijin graph with the sequencing reads from short-insert library data. the graph was simplified to generate the contigs by removing tips, merging bubbles and solving repeats. the sequencing reads were mapped to the assembled contigs, and the scaffolds were constructed by weighting the rates of consistent and conflicting paired-end relationships. at last, we retrieved the read pairs with one end that uniquely mapped to the contigs and the other end located in the gap region, a local assembly for these collected reads was performed to fill the gaps. a more detailed genome assembly method is provided in supplementary methods. total rnas of the two bats were extracted from brain, cerebellum, heart, liver, stomach, kidney, lung and muscle tissues for the generation of transcriptome data. paired-end libraries for rna sequencing were constructed using the illumina mrna-seq prep kit. the quality and integrity of the rna samples were determined using the agilent bioanalyzer. poly(a) mrnas were isolated using oligo(dt) beads, fragmented, and converted to cdnas followed by end repair, adaptor ligation, and pcr amplification. the libraries thus generated were sequenced using the illumina hiseq platform as described above. we searched for tandem repeats across the genomes using tandem repeats finder. transposable elements were predicted in the genomes by homology search against the known transposable elements (te) in repbase (jurka, et al. ) (version ) using repeatmasker version . . (tarailo-graovac and chen ). the protein-coding genes of the bat genomes were annotated by combining homology-based, ab initio and rna-seq gene prediction methods. at first, rna-seq data were assembled using the trinity package (trapnell, et al. ) . pasa (version r - - ) (haas, et al. ) was then used to map the assembled transcripts. based on the set of gene models, a training set was constructed for de novo predictors by selecting the genes with complete structures and at least % mapping rate for uniprot vertebrate proteins. for the ab initio prediction, augustus (stanke and waack genes with the training set generated by pasa. for homology-based gene prediction, the protein sequences of human, mouse, dog, cow, little brown bat and large flying fox were downloaded from ensembl release and mapped onto the repeat-masked genome using genblasta (she, et al. ). rna-seq data were mapped to the genome using tophat (trapnell, et al. ), and the transcription-based gene structure were generated by cufflinks (trapnell, et al. ) . the final gene set was generated by merging all genes predicted using glean software (http://sourceforge.net/projects/glean-gene/). to infer gene function, it was based on the best match of the alignment to the swissprot and translated embl nucleotide sequence data library databases using blastp. interproscan (mulder and apweiler ) was used to determine motifs and domains in the final gene set. to evaluate completeness of the genomes and annotations, cegma method (parra, et al. ) was employed. we used the treefam methodology (li, et al. ) to define gene families in mammalian genomes (human, macaque, mouse, rat, dog, cat, horse, rhinoceros, cow, pig, little brown bat, large flying fox, great leaf-nosed bat and chinese rufous horseshoe bat). the protein sequences of other mammalian species were obtained from ensembl database (release ). gene family expansion and contraction analysis was performed by cafÉ software (de bie, et al. ) . a random birth and death model was proposed to study gene gain and loss in the gene families across a user-specified phylogenetic tree. a global parameter λ (lambda), which described both gene birth (λ) and death (μ = -λ) rates across all branches of all gene families was estimated using the maximum likelihood method. a conditional p-value was calculated for each gene family, and families with conditional p-values less than . were considered to have a significantly accelerated rate of expansion and contraction. protein sequences of the aforementioned mammals were aligned using muscle software (edgar ) . all orthologous genes were concatenated to one super gene for each species. raxml (stamatakis ) was applied to build phylogenetic trees. we partitioned the data by coding genes, and evaluated the model parameter independently for each partition. in all partitioned analyses, the empirical base frequencies and the evolutionary rates were estimated independently for every partition. bootstrap support was obtained by repeating the original partitioned ml raxml analysis on bootstrap replicates for each dataset using different random number seeds in each repetition. next, we inferred the species tree using coalescent method: maximum pseudo-likelihood estimation of species tree (mp-est) (liu, yu, et al. ) . individual gene tree for each gene was estimated using the maximum-likelihood method and rooted by an outgroup (human). species trees were estimated from the rooted gene trees in the program mp-est with bootstrap replicates. the results supported that bats are member of scrotifera (chiroptera + carnivores + perissodactyla + cetartiodactyla) with bat lineage diverging from fereuungulata (carnivores + perissodactyla + cetartiodactyla). the values of ka, ks, the ka/ks ratio were estimated for each gene using the codeml programs nested in the paml package (yang ) . in order to detect positively selected genes, optimized branch-site likelihood model (zhang, et al. ) was used. we separately explored the positively selected genes in the great leaf-nosed bat and the chinese rufous horseshoe bat. for each analysis, only one bat species was selected as foreground branches, and all other species were regarded as the background branches. the revised branch-site model a was employed, which attempts to identify positive selection acting on some sites on the "foreground branches". using an likelihood ratio test (lrt), the alternative hypothesis that positive selection occurs on the foreground branches (ka/ks > ) is compared with the null hypothesis (ka/ks= ). bayesian empirical bayes values were used to identify sites under positive selection. then, branch two-ratio model was applied to detect accelerated evolved genes in specific lineage. the one-ratio model assumed an equal ka/ks ratio for all lineages in the phylogeny, and the two-ratio model assumed two ka/ks ratios: one branch for the background, one for the foreground branch leading to the specific species. then, clade model c was employed to test for positive selection along the rhinolophoid bats. the two clades were assumed to share sites under purifying selection and neutral evolution, but to differ at a third site partitions under divergent selection. the null model used for the clade model c was m a_rel (weadick and chang ) , whose lrt has a relatively lower false-positive rate. go annotations were downloaded from ensembl databases and were assigned to these orthologous genes. the binomial test was used to identify go categories with more than gene that had an excess of non-synonymous changes in bat lineages. next, we used the program mapp (multivariate analysis of protein polymorphism) (stone and sidow ) to evaluate the physicochemical impact of these convergent amino acid substitutions on bats. physicochemical variations can be used to predict how these particular convergent amino acid substitutions might affect protein function. in this work, we performed a probabilistic analyses of the sequence convergence in echolocating bats. a maximum likelihood approach, implemented in the software package codeml ancestral, was used. we compared the pair-wise branches of two echolocating bat in the phylogeny, and posterior probabilities of all possible amino acid substitutions were calculated. the probabilities of divergent and convergent substitutions were calculated as the sum of joint probabilities of substitutions between the two branches of echolocating bats. convergence and divergence estimates were based on posterior distributions of ancestral states and substitutions. the same state (same amino acid) represents convergent substitutions, and the different state represents divergent substitutions. finally, to further validate that the convergence between two branch pairs of echolocating bats was significant, we performed the simulation analysis to compare the observed probabilities against that of the null hypothesis. simulated sequences were generated using evolver, another package from paml package (yang ) . the branch-wise convergence probabilities were calculated with , replicates. we used the similar in silico method as previously reported in dong et al. (dong, et al. ). at first, we used previously published or genes in vertebrates as query sequences (niimura and nei ) and conducted a tblastn search against the genome sequences with a cutoff e value of e- to identify the or gene repertoires. here, we totally identified or gene repertories from eight mammalian genomes (the site (http://genome.ucsc.edu). then, the non-redundant blast-hits were extended to the ' and ' directions along the genome sequences, and the potential coding regions were extracted from these sequences. the chemosensory receptor genes in mammals have high sequence similarity. here, we re-performed a tblastn against the genome sequences using or coding genes identified from each species, and the non-redundant blast-hits were used to identify the or pseudogenes containing interrupting stop codons or frameshifts. to identify partial or genes from these sequences, we extracted the sequences that did not have any nonsense or frameshift mutations. we then constructed a multiple alignment of these sequences together with functional or genes by the program e-ins-i in mafft version . (katoh, et al. ) . from those alignments, we extracted partial or sequences that meet the following criteria. when the c-terminal region of an or gene is missing from the genome sequence, the n-terminal region should contain an initiation codon at a proper position and should not contain any nonsense mutations, frameshifts, or long gaps. when the n-terminal region is missing, the c-terminal portion should have a stop codon at a proper position and should not contain any nonsense mutations, frameshifts, or long gaps. we also identified and sequences with nonsense stop codon in the great leaf-nosed bat and chinese rufous horseshoe bats, which miss both a start and stop codons. however, these sequences were removed because they have relatively short sequence length (~ bp) and have strong sequence similarity with bitter taste receptor genes. to assign identified or genes into distinct or gene families, a collection of protein sequences from horde database version (safran, et al. ) was used. to detect the extensive gain and lose of or gene repertories, we employed the reconciled tree method (nam and nei ) , in which the topology of a gene tree is reconciled with that of a species tree. an in-house program was applied. briefly, based on the phylogenetic tree of or genes, we compared the condensed gene tree and the species tree under the parsimony principle. the number of ancestral genes can be estimated, and the information of the past occurrence of gene expansion and contraction. here, we used a % condensed tree of or genes for analyses. a list of vision-related genes were obtained from go category of visual perception (go: ). we subjected human vision-related proteins to tblastn against the genomes with cutoff threshold of e-value e- . we found that best-hits for each human protein by using the criteria that more than % of the aligned sequences showed an identity above %. genewise algorithm was employed to identify potential pseudogenes with parameters -genesf -for -quiet. those genes with frame shifts or pre-mature stop codons were considered as candidates. we then filtered them as follows: ) we aligned all human proteins to their corresponding genomic loci, and those genes with frameshifts or premature stop codons in human-to-human alignments were removed; ) as for the human-to-human alignments, those genes with obvious splicing errors near their frameshifts or premature stop codons were removed; ) candidate pseudogenes with a low number of sequencing reads covering their frameshift or premature stop codon sites were regarded assembly error. those genes with a number of reads containing genotype variations at these sites were considered as heterozygous and were also removed. we used a method based on ka/ks to identify go categories that significantly above average in the great leaf-nosed bat genome and chinese horseshoe bat genome. at first, the ka and ks rates are calculated by paml package from all aligned bases with quality score larger than in orthologs, using the f x codon frequency model and the rev substitution matrix. in order to examine the evolution function catalog, we downloaded the go annotation of human gene from the ensembl biomart database (release- ). we estimated the average ka and ks values for those genes which have annotated go as following equations (s , s ). where t is the number of annotated genes within go categories, i a and i a are the numbers of non-synonymous substitutions and sites, i s and i s are the numbers of synonymous substitutions and sites in gene i, as estimated by paml, respectively. the expected proportion of non-synonymous substitutions a p in a go category was then calculated (s ). for a given go category c, the probability c p of observing an equal or higher number of non-synonymous substitutions and synonymous substitutions was calculated assuming a binominal distribution (s ). where c a and c s are the total number of non-synonymous and synonymous substitutions in go category c, respectively. we applied an approach to the binomial test described above to identify go categories that have an excess of non-synonymous changes on one lineage. for lineages x and y, the average proportion of non-synonymous substitutions were calculated by the following formula (s ). x is the total number of non-synonymous substitutions in the x lineage, y is the total number of non-synonymous substitutions in the y lineage, and the divergence of the proportion of non-synonymous substitution numbers in different lineages between the observed and expected obeys binomial distribution, the formula is as in the following equation (s ). as described for the absolute rate tests, we then computed this statistic for every go category, as well as for every category in , randomly permuted data sets. we sampled a total of great leaf-nosed bats distributed in four different locations. genomic dna was extracted from wing membranes of each individual. paired-end sequencing library with an insert size of bp was constructed for each sample, and sequenced on the illumina hiseq platform with × bp mode. duplicate sequencing reads were filtered out according to the following criteria: ) any reads with > % unidentified nucleotides; ) reads with > nt aligned to the adapter sequence, allowing < % mismatches; ) reads with % bases having phred quality < . the filtered reads were mapped to the genome using bwa, and samtools were used to call snps. then, we filtered snps using vcftools and gatk under the following criteria: ) coverage depth > and < ; ) root mean square mapping quality > ; ) the distance of adjacent snps > bp; ) the distance to a gap > ; ) read quality value > . to estimate phylogenetic relationships, the genetic distances were calculated among all samples to generate a neighbor-joining (nj) tree using phylip. we performed a principal component analysis using the package gcta. the population structure was inferred using frappe (v . ) with a maximum likelihood method (tang, et al. ) . sliding-window approach ( kb window sliding in kb step) was employed to quantify polymorphism levels (θ π , pairwise nucleotide variation as a measure of diversity) and genetic differentiation (fst) between the high altitude region (dq) and low altitude regions (tw, jx and gz). to detect significant signatures of selective sweep, z-transformed fst values was calculated. next-generation genome sequencing was carried out, generating . gb and . gb of sequences for the great leaf-nosed bat and the chinese rufous horseshoe bat (supplementary table s ), respectively. the genome size was estimated to be . gb and . gb for the great leaf-nosed bat and the chinese rufous horseshoe bat ( supplementary fig. supplementary fig. ). known transposon-derived repeats account for . % and . % of the genomes in the great leaf-nosed bat and the chinese rufous horseshoe bat, respectively, which are lower than other non-bat mammalian species (supplementary table s ). to facilitate the genome annotation, we generated a high-depth transcriptome data from these two rhinolophoid bats. with repeats masked, the genome was annotated by integrating the homologous prediction, ab initio prediction and transcription-based prediction methods. as a result, a non-redundant reference gene set of , and , protein-coding genes were generated for the great leaf-nosed bat and the chinese rufous horseshoe bat ( supplementary fig. ) , respectively. we employed cegma method to evaluate the completeness of genome annotation. the result showed that the vast majority of the core genes were present in our predicted gene sets ( . % for the great leaf-nosed bat and . % for the chinese rufous horseshoe bat), indicating the completeness of gene sets identification. next, we aligned the transcriptome sequencing reads to the predicted genes, and the result showed that approximately % of exons are accurately covered ( . % for the great leaf-nosed bat and . % for the chinese rufous horseshoe bat). comparative analysis showed a high gene sequence similarity between them ( %, supplementary fig. ). we next examined the level of homology between our predicted genes and sequences in the uniprot database. the result showed that > % of the genes were functionally annotated ( % for the great leaf-nosed bat and . % for the chinese rufous horseshoe bat). compared with the gene families in other three mammalian species -the little brown bat, large flying fox and human, we identified , homologous gene families shared by five species. a total of gene families were specific to the rhinolophoid bats ( fig. ) . further functional annotation indicated that the rhinolophoid bats specific gene families were significantly over-represented in two major functional categories: atp binding ( genes, f.d.r.= . ) and immunity and host defense ( genes, f.d.r.= . ; supplementary table s ) . until now, the relationship of bats to other members of superorder laurasiatheria has proven difficult to resolve. some studies insisted that bats belong to the clade of pegasoferae which comprises chiroptera, carnivores and odd-toed ungulates (lindblad-toh, et al. ; meredith, et al. ; mccormack, et al. ) , whereas others proposed that bats are a sister group to the clade comprising carnivores and euungulata (pumo, et al. ; murphy, et al. ; murphy, et al. ; song, et al. ; zhang, et al. ) . to determine the phylogenetic position of bats within the superorder laurasiatheria, a total of , single-copy : orthologous genes were fig. ). the result based on nucleotide data was in line with previous analysis that bats are a sister group to odd-toed ungulates, whereas the result based on amino acid data supported that bat bats are sister group to the fereuungulata (carnivores + perissodactyla + cetartiodactyla). to account for the tree discordance among loci, coalescent method was applied. coalescent trees were highly consistent with the result inferred from amino acid data using partitioned method ( supplementary fig. ) . to dissect the phylogenetic signal, previously published eight different phylogenetic hypotheses ( supplementary fig. ) were proposed (waddell, et al. ; murphy, et al. ; nishihara, et al. ; prasad, et al. ; lindblad-toh, et al. ; meredith, et al. ; mccormack, et al. table s , supplementary fig. ). the result is consistent after incorporating the data from eulipotyphyla group ( supplementary fig. ) . we subsequently estimated the divergence time among these mammalian species. the bat lineage seems to be diverged from fereuungulata around million years ago, and the rhinolophoid bats seem to be diverged from the old world fruit bats around million years ago. comparative genome analyses were carried out to assess the evolution and innovation within the rhinolophoid bats. we next determined the expansion and contraction of gene orthologous clusters during evolution. the result identified significantly expanded and significantly contracted gene families in the great leaf-nosed bat, significantly expanded and significantly contracted gene families in the chinese rufous horseshoe bat (fig. ) . functional annotation showed that gene family contraction mainly included many olfactory receptor gene families in both rhinolophoid bat lineages (supplementary table s ), which is consistent with the result that the olfactory system is aberrant in some echolocating bats. many of the expanded gene families in both rhinolophoid bats are significantly enriched in immune-related functional categories (supplementary table s ). moreover, we identified , and positively selected genes in the great leaf-nosed bat, the chinese rufous horseshoe bat and the large flying fox, (supplementary tables s , , ), respectively. olfaction is of great importance in the lives of bats. many bats can use olfaction for mother-pup recognition, find food and avoid danger. in old world fruit bats, olfaction appears to be of particular importance, and fruit bats can find food from scent cues. animals that rely heavily on the sense of smell tend to have large numbers of or genes, while species that always use other senses have fewer functional or genes (niimura and nei ) . it has been suggested that bats displayed a diverse olfaction abilities. in order to describe the diversity of bat or gene repertoires, we identified the entire set of or genes of four bat species (supplementary methods, supplementary table s ). in line with previous work (hayden, et al. ) , we observed that echolocating bats have less fraction of or pseudogenes ( % for the great leaf-nosed bat, % for the chinese rufous horseshoe bat and % for the little brown bat) than non-echolocating bats ( % for the large flying fox). however, further analysis showed that the large flying fox and little brown bat have more than intact or genes while these two rhinolophoid bats only have < intact or genes. this finding is consistent with the result that rhinolophoid bats have a relatively small olfactory epithelium than the frugivorous pteropodidae (neuweiler ) . next, we reconstructed a protein neighbor-joining tree of all newly identified intact or genes in bats (fig. a) . it is obvious that or genes can be classified into two distinct classes based on sequence similarity: class i, postulated to bind to water-borne molecules, and class ii, hypothesized to bind to airborne molecules. the exact number of or genes in each class/or family are shown (supplementary table s , table s ). it seems that four bat species contain similar number of or genes in class i, while or gene contraction occurred in two rhinolophoid bats in class ii . previous works have documented that the number of or genes varies extensively among mammalian species, and extensive gains and losses of or genes have been observed (niimura and nei ) . to further understand the evolutionary changes of or gene repertoires, we estimated the gains and losses of the or genes in a diverse range of mammals (supplementary methods). evolutionary changes in the number of or genes in mammals have been shown in fig. b . we can clearly identify an extensive or gene contraction events occurred to the branch leading to the common ancestor of bats. further extensive gene contractions can be observed in the branch leading to the rhinolophoid bats. this finding also suggests massive "birth-and-death" of or genes in the bat species. table s ). since high omega may be due to stochastic effect caused by extremely small sample size, we removed these genes with omega value of . the result is also stable that more positively selected genes were detected in the branches leading to echolocating bats ( genes, great leaf-nosed bat, p = . e- ; genes, chinese rufous horseshoe bat, p = . e- ; genes, little brown bat, p = . ). next, branch model (two-ratio model) was carried out with the attempt to detect genes with accelerated evolution in the bat species. the result further indicated that more hearing-related genes have higher ɷ values on the branches leading to echolocating bats than all other lineages (supplementary table s ) . clade model c implemented in paml was employed (weadick and chang ) , and the result also persisted that more positively selected genes were detected in the branches leading to echolocating bats (supplementary table s ). moreover, a significant association between the average number of non-synonymous substitutions for all the hearing-related genes leading to each mammalian species and the estimated frequency of best hearing sensitivity for that species (r = . , p = . , fig. ) was observed. no significant correlation between such hearing frequencies and number of synonymous changes was observed (p = . ). a significant association between the number of non-synonymous changes between sister taxa was observed (r = . , p = . ). it is obvious that echolocating bats have typically undergone many more non-synonymous changes in the hearing-related genes than non-echolocating mammals. these results indicated the evolution of ultrasonic hearing in the rhinolophoid bats has involved in adaptive amino acid replacements in the hearing-related genes, which provided evidence conferring greater auditory sensitivity to ultrasonic frequency. previous works have documented that seven hearing-related genes underwent convergent evolution in echolocators (li, et al. ; liu, cotton, et al. ; davies, et al. ; shen, et al. ). here, genome-wide signatures of convergent evolution were examined in laryngeal echolocating bats. except for the previously reported seven hearing-related genes, we totally identified genes examined in the sound of perception category containing potential sequence convergent loci (site-wise convergence posterior probabilities > . ). to confirm our result, we amplified and sequenced these hearing-related genes from another two echolocating bats (eptesicus fuscus and miniopterus natalensis). the result also showed that these genes have higher convergence probabilities occurred in echolocating bats from a wider range of taxa, and the convergence probabilities between branches were significant based on simulations (supplementary table s ). however, maximum likelihood trees recovered the topology that all laryngeal echolocating bats formed a monophyletic clade for only four genes (col a , icam , bsnd and strc, supplementary fig. ). further analyses showed that echolocating bats are paraphyletic based on synonymous substitutions, whereas the non-synonymous trees revealed monophyly of laryngeal echolocators for only one hearing-related genes (strc gene, supplementary fig. ). next, multivariate analyses of protein polymorphism (mapp) was employed to detect the physicochemical impact of convergent substitutions in echolocating bats. mapp scores were estimated for the amino acid variants nested in the strc gene, and the result showed that these replacements had important functional effects (mapp score = . , p = . e- for h q; mapp score = . , p = . e- for a t; mapp score = . , p = . e- for v i). we further measured the number of sites with convergent amino acid substitutions along the branches as a direct measurement of sequence convergence, and found that the number of convergent sites in the branch pairs is proportional to the number of divergent sites ( supplementary fig. ). the number of convergent sites in the laryngeal echolocating bats does not significantly exceed that between the branch pair of the little brown bat and large flying fox, given their numbers of divergent sites (supplementary table s ). no significant differences was observed in the total number of sites that have experienced convergent substitutions from hearing-related genes. this result indicated that there is no exceptional genomic signature indicative of adaptive convergence between laryngeal echolocating bats, and genes with adaptive convergent substitutions might confine to few specific genes. bats are nocturnal mammals. the eyes of most echolocating bats are relatively small and poorly developed, whereas old world fruit bats often have excellent eyesight . rhinolophoid bats have the most sophisticated echolocation ability, and have been proposed that some genes involved in visual perception may have undergone relaxed selection (zhao, et al. ). we next examined the molecular basis for the poor visual perception in the echolocating bats. of bats have long been regarded as important reservoir hosts of emerging viruses (calisher, et al. ) . to examine population dynamics and understand evolutionary processes, we sampled great leaf-nosed bats from major distributed locations in china, including one group from high-altitude region (fig. a, table s ) are located at the intergenic regions. in order to resolve their phylogenetic relationships, we constructed a neighbor-joining (nj) tree based on pairwise genetic distances (fig. b) . this result showed that the great leaf-nosed bats formed separate groups according to the different locations. principal component analysis clearly divided these samples into four groups (dq, gz, jx and tw, fig. c) . these results suggested that there were significant population structures among the great leaf-nosed bat populations. furthermore, we performed population structure analysis. when k= , all these four populations were clearly separated (fig. d) . next, we measured the genetic diversity values (θ π ) of four populations, and found similar sequence diversity values (dq: . , gz: . , jx: . and tw: . , supplementary fig. s ). we further observed that the population differentiation statistic (fst) between populations, and the result showed little differentiation among populations (fst ranging from . between jx and tw to . between tw and dq, supplementary table s ) , which suggests universal inter-region gene flows. since the method of population differentiation has been widely used to detect selective sweeps (akey, et al. ; axelsson, et al. ; gou, et al. table s ). the result showed that genes related to catabolic process are likely to have been targets of recent positive selection. interestingly, we found that five genes (epas , plxnd , gja , sell and chdh) belong to hypoxia response related go categories (pugh and ratcliffe ; storz and moriyama ) , including 'angiogenesis', 'blood coagulation', 'blood vessel morphogenesis' and 'oxidoreductase activity'. epas can respond to the changes in available oxygen in the cellular environment under the high-altitude conditions. our work suggested that epas is involved in a selective sweep during the move of bats from low to high altitude. although hypoxia go categories are not over-represented, these highlighted hypoxia-related genes gave us a clue that genetic adaptations might be associated with high altitude. using deep sequencing and de novo assembly, we generated two genomes of rhinolophoid bats. rhinolophoid bats can perceive the world by using a wide range of sensory mechanisms, some of which have become highly specialized. these genome data provided useful resources to decipher the molecular adaptations of phenotypic traits. rhinolophoid bats arguably possess the most sophisticated echolocation systems, and can emit relatively long calls adapted to detect and classify the wing beats of insects. they are heavily reliant on hearing for a variety of ecologically important roles. previous works have documented that hearing-related genes are predominantly evolutionarily conserved in mammals (kirwan, et al. ) . here, we found evidence that some hearing-related genes have undergone darwinian selection associated with the evolution of specialized constant frequency echolocation. positive selection acting on hearing-related genes in rhinolophoid bats might result from the extreme selectivity used in auditory processing by these bats. many previous works have reported the sequence convergence of some hearing-related genes reuniting echolocating bats (li, et al. ; liu, et al. ; davies, et al. ; . we found no genome-wide sequence convergence for echolocation, indicating erroneous phylogenetic grouping are still rare it has been suggested that the enlargement of one area of brain might be associated with the reduction in size of other brain area (harvey and krebs ) . the auditory cortex and the inferior colliculus are extremely enlarged in the volume in laryngeal echolocating bats (especially in rhinolophoid bats), while visual brain areas are relatively enlarged in old world fruit bats (dechmann and safi ). the trade-off has been proposed in investment in brain tissues because of the extreme energetic demands imposed by neural processing. our result showed more visual perception genes have become pseudogenes in rhinolophoid bats, and it is reasonable to speculate that some visual perception gene may have undergone relaxed natural selection in echolocating bats. meanwhile, positive selection acting on some hearing-related genes was identified. such concordance suggests that some genes are impacted by natural selection, which raised the possibility that changes at the sensory genes will have direct consequences for those genes controlling for other sensory modalities, perhaps via trade-offs. this finding supports the longstanding but weakly supported assumption that bats are experiencing trade-off between vision and audition . olfaction is of great importance in the lives of bat species. previous works have identified olfactory receptor (or) gene repertoire in the little brown bat and the large flying fox using the profile hidden markov model (hayden, et al. ; hayden, et al. in specific gene family. a possible explanation is that the little brown bat has no well-developed olfaction ability, but tends to recognize specific odorants after recent or gene duplication. these comparative analyses have provided great insights into adaptation to their specialized sensory mechanisms. in this work, we re-sequenced the genome of great leaf-nosed bats from four distributed locations. the genome re-sequencing analysis has been performed based generally on the following considerations: ) to characterize the genetic diversity and patterns of evolution; ) to understand the genetic bases of adaptation to high altitude in the great leaf-nosed bats. efforts for the conservation measures will benefit from the knowledge of population genetic structure of the great leaf-nosed bats. here, we found very little differentiation among populations, which suggests universal inter-region gene flows or incomplete lineage sorting. a broader geographical scale analysis is needed in the future. furthermore, we provided evidence of genetic adaptation in the great leaf-nosed bat that are associated with high altitude. selective sweep mapping was conducted for populations from different altitudes, and identified several hypoxia-related genes with a high extent of differentiation on the genome scale. epas is transcription factor that respond to the changes in the available oxygen in the cellular environment under high-altitude conditions, and mutations at epas are tightly associated with hematologic phenotypes (van patot and gassmann ). previous works have documented that epas polymorphisms are associated with tibetan people with lower hemoglobin concentrations (beall, et al. ) . a loss-of-function role of epas might exist in high-altitude adaptation. so, our result indicated potential high-altitude hypoxia adaptation mechanisms of the great leaf-nosed bat. our work is based on a limited genome re-sequencing resource, and data from more samples are necessary for future work. however, false positives notwithstanding, the results provided valuable staring points for experimental follow-up, and suggested an initial evolutionary scenario of bats in adaptation to high-altitude hypoxia. to the best of our knowledge, it is the first time to report the de novo assembled genome and genome re-sequencing of bats with long constant frequency echolocation calls. these data are essential for us to understand the evolution of bats. tracking footprints of artificial selection in the dog genome the genomic signature of dog domestication reveals adaptation to a starch-rich diet natural selection on epas (hif alpha) associated with low hemoglobin concentration in tibetan highlanders prediction of complete gene structures in human genomic dna allpaths: de novo assembly of whole-genome shotgun microreads bats: important reservoir hosts of emerging viruses parallel signatures of sequence evolution among hearing genes in echolocating mammals: an emerging model of genetic convergence cafe: a computational tool for the study of gene family evolution comparative studies of brain evolution: a critical insight from the chiroptera evolution of olfactory receptor genes in primates dominated by birth-and-death process muscle: multiple sequence alignment with high accuracy and high throughput isolation and characterization of a bat sars-like coronavirus that uses the ace receptor whole-genome sequencing of six dog breeds from continuous altitudes reveals adaptation to high-altitude hypoxia improving the arabidopsis genome annotation using maximal transcript alignment assemblies comparing brains ecological adaptation determines functional mammalian olfactory subgenomes a cluster of olfactory receptor genes linked to frugivory in bats the evolution of echolocation in bats repbase update, a database of eukaryotic repetitive elements mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform a phylomedicine approach to understanding the evolution of auditory sensory perception and disease in mammals the hearing gene prestin reunites echolocating bats treefam: a curated database of phylogenetic trees of animal gene families the hearing gene prestin unites echolocating bats and whales a high-resolution map of human evolutionary constraint using mammals a maximum pseudo-likelihood approach for estimating species trees under the coalescent model convergent sequence evolution between echolocating bats and dolphins the voltage-gated potassium channel subfamily kqt member (kcnq ) displays parallel evolution in echolocating bats parallel evolution of kcnq in echolocating bats parallel adaptive radiations in two major clades of placental mammals ultraconserved elements are novel phylogenomic markers that resolve placental mammal phylogeny when combined with species-tree analysis impacts of the cretaceous terrestrial revolution and kpg extinction on mammal diversification interpro and interproscan: tools for protein sequence classification and comparison resolution of the early placental mammal radiation using bayesian phylogenetics using genomic data to unravel the root of the placental mammal phylogeny evolutionary change of the numbers of homeobox genes in bilateral animals the biology of bats extensive gains and losses of olfactory receptor genes in mammalian evolution pegasoferae, an unexpected mammalian clade revealed by tracking ancient retroposon insertions cegma: a pipeline to accurately annotate core genes in eukaryotic genomes confirming the phylogeny of mammals by use of large comparative sequence data sets regulation of angiogenesis by hypoxia: role of the hif system complete mitochondrial genome of a neotropical fruit bat, artibeus jamaicensis, and a new hypothesis of the relationships of bats to other eutherian mammals human gene-centric databases at the weizmann institute of science: genecards, udb, crow and horde from spatial orientation to food acquisition in echolocating bats genblasta: enabling blast to identify homologous gene sequences parallel evolution of auditory genes for echolocation in bats and toothed whales parallel and convergent evolution of the dim-light vision gene rh in bats (order: chiroptera) consel: for assessing the confidence of phylogenetic tree selection resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model raxml version : a tool for phylogenetic analysis and post-analysis of large phylogenies gene prediction with a hidden markov model and a new intron submodel physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity mechanisms of hemoglobin adaptation to high altitude hypoxia estimation of individual admixture: analytical and study design considerations using repeatmasker to identify repetitive elements in genomic sequences hear, hear: the convergent evolution of echolocation in bats? a molecular phylogeny for bats illuminates biogeography and the fossil record differential analysis of gene regulation at transcript resolution with rna-seq tophat: discovering splice junctions with rna-seq hypoxia: adapting to high altitude by mutating epas- , the gene encoding hif- alpha towards resolving the interordinal relationships of placental mammals an improved likelihood ratio test for detecting site-specific functional divergence among clades of protein-coding genes paml : phylogenetic analysis by maximum likelihood comparative analysis of bat genomes provides insight into the evolution of flight and immunity evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level the evolution of color vision in nocturnal mammals this project is supported by key construction program of the national ' ' project of east china normal university to dong dong ( ), and the national natural science foundation of china (no. ) to shuyi zhang. we thanks shanghai majorbio bio-pharm biotechnology co., ltd. for genome sequencing and dr.chao-hung lee for providing valuable advices.. dd designed the study, and dd, ml, ph, yp, sm, gz, ep, kl and sz carried out the data analysis. dd wrote the manuscript. all authors read and approved the final manuscript. the authors declare no competing financial interests. key: cord- -obig mu authors: alić, ivan; goh, pollyanna a; murray, aoife; portelius, erik; gkanatsiou, eleni; gough, gillian; mok, kin y; koschut, david; brunmeir, reinhard; yeap, yee jie; o’brien, niamh l; groet, jurgen; shao, xiaowei; havlicek, steven; dunn, n ray; kvartsberg, hlin; brinkmalm, gunnar; hithersay, rosalyn; startin, carla; hamburg, sarah; phillips, margaret; pervushin, konstantin; turmaine, mark; wallon, david; rovelet-lecrux, anne; soininen, hilkka; volpi, emanuela; martin, joanne e; foo, jia nee; becker, david l; rostagno, agueda; ghiso, jorge; krsnik, Željka; Šimić, goran; kostović, ivica; mitrečić, dinko; francis, paul t; blennow, kaj; strydom, andre; hardy, john; zetterberg, henrik; nižetić, dean title: “patient-specific alzheimer-like pathology in trisomy cerebral organoids reveals bace as a gene-dose-sensitive ad-suppressor in human brain” date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: obig mu a population of > million people worldwide at high risk of alzheimer’s disease (ad) are those with down syndrome (ds, caused by trisomy (t )), % of whom develop dementia during lifetime, caused by an extra copy of β-amyloid-(aβ)-precursor-protein gene. we report ad-like pathology in cerebral organoids grown in vitro from non-invasively sampled strands of hair from % of ds donors. the pathology consisted of extracellular diffuse and fibrillar aβ deposits, hyperphosphorylated/pathologically conformed tau, and premature neuronal loss. presence/absence of ad-like pathology was donor-specific (reproducible between individual organoids/ipsc lines/experiments). pathology could be triggered in pathology-negative t organoids by crispr/cas -mediated elimination of the third copy of chromosome- -gene bace , but prevented by combined chemical β and γ-secretase inhibition. we found that t -organoids secrete increased proportions of aβ-preventing (aβ - ) and aβ-degradation products (aβ - and aβ - ). we show these profiles mirror in cerebrospinal fluid of people with ds. we demonstrate that this protective mechanism is mediated by bace -trisomy and cross-inhibited by clinically trialled bace -inhibitors. combined, our data prove the physiological role of bace as a dose-sensitive ad-suppressor gene, potentially explaining the dementia delay in ∼ % of people with ds. we also show that ds cerebral organoids could be explored as pre-morbid ad-risk population detector and a system for hypothesis-free drug screens as well as identification of natural suppressor genes for neurodegenerative diseases. production - , and degradation of β-amyloid peptides (aβ) are among the central processes in the pathogenesis of alzheimer's disease (ad). the canonical aβ peptide is produced after sequential cleavage of the β-amyloid precursor protein (app) by β-secretase and γ-secretase, generating a peptide that most often begins amino acids (aa) from the c-terminus of app with asp and contains the next - aa of the app sequence, generating a range of peptides (aβ - , , , and ) . the longer of these peptides can be detected in toxic amyloid aggregates in the brain, associated with ad and other neurodegenerative disorders . as app gene is located on human chromosome , people with down syndrome (ds, caused by trisomy (t )) are born with one extra copy of this gene, which increases their risk of developing ad. non-ds (euploid) people inheriting triplication of the app gene alone (dupapp) develop ad symptoms by age with % penetrance. paradoxically, only ~ % of people with ds develop clinical dementia by age , suggesting the presence of other unknown chromosome -located genes that modulate the age of dementia onset , . a number of secretases participate in the physiological cleavage of app , , generating various peptides involved in neuronal pathology. bace is the main β-secretase in the brain , while the expression and function of its homologue bace (encoded by a chromosome gene) remain less clear , . at least different activities of bace were recorded with regards to app processing: as an auxiliary β-secretase (proamyloidogenic), as a θ-secretase (degrading the β-ctf and preventing the formation of aβ), and as aβ-degrading protease (aβdp) (degrading synthetic aβ-peptides at extremely acidic ph). it remains unclear which of these activities reflect the role of bace in ad. the potential activity of bace as an anti-amyloidogenic θ-secretase can be predicted from studies on a variety of transfected cell lines that overexpress app, and artificially manipulate the dose of bace [ ] [ ] [ ] [ ] . we compared organoids from isogenic ipsc lines, derived from the same individual with ds, mosaic for t and normal disomy (d ) cells . cerebral organoids were derived following a standard protocol , and shown to contain neurons expressing markers of all layers of the human cortex ( supplementary fig. ) and no significant difference in the proportions of neurons and astrocytes between the d and t organoids ( supplementary fig. ) . the integrity and copy number of the ipsc lines were validated at the point of starting the organoid differentiation, for chromosome ( supplementary fig. ), and the whole genome (available on request). t /d status was further verified by interphase fish on mature organoid slices, ( supplementary fig. a ). the c-terminal region of app can be processed by the sequential action of different proteases to produce a range of protein fragments and peptide species, including aβ ( supplementary fig. ). aβ peptide profiles were analysed from organoidconditioned media (cm) whereby each cm sample was taken from a cm dish culturing a pool of - organoids derived from one ipsc line, in total: n= cm samples for exp ( trisomic isogenic lines, disomic isogenic lines, timepoints each), n= cm samples for exp ( trisomic isogenic lines, disomic isogenic lines, timepoints each) and n= cm samples for exp ( trisomic isogenic line, disomic isogenic line, dupapp line, line each for two different unrelated ds individuals, timepoints each). cm was collected at a timepoints between days - of culturing and analysed using immunoprecipitation in combination with mass spectrometry (ip-ms) . please see "methods" and "supplementary data" sections for more detailed explanations, and statistical controls used for individual ipsc line-to-line comparisons. ( fig. a) . relative ratios were calculated of areas under the peak between the peptides of interest within a single mass spectrum (raw data example in supplementary fig. d) , therefore unaffected by the variability in the total cell mass between wells growing organoids. the proportions of non-amyloidogenic peptides with the signature of bace cleavage products, both as a putative θ-secretase (as reflected by the aβ - product) and putative aβdp or aβclearance products (aβ - & - ), or combined, (relative to the sum of aβ amyloidogenic peptides (aβ - & - & - & - )) were approximately doubled in cm from t organoids, compared to isogenic normal controls, and reached levels of > % of the amyloidogenic peptide levels (fig. a) . this result was fully reproduced in independent experiments, each starting from undifferentiated ipscs ( vertical columns of graphs in fig. a ). in experiment , more recently generated ipsc lines from different individuals were introduced; from a euploid patient with feoad caused by dupapp , and from unrelated people with ds (supplementary figs. - ). the - & - /amyloidogenic ratios were not significantly different between d and dupapp lines, suggesting the third copy of the app gene alone did not cause any change in this ratio. ratios of - & - /amyloidogenic peptides and combined bace products/amyloidogenics were significantly increased in t lines (combining all t individuals) compared to d or dupapp lines (fig. a) . the ratio of - /amyloidogenics was significantly higher in t lines from the isogenic model, compared to its disomic isogenic control, and compared to dupapp, but it was unchanged in the other two unrelated ds ipsc lines (see also supplementary information for a more detailed explanation). as the proportions of bace -unrelated α-site cleavage products ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) were not different between t and isogenic d organoids (in any of the experiments) (fig. a) , it can be predicted that the increased presence of - , - and - peptides in t contributes towards an overall increase in soluble peptides that are non-amyloidogenic. the validity of this prediction was tested by an independent biochemical method (elisa), by measuring the aβ-peptide concentrations within the isogenic t :d organoid cm comparison, which showed an increase in absolute concentrations caused by t for each aβ - , - and - , with no difference in the aβ - / - ratio between t and isogenic d lines, mirroring the readout in the absolute levels of ip-ms peaks ( supplementary fig. ). analysis of ip-ms area under peak (used in fig. to calculate relative ratios) showed a near linear correlation when plotted against absolute peptide concentrations measured by elisa, for each aβ - , - and - ( supplementary fig. ), validating our relative ratio calculations by an independent biochemical method. to estimate the contribution of bace towards the anti-amyloidogenic pathway relative to other anti-amyloidogenic cleavages at the α-site, we calculated the peptide ratios of - / - or - (θ secretase/α secretase products) and - / - or - (bace aβdp/ α secretase products). we observed that t organoids produce statistically highly significant increases in all four of these ratios, relative to isogenic d , or non-isogenic dupapp organoids (fig. b) . therefore, we conclude that t causes these effects in our organoid system. the d ratios were not significantly different to dupapp, suggesting that the third copy of genes other than app causes these effects, though this needs to be tested on a larger number of individuals. as the peptide profiling data strongly favour the hypothesis of a genetic-dose-sensitive antiamyloidogenic action of bace , we sought to zoom in on the bace genetic locus in a systematic snp-array analysis of individuals recruited through the londowns consortium who had undergone detailed assessment for dementia , ; single nucleotide polymorphisms (snps) located within the bace locus +/- kb, were genotyped, and dementia age-of-onset determined, as described in methods. we detect two new bace snps (purple, supplementary fig. ) correlating with age of dementia onset in the ds cohort of the londowns consortium, located in close proximity to a previously reported snp (red, supplementary fig. ) . all of these snps cluster in < kbp segment, which is fully contained within a kbp deletion (blue line, supplementary fig. ) that caused a de novo eoad in a euploid patient ( supplementary fig. ) . these data corroborate the notion that subtle genotypic variation in bace levels may play an important role in affecting the age of dementia onset in both ds and non-ds individuals. in order to assess if the peptide ratio differences from fig. b have any relevance in vivo, we analysed the aβ-peptide profiles immunoprecipitated from human cerebrospinal fluid (csf). we have previously produced ip-ms data on csf from people with ds and age-matched controls . we repeated the calculations shown for organoids in fig. b , on ip-ms results from csf samples from ds (n= ) and age-matched euploid people (n= ). all four relative ratio calculations showed an increase in peptide ratios in csf from people with ds, compared to agematched euploid controls, of which three comparisons were statistically highly significant (fig. c ). this suggests that in ds brains, the third copy of bace skews the anti-amyloidogenic processing significantly towards bace -cleavages, relative to other anti-amyloidogenic enzymes cleaving at the α-site. importantly, these csf results validate the in vivo relevance of the peptide ratios obtained using cm from ipsc-derived cerebral organoids (comparison of fig. b and fig. c ). chemical inhibition of bace remains an attractive therapeutic strategy for ad. as bace is a homologous protein, most inhibitors tested in clinical trials also cross-inhibit the (proamyloidogenic) β-secretase activity of bace , which has been proven as the cause of several unwanted side-effects, such as skin pigmentation changes. as our data suggest that the opposite, aβ-degrading, activity of bace plays an important role, we designed a new fret-based in vitro assay, in which efficient aβdp-cutting after aβ aa by bace at ph= . could be measured (fig. ), while zero activity by bace was detectable under same conditions (supplementary information). we demonstrated that at least two bace inhibitor compounds (of which one recently used in clinical trials) inhibit the aβdp (aβ-clearance) activity of bace in a dose-dependent manner (fig. ). this has, to our knowledge, so far not been shown, and could provide an additional explanation for the failure of some bace -inhibitor clinical trials, and should be taken in consideration when testing new inhibitors. as in vitro experiments showed that bace can very efficiently cleave the aβ -site in the fret peptide (fig. ) and synthetic aβ - peptide in solution at an acidic ph , we sought to visualize if the presence of the substrate (aβ - ), enzyme (bace ) and one of the products of this reaction (aβ - ) can be detected in our organoids, in a sub-cellular compartment known to be acidic. firstly, by immunofluorescence (i.f.) using pan-anti-aβ ( g ), anti-bace , or neoepitope-specific antibodies against aβx- and aβx- , we detected significantly higher signals (normalized to pan-neuronal marker) in t organoid neurons, compared to isogenic d ones ( supplementary fig. b-d) . pearson's coefficient showed a high level of colocalisation (> . ) of both the main substrate (aβx- ) and its putative degradation product (aβx- ) with bace in neurons of cerebral organoids, in lamp + compartment (known to be a subset of lyzosomes, therefore low ph vesicles) ( fig. & supplementary fig. ). in comparison, the pearson's coefficient for bace with aβx- was only . ( fig. & supplementary fig. ) , and its pattern of sub-cellular localization was different to bace (high colocalization with rab and sortilin, much lower with lamp ). using i.f. on human brain sections, a similar highly significant difference was observed (fig. a, b) : aβx- colocalised with bace ( . (± . sem)) as opposed to bace ( . (± . sem)). the colocalised signal of aβx- and bace was seen in categories of objects ( fig. ) , in all analysed samples: individual ds-ad brains ( fig. a-c) , euploid sporadic ad subjects (example in supplementary fig. a , for complete list of brain samples see supplementary table ) and (in the fine vesicle compartment only) in non-demented control euploid subjects' neurons (age - ), as well as ds brain from a yr old with no plaques or dementia, (examples in fig. d , for complete list of brain samples see supplementary table ) . lambda scanning and sudan black b stainings were independently used to subtract the autofluorescence of lipofuscin granules ( supplementary fig. f, g) . this has proven that the fine vesicular pattern and large amorphous extra-cellular aggregates are not autofluorescent lipofuscin granules, but real colocalisations of bace and aβx- ( supplementary fig. ). colocalised signals of aβx- and bace were particularly strong in areas surrounding neuritic plaques (fig. a-c) . as aβdp cleavage by bace is efficient only at low ph, we sought to analyse in more detail the bace and aβx- co-localisation in highly acidic cellular compartments. for this reason, we costained lysosome markers lamp or lamp with aβx- . additionally, macroautophagic vacuoles containing aβ were shown to accumulate in ad distended neurites , which is why we also stained with the macro-autophagosome marker lc a. as we further found that aβx- did not colocalise with lamp or lc a, but colocalised strongly with lamp (fig. , supplementary figure and supplementary information), we tested colocalisation with the components of an alternative autophagy pathway: chaperone-mediated autophagy (cma), and found a very high level of colocalisation (fig. ) . using crispr/spcas -hf , we eliminated a single copy of bace in the trisomic ipsc line c (t c ∆ , a ∆ bp in bace exon , knocking out of copies of the gene), while maintaining the trisomy of the rest of chromosome ( fig. a -c, supplementary fig. , supplementary information). total actin-normalised bace signal showed a %- % reduction in Δ compared to t unedited line, and no significant difference compared to d control (fig. c, supplementary fig. ). total protein level of app in ∆ remained at trisomic levels, significantly increased compared to the disomic control ( supplementary fig. ). the crispr correction of bace gene dose from to , resulted in a significant decrease in levels of putative bace -aβdp (aβ-clearance) products ( - & - ), as well as total bace -related non-amyloidogenic peptides ( - & - & - ), relative to amyloidogenic peptides (fig. d ). this pinpoints the triplication of bace as a likely cause of specific anti-amyloidogenic t effects we observed in fig. a . furthermore, we used two different dyes to detect any presence of amyloid deposits (the traditional thioflavine s, and a newer, more sensitive dye amyloglo ) in organoid sections. remarkably, elimination of the third bace copy caused the t organoids (that had not shown any overt amyloid deposits at div, see t c in supplementary fig. , top row) to develop extremely early ad-plaque like deposits (amyloglo+ and thioflavine s+) in the cortical part of the organoid by div ( supplementary fig. , middle row), that progressed aggressively and became much stronger and denser by div, accompanied by massive cell death ( supplementary fig. , bottom row, supplementary fig. ). in order to prove that extracellular deposits staining positively with amyloid dyes really are related to hyperproduction of aβ amyloidogenic peptides, we cultured t c ∆ organoids in media containing high concentrations of β and γ secretase inhibitors. early t c and t c ∆ organoids were treated with a combination of β-secretase inhibitor iv and compound e (γ secretase inhibitor xii) (supplementary table ) from div to div (fig. ). amyloid-like deposits were readily detected with amyloglo in the untreated and vehicle only treated t c ∆ organoids (fig. b ), but were completely absent from t c ∆ organoids treated with β and γ secretase inhibitors. inhibitor treatment also significantly reduced the number of neurons expressing pathologically conformed tau (tg -positive cells) in the t c ∆ compared to untreated controls (fig. c) . no amyloglo positive aggregates or tg -positive cells were detected in t c organoids under any treatment conditions at div (fig. a , c) and were also absent in the same organoids at div (fig. g, l, supplementary fig. ). also, no obvious deleterious effects of the inhibitors, or vehicle control, could be seen in early unedited t c organoids. further histo-pathological verification showed that elimination of one copy of bace triggered progressive accumulation of extracellular deposits that co-stain with thioflavine s and antibodies against aβ, both g and neo-epitope specific aβx- &aβx- . the antibody signal intensity in colocalisations with thioflavine s drastically increased upon pre-treatment with % formic acid (fig. a-d) , proving that the deposits contain insoluble aβ material. this is further corroborated by the isolation of fibrillary material from the detergent-insoluble fraction of the crispr-edited organoid. when viewed by transmission electron microscopy (tem) the filaments found exhibited a straight morphology of < nm diameter ( supplementary fig. a ), closely resembling fibrils grown in vitro from synthetic aβ - peptide (supplementary fig. c ). furthermore, neuritic plaque-like features were detected by ihc co-staining with gallyas in crispr-edited organoids (fig. m, n) , but not their unedited t control (fig. l) . human brain from an ad patient is shown for comparison stained with gallyas (fig. k) . tau pathology was also observed by ihc using the hyper-phosphorylated tau antibody at (fig. e , f), and by i.f. for conformationally altered tau (tg , fig. g -j). the relative increase in the amount of conformationally altered (pathological) tau in crispr-edited organoids t c Δ , compared to unedited t control organoids, was also independently confirmed by immunoblotting using tg antibody. as shown in fig. o , the protein material isolated from t c Δ organoids produced significantly more tg signal than unedited controls, albeit having a weaker signal with the general r-tau antibody (consistent with the observed neuronal loss, supplementary fig. ). our data in figs. , , show that severing the bace dose by a third, using crispr/cas , might tip the balance against the anti-amyloidogenic activity, and provoke ad-like pathology. our data in fig. suggest that anti-amyloidogenic activity of bace is gene-dose dependent, and its level varies between individuals, with snp allelic differences in bace gene correlating with age of dementia onset. we therefore hypothesized that organoids grown from some people with ds may develop ad-like pathology without any crispr-cas intervention. we then tested this hypothesis using ipsc lines from different individuals with ds, and one dupapp patient (table ) . we detected amyloid-like aggregates (both diffuse and compact in appearance) in / unedited ipsc-derived organoids from people with ds, and one with dupapp (fig. ) . the two donors whose ipsc-organoids did not show pathology are (i) the t ipsc from our isogenic model (whose clinical status is unknown) and (ii) qm-ds , a donor who remains free from dementia symptoms at age (table ). organoids from another ds donors, and one dupapp patient, (all diagnosed with clinical dementia) all showed presence of diffuse and compact amyloid-like deposits (fig. ) as well as presence of neuritic plaque-like features (focal hyper-phosphorylated tau (at +), conformationally altered tau (tg +) and filamentous tau (at +)) within neuropil neurites within plaque-like circular foci ( fig. a-n) . this was corroborated by gallyas intra-neuronal positivity ( fig. o-t) . similarly as for t c Δ , we were able to isolate fibrillary material from the detergent-insoluble fraction of qm-dupapp organoid ( supplementary fig. b ), that on tem resembled fibrils grown in vitro from synthetic aβ - peptide ( supplementary fig. c ). most importantly: tested individual organoids from one donor (from multiple ipsc lines and multiple independent experiments) either all did (dupapp, qm-ds - ), or all did not (isogenic t , qm-ds ) show ad-like pathology (table ) , proving the pathology is donor dependent. this open possibilities of developing assays for pre-therapy riskstratification and individualized drug-response quantitation. several human brain studies show detectable expression and β-secretase activity of bace , though at much lower levels than that of bace [ ] [ ] [ ] [ ] . chemical inhibition of β-secretase activity is an attractive therapeutic approach aimed at reducing the production of aβ [ ] [ ] [ ] . complete knock-out of bace abolished all β-secretase activity in mouse neurons, while leaving some degree of β-secretase activity in astrocytes . this activity was abolished by the complete knockout of both bace and bace , leading to a hypothesis that a bace -driven β-secretase activity in astrocytes may contribute to accelerate the aβ-production and ad-pathology in ds . in human brain, the β-secretase activity of bace correlated positively with the amount of aβ, whereas the β-secretase activity of bace did not . on the other hand, snps at the bace locus (and not bace ) correlate with the age of onset of dementia in people with ds , as well as sporadic load in euploid people in the finnish population , and a recent report showed that a de novo intronic deletion within one allele of bace caused eoad in a year old euploid person . all of the above data (and new data we show in supplementary fig. ) implicate that a single allele alteration in the genetic dose of bace is capable of affecting the risk of ad-dementia, but do not resolve the question whether bace per se acts predominantly as an accelerator, or a suppressor of ad pathology. the answer to this question requires clarification, as most chemical inhibitors used in clinical trials have dual activity against bace and bace , . the increased ratios of - & - (bace -aβdp) to the amyloidogenic and α-site products are among our most consistent and robust observations in t organoid cm and ds-csf ( fig. b- c). the - generating cleavage can only occur after the cuts by both β-and γ-secretases have released aβ, because the hidden transmembrane site between aa and aa is inaccessible to any proteolytic enzymes until the soluble aβ ( - to - ) molecules are released from the membrane , . therefore, the aβ - species can only be a product of an aβdp activity (a catabolic degradation or clearance of an already made aβ - to - peptides). besides bace , the only enzymes with potential to cleave the peptide bond leu -met are bace , , and extracellular matrix (ecm) metalloproteinases (mmp and mmp ) , since no other aβ degrading enzymes (neither ide, nor nep, nor ece) are known to cleave at this site . bace action is unlikely to cause the increased ratios we observe, as bace can only generate this cut in solution at very high enzyme concentration and after prolonged incubation . to further corroborate this point, we designed a novel fret-assay and established the conditions in which bace can efficiently cleave at aβ site ( we also demonstrated that two bace inhibitors (β-secretase inhibitor iv -cas - - (calbiochem, originally a merck compound)), and ly (eli lilly compound recently used in clinical trials) both inhibit the aβdp activity of bace in vitro, while the γ-secretase inhibitor (dapt) had no effect. this suggests that the aβdp activity (cutting the peptide bond leu -met ) has a different enzymatic preference, conditions, and ph, as compared to the classical β-secretase cleavage that both bace and bace are capable of. as fret assays cleaving this classical (before asp ) site are generally used to measure the bace inhibitors' selectivity for bace or bace , our data suggest that the degree of selectivity for any given inhibitor calculated this way, does not necessarily reflect whether the same selectivity would apply for their cross-inhibition of the leu -met site cleavage (aβdp) activity. interestingly, the presence of the aβx- degradation product, both alone and co-localising with bace ( fig. ) show elevated levels in cells and extracellular aggregates immediately surrounding neuritic plaques, suggesting bace degradation of not only newly produced aβ, but also of aβ that is released and re-deposited (from and to) existing deposits. a recent report on widespread somatic changes in individual neurons suggests an additional mechanism for the production of toxic aβ species, including products that do not require secretase cleavage , underscoring the importance of efficient aβ degrading mechanisms that protect from ad, such as the one exerted by bace that we describe here. a recent mouse model has shown that introducing a third dose of chromosome to a mouse that several hundred fold over-expresses aβ and worsens the amyloid plaque load, and this correlates with an unexpected decrease in the aβ / ratio . this unfavourable ratio effect (the cause of which is unknown) is expected to worsen the plaque load and ad pathology, and a mere . x increase of bace dose in this mouse model has no chance in protecting the mouse against a > x overload of aβ. in another mouse model, where transgenic bace was artificially overexpressed together with transgenic wtapp, it actually decreased aβ and to the wt mouse control levels, and the presence of bace transgene reversed behavioural pathologies seen in tgapp mouse . this indicates that a balance of doses of app and bace affects levels of soluble aβ and , and their oligomerization and aggregation as a consequence. our results in figs , and further corroborate that a significant disturbance of this balance by a reduction in bace copy number is sufficient to cause an early ad-like pathology in t cerebral organoids. we did not see any amyloid plaque-like structures at > div organoids from three independent t ipsc lines (or normal disomic lines) of our isogenic system (supplementary figs. , , , , , , figs. , , ) . surprisingly, crispr/cas elimination of the third copy of bace in the same t line caused widespread amyloglo+ deposits at div, and widespread neuritic plaque-like structures with profound neuron loss ( supplementary fig. , ) and tau pathology at div (figs. , ). our data in figs. and supplementary fig. suggest that anti-amyloidogenic activity of bace is gene-dose dependent, and its level varies between individuals, with snp allelic differences in bace correlating with age of dementia onset. we therefore hypothesized that organoids grown from some people with ds may develop ad-like pathology without any crispr-cas intervention. diffuse amyloid plaque-like appearance with tau pathology was recently reported in days old cerebral organoids from only a single ds-hipsc line so far. we subsequently analysed ipsc-derived organoids at approximately the same cell culture age from a total of different individuals with ds and one with dupapp. we found flagrant adlike pathological changes in / ds tested ( %), as well as the one dupapp. very interestingly, when this assessment was repeated in independent experiments, and when individual organoids from a single experiment were compared, it was a black/white picture: either they all had adlike pathology, or none did, driven solely by the genotype of the donor (table ). our data, though not conclusive, are illustrative of the stratifying potential of this technology. for example, the cerebral organoids from individual qm-ds showed the worst ad-like pathology with fibrillary amyloid deposits ( fig. f , i, j, table ), and this individual was diagnosed with dementia at age . in contrast, organoids from individual qm-ds showed no pathology (fig. b , table ), and this individual was also dementia-free at age . this opens up possibilities for finding correlations with clinical parameters, for which a much larger number of individuals would have to be tested. to confirm that the amyloglo deposits were in fact aggregated β-amyloid containing material, early organoids were treated with a combination of β-secretase inhibitor iv (βi-iv) and gamma secretase inhibitor xii (compound e) (fig. a, b) . the combination of these inhibitors should prevent any production of aβ, and therefore eliminate amyloglo positivity. after treatment for days, the inhibitor treatment did indeed prevent the formation of plaque-like deposits within t c Δ organoids, confirming that such deposits are comprised of β-amyloid. the same treatment conditions also significantly reduced the number of tg -positive cells in t c Δ organoids (fig. c) , highlighting the ability to modulate both amyloid and tau pathology in the cerebral organoid system. this also demonstrates the feasibility of using this ad-like organoid pathology in future hypothesis-free drug screens for chemical compounds that may prevent/inhibit amyloid production or aggregation. in view of our results, it becomes inviting to hypothesize that triplication of bace may be the cause of the delayed onset of dementia in % of people with ds compared to dupapp , and (because of the predicted abundance of bace mrna in endothelial cells) also the cause of a significantly lower degree of cerebral amyloid angiopathy (caa) in the brains of people with ds compared to those of dupapp . our organoid system is not informative in this regard, as we could not detect any endothelial cells in our organoids (not shown). this, however, is also an advantage, as it allows uncovering the mechanisms that are specific to neurons in the absence of endothelial or blood cell derived tissue components. in neurons, a recent report also found that an increased app dose may act (through an unknown mechanism) as a transcriptional repressor of several chromosome genes, including bace . this observation needs further verification and mechanistic explanation, but if true, it would imply that the protective effect of the third copy of bace in ds that we observe is actually quenched by the third copy of app, which opens up possibilities of chemically intervening to inhibit this transcriptional repression and potentially unleash a much greater degree of bace protection. an integration of the two observations (the one in and the one in our report) suggests this could be exploited as an additional new protective/therapeutic strategy for ad in general. we found, surprisingly, an equally high or higher level of colocalisation of aβx- with lamp a, as with the general lamp (fig. ) . the high level of colocalisation with lamp a and absence of colocalisation with either lc a or lamp (fig. ) , suggest that aβdp activity of bace that generates aβ is not related to classical lysosomal degradation or macroautophagy, but rather could be related to a cma-like process , . the only published study that linked cma with app processing found a motif that satisfies the criteria for a cmarecognition kferq motif at the very c-terminus of app (kffeq), and this paper demonstrated that c (β-ctf) can bind hsc . however, paradoxically, when this motif is deleted from the β-ctf, the binding to hsc is not abolished, but rather increased, suggesting the presence of another, alternative cma-recognition motif within the β-ctf peptide . the association of the aβdp x- product with lamp a/cma compartment is a provocative new observation that requires further studies. in conclusion, we found that relative levels of specific non-amyloidogenic and aβdp (aβclearance) products are higher in t organoids and ds-csf, and they respond to the dose of bace (and not app). we also demonstrated that bace -aβdp activity generating one of these products can be cross-inhibited in solution by recently clinically tested bace -inhibitors. all components of the aβdp degradation reaction (hitherto only demonstrated in solution in vitro): the main substrate (aβx- ), the enzyme (bace ), and its putative degradation product (aβx- ), we found highly colocalised in discrete intracellular vesicles in human brain neurons, (and not astrocytes), suggesting that at least some of the aβdp activity generating aβx- takes place intra-neuronally and physiologically during lifetime, before the onset of ad pathology, in both normal and ds brains. furthermore, we directly demonstrated that the third copy of bace protected t -hipsc organoids from early ad-like amyloid plaque pathology, therefore proving the physiological role of bace as an ad-suppressor gene. the bace 's θ-secretase antiamyloidogenic cleavage and the aβdp degradation actions could both be contributing to an overall ad-suppressive effect. regardless of the contribution of each of these modes of action, our combined data suggest that increasing the action of bace could be exploited as a therapeutic/protective strategy to delay the onset of ad, whereas cross-inhibition of bace -aβdp activity by bace -inhibitors would have the unwanted worsening effects on disease progression. we also show that cerebral organoids from genome-unedited ipscs could be explored as a system for pre-morbid detection of high-risk population for ad, as well as for identification of natural dose-sensitive ad-suppressor genes. human subjects were participants in the "the london down syndrome consortium table . upon specific informed consent, three to six individual strands of hair were non-invasively plucked from the scalp hair of donor subjects, and placed in transport medium [dmem (sigma d ), mm glutamine (sigma g ), x pen/strep (sigma, p ), % foetal calf serum]. upon arrival to the lab, hair follicles were placed in collagen coated t flasks in kgm medium (lonza cc- ) and incubated at o c, % co . primary keratinocyte cultures were split after reaching - % confluency using . % trypsin/ . % edta. primary keratinocyte cultures were expanded to % confluency, electroporated with plasmids encoding reprogramming factors in episomal vectors (non-integrational reprogramming), and (life technologies) supplemented with penicillin/streptomycin. passaging was carried out using relesr and μm rock inhibitor was included in culture media for hours after passaging. cerebral organoids. cerebral organoids were generated following the standard protocol with the following changes . ipsc lines were first transitioned into feeder free conditions using either mtesr or e media with geltrex. to form embryoid bodies (ebs), hipscs were washed once with pbs, then incubated with gentle cell dissociation solution (stemcell technologies) for mins. this solution was then removed and accutase added and incubated for a further mins. mtesr /e medium at double the volume of accutase was added to the cells and a single cell suspension generated by titruating. cells were centrifuged to remove accutase and then resuspended in hesc medium supplemented with ng/ml fgf and micromolar rock inhibitor. cells were used to form a single eb in each well using either a v shaped ultra low attachment well plate (corning). specifically, ipscs were allowed to form embryoid bodies (ebs) in suspension by culturing for days in hesc medium with low fgf, in non-adherent culture dishes. after - days, ebs were transferred into a well ultra low attachment plate for neural induction. neural induction was achieved by culturing for further - days in dmem-f supplemented with % of each: n , glutamax and mem-neaa, plus μg/ml heparin. neurally induced ebs showing neuroectodermal "clearing" in brightlight microscopy were embedded in matrigel droplets, and transferred to cm dishes containing organoid differentiation medium-a, (for - days), followed by organoid differentiation medium+a . organoid maturation was carried out with - organoids per cm dish on an orbital shaker at °c, % co . aliquots of conditioned medium (cm) were collected from mature organoids ( - days old from day of eb formation), - days after feeding (to allow time for cells to secrete products into the culture media). three completely independent experiments were carried out each time starting from undifferentiated ipsc stage, and cm was collected at - timepoints in each experiment. cm was immediately frozen and stored at - °c. for inhibitor treatment, organoids were treated from div ( days after embedding in matrigel) to div. βi-iv and compound e were added freshly to the media before use at final concentrations of . μm and nm respectively. media was replaced every - days during treatment. dmso of the same volume was used as a vehicle only control. cm from organoids was analysed by ip-ms, using a previously described method . the team performing the ms was blinded to the genotypes in all experiments. in exp , all three independent trisomic lines (t c , t c , and t c ) were compared to two independent disomic lines (d c and d c ), whereas in exp , two independent trisomic lines (t c and t c ) were compared to two independent disomic lines (d c and d c ). in exp , a t c line was compared to the isogenic d c line, and to hipsc lines from unrelated individuals: a dupapp feoad patient (qm-dupapp), and two unrelated adult people with ds (qm-ds and qm-ds ). in all experiments, ip-ms results for all ipsc lines that were used in a particular experiment are shown. ip-ms results were used to calculate the relative ratios of peptides and these ratios were taken as data points for the statistical comparisons. ip-ms spectra were also obtained from the csf samples of people with ds and age-matched normal controls. peak ratios calculated as described above. the cohorts, methods and spectra behind these data were previously described . . fish on organoid cryosections was performed as described . briefly, slides were rinsed in pbs, rehydrated in mm sodium citrate buffer and incubated in the same buffer at ⁰c for min. slides were cooled down and incubated in x saline sodium citrate (ssc) for min and in % formamide in x ssc for h. after incubation slides were covered with previously prepared hybridization chamber and incubated with μl of fig. ) . western blot. for western blots, whole cell lysates of crispr edited or unedited ipscs (fig. c ) or organoids (fig. o) were separated in a % acrylamide gel by sds-page and transferred to a nitrocellulose membrane according to the manufacturers protocols (bio-rad). following a min incubation in % non-fat milk in tbs-t the membrane was incubated with primary and secondary antibodies (supplementary tables , ) . for the stainings shown in fig. o quantitations were done strictly on the same membranes re-stained using the antibodies shown. for the protein of interest (bace or tg ), the signal was adjusted to corresponding βactin loading control for all samples. such adjusted values for unedited c (wt) (n= ) were set to , and used to calculate the fold change for c Δ (n= ) replicates, and the resulting fold-change values for pairs run on the same gel were averaged and analysed by student's t-test. membrane stripping between stainings was carried out using thermo-fisher stripping solution, following manufacturer's instructions. amyloglo and thioflavine s staining. for amyloglo staining, oct embedded slices were rinsed with pbs, and incubated in % ethanol for min at rt, followed by washing with milli q water for min at rt. slices were then incubated with amyloglo solution for min in the dark at rt, followed by washing in . % saline solution for min at rt, and counterstaining with draq for min at rt. thioflavine s staining was performed as described supplementary fig. a ), pre-incubation with bace specific immunogenic peptide ( supplementary fig. b-e) and lambda (λ) scan function on confocal microscope ( supplementary fig. f, g) . three different samples (ds-ad , ds ( yrs) pre-ad and euploid sporadic ad ( yrs) after ihc were stained with . % sudan black b in % ethanol for min at rt and analysed on confocal microscope with aiyrscan. sample ds-ad was stained with antibodies solution, hrs pre-absorbed with bace specific immunogenic peptide, and analysed on confocal microscope and slide scanner. lambda scan records a series of individual images within a defined wavelength range (in our case from nm to end of spectrum) and each image was detected at a specific emission wavelength, at nm intervals. for lambda scan analysis, samples were stained with one primary antibody and labelled with far-red secondary antibody ( ). as negative control, we used secondary antibody ( ) alone and, as additional negative control, one sample was counterstained with dapi only, without secondary antibody. as we used a far-red ( ) antibody, we analysed expression from nm to the end of spectrum at nm intervals. aβx- and bace antibodies showed specific peaks, significantly over and above the autofluorescent signal, in all three specific roi indicated in fig. fig. ). gallyas staining. for gallyas staining samples were depariffinised and/or rinsed in pbs, then treated with ammonium-silver nitrate ( . g nh no , . g agno , . ml % naoh) solution for min protected from the light, rinsed with . % acetic acid ( x min) and placed in developer solution for - min. developer solution was made from three stock solutions: ml of solution a ( g na co + ml distilled water), . ml of solution b ( g nh no + g agno + g tungstosalicic acid hydrate + ml distilled water) and . ml of solution c ( g nh no + g agno + g tungstosalicic acid hydrate + . ml % formaldehyde solution + ml distilled water). after developer solution samples were rinsed in water and placed in destaining solution ( g k co + g edta-na + g fecl + g na s o + g kbr + ml distilled water). finally, samples were rinsed two times in . % acetic acid. after staining samples were rinsed in water, dehydrated in a graded series of ethanol, cleared in histo-clear and mounted with histomount mounting medium. samples were scanned by shown for the lack of space, data available on request). the genome integrity of the isogenic ipsc lines was previously published (but was repeated here as described above). no additional rearrangements due to re-programming or passaging were observed. bace locus snps: the cohort of people with ds has been described in recent reports , . in brief, participants donated dna samples and had detailed cognitive and clinical assessments to determine dementia status . age of dementia diagnosis was established and used in snp analysis. bace snp genotyping for the londowns cohort was undertaken as previously supplementary fig. ) were nominally associated with aoo in the londowns cohort, but were not significant after correction for multiple testing. quantitative paralogous amplification-pyrosequencing was carried out based on the published method . this method takes advantage of the existence of identical sequences on chromosome and one other autosome, allowing amplification of both loci with a single primer pair. paralogous sequence mismatches in amplified products from chromosome (gabpa and itsn) can be quantified relative to their paralogous regions on chromosome and respectively. as such, trisomic cells show a : ratio for the paralogous sequence, while disomic cells produce a : ratio. primers used are listed below, and pyrosequencing was performed on the pyromark q machine (qiagen) following standard procedures. crispr/spcas -hf cas editing of the bace locus. the guide-rna (grna) targeting bace exon was cloned into a vector containing the high fidelity spcas -hf and blasticidin s resistance gene. the complete plasmid was delivered via lipofectamine to a trisomic ipsc line t c (full official name nizedsm it -c ), which was described and characterized in a previous report . untransfected ipscs were removed by treatment with blasticidin ( μg/ml for h). individual colonies were picked and further sub cloned by limiting dilution to achieve clonal cell lines. dna was purified from individual clones, pcr amplified and sequenced by sanger sequencing. sequences were analysed in mutation surveyor (v . . ) and "tracking indels by decomposition (tide)" (tide v . . , desktop genetics). tide analysis of the crispr-targeted clone . . dna sequence gave a score of % of the wt read remaining (not shown). the quality of the grna was assessed using two different prediction software platforms: cctop online software , and the mit online platform (http://crispr.mit.edu/). the same two software platforms were used to predict the off-target sites. neither platform found any off-targets with , or mismatches. the top cctop-predicted sites were pcr amplified in both Δ and wt clones, then sequenced by sanger sequencing to rule out off target events. no differences in the sequence were found. protein isolation from cortical organoids. organoids were collected at specified durations in culture (expressed as days in vitro (div)) and washed twice with ice-cold pbs. the samples were resuspended in ice-cold np- buffer ( mm nacl, % np- , mm tris ph ) containing edta free protease inhibitors (complete cocktail, roche) and lysed using a ml tissue homogenizer (fisher). each sample was centrifuged at , rpm for minutes at ˚c and the homogenates were stored at - ˚c. protein concentration was determined using the bicinchoninic acid method (bsc, pierce). (tem). organoids were lysed following the same procedure for protein extraction, however, samples were initially spun at , g for minutes at ˚c. following the first centrifugation, supernatants were removed and kept on ice. the remaining cell pellets were resuspended in x weight/volume buffer ( mm tris-hcl ph . , . m nacl and % sucrose) containing proteases inhibitor and spun at , g for minutes at ˚c. an equal volume of supernatant was added to the supernatant from the second centrifugation step. % n-lauroysarcosinate (weight/volume) was added and the samples were rocked at room temperature for one hour. the samples were ultra-centrifuged at , g for one hour at ˚c. the supernatant was decanted and the sarkosyl-insoluble pellet was resuspended in ice cold pbs prior to imaging. the samples were deposited on to glow-discharged mesh formvar/carbon film-coated copper grids. negatively stained with a % aqueous (w/v) uranyl acetate solution and then immediately analysed at kv using a jeol tem equipped with a gatan orius camera. tem analysis of synthetic aβ - fibrils in vitro. synthetic aβ peptide powder (china peptides) was treated with , , , , , -hexafluoro- -propanol (hfip) and lyophilized. the peptide was then dissolved in µl of mm naoh and then diluted with buffer. a µm stock of this monomeric aβ peptide was grown at ˚c shaking at rpm for - hours before recording the tem images. µl of extract was added to a nm thick, lacey carbon on mesh grid (glow-discharged) for minutes followed by negative staining with % uranyl acetate for minute and then air dried. the grids were then viewed under fei t , kv transmission electron microscope equipped with a k ccd camera (fei) at x magnification under low dose conditions. all data that support the findings described in this study are available within the manuscript and the related supplementary information, and from the corresponding authors upon reasonable and after digestion with hpych iv(cut), for the initial clone . , and its colony-purified sub-clone . . (renamed further below as "Δ "). the bp fragment in . . is reduced to % of the wt value (normalized to the bp band), and a de novo bp fragment appears in crspr targeted line (red asterisk). c western blot stained with anti-bace antibody of the lysates of the ipsc line Δ compared to the wt t c ipsc line. quantification of the total actin-normalised bace signal showed a significant reduction in Δ compared to tau. β-actin was used as a loading control. human brain tissue of a year old is shown for comparison. comparison of the average values (n= ) for crispr-edited t c Δ showed a highly significant relative increase in tg compared to unedited (n= ) t c organoids, as indicated in the graph, p= . . scale bar: μm. the secretases: enzymes with therapeutic potential in alzheimer disease the amyloid hypothesis of alzheimer's disease: progress and problems on the road to therapeutics amyloid plaque core protein in alzheimer disease and down syndrome decreased clearance of cns beta-amyloid in alzheimer's disease alzheimer's disease association of dementia with mortality among adults with down syndrome older than years a genetic cause of alzheimer disease: mechanistic insights from down syndrome eta-secretase processing of app inhibits neuronal activity in the hippocampus beta-secretase cleavage of alzheimer's amyloid precursor protein by the transmembrane aspartic protease bace function, therapeutic potential and cell biology of bace proteases: current status and future prospects physiological functions of the beta-site amyloid precursor protein cleaving enzyme and identification of bace as an avid ss-amyloid-degrading protease a non-amyloidogenic function of bace- in the secretory pathway beta-secretase cleavage at amino acid residue in the amyloid beta peptide is dependent upon gamma-secretase activity bace , as a novel app theta-secretase, is not responsible for the pathogenesis of alzheimer's disease in down syndrome increased app expression in a mouse model of down's syndrome disrupts ngf transport and causes cholinergic neuron degeneration presence of soluble amyloid beta-peptide precedes amyloid plaque formation in down's syndrome isogenic induced pluripotent stem cell lines from an adult with mosaic down syndrome model accelerated neuronal ageing and neurodegeneration generation of cerebral organoids from human pluripotent stem cells characterization of amyloid beta peptides in cerebrospinal fluid by an automated immunoprecipitation procedure followed by mass spectrometry app locus duplication causes autosomal dominant early-onset alzheimer disease with cerebral amyloid angiopathy cognitive markers of preclinical and prodromal alzheimer's disease in down syndrome neurofilament light as a blood biomarker for neurodegeneration in down syndrome polymorphisms in bace may affect the age of onset alzheimer's dementia in down syndrome de novo deleterious genetic variations target a biological network centered on abeta peptide in early-onset alzheimer disease altered cerebrospinal fluid levels of amyloid beta and amyloid precursor-like protein peptides in down's syndrome abeta truncated species: implications for brain clearance mechanisms and amyloid plaque deposition macroautophagy--a novel beta-amyloid peptide-generating pathway activated in alzheimer's disease introducing amylo-glo, a novel fluorescent amyloid specific histochemical tracer especially suited for multiple labeling and large scale quantification studies bace and bace enzymatic activities in alzheimer's disease expression analysis of bace in brain and peripheral tissues bace expression increases in human neurodegenerative disease altered beta-secretase enzyme kinetics and levels of both bace and bace in the alzheimer's disease brain a promising, novel, and unique bace inhibitor emerges in the quest to prevent alzheimer's disease the bace- inhibitor cnp for prevention trials in alzheimer's disease bace inhibitor drugs in clinical trials for alzheimer's disease phenotypic and biochemical analyses of bace -and bace -deficient mice chromosome bace haplotype associates with alzheimer's disease: a two-stage study future therapeutics in alzheimer's disease: development status of bace inhibitors sequential amyloid-beta degradation by the matrix metalloproteases mmp- and mmp- proteolytic degradation of amyloid beta-protein. cold spring harb perspect med , a somatic app gene recombination in alzheimer's disease and normal neurons trisomy of human chromosome enhances amyloid-beta deposition independently of an extra copy of app in vivo effects of app are not exacerbated by bace co-overexpression: behavioural characterization of a double transgenic mouse model modeling amyloid beta and tau pathology in human cerebral organoids patterns and severity of vascular amyloid in alzheimer's disease associated with duplications and missense mutations in app gene, down syndrome and sporadic alzheimer's disease the impact of app on alzheimer-like pathogenesis and gene expression in down syndrome ipsc-derived neurons unique properties of lamp a compared to other lamp isoforms the coming of age of chaperone-mediated autophagy regulation of amyloid precursor protein processing by its kferq motif d-fish on cultured cells combined with immunostaining hallmarks of alzheimer's disease in stem-cell-derived human neurons transplanted into mouse brain bonferroni sequential correction: an excel calculator ( . ) the londowns adult cognitive assessment to study cognitive abilities and decline in down syndrome detection of aneuploidies by paralogous sequence quantification high-fidelity crispr-cas nucleases with no detectable genome-wide offtarget effects cctop: an intuitive, flexible and reliable crispr/cas target prediction tool tau proteins of alzheimer paired helical filaments: abnormal phosphorylation of all six brain isoforms quantification of the total actin-normalised app signal showed no significant difference between Δ and unedited t line, whereas they both had significantly higher app protein levels compared to the disomic control line. error bars: standard error, p-values after standard one way anova and tukey's multiple comparisons test staining with amyloid specific dye (amyloglo) and nuclear dye (draq ) supplementary fig. . cell death and neuronal loss in crispr-edited t c Δ organoids number of dapi+ nuclei are shown in the volume of µm . graph show decreased number of nuclei in crispr-edited t c Δ (div ) organoids compared to parental t c organoids and significantly decreased number of nuclei in div organoids (p< . ) electron micrographs of negatively stained filaments isolated from insoluble fraction of the ad-like pathology containing organoid lysates. a, b representative straight filaments found in the lysates from the organoids t c Δ and qm-dupapp, respectively. c aβ - synthetic peptide fibrils grown in vitro secondary antibody alone controls for organoid immunostaining. dapi staining confirms the presence of cells, but no unspecific signal from secondary antibodies both antibodies show the same pattern of expression and colocalisation after sudan black b staining (white arrows: intraneuronal fine-vesicular pattern and black arrows with white arrowhead: amorphous extra-cellular aggregates) except for a loss of the large intraneuronal spherical granules (white arrowheads, fig. ), which are likely lipofuscin. scale bar: μm. b and c chromogenic, immunohistochemical analysis of the human brain sections of ds-ad , stained using polymer-hrp/ap doublestaining kit. b the primary antibody against bace was labelled with dab (brown) b(i) is a zoomed-in inset of the rectangle in b. c same as b, but both antibodies were pre-absorbed for hours, and incubated overnight, with the excess of immunogenic peptide for the bace antibody; c(i) is a zoomed f and g in order to distinguish the contribution of lipofuscin autofluorescence to the colocalised signals, specificity of primary antibodies (aβx- and bace ) has been validated using lambda (λ) scan function on confocal microscope (see methods). f aβx- shows specific peak in different roi as negative control of staining, dapi and secondary antibody alone were used. g bace also shows specific peak in different roi and uniform pattern in human brain. h secondary antibody alone control supplementary fig. . crispr/spcas -hf -mediated reduction of bace copy number from to in the t c hipsc line, reduced bace protein expression to disomic levels, but does not alter the level of app protein. western blot stained with anti-bace antibody or anti-app antibody of the lysates of the ipsc line Δ compared to the wt t c , and d c ipsc lines. quantification of the total actin-normalised bace signal showed a % reduction in Δ compared to t unedited line, and no significant difference compared to supplementary fig. . cerebral organoids express cortical neuronal layer-specific and astrocyte markers.supplementary fig. . comparison of the proportions of neurons and astrocytes to total cells in cerebral organoids. isogenic d and t cerebral organoids generated mostly neurons and a small proportion of astrocytes, with no differences in the proportion of astrocytes or neurons in d compared to t . similar proportions were also detected in organoids from dupapp, qm-ds and qm-ds ipscs. fig. . two new single nucleotide polymorphisms in bace intron correlate with age-of-dementia-onset among individuals with ds, and co-localize with a denovo deletion causing non-ds eoad.supplementary fig. . aβx- colocalises with bace much more than with bace in t cerebral organoids.supplementary fig. . validation of crispr-edited ipscs by snp array and paralogousloci-amplification-quantitative pyrosequencing.supplementary fig. . crispr/spcas -hf -mediated reduction of bace copy number from to in the t c hipsc line, reduced bace protein expression to disomic levels, but does not alter the level of app protein.supplementary fig. . staining of extracellular β-amyloid deposits in organoids with two different methods. related to fig. : fig. a : variability between individual ipsc lines (representing individual re-programming events) was tested by anova in exp , where all independent trisomic lines of our isogenic model were used in a single experiment. no significant differences between individual lines were found in any of the calculations shown in fig. , demonstrating that our peptide-ratio-readout parameter is driven by the genotype, and not re-programming artefacts or culture history of the ipsc lines (data did not fit the allowed space, available on request).as peptide-ratio readouts differed slightly between three independent experiments, we are showing complete data here for each experiment individually. as shown in fig. a , the difference (or the absence of difference) caused by t in an isogenic comparison remained stable in each of experiments. in exp , for the ratio of - /amyloidogenics, the isogenic comparison of t v d showed a p= . ( -tailed t-test), which dropped to p= . after anova comparison with all individual samples. also in exp , we further performed an analysis by genotype groups. for the aβdp/amyloidogenics ratio, the combined t samples (n= ) were significantly higher than d (anova p= . ), and significantly higher than dupapp (anova p= . ), whereas d is not significantly different from dupapp. the same result was obtained for the total bace /amyloidogenics ratio: combined t (n= ) v d , anova p= . ; combined t (n= ) v dupapp, anova p= . , and d v dupapp shows no significant difference. the comparison of α-site cleavages ( - & - )/amyloidogenics never showed any significant difference irrespective of how the samples were grouped. fig. : fig. : the fret assay positive control was performed using recombinant human bace at ⁰c, ph= . for h in the r&d systems assay buffer, as specified in the manufacturer's protocol, using the r&d systems fret control peptide (es ). in three technical replicates the blank-subtracted raw fluorescence readings obtained were , (± sem). bace with the new fret peptide for the aβdp cleavage after aa (in the absence of any inhibitors) gave blank-subtracted readings , (± sem). this was taken as the % value for the graphs shown in fig. . for comparison, bace incubated with the same fret peptide, using the manufacturer's assay buffer for bace , gave the readings of (± sem) in the same experiment. fig. and supplementary fig. : we compared the degree of colocalisation between either bace or bace , and aβx- clearance product in organoids, along with other markers of intra-neuronal compartments: flotillin (general marker of lipid rafts), rab (late endosome marker), sortilin (a major apoe receptor linked to aβ catabolism), and lamp (one of the lysosomal membrane proteins often used to visualize lysosomes in studies of aβ-processing). both bace and bace , as well as aβx- highly colocalised with flotillin , suggesting that this type of aβ degradation takes place in lipid raft containing vesicles ( fig. and supplementary fig. ). however, bace and bace differed in vesicular sub-compartment distribution: bace was highly colocalised (> . ) with each sortilin and rab and only weakly with lamp ( . ), whereas bace did not co-localise with sortilin(< . ), but colocalised moderately with rab ( . ) and highly with lamp (> . ) (supplementary fig. ) . interestingly, the localisation of the aβx- fragment closely resembles the pattern of bace , and not of bace : (pearson coefficient of . with each sortilin and rab , and > . with lamp ), further supporting the observation of aβx- (> . ) localisation with bace and less so with bace , in both organoids ( fig. and supplementary fig. ) and human brain (fig. ) . in order to define the compartment with the highest concentration of aβx- within the endo-lysosomal system more precisely, we co-stained the aβx- neoepitope-specific antibody with other markers associated with aβ processing: lc a (macroautophagosome marker), eea (early endosome marker) and lamp (a classical lysosome marker). surprisingly, none of these markers showed any colocalisation, demonstrating that aβx- is not present in either early endosomes, macro-autophagosomes, or classical lysosomes (fig. ) . as aβx- did not colocalise with lamp or lc a, but colocalised strongly with lamp , we tested a colocalisation with the components of an alternative autophagy pathway that would be compatible with this pattern of colocalisations: chaperonemediated autophagy (cma). unexpectedly, we detected an extremely high level of colocalization of aβx- with both hsc (chaperone in cma) and lamp a, (the isoform of lamp that is the main protein controlling the levels of cma activity) (fig. ) . some intraneuronal lamp a+ vesicles appear to contain both hsc and aβx- (fig. ) . these data suggest that aβdp activity of bace is linked with the cma pathway. fig. : fig. a -d: as immunofluorescence on brain sections is susceptible to bright and false positive autofluorescent signals from lipofuscin granules, we confirmed the colocalisation of aβx- and bace using non-fluorescent, chromogenic dual labelled immunohistochemistry ( supplementary fig. b) , where the specificity of the bace antibody was further verified by pre-absorption control with the immunogenic peptide ( supplementary fig. c ). this method confirmed the intra-neuronal co-localization of aβx- and bace signals. the bp deletion causes a frameshift at aa of bace protein sequence. this introduces a stop codon within the protease cleavage domain at aa . the potential off-target effects of the crispr guide rna used were tested using two prediction software tools: cctop and http://crispr.mit.edu/. no target sequences were found with , or mismatched nucleotides. no targets, that had three or more mismatches were overlapping between the two software predictions. in cctop, only two sites with three mismatches, and more sites with four mismatches were found. top loci from this prediction were amplified with the putative target sequence in the middle, and sequenced in the t wt ipsc compared to the ∆ ipsc line. no off-target effects of the crispr/spcas -hf intervention were detected. key: cord- - uklf u authors: jiang, he-wei; li, yang; zhang, hai-nan; wang, wei; yang, xiao; qi, huan; li, hua; men, dong; zhou, jie; tao, sheng-ce title: sars-cov- proteome microarray for global profiling of covid- specific igg and igm responses date: - - journal: nat commun doi: . /s - - - sha: doc_id: cord_uid: uklf u we still know very little about how the human immune system responds to sars-cov- . here we construct a sars-cov- proteome microarray containing out of the predicted proteins and apply it to the characterization of the igg and igm antibodies responses in the sera from convalescent patients. we find that all these patients had igg and igm antibodies that specifically bind sars-cov- proteins, particularly the n protein and s protein. besides these proteins, significant antibody responses to orf b and nsp are also identified. we show that the s specific igg signal positively correlates with age and the level of lactate dehydrogenase (ldh) and negatively correlates with lymphocyte percentage. overall, this study presents a systemic view of the sars-cov- specific igg and igm responses and provides insights to aid the development of effective diagnostic, therapeutic and vaccination strategies. c ovid- is caused by the coronavirus sars-cov- , . it is presently recognized by the world health organization as a global pandemic, and as of june , , , , diagnosed cases have been reported from countries, areas or territories (http:// ncov.chinacdc.cn/ -ncov/). sequence analysis suggested that sars-cov- is most closely related to the batcov ratg and belongs to the subgenus, sarbecovirus, of the beta coronaviruses, together with the bat-sars-like coronavirus and the sars coronavirus , . by comparing sars-cov to the other related coronaviruses, it was predicted that there are proteins encoded in the genome of sars-cov- . further, such comparisons suggested that sars-cov- might utilize the same mechanism to enter the host cells, namely via high-affinity binding between the receptor-binding domain (rbd) of the spike protein (s protein) and angiotensin converting enzyme (ace ) [ ] [ ] [ ] [ ] [ ] [ ] . though there is presently tremendous worldwide effort to identify and develop effective therapeutic approaches against this virus, none of this work has been successful at the moment. one possible approach that has shown some positive results is by treating infected patients with the plasma collected from convalescent covid- patients , . here, it is believed that the humoral antibody response in these convalescent patients played an important role in their recovery, and so might likewise prove effective in other, presently infected patients. indeed, it is known that in combating many viral infections, including sars-cov and mers-cov, igg, and igm antibodies play many critical roles [ ] [ ] [ ] [ ] . however, because sars-cov- is a newly emerged pathogen, the precise igg and igm responses in the covid- patients are very poorly understood. indeed, in this regard, there are many important questions that need to be experimentally addressed: ( ) what is the variation among different patients, especially for antibodies against the nucleocapsid protein (n protein) and s protein? ( ) are there any other viral proteins that could trigger significant antibody responses in at least some of the patients? ( ) is it possible to link the magnitude of the overall igg and igm response to the severity of the disease in patients? resolution of these questions is fundamental to the development of an understanding of the global igg and igm responses against sars-cov- and for the possibility to use this material in the development of effective therapeutic or diagnostic approaches. conventional techniques for studying patient igg and igm responses include elisa [ ] [ ] [ ] and the immune-colloidal gold strip assay , , . however, these techniques usually can only test a single target protein or antibody in a single reaction. by contrast, protein microarrays enable proteome-wide characterization of antibody responses in a high-throughput format, providing a more systemic description of these vital antibody responses. indeed, a variety of protein microarrays have already been constructed and successfully applied to serum antibody profiling, such as the mycobacterium tuberculosis proteome microarray , the sars-cov protein microarray , the dengue virus protein microarray and the influenza virus protein microarray . here, we describe the construction of the sars-cov- proteome microarray and its application in the characterization of the global igg and igm responses from covid- convalescent patients. in this way, we provide a systemic view of these responses, revealing both common and unique features of these patients, which may aid future diagnostic and therapeutic efforts against this virus. schematic diagram and workflow. the genome of sars-cov- is~ . kb and is predicted to encode for proteins : structural proteins (treating the s protein as two separate proteins, s and s ), accessory proteins, and non-structural proteins (nsp) (fig. a) . the corresponding nucleotide sequences of all of these proteins and the receptor-binding domain (rbd) of the s protein were synthesized and cloned into appropriate vectors for expression in e. coli, and the expressed proteins were purified by affinity chromatography. to obtain any even broader range of proteins that were produced from different prokaryotic and eukaryotic systems, we also acquired a number of recombinant sars-cov- proteins from commercial sources (supplementary data ). after evaluating the proteins for quality control, these proteins were printed on appropriate substrate slides. convalescent sera were collected from patients on the day of their discharge and were applied to the proteome microarray. we detected the sars-cov- -specific igg and igm proteins bound to the array using fluorescent-labeled anti-human antibodies, thereby generating a global assessment of each patient's humoral antibody response. generation of the predicted sars-cov- proteins. to produce the recombinant proteins of sars-cov- for the microarray, we first determined the amino-acid sequences of the predicted proteins based on the reference genome (genbank accession no. mn . ). we split s protein into s and s , as suggested previously , to enable a more precise analysis and also included the rbd alone owing to its critical role during the entry of sars-cov- into the cells. the protein sequences were converted to the corresponding nucleotide sequences, followed by optimization of the sequences, and then insertion of the sequences into expression vectors (pet a or pgex- t- ). the final expression library included clones. further information about these clones is included in supplementary data . after several rounds of optimization, we successfully purified of these proteins (supplementary fig. ). western blotting with an anti- xhis antibody and coomassie staining showed that most of the sars-cov- proteins exhibit clear bands with the expected size (± kda) and good purity. to cover the proteome of sars-cov- as complete as possible, and to include proteins with post-translational modifications (ptm), especially glycosylation, we also acquired recombinant sars-cov- proteins produced using yeast cell-free systems or mammalian cell expression systems from a variety of commercial sources ( supplementary fig. ). among the collected proteins, there are several different full length and fragmented versions of the s and n proteins (supplementary data ). in this way, we finally obtained proteins from different sources, covering out of the predicted proteins of sars-cov- , that were of suitable concentration and purity for microarray construction. fabrication of the sars-cov- protein microarray. a total of proteins along with positive and negative controls were printed on the microarray slide (fig. a) . since most of the proteins were tagged with the xhis peptide, we examined the overall quality of the microarray by probing with an anti- xhis antibody, which revealed uniform, spot-limited labeling across the entire microarray, thus attesting to the quality of the array (fig. a) . in addition, we noticed that nsp was contaminated during the microarray manufacturing process. thus, we decided not to include nsp for further analysis. when probed with convalescent sera from covid- patients, we generally observed high, multi-spot antibody responses, which were not observed with the control sera (fig. b) . to prevent or largely decrease nonspecific signals generated from the background of the expression system and minimize any influence from possible protein impurity, e. coli lysates and egfp were added during the incubation with serum samples, which significantly reduced nonspecific signals ( supplementary fig. a) . to test the experimental reproducibility of the serum profiling using the microarray, we randomly selected two covid- convalescent sera and probed them on three separate microarrays. the pearson correlation coefficients from the measured intensities over the entire array between two samples were . and . for igg and igm, respectively. further, the overall fluorescence intensity ranges of the repeated experiments were quite similar, demonstrating a high reproducibility of the microarray-based serum profiling both for igg and igm ( fig. c-e) . sars-cov- -specific serum antibody profiles revealed by proteome microarray. to globally profile the antibody response against the sars-cov- proteins from the serum of covid- patients, we screened sera from convalescent patients, along with controls, using the sars-cov- proteome microarray. the patients were hospitalized in foshan fourth hospital in china from - - to - - for various durations. patient information is summarized in table . serum from each patient was collected on the day of hospital discharge when standard criteria were met. all of the samples and the controls were probed on the proteome microarray, and after data filtering and normalization, we constructed the igg and igm profile for each serum and performed clustering analysis to generate heatmaps (figs. [ ] [ ] . the patients and controls formed clearly separate clusters for both igg and igm data. as expected, the n and s proteins elicited high antibody responses in almost all patients but were associated with only weak signals in control groups, confirming the efficacy of these two proteins for diagnosis. interestingly, we also found that in some cases, proteins such as orf b or nsp can generate significantly high signals compared with that in the control groups. to further prove the specificity, we performed an immunoblotting-based serum analysis. as expected, the serum specifically recognized orf b, s proteins and n proteins (supplementary fig. b ). strong antibody responses against s and n proteins. since s and n proteins have been widely used as antigens for diagnosis of covid- , we next characterized the serum antibody responses against these two proteins in more detail. with the present cohort, the signals from both the n and s proteins, except for the s - fragment, exhibited strong discriminatory ability between the covid- patients and controls using either igg or igm response ( fig. a , b, supplementary fig. a, b) . notably, two sera from the control group exhibited a significant igg antibody response to the n proteins, with one to n-nter and the other to n-cter (fig. g) , suggesting that the n protein might generate a higher false-positive measurement than the s protein, especially the s protein. to investigate the consistency of signal intensities fig. f ) using data of the convalescent sera. high correlations were observed among different concentrations of the same proteins as well as the same protein from different sources ( fig. d , h, supplementary figs. a-c, g and d, g), although the n protein at high concentrations generated almost saturated igg signals ( supplementary fig. g ). in particular, for the full-length s proteins from different sources, whether from e. coli (s _t) or t (s _b and s _s) expression systems, a high correlation between these proteins were observed ( fig. c- fig. sars-cov- proteome microarray layout and quality control. a there are identical subarrays on a single microarray. a microarray was incubated with an anti- xhis antibody to demonstrate the overall microarray quality (green). one subarray was shown. the proteins were printed in quadruplicate. the triangles indicate dilution titers of the same proteins. b representative subarrays probed with sera of a covid- convalescent and healthy control. the igg and igm responses were shown in green and red, respectively. c, d the correlations of the overall igg and igm signal intensities between two repeats probed with the same serum. proteins (n = ) on the microarray were examined. e statistics of the pearson correlation confidence among repeats probed with the same serum. two serum samples from the convalescent group was examined in three independent experiments. nc negative control, pc positive control; . , . , . , and . indicate the concentration of these proteins for microarray printing. t tao lab, b hangzhou bioeast biotech. co.,ltd., k healthcode co., ltd., s sanyou biopharmaceuticals co.,ltd., w vacure l biotechnology co.,ltd., y sino biological co.,ltd. expression system: ( ) e. coli: all proteins from tao lab (t), n protein _s, n protein_w; ( ) cell-free: all proteins from healthcode co., ltd. (k), ( ) mammalian: s _b, s _s, s-rbd_s, s-rbd_y. fig. c, d) , indicating that the s proteins from different sources that we have tested are all similarly effective for detection. however, the background signals in the control group were much lower for proteins purified from mammalian cells (such as t) (fig. a, supplementary fig. a ), suggesting that these samples might possess a higher specificity and could serve as better reagents for developing immune diagnostics. the signals of the full-length s protein were highly correlated with that of the s-rbd (fig. e, supplementary fig. e ) but with much stronger signals. in contrast, the correlation levels of the s - fragment with the full-length s or rbd were lower (fig. c , supplementary fig. d) . also, the s signals were poorly correlated with s proteins, although significant s signals were observed for many of the patients (fig. f, supplementary figs. e and f) . these data might reflect a difference in the immunogenicity of different regions of the s protein, which could be resolved in the future with more refined epitope mapping. similar results were also observed for the n proteins ( supplementary fig. h, i) . interestingly, a moderate but significant linear correlation was observed between the igg responses against the n and s proteins ( fig. i) but not the igm responses ( supplementary fig. h) , while the correlations between the igg and igm signals for the same protein were low (fig. j, k) . this might partially be a consequence of overall lower igm signals than the igg signals ( supplementary fig. a, b) at the convalescent stage. antibody responses against other proteins. to statistically analyze the igg responses against sars-cov- proteins, we calculated the p-values followed by multiple testing correction (or q-values), and applied significant analysis of microarray (sam) to identify significant positive proteins (supplementary fig. and data ). besides s and n proteins, orf b and nsp also had significant positive responses. particularly, . % ( / ) and . % ( / ) patients exhibited a "positive" igg antibody signal to orf b and nsp , respectively (fig. a-c) . although e protein and orf b were statistically positive, the fluorescent intensities in both patient and control groups were too low, further verification is needed for these two proteins. to investigate if the igg responses against orf b or nsp depended on the igg responses to the n or s proteins, we calculated the correlations between these measurements. we observed no obvious correlation between the igg signals to orf b or nsp and the igg fig. the overall sars-cov- -specific igg profiles of the convalescent sera against the proteins. each square indicates the igg antibody response against the protein (row) in the serum (column). proteins were shown with names along with concentrations (μg ml − ) and sources. sera were shown with group information and serum number. ncp novel coronavirus patients or covid- patients, lc lung cancer, nc normal control. blank means no serum. three repeats were performed for serum p and p . fi fluorescence intensity. (fig. d, e) , suggesting these two proteins might provide complementary information to that generated from the n or s proteins, either for diagnosis or efforts to understand the specific immune response to this virus. igg responses were correlated with age, ldh, and lymphocyte percentage. it is known that the immune response is closely related to the development of the disease in individual patients. to study the relationship between the antibody response and the course of the disease, we examined the correlations between the s igg responses to various proteins with clinical characteristics. not surprisingly, the time after disease onset correlated with the igg response against the s (fig. a) as the igg response usually increases over time and reaches a maximum several weeks after disease onset, as observed in other studies and sars patients . we also found that age also correlated with the igg response to the s (fig. b) . we also found that the igg responses against s protein were positively correlated with peak lactate dehydrogenase (ldh) levels and inversely correlated with percentage of lymphocyte (ly %) (fig. c, d) . it was also demonstrated that the igg responses were slightly different between male and female patients (fig. e) . we further performed multiple linear regression to investigate the relationship among s igg level, age, gender, days after onset, peak ldh and ly% (fig. f) . consistent with above correlation analysis, age and peak ldh were statistically significant (both with p-values < . ) and gender showed marginally significance (p = . ). as expected, days after onset, identified as a confounding factor, showed no statistically significant difference (p = . ) and was removed from the regression. ly% was still kept in the model as its low significant (p = . ) was probably due to the small sample size. the final equation (adjusted r-squared = . , p-value < . ) is as follows: y = + *x + *x - *x + *x , where y represents s igg level and x , x , x , x represents the normalized values (between and ) of age, gender, ly%, and peak ldh, respectively. to profile the sars-cov- -specific igg and igm responses, we have constructed a sars-cov- proteome microarray with of the predicted proteins. a set of convalescent sera were analyzed on the microarray, global igg and igm profile were obtained simultaneously through a dual color strategy. our data clearly showed that both n protein and s were suitable for diagnostics, while s purified from the mammalian cell might possess better specificity. when we were preparing this work, a preprint also found better specificity with mammalian versus insect cell expressed proteins . meanwhile, significant antibody responses were identified for orf b and nsp . we further showed that the level of s igg positively correlated to age and the level of ldh while negatively correlated to ly%. it is well known that s and n proteins are the dominant antigens of sars-cov and sars-cov- that elicit both igg and igm antibodies, and antibody response against n protein is usually stronger. however, we found for two of the control sera, strong igg bindings were observed for n protein, and specifically, one control recognizing n protein at the n-terminal while the other at the c-terminal. this may be due to the high conserved n protein sequences across the coronavirus species. this indicating we should be aware of the false-positive when applying n protein for diagnosis. in contrast, s protein demonstrated a higher specificity. thus, an ideal choice of developing immunodiagnostics might be the combining of both n protein and s . we also compared the antibody responses against a variety version of s , including the full length, the rbd domain, the n-terminal, and the c-terminal. the antibody response to the rbd region was highly correlated with that to full-length protein but with weaker signals which is consistent with a recent study , however, the correlations among other s versions were not significant, suggesting dominant epitopes that elicit antibodies might differ among individuals. further study of detailed epitope mapping might give us a clear answer. in this study, we also found the significant presence of igg and igm against orf b ( out of cases) and nsp ( out of cases). orf b is predicted as an accessory protein, exhibiting high overall sequence similarity to sars and sars-like covs orf b (v i) , and is likely to be a lipid-binding protein . previous studies showed that sars orf b suppressed innate immunity by targeting mitochondria . two previous studies have found antibodies against sars orf b presented in the sera of patients recovering from sars , . our study also demonstrates the potential of antibodies against orf b for the detection of convalescent covid- patients. covid- nsp is also highly homologous to sars nsp ( % identity, % similarity). its homologous proteins in a variety of coronaviruses have been proven to impair ifn response [ ] [ ] [ ] . our study provide experimental evidence to show the existence of nsp -specific antibodies in convalescents. since nsp is a non-structural protein, theoretically, it should present only in the infected cells but not in nature communications | https://doi.org/ . /s - - - article nature communications | ( ) : | https://doi.org/ . /s - - - | www.nature.com/naturecommunications virions. hence, antibody against nsp has the potential to be applied to distinguish between covid- patients and healthy people immunized with the inactivated virus. we have analyzed the correlations between the covid- specific igg responses with clinical characteristics as well. it is expected that igg responses improve over time within one or two months after onset , , and we indeed have observed a significant correlation between igg signals with days after onset. we also found peak ldh was highly correlated with igg response, especially for female patients. as many studies reported, ldh tends to have a higher level in severe covid- patients and could be an indicator of severity , . in fact, it has been observed in sars patients that more severe sars is associated with more robust serological response , , a similar association was confirmed in covid- patients. there are some limitations to the current sars-cov- proteome microarray. firstly, due to the difficulty of protein expression and purification, there are still proteins missing . we will try to obtain these proteins through vigorous optimization or other sources. an interesting finding is anticipated in the near future for these missing proteins. secondly, most of the proteins on the microarray are not expressed in mammalian cells, critical post-translational modifications, such as glycosylation is absent. it is known that there are n-glycosylation sites on s protein, which is heavily glycosylated, and the glycosylation may play critical roles in antibody-antigen recognition , . only a few fig. igg response to other sars-cov- proteins. a other sars-cov- proteins that were recognized by igg from the convalescent sera, in comparison to that of the controls. b, c anti-orf b igg (b) or anti-nsp igg (c) in the patient and control group. for b, c, each dot indicates one serum sample either from the convalescent group (n = ) or the control group (n = ). data are presented as mean values ± sd. the dashed line indicates cutoff value calculated as mean + x sd of the control group. p-values were calculated by the two-sided t-test and q-values were adjusted p-values using bh method. d, e correlations of the overall igg responses for n or s protein vs. orf b (d) or nsp (e). for d, e, each dot indicates one serum sample from the convalescent group (n = ) and p-values were calculated by the two-sided t-test. fig. igg responses to s and n proteins. a box plots of igg response for s and s proteins. the proteins labeled with bold and red were overexpression in mammalian cell lines. b box plots of igg response for n proteins. for a, b, each dot indicates one serum sample either from the convalescent group (green, n = ) or the control group (brown, n = ). data are represented as boxplots where the middle line is the mean value. the upper and lower hinges are mean values ± sd. p values were calculated by the two-sided t-test. q values were adjusted p-values using bh method. ***q < . . the exact p-values were shown in supplementary data . c pearson correlation coefficient matrix of igg responses among different s and s proteins. d-f correlations of overall igg responses among different s proteins (d), s vs. rbd (e) and s vs. s (f). g one part of a sub-microarray showed the igg responses of two controls, i.e., lc and nc against n proteins, n-cter and n-nter indicates the c-terminal and n-terminal of n protein, respectively. h, i correlations of the overall igg responses among different n proteins (h) and n protein vs. s protein (i). j statistics of the pearson correlation coefficients between igg and igm profile against constructs of s (n = ), s-rbd (n = ), s (n = ), and n (n = ). data are presented as mean values ± sd. k correlations between igg and igm profile against s _ . _w. for d-f, h-k, each dot indicates one serum sample from the convalescent group (n = ). for f and i, p-values were calculated by the two-sided t-test. proteins on the current protein microarray were prepared using mammalian cell systems. we are trying the rest of the proteins. once the microarray is upgraded with many or all proteins purified from mammalian cells, ptm-specific igg and igm response may be better elicited. thirdly, only samples at collected at a single time point were analyzed. though there are some interesting findings, we believe some of the current conclusions could be strengthened by including more samples. furthermore, longitudinal samples , collected at different time points from the same individual after diagnosis or even after cured may enable us to reveal the dynamics of the sars-cov- specific igg and igm responses. the data may be further linked to the severity of covid- among different patients. the application of the sars-cov- proteome microarray is not limited to serum profiling. it could also be explored for host-pathogen interaction , drug or small molecule target identification , , and antibody specificity assessment . through the same construction procedure, we could easily expand the microarray to a pan-human coronavirus proteome microarray by including the other two severe coronaviruses, i.e., sars-cov , , and mers-cov , as well as the four known mild human coronaviruses , , i.e., cov e, cov oc , cov hku- , and cov nl . by applying this microarray, we can assess the immune response to coronavirus at a system level, and the possible cross-reactivity could be easily judged. taken together, we have constructed the sars-cov- proteome microarray, this microarray could be applied for a variety of applications, including but not limited to in-depth igg and igm response profiling. through the analysis of convalescent sera on the microarray, we obtained the overall picture of the sars-cov- -specific igg and igm profile. we believe that the findings in this study will shed light on the development of the more precise diagnostic kit, more appropriate treatment and effective vaccine for combating the global crisis that we are facing now. construction of expression vectors. the protein sequences of sars-cov- were downloaded from genbank (accession number: mn . ). according to the optimized genetic algorithm , the amino-acid sequences were converted into e. coli codon-optimized gene sequences. subsequently, the sequences of optimized genes were synthesized by sangon biotech. (shanghai, china). the synthesized genes were cloned into pet a or pgex- t- and transformed into e. coli bl strain to construct the transformants. detailed information (the dna sequence, the protein sequence, the size of the protein, the system for protein expression, and etc.) of the clones constructed in this study was given in supplementary data . protein preparation. the recombinant proteins were expressed in e. coli bl by growing cells in ml lb medium to an a of . at °c. protein expression was induced by the addition of . mm isopropyl-β-d-thiogalactoside (iptg) before incubating cells overnight at °c. for the purification of xhis-tagged proteins, cell pellets were re-suspended in lysis buffer containing mm tris-hcl ph . , mm nacl, mm imidazole (ph . ), then lysed by a high-pressure cell cracker (union-biotech, shanghai, china). cell lysates were centrifuged at , × g for min at °c. supernatants were purified with ni + sepharose beads (senhui microsphere technology, suzhou, china), then washed with lysis buffer and eluted with buffer containing mm tris-hcl ph . , mm nacl and mm imidazole ph . . for the purification of gst-tagged proteins, cells were harvested and lysed by a high-pressure cell cracker in lysis buffer containing mm tris-hcl, ph . , mm nacl, mm dtt. after centrifugation, the supernatant was incubated with gst-sepharose beads (senhui microsphere technology, suzhou, china). the target proteins were washed with lysis buffer and eluted with mm tris-hcl, ph . , mm nacl, mm dtt, mm glutathione. the purified proteins were analyzed by sds-page followed by western blotting using an anti-his antibody (merck millipore, usa, cat# - ) and coomassie brilliant blue staining. recombinant sars-cov- proteins were also collected from commercial sources. detailed information on the recombinant proteins prepared in this study was given in supplementary data . protein microarray fabrication. the proteins, along with the negative (bsa) and positive controls (anti-human igg, cat#i and igm antibody, cat#i ), were printed in quadruplicate on path substrate slide (grace bio-labs, oregon, usa) to generate identical arrays in a × subarray format using super marathon printer (arrayjet, uk). protein microarrays were stored at − °c until use. . e s igg responses in male (m, n = ) and female (f, n = ) groups. t-test data are presented as mean values ± sd. for a-e, p-values were calculated by two-sided t-test. f multiple linear regression model for s igg. p-values for coefficient was calculated by two-sided t-test and p-value for regression model was calculated by one-sided f-test. *p < . . patients and samples. the institutional ethics review committee of foshan fourth hospital, foshan, china approved this study and the written informed consent was obtained from each patient. covid- patients were hospitalized and received treatment in foshan forth hospital during the period from - - to - - with variable stay time (table ) . serum from each patient was collected on the day of hospital discharge when the standard criteria were met according to diagnosis . briefly, the key points of the discharge criteria are: ( ) body temperature is back to normal for more than three days; ( ) respiratory symptoms improve obviously; ( ) pulmonary imaging shows obvious absorption of inflammation; ( ) nuclei acid tests negative twice consecutively on respiratory tract samples such as sputum and nasopharyngeal swabs (sampling interval being at least h). sera of the control group from lung cancer patients and healthy controls were collected from ruijin hospital, shanghai, china. all sera were stored at − °c until use. microarray-based serum analysis. a -chamber rubber gasket was mounted onto each slide to create individual chambers for the identical subarrays. the microarray was used for serum profiling as li, y. et al. with minor modifications. briefly, the arrays stored at − °c were warmed to room temperature and then incubated in blocking buffer ( % bsa in × pbs buffer with . % tween ) for h. serum samples were diluted : in pbs containing . % tween , added with . mg ml − egfp purified in the same manner as the egfp tagged proteins and . mg ml − total e. coli lysate. a total of μl of diluted serum or buffer only was incubated with each subarray overnight at °c. the arrays were washed with × pbst and bound antibodies were detected by incubating with cy -conjugated goat anti-human igg and alexa fluor -conjugated donkey anti-human igm (jackson immunoresearch, pa, usa, cat# - - and cat# - - respectively), which were diluted : in × pbst and incubated at room temperature for h. the microarrays were then washed with × pbst and dried by centrifugation at room temperature and scanned by luxscan k-a (capitalbio corporation, beijing, china) with the parameters set as % laser power/pmt and % laser power/ pmt for igm and igg, respectively. the fluorescent intensity data was extracted by genepix pro . software (molecular devices, ca, usa). immunoblotting-based serum analysis. the selected proteins were analyzed by sds-page followed by western blotting using a serum overnight at °c. to assure the quality of the proteins, an anti-his antibody (merck millipore, usa, cat# - ) was also blotted. the serum for immunoblotting was diluted : in pbs containing . % tween , with the addition of . mg ml − egfp purified in the same manner as the egfp tagged proteins and . mg ml − total e. coli lysate as mentioned before. statistics. signal intensity was defined as the median of the foreground subtracted by the median of background for each spot and then averaged the quadruplicate spots for each protein. igg and igm data were analyzed separately. before processing, data from some spots, such as nsp _ . _t and nsp _k, were excluded for probably printing contamination. pearson correlation coefficient between two proteins or indicators and the corresponding p-value was calculated by spss software under the default parameters. cluster analysis was performed by pheatmap package in r . p-values for statistical analysis were calculated by two-way t-test and q-values or adjusted p-values were obtained using bh (benjamini and hochberg) method. significant analysis of microarray (sam) was performed by "samr" package of the r language with default parameters . to calculate the positive rate of antibody response for each protein, mean signal + * standard deviation (sd) of the control sera were used to set the threshold. the multiple linear regression was perfomed with the function "lm" from the "stats" package of the r language. to make the cofficients in the regression model more comparable with each other, the values of all predictor vairables (x) have been normalized as follows: (xmin(x))/(max(x)min(x)). a pneumonia outbreak associated with a new coronavirus of probable bat origin a new coronavirus associated with human respiratory disease in china genome composition and divergence of the novel coronavirus ( -ncov) originating in china isolation and characterization of a bat sars-like coronavirus that uses the ace receptor cryo-em structure of the -ncov spike in the prefusion conformation structural basis for the recognition of the sars-cov- by full-length human ace structure of the sars-cov- spike receptor-binding domain bound to the ace receptor angiotensin-converting enzyme is a functional receptor for the sars coronavirus sars-cov- cell entry depends on ace and tmprss and is blocked by a clinically proven protease inhibitor the effectiveness of convalescent plasma and hyperimmune immunoglobulin for the treatment of severe acute respiratory infections of viral etiology: a systematic review and exploratory meta-analysis convalescent plasma as a potential therapy for covid- severe acute respiratory syndrome diagnostics using a coronavirus protein microarray identification of human neutralizing antibodies against mers-cov and their role in virus adaptive evolution potent neutralization of mers-cov by human neutralizing monoclonal antibodies to the viral spike glycoprotein generation of antibodies against covid- virus for development of diagnostic tools evaluation of enzyme-linked immunoassay and colloidal goldimmunochromatographic assay kit for detection of novel coronavirus (sars-cov- ) causing an outbreak of pneumonia (covid- ) detection of severe acute respiratory ayndrome (sars) coronavirus nucleocapsid protein in sars patients by enzyme-linked immunosorbent assay recombinant protein-based enzyme-linked immunosorbent assay and immunochromatographic tests for detection of immunoglobulin g antibodies to severe acute respiratory syndrome (sars) coronavirus in sars patients development and clinical application of a rapid igm-igg combined antibody test for sars-cov- infection diagnosis mycobacterium tuberculosis proteome microarray for global studies of protein function and immunogenicity rapid production of virus protein microarray using protein microarray fabrication through gene synthesis (pages) influenza virus infection induces a narrow antibody response in children but a broad recall response in adults a preliminary study on serological assay for severe acute respiratory syndrome coronavirus (sars-cov- ) in admitted hospital patients anti-sars-cov igg response in relation to disease severity of severe acute respiratory syndrome a serological assay to detect sars-cov- seroconversion in humans the crystal structure of orf- b, a lipid binding protein from the sars coronavirus sars-coronavirus open reading frame- b suppresses innate immunity by targeting mitochondria and the mavs/traf /traf signalosome sars corona virus peptides recognized by antibodies in the sera of convalescent cases antibody responses to individual proteins of sars coronavirus and their neutralization activities porcine deltacoronavirus nsp inhibits interferon-beta production through the cleavage of nemo porcine deltacoronavirus nsp antagonizes type i interferon signaling by cleaving stat feline infectious peritonitis virus nsp inhibits type i interferon production by cleaving nemo at multiple sites chronological evolution of igm, iga, igg and neutralisation antibodies after infection with sars-associated coronavirus profile of specific antibodies to the sars-associated coronavirus clinical characteristics of hospitalized patients with novel coronavirus-infected pneumonia in wuhan, china microbiologic characteristics, serologic responses, and clinical manifestations in severe acute respiratory syndrome structure, function, and antigenicity of the sars-cov- spike glycoprotein longitudinal serum autoantibody repertoire profiling identifies surgery-associated biomarkers in lung adenocarcinoma antibody responses against sars-coronavirus and its nucleocaspid in sars patients systematic identification of mycobacterium tuberculosis effectors reveals that bfrb suppresses innate immunity systematic identification of arsenic-binding proteins reveals that hexokinase- is inhibited by arsenic interplay between the bacterial protein deacetylase cobb and the second messenger c-di-gmp a toolbox of immunoprecipitation-grade monoclonal antibodies to human transcription factors sars and mers: recent insights into emerging coronaviruses sars-beginning to understand a new virus hosts and sources of endemic human coronaviruses specific serology for emerging human coronaviruses by protein microarray metabolic engineering of long chain-polyunsaturated fatty acid biosynthetic pathway in oleaginous fungus for dihomo-gamma linolenic acid production national health commission & national administration of traditional chinese medicine. diagnosis and treatment protocol for novel coronavirus pneumonia (trial version pretty heatmaps significance analysis of microarrays applied to the ionizing radiation response reporting summary. further information on research design is available in the nature research reporting summary linked to this article. the protein sequences of sars-cov- were downloaded from genbank (accession number: mn . ). the sars-cov- proteome microarray data are deposited on protein microarray database under the accession number pmde (http://www.proteinmicroarray. cn/index.php?option=com_experiment&view=detail&experiment_id= ). additional data related to this paper may be requested from the authors.received: april ; accepted: june ; we thank dr. daniel m. czajkowsky for english editing and critical comments. we thank dr. min guo of healthcode co., ltd. for providing affinity-purified proteins. we thank dr. guo-jun lang of sanyou biopharmaceuticals co., ltd. for providing proteins and antibodies. we also thank dr. jie wang of vacure l biotechnology co., ltd., dr. yin-lai li of hangzhou bioeast biotech. co., ltd., and sino biological co., ltd. for providing the proteins. this work was partially supported by the national the authors declare no competing interests. supplementary information is available for this paper at https://doi.org/ . /s - - - .correspondence and requests for materials should be addressed to d.m., j.z. or s.-c.t.peer review information nature communications thanks martijn van hemert and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. peer reviewer reports are available.reprints and permission information is available at http://www.nature.com/reprintspublisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/ licenses/by/ . /. key: cord- -k rcpav authors: niespodziana, katarzyna; stenberg-hammar, katarina; megremis, spyridon; cabauatan, clarissa r.; napora-wijata, kamila; vacal, phyllis c.; gallerano, daniela; lupinek, christian; ebner, daniel; schlederer, thomas; harwanegg, christian; söderhäll, cilla; van hage, marianne; hedlin, gunilla; papadopoulos, nikolaos g.; valenta, rudolf title: predicta chip-based high resolution diagnosis of rhinovirus-induced wheeze date: - - journal: nat commun doi: . /s - - - sha: doc_id: cord_uid: k rcpav rhinovirus (rv) infections are major triggers of acute exacerbations of severe respiratory diseases such as pre-school wheeze, asthma and chronic obstructive pulmonary disease (copd). the occurrence of numerous rv types is a major challenge for the identification of the culprit virus types and for the improvement of virus type-specific treatment strategies. here, we develop a chip containing different micro-arrayed rv proteins and peptides and demonstrate in a cohort of pre-school children, most of whom had been hospitalized due to acute wheeze, that it is possible to determine the culprit rv species with a minute blood sample by serology. importantly, we identify rv-a and rv-c species as giving rise to most severe respiratory symptoms. thus, we have generated a chip for the serological identification of rv-induced respiratory illness which should be useful for the rational development of preventive and therapeutic strategies targeting the most important rv types. r espiratory viral infections are among the most common triggers of acute exacerbations of pre-school wheeze, asthma, and chronic obstructive pulmonary disease (copd) [ ] [ ] [ ] . asthma and copd are severe and disabling diseases of the respiratory tract and hence represent a serious global health problem affecting different age groups. acute pre-school wheeze and community-acquired pneumonia (cap) are other common causes of emergency visits with possible viral etiology. there is an increasing prevalence of these airways diseases, rising treatment costs, and therefore virus-induced respiratory illnesses are a heavy burden for patients and the community , . respiratory viral infections, mainly due to rhinovirus (rv), are responsible for approximately % of wheeze and asthma exacerbations in children , . moreover, infants with rhinovirus-induced wheeze have a significantly increased risk for subsequent development of recurrent wheeze and childhood asthma . since exposure to rv does not lead to wheezing illness in all children, additional factors such as the host genotype, defects of the respiratory epithelial barrier, and/or atopic predisposition have been suggested to play important roles in asthma [ ] [ ] [ ] . rv is genetically a highly diverse virus group with more than distinct rv types which have been divided into three distinct rv species, rv-a, rv-b, and rv-c , . rhinoviruses can also be classified according to which cellular receptor on human respiratory epithelial cells they use for entry . rv-b and most rv-a variants bind to the intercellular adhesion molecule- (icam- ) (i.e., major rv group), while a subset of rv-a species binds to the low-density lipoprotein receptor (i.e., minor rv group) , . more recently, a cadherin-related family member protein (cdhr ) has been reported as a one of the probable receptors for the rv-c species . the identification of the culprit rhinovirus species responsible for severe exacerbations of respiratory disease is an extremely important topic as certain rv species (e.g., rv-c) are suspected to be associated with more severe wheezing illnesses and acute asthma exacerbations in infants and children compared to others , . in fact, there are also several preventive and therapeutic strategies for rv infections under development which require a precise knowledge of the clinically relevant rv species to be targeted. for example, several approaches for developing vaccines based on polyvalent inactivated rv, synthetic rvderived peptides and recombinant rv proteins have been reported [ ] [ ] [ ] [ ] [ ] [ ] . the formulation of a broadly protective vaccine obviously requires the inclusion of the clinically most relevant and common rv species. furthermore, it has been shown that blocking of the viral receptor on respiratory epithelial cells (e.g., icam- ) can prevent rv infection . again therapeutic approaches targeting the viral receptors require knowledge which rv species are the most frequent and relevant ones. finally, it is important to investigate the role of the different rv species for exacerbations of severe bronchial obstruction in different populations and for different age groups and manifestations of respiratory illness (e.g., pre-school wheeze, asthma, copd, asthma-copd overlap: aco, cap). while rv is well established as an important trigger factor for childhood wheeze and asthma, less is known regarding the role of rv infections in exacerbations of copd and in respiratory disease exacerbations of older subjects . furthermore, the causal relationship of rv with cap is also unknown . currently, the detection of rv in the course of respiratory infections is mainly based on reverse transcription of viral rna and dna amplification by polymerase chain reaction (pcr) . such tests can demonstrate the presence of virus-derived nucleic acid but they do not necessarily indicate that the particular virus had caused an infection and is indeed responsible for clinical symptoms in the patient . in fact, rhinovirus rna has been found in a high proportion of asymptomatic infants and children [ ] [ ] [ ] . furthermore, little is known about levels and epitopespecificities of natural antibody responses capable of neutralizing rhinoviruses, and thus protecting individuals against rv infections. such information would be helpful for the development of new immunological strategies for the treatment and prevention of rv-induced exacerbations of respiratory diseases. therefore, there is a huge and so far unmet need for high-resolution serological detection of rhinovirus infections. we have previously identified the capsid protein vp and an n-terminal vp peptide as major target for the natural antibody response of rv-infected subjects . then we have demonstrated that in vivo inoculation of subjects with rv indeed induced increases of vp -specific antibody responses which were best detected with the vp protein of the corresponding species . similar increases of rvspecific antibodies were found in pre-school children after asthma attacks using complete recombinant rv capsid proteins . in a recent study performed in pre-school children with acute asthma attacks, we observed that increases of rv-specific antibody responses reflected the severity of respiratory symptoms . in the present study funded by the european union project "predicta" (https://cordis.europa.eu/project/rcn/ _en.html), we investigated if it is possible to generate a microarray-based serological test which can discriminate rv-a, rv-b, and rv-c as culprit species involved in childhood asthma attacks. development of a high-resolution predicta microarray. figure shows the arrangement of rv-derived proteins, peptides as well as control proteins on the predicta chip and the selection procedure for the n-terminal vp peptides. recombinant capsid proteins (vp -vp ) and fragments thereof representing rv-a, b, c species as well as non-structural proteins from rv were included (supplementary tables - ) . based on our previous finding that antibodies from rv-infected patients react preferentially with the n-terminus of vp (ref. ), we included synthetic n-terminal vp peptides from rv strains which were selected in a rational, multistep process to represent distinct rv strains (fig. a , supplementary table ) . starting with vp sequences retrieved from the ncbi database (rv-a: ; rv-b: ; rv-c: ), multiple sequence alignments were performed to identify clusters of peptides with high degrees of sequence identities (fig. a , supplementary table : clusters a -a ; b -b ; c -c ). in a next step, peptides were re-clustered into groups taking the chemical properties of amino acids into consideration (supplementary table : ai-axviii; bi-bix; ci-ciii). from these groups peptides with the most distantly related sequences were selected (fig. b, c; supplementary tables and ). for a further refinement of vp , vp , and vp antibody responses, vp and vp fragments as well as peptides spanning the complete vp sequence from rv were included (fig. d , supplementary tables and ). for control and calibration purposes we added recombinant allergens and control proteins (fig. d, supplementary table ) to the predicta chip. proteins and peptides were spotted in triplicates on a preactivated glass slide containing six microarrays surrounded by a teflon frame so that one chip allows testing of six serum samples (fig. d) . identities and purities of recombinant proteins and peptides were tested by sodium dodecyl sulfate polyacrylamide gel electrophoresis (sds-page) followed by coomassie brilliant blue staining, western-blotting, and by mass spectrometry, respectively (supplementary tables - , supplementary fig. ). supplementary table shows that the predicta microarray allows reproducible measurement of igg levels to the antigens according to intra-and inter-assay variations. rv strain recognition is broader in older wheezing children. the predicta chip was tested with sera from pre-school children who were admitted to the hospital due to an acute wheezing episode . table summarizes demographic and clinical data of the children investigated in this study. to investigate if the spectrum of recognized rv peptides varies by age, children were grouped according to age (group i: < year, n = ; group ii: - years, n = ; group iii: > years, n = ). figure a tables - peptides followed by rv-a peptides whereas igg reactivity to rv-b peptides was less frequent and intense (fig. a) . similar results were obtained for iga responses which in general were lower than the igg responses (fig. b ). the analysis of igg reactivity to structural and non-structural proteins and to recombinant fragments and synthetic peptides spanning vp , vp , and vp from rv is shown in supplementary fig. a for all children and in supplementary fig. b for those children (n = ) who had shown increases of rv -specific antibody responses in follow-up serum samples taken after recovery. results obtained thus confirmed our earlier observations showing that the majority of rv-specific antibody responses are directed against the n-terminus of vp as represented by the peptides from the n-terminus of vp proteins from the different rv species . some children showed an igg response against vp and in particular to a fragment representing the middle portion of vp whereas vp and vp showed no relevant igg reactivity (supplementary fig. a, b) . among the non-structural proteins, c, a, and c showed some igg reactivity ( supplementary fig. ). the analysis of igg reactivity according to the age of children at the acute visit showed that children < year of age had much lower rv-specific igg levels compared to children who were older than year. there was almost no change of the rv strainspecificity pattern among the three age groups (fig. c ). however, we found a positive correlation between the number of recognized vp peptides and the age of the children (fig. d ). children between and months of age recognized a significantly lower number of n-terminal vp peptides than older ( - months) children. children older than years reacted with significantly more peptides than both groups of younger children (fig. e) . the patterns of antibody response against different rv peptides were also analyzed using an independent bioinformatics approach ( supplementary fig. ) . a phylogenetic clustering of all peptides was prepared and used to generate peptide groups according to sequence homology which represented to a large extent the rv subgroups a, b, and c. then an unsupervised computer algorithm was used to cluster the patterns of antibody responses. finally, the results of these analyses were superimposed. it turned out that there was a very strong correlation between the two groupings, one based on antibody reactivity and the other based on sequence identities among peptides made through the unsupervised analysis. thus antibody response patterns reflected very closely the peptide sequence similarities ( supplementary fig. ). identification of rv species-specific antibody increases. based on our previous observations that antibody increases specific for the n-terminal portion of vp can be detected in serum samples obtained from subjects after rv infection , the predicta chip was equipped with a vp peptide set which should allow detecting species-specific immune responses at high resolution ( fig. ). supplementary figure shows images of rv-specific antibody responses measured in sera from six representative children obtained at the time of the acute episode of wheeze and in sera at a follow-up visit - months later. in sera from the follow-up visit increased antibody responses to rv-a (sera # , # ), rv-c (sera # , # ), and rv-b (sera # , # ) peptides could be detected ( supplementary fig. ) . we then compared the peptide-specific igg antibody levels in sera obtained from the children at the time of the acute wheeze and at the follow-up - months later (i.e., median weeks later) ( table ). supplementary figure shows a color-coded map of the peptide-specific antibody responses for the acute phase and follow-up of each of the children demonstrating species-specific igg increases. next, we compared the peptide-specific increases in relative numbers for each child (fig. a) . according to these peptide-specific igg increases, children could be identified who responded preferentially to rv-a-derived peptides (n = ), rv-c-derived peptides (n = ), rv-b-derived peptides (n = ), and some with a mixed response pattern (n = ) (fig. a) . for children no increases of peptide-specific igg responses were found (fig. a, bottom) . the same analysis was performed for increases of antibody responses against complete recombinant vp proteins from the three rv-a strains (i.e., rv , , ), the rv-c strain yp, and the rv-b strain ( fig. a) but the results were less clear and negative for several children because only few strains were covered with the recombinant proteins. we have also included in fig. a (right column) results from the pcr testing performed using vp -vp specific primers in of the children , which showed that the nucleic acid-based detection of virus strains was negative for approximately % of children with increases of rv peptide-specific igg levels which may indicate a higher sensitivity of serology vs. pcr in these children. moreover, pcr results did not correspond well with the specificities of the antibody responses. there were also children without increases of rv peptide-specific antibody responses who had positive pcr results (fig. a, bottom) . tables and ) (x-axes: red: rv-a species; green: rv-b species; blue: rv-c species). antibody levels are color-coded and expressed as isac standardized units, isu-g and isu-a, respectively. c median igg levels (y-axis: isu-g) to vp peptides (x-axis) in children grouped according to age ( - months: squares; - months: triangles; - : circles). d spearman's rank correlation between the number of igg-reactive peptides (n, y-axis; median igg > isu) and age (x-axis: months). correlation coefficient (ρ) and p-value are shown. e comparison of the number of igg-reactive vp peptides (n, y-axis; median igg > isu) in children according to age (x-axis). horizontal lines indicate medians. statistically significant differences between groups are indicated (**p < . , ****p < . ) (mann-whitney u-test) # # # # # # # # # # # # # # # # yp c qpm yp # # # # # # # # # # # # # # # # antibody signatures associated with severity of wheeze. next, we investigated whether antibody responses to certain rv species were associated with the severity of rv-induced wheeze. for this purpose, we determined the number of days with respiratory symptoms and the number of days when medication was required in the period between the acute and follow-up visit. the number of days with respiratory symptoms but not with medication was significantly higher in subjects with an increase in rv-a > mixed > rv-c > rv-b specific signal (fig. , top and middle panels) . we then analyzed the sum of days with symptoms and medication and found that rv-a and rv-c antibody increases were associated with the highest number of days with symptoms and medication (fig. , bottom panel) . children with rv-a-or rv-c-specific antibody increases had significantly more days with symptoms and medication than children with rv-b-specific antibody increases (fig. , bottom panel) . within the european union-funded project predicta (https:// cordis.europa.eu/project/rcn/ _en.html) we developed the predicta chip which is based on micro-arrayed rv-derived proteins and peptides selected to represent the three rv species (rv-a, rv-b, and rv-c). using the predicta chip we could demonstrate in a cohort of pre-school children with acute wheeze and a follow-up visit that the rv-specific antibody response (igg > iga) is directed against an n-terminal peptide of the major capsid protein vp which confirms earlier results obtained by the mapping of rv -specific antibody responses . since the predicta chip was designed to contain a panel of synthetic peptides which represented the most diverse rv strains of the three genetic rv species in terms of sequence identity and physicochemical properties we were able to perform a high-resolution mapping of rv species-specific antibody responses by serology. in a cohort of clinically well-described swedish pre-school children from whom sera were available from an episode of acute wheeze requiring an emergency room visit and from a follow-up visit after convalescence, peptides from rv-a and rv-c species were most frequently recognized whereas rv-b species were much less commonly recognized. interestingly, we found that older children (i.e., children older than years of age) recognized peptides from more rv strains than younger children. this result would indicate that children encounter in their life different rv strains and thus may broaden their igg reactivity profiles later in life but this needs to be confirmed in longitudinal studies with samples taken from the same children at different ages, as has recently been done in the analysis of the evolution of ige reactivity profiles in allergic children in birth cohorts [ ] [ ] [ ] [ ] . one of the important findings of our study was that we could demonstrate that igg reactivity to peptides from certain rv strains increased in the children which may allow identifying the culprit rv species responsible for the acute wheeze by serology. furthermore, it turned out that increases of igg responses to rv-a and rv-c species were significantly associated with more severe illness as compared to igg increases to rv-b. the predicta chip thus seems to be not only suitable for identifying the culprit rv species responsible for an exacerbation of respiratory illness by simple serology but also allows to determine those rv species giving rise to severe symptoms. since we also had results from pcr-based testing of nasal swab samples from the same children we could compare the detection of strain-specific nucleic acid with antibody results. in fact, we found that all children without speciesspecific antibody increases were also negative in the pcr test. however, there was poor correlation between the pcr results from the nasal samples and the chip-based serological results. furthermore, with the exception of four children (# , # , # , # ) for which a positive pcr result has been obtained at the follow-up visit, all pcr results were negative at both the acute and follow-up visit in approximately % of the children for whom species-specific antibody increases could be clearly demonstrated. several possibilities for the discrepancies may be considered. for example, it may be possible that the time interval after acute infection chosen for serology was not optimal. however, we have previously investigated the time interval required for the appearance of vp -specific antibody increases in a controlled infection study and it is therefore very unlikely that the time interval used in our study was too long and responsible for discrepancies between pcr testing and serology. we also do not think that cross-reactivity among strains or the original antigenic sin is responsible for the observed differences because we have taken care to include on the chip sequences from several different rv strains and the results in fig. a clearly show that the serological results obtained with the chip allow a bona fide discrimination of rv-a, rv-b, and rv-c infections because we have used a large panel of rv peptides on the chip. finally, we found that older children recognized more rv peptides from different strains than younger children (fig. c-e) , which indicates that the children develop antibodies against new viruses and thus the concept of the original antigenic sin does not seem to apply here. one limitation of our study is, however, that we cannot exclude that infections with additional rv strains have occurred in the time window between the first and second blood sampling and thus are responsible for the discrepancy between pcr results and serology. nevertheless we think that the discrepancy between pcr results and serology is rather due to the fact that not every virus detected by pcr causes an infection with a consecutive immune response. in fact, studies performed in young children report that up to % of asymptomatic subjects have positive pcr results , . one more likely possibility for the poor correlation of pcr results and antibody results could be that we used a pcr strategy based on primers specific for vp -and vp -encoding regions of the viral genome, which may be less specific than pcr strategies based on the amplification and sequencing of the vp -encoding region or of the complete rv genome , , , . in general, nucleic acid-based strategies for virus detection only demonstrate the presence of virus-specific nucleic acid but provide no evidence that an infection has taken place which gave rise to a specific immune response. we therefore think that the predicta chip and future versions of it containing an even larger repertoire of n-terminal vp peptides from more rv strains will be a complementary tool in addition to pcr testing and eventually turn out to be even superior. the antibody test is actually fast and economical: it takes only few hours and requires only microliter amounts of serum. furthermore, serological analysis is robust and can be easily performed without need for pcr cyclers and subsequent sequencing. however, more prospective studies will be needed to investigate the diagnostic sensitivity and specificity of the rv chip. the predicta chip may be extremely useful to determine rv species-specific antibody responses in serum samples from existing cohorts world-wide to define the most common and relevant rv species involved in respiratory illness. chip-based measurements will allow exploring in prospective studies the role of rv infections in a variety of respiratory disease exacerbations. for example, it will be possible to discriminate whether asthma exacerbations have been triggered by an rv infection or by allergen exposure because it has been shown that both factors (i.e., rv infections as well as allergen exposure) induce increases of specific antibody responses when serum samples collected at the acute visit and during a follow-up visit are compared , , , . the identification of the culprit factors triggering asthma attacks becomes increasingly important in respiratory medicine due to availability of selective treatments of allergic asthma such as anti-ige antibodies and a variety of other biologics such as anticytokine antibodies targeting different forms of asthma , . furthermore, it will be possible to use the predicta chip to study by serological analysis the possible contribution of rv infections in exacerbations of other respiratory diseases such as copd and aco. further studies are also necessary to perform a multiple monitoring of the presence of rv strains and other respiratory viruses by pcr and the immune reaction by serology in close intervals and for extended periods after exacerbation. the reliable determination of the most common rv species involved in triggering severe respiratory illness will ultimately provide a rational basis for the development of rv vaccines and rv species-targeting therapeutic approaches [ ] [ ] [ ] [ ] [ ] [ ] [ ] . in conclusion, we developed and evaluated a high-resolution antibody assay based on micro-arrayed peptides and recombinant antigens from the most common rv strains to identify antibody signatures discriminating rv infections at the levels of different rv species and allowed to point towards the culprit species responsible for the triggering of acute pre-school wheezing. the predicta chip has the potential to be useful for a serological global mapping of rv infections, the identification of rv species involved in triggering different forms of severe respiratory illness, and for paving the road for rv-specific therapeutic and prophylactic treatment strategies, such as vaccines. selection and production of n-terminal vp peptides. vp amino acid sequences of rv strains representing the three rv species (rv-a: n = ; rv-b: n = ; rv-c: n = ) were retrieved from the ncbi database (https://www.ncbi. nlm.nih.gov/) (supplementary table ). multiple sequence alignments of the vp n-terminal peptides (aa - ) were performed using clustalw software available at the embl-ebi website (http://www.ebi.ac.uk/tools/clustalw ) to determine amino acid sequence identities among the peptides. peptide sequences showing sequence identities greater than . % (i.e. differences of ≤ aa) were grouped together into clusters (fig. ) . sequences among the clusters (supplementary tables and , a -a , b -b , c -c ) were re-aligned using genedoc software (http://iubio.bio.indiana.edu/soft/molbio/ibmpc/genedoc-readme.html) and each amino acid mismatch was analyzed regarding physicochemical properties of the amino acids. this procedure led to re-clustering of the peptides (supplementary table , ai-xviii, bi-bix; ci-ciii). from each re-clustered group one representative rv strain peptide was selected for printing onto the chip (fig. b, sup- plementary tables and : rv-a: n = ; rv-b: n = ; rv-c: n = ). an enterovirus-derived peptide was also included (fig. ) . for the set of peptides to be printed a multiple sequence alignment was performed by clustalw and a phylogenetic tree was constructed by the neighbor-joining (n-j) method using mega software (www.megasoftware.net) (fig. c) . the evolutionary distances between sequences were computed using the kimura -parameter model with bootstrap values calculated from replicates. additional peptides spanning the rv vp protein (fig. , supplementary table ) were selected to detect antibodies towards vp epitopes other than the n-terminal portion. the peptides as well as the non-structural b protein from strain (vpg: gpysgepkpksraperrvvtq) were produced by solid-phase synthesis with the -fluorenyl-methoxy carbonyl (fmoc)-method (cem-liberty, matthews, nc, usa and applied biosystems, carlsbad, ca, usa) on peg-ps preloaded resins (applied biosystems). after synthesis, peptides were washed with dichloromethane, cleaved from the resins using ml trifluoroacetic acid (tfa), . ml silane, and . ml h o and precipitated into pre-chilled tertbutylmethylether. peptides were purified by reversed-phase high-performance liquid chromatography in a - % acetonitrile gradient using a jupiter μm proteo Å, lc column (phenomenex, torrance, ca, usa) and an ultimate pump (dionex, sunnyvale, ca, usa) to a purity > %. their identities and molecular weights were verified by mass spectrometry (microflex maldi-tof, bruker, billerica, ma, usa) . for the unsupervised analysis of antibody responses to rv peptides and proteins, unsupervised k-means clustering (k = ) was used to define clusters of peptides with similar antibody response measurements. the k number of clusters was pre-determined in order to match the number of the peptide homology groups. the peptides' amino acid sequences were aligned with a gap open cost of . and a gap extension cost of . . based on the alignment, a homology distance cladogram was built using the neighbor-joining algorithm and bootstrap replicates. the peptides were then color-coded based on the antibody response cluster that they belonged to. data were processed using the clc genomics workbench (clc, clcbio, qiagen, hilden, germany). the heat map representing rv-specific antibody responses was generated by qlucore omics explorer (qlucore, lund, sweden). expression and purification of recombinant rv proteins. recombinant histagged structural (vp - ) proteins from five representative rv strains (rv-a , -a , -a , -b , and -cyp) and mbp fusion proteins containing fragments thereof (vp - ) as well as non-structural ( a, c, a, c, and d) proteins from rv strain were expressed in escherichia coli as previously described , . dna sequences coding for the complete genes or fragments thereof (accession numbers are shown in supplementary table ) were codon optimized for bacterial expression, synthesized with the addition of the ′ sequence coding for a cterminal hexa-histidine tag and cloned into the ndei and ecori sites of plasmid pet b (genscript, piscataway, nj, usa). transformed e. coli bl (de ) cells (agilent technologies, santa clara, ca, usa) were induced with mm isopropylβ-thiogalactopyranoside (iptg) and cells were harvested at time-points of maximal expression. recombinant proteins were purified by nickel-affinity chromatography under denaturing conditions as previously described (qiagen, hilden, germany). refolding of recombinant proteins was achieved by a stepwise dialysis against mm nah po for structural and non-structural proteins and mm tris-hcl, mm nacl, mm edta for mbp fusion proteins, respectively. the purity of recombinant proteins was verified by sds-page followed by coomassie brilliant blue staining and the identity by immunoblotting using a monoclonal mouse anti-his-tag antibody : diluted (cat: dia- , dianova, hamburg, germany). bound antibodies were detected with : diluted alkaline phosphatase-coupled rat anti-mouse igg antibodies (cat: ; bd biosciences, erembodegem, belgium). protein concentrations were determined using bca protein assay kit (thermo fisher scientific, rockford, il, usa). the secondary structure of the proteins was measured by circular dichroism spectroscopy on a jasko j- spectropolarimeter (japan spectroscopic, tokyo, japan) at a protein concentration of . mg/ml in mm nah po . detection antibodies and printing of microarrays. anti-huigg (cat: - - ; jackson immunoresearch laboratories, west grove, pa, usa) and anti-huiga (cat.: ; becton dickinson, franklin lakes, nj) were labeled with dylight (pierce, thermo fisher scientific, rockford, il, usa). customized printing of rv microarrays was done by phadia-thermofisher using immunocap isac (immuno solid-phase allergen chip) technology , . spotting was performed by slow pin mode printing using the aushon printer (aushon, billerica, ma, usa). stock solutions of peptides ( mg/ml) were diluted : in a phosphate buffer, ph . and then used for spotting. antigens were spotted in triplicates on a glass surface coated with an amino-reactive organic polymer, each spot containing - fg of microarray component, corresponding to - attomol. allergens used for the calibration and other control proteins spotted on the microarray are listed in supplementary table . cohort of pre-school children with acute wheeze. serum samples examined in this study were from a cohort of pre-school children who had been admitted to the paediatric emergency ward as a result of acute wheeze, at astrid lindgren children's hospital, stockholm, sweden (table ) . this cohort and the genotyping of rv strains in the nasopharyngeal swab samples of of the children by nested pcr and sequencing have been previously described . a molecular diagnostic platform for the rapid detection of respiratory strains was used in of the children. the following respiratory viruses were found: adenovirus: children; bocavirus: children; coronavirus: children; influenza a/b: child; metapneumovirus: children; parainfluenzavirus: children: rsv: children . for of the children nasopharyngeal swabs had been available for the rv pcr targeting vp / sequences. written informed consent was obtained from the parents or by the legal guardians and the study was approved by the regional ethics committee of karolinska institutet, stockholm, sweden. peripheral blood samples had been obtained within h of presentation in the emergency unit and sera were stored at − °c. in addition to blood samples, nasopharyngeal swab samples were obtained at the acute visit and again at the follow-up visit by the research nurse and stored in the biobank at the department of clinical microbiology, karolinska university hospital . follow-up samples were obtained between and weeks after the initial recruitment (median weeks) at a scheduled visit after recovery. although this study was not planned as a prospective study for the assessment of increases of rv-specific antibodies, the time interval of - weeks was suited for this purpose because we found in an earlier study that increase of rv-specific igg responses emerge days after experimental inoculation . at the follow-up visit, the guardians also filled out a standardized questionnaire concerning the number of days the child had suffered from respiratory symptoms at home (i.e., 'cough and/or wheeze'), use of medication (i.e., β -agonists, inhaled corticosteroids, leukotriene receptor antagonist), as well as about any emergency visits between the acute and follow-up visits. the chip analysis of the anonymized sera was performed with the approval of the ethics committee of medical university of vienna (ek / ) , . microarray-based determination of antibody profiles. microarrays were washed in a washing buffer (phadia-thermo fisher) for min by stirring. after drying by centrifugation ( min, g, rt), µl of serum samples were applied on each microarray and the slides were incubated for h at gentle rocking (rt). for the detection of rv-specific igg and iga antibodies, serum samples were diluted : and : in a sample dilution buffer (phadia-thermo fisher), respectively. microarrays were then rinsed with washing buffer and washed for min as described above. after centrifugation, µl of fluorescence-labeled antibodies ( µg/ml) was added and the slides were incubated min at gentle rocking (in dark, rt). after further rinsing, washing and drying, microarrays were scanned using a confocal laser scanner (luxscan- k microarray scanner, capital-bio, beijing, people's republic of china) and the image analysis was evaluated by microarray image analyzer v . . software (phadia-thermo fisher) . for calibration and determination of background signals, a calibrator serum (i.e., a pool of allergic patients sera, diluted : ) and sample diluent were included in each analysis run. calibration and variability of the microarray. the phadia microarray image analysis software was used to process images, to calculate the mean fluorescence intensities (fi) of triplicate analyses and to calibrate the results. a calibration curve was generated by relating fluorescence intensities obtained by scanning the pre-dicta microarray with allergen-specific antibody levels measured by immunocap. results were reported in isac standardized units (isu) , . background reactivities of fluorescence-labeled α-huigg antibodies towards all microarray components were determined by testing six replicates of a sample diluent alone. for the characterization of the assay variability, one calibrator serum and three normal sera were profiled at : , : , : , and : in order to find a dilution at which the broadest spectrum of reactivity levels were covered. to assess intra-assay (i.e. within experiment) variability, six replicates of four serum samples were used to measure igg reactivities towards all microarray components on the same day. to assess inter-assay (i.e. between experiments) variability, four serum samples were evaluated in experiments conducted on five consecutive days. the evaluation of four different samples allowed determining whether the inter-assay variability was sample-dependent. for each microarray component, the mean isu-g, standard deviation (sd), coefficient of variation (cv % = sd/mean isu-g) across the six replicates and five different experiments, respectively, were calculated for each sample. microarray components were classified according to the isu-g level (> , and averages across all components within these groups on the array were also calculated for each of these quantities. data analysis. initial data processing was performed with microsoft excel. frequencies (i.e., the number of reactive sera) and intensities (i.e., isu-g levels) of peptide-specific igg and iga antibody levels were calculated using ibm spss statistics (version ; ibm corp., armonk, ny, usa). median values of peptidespecific igg levels (isu-g) were calculated using graphpad prism (la jolla, ca, usa). the cut-off for a positive igg reactivity was set at isu-g. the numbers of igg-reactive peptides in fig. d , e were calculated for specific igg levels > isu-g. increases of rv-specific antibody responses were calculated as differences between isu-g values measured at the follow-up (f) and the acute (a) visits (Δisu-g = isu-g f − isu-g a ) followed by subtraction of the double coefficient of variation ( × cv%) calculated from the baseline values (isu-g a ) for each of the antigens. for igg levels of - isu-g and > isu-g, % and . % were determined as cv% for intra-assay variability and subtracted from the data, respectively. rv-specific igm antibodies were not measured because we found in an earlier study that no increases of rv-specific igm antibody levels were detectable on days , , and after experimental rv inoculation . statistical analysis. graphpad prism (la jolla, ca, usa) was used to evaluate all statistical parameters. correlation between the number of reactive peptides and age was evaluated by calculating the spearman's rank correlation coefficient (ρ). differences between groups in number of reactive peptides and in number of days the children spent with respiratory symptoms, medication use, or with both were assessed by mann-whitney u-test (two-tailed). values of p < . were considered statistically significant. the role of the airway epithelium and its interaction with environmental factors in asthma pathogenesis the microbiology of asthma microbes and mucosal immune responses in asthma global strategy for asthma management and prevention global strategy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease role of viral infections, atopy and antiviral immunity in the etiology of wheezing exacerbations among children and young adults rhinovirus and preschool wheeze wheezing rhinovirus illnesses in early life predict asthma development in high-risk children rhinovirus wheezing illness and genetic risk of childhoodonset asthma the sentinel role of the airway epithelium in asthma pathogenesis rhinovirus-induced asthma exacerbations during childhood: the importance of understanding the atopic status of the host sequencing and analyses of all known human rhinovirus genomes reveal structure and evolution proposals for the classification of human rhinovirus species a, b and c into genotypically assigned types the major and minor group receptor families contain all but one human rhinovirus serotype the major human rhinovirus receptor is icam- members of the low density lipoprotein receptor family mediate cell entry of a minor-group common cold virus cadherin-related family member , a childhood asthma susceptibility gene product, mediates rhinovirus c binding and replication association between human rhinovirus c and severity of acute asthma in children human rhinovirus c associated with wheezing in hospitalised children in the middle east a polyvalent inactivated rhinovirus vaccine is broadly immunogenic in rhesus macaques different rhinovirus serotypes neutralized by antipeptide antibodies a combination vaccine for allergy and rhinovirus infections based on rhinovirus-derived surface protein vp and a nonallergenic peptide of the major timothy grass pollen allergen phl p antibodies induced with recombinant vp from human rhinovirus exhibit cross-neutralisation rhinovirus infections and immunisation induce crossserotype reactive antibodies to vp cross-serotype immunity induced by immunization with a conserved rhinovirus capsid protein an anti-human icam- antibody inhibits rhinovirus-induced exacerbations of lung inflammation community-acquired pneumonia requiring hospitalization amoug u.s children improved molecular typing assay for rhinovirus species a, b, and c viral infections of the lower respiratory tract: old viruses, new viruses, and the role of diagnosis detecting respiratory viruses in asymptomatic children respiratory pathogens in children with and without respiratory symptoms picornavirus infections in children diagnosed by rt-pcr during longitudinal surveillance with weekly sampling: association with symptomatic illness and effect of season misdirected antibody responses against an n-terminal epitope on human rhinovirus vp as explanation for recurrent rv infections comparison of rhinovirus antibody titers in children with asthma exacerbations and species-specific rhinovirus infection rhinovirus-specific antibody responses in preschool children with acute wheeze reflect severity of respiratory symptoms rhinovirus-induced vp -specific antibodies are groupspecific and associated with severity of respiratory symptoms mechanisms for the development of allergies consortium. early childhood ige reactivity to pathogenesis-related class proteins predicts allergic rhinitis in adolescence evolution and predictive value of ige responses toward a comprehensive panel of house dust mite allergens during the first decades of life windows of opportunity for tolerance induction for allergy by studying the evolution of allergic sensitization in birth cohorts detection of ige reactivity to a handful of allergen molecules in early childhood predicts respiratory allergy in adolescence from original antigenic sin to the universal influenza virus vaccine serial viral infections in infants with recurrent respiratory illnesses screening respiratory samples for detection of human rhinoviruses (hrvs) and enteroviruses: comprehensive vp -vp typing reveals high incidence and genetic diversity of hrv species c vp sequencing of all human rhinovirus serotypes: insights into genus phylogeny and susceptibility to antiviral capsid-binding compounds antigens drive memory ige responses in human allergy via the nasal mucosa nasal application of rbet v or non-ige-reactive t-cell epitope-containing rbet v fragments has different effects on systemic allergen-specific antibody responses pharmacological therapy of bronchial asthma: the role of biologicals evolving concepts of asthma development and characterization of a recombinant, hypoallergenic, peptide-based vaccine for grass pollen allergy advances in allergen-microarray technology for diagnosis and monitoring of allergy: the medall allergen-chip hiv microarray for the mapping and characterization of hiv-specific antibody responses subnormal levels of vitamin d are associated with acute wheeze in young children data availability. primary data that support the findings of this study are available from the corresponding author on request. this study was funded by predicta, a fp -funded eu project (no. ), by the fwffunded projects f and p of the austrian science fund, by research grants from biomay ag and viravaxx, vienna, austria, by the swedish research council, the swedish heart-lung foundation, stockholm county council (alf project), the swedish asthma and allergy association´s research foundation, the king gustaf v´s -year foundation, the centre for allergy research at karolinska institutet and the swedish cancer and allergy foundation. supplementary information accompanies this paper at https://doi.org/ . /s - - - . novartis, faes farma, biomay ag, vienna, austria, hal, nutricia research, menarini, meda, abbvie, msd, omega pharma, danone, grants from menarini, outside the submitted work. r.v. reports grants from european union, grants and personal fees from biomay ag, vienna, austria, grants and personal fees from viravaxx, vienna, austria, during the conduct of the study; grants from austrian science fund (fwf), grants and personal fees from biomay ag, vienna, austria, grants and personal fees from viravaxx ag, vienna, austria, outside the submitted work. in addition, r.v. and k.n. are co-inventors in a patent application (pct/at / ) regarding the rhinovirus diagnosis reported in this paper. the remaining authors declare no competing interests.reprints and permission information is available online at http://npg.nature.com/ reprintsandpermissions/ publisher's note: springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/ licenses/by/ . /. key: cord- -rzy mejb authors: duricki, denise a.; drndarski, svetlana; bernanos, michel; wood, tobias; bosch, karen; chen, qin; shine, h. david; simmons, camilla; williams, steven c.r.; mcmahon, stephen b.; begley, david j.; cash, diana; moon, lawrence d.f. title: corticospinal neuroplasticity and sensorimotor recovery in rats treated by infusion of neurotrophin- into disabled forelimb muscles started h after stroke date: - - journal: biorxiv doi: . / sha: doc_id: cord_uid: rzy mejb stroke often leads to arm disability and reduced responsiveness to stimuli on the other side of the body. neurotrophin- (nt ) is made by skeletal muscle during infancy but levels drop postnatally and into adulthood. it is essential for the survival and wiring-up of sensory afferents from muscle. we have previously shown that gene therapy delivery of human nt into the affected triceps brachii forelimb muscle improves sensorimotor recovery after ischemic stroke in adult and elderly rats. here, to move this therapy one step nearer to the clinic, we set out to test the hypothesis that intramuscular infusion of nt protein could improve sensorimotor recovery after ischemic cortical stroke in adult rats. to simulate a clinically-feasible time-to-treat, twenty-four hours later rats were randomized to receive nt or vehicle by infusion into triceps brachii for four weeks using implanted minipumps. nt increased the accuracy of forelimb placement during walking on a horizontal ladder and increased use of the affected arm for lateral support during rearing. nt also reversed sensory deficits on the affected forearm. there was no evidence of forepaw sensitivity to cold stimuli after stroke or nt treatment. mri confirmed that treatment did not induce neuroprotection. functional mri during low threshold electrical stimulation of the affected forearm showed an increase in peri-infarct bold signal with time in both stroke groups and indicated that neurotrophin- did not further increase peri-infarct bold signal. rather, nt induced spinal neuroplasticity including sprouting of the spared corticospinal and serotonergic pathways. neurophysiology showed that nt treatment increased functional connectivity between the corticospinal tracts and spinal circuits controlling muscles on the treated side. after intravenous injection, radiolabelled nt crossed from bloodstream into the brain and spinal cord in adult mice with or without strokes. our results show that delayed, peripheral infusion of neurotrophin- can improve sensorimotor function after ischemic stroke. phase i and ii clinical trials of nt (for constipation and neuropathy) have shown that peripheral, high doses are safe and well tolerated, which paves the way for nt as a therapy for stroke. ischemic stroke occurs in the brain when blood flow is restricted, causing brain cells to die rapidly. movements on the opposite side of the body are frequently affected . stroke victims also often exhibit lack of responsiveness to stimuli on their affected side. the w.h.o. estimates that, worldwide, there are million stroke survivors, with another million new strokes annually. the vast majority of stroke victims are not eligible for the few therapies that improve outcome because they arrive in hospital too late for reperfusion to be effective . treatment six hours or more after ischemic stroke is usually limited to rehabilitation: therapies that reverse sensory impairments and locomotor disability are urgently needed, and these must work when initiated many hours after stroke. neurotrophin- is a growth factor which plays a key role in the development, and function of locomotor circuits that express nt receptors, including descending serotonergic and corticospinal tract (cst) axons and afferents from muscle and skin that mediate proprioception and tactile sensation [ ] [ ] [ ] . however, peripheral levels of nt drop in the postnatal period . we and others had shown that delivery of nt into the cns promotes recovery in rodent models of spinal cord injury [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] but this involved invasive routes of delivery (e.g., intraspinal injection or intrathecal infusion) or gene therapy. we also recently showed that injection of an adeno-associated viral vector (aav) encoding full-length human nt (prepront , kda) into forelimb muscles hours after stroke in adult or elderly rats improved sensorimotor recovery . we had originally expected that aav would be trafficked from muscle to the spinal cord retrogradely in axons and that this would enhance secretion of nt by motor neurons, leading to sprouting of the spared cst , , and sensorimotor recovery. although nt protein was overexpressed in injected muscles, to our surprise we found little evidence for expression of the human nt transgene in the spinal cord or cervical drgs , using this dose and preparation of aav. this serendipitous result led us to reject our original assumption that sensorimotor recovery required expression of the human nt transgene in the spinal cord and to wonder whether peripheral infusion of nt protein would suffice. accordingly, here, we test the hypothesis that infusion of the mature form of the nt protein ( kda) into disabled forelimb muscles improves sensorimotor recovery. this is consistent with work by others including a study showing that a signal from muscle spindles can improve neuroplasticity of descending pathways and can enhance recovery after cns injury . notably, nt protein is synthesised by muscle spindles and can be transported from muscle to sensory ganglia and spinal motor neurons in nerves , , and from the bloodstream to the cns [ ] [ ] [ ] . this route of administration and time frame is clinically feasible so to take this potential therapy one step nearer the clinic, we next set out to determine whether intramuscular infusion of human nt protein (mature form, kda) would improve outcome after stroke (i.e., bypassing the use of gene therapy and spinal surgery). importantly, the mature form of the nt protein has excellent translational potential: phase i and ii clinical trials have shown that repeated, systemic, high doses of nt protein are well-tolerated, safe and effective in more than humans with sensory and motor neuropathy (charcot-marie-tooth type a) or constipation including in people with spinal cord injury [ ] [ ] [ ] [ ] [ ] . in contrast to other neurotrophins, nt does not cause any serious adverse effects such as pain probably because its principal high affinity receptor trkc is not expressed on adult nociceptors , . these studies pave the way for nt as a therapy for stroke in humans. we now show in a blinded, randomized preclinical trial that treatment of disabled upper arm muscles with human nt protein reverses sensory and motor disability in rats when treatment is initiated in a clinically-feasible timeframe ( hours after stroke). rats received unilateral focal cortical stroke or underwent sham surgery , (fig. a,b) . twentyfour hours after stroke, rats were allocated to treatment using nt or vehicle, infused into affected triceps brachii muscles for one month via implanted catheters and subcutaneous osmotic minipumps. experiments were performed in accordance with guidelines from the stroke therapy academic industry roundtable (stair) and others and our findings were reported in accordance with the arrive (animals in research: reporting in vivo experiments) guidelines. all surgical procedures, behavioural testing and analysis were performed using a randomised block design. all surgeries, behavioural testing and analysis were performed with investigators blinded to treatment groups. rats were randomised to surgery by drawing a rat identity number from an envelope and then a stroke/sham allocation from an envelope. allocation concealment was performed by having nt and vehicle stocks coded by an independent person prior to loading pumps. behavioural testing was conducted blind and codes were only broken after behavioural analyses were complete. lister hooded (~ months; - g) outbred female rats (charles river, uk) and adult c bl/ mice ( - weeks) were used. all procedures were in accordance with the uk home office guidelines and animals (scientific procedures) act of . rats were maintained (specific pathogen free) in groups of to in plexiglas housing with tunnels and bedding on a : hour light/dark cycle with food and water ad libitum. focal ischemic stroke was induced in the hemisphere representing the dominant forelimb ( supplementary fig. ), as determined by the cylinder behavioural test. stroke lesions (n = ) were performed as previously described . briefly, animals were transferred to a stereotaxic frame (david kopf instruments, usa) where a midline incision was made, the cortex was then exposed via craniotomy using the following co-ordinates [defined as anterioposterior (ap), mediolateral (ml)]: ap mm to − mm, ml mm to mm, relative to bregma. endothelin- (et- , pmol/µl in sterile saline; calbiochem) was applied using a glass micropipette attached to a hamilton syringe. µl of et- to was applied to the overlying dura to reduce bleeding, and immediately thereafter, the dura mater was incised and reflected. four μl volumes of et- were administered topically and four µl volumes were microinjected intracortically (at a depth of mm from the brain surface) at the following co-ordinates (from bregma and midline, respectively): ap + . mm, ml . mm ap + . mm, ml . mm ap + . mm, ml . mm ap - . mm, ml . mm temperature was maintained using a rectal probe connected to a homeothermic blanket (harvard apparatus, usa) placed under the animal which maintained rectal temperature at ± °c. prior to suturing, the animal was left undisturbed for minutes. modifying previous work , the skull fragment was then replaced and sealed using bone wax (covidien, uk). % ( / ) of rats survived this stroke surgery. sham-operated (n = ) rats received all procedures up to, but not including, craniotomy or endothelin- injection. animals were given buprenorphine ( . mg/kg, subcutaneously) for postoperative pain relief. our method of inducing stroke with et- is advantageous for evaluating regenerative stroke therapies for four reasons: ) our model produces ischemic lesions that model small focal human strokes rather than larger "malignant" strokes that tend to be fatal in humans ; ) our model targets sensorimotor cortex; ) our stroke model involves only low mortality rates and has reasonable reproducibility with a proven ability to detect therapies that induce neuroplasticity and functional recovery , ; ) our model causes sustained sensorimotor deficits (e.g., impaired use of limbs) which are common neurological symptoms of human stroke. structural images were obtained prior to stroke, hours after stroke and at one and eight weeks after stroke. mr imaging was conducted on a tesla (t) horizontal bore vmris scanner (varian, palo alto, ca, usa). animals were anaesthetised using . % isoflurane, in . l/min medical air and . l/min medical oxygen in an induction chamber. once anaesthetised they were secured in a stereotaxic head frame inside the quadrature birdcage mr coil ( mm internal diameter, id) and placed into the scanner. each animal's physiology was supervised throughout the procedure using a respiration monitor (biopac, usa) and a pulse oximetry sensor (nonin, usa) that interfaced with a pc running biopac. additionally, an mri compatible homeothermic blanket (harvard apparatus, usa) placed over the animal responded to any alterations in the body temperature identified by the rectal probe, and maintained temperature at ± °c. the t weighted mr images were acquired using a fast spin-echo sequence: effective echo time (te) ms, repetition time (tr) ms, field of view (fov) mm x mm and an acquisition matrix x , acquiring x mm thick slices in approximately minutes. at the end of the study (to avoid affecting blinding or randomization), lesion volumes at the hour time point were measured using a semi-automatic contour method in jim software under blinded conditions (xinapse systems ltd). functional magnetic resonance imaging (fmri) was performed in a subset of rats that did not receive intracortical injections of bda tracer (n = /group). images were acquired prior to stroke and at one and eight weeks after stroke during non-noxious somatosensory stimulation of the affected or less-affected wrist . this involves delivery of small electrical currents to a wrist whilst the subjects were kept anaesthetised using medium dose alpha-chloralose suitable for recovery and longitudinal (repeated) imaging . alpha chloralose anesthesia was prepared by mixing equal amounts of borax decahydrate (sigma) and alpha chloralose-pestanal (analytical standard, < % beta isoform, , sigma, uk) in physiological saline each at a concentration of mg/ml in a glass beaker at °c prior to filtering using a . µm filter. rats were first anaesthetised using % isoflurane, . l/min medical air and . l/min oxygen. a tail cannulation was performed and the animal was transferred to the mri machine. a bolus of mg/kg alpha chloralose-pestanal was injected intravenously and then the isoflurane was switched off after minutes. an infusion line for continuous application of alpha chloralose was then attached to the cannula at mg/kg/h over the experimental time. medical air ( . l/min) and oxygen ( . l/min) was continuously delivered throughout the scanning period . mr images were obtained using a tesla scanner (varian, agilent). initially, a t -weighted structural scan was acquired, using a fast spin echo (fse) sequence with repetition time (tr) = ms, echo time (te) = ms and field of view (fov) of x mm, yielding slices with voxel size of . x . x mm in approximately min. fmri scans were acquired using a multi-gradient-multi-echo sequence (tr = ms, tes = , , ms, voxel size . mm x . mm x mm, resolution x x , scan time s). volumes were acquired with a pseudo-random onoff stimulation of the forepaw at hz ( μs, ma pulse) using a platinum subdermal needle electrode (f-e , grass technologies, usa) and a tens (transcutaneous electrical nerve stimulation) pad. it has previously been shown that the use of a tens pad results in this intensity of stimulation being innocuous rather than noxious: whereas blood pressure is not altered at ma stimulation, it is increased with . ma stimulation (see references in ). the order of paw stimulation was also randomised. animals were closely monitored following the end of the scanning, given ml of saline (subcutaneous, room temperature) and were kept in a warmed incubator ( deg c) until fully conscious: this takes to hours due to slow pharmacokinetics of alpha chloralose. two animals died following alpha chloralose anaesthesia due to breathing difficulties in recovery. scans with obvious imaging artefacts were discarded, leaving final group numbers of n= , , and n= , , at weeks , and for nt and vehicle treated groups respectively. the resulting images were analyzed with spm- (statistical parametric mapping, fil, ucl). in order to make sure all lesions were in the same side of the brain, images with righthand side strokes were rotated about the sagittal mid-plane, so that the lesioned hemisphere always appeared on the left. functional scans were initially realigned to the first image in the timeseries in order to correct for movements of the head. the first volume of the functional scan was then spatially registered to the structural image, which was, in turn, linearly warped to a template brain. linear warping was used in this step in order to avoid deforming the lesion region. warping parameters obtained during registration of structural image to template were applied to the realigned functional time-series, resulting in structural and functional images that are all in a standard space. finally, functional images were smoothed using a gaussian kernel with full width half maximum of . x . x mm (twice the voxel size). because of the relatively long effective tr of the functional images, a pet basic model (one-sample t-test) was used for firstlevel analyses with covariates consisting of the pseudo-random stimulation pattern (paradigm), and the estimated movement parameters of each individual rat. volumes signal intensity was globally scaled and individual masks, generated from the fast spin echo (fse) structural scan for each rat brain at each time-point using a d pulse-coupled neural network (also registered to the template), were used as explicit masks for the first-level statistical analysis. contrast images from the first-level analysis were then carried onto a second-level (random effects) group analysis. effects of group (i.e. nt or vehicle treated) and stimulated paw (i.e. affected or lessaffected) were used to create statistical comparisons. a flexible factorial analysis was used to compare the difference between the nt and vehicle treated groups in the change from baseline to weeks . statistical parametric maps were generated using an uncorrected threshold of p < . ; images show group mean activations and t values are given. hours after stroke (immediately after mri), rats were allocated to treatment using a randomized block design. allocation concealment was performed by having nt and vehicle stocks coded by a third party. rats were anaesthetised as above and a small incision was made between elbow and axilla, and a small subcutaneous space was formed to the lower back. the osmotic pump with the catheter attached was positioned in this subcutaneous space and then an ultrafine, flexible catheter was implanted approximately millimetres into the proximal end of the long head of the triceps brachii muscle on the disabled side and this was sutured in place (prolene / , ethicon, uk). triceps brachii was selected as the site for infusion because this large muscle is involved in forelimb extension during walking and for postural support during rearing) rearing. note that in the triceps brachii, the end plates are located in the belly of the muscle . the catheter was made from stretchable and flexible silicone tubing (id = . inches, wall thickness = . inches, trelleberg, sf medical, uk, sfm - ) attached to the osmotic pump via larger gluedon tubing (id . mm, vwr international). a second section of this stiff tubing ( millimetres long) was inserted to guide the flexible catheter into the triceps; the guide was slid back after the silicone tube was implanted. catheters were connected to subcutaneous osmotic minipumps ( ml , alzet) containing either vehicle ( . % saline containing . % bovine serum albumin; sigma; a ; - - ) or vehicle containing recombinant human nt (kind gift of genentech inc., usa). pump flow moderators were mr-compatible (peek micro medical tubing, durect, # ). the original vials contained . mg/ml recombinant human ("rhu") nt in mm acetate, mm nacl, ph . . sds page and proteomic analysis indicates that this is the amino acid mature nt protein obtained after proteolytic cleavage of the amino acid proneurotrophin- (uniprot p ) ( supplementary figures and ). the nt dose ( µg/ hours) was selected based on previous experiments . pumps were replaced after two weeks and removed after four weeks. skin was sutured and analgesic administered as above. all rats survived this surgery. sham rats did not undergo this surgery. pilot experiments showed that the pump flow rate ( µl/hour) was sufficient to deliver substances ( . % saline containing % fast green, sigma) to the entire volume of the triceps muscles. rats (n = /group) were terminally anesthetised ( weeks following stroke, before pumps were removed) and triceps brachii and c spinal cord were rapidly dissected and snap frozen in liquid nitrogen prior to storage at - °c. tissue was homogenised in ice cold lysis buffer containing mm nacl, mm tris-hcl (ph . ), % np , % glycerol, mm phenylmethanesulfonylfluoride, μg/ml aprotinin, μg/ml leupeptin and . mm sodium vanadate, using approximately times the volume of buffer to the wet weight of tissue ( µl/mg tissue). protein content was measured and nt elisa was carried out according to manufacturer's instructions (emax, promega). n.b., promega nt elisa kits are no longer available. however, it was recently discovered that when performing elisa using other nt kits (r&d using "reagent diluent" and abcam using diluent a), measurements of nt from skeletal muscle lysates do not provide reliable quantitative data. this is due to so-called "matrix effects" as shown by poor recovery of spiked-in nt (< % or > %) and non-linear relationship between concentration of input material and estimated nt concentration based on a dilution series of muscle homogenate. to overcome the effect of interfering substances, samples should be diluted and appropriate diluents to prepare standards and diluted samples must be used. the abcam kit used with diluent b (not a), however, provided reliable quantitative results (dr. aline barroso spejo, unpublished results). we assessed sensory and motor deficits after stroke using the cylinder test (to assess postural support by forelimbs during rearing), adhesive patch test (to assess responsiveness to tactile stimuli), horizontal ladder (to assess forelimb and hindlimb skilled locomotion), a grip strength test and a test used to monitor unusual responses to cold stimuli (cold allodynia) , . all behavioural testing was carried out by an experimenter blinded to surgery and treatment groups. rats were handled and trained for three weeks on the horizontal ladder before the study began. preoperative baseline scores for the horizontal ladder, the vertical cylinder and the grip strength test were collected one week before surgery. the "adhesive patch" test was used to measure ) the time taken to contact stimuli on the wrists, ) the time taken to remove stimuli from the wrists, and ) the magnitude of lack of responsiveness to stimuli on the affected wrist , , , . for each trial, a round adhesive patch ( mm diameter, ryman) was applied to each wrist on the dorsal side and the animal was returned to its home cage. two times were recorded for both forepaws: ( ) contact and ( ) remove; where "contact" represents the time taken for the animal to contact an adhesive patch with its mouth, and "remove" represents the time taken for the animal to remove the first adhesive patch from its wrist. to determine whether the rats preferentially removed a sticker from their less-affected wrist before their more-affected wrist, the order and side of label removal was recorded. this was repeated four times per session until a > % preference had been found; if this was not the case a fifth trial was conducted. the magnitude of asymmetry was established using the seven levels of stimulus pairs on both wrists as previously described (figure a) . from trial to trial, the size of the stimulus was progressively increased on the affected wrist and decreased on the less affected wrist by an equal amount ( . mm ), until the rat removed the stimulus on the affected wrist first (reversal of original bias). the higher the score, the greater the degree of somatosensory impairment. walking: to assess impairments in forelimb and hindlimb function after stroke, rats were videotaped as they walked along a horizontal ladder. rats were videotaped crossing a horizontal ladder ( m) with irregularly spaced rungs ( to cm spacing changed weekly) weekly, times per session. any slight paw slips, deep paw slips and complete misses were scored as errors. the mean number of errors per step was calculated for each limb for each week (foot faults are routinely normalized "per step" after stroke although analysis of foot fault data with or without normalization led to the same conclusions being drawn). the cylinder test was used to assess asymmetries in forelimb use for postural support during rearing within a transparent cm diameter and cm high cylinder . an angled mirror was placed behind the cylinder to allow movements to be recorded when the animal turned away from the camera. during exploration, rats rear against the vertical surface of the cylinder. the first forelimb to touch the wall was scored as an independent placement for that forelimb. subsequent placement of the other forelimb against the wall to maintain balance was scored as "both." if both forelimbs were simultaneously placed against the wall during rearing this was scored as "both." a lateral movement along the wall using both forelimbs alternately was also scored as "both." scores were obtained from a total number of full rears to control for differences in rearing between animals. once scores had been acquired, forelimb asymmetry was calculated using the formula: × (ipsilateral forelimb use + / bilateral forelimb use)/total forelimb use observations (hsu and jones ). figure a ; mjs technology, uk). both the affected and less affected forelimb strength were each measured (simultaneously) at baseline, week , week and week following stroke. a pair of force transducers were used in parallel to measure the peak force achieved by a rat's forelimbs as its bilateral grip was broken by the experimenter gently pulling the rat by the base of the tail horizontally away from the transducer. the average of three strength readings was noted down per session and an average taken for both arms. the difference in grip strength was taken by subtracting the affected forepaw grip strength from the less affected grip strength. the presence or absence of cold allodynia was assessed using standard methods . rats were placed in a transparent cylinder ( cm diameter, cm height) atop a mesh wire floor. a drop ( µl) of acetone was placed against the centre of the forepaw. in the following s after acetone application the rat's response was monitored. if the rat did not withdraw, flick or lick its paw within this s period then no response was recorded for that trial ( , see below). however, if within this s period the animal responded to the cooling effect of the acetone, then the animal's response was assessed for an additional s, a total of s from initial application. the reasons for taking a longer period of time to assess the evoked behaviour were to measure only pain-related behaviour evoked by cooling and not startle responses that can occur following the initial application of acetone . moreover, the behaviour evoked by acetone is often an interrupted series of behaviours, thus it is important to give enough time to see all pain-related behaviours. responses to acetone were graded to the following -point scale: , no response; , quick withdrawal, flick or stamp of the paw; , prolonged withdrawal or repeated flicking of the paw; , repeated flicking of the paw with licking directed at the ventral side of the paw. acetone was applied alternately three times to each paw and the responses scored categorically. cumulative scores were then generated by adding the scores for each rat together, the minimum score being (no response to any of the six trials) and the maximum possible score being per forepaw. to visualize uninjured cst axons, six weeks after stroke, % biotinylated dextran amine (bda; , mw, invitrogen) in pbs (ph . ) was microinjected unilaterally into the uninjured sensorimotor cortex. animals were placed in a stereotaxic frame and six burr holes were made into the skull at the following coordinates (defined as anterioposterior (ap), mediolateral (ml): ) ap: + mm, ml: . mm; ) ap: + . mm, ml: . mm; ) ap: + . mm, ml: . mm; ) ap: + . mm, ml: . mm; ) ap: + . mm, ml: . mm; ) ap: - . mm, ml: . mm, relative to bregma. at each site, . μl injections of bda ( % in pbs) were delivered using a glass micropipette attached to a hamilton syringe inserted mm from the skull surface and delivered at a rate of . μl/min. animals were subsequently left for weeks before being perfused. tract tracing was not performed in rats that were to undergo functional mri or neurophysiology. as described below, we recorded from the ulnar nerve on the affected side and stimulated the ipsilateral median nerve or, in the pyramids, the corticospinal tract corresponding to the affected or less-affected hemisphere. at the end of the study, rats ( rats per group) were anaesthetised with an intraperitoneal injection of . g/kg urethane (sigma-aldrich). the rat was kept at °c with a homeothermic blanket system and rectal thermometer probe. tracheotomy was performed and a tracheal cannula inserted. the pyramids were then exposed ventrally by blunt dissection and removal of a small area of bone. the brachial plexus of the affected forelimb was exposed from a ventral approach by dissecting the pectoralis major. the ulnar and median nerves were dissected free from surrounding connective tissue and cut distally (to prevent twitches of target muscles). skin flaps from the incision formed a pool, which was filled with paraffin oil. the median and ulnar nerves supply flexor muscles in the forearm, wrist and hand. stimulation of afferents in the median nerve can generate responses in the ulnar nerve motor neurons. the proximal segment of each nerve was mounted on a pair of silver wire hook electrodes (with > cm separation). electrical stimuli of increasing amplitude from µa to µa, in μa steps, (single µs square wave pulse at . hz) were delivered from a constant current stimulator (nl a neurolog, digitimer) to the proximal segment of the cut median nerve. ulnar nerve responses to each stimulus were recorded from the pair of silver wire hook electrodes connected to a differential pre-amplifier and amplifier (digitimer) coupled via a powerlab (ad instruments) interface to a personal computer running labchart and scope software (ad instruments). an average of sweeps at µa was calculated online for each nerve and used to find the difference in amplitudes of monosynaptic reflexes evoked by median nerve stimulation. this was achieved using software to calculate the absolute integral of any response between . ms and ms, regardless of whether a response was observed qualitatively. cst stimulation experiment with ulnar recordings: ulnar nerve recordings were obtained during stimulation of each pyramid in turn. the concentric bipolar stimulation electrode (fhc cbbpc ) was located mm lateral to the midline and gently lowered through the pyramid up to a maximum depth of . mm while stimulating at μa ( pulses, pulse width µs; frequency hz). at the electrode location providing maximal ulnar nerve response, stimuli of increasing intensity were applied in the range of µa to µa, in µa increments. five sweeps were captured at each stimulus intensity. the number of spikes % greater than the noise, and falling between and ms, was calculated for each sweep. the average number of spikes for sweeps at each amplitude was calculated and the difference in the number of spikes elicited by stimulation of the pyramids from the lesioned or contralesional hemisphere. furthermore, the signal was rectified and the area under the curve was measured between and ms for each sweep and averaged for the sweeps at each intensity. each parameter was analysed using twoway repeated measures anova. graphs show mean and standard error of the mean for the area under the curve for stimuli given at µa. eight weeks after stroke surgery and two weeks after injection of bda, rats were terminally anesthetized with sodium pentobarbital ( mg/kg; i.p.) and perfused transcardially with pbs for minutes, followed by ml of % paraformaldehyde in pbs for minutes. the brain, c -c spinal cord, c and c drgs and both arms were carefully dissected and stored in % paraformaldehyde in pbs for hours and then transferred to % sucrose in pbs and stored at - °c. spinal cord segments c and c was embedded in oct and μm transverse slices were cut using a freezing stage microtome (kryomat; leitz, germany). ten series of sections were collected and stored in tbs/ . % azide ( mm tris, mm nacl, . mm nan , ph . ) at °c. cst axons were counted that crossed the midline, at two more lateral planes and at an oblique plane (figure a ) at c and c . for each rat, we estimated the number of cst axons per cord segment by calculating the average number of cst axons per section and then multiplying by a scaling factor (number of sections cut per segment). the total length of serotonergic processes was measured using a standard method designed specifically to measure serotonergic sprouting after neurotrophin treatment (see refs in ) and which is well suited for quantification of dense terminal arbors (e.g., in the dorsal horn of the spinal cord). processes were identified using the "adjust threshold" function in imagej and fiber lengths were measured in three areas: the dorsal horn, intermediate grey and ventral horn (fig. a ) in sections per rat. we calculated the ratio of the sides ipsilateral and contralateral to nt treatment for the three areas separately. immunofluorescence was visualized under a zeiss imager.z microscope or a confocal zeiss lsm laser scanning microscope. photographs were taken using the axio cam and axiovision le rel. . or the lsm image browser software for image analysis. nt protein was radiolabelled with µci ( . mbq) n-succinimidyl [ , - h]propionate ( h-nsp) and separated from unbound h-nsp using an Äktaprime purification system using a modification of a previous method the same treatment was repeated for mice h after stroke, with an incubation period of mins (n = ). in this set of experiments, . mbq of c sucrose (vascular marker) was injected towards the end of the incubation and the brain tissue samples also taken for capillary depletion analysis to distinguish nt or albumin in vascular endothelial cells from that in brain parenchyma. in brief, brain tissue was homogenized in physiological buffer ( µl per mg of tissue) and % dextran ( µl per mg of tissue) as described previously . the homogenate was subjected to density gradient centrifugation ( , × g for min at °c) to give an endothelial cell-enriched pellet and a supernatant containing the brain parenchyma and interstitial fluid (isf). the homogenate, pellet and supernatant samples were solubilized and counted as described above. distribution volume, vd, was calculated for all samples, including the endothelial pellet and brain parenchyma (isf). the values were corrected for c sucrose. data was analysed for capillary fraction, parenchyma and whole brain using one way anova and post hoc (bonferroni) t-tests. ug of protein was subjected to denaturing or non-denaturing sds page and visualised using colloidal coomassie brilliant blue staining. each band was excised separately, digested enzymatically (with trypsin) and subjected to lc/ms/ms analysis (dr. steve lynham, proteomics facility, kcl). in-gel reduction, alkylation and digestion with trypsin were performed prior to subsequent analysis by mass spectrometry. cysteine residues were reduced with dithiothreitol and derivatised by treatment with iodoacetamide to form stable carbamidomethyl derivatives. trypsin digestion was carried out overnight at room temperature after initial incubation at o c for hours. lc/ms/ms: peptides were extracted from the gel pieces by a series of acetonitrile and aqueous washes. the extract was pooled with the initial supernatant and lyophilised. each sample was then resuspended in l of mm ammonium bicarbonate and analysed by lc/ms/ms. chromatographic separations were performed using an ultimate lc system (dionex, uk). peptides were resolved by reversed phase chromatography on a m c pepmap column using a three step linear gradient of acetonitrile in . % formic acid. the gradient was delivered to elute the peptides at a flow rate of nl/min over min. the eluate was ionised by electrospray ionisation using a z-spray source fitted to a qtof-micro (waters corp.) operating under masslynx v . . the instrument was run in automated data-dependent switching mode, selecting precursor ions based on their intensity for sequencing by collision-induced fragmentation. the ms/ms analyses were conducted using collision energy profiles that were chosen based on the mass-to-charge ratio (m/z) and the charge state of the peptide. database searching: the mass spectral data was processed into peak lists using proteinlynx global server v . . with the following parameters: (ms survey -no background subtraction, sg smoothing iterations channels, peaks centroided (top %) no de-isotoping; ms/ms -no background subtraction, sg smoothing iterations channels, peak centroiding (top %) no de-isotoping). the peak list was searched against the uniprot database using mascot software v . using the following parameter specifications (precursor ion mass tolerance . da; fragment ion mass tolerance . da; tryptic digest with up to three missed cleavages; variable modifications: acetyl (protein n-term), carbamidomethylation (c), gln->pyro-glu (n-term q) and oxidation (m). lc/ms/ms analysis and interrogation of the data against the uniprot database identified nt from the excised and digested d gel bands. the results of the analysis and database searches are given in supplementary figure . database generated files were uploaded into scaffold (v . ) software (www.proteomesoftware.com) to create the .sfd file (pr lm d gel ). all samples were aligned in this software for easier interpretation and used to validate ms/ms based peptide assignments and protein identifications. peptide assignments were accepted if they contained at least two unique peptide assignments and were established at % identification probability by the protein prophet algorithm . the result table includes probability scores (mowse) for each peptide identified from the protein sequence. the threshold identity score corresponds to a % chance of incorrect assignment. peptides identified below these probabilities were accepted following manual inspection of the raw data to ensure that fragment ions correctly match the assigned sequence. the sequence coverage for each identified protein is represented in supplementary figure in yellow highlights. statistical analyses were conducted using spss (version . ). graphs show means ± sems (except where otherwise stated) and 'n' denotes number of rats. asterisks (*,**,***) indicate p≤ . , p≤ . and p≤ . , respectively. threshold for significance was . . histology and molecular biology data were assessed using kruskal-wallis and mann-whitney tests (due to small sample sizes). serotonergic fibre lengths was analysed by region using one way anova and post hoc (bonferroni) t-tests. pkcγ data was analysed using kruskal wallis and mann whitney tests. behavioural and mri data were analysed using linear models and restricted maximum likelihood estimation to accommodate data from rats with occasional missing values . akaike's information criterion showed that the model with best fit for the horizontal ladder data had a compound symmetric covariance matrix, whereas for the sensory test and mri data an unstructured covariance matrix was used. the model with best fit for the vertical cylinder had a compound symmetric covariance matrix, according to the - restricted log likelihood information criterion. baseline scores were used as covariates. degrees of freedom are reported to nearest integer. normality was assessed using histograms. t-tests were two-tailed unless otherwise specified. sample size calculations were presented previously . magnetic resonance imaging (mri) confirmed that infarcts included the forelimb and hindlimb areas in sensorimotor cortex (fig c) . there was no difference in the mean infarct volume between stroke groups at h, one or eight weeks after stroke (fig. d) . loss of cst axons was assessed at weeks in the upper cervical spinal cord using protein kinase c gamma (pkcγ) immunofluorescence , (fig. e) . stroke caused a % loss of cst axons in the dorsal columns relative to shams (fig. f) with no difference between vehicle and nt treated rats. together, the mri and pkcγ histology data indicate that there were no confounding pre-treatment differences in mean infarct volumes and that nt did not act as a neuroprotective agent, as expected, based on our previous results and given that treatment was initiated after the majority of cell death will have occurred. we used the "adhesive patch" test to assess forepaw somatosensory function. a sensory score was obtained by attaching pairs of adhesive patches to each rat's wrist on the dorsal side (fig. a) : a high score (e.g. ) denotes that a rat preferentially removed the smaller stimulus from their less-affected wrist (i.e., did not first remove the larger stimulus on their affected wrist). the two stroke groups exhibited a similar lack of responsiveness to stimuli on their affected wrists after one week (fig. b) . delayed treatment with nt caused recovery compared to vehicle: whereas vehicle-treated stroke rats showed a deficit relative to sham rats which persisted for eight weeks. importantly, there were no confounding differences in the time taken to contact or to remove a patch from either their less-affected or affected paw: after stroke, nt -treated rats and vehicletreated rats took longer to contact an adhesive patch relative to shams, but there was no difference between nt and vehicle treated rats (supplementary fig. a) . moreover, neither stroke nor nt treatment caused any deficit in the additional time taken after contact to remove the patch (supplementary figure b) . thus, delayed treatment of disabled forelimb muscles with nt improved responsiveness to tactile stimuli after ischemic stroke. walking was assessed using a horizontal ladder with irregularly spaced rungs (fig. c) . accurate paw placement during crossing requires proprioceptive feedback from muscle spindles . after one week, the two stroke groups made a similar number of errors with their affected forelimb when crossing a horizontal ladder (fig. d) . delayed nt treatment caused a progressive recovery after stroke whereas vehicle treated animals remained persistently impaired until the end of the study. this is consistent with previous work from our lab , . stroke also caused a modest unilateral hindlimb impairment on the ladder; infusion of nt into the forelimb triceps brachii did not improve this (supplementary figure ) . neurotrophin- also restored the use of the affected forelimb for lateral support while rats reared in a vertical cylinder (fig. e) . after stroke and vehicle treatment, rats used their affected forelimb less often than shams. nt -treated rats showed more frequent use of the affected forelimb relative to vehicle-treated rats (fig. f) . we used force transducers to measure grip strength of each forelimb (supplementary fig. ) . stroke caused transient weakness in both groups but infusion of nt into triceps brachii did not modify grip strength. we also found no evidence for pain (cold allodynia) on the affected or treated forelimbs, assessed by application of ice-cold acetone to the centre of the forepaw. cold allodynia was induced neither by stroke nor nt treatment (supplementary fig. ). in summary, infusion of nt protein into the triceps brachii induced recovery on both sensory and motor tasks that require control of muscles by pathways including corticospinal pathways, serotonergic raphespinal pathways and proprioceptive circuits. accordingly, we hypothesised that nt would induce neuroplasticity in multiple pathways. we examined anatomical neuroplasticity in the c cervical spinal cord because we knew from experiments using adult and elderly rats that the less-affected corticospinal tract sprouts at this level (as well as other levels) after injection of aav-nt into muscles including triceps brachii . indeed, anterograde tracing from the less-affected hemisphere (fig. b) revealed that infusion of nt protein increased sprouting of the cst in the c spinal cord (fig. a,b) across the midline and into the affected side at two more lateral planes, and also from the ventral cst. we assessed neural output in the ulnar nerve on the affected side, whose motor neurons are also found in c (range: c to c ) that supply muscles in the forearm including the hand , . to do this, we recorded responses during electrical stimulation of either the spared less-affected corticospinal tract (fig. c) or the partially-ablated corticospinal tract (fig. f) in the medullary pyramids. nt treatment led to enhanced responses in the ulnar nerve during stimulation of the less-affected (fig. d, e) and more-affected (fig. g, h) pathways. this result is consistent with the sprouting of traced cst axons (fig. a,b) and indicates that cst axons from both the stroke hemisphere and the contralesional hemisphere formed new synapses and/or strengthened preexisting connections in the cord on the treated side, most likely on pre-motor interneurons that lie between cst axons and motoneurons , . however, we did not find any evidence that nt strengthened the short-latency reflex from afferents in the median nerve to motor neurons in the ulnar nerve (fig. i-l) . we also found that nt treatment caused serotonergic axons to sprout in the ventral c spinal cord (fig. a-d) . anatomical and functional plasticity of corticospinal and raphespinal pathways is consistent with their expression of receptors for nt , , - . we conclude that nt caused neuroplasticity in multiple descending locomotor pathways including the raphespinal and the spared corticospinal tracts. these data are consistent with previous findings from our lab , that peripherally-administered nt can, directly or indirectly, enhance supraspinal plasticity after stroke. accordingly, next, we assessed the biodistribution of nt after peripheral administration. we measured the amount of total (rat and human) nt in the triceps brachii and c spinal cord. elisa was performed using a subset of five rats per group withdrawn at random from the study at the four-week time point: this revealed an increase in total nt protein levels in the triceps brachii on the treated side (fig. a ) and, surprisingly, on the untreated side (perhaps due to nt in endothelial cells; see below). we were not able to detect any increase in total nt in the c spinal cords (supplementary fig. ). however, elisa cannot distinguish exogenous human nt from endogenous rat nt because the amino acid sequences for mature human and rat nt are identical , . because elisa did not allow us to detect any small increases in exogenous human nt against the background of endogenous rat nt in the cns, we next used a more sensitive method for measuring trafficking of nt across the blood-cns barrier. recombinant nt protein was radiolabelled and purified. [ h]nt was injected intravenously into adult mice. radiolabelled albumin was used as a control because it does not enter the cns efficiently from the bloodstream. after , , , , , or minutes, brain, spinal cord and serum were taken for scintillation counting. nt progressively accumulated in the intact brain (fig. b) and cervical spinal cord (fig. c) . in plasma, the half-life of nt was short (fig. d) . our data is consistent with that from others who have shown that radiolabelled nt rapidly crosses the barriers between the blood and an intact cns [ ] [ ] [ ] and that a small amount of intact nt accumulates in the brain and cervical spinal cord (although the majority of nt is cleared rapidly from the bloodstream) [ ] [ ] [ ] . for example, after injection of nt into the brachial vein (which provides drainage from the triceps brachii), nt accumulates in the cortex, striatum, brainstem, cerebellum, sciatic nerve (and other regions of the nervous system involved with locomotion) . ischemia. minutes later, tissues were taken for scintillation counting. in contrast to [ h]albumin, [ h]nt accumulated in the brain (fig e) . to confirm entry of [ h]nt into brain parenchyma beyond endothelial cells, capillaries were depleted by gradient centrifugation to yield a supernatant containing brain parenchyma and an endothelial cell-enriched pellet . [ h]nt entered parenchyma (depleted of endothelial cells) at a level above that seen for [ h]albumin (fig. f ). transport of nt into the cns is apparently a receptor-mediated process as shown by ) the expression of nt receptors in rodent and human cns capillaries , and ) the ability of non-radiolabelled nt to compete for uptake of radiolabelled nt into the cns , , . in addition, we and others have shown that nt enters the pns after peripheral administration: after intramuscular overexpression of aav encoding nt , nt levels are elevated in the blood stream and nt accumulates in the ipsilateral drg , . we also found some evidence that nt is retrogradely transported from muscle to ipsilateral motor neurons . this is consistent with data showing that ) the blood-nerve barrier in drgs is permeable to proteins like nt , ) that after intravenous injection, radiolabelled nt accumulates in the sciatic nerve and ) that nt is retrogradely transported from muscle to the spinal cord or drg in nerves , , , . we conclude that neuroplasticity occurred in multiple locomotor pathways because peripherally-administered nt bound to receptors in the pns and cns. to explore the mechanism whereby nt improved responsiveness to stimuli attached to the affected wrist (fig. b) , we performed functional brain imaging (bold-fmri) during low threshold (non-noxious) intensity electrical stimulation of the affected wrist ( supplementary fig. ). as expected, prior to stroke, stimulation of the wrist resulted in a higher probability of activation of the opposite somatosensory cortex (supplementary fig. a ). fmri performed one week after stroke confirmed that somatosensory cortex was not active when the affected paw was stimulated in either vehicle or nt treated rats (p> . , supplementary fig. b ). this supports our claim above that there were no early differences between groups that could be explained by neuroprotection. fmri performed eight weeks after stroke revealed a trend towards perilesional re-activation of somatosensory cortex in both vehicle and nt treated groups (p< . , supplementary fig. c ). this is in line with human brain imaging studies showing that spontaneous sensory recovery is increased after stroke when more-normal activity patterns are observed on the affected side of the brain . however, these probabilities of re-activation were not big enough to survive correction for testing of multiple voxels (p-values> . ) although clearly they are in a location that might mediate recovery of somatosensation. a longitudinal analysis showed that at weeks (relative to pre-stroke baseline), there was some evidence that rats treated with neurotrophin- showed increased probability of activation of perilesional cortex (supplementary fig. d, p< . ) and showed decreased probability of activation of somatosensory cortex on the less-affected hemisphere (p< . ) relative to vehicletreated stroke rats. however, these apparent differences did not survive correction for testing of multiple voxels (p< . ). we conclude that both groups showed partial, spontaneous restoration of more-normal patterns of somatosensory cortex activation , but, conservatively, that nt did not further increase probability of activation of any supraspinal areas. these conclusions are consistent with previous fmri data from our laboratory ; we propose that the additional recovery of somatosensory function after nt treatment (fig. b) is due to changes in the spinal cord rather than in supraspinal areas. the batch of neurotrophin- protein that we used was produced more than a decade ago by genentech. we sought to determine whether any degradation had occurred and to confirm its amino acid sequence so that identical preparations of neurotrophin- could be made for future experiments. supplementary figure depicts results from a non-denaturing gel showing a higher molecular weight band ( kda) and a lower molecular weight band ( kda) consistent, respectively, with dimeric mature nt and monomeric mature nt . there was no evidence of degradation or aggregation. each band was excised separately, digested enzymatically (with trypsin) and subjected to lc/ms/ms analysis. proteomic analysis was consistent with both bands being mature nt with no evidence of residual prepro sequences (supplementary figure ) . we conclude that the higher molecular weight band is not prepront (~ kda) but rather corresponds to dimeric mature nt . this facilitated our ongoing experiments to evaluate nt as a therapy for stroke because most commercial preparations of nt consist of mature nt rather than prepront . treatment of disabled arm muscles with nt protein, initiated hours after stroke, caused changes in multiple locomotor circuits, and promoted a progressive recovery of sensory and motor function in rats. the fact that nt can reverse disability when treatment is initiated hours after stroke is exciting because the vast majority of stroke victims are diagnosed within this time frame . in contrast, the gold-standard drug for ischemic stroke, tpa, needs to be given within a few hours and is only administered to a minority. thus, nt could potentially be used to treat an enormous number of victims. nt has good clinical potential. firstly, phase ii clinical trials show that doses up to µg/kg/day are well tolerated and safe in healthy humans and in humans with other conditions , - . we used a threefold lower dose ( µg/kg/day) in this study: in future experiments we will optimize the dose and duration of treatment because it is possible that a higher dose of nt would promote additional recovery after stroke. secondly, there is good conservation from rodents to primates including humans in the expression of receptors for nt in the locomotor system , , [ ] [ ] [ ] . thirdly, in none of our rodent experiments has nt treatment caused any detectable pain, spasticity or muscle weakness (in line with the human trials); rather, after bilateral corticospinal tract injury in rats, intramuscular delivery of aav-nt reduced spasticity, slightly improved grip strength and showed a trend towards reducing mechanical hyperalgesia . in this study and in a previous study we used functional mri combined with electrical stimulation of the wrist in an effort to discover what neuroplasticity underlies recovery of somatosensory responsiveness to adhesive patches attached to the wrist. we confirmed work by others that recovery correlated well with more-normal patterns of increased bold signal surrounding the infarct (potentially in spared somatosensory cortex) , , , but we did not find strong evidence that nt further increased peri-lesional (or other) activation (either in this study or in our previous study ). instead, we now propose that nt increased somatosensory recovery by inducing neuroplasticity in spinal circuits involving cutaneous afferents. this is plausible because cutaneous afferents which mediate tactile sensitivity express trkc receptors . moreover, others have shown that dl spinal interneurons can gate cutaneous transmission . we have previously shown that nt normalises post-activation depression of output from spinal circuits evoked by stimulation of low threshold afferents from the treated wrist (which might include cutaneous afferents as well as proprioceptive afferents) although in those experiments we measured motor output rather than sensory transmission. in the future one might examine whether nt modulates gating of somatosensory inputs from the wrist to spinal interneurons . however, in the present work, the deficits in somatosensory responses were modest and might be difficult to dissect. with regard to corticospinal neuroplasticity, we have shown twice previously (in adult and elderly rats) that the less-affected corticospinal tract sprouts across the cervical midline after injection of aav-nt into affected forelimb muscles . others have shown that intrathecal infusion of nt induces sprouting of the corticospinal tracts and that injection of vectors encoding nt into muscles or nerve can induce corticospinal tract sprouting. here, our anatomical tracing confirmed that the less-affected corticospinal tract sprouted after infusion of nt protein into triceps and in future we will trace both tracts. this is because, in the present study, neurophysiology revealed that both corticospinal tracts underwent plasticity after unilateral infusion of nt protein. we propose that spared cst axons sprouted after nt entered the cns from the systemic circulation. this is consistent with data from us and others showing that radiolabelled nt entered the brain and spinal cord after intravenous injection [ ] [ ] [ ] . moreover, it has been shown that endogenous muscle spindle-derived cues induce sprouting of descending pathways after spinal cord injury in adult mice ; given that muscle spindles make nt endogenously , it is plausible that infusion of supplementary nt to muscle might enhance corticospinal sprouting after stroke. it is also notable that infusion of nt into a proximal forelimb extensor improved the accuracy of use of the affected forelimb when walking on a horizontal ladder but did not improve the accuracy of use of the affected hindlimb; this implies that circulating nt is not sufficient to improve hindlimb movements. moreover, we did not find any evidence that nt strengthened the short-latency reflex between afferents in the median nerve and motor neurons in the ulnar nerve; this may be because we infused nt into the triceps brachii whose afferents do not run in the median or ulnar nerve. this is consistent with previous work of ours showing that a reflex may be strengthened when its afferent comes from a muscle expressing higher levels of nt but not when its afferent comes from a muscle lacking transgenic expression of nt . finally, infusion of nt protein into triceps brachii did not improve forelimb grip strength: however, the grip strength task probably depends more on strength in hand and digit flexor muscles (into which nt was not infused) than on triceps brachii (elbow extensors). indeed, in previous work, injection of aav-nt into proximal and distal flexor muscles did modestly improve grip strength . taken together, these results indicate that it may be important to target nt to multiple muscles. however, it is not straightforward to reconcile all our findings with a single mechanistic explanation. it is possible that, additionally, nt was trafficked from triceps brachii in axons to motor neurons and/or by drg neurons where it induced expression of a molecule that was secreted and induced cst sprouting (e.g., bdnf or igf , ). nt is certainly trafficked to ipsilateral motor neurons and drg after intramuscular delivery , , , , and in this study we also showed, unexpectedly, a small increase in contralateral triceps (perhaps from nt in endothelial cells). diffusion of nt within neuropil is inefficient but spinal motor neuron dendritic arbors can be very large; some even extend across the midline and these might provide a widespread source of cues for supraspinal axonal plasticity (e.g., across the midline). to seek drg-secreted factors, we have performed rnaseq of cervical drg after injection of aav-nt into forelimb flexors , . in the future we will also seek motor neuron-derived cues. finally, it is interesting that the recovery continues even after infusion of nt is discontinued at four weeks. this is encouraging, from a translational perspective. we propose that the four-week long nt treatment induces changes in target neurons that persist (e.g., due to sustained modifications in gene expression). indeed, longer treatment with nt induces different intracellular signalling events in sensory axons than does brief treatment, thereby enhancing terminal branching . in the future, we will seek factors that are persistently increased in target neurons after nt treatment is discontinued. additionally, it may be that nt induces sprouting of cst axons that (after cessation of treatment) is followed by selection of synapses (e.g., strengthening or pruning) by a mechanism that is independent of nt . for example, it is known that corticomotoneuronal axon synapses are pruned by repulsive plexina -sema d interactions . to begin to dissect the mechanisms whereby nt promotes neuroplasticity and recovery after peripheral delivery, we are setting up a mouse model of stroke. in summary, treatment of disabled arm muscles with nt (initiated in a clinically-feasible timeframe) induces multilevel spinal and supraspinal neuroplasticity, improves walking and reverses a tactile sensory impairment. and hours later infusion of nt or vehicle into the disabled triceps brachii was initiated for one month. six weeks after stroke, anterograde tracer was injected into the contralesional hemisphere (blue). rats underwent weeks of behavioural testing. structural mri was conducted on all rats at hours, week and weeks after stroke and fmri was conducted in a subset of rats at baseline, week and week . electrophysiology was performed in the subset of rats which did not receive bda tracing. all surgeries, treatments and behavioural testing were performed using a randomized block design and the study was completed blinded to treatment allocation. c) t mri scans hours after stroke, immediately prior to treatment, showing infarct in coronal sections rostral (mm) to bregma. d) there were no differences between stroke groups in mean lesion volume at hours (mann whitney p values = . ). e) photomicrographs showing loss of figure : delayed nt treatment improved responsiveness to somatosensory stimulation, improved walking and partially restored use of the affected forelimb for lateral support during rearing. a-c) somatosensory deficits were assessed using pairs of adhesive patches attached to the rat's wrists. b) treatment with nt caused improvement compared to vehicle (linear model; f , = . , p< . ; post hoc p= . ): whereas vehicle-treated stroke rats showed a deficit relative to sham rats which persisted for eight weeks (linear model; f , = . , corticospinal axons were anterogradely traced from the less-affected cortex and were counted at the midline (m), at two more lateral planes (d and d ) and crossing into grey matter from the ipsilateral, ventral tract (ipsi). b) nt treatment caused an increase in the number of axons crossing at the midline (f , = . , p= . ; post hoc p value= . ), at two lateral planes denoted as d (f , = . , p< . , post hoc p value< . ) and d (f , = . ,p< . , post hoc p value< . ) and from the ventral cst on the treated side (f , = . , p= . , post hoc p value= . ). although stroke by itself caused sprouting at the midline at c (planned comparison p= . ), nt did not promote additional sprouting at c . n= /group were used for tract tracing. c-e) the cst from the less-affected hemisphere or f-h) lesioned hemisphere was stimulated in the medullary pyramids (before the decussation) and the motor output was recorded from the ulnar nerve on the treated side. d, g) the majority of spikes were detected between ms and ms, latencies consistent with polysynaptic transmission, in both vehicle-treated rats (grey) and nt treated rats (blue) when the less-affected or affected cst was stimulated. e, h) stimulus intensity was increased incrementally from µa to µa and the area under the curve were measured (between and ms) after stimulation of the affected or less-affected hemisphere. nt treatment caused increased output in the ulnar nerve during stimulation of either the affected cst (two-way rm anova intensity* group interaction f , = . , p= . , n= vehicle, nt ) or less affected cst (f , = . , p= . , n= vehicle, n= nt ). i, j) the heteronymous reflex from median afferents to ulnar motor neurons was recorded in the axilla. k) the monosynaptic component was measured. l) nt did not increase the strength of the monosynaptic component. the rats held bilaterally on to the pair of force transducers (top left) and the rat was pulled away horizontally and perpendicularly (towards the right) until the bilateral grip was broken. the force transducers provide a measure of strength (grams) for each upper limb. an average of three trials was taken per rat per week. grip strength (grams) are presented as group means ± sems. b) grip strength of the affected limb was subtracted from the unaffected limb strength, as an internal control (e.g., to control for differences in motivation, etc.). c) grip strength for the affected limb. d) grip strength for the less-affected limb. stroke caused a weak trend towards a transient decrease in strength on the limb affected by stroke (relative to shams; time f , = . , p= . ; p-values p= . and . , respectively) but multiple pairwise comparisons did not show significant differences at any timepoint (all p> . ). there was no difference between nt and vehicle treated rats overall (group f , = . restoring brain function after stroke -bridging the gap between animals and humans a comprehensive review of prehospital and in-hospital delay times in acute stroke care trkc-like immunoreactivity in the primate descending serotonergic system local and remote growth factor effects after primate spinal cord injury influences of neurotrophins on mammalian motoneurons in vivo expression and coexpression of trk receptors in subpopulations of adult primary sensory neurons projecting to identified peripheral targets the neurotrophins bdnf, nt- , and ngf display distinct patterns of retrograde axonal transport in peripheral and central neurons expression of neurotrophins in skeletal muscle: quantitative comparison and significance for motoneuron survival and maintenance of function nt- , but not bdnf, prevents atrophy and death of axotomized spinal cord projection neurons muscle injection of aav-nt promotes anatomical reorganization of cst axons and improves behavioral outcome following sci neurotrophin- expressed in situ induces axonal plasticity in the adult injured spinal cord adeno-associated viral vector-mediated neurotrophin gene transfer in the injured adult rat spinal cord improves hind-limb function differential effects of brain-derived neurotrophic factor and neurotrophin- on hindlimb function in paraplegic rats intramuscular aav delivery of nt- alters synaptic transmission to motoneurons in adult rats either brain-derived neurotrophic factor or neurotrophin- only neurotrophin-producing grafts promote locomotor recovery in untrained spinalized cats neurotrophin- enhances sprouting of corticospinal tract during development and after adult spinal cord lesions intramuscular neurotrophin- normalizes low threshold spinal reflexes, reduces spasms and improves mobility after bilateral corticospinal tract injury in rats spinal electromagnetic stimulation combined with transgene delivery of neurotrophin nt- and exercise: novel combination therapy for spinal contusion injury delayed intramuscular human neurotrophin- improves recovery in adult and elderly rats after stroke retrograde viral delivery of igf- prolongs survival in a mouse als model immune activation is required for nt- -induced axonal plasticity in chronic spinal cord injury expression of neurotrophin- promotes axonal plasticity in the acute but not chronic injured spinal cord activity-dependent increase in neurotrophic factors is associated with an enhanced modulation of spinal reflexes after spinal cord injury muscle spindle feedback directs locomotor recovery and circuit reorganization after spinal cord injury selective expression of neurotrophin- messenger rna in muscle spindles of the rat permeability of the blood-brain barrier to neurotrophins permeability at the blood-brain and blood-nerve barriers of the neurotrophic factors: ngf, cntf, nt- , bdnf penetration of neurotrophins and cytokines across the bloodbrain/blood-spinal cord barrier nt- promotes nerve regeneration and sensory improvement in cmt a mouse models and in patients neurotrophin- improves functional constipation tolerability of recombinant-methionyl human neurotrophin- (r-methunt ) in healthy subjects recombinant human neurotrophic factors accelerate colonic transit and relieve constipation in humans nerve growth factor-and neurotrophin- -induced changes in nociceptive threshold and the release of substance p from the rat isolated spinal cord unbiased classification of sensory neuron types by large-scale singlecell rna sequencing sustained sensorimotor impairments after endothelin- induced focal cerebral ischemia (stroke) in aged rats rodent models of focal stroke: size, mechanism, and purpose delayed treatment with chondroitinase abc promotes sensorimotor recovery and plasticity after stroke in aged rats on the use of alpha-chloralose for repeated bold fmri measurements in rats robust automatic rodent brain extraction using -d pulse-coupled neural networks (pcnn) contrast weights in flexible factorial design with multiple groups of subjects spatial characterization of the motor neuron columns supplying the rat forelimb ethosuximide reverses paclitaxel-and vincristine-induced painful peripheral neuropathy chondroitinase abc promotes plasticity of spinal reflexes following peripheral nerve injury the labelling of proteins to high specific radioactivities by conjugation to a i-containing acylating agent heat shock protein-based therapy as a potential candidate for treating the sphingolipidoses graphical evaluation of blood-tobrain transfer constants from multiple-time uptake data the distribution of the anti-hiv drug, ' '-dideoxycytidine (ddc), across the blood-brain and blood-cerebrospinal fluid barriers and the influence of organic anion transport inhibitors a statistical model for identifying proteins by tandem mass spectrometry analysis of longitudinal data from animals with missing values using spss chondroitinase abc promotes functional recovery after spinal cord injury cervical motoneuron topography reflects the proximodistal organization of muscles and movements of the rat forelimb: a retrograde carbocyanine dye analysis lack of monosynaptic corticomotoneuronal epsps in rats: disynaptic epsps mediated via reticulospinal neurons and polysynaptic epsps via segmental interneurons electrophysiological actions of the rubrospinal tract in the anaesthetised rat trka, trkb, and trkc messenger rna expression by bulbospinal cells of the rat motoneuron-derived neurotrophin- is a survival factor for pax -expressing spinal interneurons bdnf and nt- , but not ngf, prevent axotomy-induced death of rat corticospinal neurons in vivo human and rat brain-derived neurotrophic factor and neurotrophin- : gene structures, distributions, and chromosomal localizations neurotrophin- : a neurotrophic factor related to ngf and bdnf a revised role for p-glycoprotein in the brain distribution of dexamethasone, cortisol, and corticosterone in wild-type and abcb a/bdeficient mice the cell biology of the blood-brain barrier nerve growth factor-induced protection of brain capillary endothelial cells exposed to oxygen-glucose deprivation involves attenuation of erk phosphorylation expression of cannabinoid receptors and neurotrophins in human gliomas vascularization of the dorsal root ganglia and peripheral nerve of the mouse: implications for chemical-induced peripheral sensory neuropathies neurotrophin- administration attenuates deficits of pyridoxineinduced large-fiber sensory neuropathy neurotrophin- is a target-derived neurotrophic factor for penile erection-inducing neurons reemergence of activation with poststroke somatosensory recovery: a serial fmri case study correlation between brain reorganization, ischemic damage, and neurologic status after transient focal cerebral ischemia in rats: a functional magnetic resonance imaging study early prediction of functional recovery after experimental stroke: functional magnetic resonance imaging, electrophysiology, and behavioral testing in rats expression of mrnas for neurotrophic factors (ngf, bdnf, nt- , and gdnf) and their receptors (p ngfr, trka, trkb, and trkc) in the adult human peripheral nervous system and nonneural tissues trka and trkc expression is increased in human diabetic skin neurotrophin- -like immunoreactivity and trk c expression in human spinal motoneurones in amyotrophic lateral sclerosis changes in cortical activation patterns accompanying somatosensory recovery in a stroke patient: a functional magnetic resonance imaging study longitudinal changes in cerebral response to proprioceptive input in individual patients after stroke: an fmri study circuits for grasping: spinal di interneurons mediate cutaneous control of motor behavior intraspinal rewiring of the corticospinal tract requires target-derived brain-derived neurotrophic factor and compensates lost function after brain injury igf-i specifically enhances axon outgrowth of corticospinal motor neurons differential distribution of exogenous bdnf, ngf, and nt- in the brain corresponds to the relative abundance and distribution of high-affinity and low-affinity neurotrophin receptors distinct limb and trunk premotor circuits establish laterality in the spinal cord rnaseq dataset describing transcriptional changes in cervical sensory ganglia after bilateral pyramidotomy and forelimb intramuscular gene therapy with aav encoding human neurotrophin- . data in brief sad kinases sculpt axonal arbors of sensory neurons through long-and short-term responses to neurotrophin signals control of species-dependent cortico-motoneuronal connections underlying manual dexterity cst axons in the upper cervical dorsal columns weeks after stroke (right) relative to sham surgery (left), visualised using pkcγ immunofluorescence. f) stroke caused a significant loss of cst axons relative to shams in the dorsal columns there were no differences between nt and vehicle treated rats at one week (t-test p= . ). c) accuracy of paw placement by the affected forelimb during walking was assessed using a horizontal ladder with irregularly spaced runs. d) one week after stroke, nt and vehicle treated rats made a similar number of misplaced steps (t-test p= . ), expressed as a percentage of total steps. importantly, the nt group progressively recovered compared to the vehicle group (group f , = . , p< . ; post hoc p= . ) and differed from the vehicle group from weeks to (group x time f , = . , p< . ; post hoc p values< . ) and whereas the vehicle group remained impaired relative to shams from weeks to (p values< . ), from weeks to the nt group made no more errors than shams (post hoc p-values> . ). e) the vertical cylinder test assessed use of the affected forelimb for lateral support during rearing. f) stroke caused a reduction in the use of the affected forelimb during rearing in a vertical cylinder in both nt and vehicle treated rats relative to shams (group f , = . , p= . ; post hoc p values= . and . , respectively) with no differences between stroke groups at one week (p= . ). nt treatment caused a progressive recovery in the use of the affected forelimb grey circles) against time after iv injection (n= - mice/time) in adult mice. t-tests for nt vs albumin all significant for incubation times of , , , and (p values from < . to < . ***). the volume of distribution of [ h] nt or [ h] albumin in brain (vd =am/cp) is calculated as a ratio of counts per minute (cpm) in µg of brain and cpm in µl of serum for each time point and plotted against exposure time given by the term ∫ t cp(τ)dτ/cp. the rate of influx (ki) was calculated from the patlak plot of vd for  - μl/mg/s). c) [ h]nt entered the cervical spinal cord more abundantly than [ h]albumin. d) plasma half-life of nt for the normal adult mouse is ~ min (estimated from /normalised serum values). e) twenty-four hours after cortical ischemia, [ h]nt entered the brain more abundantly than [ h]albumin, measured minutes after iv injection. f) twenty-four hours after cortical ischemia there are no conflicts of interests of the authors. correspondence and requests for materials should be addressed to l.m (lawrence.moon@kcl.ac.uk) figure : focal cortical stroke caused impairment of the affected forelimb but modest or no impairment of the three other limbs. a) after stroke, nt treated rats recovered function of their affected forelimb on the ladder test relative to stroke vehicle controls and sham rats. nt treated rats recovered fully relative to shams (linear model and t-tests, p≤ . ). *** denotes group difference, p < . ; † denotes interaction of group with time, p< . . this subpanel is reproduced from figure to allow comparison with other subpanels). b) shows photograph of the horizontal ladder set up and insert shows a rat traversing the ladder. c) there was no difference in the number of foot faults made in any of the groups using the less affected supplementary figure : cold allodynia was caused neither by focal cortical stroke nor by treatment with neurotrophin- . the acetone test was used to see whether stroke and/or nt treatment caused any change in cold allodynia pain responses. the test involves applying a drop of acetone to the a) affected or b) less affected forelimb, and then allocating a score between and : higher numbers denote a heightened pain response. there is no evidence of painful behaviour based on this test in either forelimb. rm ancova with bonferroni post hoc tests. figure : elisa revealed that infusion of nt into triceps brachii did not cause detectable elevation of nt in homogenates of cervical spinal cord hemicords on the infused or non-infused side of the body (mann whitney p-values= . , . , respectively). figure : functional brain imaging during stimulation of the affected wrist revealed no enhanced probability of perilesional activation by neurotrophin- . the same rats were imaged prior to stroke and then one week and eight weeks after stroke and intramuscular treatment with either nt or vehicle. scans with obvious imaging artefacts were discarded, leaving final group numbers of n= , , and n= , , at weeks , and for nt and vehicle treated groups respectively. red voxels denote greater probability of activation during stimulation (versus stimulation off) whereas blue voxels denote lesser probability of activation during stimulation (versus stimulation off). a) prior to stroke, stimulation of the dominant paw led to a strong probability of activation in the opposite somatosensory cortex. b) one week after stroke, this activation was abolished by infarction. c) eight weeks after stroke, there was a slight trend towards a small perilesional area of reactivation in both groups. d) there was a slight trend towards greater perilesional reactivation in the nt group versus the vehicle group at weeks (relative to their baselines). however, all these heat maps of groups of rats show t-values obtained by statistical parametric map analysis without correction for multiple testing (p< . ) and there were no differences between the two groups for any voxels when the threshold for significance was corrected for multiple testing (p< . ; this data is not shown as the heat map was black). red voxels denote greater probability of activation during stimulation for the nt group than for the vehicle group whereas blue voxels denote lesser probability of activation for the nt group than for the vehicle group. when stimulating the less-affected wrist, there were no differences between the two groups for any voxels when the threshold for significance was corrected for multiple testing (p< . ; this data is not shown as the heat map was black). figure : different amounts of recombinant nt were run in four lanes of an sds page gel. two sizes of band of interest (lm _ and lm _ ) were detected following staining with colloidal coomassie brilliant blue. these protein bands were excised prior to separate enzymatic digestion and lc/ms/ms analysis. the apparent molecular weight of the upper band (~ kda) is consistent with either the pro-neurotrophin- precursor form or a dimer of the mature nt protein, whilst the lower band (~ kda) is consistent with the mature nt protein. sequencing revealed that both bands represent the mature nt protein. key: cord- -jvx rh g authors: hinch, r.; probert, w. j. m.; nurtay, a.; kendall, m.; wymatt, c.; hall, m.; lythgoe, k.; bulas cruz, a.; zhao, l.; stewart, a.; ferritti, l.; montero, d.; warren, j.; mather, n.; abueg, m.; wu, n.; finkelstein, a.; bonsall, d. g.; abeler-dorner, l.; fraser, c. title: openabm-covid - an agent-based model for non-pharmaceutical interventions against covid- including contact tracing date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: jvx rh g sars-cov- has spread across the world, causing high mortality and unprecedented restrictions on social and economic activity. policymakers are assessing how best to navigate through the ongoing epidemic, with models being used to predict the spread of infection and assess the impact of public health measures. here, we present openabm-covid : an agent-based simulation of the epidemic including detailed age-stratification and realistic social networks. by default the model is parameterised to uk demographics and calibrated to the uk epidemic, however, it can easily be re-parameterised for other countries. openabm-covid can evaluate non-pharmaceutical interventions, including both manual and digital contact tracing. it can simulate a population of million people in seconds per day allowing parameter sweeps and formal statistical model-based inference. the code is open-source and has been developed by teams both inside and outside academia, with an emphasis on formal testing, documentation, modularity and transparency. a key feature of openabm-covid is its python interface, which has allowed scientists and policymakers to simulate dynamic packages of interventions and help compare options to suppress the covid- epidemic. a particular focus of our work applying openabm-covid has been exploring different ways in which contact tracing, and in particular digital contact tracing using mobile phone apps that record proximity events, can contribute to epidemic control [ ] . several other groups have approached this problem with similar agent based models [ , ] ; compared to those, our model places more emphasis on simulating larger populations, computational efficiency, and on code generalisability that allows other researchers to use and develop the code. we developed the agent-based model (abm) openabm-covid to simulate an outbreak of covid- in an urban environment. the default population is one million inhabitants with demographic structure based upon uk-wide census data, and household size and age-structure matched to data from the uk census survey (for example, older people tend to live together and young children tend to live with younger adults). on a daily basis all individuals in the model move between networks representing households and either workplaces, schools, or regular social environments for older people. individuals also interact through random networks representing public transport, transient social gatherings etc. membership of each type of network is determined by age, giving rise to age-assortative mixing patterns. network parameters are chosen such that the average number of interactions match age-stratified data reported in [ ] [ ] . the number of daily interactions in random networks is drawn from a negative binomial distribution, allowing for rare super-spreading events. infections are seeded in the population and spread through the networks. biological and epidemiological characteristics of covid- disease have been derived from the scientific literature. the model takes into account asymptomatic infections and different stages of severity, and includes the simulation of hospitalisations and icu admissions. since symptoms, disease progression and infectiousness are highly age-dependent, disease pathways in the model are age-stratified. the abm was developed to simulate different non-pharmaceutical interventions including lockdown, physical distancing, self-isolation on symptoms, testing and contact tracing. modelling contact tracing requires the model to keep a record of previous interventions for a set number of days. a variety of contact tracing algorithms are included in the abm, including tracing on symptoms and/or after a positive test, notifying first-degree contacts only or second-degree contacts as well, testing of traced contacts, and imperfections in test-trace-isolate programmes such as delays, missed contacts and partial compliance. the model reports both aggregated data, such as incidence, tests required, individuals quarantined for various reasons etc., and individual data such as transmission relationships. openabm-covid is available on github ( https://github.com/bdi-pathogens/openabm-covid ), including model documentation, dictionaries for input parameters and output files, over tests in a consistent testing framework used in model validation, and examples for running the model. the core of the model is implemented in the c language for speed; however, the model is run via python using a swig-interface. this interface allows for dynamic intervention strategies to be modelled, as well as providing full transparency about the state of the model. this manuscript was prepared using v . of the model (commit number d e) and code for reproducing all figures in this manuscript from model output are publicly available online ( https://github.com/bdi-pathogens/openabm-covid -model-paper ). openabm-covid enables simulation of interventions to help policymakers determine the best options to suppress the covid- epidemic in various settings. default demographic parameters were chosen to reflect the uk and fit well to the uk epidemic after calibration; however, all parameters of the model can be changed by the user. within the abm, individuals are categorised into nine age groups by decade, from " - year" to " + years". decades were used because of the strong age-structure of the disease progression. by default, the demographics of the abm are set to uk national data for from the office of national statistics (ons). the proportion of individuals in each age group is the same as that specified by the population level statistics in supplementary table . since we only consider simulating the epidemics up to a year, we do not consider changes in the population due to births, deaths due to other causes, and migration. in each of the interaction networks, individuals are represented as a node. constant and dynamic connections occur between the nodes in the networks, representing interactions between individuals. the three networks represent different types of daily interactions: household, occupation, and random ( figure ). the interaction networks have two roles in the abm. first, the infection can be transmitted between two individuals on a day that they interact. second, the interactions for each individual are stored and can be used for contact tracing. the membership of different networks leads to age-group assortativity in the interactions. a previous study of social contacts for infectious disease modelling, based on participants being asked to recall their interactions over the past day, has estimated the mean number of interactions that individuals have by age group [ ] . we estimate mean interactions by age group by aggregating data (supplementary table ). figure every individual is assigned to live in a single household. the household network is formed by all members of every household interacting with each other every day. the distribution of household sizes is the ons estimate for the uk in (supplementary table ). in addition to the household size, the mix of ages in households is important since multi-generational households provide a path by which the infection can be transmitted from young to old. to model this we used a reference panel of , households taken by down-sampling the uk-wide household composition data from the census produced by the ons. the overall household structure was generated by sampling from the reference household panel with replacement and using rejection-sampling to match the aggregate statistics for the age demographics and household size. each individual is also a member of a recurring occupation network to model school, workplace or social networks. the occupation networks are modelled as small-world networks [ ] . the network has a fixed set of connections between individuals, and each day a random subset ( %) of these connections are chosen as the interactions between individuals. when constructing the occupation networks, the abm ensures the absence of overlaps between the household interactions and the local interactions on the small-world network. for children, there are separate occupation networks for the - year age group (i.e. nursery/primary school) and the - year age group (i.e secondary school). on each of these networks we introduce a small number of adults ( adult per children) to represent teaching and other school staff. similarly for the - year age group and the + year age group we created separate networks representing daytime social activities among elderly people (again with younger adult per elderly people to represent some mixing between the age groups). all remaining adults (the vast majority) are part of the - network. due to the difference in total number of daily interactions, each age group has a different number of interactions in their occupation network. parameters and values corresponding to the occupation network are shown in supplementary table . in addition to the recurring structured networks of households and occupations, we include random interactions. these are drawn randomly each day, independent of previous connections. the number of random connections an individual makes is the same each day (in the absence of interventions), drawn at the start of the simulation from an over-dispersed negative-binomial distribution. this variation in the number of interactions introduces some "super-spreaders" into the network who have many more interactions than average. the mean numbers of connections were chosen so that the total number of daily interactions matched that from a previous study of social interaction [ ] . the number of random interactions was chosen to be lower in children in comparison to other age groups. interactions in the random network are listed in supplementary table . the infection is spread by interactions between infected (source) and susceptible (recipient) individuals. the rate of transmission is determined by three factors: the infectiousness of the source, the age-dependent susceptibility of the recipient, and the type of interaction, i.e. on which network it occurred. infectiousness varies over the natural course of an infection, i.e. as a function of the amount of time the source has been infected, . infectiousness starts at zero at the point of infection ( = ), increases to a peak at an intermediate time, and decreases to zero a long time after infection (large ). following [ ] , we took the functional form of infectiousness to be a scaled gamma distribution. we chose the mean and standard deviation as intermediate values between different studies [ , , ] . once infected, we split individuals into three groups based upon the eventual severity of the disease: asymptomatic, mild symptomatic and moderate-severe symptomatics. the level of infectiousness depends upon the eventual severity of the disease, i.e. pre-symptomatic individuals who go on to develop moderate-severe symptoms are more infectious than those who go on to develop mild symptoms. by default, the overall infectiousness of asymptomatic individuals and individuals with mild symptoms, is . and . times that of individuals with moderate-severe symptoms respectively [ ] . an example of how transmissions can be stratified by the infection status of the source and the age of both source and recipient is depicted in figure . in this simulation of an uncontrolled epidemic, most transmissions occur from pre-symptomatic individuals with mild disease who are more numerous than individuals who go on to develop severe disease, followed by symptomatic individuals with mild disease. interventions that reduce the rate of growth of transmission will change the relative contributions of different symptomatic stages. the susceptibility of the recipient to infection is modelled with a scale factor dependent on the recipient's age. to calibrate these factors, we identified studies of whether or not transmission occurred from index cases to monitored close contacts [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . lower probability of infection in children was reported in almost all studies, including that of all rights reserved. no reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. zhang et al [ ] which observed more infections than the rest of the studies combined, with consistent adjustment for other covariates of transmission risk. we used the susceptibility by age of zhang et al., interpolated to match our ten-year age categories. the merged data and fit are shown in supplementary table . finally, we model the type of interaction, i.e. on which network the interaction took place. whilst we do not have data on the length of interactions, interactions which take place within a person's home are likely to be closer than other types of interactions leading to higher rates of transmission. this is modelled using a scale factor, which is by default. combining all effects, we model the hazard rate per interaction at which the virus is transmitted by where t is the time since the source was infected; d indicates the disease severity of the source (asymptomatic, mild, moderate/severe); a is the age of the recipient; n is the type of network where the interaction occurred; i is the mean number of daily interactions; f Γ ( u; μ, σ ) is the probability density function of a gamma distribution; μ i and σ i are the mean and width of the infectiousness curve; r scales the overall infection rate; s a is the relative susceptibility of the recipient based on age; a d is the relative infectiousness of the source based on disease severity; b n is the scale factor for the network on which the interaction occurred. supplementary table contains the values of the parameters used in simulations. the transmission hazard rate is converted to a probability of transmission via the epidemic is seeded by randomly . p = − e −λ infecting individuals on the day before the simulation starts. all rights reserved. no reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint a fraction Φ asym (age) of individuals are asymptomatic and do not develop symptoms, a fraction Φ mild (age) will eventually develop mild symptoms, and the remainder develop moderate/severe symptoms. each of these proportions depend on the age of the infected individual (supplementary table ). those who are asymptomatic are infectious at a lower level (see infection dynamics section) and will move to a recovered state after a time τ a,rec drawn from a gamma distribution. once an individual is recovered the model allows immunity to wane through time using two parameters: a fixed period for which every individual must wait, τ waning-shift , and then a geometric distribution of waiting times until individuals become susceptible, parameterised by its mean τ waning-mean . by default, the model assumes τ waning-shift to be , days (essentially no waning immunity). during this waiting period, infection is assumed to be completely immunising (recovered individuals cannot be reinfected). individuals who will develop symptoms start by being in a pre-symptomatic state, in which they are infectious but have no symptoms. the pre-symptomatic state is important for modelling interventions because individuals in this state do not realise they are infectious and therefore will not self-isolate based on symptoms to prevent infecting others. individuals who develop mild symptoms do so after time τ sym and then recover after time τ rec (both drawn from gamma distributions). the remaining individuals develop moderate/severe symptoms after a time τ sym drawn from the gamma distribution. whilst most individuals recover without requiring hospitalisation, a fraction Φ hosp (age) of those with moderate/severe symptoms will require hospitalisation. this fraction is age-dependent. those who do not require hospitalisation recover after a time τ rec drawn from a gamma distribution, whilst those who require hospitalisation are admitted to hospital after a time τ hosp , which is drawn from a shifted bernoulli distribution. among all hospitalised individuals, a fraction Φ crit (age) develop critical symptoms and require intensive care treatment, with the remainder recovering after a time τ hosp , rec drawn from a gamma distribution. the time from hospitalisation to developing critical symptoms, τ crit , is drawn from a shifted bernoulli distribution. of those who develop critical symptoms, a fraction Φ icu (age) will receive intensive care treatment. for patients receiving intensive care treatment, a fraction Φ death (age) die after a time τ death drawn from a gamma distribution, with the remainder leaving intensive care after a time τ crit,surv . patients who require critical care and do not receive intensive care treatment are assumed to die upon developing critical symptoms. patients who survive critical symptoms remain in hospital for τ hosp,rec before recovering. the age-dependent infection fatality ratio (ifr) is depicted in figure ; other age-dependent outcomes in supplementary figure . supplementary figure shows the corresponding waiting time distributions. figure : age-stratified infection fatality ratio (ifr) as output from a single simulation in a population of million with uk-like demography and with a lockdown when prevalence reached %. grey numbers on each bar show the ifr within each age group. main outputs of the model include the number of infected individuals, hospitalisations, icu admissions and deaths ( figure ). additional outputs are the number of people in quarantine and the number of tests required, which is of particular interest when comparing different interventions. transmissions can be depicted according to their type (pre-symptomatic, symptomatic and asymptomatic). the model provides a good fit to uk data ( figure ). all rights reserved. no reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this this version posted september , . . all rights reserved. no reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint openabm-covid can model a range of non-pharmaceutical interventions (npis). given the many types of intervention and interest in introducing them at different times, the interventions are controlled in the simulation dynamically through the python interface. this allows for policy interventions to be applied in response to change in the growth of the epidemic (e.g. stricter policies such as lockdown can be applied when prevalence is above a threshold). below we give brief descriptions of the interventions and sample python code is given in the supplementary materials with links to jupyter notebooks. all model parameters involved with npis are given in supplementary tables and . . self-isolation upon symptoms : a proportion of individuals self-isolate upon developing symptoms. self-isolation is modelled by stopping interactions on the individual's occupation network and greatly reducing their number of interactions on the random network. the default time for self-isolation is days with a daily dropout. the abm contains the option to quarantine everybody within the household of the symptomatic individual. the abm also considers individuals without covid- who develop flu-like symptoms. supplementary figure a is a jupyter notebook demonstrating how self-isolation upon symptoms reduces the rate of spread of the infection. . hospitalisation: once admitted to hospitals, a patient immediately stops interacting with the household, occupation and random networks. we do not model interactions within hospitals, but will add this in future work. . lockdown : is modelled by reducing the number of interactions that people have on their occupation and random networks (by default by %). additionally, given that during lockdown people stay at home, the transmission rate for interactions on the household network is increased by a factor of . . supplementary figure b is a jupyter notebook demonstrating the rapid reduction in new infections when a lockdown is imposed. the impact of lockdown on the reproduction number, r, is given in supplementary figure and an animation showing the age-stratified detail breakouts is in supplementary figure . . shielding : contact reductions can be applied to certain age groups only. for example, given that fatality ratio is highly skewed towards the over s, we have all rights reserved. no reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this this version posted september , . . the option of applying a reduction in contacts to this demographic group only. supplementary figure c is a jupyter notebook demonstrating how new infections can be kept low in a shielded group. . physical distancing : measures such as physical distancing and mask wearing will reduce the probability of transmission in certain types of interactions (i.e. random interactions but not household interactions). the abm allows for this to be modelled by allowing for the network-specific transmission multipliers to be adjusted during a simulation. supplementary figure d is a jupyter notebook demonstrating how new infections can be kept low after a lockdown with (extreme) social distancing measures. openabm-covid is able to model contact tracing (both manual and digital) and how it operates with or without an integrated testing system. the model contains many of the real-world imperfections which affect test and contact tracing programmes, such as test sensitivity and specificity, delays in testing and contact tracing, incomplete coverage, failure to recall contacts, contact tracer resource limitations and impartial adherence to quarantine requests. it also has the ability to model recursive contact tracing with and without testing. below we give descriptions of the test and contact tracing features, with sample code given in the supplementary materials along with links to jupyter notebooks. : testing can occur in both the community and hospital (where an immediate clinical diagnosis is allowed). tests are assumed to be sensitive from days post-infection to days post-infection with a sensitivity of % and specificity of %. for community testing, delays can be introduced for ordering a test and for receiving the test result. testing of an individual in the community is triggered by reporting symptoms and can also be triggered by being contact traced. supplementary figure e demonstrates the importance in quick testing if self-isolation only occurs after a positive test (as opposed to on symptoms). contact tracing is vital to control epidemics with a high level of pre-symptomatic transmission. a variable fraction of individuals in each group can be assigned to have the app. ownership of smartphones is based on age-stratified ofcom data (supplementary figure and supplementary table ). digital contact tracing can only occur between two app users. digital proximity sensing is likely to miss some interactions, so when contact tracing a number of interactions are randomly dropped. for contact tracing, the model takes into account all interactions the individual has had with other app-users for the past seven days which have not been dropped. the model can simulate different app-based contact tracing algorithms. the app can send out notifications with the request to quarantine based on symptoms, or based on a positive test result of the index case. it can ask the household members of the index case and/or household members of the contacts to quarantine and also send out notifications deeper into the network if desired. it can request tests for contacts of index cases if desired. supplementary figure f demonstrates how digital contact tracing following rapid testing can prevent a second wave even when the average uptake is at only % of the total population. . manual contact tracing : manual contact tracing works in a similar way to digital contact tracing with a few key differences. first, since it does not rely on an individual being a smartphone user, it can originate from anybody who tests positive (particularly important in the elderly where smartphone usage is lower). however, since the identification of interactions relies on the index case recalling them, only a fraction of actual interactions are traced. in particular, the fraction of interactions recalled depends on the type of interaction (i.e. occupation based interactions are more likely to be recalled than random interactions). manual contact tracing only occurs after a delay following a positive test, to account for contact tracers contacting both the index and traced individuals. finally, during a peak in the epidemic the amount of contact tracing required increases and risks overwhelming a manual contact tracing service. therefore the model contains constraints on the total number of interviews that contact tracers can perform on a single day. supplementary figure g demonstrates how a well-staffed manual contact tracing following rapid testing can lessen a second wave. quarantine : contact traced individuals can be asked to quarantine (default days) either because they are directly traced or because they are a household member of somebody who has been traced. like self-isolation, quarantine is modelled by stopping interactions on the workplace network and greatly reducing the number of interactions on the random network. the model includes a daily dropout rate to simulate imperfect adherence. quarantine can be ended if the index case later tests negative (after tracing based upon their symptoms), or if the quarantined individual tests negative. the core of openabm-covid is coded in c using an object-oriented coding style. the code is written in a modular manner to ease readability and encourage extension of the code base. it is open source and is being actively developed by multiple teams. the model uses the gnu scientific library (gsl) for mathematical functions, statistical distributions, and random number generation [ ] and so any distribution or function available within the gsl can be easily incorporated into the model (for instance in modelling waiting-time distributions). memory is pre-allocated at the start of the simulation for efficiency. an important feature of the implementation is the python interface using swig. running the model via python allows for complex dynamic interventions strategies to be easily modelled (see examples in supplementary figure a -h). all states of the model (e.g. transmission events, interactions, individual characteristics) are exposed in python, which gives full transparency to the results of the model. for example, supplementary figure h is a notebook showing how to calculate the relative personal protective effect of for app users versus non-app users when digital contact tracing is used. python is also a ubiquitous language amongst data scientists, and the interface allows them to fully interact with the model whilst keeping the high speed and memory performance of c. performance. the abm for million individuals takes approximately s per day to run and requires gb of memory (reduced to . gb if contact-tracing is disabled) on a macbook pro. both speed and memory are linear in population size (tested from k to m). the majority of the cpu usage is spent on rebuilding the daily interaction networks and updating the individual's interaction diaries. we present openabm-covid , a covid- -specific agent-based model suitable for simulating the epidemic in different settings and assessing non-pharmaceutical interventions, including contact tracing using a mobile phone app. the model is well documented with a simple interface, allowing non-experts to easily evaluate complex dynamic intervention strategies in a few lines of python code. openabm-covid is an open-source project and is easily extensible, with new features already being added by multiple external teams. the model is fully documented and is thoroughly tested in a formal testing framework. the model was designed to be as parsimonious as possible, with complexity only added when it was essential to model important features of covid- or details of non-pharmaceutical interventions, and with parameters being inferred from published studies. due to the substantial pre-symptomatic and asymptomatic transmission of the virus, it is necessary to model each individual's normal daily interactions. further, on developing symptoms or during interventions such as contact tracing, the interaction pattern of individuals change to only include those in the household. we therefore took the decision to model interactions using three social networks (household/occupational/random) with non-pharmaceutical interventions affecting each network differently. recurring small-world networks were used to model interactions at home and at work, whereas a transient random network was used to model other daily interactions such as on public transport or in shops. the strong association of covid- disease progression with age along with the age assortativity of social networks, led us to using a decade age-structure. the model simulated an urban population of million rather than the population of a whole country to allow realistic estimates for hospitalisation and icu admission forecasts on a regional level. large national epidemics will also exhibit meta-population dynamics rather than the spatially unstructured mixing modelled here. one of the key aims of openabm-covid was to model non-pharmaceutical interventions and, in particular, different forms of contact tracing. the model of digital contact-tracing allows for questions such as the role of: testing delays, different quarantine requests, compliance rates, recursive testing, and app uptake to be investigated. the model of manual contact-tracing allows for questions such as resource limitations, partial contact recall and interview delays to be investigated. importantly, due to the simple python interface, it is possible for non-experts to simulate all these features and to investigate the effect of applying multiple intervention policies at different stages of the epidemic. the current version of the model does not currently include events in hospitals, care-home settings, non-hospital deaths, gender/sex of individuals, comorbidities, or any geographical structure apart from that implicit within the three modelled networks. all of these limitations are being currently addressed by collaborators and will become available on the github repository in the near future. openabm-covid is a versatile tool to model the covid- epidemic in different settings and simulate different non-pharmaceutical interventions including contact tracing. openabm-covid is a modular tool that will help scientists and policymakers weigh decisions during this epidemic. our vision is that, with the help of the world-wide modelling community, it will develop into a family of models for infectious diseases that are at risk of causing pandemics in the future, adding to the international toolkit for epidemic preparedness. all rights reserved. no reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint coronavirus disease weekly epidemiology report a literature review of the economics of covid- mathematical models to guide pandemic response epidemiology, transmission dynamics and control of sars: the -- epidemic special report: the simulations driving the world's response to covid- modelling transmission and control of the covid- pandemic in australia quantifying sars-cov- transmission suggests epidemic control with digital contact tracing effectiveness of isolation, testing, contact tracing, and physical distancing on reducing transmission of sars-cov- in different settings: a mathematical modelling study effective configurations of a digital contact tracing app: a report to nhsx covasim: an agent-based model of covid- dynamics and interventions. medrxiv social contacts and mixing patterns relevant to the spread of infectious diseases collective dynamics of "small-world" networks epidemiological parameters of coronavirus disease : a pooled analysis of publicly reported individual data of cases from seven countries estimating the generation interval for coronavirus disease (covid- ) based on symptom onset data changes in contact patterns shape the dynamics of the covid- outbreak in china characteristics of household transmission of covid- modes of contact and risk of transmission in covid- among close contacts household secondary attack rate of covid- and associated determinants in guangzhou, china: a retrospective cohort study household transmission of covid- sars-cov- transmission in different settings: analysis of cases and close contacts from the tablighi cluster in brunei darussalam covid- in cases and of their close contacts in shenzhen, china: a retrospective cohort study characteristics of pediatric sars-cov- infection and potential evidence for persistent fecal viral shedding key: cord- -mjaqt wk authors: enard, david; cai, le; gwennap, carina; petrov, dmitri a title: viruses are a dominant driver of protein adaptation in mammals date: - - journal: elife doi: . /elife. sha: doc_id: cord_uid: mjaqt wk viruses interact with hundreds to thousands of proteins in mammals, yet adaptation against viruses has only been studied in a few proteins specialized in antiviral defense. whether adaptation to viruses typically involves only specialized antiviral proteins or affects a broad array of virus-interacting proteins is unknown. here, we analyze adaptation in ~ virus-interacting proteins manually curated from a set of proteins conserved in all sequenced mammalian genomes. we show that viruses (i) use the more evolutionarily constrained proteins within the cellular functions they interact with and that (ii) despite this high constraint, virus-interacting proteins account for a high proportion of all protein adaptation in humans and other mammals. adaptation is elevated in virus-interacting proteins across all functional categories, including both immune and non-immune functions. we conservatively estimate that viruses have driven close to % of all adaptive amino acid changes in the part of the human proteome conserved within mammals. our results suggest that viruses are one of the most dominant drivers of evolutionary change across mammalian and human proteomes. doi: http://dx.doi.org/ . /elife. . a number of proteins with a specialized role in antiviral defense have been shown to have exceptionally high rates of adaptation (cagliani et al., ; cagliani et al., ; elde et al., ; fumagalli et al., ; kerns et al., ; liu et al., ; sawyer et al., ; sawyer et al., ; sawyer et al., ; sironi et al., ; vasseur et al., ) . one example is protein kinase r (pkr), which recognizes viral double-stranded rna upon infection, halts translation, and as a result blocks viral replication (elde et al., ) . pkr is one of the fastest adaptively evolving proteins in mammals. specific amino acid changes in pkr have been shown to be associated with an arms race against viral decoys for the control of translation (elde et al., ) . however, pkr and other fast-evolving antiviral defense proteins may not be representative of the hundreds or even thousands of other proteins that interact physically with viruses (virus-interacting proteins or vips in the rest of this manuscript). most vips are not specialized in antiviral defense and do not have known roles in immunity. many of these vips play instead key functions in basic cellular processes, some of which might be essential for viral replication. in principle some vips without specific antiretroviral functions might nonetheless evolve to limit viral replication or alleviate deleterious effects of viruses despite the need to balance this evolutionary response with the maintenance of the key cellular functions they play. there are reasons to believe that such an evolutionary response to viruses might be limited, however. first, most vips evolve unusually slowly rather than unusually fast both in animals (davis et al., ; jäger et al., ) and in plants (mukhtar et al., ; weßling et al., ) . second, vips tend to interact with proteins that are functionally important hubs in the protein-protein interaction network of the host possibly limiting their ability to adapt (dyer et al., ; halehalli and nagarajaram, ) . finally, very few cases of adaptation to viruses are known outside of fast evolving, specialized antiviral proteins meyerson et al., ; meyerson and sawyer, ; meyerson et al., ; ng et al., ; ortiz et al., ; schaller et al., ) . transferrin receptor or tfrc is the most notable exception, and serves as a striking example of a non-immune, housekeeping protein used by viruses (demogines et al., ; kaelber et al., ) . tfrc is responsible for iron uptake in many different cell types and is used as a cell surface receptor by diverse viruses in rodents and carnivores. tfrc has repeatedly evaded binding by viruses through recurrent adaptive amino acid changes. as such, tfrc is the only clear-cut example of a host protein not involved in antiviral response that is known to adapt in response to viruses. here we analyze patterns of evolutionary constraint and adaptation in a high quality set of~ vips that we manually curated from virology literature. these vips come from a set of~ proteins conserved across well-sequenced mammalian genomes (materials and methods). as expected, the vast majority of these vips (~ %) have no known antiviral or any other more broadly defined immune activity. we confirm that vips do tend to evolve slowly and demonstrate that this is because vips experience much stronger evolutionary constraint than other proteins within the same functional categories. however, despite this greater evolutionary constraint, vips display higher rates of adaptation compared to other proteins. this excess of adaptation is visible in vips across biological functions, on multiple time scales, in multiple taxa, and across multiple studied viruses. finally, we showcase the power of our global scan for adaptation in vips by studying the case of aminopeptidase n, a well-known multifunctional enzyme (mina-osorio, ) used by coronaviruses as a receptor (delmas et al., ; yeager et al., ) . using our approach we reach an amino-acid level understanding of parallel adaptive evolution in aminopeptidase n in response to coronaviruses in a wide range of mammals. we curated a set of vips from the low-throughput virology literature (materials and methods and supplementary file a). vips were defined as proteins that interact physically with viral proteins, viral rna, and/or viral dna (supplementary file a). we excluded interactions identified by high-throughput experiments because we were concerned about a high rate of false positives (mellacheruvu et al., ) . the vips were annotated from an initial set of proteins with clear orthologs in all analyzed mammalian high quality genomes (figure , supplementary file b and materials and methods) . elife digest when an environmental change occurs, species are able to adapt in response due to mutations in their dna. although these mutations occur randomly, by chance some of them make the organism better suited to their new environment. these are known as adaptive mutations. in the past ten years, evolutionary biologists have discovered a large number of adaptive mutations in a wide variety of locations in the genome -the complete set of dna -of humans and other mammals. the fact that adaptive mutations are so pervasive is puzzling. what kind of environmental pressure could possibly drive so much adaptation in so many parts of the genome? viruses are ideal suspects since they are always present, ever-changing and interact with many different locations of the genome. however, only a few mammalian genes had been studied to see whether they adapt to the presence of viruses. by studying thousands of proteins whose genetic sequence is conserved in all mammalian species, enard et al. now suggest that viruses explain a substantial part of the total adaptation observed in the genomes of humans and other mammals. for instance, as much as one third of the adaptive mutations that affect human proteins seem to have occurred in response to viruses. so far, enard et al. have only studied old adaptations that occurred millions of years ago in humans and other mammals. further studies will investigate how much of the recent adaptation in the human genome can also be explained by the arms race against viruses. most of the vips ( %) correspond to an interaction between a human protein and a virus infecting humans (supplementary file a). human immunodeficiency virus type (hiv- ) is the best-represented virus with vips, with nine other viruses (hpv, hcv, ebv, hbv, hsv, influenza virus, adv, htlv and kshv) having at least vips (supplementary file a). this dataset represents the largest, most up-to-date set of vips backed by individual low-throughput publications. nonetheless, given that many vips were discovered only recently, with half of all publications reporting vips published in the past years (figure ) , it is likely that many additional vips remain to be discovered. the identified vips are involved in diverse cellular and supracellular processes with overlapping go cellular and supracellular processes having more than vips (gene ontology (go) classes (october version) (ashburner et al., ; the gene ontology consortium, ) ; supplementary file c). these cellular processes include transcription ( vips), post-translational protein modification ( vips), signal transduction ( vips), apoptosis ( vips), and transport ( vips). the supracellular processes notably include defense response ( vips) and developmental processes ( vips). only vips or % of vips have known antiviral activity (supplementary file d). these antiviral vips are part of a larger group of vips ( % of vips) with known immune functions, defined here as any activity that modulates the immune response or involved in the development of the immune response (materials and methods and supplementary file d). most -more than % -of the vips have no known immune activity. we analyze both purifying selection and positive selection in vips versus non-vips at two distinct evolutionary time scales: (i) in the great apes in general and in the human branch specifically and (ii) across the entire mammalian phylogeny. we use the ratio of nonsynonymous to synonymous polymorphisms (abbreviated as pn/ps) within humans and great apes as a measure of purifying selection. we use mcdonald-kreitman (mk) and the branch-site tests of positive selection using the bs-rel (kosakovsky pond et al., ) and busted (murrell et al., ) tests from the hyphy package (pond et al., ) to assess the prevalence of positive selection in vips compared to non-vips in the human lineage and in mammals in general (material and methods). we confirm that vips tend to evolve slowly (jäger et al., ; davis et al., ) . on average, the vips have~ % lower mammal-wide dn/ds ratio compared to non-vips ( . versus . , % ci [ . , . ]; materials and methods). the difference in dn/ds is highly significant (permutation test p= after iterations; supplementary file b). in order to disentangle whether this slower evolution of vips is due to stronger purifying selection or to a lower rate of adaptation, we first assess the strength of purifying selection in the vips using the pn/ps ratio. genome-wide polymorphism data required to measure pn/ps are available in humans (abecasis et al., ) ( genomes project) (supplementary file e), and other great apes: chimpanzee, gorilla, and orangutans (prado-martinez et al., ) (great apes genome project) (supplementary file f). the genomes project and the great apes genome project are complementary for this analysis. on the one hand, the genomes project provides high quality variants with frequencies estimated from a large number of individuals. on the other hand while the great ape genome project includes fewer individuals and provides coarser frequency data, it provides substantially higher pn and ps counts than the genomes data because non-human great apes tend to be more polymorphic overall (prado-martinez et al., ) . in the human african populations from the genomes project (materials and methods), the average pn/ps is % lower in vips compared to non-vips ( . versus . , % ci [ . , . ], simple permutation test p= after iterations). vips also show an excess of low frequency ( %) deleterious non-synonymous variants compared to non-vips (figure -figure supplement ; simple permutation test p= after iterations). in great apes, the average pn/ps ratio is % lower in vips compared to non-vips ( . versus . , % ci [ . , . ], simple permutation test p= after iterations; figure a ). finally, stronger purifying selection acting on vips is widespread and is not limited to vips interacting with any one particular virus ( figure b ). vips and non-vips have slightly different coding sequence gc content ( . versus . on average, p= x - ), coding sequence lengths ( versus amino acids on average, p= ) and recombination rates (kong et al., ) ( . cm/mb versus . cm/mb on average, p= . ). to ensure that the difference in pn/ps between vips and non-vips is robust to these differences, we compare vips with non-vips with similar values for each potential confounding factor using permutations with a target average (materials and methods). the difference in pn/ps in great apes between vips and non-vips persists when comparing vips and non-vips with similar gc content ( . versus . , p= after iterations), similar coding sequence length ( . versus . , p= ), or similar recombination ( . versus . , p= ). the difference in pn/ps between vips and non-vips is therefore a genuine difference in the strength of purifying selection and not due to confounding factors biasing the pn/ps ratio. vips have been shown before to be broadly expressed genes and to serve as hubs in the human protein-protein interactions network (dyer et al., , halehalli and nagarajaram, ) . these differences in gene expression and the number of protein-protein interactions may explain the stronger purifying selection experienced by vips. we confirm that vips are indeed expressed in more tissues than non-vips both at the rna level (gtex consortium, ) (gtex v rna-seq expression rpkm! in . tissues on average in vips versus . tissues in non-vips, simple permutation test p= ) and at the protein level (human proteome map spectral count! in . tissues on average for vips versus . for non-vips, simple permutation test p= ). vips also have many more protein-protein interaction partners than non-vips based on a dataset of human protein-protein interactions curated by (luisi et al., ) from the biogrid database (stark et al., ) ( . on average versus . , simple permutation test p= ). the magnitude of the difference in pn/ps between vips and non-vips expressed in a similar number of tissues at the rna level (gtex) ( . versus . , p= ) or in a similar number of tissues at the protein level (human protein map) ( . versus . , p= ) remains largely unchanged. in contrast, the difference in pn/ps is strongly affected when comparing vips and non-vips with a similar number of protein-protein interactions. indeed, non-vips with the same number of interacting partners as vips have a pn/ps ratio of . versus . for all non-vips, and the difference in the pn/ ps ratios between vips and non-vips is reduced from % to %. these results show that vips do experience stronger purifying selection than non-vips, and that the difference in purifying selection is driven at least partly by the fact that vips tend to be hubs with many interacting partners in the human protein-protein interactions network. the higher level of purifying selection in vips might be due to the fact that vips participate in the more constrained host functions, or, alternatively, because within each specific host function, viruses tend to interact with the more constrained proteins. in order to test these two non-mutually exclusive scenarios we generated control sets of non-vips chosen to be in the same gene ontology processes as vips (go processes with more than vips; supplementary file c and materials and methods). in great apes, go-matched non-vips still have a much higher pn/ps ratio compared to vips, suggesting that vips tend to be more conserved than non-vips from the same go category. on average, pn/ps in the go-matched non-vips is . ( % ci [ . , . ]). this is only slightly lower than the average ratio in non-vips in general (pn/ps= . , p= x - ), but much higher than the average ratio in vips ( . , permutation test p= after iterations). moreover, the stronger purifying selection acting on vips is apparent within most functions. figure c shows stronger purifying selection in the high level go categories with the most vips. in all the go categories pn/ps is lower in vips than in non-vips, and the difference is significant for of these categories (supplementary file c). this shows that within a wide range of host functions, viruses tend to interact with the most conserved proteins. interestingly, even immune vips (supplementary file d) have a significantly reduced pn/ps ratio compared to immune non-vips ( figure c ), which suggests that immune proteins in direct physical contact with viruses are more constrained. the reduction in pn/ps in non-immune vips is very similar to the reduction observed in the entire set of vips ( figure c ). the table at supplementary file c further shows stronger purifying selection in of the go categories ( %) with more than vips. we estimate the proportion of adaptive non-synonymous substitutions (noted a) in vips and non-vips in the human lineage by using the classic mcdonald-kreitman test (mk test) (mcdonald and kreitman, ) (materials and methods). we use the genomes project polymorphism data from african populations (materials and methods and supplementary file e) and divergence between humans and chimpanzees. we first attempt to limit the effect of deleterious variants by excluding all variants with a derived allele frequency lower than % (materials and methods) (keightley and eyre-walker, , charlesworth and eyre-walker, a , eyre-walker and keightley, , messer and petrov, . we find that a is strongly elevated in vips compared to non-vips (a= . in vips versus À . in non-vips, permutation test p= .x À ). note that the classic mk test is known to underestimate the true a in the presence of slightly deleterious polymorphisms (charlesworth and eyre-walker, b) . given that vips tend to have more non-synonymous deleterious low frequency variants than non-vips (figure -figure supplement ) this downward bias should be stronger in the vips, making this comparison conservative and indicating that vips likely have a substantial excess of adaptation compared to non-vips. the difference in a is robust to recombination (a =À . in non-vips with similar recombination to vips versus À . without control). it is also robust to coding sequence gc content (a=À . with versus À . without control), coding sequence length (a=À . with versus À . without control). the difference is also robust to variation in levels of expression at the rna level measured as the number of tissues with gtex v rna-seq expression rpkm! (a= . with versus . without control) and as the average expression across all gtex v tissues (a=À . with versus . without control), as well as at the protein level measured as the number of human proteome map tissues with spectral count>= (a=À . with versus . without control) or the average expression across all the human proteome map tissues (a=À . with versus . without control). the difference in a is also not affected by the number of protein-protein interactions (a= . with versus À . without control). the difference in a is not affected either by purifying selection, as shown by the fact that using great apes pn/ps or human pn/ps as a control has no effect (a= . with versus . without control in both cases). finally, we match vips and non-vips with similar go categories (materials and methods, paragraph titled 'gene ontology-matching control samples'). the higher rate of adaptation in vips is not explained by higher rates of adaptation in the host go processes where vips are well represented (a= . with versus À . without control). for all the controls, the difference in a between vips and non-vips remains highly significant (permutation test p< - in all cases). together these results show that the excess of adaptation in vips is robust to many different host factors. we further investigate the excess of adaptation for the specific vips of ten human viruses and in the high level go categories with the most vips ( figure a and b). although the small number of proteins interacting with individual viruses precludes precise estimates of a (see the large confidence intervals on figure a ), the vips show nominally higher values of a for eight out of viruses, with hiv- and hepatitis b virus (hbv) displaying statistically significant increases in adaptation. likewise, vips in most go categories show higher rates of adaptation ( out of ) with of showing statistically significant increases ( figure b ). finally and importantly, the % of vips with no known antiviral or broader immune function (supplementary file d) have a strongly increased rate of adaptation according to the classic mk test (a= . in vips versus . in non-vips, permutation test p= x À ; figure b ). intriguingly, unlike for non-immune vips or all vips considered together (top of figure b ), immune vips, including antiviral vips (supplementary file d), do not show any increase of adaptation compared to immune non-vips. the lack of a signal is unlikely to be due to reduced statistical power of the comparison in a smaller set of immune proteins, given that random samples of non-immune vips with the same size as the immune vips sample ( ) always exhibited a significantly (p< . ) increased rate of adaptation compared to non-immune non-vips. the classic mk test is known to be biased downward by the presence of slightly deleterious non-synonymous variants (charlesworth and eyre-walker, b) and this bias is difficult to eliminate fully even by excluding low frequency variants (messer and petrov, ) . messer and petrov suggested an asymptotic modification of the mk test which provides less biased estimates of a in the presence of slightly deleterious variants (messer and petrov, ) . the messer-petrov approach estimates a for each frequency category separately and then uses the functional dependence of a on allele frequency to extrapolate a at fixation. this approach thus requires well-resolved frequency data necessitating the use of the genomes data and also lacks power to estimate a for small subsets of genes. we thus use this approach only to better quantify the true a in the complete sets of vips and non-vips. to further validate the asymptotic mk test we carry out extensive population simulations using slim (messer, ) and show that this test is robust to demographic events such as bottlenecks or population expansions (materials and methods and supplementary file g). the asymptotic mk test estimate of a in vips is % ( out of amino acid changes) compared to~ % ( out of , amino acid changes) in non-vips ( figure c ). thus, although vips represent only % of the orthologs in our dataset and only % of all amino acid substitutions, we estimate that in human evolution they account for almost % of all adaptive amino-acid changes. note that both vips and non-vips in our dataset are limited to the proteins conserved across all mammals. the increased rate of adaptation in vips in the human lineage strongly suggests that vips in our dataset, % of which interact with modern viruses affecting humans (supplementary file a), were also vips during the last million years of human evolution since the split with chimpanzees. it is also plausible that a substantial proportion of the vips we study are also vips in multiple mammalian lineages. indeed, viruses infecting humans (including the ten viruses with the most vips) are known to have close viral relatives in many other mammals, with the exception of hepatitis c virus (hcv) for which only distant relatives are known and primarily in bats (quan et al., ) . there is also growing evidence that distantly related viruses tend to interact with overlapping sets of host proteins (jäger et al., , davis et al., . we thus hypothesize that vips, while identified primarily in humans, may have also experienced frequent adaptation in mammals in general, with the possible exception of the vips interacting with hcv. to test this hypothesis we use the branch-site random effect likelihood test (bs-rel test) (kosakovsky pond et al., ) and the busted test (murrell et al., ) both available in the hyphy package (pond et al., ) in order to detect episodes of adaptive evolution in each of the branches of the mammalian tree used for the analysis (materials and methods). for a specific coding sequence, the bs-rel and busted tests estimate the proportion of codons where the rate of non-synonymous substitutions is higher than the rate of synonymous substitutions (dn/ds> ), which is a hallmark of adaptive evolution. the bs-rel test estimates proportions of selected codons specifically for each branch, whereas busted estimates an overall proportion of selected codons across the entire tree. both tests then compare two competing models of evolution, one with adaptive substitutions and one without adaptive substitutions, and decide whether the model with adaptation is a significantly better fit to the data. the busted p-value is a good measure of whether a specific protein experienced adaptation in the history of mammalian evolution. in addition to presence/absence of adaptation, we assess the amount of adaptation experienced by a particular protein by estimating the average proportion of selected codons from the bs-rel test along all mammalian branches. we compare the proportion of selected codons detected by the bs-rel test between vips and non-vips. the statistical power of busted and the bs-rel test has been shown to depend strongly on the amount of constraint in a coding sequence, with higher constraint/purifying selection decreasing the ability to detect adaptation (kosakovsky pond et al., ) . we confirm this in our dataset by observing a strong positive correlation between the pn/ps ratio in great apes and the proportion of selected codons across mammals estimated by the bs-rel test (spearman's rank correlation = . , p< x - , n= ). we therefore use a permutation test with a target average (materials and methods) that matches vips and non-vips with similar pn/ps ratios in order to compare vips and non-vips that experience similar levels of purifying selection and providing us with similar power to detect adaptation (materials and methods and figure -figure supplements and ) . the permutation test shows that adaptation has been much more common in vips than in non-vips across mammals ( figure ) . we estimate that all vips have experienced twice as many adaptive amino acid changes on average compared to non-vips ( figure a , permutation test p= after iterations). we further use an increasingly strict level of evidence for the presence of adaptation, by including only proteins with increasingly low busted p-values; that is, increasingly high probability that adaptation occurred somewhere on the tree ( figure a ). figure a shows that vips with the strongest evidence of adaptation (busted p-values lower than - ) have a six-fold excess of strong signals of adaptation (permutation test p= after permutations). in figure -figure supplement we further show that this excess of adaptation in vips is due to i) more vips with signals of adaptation than non-vips, ii) more branches of the tree per vip showing adaptation, and iii) a greater proportion of codons evolving adaptively per branch. in line with the mk test, we find that the excess of adaptation in mammals is robust to the potential confounding factors of expression at the rna and protein levels, and to the number of host protein-protein interactions (supplementary file h). indeed, adaptation in mammals remains at least twice more frequent in vips compared to non- figure . excess of adaptation across mammals in vips the excess of adaptation is measured as the extra percentage of adaptation in vips compared to non-vips. for example, if vips have . times or % more adaptation, then the adaptation excess is %. (a) thick black curve: average excess of adaptation in all vips. dotted black curves: % confidence interval for the excess of adaptation in all vips. thick grey curve: excess of adaptation in non-immune vips. dotted grey curves: % confidence interval for the excess of adaptation in non-immune vips. (b) virus-by-virus excess of adaptation in vips. black dot is the average excess and the represented interval is the % confidence interval. excess is shown for busted p . . (c) excess of adaptation within the top high-level go processes with the most vips. excess is shown for busted p . . (d) proportions of selected codons in vips (blue dot) and non-vips (red dot and % confidence interval) in the mammalian clades represented by more than one species in the tree. all: entire tree. primata: primates. glires: rodents and rabbit. cetartyodactyla: sheep, cow, pig. zooamata: carnivores and horse. excess is shown for busted p . . doi: . /elife. . the following figure supplements are available for figure : vips expressed at the rna level in many tissues (permutation test p< - for vips and non-vips expressed in at least , , or gtex tissues), vips and non-vips expressed at the protein level in many tissues (permutation test p ). the asymptotic mk test is robust to the presence of deleterious mutations and to demography. the asymptotic mk test works by estimating a in bins of derived allele frequencies. for example, a can be calculated in the bin of frequencies from . to . by counting only variants with a derived allele frequency between . and . to measure pn and ps. an exponential curve is then fitted to the estimates of a across bins of frequency. the value taken by the fitted curve for a derived allele frequency of % provides the estimate for a (messer and petrov, ) . using the asymptotic mk test, messer and petrov (messer and petrov, ) estimated that a is % in drosophila melanogaster, and % in human. for both species, these estimates were obtained based on polymorphism and divergence data for most of the proteome (more than protein coding sequences in both cases). here, we need to estimate a in vips, and in the same number of randomly sampled non-vips. this is an order of magnitude less than the number of coding sequences used by messer and petrov (messer and petrov, ) , which makes curve fitting challenging. indeed, the low number of high frequency variants means that estimates of a in the high frequency bins are very noisy. using the stable release genomes project final phase i variants from african populations, we count that vips only have non-synonymous and synonymous variants with a derived allele frequency above . , respectively. the [ . , . ] bin of frequency only has non-synonymous and synonymous variants. in comparison, the [ . , . ] bin has non-synonymous and synonymous variants. the [ . , . ] bin still has twice more variants ( non-synonymous, synonymous) than the [ . , . ] bin. the low number of high frequency variants is however not the only issue. a second potential issue when trying to fit a curve to predict a in the asymptotic mcdonald-kreitman test is the mispolarization of alleles as ancestral or derived. mispolarization is a common problem that distorts the unfolded site frequency spectrum (sfs) (hernandez et al., ) . the most severe distortion is usually within the high frequency part of the sfs (hernandez et al., ) . indeed, abundant low-frequency derived variants are often misidentified as high frequency derived variants. this can result in substantial overestimations of the number of high frequency variants. the number of non-synonymous variants pn might be more severely overestimated than ps, since less high frequency and more low frequency non-synonymous variants are expected in the first place. this could hypothetically result in underestimates of a within high frequency bins. here we modify the asymptotic mk test to circumvent the mispolarization and high frequency, high noise issues. we do so by estimating a based only on derived allele frequencies lower than . , where the distortion of the sfs due to mispolarized alleles is negligible. this also makes the asymptotic mcdonald-kreitman less reliant on bins of high frequencies with very noisy estimates of a, due to small values of pn and ps. we use either a logarithmic fit of the form y ¼ a þ bðlnðx þ cÞÞ over the range of frequencies to . ( figure c) , or an exponential fit of the form y ¼ a þ b à expðÀx=cÞ. both the logarithmic fit and the exponential fits provide accurate estimates of a for a wide range of evolutionary scenarios, as shown by forward population simulations using slim (messer, ) . we use the forward population simulator slim (messer, ) to simulate a typical, codons, six exons coding sequence. each exon is separated by bp long introns. one in four coding sites is synonymous and only experiences neutral mutations. non-synonymous sites experience neutral, advantageous, strongly deleterious and slightly deleterious mutations. the coding sequence evolves for , generations in a population of individuals, at a uniform mutation rate of . x À and with a uniform recombination rate of cm/mb. these parameters are equivalent to , generations of evolution of a , individuals population with a mutation rate of . x À and a recombination rate of cm/mb. this results roughly in the amount of divergence observed in the human lineage since divergence with chimpanzee. the rescaling by a factor of ten greatly speeds up the simulations. roughly matching the observed dn, ds, pn and ps in vips requires simulating coding sequences. the true a obtained from simply counting adaptive fixations in the simulations can then be compared with the a estimated from dn, ds, pn and ps. by repeating the simulation of sets of coding sequences many times, we can get the variance of the estimation of a both by the modified asymptotic mk test. by repeating the simulations times, we show that the modified asymptotic mk test gives accurate estimates of a for all evolutionary scenarios tested (supplementary file g). in practice, the logarithmic fit is easier to use than the exponential fit. indeed, fitting algorithms such as the ones implemented in the lm() function or the nlslm() function from the minpack.lm package in r often fail to converge for the exponential fit. we therefore use the logarithmic fit. we use the multiple alignments of the coding sequences from the mammals listed above to quantify adaptation across mammals. there are three different types of tests aimed at detecting and quantifying adaptation in a multi-species coding sequence alignment: branch tests, site tests, and branch-site tests. the so-called branch tests look for branches in a tree where the ratio of non-synonymous to synonymous substitutions dn/ds exceeds one for the entire coding sequence. in order to happen this requires an extreme amount of adaptation in a specific branch. branch tests thus detect only the most extreme bursts of adaptation, and have very low statistical power to detect the vast majority of more moderate bursts of adaptation in a phylogeny . this makes them a very poor choice to quantify adaptation within an entire phylogeny. site tests look for specific codons of a coding sequence where dn/ds significantly exceeds one across the entire phylogeny. codons with dn/ds >> are codons that have accumulated many adaptive non-synonymous substitutions across the tested phylogeny. this means site tests ignore the case where specific codons have evolved adaptively on a specific branch, probably the most common case in coding sequence evolution (murrell et al., ) . although site tests are well suited for cases where there is a strong a priori expectation about which sites should evolve adaptively, as is for example the case of tfrc, here we have no a priori knowledge about the sites that are expected to evolve adaptively in vips in response to viruses. instead we use branch-site tests which are designed to detect adaptation at specific codons in specific branches. there are currently two main implementations of the branch-site test, one available in paml (zhang et al., , yang, and one available in the hyphy package (kosakovsky pond et al., ) . the two tests are both likelihood ratio tests that compare a model integrating positive selection with a neutral model without positive selection. the paml branch-site test and the hyphy bs-rel branch-site test differ mainly in the assumptions of their evolutionary models. the paml branch-site test defines two kinds of branches in the phylogenetic tree used, the foreground and background branches. the foreground branch is the branch where the presence of positive selection is tested. the evolutionary model of the branch-site test authorizes positive selection in the foreground branch, but not in the background branch. unlike the paml branch-site test, the hyphy bs-rel test uses a model that has no limitation regarding the occurrence of adaptation across the tree. this difference in the models used has very profound consequences for the ability of the two tests to detect and quantify recurrent adaptation (kosakovsky pond et al., ) . indeed, the hyphy test has good power to detect recurrent adaptation. because it does not allow adaptation in the background branches, the paml tests suffers a severe loss of statistical power when recurrent adaptation does occur in the background branches. as an example, the hyphy bs-rel test detects significant (bs-rel test p . ) signals of adaptation in branches of the mammalian tree used in this study for pkr ( figure ). in comparison, the paml test detects only nine branches (paml test p . ). this is a crucial difference between the two tests in our case given that the arms race with viruses is likely to trigger recurrent bursts of adaptation across mammals. for this reason we use hyphy bs-rel to quantify adaptation in mammals. more specifically, we use the proportion of selected codons estimated by the bs-rel test to quantify adaptation. to estimate the strength of the evidence in favor of adaptation across the entire mammalian tree, we use the p-value of the busted test in hyphy that uses the same codon evolution model as the bs-rel test. in this study, we compare vips and non-vips for weak (busted p-values . ) to increasingly strong (busted p-values - ) evidence of adaptation. we start by computing the average proportion of codons under adaptive evolution for vips and the same average proportion for the sets of randomly matched non-vips (see the description of the permutation test). for each coding sequence, we retrieve the proportion of positively selected codons on each branch, and compute the average of this proportion across branches. more specifically, we only count branches of the tree with conserved synteny (supplementary file b) . that is, if of the branches of the tree have conserved synteny (see above), we compute the average proportion of selected codons only from these branches. in practice we tolerate only up to five branches in the tree with no conserved synteny (at least branches with conserved synteny; supplementary file ). this reduces the dataset of orthologs that can be used in the analysis only slightly, from to total, and among those the number of vips from to . then adaptation is simply quantified as the average proportions of selected codons in valid branches across vips (or the same number of matched non-vips). if the threshold for busted p-value is set to Àx , we only include in the quantification the average proportions of selected codons from coding sequences with busted p-value Àx . for a low x, we compare how much selection occurred in vips and non-vips counting both weak and strong signals of adaptation. for a high x, we compare how much strong, highly significant signals of adaptation occurred in vips compared to non-vips. the pn/ps ratio as a measure of purifying selection we designed a permutation test that makes it possible to compare adaptation in vips with adaptation in non-vips with the same amount of purifying selection. the amount of purifying selection in a protein corresponds to the proportion of amino acids that cannot change, or very infrequently during evolution. on average vips experience much more purifying selection than non-vips. this means that mechanically, a smaller proportion of amino acids can possibly be targeted by adaptive evolution in vips. a naïve comparison of vips and non-vips would therefore tell more about the difference in purifying selection than about the difference in the amount of adaptation. instead, the idea is to compare adaptation in the vips with adaptation in non-vips with the same overall average and variance in levels of purifying selection. the sampling of non-vips with similar purifying selection vips is however challenging. the first question is which measure of purifying selection to use? the ratio of non-synonymous to synonymous substitution rates dn/ds is often used as a measure of purifying selection. how much smaller dn is compared to ds can indeed tell how evolutionarily constrained a protein is. however the problem with dn/ds is that dn not only reflects purifying selection, but also reflects adaptive amino acid substitutions. this means that comparing vips with non-vips with similar dn/ds ratios would underestimate an excess of adaptation in vips. this is because more adaptation would increase dn/ds in vips more than it does in non-vips ( unlike the dn/ds ratio, the ratio of non-synonymous to synonymous polymorphism pn/ps only reflects purifying selection. indeed, pn is decreased by purifying selection, but is not affected by adaptive mutations that segregate for very short times in populations. this makes pn/ps a much better measure of purifying selection than dn/ds that can be used to match vips with similarly constrained non-vips. there are however two problems with the pn/ps ratio. the first is that proteome-wide estimates of pn/ps are not available for all the mammals included in this analysis. good estimates of pn/ps require sequencing the genomes of a sufficient number of non-inbred individuals, ideally more than ten, within a given species. the pn/ps ratio is publicly available, based on the genome sequences of a sufficient number of individuals, in human (abecasis et al., ) and the non-human primate species represented in the great ape genome project, namely chimpanzee, gorilla and orangutan (prado-martinez et al., ) . the limited number of species of the mammalian tree with pn/ps information can still be used as a control of purifying selection in the permutation test for all mammals. it is true that the pn/ps ratio within a species or a subset of species does not represent the absolute, overall level of purifying selection in the entire mammalian tree. it is known for instance that primates experience weaker purifying selection than rodents. what matters however for the permutation test is not the absolute level of purifying selection, but the relative difference in pn/ps between vips and non-vips. indeed, vips and matched non-vips with similar pn/ps experience similar purifying selection, even if pn/ps is from a subset of species in the mammalian tree. whether pn/ps is overall skewed towards higher or towards lower values in the subset of species used, then the skew is still the same for both vips and non-vips. this means that the relative difference in pn/ps is still a good measure of the general difference in purifying selection across mammals. a different skew in vips and non-vips requires invoking unlikely scenarios where vips would experience a global relaxation or intensification of constraint specifically in the primate species where pn/ps is available. given the high number of vips and their high functional diversity (supplementary file c) , such a global trend towards relaxation or higher constraint in primates is extremely unlikely. the pn/ps ratio from primate species can therefore be used as a control for purifying selection. more specifically, we use the pn/ps ratio from populations of chimpanzees (nigeria-cameroon, eastern and central populations), gorillas (western lowland population) and orangutans (sumatran and bornean populations) (supplementary file f) . these populations are the populations included in the great apes genome project with the highest effective population sizes. indeed, the pn/ps ratio is less noisy and available for more proteins in populations with higher population sizes and higher genetic diversity. for each vip and non-vip, the value of pn/ps used in the permutation test is simply the sum of pn across all the primate populations divided by the sum of ps across the same populations (supplementary file f). in each primate population, pn and ps are measured excluding singletons to limit the influence of potential erroneous variant calls. the second potential problem with pn/ps is that it is a noisy measure of purifying selection. at any time in a population of primates, only few positions are polymorphic within a typical (~ codons) coding sequence. as a consequence, a highly constrained coding sequence may by chance have more non-synonymous variants than synonymous variants, and a high pn/ps ratio. conversely, a weakly constrained coding sequence may by chance have less non-synonymous variants, and a low pn/ps ratio. this is problematic if we want to use pn/ps as a control for purifying selection. one can consider the case of vips where pn/ps is substantially lower than in non-vips. matching vips with non-vips with a similarly lower pn/ps, we would end up selecting non-vips with a lower pn/ps not because of purifying selection but merely because of noise. this makes controlling for purifying selection less straightforward than directly matching each individual vip with non-vips with similar pn/ps ratios. instead we use an indirect matching strategy. permutations with a target average: the example of purifying selection as described above, the pn/ps ratio is a noisy measure of purifying selection. this means that we cannot use a direct matching strategy between vips and non-vips for the permutation test. an indirect matching strategy can however still be used, that uses the mammals-wide rate of non-synonymous substitutions dn as an intermediate. in particular, we use paml to estimate dn and ds under the m evolution model (yang, ) . the dn/ds ratio for the whole mammalian tree (supplementary file b) integrates hundreds of millions of years of evolution. in the absence of adaptation, it would therefore be a much less noisy measure of purifying selection than pn/ps. the issue is however that dn is influenced by both purifying selection and by adaptation. if vips experience more adaptation than non-vips, then purifying selection being equal, we expect dn/ds to be higher in vips than in non-vips. if vips experience less adaptation than non-vips, then purifying selection being equal, we expect dn/ds to be lower in vips than in non-vips. vips have a % lower pn/ps than non-vips in great apes, but only a % lower dn/ds than non-vips. the smaller difference in dn/ds than pn/ps is an indication that adaptation has increased dn more strongly in vips than in non-vips. purifying selection being equal, dn/ds is therefore higher in vips than in non-vips. this means that non-vips with the same pn/ps (purifying selection) as vips have a lower dn/ds. we can therefore match vips and non-vips with the same pn/ps by selecting non-vips with a dn/ds ratio that is only a fraction of the dn/ds in vips. this fraction can be adjusted through trial and error until finding the one that matches vips with non-vips with the same overall average pn/ps. this indirect matching strategy makes it possible to compare vips and non-vips with the same level of purifying selection while avoiding the pitfall of noise in pn/ps. the random sets of non-vips must fulfill two criteria to be comparable to vips. first, non-vips should have the same overall average pn/ps as vips. second, non-vips should have the same variance in pn/ps, i.e. pn/ps values in non-vips are spread as much as they are in vips. this can all be achieved by using a permutation scheme where samples of non-vips must satisfy a pre-fixed, target average ( figure -figure supplement ) . note that although here we detail the case of purifying selection, permutations with a target average can be used to get samples of non-vips similar to vips for any possible factor. for the case of purifying selection, the permutations with an average target work as follows. we first measure the average dn in vips, noted dn (vip) . then we define the target average dn we wish the chosen non-vips to exhibit at the end of the sampling. in specific, we ask the target dn to be a fraction a of dn (vip) , plus or minus %. the target dnfor non-vips can therefore take values between dn (inf) and dn (sup) , where dn (inf) = . (adn (vip) ) and dn (sup) = . (adn (vip) ). the fraction a is set manually through trial and error so that the sampled non-vips have the same average pn/ps as vips. note that we use dn and not dn/ds to avoid giving too much weight to ds, as it tends to saturate and take much greater values than dn (supplementary file b) and thus bears much more heavily on the dn/ds ratio. non-vips are sampled using a simple algorithm described in figure -figure supplement . we first randomly sample a set of five non-vips. this initial sampling of five non-vips is repeated until their average dn falls within the target interval [dn inf ,dn sup ]. we then add randomly sampled non-vips one at a time until their number matches the number of vips. the average of all the sampled non-vips has to remain within [dn (inf) ,dn (sup) ] (blue dots in figure -figure supplement ) , except for every x non-vip that is sampled completely randomly (red dots in figure -figure supplement ). this means that in the latter case the average dn of the sampled non-vips can fall out of [dn (inf) , dn (sup) ]. when this happens we sample non-vips with dn values that bring the average dn back within [dn (inf) ,dn (sup) ] (grey dots in figure -figure supplement ) . that is, if the average dn is above dn (sup) , we sample as many non-vips as necessary that each lower the average dn until it falls back within [dn (inf) ,dn (sup) ]. if the average dn is below dn(inf), we sample non-vips that each increase the average dn until it falls back within [dn (inf) ,dn (sup) ]. the parameter x is the parameter of the test that makes it possible to match the variance in pn/ps of the sampled non-vips with the variance observed for vips. a low x gives samples of non-vips with a higher variance. a high x gives samples of non-vips with a lower variance. to define the fraction a and x, we get random samples of non-vips. we then test whether the average and variance of pn/ps in vips are significantly different or not from the distributions of averages and variances of pn/ps given by the random samples of non-vips. we find that a= . and x= give samples of non-vips with slightly significantly higher average pn/ps than vips' pn/ps ( . in non-vips versus . in vips, p= . ) and a very similar variance ( . in non-vips vs . in vips, p= . ). the pn/ps ratio in human african populations is also slightly higher in non-vips compared to non-vips ( . in non-vips compared to . in vips, p= . ), which shows that our calibration is robust to the species used to measure pn/ps. there is no combination of a and x where both the average and variance of pn/ps are identical in vips and the sampled non-vips. the combination of a= . and x= gives the closest matching variances and a slightly higher pn/ps (lower purifying selection, see above for numbers) in non-vips than in vips. other combinations give closer averages of pn/ps, but more distant variances. to be conservative, we thus choose to use a= . and x= . the fact that the sampled non-vips experience slightly less purifying selection than vips makes the comparison conservative (the less purifying selection in non-vips compared to non-vips, the more opportunities there were for adaptation to happen at positions of coding sequences that can change). finally, using a= . and x= , we can compare vips and non-vips with similar dn/ds ratios (vips' and non-vips' average dn/ds= . , p= . ). as expected ( figure -figure supplement ), using matching dn/ds instead of pn/ps strongly underestimates, but yet still reveals a substantial excess of adaptation in vips compared to non-vips ( % adaptation excess, p= versus % excess, p= after iterations, when matching pn/ps; see figure a ). throughout this analysis we distinguish between the effects due to viruses and the effects due to the functional roles that vips play in the host. this is done by comparing vips with matching control sets of non-vips with similar gene ontology (go) processes. there are go processes with or more vips (supplementary file c) . the matching procedure is conducted using only these processes. for each vip, we find all the non-vips that have at least % of go processes in common, and where the total number of processes does not exceed % of the number in the vip to be matched with. we then randomly choose one non-vips among all those that fulfill these requirements. with the parameters used, we find each vip always has more than non-vips to choose from, and many more for most vips. furthermore, these parameters give control sets of non-vips with representations of go processes very similar to their representation in vips. on average the representation of each go process is only % lower or higher in the matching controls, versus % lower or higher in non-matching, randomly sampled sets of non-vips. note that perfect matching is impossible to achieve given that different proteins can have very different and specific combinations of associated go processes. number of protein-protein interactions. (i) table with mammalian orthologs and evidence of adaptation. the table provides all the , orthologs with best reciprocal hits (material and methods), the first column is ensembl gene id, the second column is busted p-value, and the other columns are bs-rel estimated proportions of selected codons in all the branches tested. note that the proportions of selected codons are set to zero for those branches where there is no good synteny information (materials and methods). (j) genomes project consortium. . an integrated map of genetic variation from , human genomes human genomics. the genotype-tissue expression (gtex) pilot analysis: multitissue gene regulation in humans gene ontology: tool for the unification of biology. the gene ontology consortium structural requirements of n-glycosylation of proteins. studies with proline peptides as conformational probes the genomic rate of adaptive amino acid substitution in drosophila a trans-specific polymorphism in zc hav is maintained by long-standing balancing selection and may confer susceptibility to multiple sclerosis a positively selected apobec h haplotype is associated with natural resistance to hiv- infection. evolution the mcdonald-kreitman test and slightly deleterious mutations the mcdonald-kreitman test and slightly deleterious mutations acoustic function in the peripheral auditory system of cuvier's beaked whale (ziphius cavirostris) global mapping of herpesvirus-host protein complexes reveals a transcription strategy for late genes aminopeptidase n is a major receptor for the entero-pathogenic coronavirus tgev further characterization of aminopeptidase-n as a receptor for coronaviruses dual host-virus arms races shape an essential housekeeping protein evidence for ace -utilizing coronaviruses (covs) related to severe acute respiratory syndrome cov in bats the landscape of human proteins interacting with viruses and other pathogens protein kinase r reveals an evolutionary model for defeating viral mimicry data from: coding sequence alignments of , mammalian orthologs estimating the rate of adaptive molecular evolution in the presence of slightly deleterious mutations and population size change the effect of insertions, deletions, and alignment errors on the branch-site test of positive selection population genetics of ifih : ancient population structure, local selection, and implications for susceptibility to type diabetes genenames.org: the hgnc resources in virhostnet . : surfing on the web of virus/host molecular interactions data molecular principles of human virus protein-protein interactions context dependence, ancestral misidentification, and spurious signatures of natural selection global landscape of hiv-human protein complexes the effects of alignment error and alignment filtering on the sitewise detection of positive selection evolutionary reconstructions of the transferrin receptor of caniforms supports canine parvovirus being a reemerged and not a novel pathogen in dogs joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies blat-the blast-like alignment tool positive selection and increased antiviral activity associated with the parp-containing isoform of human zinc-finger antiviral protein a draft map of the human proteome fine-scale recombination rate differences between sexes, populations and individuals a random effects branchsite model for detecting episodic diversifying selection hpidb-a unified resource for host-pathogen interactions adaptive evolution of primate trim alpha, a gene restricting hiv- infection a model of evolution and structure for multiple sequence alignment recent positive selection has acted on genes encoding proteins with more interactions within the whole human interactome an evolutionary screen highlights canonical and noncanonical candidate antiviral genes within the primate trim gene family adaptive protein evolution at the adh locus in drosophila the crapome: a contaminant repository for affinity purification-mass spectrometry data frequent adaptation and the mcdonald-kreitman test slim: simulating evolution with selection and linkage positive selection of primate genes that promote hiv- replication two-stepping through time: mammals and viruses identification of owl monkey cd receptors broadly compatible with early-stage hiv- isolates the moonlighting enzyme cd : old and new functions to target independently evolved virulence effectors converge onto hubs in a plant immune system network gene-wide identification of episodic selection detecting individual sites subject to episodic diversifying selection filovirus receptor npc contributes to species-specific patterns of ebolavirus susceptibility in bats a scan for positively selected genes in the genomes of humans and chimpanzees identification of a putative cellular receptor kda polypeptide for porcine epidemic diarrhea virus in porcine enterocytes evolutionary trajectories of primate genes involved in hiv pathogenesis ucsf chimera-a visualization system for exploratory research and analysis hyphy: hypothesis testing using phylogenies great ape genetic diversity and population history bats are a major natural reservoir for hepaciviruses and pegiviruses structural bases of coronavirus attachment to host aminopeptidase n and its inhibition by neutralizing antibodies ancient adaptive evolution of the primate antiviral dna-editing enzyme apobec g discordant evolution of the adjacent antiretroviral genes trim and trim in mammals positive selection of primate trim identifies a critical speciesspecific retroviral restriction domain hiv- capsid-cyclophilin interactions determine nuclear import pathway, integration targeting and replication efficiency a common polymorphism in tlr confers natural resistance to hiv- infection the biogrid interaction database: update gene ontology consortium: going forward mutational analysis of aminopeptidase n, a receptor for several group coronaviruses, identifies key determinants of viral host range the selective footprints of viral pressures at the human rig-i-like receptor family convergent targeting of a common host protein-network by pathogen effectors from three kingdoms of life paml : phylogenetic analysis by maximum likelihood human aminopeptidase n is a receptor for human coronavirus e minke whale genome and aquatic adaptation in cetaceans evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level we thank kerry samerotte, sandeep venkataram, emily ebel, pleuni pennings, hunter fraser, sara sawyer and sergei kosakovsky pond for comments on the manuscript. this work is funded by nih grants r gm and r gm to dap. the funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. table with vips. interactions are described in the third column. for example, -ebv-dsdna means that the article with pubmed id describes an interaction between a mammalian host protein and an epstein-barr virus ebv protein. the dsdna label is for the fact that ebv is a double-stranded dna virus (we use ssrna for single-stranded rna viruses, ssrnart for single-stranded rna retroviruses, dsdna for double-stranded viruses, dsdnart for double-stranded dna retroviruses and ssdna for single-stranded dna viruses). if the interaction in the example was with ebv's rna, we would have -rna-ebv-dsdna instead of -ebv-dsdna. if the interaction was with ebv's dna, we would have -dna-ebv-dsdna. (b) table with the mammalian orthologous cds information. the table contains the synteny information as well as the mammals-wide rates dn and ds for each of the , orthologs included in the analysis. (c) key: cord- -ea sjcs authors: ramazzotti, daniele; angaroni, fabrizio; maspero, davide; gambacorti-passerini, carlo; antoniotti, marco; graudenzi, alex; piazza, rocco title: verso: a comprehensive framework for the inference of robust phylogenies and the quantification of intra-host genomic diversity of viral samples date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: ea sjcs a global cross-discipline effort is ongoing to characterize the evolution of sars-cov- virus and generate reliable epidemiological models of its diffusion. to this end, phylogenomic approaches leverage accumulating genomic mutations to track the evolutionary history of the virus and benefit from the surge of sequences deposited in public databases. yet, such methods typically rely on consensus sequences representing the dominant virus lineage, whereas a complex intra-host genomic composition is often observed within single hosts. furthermore, most approaches might produce inaccurate results with noisy data and sampling limitations, as witnessed in most countries affected by the epidemics. we introduce verso (viral evolution reconstruction), a new comprehensive framework for the characterization of viral evolution and transmission from sequencing data of viral genomes. our probabilistic approach first delivers robust phylogenetic models from clonal variant profiles and then exploits variant frequency patterns to characterize and visualize the intra-host genomic diversity of samples, which may reveal uncovered infection events. we prove via extensive simulations that verso outperforms the state-of-the-art tools for phylogenetic inference, also in condition of noisy observations and sampling limitations. the application of our approach to sars-cov- samples from amplicon sequencing and to samples from rna-sequencing unravels robust phylogenomic models, improving the current knowledge on sars-cov- evolution and spread. importantly, by exploiting co-occurrence patterns of minor variants, verso allows us to reveal uncovered infection paths, which are validated with contact tracing data. moreover, the in-depth analysis of the mutational landscape of sars-cov- confirms a statistically significant increase of genomic diversity in time and allows us to identify a number of variants that are transiting from minor to clonal state in the population, as well as several homoplasies, some of which might indicate ongoing positive selection processes. overall, the results show that the joint application of our framework and data-driven epidemiological models might improve currently available strategies for pathogen surveillance and analysis. verso is released as an open source tool at https://github.com/bimib-disco/verso. the outbreak of coronavirus disease (covid- ) , which started in late in wuhan (china) [ , ] and was declared pandemic by the world health organization, is fueling the publication of an increasing number of studies aimed at exploiting the information provided by the viral genome of sars-cov- virus to identify its proximal origin, characterize the mode and timing of its evolution, as well as to define descriptive and predictive models of geographical spread and evaluate the related clinical impact [ , , ] . as a matter of fact, the mutations that rapidly accumulate in the viral genome [ ] can be used to track the evolution of the virus and, accordingly, unravel the viral infection network [ , ] . at the time of this writing, numerous independent laboratories around the world are isolating and sequencing sars-cov- samples and depositing them on public databases, e.g., gisaid [ ] , whose data are accessible via the nextstrain portal [ ] . such data can be employed to estimate models from genomic epidemiology and may serve, for instance, to estimate the proportion of undetected infected people by uncovering cryptic transmissions, as well as to predict likely trends in the number of infected, hospitalized, dead and recovered people [ , , ] . more in detail, most studies employ phylogenomic approaches that process consensus sequences, which represent the dominant virus lineage within each infected host. a growing plethora of methods for phylogenomic reconstruction is available to this end, which rely on different algorithmic frameworks, including distance-matrix, maximum parsimony, maximum likelihood or bayesian inference, with various substitution models and distinct evolutionary assumptions (see, e.g., [ , , , , , , , , , ] ). however, while such methods have repeatedly proven effective in unraveling the main patterns of evolution of viral genomes with respect to many different diseases, including sars-cov- [ , , , ] , at least two issues can be raised. first, most phylogenomics methods might produce unreliable results when dealing with noisy observation, for instance due to sequencing issues, or collected with significant sampling limitations [ , , ] , as witnessed for most countries during the epidemics [ , ] . second, most methods do not consider the key information on intra-host minor variants (also referred to as minority variants or isnvs), which can be retrieved from whole-genome deep sequencing raw data and might be essential to improve the characterization of the infection dynamics and to pinpoint positively selected variants [ , , ] . due to the high replication, mutation and recombination rates of rna viruses, subpopulations of mutant viruses, also known as viral quasispecies [ ] , typically emerge and coexist within single hosts, and are supposed to underlie most of the adaptive potential to the immune system response and to anti-viral therapies [ , , ] . in this regard, many recent studies highlighted the noteworthy amount of intra-host genomic diversity in sars-cov- samples [ , , , , , , , , ] , similarly to what already observed in many distinct infectious diseases [ , , , , , , ] . here, we introduce verso (viral evolution reconstruction), a new comprehensive framework for the inference of high-resolution models of viral evolution from raw sequencing data of viral genomes (see fig. ). verso includes two subsequent algorithmic steps. step # : robust phylogenomic inference from clonal variant profiles. verso first employs a probabilistic noise-tolerant framework to process binarized clonal variant profiles (or, alternatively, consensus sequences), to return a robust phylogenetic model also in condition of sampling limitations and sequencing issues. by adapting algorithmic strategies widely employed in cancer evolution analysis [ , , , ] , verso is able to correct false positive and false negative variants, can manage missing observations due to low coverage, and is designed to group samples with identical (corrected) clonal genotype in polytomies, avoiding ungrounded arbitrary orderings. as a result, the accurate and robust phylogenomic models produces by verso may be used to improve the parameter estimation of epidemiological models, which typically rely on limited and inhomogeneous data [ , ] . notice that this step can be executed independently from step # , for instance in case raw sequencing data are not available. homoplasy detection (clonal variants). the first step of verso allows us to identify clonal mutations that violate the accumulation hypothesis and might be involved in homoplasies, possibly due to positive selection (in a scenario of convergent/parallel evolution [ ] ), founder effects [ ] or mutational hotspots [ ] . such information might be useful to drive the design of opportune treatments and vaccines, for instance by blacklisting positively selected genomic regions. step # : characterization of intra-host genomic diversity. in the second step, verso exploits the information on variant frequency (vf) profiles obtained from raw-sequencing data (if available), to characterize and visualize the intra-host genomic similarity of hosts with identical (corrected) clonal genotype. in fact, even though the extent and modes of transmission of quasispecies from a host to another during infections are still elusive [ , ] , patterns of co-occurrence of minor variants detected in hosts with identical clonal genotype may provide an indication on the presence of uncovered infection paths [ , ] . for this reason, the second step of verso is designed to characterize and visualize the genomic similarity of samples by exploiting dimensionality reduction and clustering strategies typically employed in single-cell analyses [ ] . alternative approaches for the analysis of quasispecies, yet with different goals and algorithmic assumptions have been proposed, for instance in [ , , , ] and recently reviewed in [ ] . as specified above, verso step # is executed on groups of samples with identical clonal genotype: the rationale is that the transmission of minor variants implicates the concurrent transfer of clonal variants, excluding the rare cases in which the vf of a clonal variant significantly decreases in a given host, for instance due to mutation losses (e.g., via recombination-associated deletions or via multiple mutations hitting an already mutated genome location [ ] ) or to complex horizontal evolution phenomena (e.g., super-infections [ , ] ). conversely, the transmission of clonal variants does not necessarily implicate the transfer of all minor variants, which are affected by complex recombination and transmission effects, such as bottlenecks [ , ] . as a final result, verso allows to visualize the genomic similarity of samples on a low-dimensional space (e.g., umap [ ] or tsne [ ] ) representing the intra-host genomic diversity, and to characterize high-resolution infection chains, thus overcoming the limitations of methods relying on consensus sequences. homoplasy detection (minor variants). importantly, minor variants observed in hosts with distinct clonal genotypes (identified via verso step # ) may indicate homoplasies, due to mutational hotspots, phantom mutations or to positive selection [ ] . verso pinpoints such variants for further investigations and allows to exclude them from the computation of the vf-based genomic similarity prior to verso step # , to reduce the possible confounding effects. to assess the accuracy and robustness of the results produced by verso, we performed an extensive array of simulations, and compared with two state-of-the-art methods for phylogenetic reconstruction, i.e., iq-tree [ ] and beast [ ] . as a major result, verso outperforms competing methods in all settings and also in condition of high noise and sampling limitations. furthermore, we applied verso to two large-scale datasets, generated via amplicon and rna-seq illumina sequencing protocols, including and samples, respectively. the robust phylogenomic models delivered via verso step # allow us to refine the current knowledge on sars-cov- evolution and spread. besides, thanks to the in-depth analysis of the mutational landscape of both clonal and minor variants, we could identify a number of variants undergoing transition to clonality, as well as several homoplasies, including variants likely undergoing positive selection processes. remarkably, the infection chains identified via verso step # , by assessing the intra-host genomic similarity of samples with the same clonal genotype, were validated by employing contact tracing data from [ ] . this important result, which could not be achieved by analyzing consensus sequences, proves the effectiveness of employing raw sequencing data to improve the characterization of the transmission dynamics, in particular during the early phase of the outbreak, in which a relatively low diversity of sars-cov- has been observed at the consensus level. verso is released as free open source tool at this link: https://github.com/bimib-disco/verso. in order to assess the performance of verso and compare it with competing approaches, we executed extensive tests on simulated datasets, generated with the coalescent model simulator msprime [ ] . simulations allow one to compute a number of metrics with respect to the ground-truth, which in this case is the phylogeny of samples resulting from a backwards-in-time coalescent simulation [ ] . accordingly, this allows one to evaluate the accuracy and robustness of the results produced by competing methods in a variety of in-silico scenarios. in detail, we selected simulation scenarios with n = samples in which a number of clonal variants (with distinguishable profiles) between and was observed. we then inflated the datasets with false positives with rate α and false negatives with rate β, in order to mimic sequencing and coverage issues. moreover, additional datasets were generated via random subsampling of the original datasets, to model possible sampling limitations and sampling biases. as a result, we investigated simulations settings: (a) low noise, no subsampling, (b) high noise, no subsampling, (c) low noise, subsampling, and (d) high noise, subsampling (see methods and the supplementary material for further details, the complete parameter settings of the simulations are provided in table of the supplementary material). verso step # was compared with two state-of-the-art phylogenetic methods from consensus sequences: iq-tree [ ] , the algorithmic strategy included in the nextstrain-augur pipeline [ ] , and beast [ ] . consensus sequences to be provided as input to such methods were generated from simulation data by employing the reference genome sars-cov- -anc (see below). example, three hosts infected by the same viral lineage are sequenced. all hosts share the same clonal mutation (t>c, green), but two of them (# and # ) are characterized by a distinct minor mutation (a>t, red), which randomly emerged in host # and was transferred to host # during the infection. standard sequencing experiments return an identical consensus sequence for all samples, by employing a threshold on variant frequency (vf) and by selecting mutations characterizing the dominant lineage. (b) verso takes as input the variant frequency profiles of samples, generated from raw sequencing data. in step # , verso processes the binarized profiles of clonal variants and solves a boolean matrix factorization problem by maximizing a likelihood function via monte-carlo markov chain, in order to correct false positives/negatives and missing data. as output, it returns both the corrected mutational profiles of samples and the phylogenetic tree, in which samples with identical corrected genotypes are grouped in polytomies. corrected genotypes are then employed to identify homoplasies of minor variants, which are further investigated to pipoint positively selected mutations. the variant frequency profile of minor variants (excluding homoplasies) is processed by step # of verso, which computes a refined genomic distance among hosts (via bray-curtis dissimilarity on the knn graph, after pca) and performs clustering and dimensionality reduction, in order to project and visualize samples on a d space, representing the intra-host genomic diversity and the distance among hosts. this allows to identify uncovered transmission paths among samples with identical clonal genotype. the performance of methods was assessed by comparing the reconstructed phylogeny with the simulated ground-truth, in terms of: (i) absolute error evolutionary distance, (ii) branch score difference [ ] and (iii) quadratic path difference [ ] (please refer to the supplementary material for a detailed description of all metrics). in figure one can find the performance distribution of all methods with respect to all simulations settings. notably, verso step # outperforms competing methods in all scenarios (mann-whitney u test p < . in all cases), with noteworthy percentage improvements, also in conditions of high noise and sampling limitations. this important result shows that the probabilistic framework that underlies verso step # can produce more robust and reliable results when processing noisy data, as typically observed in real-world scenarios. different reference genomes have been employed in the analysis of sars-cov- origin and evolution. two genome sequences from human samples, in particular, were used in early phylogenomic studies, namely sequence epi_isl_ (ref. # in the following), used, e.g., in [ ] and sequence epi_isl_ (ref. # ) used, e.g., in [ ] . excluding the polya tails, the two sequences are identical for on genome positions ( in order to define a likely common ancestor for both sequences, we analyzed the bat-cov-ratg genome (sequence epi_isl_ ) [ ] and the pangolin-cov genome (sequence epi_isl_ ) [ , ] , which were identified as closely related genomes to sars-cov- [ ] . in particular, it was hypothesized that sars-cov- might be a recombinant of an ancestor of pangolin-cov and bat-cov-ratg [ , ] , whereas more recent findings would suggest that sars-cov- lineage is the consequence of a direct or indirect zoonotic jump from bats [ ] . whatever the case, both bat-cov-ratg and pangolin-cov display haplotype tctct at locations , , , and and, therefore, one can hypothesize that such haplotype was present in the unknown common ancestor of ref. # and # . for this reason, we generated an artificial reference genome, named sars-cov- -anc, which is identical to both ref. notice that verso pipeline is flexible and can employ any reference genome. we retrieved raw illumina amplicon sequencing data of sars-cov- samples of dataset # and applied verso to the mutational profiles of samples selected after quality check (mutational profiles were generated by executing variant calling via standard practices; see methods for further details). notice that the analysis of this dataset was performed independently from that of dataset # in order to exclude possible sequencing-related artifacts or idiosyncrasies. we first applied verso step # to the mutational profile of the variants detected as clonal (vf > %) in at least % of the samples, in order to reconstruct a robust phylogenomic tree. the verso phylogenetic model is displayed in fig. a and highlights the presence of clonal genotypes, obtained by removing noise from data, and which define polytomies including different numbers of samples (see methods for further details). more in detail, variant g. t>c (n, synonymous) is the earliest evolutionary event from reference genome sars-cov- -anc and is detected in samples of the dataset. the related clonal genotype g , which is characterized by no further mutations, identifies a polytomy including australian, chinese, american and south-african samples. three clades originate from g : a first clade includes clonal genotypes g ( samples) and g ( ), while a second clade includes clonal genotype g ( ). clonal genotypes g -g are characterized by the absence of snvs g. t>c (orf ab, synonymous) and g. c>t (orf , p. s>l) and correspond to previously identified type a [ ] (also type s [ ] ), which was hypothesized to be an early sars-cov- type. the third clade originating from clonal genotype g includes all remaining clonal genotypes (g -g ) and is characterized by the presence of both snvs g. t>c and g. c>t. this specific haplotype corresponds to type b [ ] (also type l [ ] ) and an increase of its prevalence has progressively recorded in the population, as one can see in fig. , as opposed to type a (s), which was rarely observed in late samples. in this regard, we note that there are currently insufficient elements to support any epidemiological claim on virulence and pathogenicity of such sars-cov- types, even if recent evidences would suggest the existence of a low correlation [ ] . , on (i) absolute error evolutionary distance, (ii) branch score difference [ ] and (iii) quadratic path difference [ ] with respect to the ground-truth sample phylogeny provided by msprime (see the supplementary week ( ) homoplasy detection (clonal variants). clonal variants included in our model show apparent violations of the accumulation hypothesis, namely: g. g>t (orf ab, p. l>f), g. c>t (orf ab, synonymous), g. c>t (orf ab, synonymous), g. c>t (orf , p. s>l), and g. c>t (n,p. p>l), suggesting that they might be involved in homomplasies. some of such variants have been exhaustively studied (e.g., g. g>t in [ ] ), specifically to verify possible scenarios of convergent evolution, which may unveil the fingerprint of adaptation of sars-cov- to human hosts. to this end, particular attention should be devoted to the three non synonymous substitutions, i.e., g. g>t (present in samples, ≈ % of the dataset), g. c>t ( samples, ≈ %) and g. c>t ( samples, ≈ %). as a first result, we note the prevalence dynamics of the haplotypes defined by such variants does not show any apparent growth trend in the population (see supplementary figure ). to further investigate if such variants fall in a region prone to mutations of the sars-cov- genome, we evaluated the mutational density employing a sliding window approach similarly to [ ] (see supplementary material for additional details). as shown in supplementary figure , the mutational density, computed by considering synonymous minor variants, exhibits a median value of = . [syn. mutations][nucleotides] − . interestingly, the three nonsynonymous snvs (g. g>t, g. c>t and g. c>t) are located within windows with a higher mutational density than the median value: . , . and . [syn. mutations][nucleotides] − , respectively (see table of the supplementary material), and this would suggest that they might have originally emerged due to the presence of natural mutational hotspots or phantom mutations. however, this analysis is not conclusive and further investigations are needed to characterize the functional effect of such mutations and the possible impact in the evolutionary and diffusion process of sars-cov- . stability analysis. the choice of an appropriate vf threshold to identify clonal variants and, accordingly, to generate consensus sequences from raw sequencing data might affect the stability of the results of any downstream phylogenomic analysis. on the one hand, loose thresholds might increase the risk of including non-clonal variants in consensus sequences. on the other hand, too strict thresholds might increase the rate of false negatives, especially with noisy sequencing data. for this reason, we assessed the robustness of the results produced by verso step # on dataset # when different thresholds in the set δ ∈ { . , . , . , . } are employed to identify clonal variants, with those obtained with default threshold (δ = . ), in terms of tree accuracy (see the supplementary material for further details). as one can see in supplementary figure , the tree accuracy varies between . and . in all settings, proving the the results produced by verso step # are robust with regard to the choice of the vf threshold for clonal variant identification. we then applied verso step # to the complete vf profiles of the samples with the same clonal genotype and projected their intra-host genomic diversity on the umap low-dimensional space. this was done excluding: (i) the clonal variants employed in the phylogenetic inference via verso step # , (ii) all minor variants (vf ≤ %) observed in more than one clonal genotype (i.e., homoplasies) and which are likely emerged independently within the hosts, due to mutational hotspots, phantom mutations or positive selection (see methods and the next subsections). even though, as expected, the vf profiles of minor variants are noisy, a complex intra-host genomic architecture is observed in several individuals. moreover, patterns of co-occurrence of minor variants across samples support the hypothesis of transmission from a host to another. in fig. as a first result, all samples belonging a specific contact group are characterized by the same clonal genotype, determined via verso step # , a result that confirms recent findings [ , ] . more importantly, the analysis of the intra-host genomic diversity via verso step # allows to highly refine this analysis. in fig. one can find the umap plot of clonal genotypes g , g and g , which include (on ), (on ) and (on ) samples with contact information. strikingly, the distribution of the pairwise intra-host genomic distance among samples from the same institution/household (computed on the k-nearest neighbour graph via bray-curtis dissimilarity, after pca; see methods) is significantly lower with respect to the distance of all samples with the same clonal genotype (p-value of the mann-whitney u test < . in all cases). furthermore, all samples belonging to the same contact group are connected in the knn graph, while a noteworthy proportion of samples without contact information in genotypes g and g are placed in disconnected graphs ( . % and . %, respectively). this major result suggests that patterns of co-occurrence of minor variants can indeed provide useful indication on contact tracing dynamics, which would be masked when employing consensus sequencing data. accordingly, the algorithmic strategy employed by verso step # and, especially, the identification of the knn graph on intra-host genomic similarity, provides an effective tool to dissect the complexity of viral evolution and transmission, which might in turn improve the reliability of currently available contact tracing tools. homoplasy detection (minor variants). several minor variants are found in samples with distinct clonal genotypes and might indicate the presence of homoplasies. in this respect, the heatmap in fig. f returns the distribution of minor snvs with respect to: (i) the number of distinct clonal genotypes in which they are detected and (ii) the mutational density of the region in which they are located (see the supplementary material for details on the mutational density analysis). the intuition is that variants detected in single clonal genotypes (left region of the heatmap) are likely spontaneously emerged private mutations or the result of infection events between hosts with same clonal genotype (see above). conversely, snvs found in multiple clonal genotypes (right region of the heatmap) may have emerged due to positive selection in a parallel/convergent evolution scenario, or to mutational hotspots or phantom mutations. to this end, the mutational density analysis provides useful information to pinpoint mutation-prone regions of the genome. interestingly, a significant number of minor variants are observed in multiple clonal genotypes and fall in scarcely mutated regions of the genome (see supplementary figure ). this would suggest that some of these variants might have been positively selected, due to some possible functional advantage or to transmission-related founder effects. in this respect, we further focused our investigation on a list of candidate minor variants, which: (i) are detected in more than one clonal genotype, (ii) are present in at least samples, (iii) are nonsynonymous and (iv) fall in a region of the genome with mutational density lower than the median value (see table of the supplementary material for details on such variants). in the following, we focus on a subset of such variants falling on the spike gene of the sars-cov- genome. considerations on homoplasies falling on the spike gene. the spike protein of sars-cov- plays a critical role in the recognition of the ace receptor and in the ensuing cell membrane fusion process [ ] . we prioritized candidate homoplastic minor variants occurring on the sars-cov- spike gene (s) (see table of the supplementary material). interestingly, out of , namely: g. t>c (p. i>t) and g. g>t (p. g>c), detected in samples in total ( and samples, respectively), clustered in the so-called connector region (cr), bridging between the two heptad repeat regions (hr and hr ) of the s subunit of the spike protein. when the receptor binding domain (rbd) binds to ace receptor on the target cell, it causes a conformational change responsible for the insertion of the fusion peptide (fp) into the target cell membrane. this, in turn, triggers further conformational changes eventually promoting a direct interaction between hr trimer and hr , which occurs upon bending of the flexible cr, in order to form a six-helical hr -hr complex known as the fusion core region (fcr) in close proximity to the target cell plasma membrane, ultimately leading to viral fusion and cell entry [ ] . peptides derived from the hr heptad region of enveloped viruses and able to efficiently bind to the viral hr region inhibit the formation of the fcr and completely suppress viral infection [ ] . therefore, the formation of the fcr is considered to be vital to mediate virus entry in the target cells, promoting viral infectivity. of note, the cr is highly conserved across the gammacoronavirus genus, supporting the notion that this region may play a very important but yet unclear functional role (fig g) . although structural and in vitro models will be required in order to extensively characterize the functional effect of these variants, the evidence that two our of three minor variants detected in the spike protein falls in a small domain comprising less than % of the entire spike protein length is intriguing, as it suggests a potential functional role for these mutations. it will be important to track the prevalence of these mutations, as well as of all other candidate convergent variants falling on different region of the sars-cov- , to highlight possible transitions to clonality (see below). we analyzed in-depth the mutational landscape of the samples of dataset # . first, the comparison of the number of clonal (vf > %) and minor variants detected in each host (fig. a ) reveals a bimodal distribution of clonal variants (with first mode at and second mode at ), whereas minor variants display a more dispersed long-tailed distribution with median equal to and average ≈ . from the plot, it is also clear that individuals characterized by the same clonal genotype may display a significantly different number of minor variants, with distinct distributions observed across clonal genotypes. importantly, the comparison of the distribution of the number of variants obtained by grouping the samples with respect to collection week ( fig. b-c) allows us to highlight a highly statistically significant increasing trend for clonal variants (mann-kendall trend test on median number of clonal variants p < . ). this result would strongly support both the hypotheses of accumulation of clonal variants in the population and that of a concurrent increase of overall genomic diversity of sars-cov- [ , ] , whereas the relevance of this phenomenon on minor variants is unclear. we then focused on the properties of the snvs detected in the population. surprisingly, the distribution of the median vf for each detected variants (fig. d ) reveals a bimodal distribution, with the large majority of variants showing either a very low or a very high vf, with only a small proportion of variants showing a median vf within the range − %. this behavior is typical of systems where the prevalence of some subpopulations is driven by positive darwinian selection while others are purified [ ] . in order to analyze the two components of this distribution, we further categorized the variants as always clonal, (i.e., snvs detected with vf > % in all samples), always minor (i.e., snvs detected with vf % and ≤ % in all samples) and mixed (i.e., snvs detected as clonal in at least one sample and as minor in at least another sample). as one can see in fig. e, . %, . % of and % all snvs are respectively detected as always clonal, always minor and mixed in our dataset. moreover, %, . % and . % of always clonal, always minor and mixed variants, respectively, are nonsynonymous, whereas the large majority of remaining variants are synonymous. these results would suggest that, in most cases, randomly emerging sars-cov- minor variants tend to remain at a low frequency in the population, whereas, in some circumstances, certain variants can undergo frequency increases and even become clonal, due to uncovered mixed transmission events or to selection shifts, as it was observed in [ ] for the cases of h n and h n / influenza. interestingly, variants identified as possibly convergent (see above) fall in this category and deserves further investigations (see table of the supplementary material for additional details) . transmission bottleneck analysis. the estimation of transmission bottlenecks might be of specific interest during the current pandemics. despite most available methods require data collected on donor-host couples (see, e.g., [ , ] ), here we employed a strategy akin to [ , ] and that is roughly based on the analysis of the variation of the vf variance of a number of candidate neutral mutations. the intuition is that variance shrinking indicates significant transmission bottlenecks which, accordingly, would result in lower viral diversity transferred from a host to another and, possibly, in purification of certain variants in the population. as the analysis ideally requires the comparison of groups in which infection events have occurred, here we considered groups of samples with distinct clonal genotypes, separately. we then selected a number of variants as neutral markers. the rationale is that transmission phenomena such as bottlenecks are expected to significantly affect the vf variance of neutral markers (please see supplementary material for further details). more in detail, we first split the samples of each clonal genotype, for which a collection date is available, in nonoverlapping groups corresponding to two subsequent time windows, i.e., before and after the th week, . accordingly, snvs were selected as candidate neutral or quasi-neutral markers, namely variants g. t>c, g. a>g and g. g>a. in supplementary figure , one can find the distribution of the variant frequency of the selected markers with respect to the time windows, which highlights moderate variations of the variance for all markers (see also table of the supplementary material). all in all, this result would suggest the presence of mild bottleneck effects, consistently with recent studies involving donor-host data [ ] . we retrieved the raw illumina rna-sequencing data of samples included in dataset # and applied verso to the mutational profiles of samples selected after quality check. clonal variants were employed in the analysis, according to the filters described in the methods section. remarkably, the output phylogenetic model is consistent with the one obtained for dataset # , despite minor differences (supplementary figure a) . specifically, distinct clonal genotypes are identified by verso step # , of which are identical to those found in the analysis of dataset # (in such cases the same genotype label was maintained). further clonal genotypes are evolutionary consistent and represent independent branches detected due to the nonoverlapping composition of the dataset, and are labeled with progressive letters from the closest genotype (i.e., g b, g b, g b, g c, g b), while the samples of genotype g b* might be safely assigned to genotype g b, since the absence of mutation g. c>t is likely due to low coverage. by excluding the remaining clonal genotype gh, which presents inconsistencies due to the presence of the candidate homoplastic variant g. g>t (orf ab, p. l>f, see above), all clonal genotypes display the same ordering in both datasets. this proves the robustness of the results delivered by verso step # even when dealing with data generated from distinct sequencing platforms. by looking at the geo-temporal localization of samples obtained via microreact [ ] (supplementary figure b) , one can see that that dataset # includes samples with a significantly different geographical distribution with respect to dataset # . this dataset contains sample from countries, with the large majority collected in usa ( . %). more in detail, the samples of such country are mostly characterized by clonal genotype g . we further notice that, also for dataset # , mutation g. a>g (s,p. d>g) becomes prevalent in the population at late collection dates. moreover, only samples belonging to previously defined type b are detected in this dataset. the analysis of the intra-host genomic diversity was also performed for dataset # via verso step # , which would suggest the existence of uncovered infection events and of several infection clusters with distinct properties, even though no contact tracing are available in this case. overall, this proves the general applicability of verso framework, which can produce meaningful results when applied to data produced with any sequencing platforms. however, in order to minimize the possible impact of data-and platform-specific biases, we suggest to perform the verso analysis on datasets generated from different protocols separately. we finally assessed the computational time required by verso in a variety of simulated scenarios. the results are shown in the supplementary material (supplementary figure ) and demonstrate the scalability of verso also when processing large-scale datasets. we introduced verso, a comprehensive framework for the high-resolution characterization of viral evolution from sequencing data, which improves over currently available methods for the analysis of consensus sequences. verso exploits the distinct properties of clonal and minor variants to dissect the complex interplay of genomic evolution within hosts and transmission among hosts. on the one hand, the probabilistic framework underlying verso step # delivers highly accurate and robust phylogenetic models from clonal variants, also in condition of noisy observations and sampling limitations, as proven by extensive simulations and by the application to two-large scale sars-cov- datasets generated from distinct sequencing platforms. on the other hand, the characterization of intra-host genomic diversity provided by verso step # allows one to identify uncovered infection paths, which were in our case validated with contact tracing data, as well as to intercept variants involved in homoplasies. this may represents a major advancement in the analysis of viral evolution and spread and should be quickly implemented in combination to data-driven epidemiological models, to deliver a high-precision platform for pathogen detection and surveillance [ , ] . this might be particularly relevant for countries which suffered outbreaks of exceptional proportions and for which the limitations and inhomogeneity of diagnostic tests have proved insufficient to define reliable descriptive/predictive models of disease diffusion. for instance, it was hypothesized that the rapid diffusion of covid- might be likely due to the extremely high number of untested asymptomatic hosts [ ] . more accurate and robust phylogenetic models may allow to improve the assessment of molecular clocks and, accordingly, the estimation of the parameters of epidemiological models such as sir and sis [ , ] , as well as to unravel the cryptic transmission paths [ , , , ] . furthermore, the finer grain of the analysis on intra-host genomic similarity from sequencing data might be employed to enhance the active surveillance, for instance by facilitating the identification of infection clusters and super-spreaders [ ] . finally, the characterization of variants possibly involved in positive selection processes might be used to drive the experimental research on treatments and vaccines. verso is a novel framework for the reconstruction of viral evolution models from raw sequencing data of viral genomes. it includes a two-step procedure, which we describe in the following. the first step of verso employs a probabilistic maximum-likelihood framework for the reconstruction of robust phylogenetic trees from binarized mutational profiles of clonal variants (or, alternatively, from consensus sequences). this step relies on an evolved version of the algorithmic framework introduced in [ ] for the inference of cancer evolution models from single-cell sequencing data, and can be executed independently from step # , in case raw sequencing data are not available. in detail, the method takes as input: a n (samples) × m (variants) binary mutational profile matrix, as defined on clonal snvs only. in this case, an entry in a given sample is equal to (present) if the vf is larger than a certain threshold (in our analyses, equal to %), it is equal to if lower than a distinct threshold (in our analyses, equal to %), and is considered as missing (na) in the other cases, thus modeling possible uncertainty in sequencing data or low coverage. notice that consensus sequences can be processed by verso step # by generating a consistent binarized mutational profile matrix. here, we recall that the variant accumulation hypothesis holds only when considering clonal mutations, which are most likely transmitted from a host to another during the infection, whereas this might not be the case with variants with lower frequency, due to the high recombination rates, as well as to bottlenecks, founder effects and stochasticity (see below). we also note that given the intrinsic challenges associated with a reliable identification of low vf indels, the analysis focuses only on single nucleotide variants. further details on the variant calling pipeline employed in this study are provided in the next subsections. the algorithmic framework verso step # is a probabilistic framework which solves a boolean matrix factorization problem with perfect phylogeny constraints, i.e., by assuming the infinite sites assumption, which subsumes a consistent process of accumulation of clonal variants in the population and does not allow for losses of mutations or convergent variants (i.e., mutations observed in distinct clades). further details on verso assumptions are provided in the supplementary material and in [ , ] ). our approach accounts for uncertainty in the data, by employing a maximum likelihood approach (via mcmc search) that allows for the presence of false positives, false negatives and missing data points. as shown in [ ] in a different experimental context, our algorithmic framework ensures robustness and scalability also in case of high rates of errors and missing data, due for instance to sampling limitations, and is robust to mild violations of the infinite sites assumption, e.g., to convergent variants or mutation losses (see the supplementary material for further details on the algorithmic framework, including the probabilistic graphical model depicted in supplementary figure and the summary of notation in table of the supplementary material). the inference returns a set of maximum likelihood variants trees (minimum ) as sampled during the mcmc search, representing the ordering of accumulation of clonal variants, and a set of maximum likelihood attachments of samples to variants. given the variants tree and the maximum likelihood attachments of samples to variants, verso outputs: (i) a phylogenetic model where each leaf correspond to a sample, whereas internal nodes correspond to accumulating clonal variants, (ii) the corrected clonal genotype of each sample, i.e., the binary mutational profile on clonal variants obtained after removing false positives, false negatives and missing data. the model naturally includes polytomies, which group samples with the same corrected clonal genotype. the length of the branches in the model represents the number of clonal substitutions (which can be normalized with respect to genome length), as in standard phylogenomic models. the verso phylogenetic model is provided as output in newick file format and can be processed and visualized in standard tools for phylogenetic analysis, such as figtree [ ] or dendroscope [ ] . furthermore, verso allows one to visualize the geo-temporal localization of clonal genotypes via microreact [ ] . violations of the perfect phylogeny constraints (i.e., of the consistent accumulation of clonal variants in the population) [ ] are possible and can be due to homoplasies, i.e., identical variants detected in samples belonging to different clades, or to rare occurrences involving mutation losses (e.g., due to recombination-related deletions o or to multiple mutations hitting an already mutated genome location [ ] ), as well as to infrequent transmission phenomena, such as super-infections [ , ] . in this regard, verso allows to identify mutations likely involved in homoplasies in a similar fashion to the plethora of works on mitochondrial evolution (see, for example [ , , ] ). in detail, given the maximum likelihood phylogenetic tree, verso can estimate the variants that are theoretically expected in each sample. by comparing the theoretical observations with the input data, verso can estimate the rate of false positives (i.e., the variants that are observed in the data but are not predicted by verso), and false negatives (i.e., variants that are not observed, but predicted). variants that show a particularly high level of estimated error rates represent candidate homoplasies and are flagged. once this procedure has been completed, the list of flagged variants can include: (i) mutations falling in highly-mutated regions due to mutational hotspots, (ii) phantom mutations i.e., systematic artifacts generated during sequencing processes [ ] , or (iii) mutations that have been positively selected in the population, e.g., due to a particular functional advantage. since one might be interested in identifying positively selected mutations, verso allows to perform a subsequent analysis, which aims at highlighting the mutation-prone regions of the genome, and which might be due to mutational hotspots or phantom mutations (see the supplementary material for further details). we finally note that the detection of homoplasies for minor variants require a different algorithmic procedure, which is detailed in the following. in the second step, verso takes into account the variant frequency profiles of groups of samples with the same clonal genotype (identified via verso step # ), in order to characterize their intra-host genomic diversity and visualize it on a low-dimensional space. this allows to highlight patterns of co-occurrence of minor variants, possibly underlying uncovered infection events, as well as homoplasies involving, e.g., positively selected variants. notice that this step requires raw sequencing data and the prior execution of step # . verso step # takes as input a n (samples) × m (variants) variant frequency (vf) profile matrix, in which each entry includes the vf ∈ ( , ) of a given mutation in a certain sample, after filtering out: (i) the clonal variants employed in step # and (ii) the minor variants possibly involved in homoplasies (see below). the variant calling pipeline employed in this work is detailed in the next subsections. while it is sound to binarize clonal variant profiles to reconstruct a phylogenetic tree, it is opportune to consider the variant frequency profiles when analyzing intra-host variants, for several reasons. first, variant frequency profiles describe the intra-host genomic diversity of any given host, and this information would be lost during binarization. second, minor variant profiles might be noisy, due to the relatively low abundance and to the technical limitations of sequencing experiments. accordingly, such data may possibly include artifacts, which can be partially mitigated during the quality-check phase and by including in the analysis only highly-confident variants. however, binarization with arbitrary thresholds might increase the false positive rate, compromising the accuracy of any downstream analysis. third, as specified above, the extent of transmission of minor variants among individuals is still partially obscure. the vf of minor variants is, in fact, highly affected by recombination processes, as well as by complex transmission phenomena, involving stochastic fluctuations, bottlenecks and founder effects, and which may lead certain variants changing their vf, not being transmitted or even becoming clonal in the infected host [ ] . the latter issue also suggests that the hypothesis of accumulation of minor variants during infections may not hold and should be relaxed. for these reasons, verso step # defines a pairwise genomic distance, computed on the variant frequency profiles, to be used in downstream analyses. the intuition is that samples displaying similar patterns of co-occurrence of minor variants might have a similar quasispecies architecture, thus being at a small evolutionary distance. accordingly, this might indicate a direct or indirect infection event. in particular, in this work we employed the bray-curtis dissimilarity, which is defined as follows: given the ordered vf vectors of two samples, i.e. v i = {v f i , . . . , v f i r , } and v j = {v f j , . . . , v f j r , }, the pairwise bray-curtis dissimilarity d(i, j) is given by: since this measure weights the pairwise vf dissimilarity on each variant with respect to the sum of the vf of all variants detected in both samples, it can be effectively used to compare the intra-host genomic diversity of samples, as proposed for instance in [ ] . however, verso allows one to employ different distance metrics on vf profiles, such as correlation or euclidean distance. as a design choice, in verso the genomic distance is computed among all samples associated to any given clonal genotype, as inferred in step # . the rationale is that, in a statistical inference framework modeling a complex interplay involving heterogeneous dynamical processes, it is crucial to stratify samples into homogeneous groups, to reduce the impact of possible confounding effects [ ] . furthermore, as specified above, due to the distinct properties of clonal and minor variants during transmission, it is reasonable to assume that the event in which certain minor variants and no clonal variants are transmitted from a host to another during the infection is extremely unlikely. accordingly, the clonal variants employed for the reconstruction of the phylogenetic tree in step # are excluded from the computation of the intra-host distance among samples. in order to produce useful knowledge from the genomic distance discussed above and since, in real-world scenarios, this is a typically complex high-dimensional problem, it is sound to employ state-of-the-art strategies for dimensionality reduction and (sample) clustering, as typically done in single-cell analyses [ ] . in this regard, the workflow employed in verso ensures high scalability with large datasets, also allowing to taking advantage of effective analysis and visualization features. in detail, the workflow includes three steps: (i) the computation of the k-nearest neighbour graph (k-nng), which can be executed on the original variant frequency matrix, or after applying principal component analysis (pca), to possibly reduce the effect of noisy observations (when the number of samples and variants is sufficiently high); (ii) the clustering of samples via either louvain or leiden algorithms for community detection [ ] ; (iii) the projection of samples on a low-dimensional space via standard tsne [ ] or umap [ ] plots. as output, verso step # delivers both the partitioning of samples in homogeneous clusters and the visualization in a low-dimensional space, also allowing to label samples according to other covariates, such as, e.g., collection date or geographical location. in the map in fig. , for instance, the intra-host genomic diversity of each sample and the genomic distance among samples are projected on the first two umap components, whereas samples that are connected by k-nng edges display similar patterns of co-occurrence of variants. accordingly, the map show clusters of samples likely affected by infection events, in which (a fraction of) quasispecies might have been transmitted from a host to another. this represents a major novelty introduced by verso and also allows one to effectively visualize the space of variant frequency profiles. to facilitate the usage, verso step # is provided as a python script which employs the scanpy suite of tools [ ] , which is typically used in single-cell analyses and includes a number of highly-effective analysis and visualization features. additional feature: homoplasy detection on minor variants also in the case of minor variants, it is important to pinpoint possible homoplasies and which might be due to mutational hotspots, phantom mutations and convergent variants. given the phylogenetic model retrieved via step # , verso allows to flag the variants that are detected in a number of clonal genotypes exceeding a user-defined threshold. in our case, the threshold is equal to , meaning that all minor variants found in more than one clonal genotypes are flagged. such variants are then excluded from the computation of the intra-host genomic distance, prior to the execution of step # . furthermore, the list of flagged variants can be investigated as proposed for step # (see above), in order to possibly identify mutations involved in positive selection scenarios. dataset # (illumina amplicon sequencing) we analyzed samples from distinct individuals obtained from ncbi bioprojects, which, at the time of writing, are all the publicly available datasets including raw illumina amplicon sequencing data. in detail, we selected the following projects: contact tracing data were obtained from the study presented in [ ] . in detail, for samples included in dataset # (ncbi bioproject prjna ), information on households, work institutions and epidemiological linkages are provided. thus, it is possible to identify different contact groups based on institutions regularly frequented by patients and household couples. contact information was employed to assess the relation between the intra-host genomic similarity and the contact dynamics. the results are provided in the main text. [ ] . we remark that one should be extremely careful when considering low-frequency variants, which might possibly result from sequencing artifacts, even in case of high-coverage experiments. in this regard, we note that many approaches can be employed to reduce false variants. for instance, the broad institute recently updated an effective variant calling pipeline for viral genome data [ ] , while new methods for error correction of viral sequencing have been proposed at this widely used website: https://virological.org, which also includes a number of useful up-to-date guidelines and best practices for viral evolution analyses. in our case, we here employed the following significance filters on variants. in particular, we kept only the mutations: ( ) showing a varscan significance p-value < . (fisher's exact test on the read counts supporting reference and variant alleles) and more than reads of support in at least % of the samples, ( ) displaying a variant frequency vf > %. as a result, we selected a list of (on overall snvs) highly-confident snvs for dataset # and (on ) for dataset # . high-quality variants were then mapped on sars-cov- coding sequences (cdss) via a custom r script, also by highlighting synonymous/nonsynonymous states and amino acid substitutions for the related open reading frame (orf) product. in particular, we translated reference and mutated cdss with the seqinr r package to obtain the relative amino acid sequences, which we compared to assess the effect of each nucleotide variation in terms of amino acid substitution. we finally note that availability of the ct values generated by q-pcr and the related quantification of the amount of viral transcripts would be very useful to characterize samples with high viral load, yet this information is not available for the considered datasets. in order to select high-quality samples, we selected only those exhibiting high coverage and in particular those with at least reads in more than % of the sars-cov- -anc genome. in addition, we filtered out all samples exhibiting more than minor variants (vf ≤ %). we finally excluded samples srr and srr from dataset # , as the first sample displays zero snvs and the second one reports an unfeasible collection date (i.e. th jan. ). after the quality-check filters, samples of dataset # are left for downstream analyses, in which distinct high-quality single-nucleotide variants are observed, and samples are left for dataset # , with high-quality snvs. the phylogenomic analysis via verso step # was performed on datasets # and # by considering only clonal variants (vf > %) detected in at least % of the samples. a grid search comprising different error rates was employed (see table of the supplementary material). samples with the same corrected clonal genotype were grouped in polytomies in the final phylogenetic models. the analysis of the intra-host genomic diversity via verso step # was performed by considering the vf profiles of all samples, by excluding: (i) the clonal variants employed in the phylogenomic reconstruction via verso step # , (ii) the minor variants involved in homoplasies, i.e., observed in more than one clonal genotype returned by verso step # . missing values (na) were imputed to for downstream analysis. a number of pcs equals to was employed in pca step, prior to the computation of the k-nearest neighbour graph (k = ) on the bray-curtis dissimilarity of vf profiles. leiden algorithm was applied with resolution = (see table of the supplementary material for the parameter settings of verso employed in the case studies). in order to compare the performance of verso step # with competing phylogenomic tools, i.e., iq-tree [ ] and beast [ ] , we performed extensive simulations via msprime [ ] , which simulates a backwards-in-time coalescent model. in particular, we simulated distinct evolutionary processes, with the following parameters: n = total samples, effective population size n e = . (i.e., haploid population), mutational rate m = × − mutations per site per generation and a genome of length l = bases. such parameters were chosen to roughly approximate the mutational rate currently estimated for sars-cov- (i.e., m ≈ − mutations per site per year and ≈ − generation year [ ] ) and to obtain a number of clonal mutations (in the range − ) that is comparable to the one observed in the real-word scenarios (see the case studies). as output, msprime returns a phylogenetic tree representing the genealogy between the samples, the genotype of all samples (i.e., the leaves of the tree) and the location of all mutations. the genotypes of the samples were then inflated with different levels of noise, with false positive rate α and false negative rate β (see the parameter settings in table of the supplementary material), in order to assess the performance of the methods in conditions of noisy observations and possible sequencing issues. finally, we subsampled all datasets to obtain two distinct samples sizes ( and samples), in order to test the robustness of methods in conditions of sampling limitations. the parameters of the phylogenetic methods employed in the comparative assessment are reported in the supplementary material ( table of the supplementary material). verso is freely available at this link: https://github.com/bimib-disco/verso. verso step # is provided as an open source standalone r tool, whereas step # is provided as python script. the source code to replicate all the analyses presented in the manuscript, both on simulated and real-world datasets, is available at this link: https://github.com/bimib-disco/verso-utilities. scanpy [ ] is available at this link: https://scanpy.readthedocs.io/en/stable/. the web-based tool for the geo-temporal visualization of samples, microreact [ ] , is available at this link: https://microreact.org/ showcase. the tool employed to plot the phylogenomic model returned by verso step # (in newick file format) is figtree [ ] and is available at this link: http://tree.bio.ed.ac.uk/software/figtree/. supervised the computational analysis. a.g. and r.p. drafted the manuscript, which all authors discussed, reviewed and approved. a pneumonia outbreak associated with a new coronavirus of probable bat origin a new coronavirus associated with human respiratory disease in china the proximal origin of sars-cov- isolation of sars-cov- -related coronavirus from malayan pangolins genomic surveillance reveals multiple introductions of sars-cov- into northern california we shouldn't worry when a virus mutates during disease outbreaks unifying the epidemiological and evolutionary dynamics of pathogens quantifying influenza virus diversity and transmission in humans global initiative on sharing all influenza data-from vision to reality iq-tree: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies viral phylodynamics establishment and cryptic transmission of zika virus in brazil and the americas the emergence of sars-cov- in europe and north america median-joining networks for inferring intraspecific phylogenies evolutionary inferences from phylogenies: a review of methods mrbayes . : efficient bayesian phylogenetic inference and model choice across a large model space raxml version : a tool for phylogenetic analysis and post-analysis of large phylogenies genomic infectious disease epidemiology in partially sampled and ongoing outbreaks quentin: reconstruction of disease transmissions from viral quasispecies genomic data bayesian reconstruction of transmission within outbreaks using genomic variants hiv-trace (transmission cluster engine): a tool for large scale molecular epidemiology of hiv- and other rapidly evolving pathogens beast . : an advanced software platform for bayesian evolutionary analysis early phylogenetic estimate of the effective reproduction number of sars-cov- phylogenetic network analysis of sars-cov- genomes analysis of the hosts and transmission paths of sars-cov- in the covid- outbreak a metric on the space of reduced phylogenetic networks bitphylogeny: a probabilistic framework for reconstructing intra-tumor phylogenies phylogenetic interpretation during outbreaks requires caution regaining perspective on sars-cov- molecular tracing and its implications the quasispecies (extremely heterogeneous) nature of viral rna genome populations: biological relevance-a review mutational and fitness landscapes of an rna virus revealed through population sequencing rapid viral quasispecies evolution: implications for vaccine and drug strategies why do rna viruses recombine? mutational signatures and heterogeneous host response revealed via large-scale characterization of sars-cov- genomic diversity genomic diversity of sars-cov- in coronavirus disease patients virological assessment of hospitalized patients with covid- molecular characterization of sars-cov- from the first case of covid- in italy intra-host site-specific polymorphisms of sars-cov- is consistent across multiple samples and methodologies. medrxiv genomic epidemiology of sars-cov- in guangdong province, china shared sars-cov- diversity suggests localised transmission of minority variants tracking the covid- pandemic in australia using genomics mutational dynamics and transmission properties of sars-cov- superspreading events in austria clonal interference and the evolution of rna viruses sars-associated coronavirus quasispecies in individual patients beyond the consensus: dissecting within-host viral population diversity of foot-and-mouth disease virus by using next-generation genome sequencing analysis of intrapatient heterogeneity uncovers the microevolution of middle east respiratory syndrome coronavirus intra-host dynamics of ebola virus during capri: efficient inference of cancer progression models from cross-sectional data cancer evolution: mathematical models and computational inference algorithmic methods to infer the evolutionary trajectories in cancer progression the evolution of tumour phylogenetics: principles and practice exceptional convergent evolution in a virus the fingerprint of phantom mutations in mitochondrial dna data circulating virus load determines the size of bottlenecks in viral populations progressing within a host reconstructing foot-and-mouth disease outbreaks: a methods comparison of transmission network models scanpy: large-scale single-cell gene expression data analysis qure: software for viral quasispecies reconstruction from next-generation sequencing data full-length haplotype reconstruction to infer the structure of heterogeneous virus populations viral quasispecies assembly via maximal clique enumeration qsdpr: viral quasispecies reconstruction via correlation clustering epidemiological data analysis of viral quasispecies in the next-generation sequencing era co-infection and super-infection models in evolutionary epidemiology incidence of co-infections and superinfections in hospitalized patients with covid- : a retrospective cohort study uniform manifold approximation and projection for dimension reduction visualizing data using t-sne revealing covid- transmission in australia by sars-cov- genome sequencing and agent-based modeling efficient coalescent simulation and genealogical analysis for large sample sizes coalescent theory: an introduction nextstrain: real-time tracking of pathogen evolution a simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates distributions of tree comparison metrics-some new results the first novel coronavirus case in nepal evolutionary origins of the sars-cov- sarbecovirus lineage responsible for the covid- pandemic emergence of sars-cov- through recombination and strong purifying selection microreact: visualizing and sharing data for genomic epidemiology and phylogeography on the origin and continuing evolution of sars-cov- viral and host factors related to the clinical outcome of covid- evaluating the effects of sars-cov- spike mutation d g on transmissibility and pathogenicity tracking changes in sars-cov- spike: evidence that d g increases infectivity of the covid- virus structural and functional analysis of the d g sars-cov- spike protein variant making sense of mutation: what d g means for the covid- pandemic remains unclear emergence of genomic diversity and recurrent mutations in sars-cov- . infection correcting for purifying selection: an improved human mitochondrial molecular clock functional assessment of cell entry and receptor usage for sars-cov- and other lineage b betacoronaviruses peptide-based membrane fusion inhibitors targeting hcov- e spike protein hr and hr domains a pan-coronavirus fusion inhibitor targeting the hr domain of human coronavirus spike transmission dynamics and evolutionary history of -ncov positive and negative selection on the human genome transmission bottleneck size estimation from pathogen deep-sequencing data, with an application to human influenza a virus inferring transmission bottleneck size from viral sequence data using a novel haplotype reconstruction method large bottleneck size in cauliflower mosaic virus populations during host plant colonization genetic drift, purifying selection and vector genotype shape dengue virus intra-host genetic diversity in mosquitoes towards a genomics-informed, real-time, global pathogen surveillance system substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov ) phylodynamics of infectious disease epidemics cryptic transmission of sars-cov- in washington state mapping genome variation of sars-cov- worldwide highlights the impact of covid- super-spreaders longitudinal cancer evolution from single cells the number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations efficient algorithms for inferring evolutionary trees dendroscope : an interactive tool for rooted phylogenetic trees and networks phantom mutation hotspots in human mitochondrial dna haplogrep : mitochondrial haplogroup classification in the era of high-throughput sequencing genome-wide mapping of gene-microbiota interactions in susceptibility to autoimmune skin blistering current best practices in single-cell rna-seq analysis: a tutorial from louvain to leiden: guaranteeing well-connected communities varscan : somatic mutation and copy number alteration discovery in cancer by exome sequencing broadinstitute/viral-ngs science forum: sars-cov- (covid- ) by the numbers this work was partially supported by the elixir italian chapter and the sysbionet project, a ministero dell'istruzione, dell'università e della ricerca initiative for the italian roadmap of european strategy forum on research infrastructures and by the airc-ig grant . partial support was also provided by the cruk/airc accelerator award # , "single-cell cancer evolution in the clinic". we thank giulio caravagna and chiara damiani for helpful discussions. we also thank david posada for interesting suggestions on the preliminary version of the manuscript. key: cord- -zhgjmt j authors: tang, min; xie, qi; gimple, ryan c.; zhong, zheng; tam, trevor; tian, jing; kidwell, reilly l.; wu, qiulian; prager, briana c.; qiu, zhixin; yu, aaron; zhu, zhe; mesci, pinar; jing, hui; schimelman, jacob; wang, pengrui; lee, derrick; lorenzini, michael h.; dixit, deobrat; zhao, linjie; bhargava, shruti; miller, tyler e.; wan, xueyi; tang, jing; sun, bingjie; cravatt, benjamin f.; muotri, alysson r.; chen, shaochen; rich, jeremy n. title: three-dimensional bioprinted glioblastoma microenvironments model cellular dependencies and immune interactions date: - - journal: cell res doi: . /s - - - sha: doc_id: cord_uid: zhgjmt j brain tumors are dynamic complex ecosystems with multiple cell types. to model the brain tumor microenvironment in a reproducible and scalable system, we developed a rapid three-dimensional ( d) bioprinting method to construct clinically relevant biomimetic tissue models. in recurrent glioblastoma, macrophages/microglia prominently contribute to the tumor mass. to parse the function of macrophages in d, we compared the growth of glioblastoma stem cells (gscs) alone or with astrocytes and neural precursor cells in a hyaluronic acid-rich hydrogel, with or without macrophage. bioprinted constructs integrating macrophage recapitulate patient-derived transcriptional profiles predictive of patient survival, maintenance of stemness, invasion, and drug resistance. whole-genome crispr screening with bioprinted complex systems identified unique molecular dependencies in gscs, relative to sphere culture. multicellular bioprinted models serve as a scalable and physiologic platform to interrogate drug sensitivity, cellular crosstalk, invasion, context-specific functional dependencies, as well as immunologic interactions in a species-matched neural environment. brain tumors are complex tissues with multicomponent interactions between multiple cell types. precision medicine efforts based solely on genomic alterations and molecular circuitries driving neoplastic cells have translated into relatively limited benefit in clinical practice for brain cancers, including glioblastoma, the most prevalent and lethal primary intrinsic brain tumor. crosstalk between neoplastic cells and the surrounding stroma contributes to tumor initiation, progression, and metastasis. however, most cancer research studies investigate cancer cells in isolation, cultured in non-physiologic adherent conditions containing species-mismatched serum. massive efforts have interrogated functional dependencies of cancer cell lines. [ ] [ ] [ ] [ ] while these studies provide valuable insights into cancer cell dependencies, they lack the capacity to investigate interactions of cancer cells with stromal cells or the microenvironment in an appropriate physiological context. patient-derived xenografts (pdxs) and genetically engineered mouse models are informative and can better recapitulate the genomic and transcriptomic profiles of patient brain tumors than two-dimensional ( d) culture. however, challenges with engraftment, the low throughput nature of animal experiments, and the lack of normal human cellular interactions, limit their broad applications in clinical settings. in tumors with significant immune cell involvement, such as glioblastoma, pdxs are limited as immunocompromised animals prevent investigation of immune cells in cancer biology. methods to construct self-organizing three-dimensional ( d) coculture systems, termed organoids, have been developed to interrogate physiological and pathophysiological processes. , in cancer research, organoid systems serve as models of colorectal cancer, , breast cancer, , hepatocellular and cholangiocarcinomas, pancreatic cancers, and glioblastomas, among others. , in glioblastoma, we first described organoid systems that recapitulate tumor architecture, microenvironmental gradients, and tumor cellular heterogeneity. additional glioblastoma models utilize human-embryonic stem cell (hesc)-derived cerebral organoids to investigate interactions between glioblastoma stem cells (gscs) and normal brain components including infiltration, microenvironmental stimuli, and response to therapies. however, organoid modeling is labor intensive, relatively low throughput, and highly variable in terms of cellular composition and structure due to the process of self-assembly. further development of tissue engineering approaches informs new d culture systems with improved scalability and capacity to tune specific biological parameters, including cellular composition and extracellular matrix stiffness. the development of physiologically relevant brain tumor microenvironments requires careful consideration of the biophysical and biochemical properties of the matrix and cellular composition of specific tumor types, which can be achieved with recent advances in d bioprinting and biomaterials designed specifically for the bioprinting process. [ ] [ ] [ ] [ ] biocompatible scaffolds for tumor microenvironments include the naturally occurring extracellular matrix products chitosan-alginate (ca) and hyaluronic acid (ha)-based hydrogels, , but also synthetic polymers, including poly lactide-co-glycolide (plga), and polyethylene-glycol (peg), or polyacrylamide hydrogels. d printing with biocompatible materials is emerging to advance the fields of regenerative medicine and tissue modeling, with notable relevance and applicability to cancer research. d bioprinting models microenvironmental interactions and drug sensitivities, reciprocal interactions with macrophages, and patient-specific screening tools in microfluidics-based systems. among many d printing technologies, digital light processing (dlp)-based d bioprinting provides superior scalability and printing speed in addition to versatility and reproducibility. several biomimetic tissue models have been developed using this technology, creating tissue-specific architecture and cellular composition that could be used for functional analyses, metastasis studies, and drug screening. , here, we employ a rapid d bioprinting system and photocrosslinkable native ecm derivatives to create a biomimetic d cancer microenvironment for the highly lethal brain tumor, glioblastoma. the model is comprised of patient-derived gscs, macrophages, astrocytes, and neural stem cells (nscs) in a ha-rich hydrogel. one major microenvironmental feature of glioblastoma is the prominent infiltration of tumor masses by macrophage and microglia. in progressive or recurrent glioblastoma, macrophage and microglia account for a substantial fraction of the tumor bulk. using genetic depletion, co-implantation, and pharmacologic depletion, macrophage/microglia have been shown to be functionally important for glioblastoma growth, but each of these approaches may have broader effects beyond direct tumor cellmacrophage interactions. using our rapid d bioprinting platform, we can interrogate functional dependencies and multicellular interactions in a physiologically relevant manner. dlp-based rapid d bioprinting generates glioblastoma tissue models brain tumors are composed of numerous distinct populations of malignant and supporting stromal cells, and these complex cellular interactions are essential for tumor survival, growth, and progression. glioblastomas display high levels of intratumoral heterogeneity, with contributions from astrocytes, neurons, npcs, macrophage/ microglia, and vascular components. to move beyond serum-free sphere culture-based models, we utilized a dlp-based rapid d bioprinting system to generate d tri-culture or tetra-culture glioblastoma tissue models, with a background "normal brain" made up of npcs and astrocytes and a tumor mass generated by gscs, with or without macrophage, using brain-specific extracellular matrix (ecm) materials (fig. a ). leveraging this system with exquisite control of cellular constituents in specific locations, we selected macrophage for additional study, as we hypothesized that dlp-based d bioprinting could enable precise spatial arrangement of cells and matrix, and selection of any cell type. the key components of the bioprinting system were a digital micromirror device (dmd) chip and a motorized stage where prepolymer cellmaterial mixtures were sequentially loaded. the dmd chip with approximately × micromirrors controlled the light projection of the brain-shaped patterns onto the printing materials (fig. b) . the elliptical pattern corresponded to the core region and the coronal slice pattern corresponded to the peripheral region. each pattern was printed with s of light exposure. in the d tri-culture model, a central tumor core composed of gscs was surrounded by a less dense population of astrocytes and npcs. in the d tetra-culture model, we mixed m macrophages with gscs within the central core to mimic the immune cell infiltrated tumor mass (fig. c) . the ecm composition of the glioblastoma microenvironment was modeled with gelatin methacrylate (gelma) and glycidyl methacrylate-ha (gmha) hydrogels. cells were encapsulated into a material mixture of % gelma (at % degree of methacrylation) and . % gmha (at % degree of methacrylation), which generated a hydrogel matrix that resembled glioblastoma tissue (supplementary information, fig. s a , b). gelma has good biocompatibility and serves as a stiffness modulator that provided desirable mechanical properties and little intervention in biochemical cues. ha is the most abundant ecm component in healthy brain tissue and promotes glioblastoma progression, including regulating glioblastoma invasion through the receptor for hyaluronan-mediated motility (rhamm) and cd , as well as other mechanical and topographical cues. we used a physiologically relevant concentration of ha ( . %) determined from clinical analysis of a diverse population of biopsy specimens from patients with different brain tumors. while a range of molecular weight has are present in the brain, low molecular weight ha promotes gsc stemness and resistance. thus, in this study, low molecular weight ha ( kda) was used to synthesize gmha to model the pro-invasive brain tumor microenvironment. the mechanical properties of the model were characterized by the compressive modulus and pore sizes. the stiffness of the acellular hydrogel remained stable over a week of incubation at °c (data not shown). the stiffness of cell-encapsulated tumor core was . ± . kpa, while the less populated peripheral region containing npcs and astrocytes was . ± . kpa. the peripheral region stiffness was designed to match that of healthy brain tissue reported to be~ kpa. glioblastoma displays enhanced migration and proliferation in stiffer materials. the stiffness of the tumor core was modulated with the light exposure time during printing to have higher modulus than the healthy region. the hydrogel had a porosity of % and an average pore size of μm. with these microscale features, small molecules, such as drug molecules, freely diffuse through the matrix. cells closely interacted with other cells and the matrix (fig. d) . at a macro scale, the model had a thickness of mm, and . mm by . mm in width and length, which allowed gradients of oxygen and nutrition diffusion to be formed within the tissue. cells were precisely printed into two prearranged regions to provide more physiologically relevant features: a non-neoplastic peripheral region composed of npcs and astrocytes surrounding a tumor core composed of either gscs alone or gscs with macrophage ( fig. e) . following optimization for cell density (supplementary information, fig. s a, b) , the tumor core in the d tri-culture consisted of . × gscs/ml, while the tetra-culture tumor core contained . × gscs/ml and . × macrophages/ml. d bioprinted models recapitulate glioblastoma transcriptional profiles traditionally grown cell lines have been extensively characterized in glioblastoma, revealing that these conditions fail to replicate article patient tumors in cellular phenotypes (e.g., invasion) or transcriptional profiles. while patient-derived glioblastoma cells grown under serum-free conditions enrich for stem-like tumor cells (gscs) that form spheres and more closely replicate transcriptional profiles and invasive potential than standard culture conditions, we previously demonstrated that spheres display differential transcriptional profiles and cellular dependencies in an rna interference screen compared to in vivo xenografts. based on this background, we interrogated the transcriptional profiles from a large cohort of patient-derived gscs grown in serum-free, sphere cell culture that we recently reported. gscs grown as spheres were transcriptionally distinct from primary glioblastoma surgical resection tissue specimens, when compared through either principal component analysis (pca) or uniform manifold approximation and projection (umap) (fig. a, b) . to determine whether the d bioprinted culture systems more closely resemble primary glioblastoma tumors, we performed global transcriptional profiling through rna extraction followed by next-generation sequencing (rna-seq) on gscs isolated from the bioprinted models and on gscs in sphere culture (fig. c) . upregulation of a fig. d bioprinting enables generation of glioblastoma tri-culture and tetra-culture tissue environment model. a schematic diagram of in vitro d glioblastoma model containing gscs, macrophages, astrocytes, and neural stem cells (nscs). b schematic diagram of digital micromirror device (dmd) chip-based d bioprinting system used to produce the d glioblastoma model. c diagram of tri-culture (left) and tetra-culture (right) model system. d (left) scanning electron microscope (sem) images of acellular glycidyl methylacrylate-hyaluronic acid and gelatin methacrylate extracellular matrix. (center and right) sem images of the cells encapsulated in the extracellular matrix. scale bars, μm (left), μm (center), and μm (right). e brightfield and immunofluorescence images of the tri-culture and tetra-culture d glioblastoma models. gscs are labeled with green fluorescent protein (gfp) while macrophages are labeled with mcherry. nuclei are stained with dapi. scale bars, mm. core set of glioblastoma tissue-specific genes defined a "glioblastoma tissue" gene signature (fig. d) . when compared to gscs grown in sphere culture, the tetra-culture bioprinted model displayed upregulation of the glioblastoma tissue-specific gene set (fig. e) , suggesting that the bioprinted model recapitulates transcriptional states present in patient-derived glioblastoma tissues. gscs in d tetra-culture displayed upregulation of genes specifically expressed in orthotopic intracranial xenografts (fig. f, g) and, to a lesser extent, genes specifically expressed in subcutaneous flank xenografts (supplementary information, fig. s c ) compared to sphere culture. additionally, signatures that distinguish gscs from their differentiated counterparts were upregulated in the tetraculture system compared to sphere culture (fig. h, i) , suggesting that the physiologic tissue environment promotes stem-like transcriptional states. we further interrogated the gene expression profiles that distinguish gscs grown in sphere culture from the d tetraculture bioprinted models (fig. a) . while cells grown in sphere culture displayed enrichment for gene sets involved in ion transport, protein localization, and vesicle membrane function, cells in the tetra-culture d model displayed transcriptional upregulation of cell adhesion, extracellular matrix, cell and structure morphogenesis, angiogenesis, and hypoxia signatures ( fig. s b ). hypoxia response genes, ca , ndrg , angptl , and egln family members, were upregulated in the tetra-culture system, while various ion transporters, including slc a and slc a , were downregulated (fig. d , e). by qpcr, gscs isolated from either d system days after printing displayed elevated levels of the stemness marker olig and decreased levels of the differentiation markers map and tuj compared to their sphere counterparts grown in parallel (fig. f) . additionally, gsc levels of map and tuj were decreased to a greater degree in tetraculture (i.e., with macrophage) compared to tri-culture. we further evaluated the protein expression of stemness, hypoxia, and proliferative markers in the tetra-culture system compared to sphere culture. the hypoxia marker ca was upregulated in the tetra-culture model compared to sphere culture (fig. g) . the heightened hypoxia level more closely resembled pathologic in vivo conditions, in which the tumor core had a higher hypoxia expression compared to the peripheral region of neurons and astrocytes. in the d culture model, cells also showed increased levels of the proliferative marker ki and increased protein expression of the stemness markers olig and sox ( fig. h-j) . macrophages promote hypoxic and invasive signatures in bioprinted models to understand the relative contributions of each cell type incorporated into bioprinted models, we performed rna-seq on gscs derived from tri-cultures and tetra-cultures. given that thp derived macrophages display distinct expression profiles as primary macrophages, we built tetra-cultures containing thp derived macrophage, human induced pluripotent stem cell (hipsc)-derived macrophage generated from an established protocol, and primary human volunteer-derived macrophage. both hipsc-derived macrophage and primary macrophage integrated into the tetra-culture models. umap clustering revealed that the transcriptional outputs of sphere cultured gscs are distinct from that of gscs in bioprinted models (fig. a, b) . concordantly, we detected differentially expressed genes between sphere cultured cells and any of the bioprinted models ( - differentially expressed genes), while there were fewer genes that distinguished the bioprinted models ( - differentially expressed genes) (fig. c) . bioprinted models were characterized by activation of invasion, extracellular matrix, cell surface interaction, and hypoxia signatures, while gscs in sphere culture expressed cell cycle, dna replication, rna processing, and mitochondrial translation signatures (supplementary information, we next investigated differentially expressed pathways between bioprinted models to interrogate the contributions of cellular components. tri-culture-derived gscs upregulated extracellular matrix and biological adhesion pathways compared to gscs in sphere culture (supplementary information, fig. s a -e). addition of macrophage further increased activation of hypoxia and glycolytic metabolism signatures, with enrichment for invasiveness signatures (fig. d-h) . tetra-cultures constructed with hipsc-derived macrophage expressed higher levels of extracellular matrix and wound healing and platelet activation signatures and decreased levels of neuron and glial development and differentiation pathways compared to tetra-cultures containing thp -derived macrophages (supplementary information, fig. s a , b). incorporation of primary human macrophages did not affect levels of ki or sox compared to use of thp -derived cells (supplementary information, fig. s c , d). consistent with our previous findings, use of hipsc-derived macrophages reduced gsc expression of map and tuj differentiation markers and increased expression of ca and ndrg hypoxia markers (supplementary information, fig. s e ). taken together, gscs upregulate extracellular matrix interaction signatures in response to growth in a bioprinted model. the addition of macrophage further accentuates these gene activation signatures and increases activation of hypoxia and pro-invasive transcriptional profiles. d bioprinted tissues model complex cellular interactions and migration interactions between malignant cells and stromal components shape tumor tissue with each cell type impacting the other tissue components. to understand these changes, we investigated how macrophage responded to the d brain tumor microenvironment by isolating thp -derived macrophages from d bioprinted constructs and performing rna-seq (fig. a, b) . for the d printed tissue, macrophage were mixed with gscs at a : ratio to form the tumor core, while the periphery was formed by astrocytes and npcs using the same composition described previously. the transcriptional output of macrophage grown in traditional culture displayed enrichment for prc complex targets, amino acid biosynthesis, protein metabolism signatures and ribosomal pathways, while macrophage exposed to gscs in the fig. d tetra-culture models better recapitulate transcriptional signatures found in glioblastoma tissues than standard sphere culture. a pca of the global transcriptional landscape of glioma stem cells in culture (gscs in culture, n = ) vs primary glioblastoma surgical resection tissues (gbm tissue, n = ) as defined by rna-seq. the top differential genes were used for the analysis. data was derived from mack et al. b umap of the global transcriptional landscape of glioma stem cells in culture (gscs in culture, n = ) vs primary glioblastoma surgical resection tissues (gbm tissue, n = ) as defined by rna-seq. analysis parameters include: sample size of local neighborhood, number of neighbors = ; learning rate = . ; initialization of low dimensional embedding = random; metrics for computation of distance in high dimensional space = manhattan. data was derived from mack et al. c schematic diagram of experimental approach for gsc rna-seq experiments. d volcano plot of transcriptional landscape profiled by rna-seq comparing gscs in sphere culture (n = ) vs glioblastoma primary surgical resection tissues (n = ). the x-axis depicts the log transformed fold change, while the y-axis shows the log transformed p value adjusted for multiple test correction. e gene set enrichment analysis (gsea) of the glioblastoma tissue vs cell culture signature as defined in d when applied to rna-seq data comparing the d tetra-culture system with sphere cell culture. f volcano plot of transcriptional landscape profiled by rna-seq comparing gscs in sphere culture (n = biological samples with technical replicates each) vs matched orthotopic intracranial xenograft specimens (n = biological samples with technical replicates each). the x-axis depicts the log transformed fold change, while the y-axis shows the log transformed p value adjusted for multiple test correction. data was derived from miller et al. g gsea of the glioblastoma tissue vs cell culture signature as defined in f when applied to rna-seq data comparing the d tetraculture system with sphere cell culture. h volcano plot of transcriptional landscape profiled by rna-seq comparing gscs in sphere culture (n = biological samples with technical replicates each) vs differentiated glioma cells (dgcs) in sphere culture (n = biological samples with technical replicates each). the x-axis depicts the log transformed fold change, while the y-axis shows the log transformed p value adjusted for multiple test correction. data was derived from suva et al. i gsea of the glioblastoma tissue vs cell culture signature as defined in h when applied to rna-seq data comparing the d tetra-culture system with sphere cell culture. bioprinted construct showed elevation of pathways involved in leukocyte activation and innate immune response, cytokine signaling and inflammatory responses, and tlr-stimulated signatures ( fig. c ; supplementary information, fig. s a -d). defense response genes, including ch , pla g , and alox , were upregulated in macrophage derived from the tetra-culture system, while genes involved in amino acid restriction, including il , cd , and vldlr, were downregulated (fig. d , e). m macrophage-related markers were upregulated in the d tetracultures, with cd increased by -fold and il- increased by -fold compared to traditional suspension culture, as measured by qpcr. m -related markers, including tnf-α and nos , did not increase, demonstrating that the d printed microenvironment preferentially polarized macrophage towards the m phenotype ( fig. f ). this is consistent with the m polarization of macrophage in glioblastoma tumors. , gene expression signatures defining peripherally-derived tumor-associated macrophage in glioma , were selectively enriched in macrophage derived from tetraculture models compared to those grown in d culture (supplementary information, fig. s ). collectively, macrophage grown in our d bioprinted tetra-culture model expressed gene expression signatures consistent with patient-derived tumorassociated macrophage. we interrogated the functional consequences of the addition of immune components to the d bioprinted model. in four patientderived gscs spanning three major glioblastoma transcriptional subtypes (proneural, classical, and mesenchymal), the addition of thp -derived m macrophage increased gsc invasion into the surrounding brain-like parenchyma ( fig. g-j) . consistent with our gene expression analyses, m macrophage increased the area of invasion by % for cw , % for gsc , % for gsc , and % for gsc . collectively, these results support the tetraculture model as an effective tool to study cancer cell invasion and the mechanisms by which cellular interactions impinge upon these processes. as numerous stromal compartments, including neural progenitor cells, astrocytes, and neurons, [ ] [ ] [ ] interact with glioblastoma cells within patient tumors, we interrogated the effects of the bioprinted model on neuronal and oligodendrocyte differentiation of the non-neoplastic npcs. in d culture, most npcs expressed the proliferative npc marker sox . the high expression and frequency of sox was retained in tri-cultures and tetra-cultures containing macrophage derived from thp cells or primary human macrophage (supplementary information, fig. s a ). in d culture, npcs expressed the neuronal marker tubb , but retained a progenitor-like cellular morphology. in bioprinted models, npcs adopted a neuronal morphology with the appearance of elongated cellular projections (supplementary information, fig. s b ). expression of map was reduced in npcs in bioprinted models compared to d culture (supplementary information, fig. s a ). olig staining revealed oligodendrocytelike cells in tri-cultures (supplementary information, fig. s b ). taken together, npcs partially differentiate in our bioprinted system, but are unlikely to form mature functional neurons or oligodendrocytes. the d bioprinted model serves as a platform for drug response modeling we next investigated the ability of our d bioprinted constructs to model drug responses and the capacity for cellular interactions within the d bioprinted constructs to affect drug sensitivity of gscs. fluorescent dextran molecules ( kda) modeled drug penetration into d bioprinted models. , dextran molecules rapidly entered bioprinted constructs when the hydrogel was soaked in a dextran solution, with rapid increases in average fluorescence intensity measured from the hydrogel. the fluorescence intensity plateaued after min of incubation and displayed a uniform spatial intensity across the hydrogel, demonstrating that drug compounds can effectively permeate the d bioprinted model (fig. a-c) . egfr is commonly amplified, overexpressed, or mutated in glioblastoma, so we evaluated the treatment efficacy of two egfr inhibitors, erlotinib and gefitinib, and the glioblastoma standardof-care alkylating agent temozolomide in our models. d tricultures and tetra-cultures were cultured for days before drug treatment. despite activated egfr in glioblastomas, egfr inhibitors have shown little benefit for glioblastoma patients. gsc in either d model displayed enhanced resistance to egfr inhibitors and temozolomide compared to sphere culture. inclusion of m macrophage further increased resistance of gsc to egfr inhibitors ( glioblastomas are highly lethal cancers for which current therapy is palliative. , therefore, we explored the potential utility of d bioprinted systems to inform drug responses in glioblastoma. overlaying gene expression data from the d tetraculture model with drug sensitivity and gene expression data from the cancer cell line encyclopedia (ccle) and the cancer therapeutic response platform (ctrp) enabled prediction of drug sensitivity and resistance in our d tetra-culture model based on transcriptional signatures (fig. f) . [ ] [ ] [ ] consistent with our studies of erlotinib, gefitinib, and temozolomide, high expression of genes upregulated in gscs in the d tetra-culture model was predicted to be associated with drug resistance for the majority of compounds across all cancer cell lines tested (fig. g) or when restricted to brain cancer cell lines (supplementary information, fig. s a ). drugs predicted to be ineffective included gsk-j (jmjd /kdm b inhibitor), cytarabine (nucleotide antimetabolite), and decitabine (dna methyltransferase inhibitor), while drugs predicted to be effective included abiraterone (cyp a inhibitor), fig. gscs grown in d tetra-culture models upregulate transcriptional signatures of cellular interaction, hypoxia, and cancer stem cells. a volcano plot of transcriptional landscape profiled by rna-seq comparing the cw gsc grown in standard sphere culture vs gscs in the d tetra-culture model. the x-axis depicts the log transformed fold change, while the y-axis shows the log transformed p value adjusted for multiple test correction. n = technical replicates per condition. b pathway gene set enrichment connectivity diagram displaying pathways enriched among gene sets upregulated (red) and downregulated (blue) in gscs in the d tetra-culture system vs standard sphere culture. c normalized single sample gene set enrichment analysis (ssgsea) scores of glioblastoma transcriptional subtypes as previously defined for the cw gsc when grown in in standard sphere culture vs gscs in the d tetra-culture model. bars are centered at the mean value and error bars represent standard deviation. d mrna expression of representative genes in hypoxia response pathways between standard sphere culture vs gscs in the d tetra-culture model as defined by rna-seq. p values were calculated using deseq with a wald test with benjamini and hochberg correction. ****p < e− . bars are centered at the mean value and error bars represent standard deviation. e mrna expression of representative genes in ion transport pathways between standard sphere culture vs gscs in the d tetra-culture model as defined by rna-seq. p values were calculated using deseq with a wald test with benjamini and hochberg correction. ****p < e− . bars are centered at the mean value and error bars represent standard deviation. f mrna expression of stem cell and differentiation markers between standard sphere culture vs gscs in the d tetra-culture model as defined by quantitative pcr (qpcr). three technical replicates were used and ordinary two-way anova with dunnett multiple comparison test was used for statistical analysis, *p < . ; **p < . ; ***p < . . bars indicate mean, with error bars showing standard deviation. g immunofluorescence staining of ca in cells grown in standard sphere culture (top) vs gscs in the d tetra-culture model (bottom). scale bars, μm. h immunofluorescence staining of ki in cells grown in standard sphere culture (top) vs gscs in the d tetra-culture model (bottom). scale bars, μm. i immunofluorescence staining of olig in cells grown in standard sphere culture (top) vs gscs in the d tetra-culture model (bottom). scale bars, μm. j immunofluorescence staining of sox in cells grown in standard sphere culture (top) vs gscs in the d tetra-culture model (bottom). scale bars, μm. fig. addition of macrophages activates extracellular matrix and invasiveness signatures. a umap analysis of rna-seq data from gscs grown in ( ) sphere culture, ( ) tri-culture, ( ) tetra-culture with thp -derived macrophage, and ( ) tetra-culture with hipsc-derived macrophages. b heatmap displaying mrna expression of differentially expressed genes between conditions. c upset plot showing the number of differentially expressed genes between conditions. for conditions containing sphere cultured cells, genes were considered differentially expressed if the log fold change of mrna expression was greater than . (or < − . ) with an adjusted p value of e− . for other conditions, genes were considered differentially expressed if the log fold change of mrna expression was greater than . (or < − . ) with an adjusted p value of e− . d volcano plot of transcriptional landscapes profiled by rna-seq comparing the cw gsc grown in tetraculture containing thp -derived macrophages vs gscs in the tri-culture model. the x-axis depicts the log transformed fold change, while the y-axis shows the log transformed p value adjusted for multiple test correction. n = technical replicates per condition. e pathway gene set enrichment connectivity diagram displaying pathways enriched among gene sets upregulated (red) and downregulated (orange) in gscs in the d tetra-culture system vs tri-culture system. f gsea of the extracellular matrix structural constituent pathway between tetra-culture and tri-culture models. fdr q value = . . g gsea of the anastassiou multicancer invasiveness pathway between tetra-culture and tri-culture models. fdr q value = . . h gene set enrichment analysis (gsea) of the collagen degradation pathway between tetra-culture and tri-culture models. fdr q value = . . vemurafenib and plx- (raf inhibitors), ml (nrf activator), and ifosfamide (akylating agent) (fig. g-j) . the drug sensitivity predictions were similar, but not entirely overlapping, when a glioblastoma orthotopic xenograft expression signature was used (supplementary information, fig. s b ). investigation of the library of integrated network-based cellular signatures (lincs) dataset showed that compounds predicted to recapitulate the d tetra-culture signature included hypoxia inducible factor activators, caspase activators, and hdac inhibitors, while raf inhibitors and immunosuppressive agents may impair expression of this gene signature (supplementary information, fig. s c ). these findings suggest that interactions with the local microenvironment affect gsc sensitivity to therapeutic compounds and that the d bioprinted tissue model can interrogate these context-dependent effects. further, as the tetra-culture model expresses genes associated with poor sensitivity to a variety of therapeutic compounds, this system may be a more realistic model for drug discovery in glioblastoma. to validate these predictions, we treated gscs with three of the predicted compounds, abiraterone, vemurafenib, and ifosfamide in triculture and tetra-culture bioprinted models. when treated at the sphere culture ic value (supplementary information, fig. s d-f ), gscs in tetra-culture displayed enhanced sensitivity to abiraterone and ifosfamide compared to gscs in tri-culture, while sensitivity to vemurafenib was unchanged ( fig. i-k) . this suggests that abiraterone and ifosfamide may be effective in targeting tetra-culture derived gscs. further validating these findings in an in vivo subcutaneous glioblastoma xenograft model, ifosfamide therapy reduced tumor growth compared to vehicle (supplementary information, figs. s a-c). d bioprinted tissues uncover novel context-dependent essential pathways and serve as a platform for crispr screening given widespread therapeutic resistance in glioblastoma, we leveraged the d bioprinted construct as a discovery platform for glioblastoma dependencies. parallel whole-genome crispr-cas loss-of-function screening was performed in gscs in sphere culture as well as in the d tetra-culture system ( fig. a; supplementary information, fig. s ). functional dependencies segregated gscs based on their method of growth ( fig. b; supplementary information, fig. s f ). guide rnas were enriched (indicating that the targeted gene enhances viability when deleted) or depleted (indicating that the targeted gene reduces cell viability when deleted) in each platform (fig. c, d) . genes essential in each context, as well as pan-essential genes common to both platforms, included core pathways involved in translation, ribosome functions, and rna processing, cell cycle regulation, protein localization, and chromosomes and dna repair ( fig. e; supplementary information, fig. s g, h) . gene hits were stratified to identify context-specific dependencies (fig. f) . genes selectively essential in sphere culture were enriched for cell cycle, endoplasmic reticulum, golgi and glycosylation, lipid metabolism, and response to oxygen pathways. gscs grown in the d tetra-culture model were more dependent on transcription factor activity, cell development and differentiation, nf-κb signaling, and immune regulation pathways (fig. g-k) . thus, the d bioprinted model allowed for interrogation of functional dependencies of brain tumor cells in physiological settings and in combination with stromal fractions and revealed a more complex functional dependency network than that observed in sphere culture. to further validate d bioprinted-specific dependencies, we stratified our whole-genome crispr screening results, selecting genes predicted to be essential in d tetra-culture (fig. a, b) . individual gene knockout in luciferase-labeled gscs of pag , znf , atp h, and rnf a with two independent sgrnas reduced gsc viability in both sphere culture and d tetra-culture models (fig. c-m) . additionally, knockout of pag or znf in gscs delayed the onset of neurological signs in orthotopic glioblastoma xenografts compared to gscs treated with a nontargeting sgrna (fig. n-q) . pag and znf are upregulated at the mrna level in glioblastomas compared to normal brain tissue and high expression is associated with poor patient prognosis in primary glioblastomas from the chinese glioma genome atlas (cgga) dataset, highlighting the clinical relevance of these factors in glioblastoma (supplementary information, fig. s a-d) . taken together, this screening approach has identified novel candidates for future investigation and potential therapeutic development. d bioprinted cultures express transcriptional signatures associated with poor glioblastoma patient prognosis to determine the clinical relevance of the d bioprinted construct, we investigated the transcriptional profiles relative to glioblastoma patients. signatures of genes upregulated either in intracranial orthotopic xenografts or in d tetra-culture compared to sphere culture were elevated in glioblastomas compared to low-grade gliomas in the cancer genome atlas (tcga), cgga, and the rembrandt dataset (fig. a-d) . the d tetra-culture gene signature was elevated in recurrent glioblastomas compared to primary tumors (fig. e) and in the mesenchymal subtype compared to classical or proneural glioblastomas (fig. f) . in the tcga and cgga datasets, the orthotopic xenograft signature and the d tetra-culture signature were associated with poor glioblastoma patient prognosis (fig. g-j) . many genes with individual poor prognostic significance were upregulated in the intracranial xenograft signature, including chi l , postn, and ndrg (fig. k) , while dennd a, maob, and igfbp were upregulated in the d bioprinted cultures (fig. l) . genes with poor prognostic significance were enriched among all genes in the d tetra-culture signature, when compared to a background of all genes (fig. m) . thus, d bioprinting enabled investigation of gene pathways associated with more aggressive glioblastomas, suggesting that this model can serve as a more realistic therapeutic discovery platform for the most lethal classes of glioblastoma. fig. macrophages grown in d tetra-culture models upregulate immune activation signatures, increase m polarization, and promote gsc invasion. a schematic diagram of experimental approach for macrophage rna-seq experiments. b volcano plot of transcriptional landscape profiled by rna-seq comparing macrophages grown in standard sphere culture vs macrophages in the d tetra-culture model. the x-axis depicts the log transformed fold change, while the y-axis shows the log transformed p value adjusted for multiple test correction. c pathway gene set enrichment connectivity diagram displaying pathways enriched among gene sets upregulated (red) and downregulated (blue) in macrophages in the d tetra-culture system vs standard sphere culture. d mrna expression of representative genes in defense response and macrophage function pathways between standard sphere culture vs macrophages in the d tetra-culture model as defined by rna-seq. p values were calculated using deseq with a wald test with benjamini and hochberg correction. ****p < e− . bars are centered at the mean value and error bars represent standard deviation. e mrna expression of representative genes in amino acid deprivation pathways between standard sphere culture vs macrophages in the d tetra-culture model as defined by rna-seq. p values were calculated using deseq with a wald test with benjamini and hochberg correction. ****p < e− . bars are centered at the mean value and error bars represent standard deviation. f mrna expression of m and m macrophage polarization markers between standard sphere culture vs macrophages in the d tetra-culture model as defined by qpcr. three technical replicates were used and ordinary two-way anova with dunnett multiple comparison test was used for statistical analysis, ***p < . ; ****p < . . bars indicate mean, with error bars showing standard deviation. g fluorescence imaging of cw gscs (green) and macrophages (red) grown in the d tri-culture model without macrophages (top) vs the d tetra-culture model with macrophages (bottom). scale bars, mm. h fluorescence imaging of gscs (green) and macrophages (red) grown in the d tri-culture model without macrophages (top) vs the d tetra-culture model with macrophages (bottom). scale bars, mm. i fluorescence imaging of gsc gscs (green) and macrophages (red) grown in the d tri-culture model without macrophages (top) vs the d tetra-culture model with macrophages (bottom). scale bars, mm. j fluorescence imaging of gscs (green) and macrophages (red) grown in the d tri-culture model without macrophages (top) vs the d tetra-culture model with macrophages (bottom). scale bars, mm. to improve modeling of a highly lethal brain cancer for which current therapies are limited, we utilized a dlp-based d bioprinting system to model glioblastoma, the most common and highly lethal type of brain tumor. studies have reported using d printing to create coculture models of glioblastoma cells with other stromal cells or fabricate ha-based hydrogel to mimic brain ecm. , , however, most prior models focus on only one aspect of the in vivo situation or used non-human cells, which reduced their capacity to be applied to actual clinical settings. to the best article of our knowledge, this is the first report of a human cell-based d glioblastoma model that recapitulates the complex tumor microenvironment with inclusion of normal brain, immune components, stromal components, and essential mechanical and biochemical cues from the extracellular matrix. the tumor microenvironment provides essential signals to guide tumor growth and survival; however, these cues are inefficiently modeled in standard d culture, even in the absence of serum. hypoxic signaling contributes to glioblastoma aggressiveness by remodeling gsc phenotypes. , our d tetra-culture brain tumor model expressed hypoxia response signatures, allowing for investigation of hypoxic signaling in a physiologic environment, unlike standard cell culture systems. critical growth factor signaling elements are provided from neurons, [ ] [ ] [ ] , npcs, ecm components, , and immune fractions, including macrophages. , the perivascular niche provides a variety of signals including wnts, ephrins, and osteopontins to promote glioblastoma invasion, growth, and maintenance of gscs. future studies will be required to integrate vascular components into the d printed model system to further study these important components of the brain tumor microenvironment. the d tetra-culture tissue environment presented here enables controlled, reproducible, and scalable interrogation of these various cellular interactions that drive brain tumor biology. while microenvironmental components supply critical niche factors to sustain the tumor ecosystem, stromal elements are also actively remodeled by malignant cells. here, we observed the role of immune cells in glioblastoma growth, including changes in gene expression, invasive behaviors, and response to treatments. reciprocally, we also find that the d glioblastoma microenvironment promoted polarization of macrophages towards a protumoral m macrophage phenotype, highlighting this bidirectional crosstalk. the bioprinting approach generates a spatially separated tumor region and surrounding non-neoplastic neural tissue with defined cell density which allows the cells to interact in a more realistic manner, providing a highly reproducible platform for the interrogation of cell-cell interactions with several key advantages. first, this d glioblastoma tissue model allows for investigation of tumor-immune interactions in a fully human species-matched system, which is not possible in xenograft or genetically engineered mouse model. this may facilitate understanding of human-specific immune interactions and advance the field of neuro-oncoimmunology by providing insights into immunotherapy efficacy. second, combining tumoral and non-neoplastic neural components within one model will propel drug discovery efforts by enabling measurements of therapeutic efficacy, toxicities, and therapeutic index. the scalability and reproducibility of this d bioprinted model also allows for more high-throughput compound screening efforts. our findings suggest that the d bioprinted model displays transcriptional signatures closer to patient-derived glioblastoma tissue, and that local stromal interactions present within our model promotes broad therapeutic resistance, enabling compound discovery efforts in a challenging environment. third, the d bioprinted model is amenable to largescale whole-genome crispr-cas -based screening methods to uncover novel functional dependencies in a physiologic setting. this model extends previous approaches by characterizing context-dependent target essentiality in cancer cells and allowing for investigation of multivalent stromal cell dependencies. in conclusion, we report a controlled, reproducible, and scalable d engineered glioblastoma tissue construct that serves as a more physiologically accurate brain tumor model, facilitates interrogation of the multicellular interactions that drive brain tumor biology, and acts as a platform for discovery of novel functional dependencies. gelma and gmha synthesis and characterization gelma and gmha were synthesized using type a, gel strength gelatin from porcine skin (sigma aldrich cat #: g ) and , da hyaluronic acid (lifecore), respectively, as described previously. , briefly, for the gelma synthesis of % degree of methacrylation, % (w/v) gelatin was dissolved in . m : carbonate-bicarbonate buffer solution (ph~ ) at °c. methacrylic anhydride was added dropwise at a volume of . ml/(gram gelatin). the reaction was left to run for h at °c. after synthesis, the solutions were dialyzed, frozen overnight at − °c, and lyophilized. freeze-dried gelma and gmha were stored at − °c and reconstituted immediately before printing to stock solutions of % (w/vol) and % (w/vol), respectively. all materials were sterilized by syringe filters before mixing with cells (millipore). the degree of methacrylation of gelma and gmha were quantified using proton nmr (bruker, mhz). cell culture xenografted tumors were dissociated using a papain dissociation system according to the manufacturer's instructions. gscs were then cultured in neurobasal medium supplemented with % b , fig. d bioprinting enables a drug discovery platform and microenvironmental interactions contribute to drug resistance. a (top) schematic diagram of drug diffusion experiment. (bottom) images of fitc-dextran diffusion through the d hydrogel over a time course. scale bars, mm. b average intensity of fitc-dextran signal through the d tetra-culture model over a time course. three replicates were used. bars indicate mean with error bars showing standard deviation. ordinary one-way anova with tukey correction for multiple comparisons was used for statistical analysis. c spatial intensity of fitc-dextran signal through the d tetra-culture model over a time course. d cell viability of the gsc gsc following treatment with the egfr inhibitors, erlotinib and gefitinib, and the alkylating agent temozolomide (tmz) in standard sphere culture conditions, the d tri-culture model, and the d tetra-culture model. three replicates were used, ordinary two-way anova with dunnett multiple test correction was used for statistical analysis. bars indicate mean, while error bars show standard deviation. **p < . ; ****p < . . e cell viability of the cw gsc following treatment with the egfr inhibitors, erlotinib and gefitinib, and the alkylating agent tmz in standard sphere culture conditions, the d tri-culture model, and the d tetra-culture model. three replicates were used, ordinary two-way anova with dunnett multiple test correction was used for statistical analysis. bars indicate mean, while error bars show standard deviation. **p < . ; ***p < . ; ****p < . . f schematic diagram of process to determine drug sensitivity based on the d tetra-culture gene expression signature from the ccle and ctrp datasets. [ ] [ ] [ ] g therapeutic efficacy prediction of drugs in all cancer cells in the ctrp dataset based on differentially expressed genes between the d tetra-culture model and gscs grown in sphere culture as defined by rna-seq. h correlation of (top) abiraterone and (bottom) gsk-j sensitivities based on the d tetra-culture signature expression across all cancer cell lines in the ccle dataset. compounds are ranked based on the correlation between the tetra-culture gene expression signature and compound area under the curve (auc). i normalized cell viability of gscs in tri-culture and tetra-culture models following treatment with μm of abiraterone. ***p < . . bar shows mean of six technical replicates and error bars indicate standard deviation. unpaired two-tailed t-test was used for statistical analysis. j normalized cell viability of gscs in tri-culture and tetra-culture models following treatment with μm of vemurafenib. ns, p > . . bar shows mean of six technical replicates and error bars indicate standard deviation. unpaired two-tailed t-test was used for statistical analysis. k normalized cell viability of gscs in tri-culture and tetra-culture models following treatment with μm of ifosfamide. ***p < . . bar shows mean of six technical replicates and error bars indicate standard deviation. unpaired two-tailed t-test was used for statistical analysis. % l-glutamine, % sodium pyruvate, % penicillin/streptomycin, ng/ml basic human fibroblast growth factor (bfgf), and ng/ ml human epidermal growth factor (egf) for at least h to recover expression of surface antigens. gsc phenotypes were validated by expression of stem cell markers (sox and olig ) functional assays of self-renewal (serial neurosphere passage), and tumor propagation using in vivo limiting dilution. thp- monocytes were cultured in rpmi (gibco) medium supplemented with % heat-inactivated fetal bovine serum (fbs, invitrogen) and % penicillin/streptomycin. to obtain monocytederived m macrophage, thp- monocytes were first seeded in well plates at a density of × cells/ml ( ml/well). polarization to m macrophage was induced by ( ) incubating cells in ng/ ml phorbol -myristate -acetate (pma, sigma aldrich) for h, ( ) replacing with thp complete medium for h, and then ( ) incubating in ng/ml interleukin (il , peprotech) and ng/ ml interleukin (il , peprotech) for h. hnp neural progenitor cells (neuromics) were cultured on matrigel-coated plates using the complete nbm medium for gscs. human astrocytes (thermofisher) were cultured with astrocyte medium (sciencell) supplemented with % penicillin/streptomycin. d bioprinting process before printing, gscs, hnp s, and astrocytes were digested by accutase (stemcell technology), and macrophages were digested with tryple (thermofisher). for the d tetra-culture samples, the cell suspension solution for the tumor core consisted of . × cells/ml gscs and . × cells/ml macrophages (gscs:m = : ). for the d tri-culture samples, the core cell suspension solution consisted of . × cells/ml gscs only (supplementary information, fig. s a, b) . the cell suspension solution for the peripheral region for both models consisted of × cells/ml hnp s and × cells/ml astrocytes. all cell suspensions were aliquoted into . ml eppendorf tubes and stored on ice before use. the prepolymer solution for bioprinting was prepared with % (w/v) gelma, . % (w/v) gmha, and . % (w/v) lithium phenyl( , , -trimethylbenzoyl) phosphinate (lap) (tokyo chemical industry). prepolymer solution was kept at °c in dark before use. cell suspension was mixed with prepolymer solution at : ratio immediately before printing to maximize viability. the two-step bioprinting process utilized a customized lightbased d printing system. components of the system included a digital micromirror device (dmd) chip (texas instruments), a motion controller (newport), a light source (hamamatsu), a printing stage, and a computer with software to coordinate all the other components. the thickness of the printed samples was precisely controlled by the motion controller and the stage. cellmaterial mixture was loaded onto the printing stage, and the corresponding digital mask was input onto the dmd chip. light was turned on for an optimized amount of exposure time ( s for the core and s for the periphery). the bioprinted d tri-culture/ tetra-culture samples were then rinsed with dpbs and cultured in maintenance medium at °c with % co . maintenance medium was made of % of complete nbm medium, % of thp medium, and % of astrocyte medium. hipsc-derived macrophage generation hipsc-derived macrophage differentiation protocol was adapted from yanagimachi et al. and modified from mesci et al. briefly, ipsc cell lines were generated as previously described, by reprogramming fibroblast from a healthy donor. the ipsc colonies were plated on matrigel-coated (bd biosciences) plates and maintained in mtesr media (stem cell technologies). the protocol of myeloid cell lineage consisted of sequential steps. in the first step, primitive streak cells were induced by bmp addition, which in step , were differentiated into hemangioblast-like hematopoietic precursors (vegf ( ng/ml, peprotech), scf ( ng/ml, gemini) and basic fibroblast growth factor (bfgf), ( ng/ml, life technologies)). then, in the third step, the hematopoietic precursors were pushed towards myeloid differentiation (flt- ligand ( ng/ml, humanzyme), il- ( ng/ml, gemini), scf ( ng/ml, gemini), thrombopoietin, tpo ( ng/ml), m-csf ( ng/ml)) and finally into the monocytic lineage in step [flt -ligand ( ng/ml), m-csf ( ng/ml), gm-csf ( ng/ml)]. cells produced in suspension in step were recovered, sorted by using anti-cd magnetic microbeads (macs, miltenyi) and then integrated into d bioprinted models as described above. isolation and generation of primary human macrophages human blood was obtained from healthy volunteers from the scripps research institute normal blood donor service. mononuclear cells were isolated by gradient centrifugation using lymphoprep (# stemcell), washed with pbs, and treated with red blood cell lysis buffer. cells were plated to adhere monocytes and cultured in % heat inactivated fbs in rpmi with hepes, glutamax, mm sodium pyruvate, and pen/strep with ng/ml m-csf for days as described by ogasawara et al. unpolarized m macrophages were collected and integrated into d bioprinted models as described above. mechanical testing compressive modulus of the d printed constructs was measured with a microsquisher (cellscale). pillars with mm in diameter and mm in height were printed with same conditions used for the tissue models and incubated overnight at °c. both acellular and cell-encapsulated constructs were tested. the microsquisher utilized stainless steel beams and platens to compress the constructs at % displacement of their height. customized matlab scripts were used to calculate the modulus from the force and displacement data collected by microsquisher. sem surface patterns of the materials and cell-material interactions on micron-scale were imaged with a scanning electron microscope (zeiss sigma ). acellular samples were snapfrozen in liquid nitrogen and immediately transferred to the freeze drier to dry overnight. cell-encapsulated samples were dried based on a chemical dehydration protocol. briefly, samples were fixed using . % glutaraldehyde solution for h at room temperature and then overnight at °c. on the next day, the samples were rinsed with dpbs for three times and soaked in % ethanol, % ethanol, and % ethanol subsequently, each for min. then the solution was replaced with % ethanol for min, and the step was repeated two more times. hexamethyldisilazane (hdms) was mixed with % ethanol at : ratio and : ratio. samples were first transferred to hdms: fig. whole-genome crispr-cas screen reveals context-specific functional dependencies. a schematic diagram of whole-genome crispr-cas loss-of-function screening strategy in standard sphere culture conditions and the d tetra-culture model. b pca of functional dependencies defined by whole genome crispr-cas screening as defined in (a). c volcano plot demonstrating genes that enhance (blue) or inhibit (red) cell proliferation in sphere culture when inactivated by a specific sgrna in a whole genome crispr-cas loss-of-function screen. the x-axis displays the z-score and the y-axis displays the p value as calculated by the mageck-vispr algorithm. d volcano plot demonstrating genes that enhance (blue) or inhibit (red) cell proliferation in the d tetra-culture model when inactivated by a specific sgrna in a whole genome crispr-cas loss-of-function screen. the x-axis displays the z-score and the y-axis displays the p value as calculated by the mageck-vispr algorithm. e pathway gene set enrichment connectivity diagram displaying pathways enriched among functional dependency genes common to both sphere culture and d culture in the tetra-culture model. f plot comparing the functional dependency zscores between sphere culture and d culture in the tetra-culture model. g pathway gene set enrichment connectivity diagram displaying pathways enriched among functional dependency genes that are specific to sphere culture, as defined in f. h pathway gene set enrichment connectivity diagram displaying pathways enriched among functional dependency genes that are specific to growth in the d tetra-culture, as defined in f. i volcano plot displaying differential functional dependency scores between sphere culture and the d tetra-culture system as defined by mageck-vispr. j pathway gene set enrichment connectivity diagram displaying pathways enriched among functional dependency genes that are more essential in sphere culture compared to in the d tetra-culture system, as defined in i. k pathway gene set enrichment connectivity diagram displaying pathways enriched among functional dependency genes that are more essential in the d tetraculture system compared to in sphere culture, as defined in i. etoh ( : ) for min, then hdms:etoh ( : ) for min. then the solution was replaced with % hdms for min, and the step was repeated two more times. the samples were left uncovered in chemical hood overnight to dry. the freeze-dried or chemically dried samples were coated with iridium by a sputter coater (emitech) prior to sem imaging. immunofluorescence staining and image acquisition of tumor model d bioprinted samples and sphere cultured cells were fixed with % paraformaldehyde (pfa; wako) for min and min, respectively, at room temperature. all samples were blocked and permeabilized using % (w/v) bovine serum albumin (bsa, gemini bio-products) solution with . % triton x- (promega) for h at room temperature on a shaker. samples were then incubated with the respective primary antibody (listed below) overnight at °c. on the next day, samples were rinsed by dpbs with . % tween (pbst) for three times on the shaker. samples were incubated with fluorophore-conjugated goat antirabbit or goat anti mouse secondary antibodies ( : ; biotium) and hoechst ( : ; life technologies) counterstain in dpbs with % (w/v) bsa for h at room temperature in dark. after incubation, samples were rinsed three times in pbst and stored in dpbs with . % sodium azide (alfa aesar) at °c before imaging. fluorescence images of d samples and their sphere cultured counterparts were taken with a confocal microscope (leica sp ) using consistent settings for each antibody (supplementary information, table s ). fluorescence images of egfp-or mcherry-abeled cells in the d samples were also acquired using the confocal microscope. tile scan merging was completed by the automated program on the leica microscope and the z-stack projection was completed by imagej. quantification of the migration was based on the fluorescence images processed by imagej. rna isolation and rt-pcr egfp-labeled gscs and mcherry-labeled thp s were isolated from d printed tri-culture and tetra-culture samples using flow cytometry (bd facsaria ii). cells isolated from d and sphere cultured cells were treated with trizol reagent (life technologies) before rna extraction. total rna of each sample was extracted using direct-zol rna miniprep kit (zymo) and immediately stored at − °c. to perform rt-pcr, cdna was first obtained by rna reverse transcription using the protoscript® first strand cdna synthesis kit (new england biolabs) with input rna of ng per sample. the primers were purchased from integrated dna technologies. rt-pcr was performed using powerup sybr green master mix (applied biosystems) and detected with quantstudio rt-pcr system. gene expression was determined by the threshold cycle (ct) values normalized against the housekeeping gene (supplementary information, table s ). rna-seq and data analysis rna was purified as described above and subjected to rna-seq. paired-end fastq sequencing reads were trimmed using trim galore version . . (https://www.bioinformatics.babraham.ac.uk/ projects/trim_galore/) using cutadapt version . . transcript quantification was performed using salmon version . . in the quasi-mapping mode from transcripts derived from human gencode release (grch . ). salmon "quant" files were converted using tximport (https://bioconductor.org/packages/ release/bioc/html/tximport.html) and differential expression analysis was performed using deseq in the r programming language. data from gscs and primary glioblastoma surgical resection tissues were derived from mack et al. and were processed using the same analysis pipeline. data from matched gscs grown in serum-free sphere culture and orthotopic intracranial xenografts were derived from miller et al. and were processed using the same analysis pipeline. processed data from matched gscs and differentiated tumor cells were derived from suva et al. and differentially expressed genes were calculated using the limma-voom algorithm in the limma package in the r programming language. pca was performed within the deseq package using the top differentially expressed genes. umap analysis was performed using the umapr package (https://github.com/ropenscilabs/umapr) and uwot (https://cran.r-project.org/web/packages/uwot/index. html). for comparisons of glioblastoma tissue samples with gscs grown in standard sphere culture, analysis parameters include: sample size of local neighborhood, number of neighbors = ; learning rate = . ; initialization of low dimensional embedding = random; metrics for computation of distance in high dimensional space = manhattan. for comparisons of gscs derived from sphere culture or d bioprinted models, analysis parameters include: sample size of local neighborhood, number of neighbors = ; initialization of low dimensional embedding = random; metrics for computation of distance in high dimensional space = cosine. gene set enrichment analysis was performed using the online gsea webportal (http://software.broadinstitute.org/ gsea/msigdb/annotate.jsp) and the gsea desktop application fig. pag and znf are potential therapeutic targets in glioblastoma. a d tetra-culture specific target identification approach. graph showing gene dependency z-score in sphere culture (x-axis) vs tetra-culture (y-axis). red color indicates genes with a sphere culture z-score of > − . and a tetra-culture z-score of < − . . b red genes from (a) ranked based on the dependency significance in tetra-culture models (−log of the p value). c luminescent signal in gscs transfected with a luciferase expression vector (red) or un-transfected cells following treatment with luciferin reagent for min. ***p < . . unpaired, two-tailed t test was used for statistical analysis. d western blot for pag and flag-tagged cas following treatment with two independent sgrnas targeting pag in luciferase-expressing cw cells or a nontargeting control (sgcont). tubulin was used as a loading control. e western blot for znf and flag-tagged cas following treatment with two independent sgrnas targeting znf in luciferase expressing cw cells or a sgcont. tubulin was used as a loading control. f western blot for atp h (atp pd) and flag-tagged cas following treatment with two independent sgrnas targeting atp h in luciferase expressing cw cells or a sgcont. tubulin was used as a loading control. g western blot for rnf a and flag-tagged cas following treatment with two independent sgrnas targeting rnf a in luciferase expressing cw cells or a sgcont. tubulin was used as a loading control. h cell viability of cw luciferase expressing gscs in sphere culture following treatment with two independent sgrnas targeting pag or a sgcont. ****p < . . two-way repeated measures anova with dunnett multiple comparison testing was used for statistical analysis. i cell viability of cw luciferase expressing gscs in sphere culture following treatment with two independent sgrnas targeting znf or a sgcont. ****p < . . two-way repeated measures anova with dunnett multiple comparison testing was used for statistical analysis. j cell viability of cw luciferase-expressing gscs in sphere culture following treatment with two independent sgrnas targeting atp h or a sgcont. ****p < . . two-way repeated measures anova with dunnett multiple comparison testing was used for statistical analysis. k cell viability of cw -luciferase expressing gscs in sphere culture following treatment with two independent sgrnas targeting rnf a or a sgcont. ****p < . . two-way repeated measures anova with dunnett multiple comparison testing was used for statistical analysis. l cell viability of cw luciferase expressing gscs in d tetra-culture models after editing with two independent sgrnas targeting pag , znf , or a non-targeting sgrna after seven days. ****p < . . bars show mean and standard deviation of two biological replicates with technical replicates. ordinary one-way anova with dunnett multiple comparison correction was used for statistical analysis. m cell viability of cw luciferase expressing gscs in d tetra-culture models after editing with two independent sgrnas targeting atp h, rnf a, or a nontargeting sgrna after seven days. *p < . ; **p < . . bars show mean and standard deviation of two biological replicates with technical replicates. ordinary one-way anova with dunnett multiple comparison correction was used for statistical analysis. n western blot for pag and flag-tagged cas following treatment with two independent sgrnas targeting pag in cw gscs or a sgcont. tubulin was used as a loading control. o kaplan-meier plot showing mouse survival following orthotopic implantation of gscs edited with one of two sgrnas targeting pag or a sgcont. sgpag . vs sgcont, p = . . sgpag . vs sgcont = . . log-rank test was used for statistical analysis. p western blot for znf and flag-tagged cas following treatment with two independent sgrnas targeting znf in cw gscs or a sgcont. tubulin was used as a loading control. q kaplan-meier plot showing mouse survival following orthotopic implantation of gscs edited with one of two sgrnas targeting znf or a sgcont. sgznf . vs sgcont, p = . . sgznf . vs sgcont, p > . . log-rank test was used for statistical analysis. (http://software.broadinstitute.org/gsea/downloads.jsp). , pathway enrichment bubble plots were generated using the bader lab enrichment map application and cytoscape (http://www. cytoscape.org). glioblastoma transcriptional subtypes were calculated using a program written by wang et al. and implemented in r. gene signatures were calculated using the single sample gene set enrichment analysis projection (ssgseaprojection) module on genepattern (https://cloud.genepattern.org). crispr editing crispr editing was performed on cw gscs as well as luciferase-labeled cw gscs (cw -luc). for unlabeled cells, sgrnas were cloned into the lenticrisprv plasmid containing a puromycin selection marker (addgene plasmid # ), while luciferase-labeled cells were edited with sgrnas cloned into the lenticrisprv plasmid containing a hygromycin selection marker (addgene plasmid # ). sgrna sequences were chosen from the human crispr knockout pooled library (brunello) (supplementary information, table s ). western blot analysis cells were collected and lysed in ripa buffer ( mm tris-hcl, ph . ; mm nacl; . % np- ; mm naf with protease inhibitors) and incubated on ice for min. lysates were centrifuged at °c for min at , rpm, and supernatant was collected. the pierce bca protein assay kit (thermo scientific) was utilized for determination of protein concentration. equal amounts of protein samples were mixed with sds laemmli loading buffer, boiled for min, and electrophoresed using nupage bis-tris gels, then transferred onto pvdf membranes. tbs-t supplemented with % non-fat dry milk was used for blocking for a period of h followed by blotting with primary antibodies at °c for h (supplementary information, table s ). blots were washed times for min each with tbs-t and then incubated with appropriate secondary antibodies in % non-fat milk in tbs-t for h. for all western immunoblot experiments, blots were imaged using biorad image lab software and subsequently processed using adobe illustrator to create the figures. molecular diffusion assessment d printed hydrogels were printed and incubated in dpbs overnight at °c. fluorescein isothiocyanate (fitc)-dextran with average molecular weight of da was dissolved in dpbs at concentration of µg/ml. dpbs was removed and fitc-dextran solutions were added to the wells with d printed hydrogels. hydrogels were incubated in fitc-dextran solution at °c for , , , , , and min; rinsed three times with dpbs; and then imaged using a fluorescence microscope. fluorescence intensities of the hydrogel were measured by imagej. the average intensities and the spatial intensities at each time point were calculated in excel and plotted using prism. drug response assessment d tri-culture/tetra-culture samples were printed as described above, with regular gscs substituted with luciferase-labeled gscs. d samples and sphere cultured cells plated on matrigel-coated slides were treated with drugs after days in culture. drug effects were evaluated h later for erlotinib and gefitinib. for temozolomide, medium was replaced with fresh medium with temozolomide h after first treatment, and the drug response was evaluated h after second treatment. luciferase readings were obtained using using the promega luciferase assay system (e ) based on the provided protocol and a tecan infinite m plate reader. abiraterone (hy- ), vemurafenib (hy- ), and ifosfamide (hy- ), erlotinib (hy- ), and gefitinib (hy- ) from medchemexpress was used to generate dose response curves in vitro. fig. d bioprinting contributes to upregulation of genes with poor prognostic significance in glioblastoma. a heatmap displaying mrna expression signatures of intracranial xenografts (vs sphere cell culture) and d bioprinted tetra-cultures (vs sphere cell culture) as defined by the tcga glioma hg-u a microarray. various clinical metrics, patient information and information on tumor genetics are also displayed. b mrna expression signature of (left) d bioprinted tetra-cultures (vs sphere cell culture) and (right) intracranial xenografts (vs sphere cell culture) in tcga glioma hg-u a microarray. grade ii (n = ), grade iii (n = ), grade iv (n = ). the box-and-whisker plot indicates the lower quartile, median, and upper quartile. error bars represent the %- % confidence interval. ordinary one-way anova with tukey multiple comparison test was used for statistical analysis, ****p < . . c mrna expression signature of d bioprinted tetra-cultures (vs sphere cell culture) in cgga. grade ii (n = ), grade iii (n = ), grade iv (n = ). the box-and-whisker plot indicates the lower quartile, median, and upper quartile. error bars represent the %- % confidence interval. ordinary one-way anova with tukey multiple comparison test was used for statistical analysis, ****p < . . d mrna expression signature of d bioprinted tetra-cultures (vs sphere cell culture) in the rembrandt glioma dataset. grade ii (n = ), grade iii (n = ), grade iv (n = ). the box-and-whisker plot indicates the lower quartile, median, and upper quartile. error bars represent the %- % confidence interval. ordinary one-way anova with tukey multiple comparison test was used for statistical analysis, ****p < . . e mrna expression signature of d bioprinted tetra-cultures (vs sphere cell culture) in the chinese glioma genome atlas (cgga). data presented is restricted to glioblastomas (grade iv glioma). primary (n = ), recurrent (n = ). the box-and-whisker plot indicates the lower quartile, median, and upper quartile. error bars represent the %- % confidence interval. ordinary one-way anova with tukey multiple comparison test was used for statistical analysis, ****p < . . f mrna expression signature of d bioprinted tetra-cultures (vs sphere cell culture) in the rembrandt glioma dataset. data presented is restricted to glioblastomas (grade iv glioma). proneural (n = ), mesenchymal (n = ), classical iv (n = ). the box-and-whisker plot indicates the lower quartile, median, and upper quartile. error bars represent the %- % confidence interval. ordinary one-way anova with tukey multiple comparison test was used for statistical analysis, ****p < . . g kaplan-meier survival analysis of glioblastoma patients in the tcga dataset based on the mrna expression signature of intracranial xenografts (vs sphere cell culture). patients were grouped into "high" or "low" signature expression groups based on the median signature expression score. low (n = ), high (n = ). log rank analysis was used for statistical analysis, p = . . h kaplan-meier survival analysis of glioblastoma patients in the tcga dataset based on the mrna expression signature of d bioprinted tetra-cultures (vs sphere cell culture). patients were grouped into "high" or "low" signature expression groups based on the median signature expression score. low (n = ), high (n = ). log rank analysis was used for statistical analysis, p = . . i kaplan-meier survival analysis of glioblastoma patients in the cgga dataset based on the mrna expression signature of intracranial xenografts (vs sphere cell culture). patients in the top / of the expression signature score were grouped into the "high" group, while those in the bottom / of the expression signature score were grouped into the "low" group. low (n = ), high (n = ). log rank analysis was used for statistical analysis, p = . . j kaplan-meier survival analysis of glioblastoma patients in the cgga dataset based on the mrna expression signature of d bioprinted tetra-cultures (vs sphere cell culture). patients in the top / of the expression signature score were grouped into the "high" group, while those in the bottom / of the expression signature score were grouped into the "low" group. low (n = ), high (n = ). log rank analysis was used for statistical analysis, p = . . k plot showing genes in the intracranial xenograft signature ranked by (x-axis) the mean survival difference between the "high" expressing group and the "low" expressing group and (y-axis) the statistical significance of the survival difference as calculated by the log-rank test. patients were grouped into "high" or "low" signature expression groups based on the median gene expression. l plot showing genes in the d bioprinted tetra-cultures (vs sphere cell culture) signature ranked by (x-axis) the mean survival difference between the "high" expressing group and the "low" expressing group and (y-axis) the statistical significance of the survival difference as calculated by the log-rank test. patients were grouped into "high" or "low" signature expression groups based on the median gene expression. m the outer pie chart displays the fraction of genes with prognostic significance in the d bioprinted tetra-cultures gene signature as calculated by the log-rank test. patients were grouped into "high" or "low" signature expression groups based on the median gene expression. the inner pie chart displays the number of total prognostically significant genes as a fraction of all genes. the chi-squared test was used for statistical analysis, p < . . sphere culture cell proliferation experiments were conducted by plating cells of interest at a density of cells per well in a -well plate with replicates. cell titer glo (promega) was used to measure cell viability. data is presented as mean ± standard deviation. drug sensitivity prediction therapeutic sensitivity and gene expression data were accessed through the cancer therapeutics response portal (https://portals. broadinstitute.org/ctrp/). [ ] [ ] [ ] gene signature scores were calculated for each cell line in the dataset using the single sample gene set enrichment analysis projection (ssgseaprojection) module on genepattern (https://cloud.genepattern.org). gene signature score was then correlated with area under the curve (auc) values for drug sensitivity for each compound tested. correlation r-value was plotted and statistical analyses were corrected for multiple test correction. crispr screening and data analysis whole-genome crispr-cas loss-of-function screening was performed with the human crispr knockout pooled library (brunello), which was a gift from david root and john doench (addgene # ). the library was used following the instructions on addgene website (https://www.addgene.org/pooled-library/ broadgpp-human-knockout-brunello). briefly, the library was stably transduced into gscs by lentiviral infection with a multiplicity of infection (moi) around . - . , after puromycin selection, cells were propagated in either standard sphere cell culture conditions or in a d tetra-culture system. after days, genomic dna was extracted from gscs and the sequencing library was generated using the protocol on addgene website (https://media.addgene.org/cms/filer_public/ / / f - - a -b c -e a ecf f /broadgpp-sequencing-protocol. pdf). sequencing quality control was performed using fastqc (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) and enrichment and dropout were calculated using the mageck-vispr pipeline , using the mageck-mle pipeline. in vivo tumorigenesis assays intracranial xenografts experiments were generated by implanting , patient-derived gscs (cw ) following treatment with sgrnas targeting pag or znf or a sgcont into the right cerebral cortex of nsg mice (nod.cg-prkdcscid il rgtm wjl/szj, the jackson laboratory, bar harbor, me, usa) at a depth of . mm under a university of california, san diego institutional animal care and use committee (iacuc) approved protocol. all murine experiments were performed under an animal protocol approved by the university of california, san diego iacuc. healthy, wild-type male or female mice of nsg background, - weeks old, were randomly selected and used in this study for intracranial injection. mice had not undergone prior treatment or procedures. mice were maintained in h light/ h dark cycle by animal husbandry staff with no more than mice per cage. experimental animals were housed together. housing conditions and animal status were supervised by a veterinarian. animals were monitored until neurological signs were observed, at which point they were sacrificed. neurological signs or signs of morbidity included hunched posture, gait changes, lethargy and weight loss. survival was plotted using kaplan-meier curves with statistical analysis using a log-rank test. subcutaneous xenografts were established by implanting million luciferase-labeled cw gscs into the right flank of nsg mice and maintained as described above. two weeks after implantation, treatment was initiated with mg/kg of ifosfamide (hy- , medchemexpress) dissolved in % safflower oil (spectrum laboratory products) and % dmso or vehicle alone by μl intraperitoneal injection once per day for days. luminescence signal was assessed at days , , , , and after initiation of treatment using bioluminescence imaging following injection of luciferin reagent intraperitoneally. tumor size was normalized based on the day time point for each mouse individually. statistical analysis statistical analysis parameters are provided in each figure legend. multiple group comparisons were compared by one-way anova with tukey's post-hoc analysis (by graphpad prism). p < . was designated as the threshold value for statistical significance. all data were displayed as mean values with error bars representing standard deviation. all raw sequencing data and selected processed data is available on geo at the accession number gse (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=gse ). there are no restrictions on data availability, and all data will be made available upon request directed to the corresponding authors. all biological materials used in this manuscript will be made available upon request to the corresponding authors. distribution of human patient-derived gscs may be distributed following completion of a material transfer agreement (mta) with the appropriate institutions if allowed. all computational algorithms utilized in the manuscript have been referenced in the corresponding figure legend and described in the methods section. additional details can be made available upon request. co-evolution of tumor cells and their microenvironment project drive: a compendium of cancer dependencies and synthetic lethal relationships uncovered by large-scale, deep rnai screening the cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity next-generation characterization of the cancer cell line encyclopedia the landscape of cancer cell line metabolism patient-derived xenograft models: an emerging platform for translational cancer research organogenesis in a dish: modeling development and disease using organoid technologies organoids as an in vitro model of human development and disease sequential cancer mutations in cultured human intestinal stem cells modeling colorectal cancer using crispr-cas -mediated engineering of human intestinal organoids a living biobank of breast cancer organoids captures disease heterogeneity brca-deficient mouse mammary tumor organoids to study cancer-drug resistance human primary liver cancer-derived organoid cultures for disease modeling and drug screening ductal pancreatic cancer modeling and drug screening using human pluripotent stem cell-and patient-derived tumor organoids a three-dimensional organoid culture system derived from human glioblastomas recapitulates the hypoxic gradients and cancer stem cell heterogeneity of tumors found in vivo organoids in cancer research organoid cultures derived from patients with advanced prostate cancer modeling patient-derived glioblastoma with cerebral organoids modeling physiological events in d vs. d cell culture the microenvironmental landscape of brain tumors d bioprinting of tissues and organs bioprinting for cancer research d-bioprinted mini-brain: a glioblastoma model to study cellular interactions and therapeutics a bioprinted human-glioblastoma-on-a-chip for the identification of patient-specific responses to chemoradiotherapy chitosan-alginate d scaffolds as a mimic of the glioma tumor microenvironment elucidating the mechanobiology of malignant brain tumors using a brain matrix-mimetic hyaluronic acid hydrogel platform brain-mimetic d culture platforms allow investigation of cooperative effects of extracellular matrix features on therapeutic resistance in glioblastoma engineering tumors with d scaffolds differential response of patient-derived primary glioblastoma cells to environmental stiffness d bioprinting of functional tissue models for personalized drug screening and in vitro disease modeling rapid d bioprinting of decellularized extracellular matrix with regionally varied mechanical properties and biomimetic microarchitecture deterministically patterned biomimetic human ipsc-derived hepatic model via rapid d bioprinting dissecting and rebuilding the glioblastoma microenvironment with engineered materials hyaluronan and hyaluronectin in the extracellular matrix of human brain tumour stroma tumor stem cells derived from glioblastomas cultured in bfgf and egf more closely mirror the phenotype and genotype of primary tumors than do serum-cultured cell lines transcription elongation factors represent in vivo cancer dependencies in glioblastoma chromatin landscapes reveal developmentally encoded transcriptional states that define human glioblastoma modeling neuro-immune interactions during zika virus infection possible involvement of the m anti-inflammatory macrophage phenotype in growth of human gliomas glioma grade is associated with the accumulation and activity of cells bearing m monocyte markers single-cell profiling of human gliomas reveals macrophage ontogeny as a basis for regional differences in macrophage activation in the tumor microenvironment decoupling genetics, lineages, and microenvironment in idh-mutant gliomas by single-cell rna-seq neuronal activity promotes glioma growth through neuroligin- secretion electrical and synaptic integration of glioma into neural circuits targeting neuronal activity-regulated neuroligin- dependency in high-grade glioma tumor vascular permeability, accumulation, and penetration of macromolecular drug carriers cbtrus statistical report: primary brain and other central nervous system tumors diagnosed in the united states in - radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma correlating chemical sensitivity and basal gene expression reveals mechanism of action harnessing connectivity in a large-scale small-molecule sensitivity dataset an interactive resource to identify cancer genetic and lineage dependencies targeted by small molecules lincs data portal . : next generation access point for perturbation-response signatures regulation of glioma cell phenotype in d matrices by hyaluronic acid hypoxic induction of vasorin regulates notch turnover to maintain glioma stem-like cells an id -dependent mechanism for vhl inactivation in cancer glutamatergic synaptic input to glioma cells drives brain tumour progression synaptic proximity enables nmdar signalling to promote brain metastasis neural precursor-derived pleiotrophin mediates subventricular zone invasion by glioma activation of notch signaling by tenascin-c promotes growth of human brain tumor-initiating cells a tension-mediated glycocalyx-integrin feedback loop promotes mesenchymal-like glioblastoma tumour-associated macrophages secrete pleiotrophin to promote ptprz signalling in glioblastoma stem cells for tumour growth macrophage-associated pgk phosphorylation promotes aerobic glycolysis and tumorigenesis a glial signature and wnt signaling regulate glioma-vascular interactions and tumor microenvironment ephrinb drives perivascular invasion and proliferation of glioblastoma stem-like cells osteopontin-cd signaling in the glioma perivascular niche enhances cancer stem cell phenotypes and promotes aggressive tumor growth cancer stem cells: the architects of the tumor ecosystem photocrosslinked hyaluronic acid hydrogels: natural, biodegradable tissue engineering scaffolds precise tuning of facile one-pot gelatin methacryloyl (gelma) synthesis robust and highly-efficient differentiation of functional monocytic cells from human pluripotent stem cells under serum-and feeder cellfree conditions a model for neural development and treatment of rett syndrome using human induced pluripotent stem cells selective blockade of the lyso-ps lipase abhd stimulates immune responses in vivo salmon provides fast and bias-aware quantification of transcript expression gencode reference annotation for the human and mouse genomes differential analyses for rna-seq: transcript-level estimates improve gene-level inferences moderated estimation of fold change and dispersion for rna-seq data with deseq reconstructing and reprogramming the tumor-propagating potential of glioblastoma stem-like cells limma powers differential expression analyses for rnasequencing and microarray studies gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles pgc- alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes enrichment map: a network-based method for gene-set enrichment visualization and interpretation tumor evolution of glioma-intrinsic gene expression subtypes associates with immunological changes in the microenvironment optimized sgrna design to maximize activity and minimize off-target effects of crispr-cas quality control, modeling, and visualization of crispr screens with mageck-vispr mageck enables robust identification of essential genes from genome-scale crispr/cas knockout screens supplementary information accompanies this paper at https://doi.org/ . / s - - - . competing interests: a.r.m. is a co-founder and has equity interest in tismoo, a company dedicated to genetic analysis focusing on therapeutic applications customized for the autism spectrum disorder and other neurological disorders origin genetics. the terms of this arrangement have been reviewed and approved by the university of california, san diego in accordance with its conflict of interest policies. the remaining authors declare no potential conflicts of interest. key: cord- -ezrkg dc authors: myerson, jacob w.; patel, priyal n.; habibi, nahal; walsh, landis r.; lee, yi-wei; luther, david c.; ferguson, laura t.; zaleski, michael h.; zamora, marco e.; marcos-contreras, oscar a.; glassman, patrick m.; johnston, ian; hood, elizabeth d.; shuvaeva, tea; gregory, jason v.; kiseleva, raisa y.; nong, jia; rubey, kathryn m.; greineder, colin f.; mitragotri, samir; worthen, george s.; rotello, vincent m.; lahann, joerg; muzykantov, vladimir r.; brenner, jacob s. title: supramolecular organization predicts protein nanoparticle delivery to neutrophils for acute lung inflammation diagnosis and treatment date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: ezrkg dc acute lung inflammation has severe morbidity, as seen in covid- patients. lung inflammation is accompanied or led by massive accumulation of neutrophils in pulmonary capillaries (“margination”). we sought to identify nanostructural properties that predispose nanoparticles to accumulate in pulmonary marginated neutrophils, and therefore to target severely inflamed lungs. we designed a library of nanoparticles and conducted an in vivo screen of biodistributions in naive mice and mice treated with lipopolysaccharides. we found that supramolecular organization of protein in nanoparticles predicts uptake in inflamed lungs. specifically, nanoparticles with agglutinated protein (naps) efficiently home to pulmonary neutrophils, while protein nanoparticles with symmetric structure (e.g. viral capsids) are ignored by pulmonary neutrophils. we validated this finding by engineering protein-conjugated liposomes that recapitulate nap targeting to neutrophils in inflamed lungs. we show that naps can diagnose acute lung injury in spect imaging and that nap-like liposomes can mitigate neutrophil extravasation and pulmonary edema arising in lung inflammation. finally, we demonstrate that ischemic ex vivo human lungs selectively take up naps, illustrating translational potential. this work demonstrates that structure-dependent interactions with neutrophils can dramatically alter the biodistribution of nanoparticles, and naps have significant potential in detecting and treating respiratory conditions arising from injury or infections. the covid- pandemic tragically illustrates the dangers of acute inflammation and infection of the lungs, for both individuals and societies. acute alveolar inflammation causes the clinical syndrome known as acute respiratory distress syndrome (ards), in which inflammation prevents the lungs from oxygenating the blood. severe ards is the cause of death in most covid- mortality and was a major cause of death in the influenza epidemic, but ards is common even outside of epidemics, affecting ~ , american patients per year with a ~ - % mortality rate. - ards is caused not just by viral infections, but also by sepsis, pneumonia (viral and bacterial), aspiration, and trauma. , largely because ards patients have poor tolerance of drug side effects, no pharmacological strategy has succeeded as an ards treatment. , [ ] [ ] [ ] therefore, there is an urgent need to develop drug delivery strategies that specifically target inflamed alveoli in ards and minimize systemic side effects. neutrophils are "first responder" cells in acute inflammation, rapidly adhering and activating in large numbers in inflamed vessels and forming populations of "marginated" neutrophils along the vascular lumen. [ ] [ ] [ ] [ ] [ ] [ ] [ ] neutrophils can be activated by a variety of initiating factors, including pathogen-and damage-associated molecular patterns such as bacterial lipopolysaccharides (lps). , after acute inflammatory insults, neutrophils marginate in most organs, but by far most avidly in the lung capillaries. , , , , neutrophils are therefore key cell types in most forms of ards. in ards, marginated neutrophils can secrete tissue-damaging substances (proteases, reactive oxygen species) and extravasate into the alveoli, leading to disruption of the endothelial barrier and accumulation of neutrophils and edematous fluid in the air space of the lungs ( figure a ). , , , [ ] [ ] [ ] targeted nanoparticle delivery to marginated neutrophils could provide an ards treatment with minimal side effects, but specific delivery to marginated neutrophils remains an open challenge. antibodies against markers such as ly g have achieved targeting to neutrophils in mice, but also deplete populations of circulating neutrophils. [ ] [ ] [ ] [ ] additionally, while ly g readily marks neutrophils in mice, there is no analogous specific and ubiquitous marker on human neutrophils. therefore, antibody targeting strategies have not been widely adopted for targeted drug delivery to these cells. as another route to neutrophil targeting, two previous studies noted that activated neutrophils take up denatured and agglutinated bovine albumin, concluding that denatured protein was critical in neutrophil-particle interactions. , nanoparticle structural properties such as shape, size, and deformability can define unique targeting behaviors. [ ] [ ] [ ] [ ] [ ] here, we screened a diverse panel of nanoparticles to determine the nanostructural properties that predict uptake in pulmonary marginated neutrophils during acute inflammation. as a high-throughput animal model for ards, we administered lps to mice, causing a massive increase in pulmonary marginated neutrophils. we show that two initial leads in our screen, lysozyme-dextran nanogels (ldngs) and crosslinked albumin nanoparticles (anps), selectively home to marginated neutrophils in inflamed lungs, but not naïve lungs. in our subsequent screen of over diverse nanoparticles, we find that protein nanoparticles, all defined by agglutination of protein in amorphous nanostructures (nanoparticles with agglutinated proteins, naps), but not by denatured protein, have specificity for lps-inflamed lungs. in contrast to naps, we demonstrate that three symmetric protein nanostructures (viruses/nanocages) have biodistributions unaffected by lps injury. we show that polystyrene nanoparticles and five liposome formulations do not accumulate in injured lungs, indicating that nanostructures that are not based on protein are not intrinsically drawn to marginated neutrophils in acute inflammation. we then engineered liposomes (the most clinically relevant nanoparticle drug carriers) as naps, through conjugation to protein modified with hydrophobic cyclooctynes, encouraging protein agglutination on the liposome surface by hydrophobic interactions. we thus show that supramolecular organization of proteins, rather than chemical composition, best predicts uptake in marginated neutrophils in acutely inflamed lungs. we then demonstrate proof of concept for naps as diagnostic and therapeutic tools for ards. we show; a) in-labeled naps provide diagnostic imaging contrast that distinguishes inflammatory lung injury from cardiogenic pulmonary edema; b) napliposomes can significantly ameliorate edema in a mouse model of severe ards; c) naps, but not crystalline protein nanostructures, accumulate in ex vivo human lungs rejected for transplant due to ards-like conditions. collectively, our results will demonstrate that supramolecular organization of protein, namely protein agglutination, predicts strong, intrinsic nanoparticle tropism for marginated neutrophils. this finding indicates that naps, encompassing a wide range of nanoparticles based on or incorporating protein, have biodistributions that are responsive to inflammation. naps could be useful beyond ards, since marginated neutrophils play a pathogenic role in a diverse array of inflammatory diseases, including infections, heart attack, and stroke. [ ] [ ] [ ] [ ] [ ] but our findings provide a clear path forward for using naps to improve diagnosis and treatment of ards. to quantify the increase in pulmonary marginated neutrophils after inflammatory lung injury, radiolabeled clone a anti-ly g antibody (specific for mouse neutrophils) was administered intravenously (iv) to determine the location and concentration of neutrophils in mice. iv injection of lps subjected mice to a model of mild ards. accumulation of anti-ly g antibody in the lungs was dramatically affected by iv lps, with . % of injected antibody adhering in lps-injured lungs, compared to . % of injected antibody in naïve control lungs ( figure b) . agreeing with previous studies addressing the role of neutrophils in systemic inflammation, biodistributions of anti-ly g antibody indicated that systemic lps injury profoundly increased the concentration of neutrophils in the lungs. , , , single cell suspensions prepared from mouse lungs were probed by flow cytometry to further characterize pulmonary neutrophils in naïve mice and in mice following lps-induced inflammation. to identify intravascular populations of leukocytes, mice received iv fluorescent cd antibody five minutes prior to sacrifice. single cell suspensions prepared from iv cd -stained lungs were then stained with anti-ly g antibody to identify neutrophils. a second stain of single cell suspensions with cd antibody indicated the total population of leukocytes in the lungs, distinct from the intravascular population indicated by iv cd . flow cytometry showed greater concentrations of neutrophils in lps-injured lungs, compared to naïve lungs ( figure c , counts above horizontal threshold indicate positive staining for neutrophils, figure d , rightmost peak indicates positive staining for neutrophils). comparison of ly g stain to total cd -positive cells indicated . % of leukocytes in the lungs were ly g-positive after lps injury, compared to . % in the naïve control ( figure d , center panel). comparison of ly g stain to iv cd stain indicated that the majority of neutrophils were intravascular, in both naïve and lpsinjured mice. in naïve mice, . % of neutrophils were intravascular and in lps-injured mice, . % of neutrophils were intravascular ( figure d , right panel). the presence of large populations of intravascular neutrophils following inflammatory injury is consistent with previously published observations. , , , , histological analysis confirmed results obtained with flow cytometry and radiolabeled anti-ly g biodistributions. staining of lung sections indicated increased concentration of neutrophils in the lungs following iv lps injury ( figure e , left panels). co-registration of neutrophil staining with tissue autofluorescence (indicating tissue architecture) broadly supported the finding that pulmonary neutrophils reside in the vasculature ( figure e , right panels). previous work has traced the neutrophil response to bacteria in the lungs, determining that pulmonary neutrophils pursue and engulf active bacteria following either intravenous infection or infection of the airspace in the lungs. , , we injected heat-inactivated, oxidized, and fixed e. coli in naïve and iv-lps-injured mice. with the bacteria stripped of their functional behavior, e. coli did not accumulate in the lungs of naïve control mice ( . % of initial dose in the lungs, blue bars in figure f ). however, pre-treatment with lps to recapitulate the inflammatory response to infection led to enhanced accumulation of the deactivated e. coli in the lungs ( . % of initial dose in the lungs, red bars in figure f ). with e. coli structure maintained but e. coli function removed, the inactivated bacteria were taken up more avidly in lungs primed by an inflammatory injury. in order to identify nanostructural parameters that correlate with nanoparticle uptake in inflamed lungs, we conducted an in vivo screen of a diverse array of nanoparticle drug carriers. the screen was based on the method used above for tracing inactivated bacteria: inject radiolabeled nanoparticles into mice and measure biodistributions, comparing naïve with iv-lps mice. to validate that the radiotracing screen would measure uptake in pulmonary marginated neutrophils, we more fully characterized the in vivo behavior of two early hits in the screen. lysozyme-dextran nanogels (ldngs, ngs) and poly(ethylene)glycol (peg)crosslinked albumin nps have been characterized as targeted drug delivery agents in previous work. [ ] [ ] [ ] here, ldngs ( . ± . nm diameter, . ± . pdi, supplementary figure a ) and peg-crosslinked human albumin nps ( . ± . nm diameter, . ± . pdi, supplementary figure b) were administered in naïve and iv-lps-injured mice. neither np was functionalized with antibodies or other affinity tags. the protein component of each particle was labeled with i for tracing in biodistributions, and assessed minutes after iv administration of nps. both absolute ldng lung uptake and ratio of lung uptake to liver uptake registered a ~ -fold increase between naïve control and lps-injured animals (figure a , supplementary table ) . specificity for lps-injured lungs was recapitulated with peg-crosslinked human albumin nps. albumin nps accumulated in naïve lungs at . % injected dose per gram organ weight (%id/g), and in lps-injured lungs at . %id/g, accounting for a -fold increase in lung uptake after intravenous lps insult ( figure b , supplementary table ) . single cell suspensions were prepared from lungs after administration of fluorescent ldngs or peg-crosslinked albumin nps. flow cytometric analysis of cells prepared from lungs after np administration enabled identification of cell types with which nps associated. firstly, the total number of cells containing ldngs or albumin nps increased between naïve and lps-injured lungs. in naïve control lungs, . ly g stain for neutrophils indicated that the bulk of ldng and albumin np accumulation in lps-injured lungs could be accounted for by uptake in neutrophils. in figure c and d, counts above the horizontal threshold indicate neutrophils and counts to the right of the vertical threshold indicate cells containing ldngs ( figure c ) or albumin nps ( figure d ). in iv-lps-injured lungs, ldng and albumin np uptake was dominated by neutrophils ( figure c , figure d , upper right quadrants indicate nppositive neutrophils). in lps-injured lungs, the majority of neutrophils, > % of cells, contained significant quantities of nanoparticles, compared to < % in naïve lungs. likewise, the majority of nanoparticle uptake in the lungs (> %) was accounted for by nanoparticle uptake in neutrophils ( figure e , f, g, h, supplementary table ) . for np uptake not accounted for by neutrophils, cd staining indicated that the remaining np uptake was attributable to other leukocytes. co-localization of albumin np fluorescence with cd stain showed that . % of albumin np uptake was localized to leukocytes in naïve lungs and . % of albumin np uptake was localized to leukocytes in injured lungs (supplementary figure c, supplementary figure d ). for ldngs, localization to neutrophils in injured lungs was confirmed via histology. ly g staining of lps-injured lung sections confirmed colocalization of fluorescent nanogels with neutrophils in the lung vasculature ( figure i ). slices in confocal images of lung sections indicated that ldngs were inside neutrophils ( figure j ). intravital imaging of injured lungs allowed real-time visualization of ldng uptake in leukocytes in injured lungs. ldng fluorescent signal accumulated over minutes and reliably colocalized with cd staining for leukocytes ( figure k , supplementary movie ). ldng pharmacokinetics were evaluated in naïve and iv-lps-injured mice (supplementary figure ) . in both naïve and injured mice, bare ldngs were rapidly cleared from the blood with a distribution half-life of ~ minutes. in naïve mice, transient retention of ldngs in the lungs ( . %id/g at five minutes after injection) leveled off over one hour. in iv-lps-treated mice, ldng concentration in the lungs reached a peak value at minutes after injection, as measured either by absolute levels of lung uptake or by lungs:blood localization ratio. ldng biodistributions were also assessed in mice undergoing alternative forms of lps-induced inflammation. intratracheal (it) instillation of lps led to concentration of ldngs in the lungs at . %id/g. liver and spleen ldng uptake was also reduced following it lps injury, leading to a -fold increase in the lungs:liver ldng localization ratio induced by it lps injury (supplementary figure ) . as with iv lps injury, it lps administration leads to neutrophil-mediated vascular injury focused in the lungs. mice were administered lps via footpad injection to provide a model of systemic inflammation originating in lymphatic drainage. ldng uptake in the lungs and in the legs was enhanced by footpad lps administration. at hours after footpad lps administration, ldngs concentrated in the lungs at . %id/g, an -fold increase over naïve. at hours, ldngs concentrated in the lungs at . %id/g (supplementary figure a) . total ldng accumulation in the legs accounted for . % of initial dose (%id) in naïve mice, . %id in mice hours after footpad lps injection, and . %id at hours after footpad injection (supplementary figure b) , indicating ldngs can concentrate in inflamed vasculature outside the lungs. previous work has indicated that nps based on denatured albumin accumulate in neutrophils in inflamed lungs and at sites of acute vascular injury, whereas nps coated with native albumin do not. , we have characterized lysozyme-dextran nanogels and crosslinked human albumin nps with circular dichroism (cd) spectroscopy to compare secondary structure of proteins in the nps to secondary structure of the native component proteins (supplementary figure a-b) . identical cd spectra were recorded for ldngs vs. lysozyme and for albumin nps vs. human albumin. deconvolution of the cd spectra via neural network algorithm trained against a library of cd spectra for known structures verified that secondary structure composition of lysozyme and albumin was unchanged by incorporation of the proteins in the nps. free protein and protein nps were also probed with -anilino- naphthalenesulfonic acid (ansa), previously established as a tool for determining the extent to which hydrophobic domains are exposed on proteins. consistent with known structures of the two proteins, ansa staining indicated few available hydrophobic domains on lysozyme and substantial hydrophobic exposure on albumin (supplementary figure c -d, blue curves). ldngs had increased hydrophobic accessibility vs. native lysozyme whereas albumin nps had reduced hydrophobic accessibility compared with native albumin. therefore, our data indicate that lysozyme and albumin are not denatured in ldngs and albumin nps, but the nps composed of the two proteins present a balance of hydrophobic and hydrophilic surfaces differing from the native proteins. the previous section demonstrates that two different nanoparticles based on protein, shown not to be denatured in cd spectroscopy studies, have uptake in lpsinflamed lungs driven by uptake in marginated neutrophils. we next undertook a broader study considering how aspects of np structure including size, composition, surface chemistry, and structural organization impact np uptake in lps-injured lungs. as examples of different types of protein nps, variants of ldngs (representing nps based on hydrophobic interactions between proteins), crosslinked protein nps, and nps based on electrostatic interactions between proteins were traced in naïve control and iv-lps-injured mice. as examples of nps based on site-specific protein interactions (rather than site-indiscriminate interactions leading to crosslinking, gelation, or chargebased protein nps), we also traced viruses and ferritin nanocages in naïve and lpstreated mice. liposomes and polystyrene nps were studied as examples of lipid and polymeric nanostructures. nanoparticles based on hydrophobic protein interactions ldng size was varied by modifying lysozyme-dextran composition of the nps and ph at which particles were formed. figure , all sizes of ldngs accumulated in lps-injured lungs at higher concentrations than in naïve lungs, with accumulation in injured lungs reaching ~ % of initial dose for all types of ldngs (supplementary table ). variations in size and composition of ldngs therefore did not affect ldng specificity for lps-injured lungs. expanding on data with peg-nhs ester-crosslinked human serum albumin particles, we varied the geometry and protein composition of nps based on peg-nhs protein crosslinking. human serum albumin nanorods (aspect ratio : ), bovine serum albumin nps ( . table ). lysozyme nps accumulated in naïve lungs at a uniquely high concentration of . %id/g, compared to . %id/g in inflamed lungs. degree of uptake in injured lungs, along with injured vs. naïve contrast, did vary with protein np composition. however, acute inflammatory injury resulted in a minimum three-fold increase in lung uptake for all examined crosslinked protein nps, excluding crosslinked lysozyme, which still accumulated in injured lungs at a high concentration ( . % of initial dose). we traced recently-developed poly(glutamate) tagged green fluorescent protein (e-gfp) nps, representing a third class of protein np based on electrostatic interactions between proteins and carrier polymer or metallic particles. negatively-charged e-gfp was paired to arginine-presenting gold nanoparticles ( . ± . nm diameter, pdi . ± . ) or to poly(oxanorborneneimide) (poni) functionalized with guanidino and tyrosyl side chains ( . ± . nm diameter, pdi . ± . ) (supplementary figure d) . for biodistribution experiments with poni/e-gfp hybrid nps, tyrosine-bearing poni was labeled with i and e-gfp was labeled with i, allowing simultaneous tracing of each component of the hybrid nps. the two e-gfp nps, with structure based on charge interactions, had specificity for iv lps-injured lungs. comparing uptake in lps-injured lungs to naïve lungs, we observe an lps:naïve ratio of . for poni/e-gfp nps as traced by the poni component, . for poni/e-gfp nps as traced by the e-gfp component, and . for au/e-gfp nps ( figure c , supplementary figure ). poni/e-gfp particles, specifically, accumulated in lpsinjured lungs at . % initial dose as measured by poni tracing and . % initial dose as measured by gfp tracing, indicating effective co-delivery in the inflamed organ. acute inflammatory injury therefore resulted in a two-to three-fold increase in pulmonary uptake of nps constructed via electrostatic protein interactions. nanoparticles based on symmetric protein organization adeno-associated virus (aav), adenovirus, and horse spleen ferritin nanocages were employed as examples of protein-based nps with highly symmetrical structure (see supplementary figure d for dls confirmation of structure). [ ] [ ] [ ] for each of these highly ordered protein nps, iv lps injury had no significant effect on biodistribution and levels of uptake in the injured lungs were minimal ( figure d table ). therefore, highly ordered protein nps traced in our studies did not have tropism for the lungs after acute inflammatory injury. liposomes and polystyrene nps were studied as example nps that are not structurally based on proteins. dota chelate-containing lipids were incorporated into bare liposomes, allowing labeling with in tracer for biodistribution studies. carboxylate polystyrene nps were coupled to trace amounts of i-labeled igg via edci-mediated carboxy-amine coupling. liposomes had a diameter of . ± . nm (pdi . ± . ) and igg-polystyrene nps had a diameter of . ± . nm (pdi . ± . ) (supplementary figure c -d). liposomes accumulated in inflamed lungs at a concentration of . %id/g, accounting for no significant change against naïve lungs. lps injury actually induced a reduction in the lungs:liver metric, from . for naïve mice to . for lps-injured mice. polystyrene nps accumulated in inflamed lungs at . %id/g ( . % initial dose), so iv lps injury did in fact induce increased levels of np uptake in the lungs, from a concentration of . %id/g in the naïve lungs ( figure e, supplementary figure ). however, neither bare liposomes nor polystyrene nps were drawn to lps-injured lungs in significant concentrations. significantly, isolated proteins did not home to lps-inflamed lungs themselves. we traced radiolabeled albumin, lysozyme, and transferrin in naïve control and iv lpsinjured mice (supplementary figure , supplementary table ). in injured mice, albumin, lysozyme, and transferrin localized to the lungs at low concentrations and no significant differences were recorded when comparing naïve to lps-injured lung uptake. the data presented in figure and supplementary figures - indicate that a variety of protein-based nanostructures have tropism for acute inflammatory injury in the lungs. nps based on agglutination of proteins in non-site-specific interactions (naps, figure a -c, supplementary figures - ) all exhibited either significant increases in lung uptake after lps injury or high levels of lung uptake in both naïve control and lpsinjured animals. nanostructures based on highly symmetrical protein organization had no specific tropism for inflamed lungs ( figure d ). representative nanostructures not based on proteins, bare liposomes and polystyrene beads, did not home to inflamed lungs ( figure e ). we next engineered naps from liposomes, a nanoparticle shown above to have no intrinsic neutrophil tropism. our methods for engineering nap-like liposomes serve to validate the finding that supramolecular organization of protein in nanoparticles predicts neutrophil tropism. liposomes were functionalized with rat igg conjugated via sata-maleimide chemistry (sata-igg liposomes) or via recently demonstrated copper-free click chemistry methods. briefly, click chemistry methods entailed nhs-ester conjugation of an excess of strained alkyne (dibenzocyclooctyne, dbco) to igg, followed by reaction of the dbco-functionalized igg with liposomes containing peg-azide-terminated lipids (dbco-igg liposomes, figure a ). dbco-igg liposomes had a diameter of . ± . nm and a pdi of . ± . and sata-igg liposomes had a diameter of . ± . nm and a pdi of . ± . (supplementary figure c) . in mice subjected to iv-lps, sata-igg liposomes accumulated in the lungs at a concentration of . %id/g ( figure b , yellow bars). dbco-igg liposomes, by contrast, concentrated in the lungs at . %id/g, corresponding to . % of initial dose and roughly matching the accumulation of nm ldngs in the inflamed lungs ( figure b , brown bars). for comparison, bare liposomes, as in figure e , concentrated in the inflamed lungs at . %id/g ( figure b , green bars). for dbco-igg liposomes, the inflamed vs. naïve lung uptake accounted for a twelve-fold change. dbco-igg liposomes specifically accumulated in injured lungs, whereas sata-igg liposomes and bare liposomes did not (supplementary figure , supplementary table ) . it lps instillation also led to elevated concentrations of dbco-igg liposomes in the lungs. biodistributions of the dbco-igg liposomes indicated a pulmonary concentration of . %id/g at hour after it lps, . %id/g at hours after it lps, and . %id/g at hours after it lps (supplementary figure ). even at early time points after direct pulmonary lps insult, dbco-igg liposomes accumulated in the inflamed lungs. results in figure b were obtained by introducing a -fold molar excess of nhs-ester-dbco to rat igg before dbco-igg conjugation to liposomes (dbco( x)-igg liposomes). optical density quantification of dbco indicated ~ dbco per igg following reaction of dbco and igg at : molar ratio (supplementary figure ) . to test the hypothesis that dbco functions as a tag that modifies dbco-igg liposomes for neutrophil affinity in settings of inflammation, we varied the concentration of dbco on igg prepared for conjugation to azide liposomes. dbco was added to igg at -fold, five-fold, and . -fold molar excesses. a -fold molar excess resulted in ~ dbco per igg, a -fold molar excess resulted in ~ dbco per igg, and a . -fold molar excess resulted in ~ dbco per igg (supplementary figure ) . igg with different dbco loading concentrations was conjugated to azide liposomes. dbco-igg liposomes had similar sizes across all dbco concentrations (supplementary figure c) , with diameters of ~ nm and pdis < . . the different types of dbco-igg liposomes were each traced in iv-lps injured mice. titrating the quantity of dbco on dbco-igg liposomes indicated that liposome accumulation in the lungs of injured mice was dependent on dbco concentration on the liposome surface. concentration of dbco-igg liposomes in inflamed lungs attenuated with decreasing dbco concentration on igg (supplementary table , figure c ). therefore, only igg with high concentrations of dbco served as a tag for modifying the surface of liposomes for specificity to pulmonary injury. flow cytometry verified the specificity of dbco-igg liposomes for neutrophils in injured lungs ( figure d -e). as with ldngs and albumin nps in figure c -h, single cell suspensions were prepared from lps-inflamed and naïve control lungs after circulation of fluorescent dbco-igg liposomes. confirming the results of biodistribution studies, . % of cells were liposome-positive in naïve lungs, compared to . % of all cells in lps-inflamed lungs (supplementary figure a-b) . dbco-igg liposomes predominantly accumulated in pulmonary neutrophils after iv lps. there were more neutrophils in the injured lungs and a greater fraction of neutrophils took up dbco-igg liposomes in the injured lungs, as compared to the naïve control ( figure d -e). approximately one half of neutrophils in iv lps-injured lungs contained liposomes. dbco-igg liposomes were also highly specific for neutrophils in inflamed lungs, with ~ % of liposome-positive cells in the injured lungs being neutrophils (supplementary table ). the remaining dbco-igg liposome uptake in the lungs was accounted for by other cd -positive cells (supplementary figure c -e). . % of liposome uptake colocalized with cd -positive cells in lps-injured lungs and . % of liposome uptake in the naïve lungs was associated with cd -positive cells. accordingly, less than % of liposome uptake was associated with endothelial cells (supplementary figure f -g). dbco( x)-igg itself did not have specificity for inflamed lungs (supplementary figure ). uptake of dbco( x)-igg in naïve and injured lungs was statistically identical and the biodistribution of the modified igg resembled published results with unmodified igg. these results verify that dbco-igg modifies the structure of immunoliposomes, but does not function as a standard affinity tag by acting as a surface motif with intrinsic affinity for neutrophils. indeed, cd spectroscopic and ansa structural characterization of dbcomodified igg and dbco-igg liposomes resembled results obtained for ldngs and crosslinked albumin nps. igg secondary structure, as assessed by cd spectroscopy, was unchanged by dbco modification (supplementary figure a) . deconvolution of cd spectra via neural network algorithm indicated identical structural compositions for dbco( x)-igg, dbco( x)-igg, dbco( x)-igg, dbco( . x)-igg, and unmodified igg, showing that igg was not denatured by conjugation to dbco. ansa was used to probe accessible hydrophobic domains on dbco( x)-igg and dbco( x)-igg liposomes (supplementary figure b) . ansa fluorescence indicated more hydrophobic domains available on dbco( x)-igg liposomes than on dbco( x)-igg itself, resembling results for lysozyme and ldngs. therefore, addition of a hydrophobic moiety to protein on the surface of liposomes led to uptake of the liposomes in pulmonary marginated neutrophils after inflammatory insult. this result indicates that hydrophobic interactions between proteins on the surface of functionalized liposomes, like the protein interactions in naps, predict liposome tropism for marginated neutrophils in inflamed lungs. including nps from our four classes of protein-based nps, two non-protein nps (bare liposomes and polystyrene nps), and five types of igg-coated liposomes, we traced nanoparticles in naïve and inflamed mice. direct assessment of naïve-toinflamed shifts in lung uptake led us to identify naps with specificity for inflamed lungs. to verify this assessment and derive additional patterns in the broader data set, we undertook linear discriminant and principal components analyses of the biodistribution data for our nanoparticles, along with three isolated proteins. grouping the nanoparticles and three proteins according to the classes defined in figure and supplementary figures - , we completed a linear discriminant analysis of the naïve-to-inflamed shift for particle retention in the lungs, blood, liver, and spleen (supplementary figure a) . data for particle uptake in each organ was normalized by subtracting and then dividing by the mean uptake over all particles. the first two eigenvectors, dominated by splenic uptake and a combination of liver and lung uptake, respectively, accounted for % of variation in the data. the resulting projection of the data along the first two linear discriminant analysis eigenvectors was analyzed by k-means clustering to confirm the classes of nanoparticle with specificity for the inflamed lungs (supplementary figure b) . indeed, division of the data into two clusters supported the delineation of the nanoparticles with specificity for inflamed lungs. naps, nanoparticles based on protein gelation, crosslinking, and charge association, all aligned in one cluster. as an exception, dbco( x)-igg liposomes were considered as a unique class of particle and the linear discriminant analysis indicated that the inflammation-specific liposomes had in vivo behavior resembling that of ldngs or poni-gfp nanoparticles. this analysis of the liposome biodistributions supports the classification of dbco( x)-igg liposomes as naps. igg-coated polystyrene nanoparticles and dbco( x)-igg liposomes were part of the k-means cluster without inflammation specificity, but data for these two particles resided close to the voronoi boundary distinguishing the two clusters. principal component analysis comparing normalized nanoparticle uptake in inflamed lungs to normalized retention in liver, spleen, and blood provided a reductive metric to compare the distinct in vivo behavior of nanoparticles in the classes identified by linear discriminant analysis. most variation in the biodistribution data was accounted for by an eigenvector closely aligned to variation in pulmonary uptake (supplementary figure a) . data was projected along that first eigenvector and magnitude of the projection was determined for each nanoparticle (supplementary figure b) . first eigenvector projection values were then grouped according to the classes examined above via linear discriminant analysis. only the classes in the inflammation-specific kmeans cluster had positive average first eigenvector projections. all other particle classes had average first eigenvector projections indistinguishable from isolated protein (supplementary figure b) . principal component and linear discriminant analyses of our compiled biodistributions confirmed; a) identification of naps as nanoparticles with distinct tropism for inflamed lungs and; b) alignment of dbco( x)-igg liposome in vivo behavior with that of other naps. computerized tomography (ct) imaging is a standard diagnostic tool for ards. ct images can identify the presence of edematous fluid in the lungs, but ct cannot distinguish between the two major types of pulmonary edema: non-inflammatory cardiogenic pulmonary edema (cpe) and ards-associated edema. we sought to use naps to distinguish inflammatory lung injury from cpe in diagnostic imaging experiments. we induced cpe in mice via prolonged iv propranolol infusion. edema was confirmed via ct imaging of inflated lungs ex vivo and in situ. three-dimensional reconstructions of chest ct images were partitioned to distinguish airspace and lowdensity tissue, as in normal lungs (white, yellow, and light orange signal in figure a ), from high-density tissue and edema (red and black/transparent signal in figure a ). quantification of ct attenuation and gaps in the reconstructed three-dimensional lung images indicated profuse edema in lungs afflicted with model cardiogenic pulmonary edema ( figure a nm ldngs were traced in mice with induced cardiogenic pulmonary edema. ldngs accumulated in the edematous lungs at . %id/g concentration, statistically indistinguishable from lung uptake in naïve mice and an order of magnitude lower than the level of lung uptake in mice treated with iv lps ( figure c ). naïve and iv lps-injured mice were dosed with ldngs labeled with in via chelate conjugation to lysozyme. in uptake in naïve and lps-injured lungs was visualized with ex vivo spect-ct imaging to indicate capacity of ldngs for imagingbased diagnosis of inflammatory lung injury ( figure d ). in signa was colocalized with anatomical ct images for reconstructions in figure d . in spect signal was detectable in lps-injured lungs, but in spect signal was at background level in naïve lungs (supplementary movies and ). reduced spect signal in the liver of lps-injured mice, in agreement with biodistribution data, was also evident in coregistration of spect imaging with full body skeletal ct imaging (supplementary movies and ). therefore, naps with tropism for marginated neutrophils have the ability to detect and assess ards-like inflammation via spect-ct imaging. since those same naps do not accumulate in lungs afflicted with cpe, naps have potential for differential diagnosis of acute lung inflammation against cpe. in recent work, we demonstrated that human donor lungs rejected for transplant due to ards-like phenotypes can be perfused with nanoparticle solutions. these perfusion experiments evaluate the tendency of nanoparticles to distribute to human lungs ex vivo. we used this perfusion method to evaluate nap retention in inflamed human lungs. first, fluorescent ldngs were added to single cell suspensions prepared from human lungs. µg, µg, or µg of ldngs were incubated with x cells in suspension for hour at room temperature. after three washes to remove unbound ldngs, cells were stained for cd and analyzed with flow cytometry ( figure a -b). the majority of ldng uptake in the single cell suspensions was attributable to cd positive cells. ldngs accumulated in the human leukocytes, extracted from inflamed lungs, in a dose-dependent manner, with . % of leukocytes containing ldngs at a loading dose of µg. therefore, our prototype nap was retained in leukocytes from human lungs. to test ldng tropism for inflamed intact human lungs, fluorescent or i-labeled ldngs were infused via arterial catheter into ex vivo human lungs excluded from transplant. immediately prior to ldng administration, tissue dye was infused via the same arterial catheter to stain regions of the lungs directly perfused by the catheterized branch of the pulmonary artery ( figure c ). after infusion of ldngs, phosphate buffered saline infusion was used to rinse away unbound particles. perfused regions of the lungs were dissected and divided into ~ g segments, then sorted into regions deemed to have high, medium, or low levels of tissue dye staining. for lungs receiving fluorescent ldngs, well-perfused and poorly-perfused regions were selected for sectioning and fluorescent imaging. fluorescent signal from ldngs was clearly detectable in sections of well-perfused tissue, but not poorly-perfused tissue ( figure d ). in experiments with i-labeled ldngs, i-labeled ferritin was concurrently infused (i.e. a mix of ferritin and ldngs was infused) as an internal control particle shown to have no tropism for injured mouse lungs. with ldngs and ferritin infused into the same lungs via the same branch of the pulmonary artery, ldngs retained in the lungs at . % initial dose and ferritin retained at . % initial dose ( figure e ). ldng accumulation in human lungs was focused in regions of the lungs with high levels of perfusion stain, with concentrations of . %id/g in the "high" perfusion regions, compared to . %id/g in the "medium" perfusion regions. ferritin accumulation was more diffuse, with . %id/g in the "high" perfusion regions, compared to . %id/g in the "medium" perfusion regions (supplementary figure ) . ldngs, a prototype nap shown to home to neutrophils in acutely inflamed mouse lungs, specifically accumulated in perfused regions of inflamed human lungs, but ferritin nanocages, a particle with no tropism for neutrophils, concentrated at much lower levels in injured human lungs. our data thus indicate that nap tropism for neutrophils in inflamed mouse lungs may be recapitulated in human lungs. previous studies indicate that nanoparticles can interfere with neutrophil adhesion in inflamed vasculature. we designed studies to evaluate whether or not naps mitigate the neutrophil-mediated effects of lung inflammation. namely, we administered ldngs, dbco( x)-igg liposomes, or bare liposomes in mice subjected to model ards and determined whether or not the nanoparticles prevented lung edema induced by inflammation. mice were treated with nebulized lps as a high-throughput model for severe ards. to evaluate physiological effects of the model injury, bronchoalveolar lavage (bal) fluid was harvested from mice at hours after exposure to lps. in three separate experiments, nebulized lps induced elevated concentrations of neutrophils, cd -positive cells, and protein in the bal fluid. in naïve mice, cd -positive cells concentrated at . x cells per ml bal and neutrophils concentrated at . x cells per ml bal. after lps injury, cd -positive cells and neutrophils concentrated at . x and . x cells per ml bal, respectively. in naïve mice, protein concentrated in the bal fluid at . mg/ml and in lps-injured mice, protein concentrated in the bal at . mg/ml ( figure , white and grey bars). vascular disruption after nebulized lps treatment thus led to accumulation of protein-rich edema in the alveolar space. dbco( x)-igg liposomes, ldngs, and bare liposomes were compared for effects on vascular permeability in model ards. nps were administered as an iv bolus ( mg per kg body weight) two hours after nebulized lps administration. as in untreated mice, bal fluid was harvested and analyzed at hours after exposure to nebulized lps. bare liposomes or ldngs did not have significant effects on vascular injury induced by nebulized lps, as measured by either leukocyte or protein concentration in bal fluid ( figure , red and green bars). dbco( x)-igg liposomes, however, had a significant salient effect on both protein leakage and cellular infiltration in the bal ( figure , brown bars). with dbco( x)-igg liposomes administered two hours after nebulized lps, cd -positive cells and neutrophils in bal were reduced to concentrations of . x and . x cells per ml, respectively. protein concentration in the bal was reduced to . mg/ml by dbco( x)-igg liposome treatment. as measured by protection against cellular or protein leakage, relative to untreated mice, dbco( x)-igg liposomes provided . % protection against leukocyte leakage, . % protection against neutrophil leakage, and . % protection against protein leakage. dbco( x)-igg liposomes, without any drug, altered the course of inflammatory lung injury to limit protein and leukocyte edema in the alveoli. our results with dbco( x)-igg liposomes indicate that some naps can interfere with neutrophil extravasation into the alveoli and thus limit edema following inflammatory injury. however, our results with ldngs show that tropism for marginated neutrophils is not alone sufficient to limit the neutrophil-mediated effects of inflammatory lung injury. neutrophils concentrate in the pulmonary vasculature during either systemic or pulmonary inflammation. , , , , these marginated neutrophils can recognize and engulf bacteria. , , therefore, neutrophils surveil the vasculature for potentially pathogenic foreign species, with the pulmonary vasculature serving as a "surveillance hub" in the case of systemic or pulmonary infection and inflammation. , , , our results with e. coli are noteworthy in this context: when e. coli are stripped of functional properties by heat treatment, oxidation, and fixation, but maintain their structure, uptake of the bacteria in the lungs only occurs after systemically prompting neutrophils with an inflammatory signal, lps. inflammation thus leads to pulmonary uptake of the e. colishaped particles. in large part, the overall outcome of this study is an accounting of nanoparticle structural properties that lead to recognition by "surveilling" neutrophils in the inflamed lungs, analogously to e. coli recognition by pulmonary neutrophils. including different liposomal formulations, nanoparticles were screened in our biodistribution studies comparing pulmonary nanoparticle uptake in naïve and lps-inflamed mice. thirteen different nanoparticles exhibited specificity for inflamed lungs over naïve lungs, with flow cytometry data indicating that at least three of those nanoparticle species specifically and avidly gather in neutrophils. the thirteen nanoparticles with specificity for the inflamed lungs have a range of properties. seven different proteins were used in the inflammation-specific particles. the particles have sizes ranging from ~ nm to ~ nm, include both spheres and rods, and have a range of zeta potentials. however, our analyses classify the inflammation-specific nanoparticles as; ) nanoparticles with structure based on hydrophobic interactions between proteins; ) nanoparticles with structure based on non-site-specific protein crosslinking; ) nanoparticles based on charge interactions between proteins. put broadly, these three classes can all be grouped as structures based on protein agglutination, without regard for site-specific interactions or symmetry in the resulting protein superstructure. we define the term nanoparticles with agglutinated proteins (naps) to indicate that particles with tropism for pulmonary marginated neutrophils during inflammation share commonalities in supramolecular organization. we identify naps as a broad class, rather than a single particle type. accordingly, we have presented diverse nap designs, implying a diversity of potential nap-based strategies for targeted treatment and diagnosis of ards and other inflammatory disorders in which marginated neutrophils play a role (e.g. local infections or thrombotic disorders). , , , , the diversity of naps will allow versatile options for engineering neutrophil-specific drug delivery strategies to accommodate different pathologies. in contrast to naps, three particles (adenovirus, aavs, and ferritin) characterized by highly symmetric arrangement of protein subunits into a protein superstructure [ ] [ ] [ ] did not accumulate in the inflamed neutrophil-rich lungs. these three particles have evolved structures that lead to prolonged circulation or evasion of innate immunity in mammals. [ ] [ ] [ ] [ ] it is conceivable that neutrophils more effectively recognize less patterned and more variable protein arrangements that may better parallel the wide variety of structures presented by the staggering diversity of microbes against which neutrophils defend. , to support our conclusions regarding supramolecular organization and neutrophil tropism, we re-engineered liposomes, particles with no intrinsic neutrophil tropism, to behave like naps. protein arrangement on the surface of dbco-igg liposomes was predicted to recapitulate protein agglutination seen in naps based on hydrophobic interactions. introduction of dbco to igg entails conjugation of a highly hydrophobic moiety to hydrophilic residues on the igg. replacing dbco with the less hydrophobic modifying group used in sata-maleimide conjugation abrogates the inflammation specificity observed with dbco-igg liposomes. likewise, titrating down the amount of dbco on the igg, thus limiting the hydrophobic groups on the protein, also ratchets down the targeting behavior of the dbco-igg liposomes. our data therefore points towards hydrophobic interactions between proteins on the liposome surface being a determinant in liposome uptake in neutrophils in the inflamed lungs. essentially, the dbco-igg liposomes may reproduce the hydrophobic interaction structural motif seen in naps produced by protein gelation (i.e. ldngs). nap-liposomes may be particularly attractive for future clinical translatability. liposomes are prominent among fda-approved nanoparticle drug carriers. further, even without cargo drugs, nap-liposomes conferred significant therapeutic effects in a mouse model of severe ards. ldngs, despite high levels of uptake in inflamed lungs, did not have the same therapeutic effect as the nap-liposomes. this result suggests that the composition of the liposomes may be important for their therapeutic effect. among possible mechanisms for the therapeutic effect, we note that lipid rafts are major signaling hubs in neutrophils. , the lipid content of the nap-liposomes (particularly the cholesterol content) may modulate neutrophil lipid rafts dependent on cholesterol. we have also observed that neutrophil content in the inflamed alveoli is markedly reduced by nap-liposomes. in this context, we note published work demonstrating that certain nanoparticles, in a still undetermined manner dependent on particle composition, can drive redistribution of neutrophils from the lungs to the liver. as a major corollary, our findings indicate many protein-based or proteinincorporating nanoparticles developed for therapeutic applications may accumulate in inflamed lungs, even when those nanoparticles were designed to accumulate elsewhere. the variety of protein nanostructures accumulating in inflamed lungs in our data includes particles that have been investigated as targeted drug delivery vehicles where marginated neutrophils are not the intended site of accumulation. , , , , the patterns in our data indicate that future studies may reveal additional nanoparticles that accumulate in the lungs following inflammatory insult. this study therefore serves as evidence that inflammatory challenges may prompt profound off-target changes in the biodistributions of nanomaterials, including dramatic shunting of nanoparticles and any associated drug payload to the lungs. the nanoparticle targeting profiles documented in naïve or, for instance, tumor model studies may be overturned by, for instance, bacterial infection in a patient receiving the nanoparticle. in conclusion, supramolecular organization in nanoparticle structure predicts nanoparticle uptake in pulmonary marginated neutrophils during acute inflammation. specifically, nanoparticles with agglutinated protein (naps) accumulate in marginated neutrophils, while nanoparticles with more symmetric protein organization do not. nap tropism for neutrophils allowed us to develop naps as diagnostics and therapeutics for ards, and even to demonstrate nap uptake in inflamed human lungs. future work may more deeply explore therapeutic effects of naps in ards and other diseases in which neutrophils play key roles. this study also obviates future testing of supramolecular organization as a variable in in vivo behavior of nanoparticles, including screens of tropism for other pathologies and cell types. these studies could in turn guide engineering of new particles with intrinsic cell tropisms, as with our engineering of nap-liposomes with neutrophil tropism. these "targeting" behaviors, requiring no affinity moieties, may apply to a wide variety of nanomaterials. but our current findings with neutrophil-tropic naps indicate that many protein-based and protein-coated nanoparticles could be untapped resources for treatment and diagnosis of devastating inflammatory disorders like ards. lysozyme-dextran nanogels (ldngs) were synthesized as previously described. , kda rhodamine-dextran or fitc-dextran (sigma) and lysozyme from hen egg white (sigma) were dissolved in deionized and filtered water at a : or : mol:mol ratio, and ph was adjusted to . before lyophilizing the solution. for maillard reaction between lysozyme and dextran, the lyophilized product was heated for hours at °c, with % humidity maintained via saturated kbr solution in the heating vessel. dextran-lysozyme conjugates were dissolved in deionized and filtered water to a concentration of mg/ml, and ph was adjusted to . or . . solutions were stirred at °c for minutes. diameter of ldngs was evaluated with dynamic light scattering (dls, malvern) after heat gelation. particle suspensions were stored at °c. crosslinked protein nanoparticles and nanorods were prepared using previously reported electrohydrodynamic jetting techniques. the protein nanoparticles were prepared using bovine serum albumin, human serum albumin, human lysozyme, human transferrin, or human hemoglobin (all proteins were purchased from sigma). protein nanorods were prepared using chemically modified human serum albumin. for electrohydrodynamic jetting, protein solutions were prepared by dissolving the protein of interest at a . w/v% (or . w/v% for protein nanorods) concentration in a solvent mixture of di water and ethylene glycol with : (v/v) ratio. the homobifunctional amine-reactive crosslinker, o,o′-bis[ -(n-succinimidylsuccinylamino)ethyl]polyethylene glycol with molecular weight of kda (nhs-peg-nhs, sigma) was mixed with the protein solution at w/w%. protein nanoparticles were kept at °c for days for completion of the crosslinking reaction. the as-prepared protein nanoparticles were collected in pbs buffer and their size distribution was analyzed using dynamic light scatting (dls, malvern). glutamic acid residues (e -tag) were inserted at the c-terminus of enhanced green fluorescent protein (egfp) through restriction cloning and site-directed mutagenesis as previously reported. proteins were expressed in an e. coli bl strain using standard protein expression protocol. briefly, protein expression was carried out in xyt media with an induction condition of mm iptg and °c for h. at this point, the cells were harvested, and the pellets were lysed using % triton-x- ( min, °c)/dnase-i treatment ( minutes). proteins were purified using hispur cobalt columns. after elution, proteins were preserved in pbs buffer. the purity of native proteins was determined using % sds−page gel. polymers (poni) were synthesized by ring-opening metathesis polymerization using third generation grubbs' catalyst as previously described. in brief, solutions in dichloromethane of guanidium functionalized monomer and grubbs' catalyst were placed under freeze thawing cycles for degassing. after warming the solutions to room temperature, the degassed monomer solutions were administrated to degassed catalyst solutions and allowed to stir for minutes. the polymerization reaction was terminated by the addition of excess ethyl vinyl ether. the reaction mixture was further stirred for another min. the resultant polymers were precipitated from excess hexane or diethyl ether anhydrous, filtered, washed and dried under vacuum to yield a light-yellow powder. polymers were characterized by h nmr and gel permeation chromatography (gpc) to assess chemical compositions and molecular weight distributions, respectively. subsequent to deprotection of boc functionalities, polymer was dissolved in the dcm with the addition of tfa at : ratio. the reaction was allowed to stir for hours and dried under vacuum. excess tfa was removed by azeotropic distillation with methanol. afterwards, the resultant polymers were re-dissolved in dcm and precipitated in anhydrous diethyl ether, filtered, washed and dried. polymers were then dissolved in water and transferred to biotech ce dialysis tubing membranes with a g/mol cutoff and dialyzed against ro water ( − days). the polymers were then lyophilized dried to yield a light white powder. poni polymer/e-tag protein nanocomposites (ppncs) were prepared in polypropylene microcentrifuge tubes (fisher) through a simple mixing procedure. . nmol of kda poni was incubated with . nmol of egfp at room temperature for minutes prior to dilution to µl in sterile pbs and subsequent injection. similarly, . nmol of arginine-tagged gold nanoparticles, prepared as described, were combined with . nmol of egfp to prepare egfp/gold nanoparticle complexes. azide-functionalized liposomes were prepared by thin film hydration techniques, as previously described. the lipid film was composed of mol% dppc ( , dipalmitoyl-sn-glycero- -phosphocholine), mol% cholesterol, and mol% azide-peg -dspe (all lipids from avanti). . mol% top fluor pc ( -palmitoyl- -(dipyrrometheneboron difluoride) undecanoyl-sn-glycero- -phosphocholine) was added to prepare fluorescent liposomes. . mol% dtpa-pe ( , -distearoyl-sn-glycero- phosphoethanolamine-n-diethylenetriaminepentaacetic acid) was added to prepare liposomes with capacity for radiolabeling with in. lipid solutions in chloroform, at a total lipid concentration of mm, were dried under nitrogen gas, then lyophilized for hours to remove residual solvent. dried lipid films were hydrated with dulbecco's phosphate buffered saline (pbs). lipid suspensions were passed through freezethaw cycles using liquid n / °c water bath then extruded through nm cutoff tracketched polycarbonate filters in cycles. dls assessed particle size after extrusion and after each subsequent particle modification. liposome concentration following extrusion was assessed with nanosight nanoparticle tracking analysis (malvern). for conjugation to liposomes, rat igg was modified with dibenzylcyclooctyne-peg -nhs ester (dbco, jena bioscience). igg solutions (pbs) were adjusted to ph . with m nahco buffer and reacted with dbco for hour at room temperature at molar ratios of . : , : , : , or : dbco:igg. unreacted dbco was removed after reaction via centrifugal filtration against kda cutoff filters (amicon [def] . dbcomodified igg was incubated with azide liposomes at igg per liposome overnight at room temperature. unreacted antibody was removed via size exclusion chromatography, and purified liposomes were concentrated to original volume against centrifugal filters (amicon). maleimide liposomes were also prepared via lipid film hydration. lipid films comprised % dppc, % cholesterol, and % mpb-pe ( , -dioleoyl-sn-glycero- phosphoethanolamine-n-[ -(p-maleimidophenyl) butyramide]), with lipids prepared, dried, resuspended, and extruded as described above for azide liposomes. igg was prepared for conjugation to maleimide liposomes by one-hour reaction of sata (n-succinimidyl s-acetylthioacetate) per igg at room temperature in . mm edta in pbs. unreacted sata was removed from igg by passage through kda cutoff gel filtration columns. sata-conjugated igg was deprotected by one-hour room temperature incubation in . m hydroxylamine in . mm edta in pbs. excess hydroxylamine was removed and buffer was exchanged for . mm edta in pbs via kda cutoff gel filtration column. sata-conjugated and deprotected igg was added to liposomes at igg per liposomes for overnight reaction at °c. excess igg was removed by size exclusion column purification, as above for azide liposomes. nm carboxylate nanoparticles (phosphorex) were exchanged into mm mes buffer at ph . via gel filtration column. n-hydroxysulfosuccinimide (sulfo-nhs) was added to the particles at . mg/ml, prior to incubation for minutes at room temperature. edci was then added to the particles at . mg/ml, prior to incubation for minutes at room temperature. igg was added to the particle mixture at igg per nanoparticle, prior to incubation for hours at room temperature while vortexing. for radiotracing, i-labeled igg was added to the reaction at % of total igg mass. the igg/particle mixture was diluted with -fold volume excess of ph . mes buffer and the diluted mixture was centrifuged at xg for minutes. supernatant was discarded and pbs with . % bsa was added at desired volume before resuspending the particles via sonication probe sonication (three pulses, % amplitude). particle size was assessed via dls after resuspension, and particles were used immediately after dls assessment. top e. coli were grown overnight in terrific broth with ampicillin. bacteria were heat-inactivated by -minute incubation at °c, then fixed by overnight incubation in % paraformaldehyde. after fixation, bacteria were pelleted by centrifugation at xg for minutes. pelleted bacteria were washed three times in pbs, prior to resuspension by pipetting. bacterial concentration was verified by optical density at nm, prior to radiolabeling as described for nanoparticles below. bacteria were administered in mice ( . x colony forming units in a µl suspension per mouse). protein, horse spleen ferritin nanocages (sigma), or adeno-associated virus (empty capsids, serotype ) were prepared in pbs at concentrations between and mg/ml in volumes between and µl. films of oxidizing agent were prepared in borosilicate tubes by drying µl of . mg/ml iodogen (perkin-elmer, chloroform solution) under nitrogen gas. alternatively, iodobeads (perkin-elmer) were added to borosilicate tubes (one per reaction). protein solutions were added to coated or beadcontaining tubes, before addition of na / i at µci per µg of protein. protein was incubated with radioiodine at room temperature for minutes under parafilm in a ventilated hood. iodide-protein reacottions were terminated by purifying protein solutions through a kda cutoff gel column (zeba). additional passages through gel filtration columns or against centrifugal filters (amicon, kda cutoff) were employed to remove free iodine, assuring that > % of radioactivity was associated with protein. lysozyme-dextran nanogels, crosslinked protein nanoparticles, e. coli, or adenovirus were similarly iodinated. at least µl of particle suspension was added to a borosilicate tube containing two iodobeads, prior to addition of µci of na i per µl of suspension. particles were incubated with radioiodine and iodobeads for minutes at room temperature, with gentle shaking every minutes. to remove free iodine, particle suspensions were moved to a centrifuge tube, diluted in ~ ml of buffer and centrifuged to pellet the particles ( xg/ minutes for nanogels, xg/ minutes for crosslinked protein particles, xg/ minutes for adenovirus, and xg/ minutes for e. coli). supernatant was removed and wash/centrifugation cycles were repeated to assure > % of radioactivity was associated with particles. particles were resuspended by probe sonication (three pulses, % amplitude) for nanogels or crosslinked protein nanoparticles or pipetting for adenovirus or e. coli. nanoparticle labeling with in in labeling of nanoparticles followed previously described methods, with adaptation for new particles. all radiolabeling chelation reactions were performed using metal free conditions to prevent contaminating metals from interfering with chelation of in by dtpa or dota. metals were removed from buffers using chelex metal affinity resin (biorad, laboratories, hercules ca). lysozyme-dextran nanogels were prepared for chelation to in by conjugation to s- -( -isothiocyanatobenzyl)- , , , -tetraazacyclododecane tetraacetic acid (p-scn-bn-dota, macrocyclics). nanogels were moved to metal free ph . m nahco buffer by three-fold centrifugation ( xg for minutes) and pellet washing with metal free buffer. p-scn-bn-dota was added to nanogels at : mass:mass ratio, prior to reaction for minutes at room temperature. free p-scn-bn-dota was removed by three-fold centrifugal filtration against kda cutoff centrifugal filters, with resuspension of nanogels in metal-free ph citrate buffer after each centrifugation. dota-conjugated nanogels or dtpa-containing liposomes in ph citrate buffer were combined with incl for one-hour chelation at °c. nanoparticle/ incl mixtures were treated with free dtpa ( mm final concentration) to remove in not incorporated in nanoparticles. efficiency of in incorporation in nanoparticles was assessed by thin film chromatography (aluminum/silica strips, sigma) with µm edta mobile phase. chromatography strips were divided between origin and mobile front and the two portions of the strip were analyzed in a gamma counter to assess nanoparticleassociated (origin) vs. free (mobile front) in. free in was separated from nanoparticles by centrifugal filtration and nanoparticles were resuspended in pbs (liposomes) or saline (nanogels). for spect/ct imaging experiments (see spect/ct imaging methods below) with nanogels, µci of in-labeled nanogels, used within one day in labeling as described above, were administered to each mouse. for tracing in-labeled liposomes in biodistribution studies, liposomes were labeled with µci in per µmol of lipid. nanoparticle or protein biodistributions were tested by injecting radiolabeled nanoparticles or protein (suspended to µl in pbs or . % saline at a dose of . mg/kg with tracer quantities of radiolabeled material) in c bl/ male mice from jackson laboratories. biodistributions in naïve mice were compared to biodistributions in several injury models. biodistribution data were collected at minutes after nanoparticle or protein injection, unless otherwise stated, as in pharmacokinetics studies. briefly, blood was collected by vena cava draw and mice were sacrificed via terminal exsanguination and cervical dislocation. organs were harvested and rinsed in saline, and blood and organs were examined for nanoparticle or protein retention in a gamma counter (perkin-elmer). nanoparticle or protein retention in harvested organs was compared to measured radioactivity in injected doses. for calculations of nanoparticle or protein concentration in organs, quantity of retained radioactivity was normalized to organ weights. mice subject to intravenous lps injury were anesthetized with % isoflurane before administration of lps from e. coli strain b at mg/kg in µl pbs via retroorbital injection. after five hours, mice were anesthetized with ketamine-xylazine ( mg/kg ketamine, mg/kg xylazine, intramuscular administration) and administered radiolabeled nanoparticles or protein via jugular vein injection to determine biodistributions as described above. for mice subject to intratracheal (it) lps injury, b lps was administered to mice (anesthetized with ketamine/xylazine) at mg/kg in µl of pbs via tracheal catheter, followed by µl of air. biodistributions of lysozyme-dextran nanogels in it-lps-injured mice were assessed as above hours after lps administration. biodistributions of liposomes in it-lps-injured mice were assessed at , , or hours after lps administration. mice subject to footpad lps administration were provided b lps at mg/kg in µl pbs via footpad injection. biodistributions of lysozyme-dextran nanogels were obtained at or hours after footpad lps administration. lysozyme-dextran nanogel biodistributions were also traced in a mouse model of cardiogenic pulmonary edema. to establish edema, mice were anesthetized with ketamine/xylazine and administered propranolol in saline ( µg/ml) via jugular vein catheter at µl/min over minutes. lysozyme-dextran nanogel biodistributions were subsequently assessed as above. single cell suspensions were prepared from lungs for flow cytometric analysis of cell type composition of the lungs and/or nanoparticle distribution among different cell types in the lungs. c bl/ male mice were anesthetized with ketamine/xylazine ( mg/kg ketamine, mg/kg xylazine, intramuscular administration) prior to installation of tracheal catheter secured by suture. after sacrifice by terminal exsanguination via the vena cava, lungs were perfused by right ventricle injection of ~ ml of cold pbs. the lungs were then infused via the tracheal catheter with ml of a digestive enzyme solution consisting of u/ml dispase, . mg/ml collagenase type i, and mg/ml of dnase i in cold pbs. immediately after infusion, the trachea was sutured shut while removing the tracheal catheter. the lungs with intact trachea were removed via thoracotomy and kept on ice prior to manual disaggregation. disaggregated lung tissue was aspirated in ml of digestive enzyme solution and incubated at °c for minutes, with vortexing every minutes. after addition of ml of fetal calf serum, tissue suspensions were strained through µm filters and centrifuged at xg for minutes. after removal of supernatant, the pelleted material was resuspended in ml of cold ack lysing buffer. the resulting suspensions were strained through µm filter and incubated for minutes on ice. the suspensions were centrifuged at xg for minutes and the resulting pellets were rinsed in ml of facs buffer ( % fetal calf serum and mm edta in pbs). after centrifugation at xg for minutes, the rinsed cell pellets were resuspended in % pfa in ml facs buffer for minutes incubation. the fixed cell suspensions were centrifuged at xg for minutes and resuspended in ml of facs buffer. for analysis of intravascular leukocyte populations in naïve and inflamed lungs, mice received an intravenous injection of fitc-conjugated anti-cd antibody five minutes prior to sacrifice and preparation of single cell suspensions as described above. populations of intravascular vs. extravascular leukocytes were assessed by subsequent stain of fixed cell suspensions with percp-conjugated anti-cd antibody and/or apcconjugated clone a anti-ly g antibody. to accomplish staining of fixed cells, µl aliquots of the cell suspensions described above were pelleted at xg for minutes, then resuspended in labeled antibody diluted in facs buffer ( : dilution for apcconjugated anti-ly g antibody and : dilution for percp-conjugated anti-cd antibody). samples were incubated with staining antibodies for minutes at room temperature in the dark, diluted with ml of facs buffer, and pelleted at xg for minutes. stained pellets were resuspended in µl of facs buffer prior to immediate flow cytometric analysis on a bd accuri flow cytometer. all flow cytometry data was gated to remove debris and exclude doublets. control samples with no stain, obtained from naïve and iv-lps-injured mice, established gates for negative/positive staining with fitc, percp, and apc. single stain controls allowed automatic generation of compensation matrices in fcs express software. comparison of percp anti-cd signal with fitc anti-cd signal indicated intravascular vs. extravascular leukocytes. comparison of apc anti-ly g signal with fitc anti-cd signal indicated intravascular vs. extravascular neutrophils, with percp and apc co-staining verifying identification of cells as neutrophils. similar staining and analysis protocols enabled identification of nanoparticle distribution among different cell types in the lungs. to enable fluorescent tracing, lysozyme-dextran nanogels contained fitc-dextran, dbco-igg liposomes contained green fluorescent top fluor pc lipid, and crosslinked albumin nanoparticles were labeled with nhs ester alexa fluor . alexa fluor labeling of albumin nanoparticles was accomplished by incubation of the nhs ester fluorophore with nanoparticles at : mass:mass fluorophore:nanoparticle ratio for two hours on ice. excess fluorophore was removed from nanoparticles by -fold centrifugation at xg for minutes followed by washing with pbs. nanoparticles were administered at . mg/kg via jugular vein injection and circulated for minutes, prior to preparation of single cell suspensions from lungs as above. fixed single cell suspensions were stained with apc-conjugated anti-ly g or percp-conjugated anti-cd as above. additional suspensions were stained with : dilution of apc-conjugated anti-cd , in lieu of anti-ly g, to identify endothelial cells. association of nanoparticles with cell types was identified by coincidence of green fluorescent signal with anti-cd , anti-ly g, or anti-cd signal. as described previously, thirty minutes after injection of µci of in-labeled nanogels, anesthetized mice were sacrificed by cervical dislocation. mice were placed into a milabs u-spect (utrecht, netherlands) scanner bed. a region covering the entire body was scanned for min using listmode acquisition. the animal was then moved, while maintaining position, to a milabs u-ct (utrecht, netherlands) for a fullbody ct scan using default acquisition parameters ( µa, kvp, ms exposure, . ° step with projections). for naïve mice and mice imaged after cardiogenic pulmonary edema, ct data was acquired as above without spect data. the spect data was reconstructed using reconstruction software provided by the manufacturer, with µm voxels. the ct data were reconstructed using reconstruction software provided by the manufacturer, with µm voxels. spect and ct data, in nifti format, were opened with imagej software (fiji package). background signal was removed from spect images by thresholding limits determined by applying renyi entropic filtering, as implemented in imagej, to a spect image slice containing ngassociated in in the liver. background-subtracted pseudo-color spect images were overlayed on ct images and axial slices depicting lungs were selected for display, with ct thresholding set to emphasize negative contrast in the airspace of the lungs. imagej's built-in d modeling plugin was used to co-register background-subtracted pseudo-color spect images with ct images in three-dimensional reconstructions. ct image thresholding was set in the d modeling tool to depict skeletal structure alongside spect signal. for three-dimensional reconstructions of lung ct images, thresholding was set, as above, for contrast emphasizing the airspace of the lungs, with thresholding values standardized between different ct images (i.e. identical values were used for naïve and edematous lungs). images were cropped in a cylinder to exclude the airspace outside of the animal, then contrast was inverted, allowing airspace to register bright ct signal and denser tissue to register as dark background. three-dimensional reconstructions of the lung ct data, and co-registrations of spect data with lung ct data, were generated as above with imagej's d plugin applied to ct data cropped and partitioned for lung contrast. quantification of ct attenuation employed imagej's measurement tool iteratively over axial slices, with measurement fields of view manually set to contain lungs and exclude surrounding tissue. mice were exposed to nebulized lps in a 'whole-body' exposure chamber, with separate compartments for each mouse (mpc- aero; braintree scientific). to maintain adequate hydration, mice were injected with ml of sterile saline warmed to °c, intraperitoneally, immediately before exposure to lps. lps (l - mg, sigma aldrich) was reconstituted in pbs to mg/ml and stored at - °c until use. immediately before nebulization, lps was thawed and diluted to mg/ml with pbs. lps was aerosolized via a jet nebulizer connected to the exposure chamber (neb-med h, braintree scientific, inc.). ml of mg/ml lps was used induce the injury. nebulization was performed until all liquid was nebulized (~ minutes). liposomes or saline sham were administered via retro-orbital injections of µl of suspension ( mg/kg liposome dose) at hours after lps exposure. mice were anesthetized with % isoflurane to facilitate injections. bronchoalveolar lavage (bal) fluid was collected hours after lps exposure, as previously described. briefly, mice were anesthetized with ketamine-xylazine ( mg/kg ketamine, mg/kg xylazine, intramuscular administration). the trachea was isolated and a tracheostomy was performed with a -gauge catheter. the mice were euthanized via exsanguination. . ml of cold bal buffer ( . mm edta in pbs) was injected into the lungs over ~ min via the tracheostomy and then aspirated from the lungs over ~ min. injections/aspirations were performed three times for a total of . ml of fluid added to the lungs. recovery bal fluid typically amounted to ~ . ml. bal samples were centrifuged at xg for minutes. the supernatant was collected and stored at - °c for further analysis. protein concentration was measured using bio-rad dc protein assay, per manufacturer's instructions. the cell pellet was fixed for flow cytometry as follows. µl of . % pfa in pbs was added to each sample. samples were incubated in the dark at room temperature for minutes, then ml of bal buffer was added. samples were centrifuged at xg for min, the supernatant was aspirated, and ml of facs buffer ( % fetal calf serum and mm edta in pbs) was added. at this point, samples were stored at °c for up to week prior to flow cytometry analysis. to stain for flow cytometry, samples were centrifuged at xg for min, the supernatant was aspirated, and µl of staining buffer was added. staining buffer used was a : dilution of stock antibody solution (apc anti-mouse cd ; alexa fluor anti-mouse ly g, biolegend) into facs buffer. samples were incubated with staining antibody for minutes at room temperature in the dark. to terminate staining, ml of facs buffer was added, samples were centrifuged at xg for minutes, and supernatant was aspirated. cells were resuspended in µl of facs buffer and immediately analyzed via flow cytometry. flow cytometric analysis was completed with a bd accuri flow cytometer as follows: sample volume was set to µl and flow rate was set to 'fast'. unstained and single-stained controls were used to set gates. forward scatter (pulse area) vs. side scatter (pulse area) plots were used to gate out non-cellular debris. forward scatter (pulse area) vs. forward scatter (pulse height) plots were used to gate out doublets. the appropriate fluorescent channels were used to determine stained vs. unstained cells. the gates were placed using unstained control samples. single-stain controls were tested and showed there was no overlap/bleed-through between the fluorophores. final analysis indicated the quantity of leukocytes (cd -positive cells) and neutrophils (ly g-positive cells) in bal samples. human lungs were obtained after organ harvest from transplant donors whose lungs were in advance deemed unsuitable for transplantation. the lungs were harvested by the organ procurement team and kept at °c until the experiment, which was done within hours of organ harvest. the lungs were inflated with low pressure oxygen and oxygen flow was maintained at . l/min to maintain gentle inflation. pulmonary artery subsegmental branches were endovascularly cannulated, then tested for retrograde flow by perfusing for minutes with steen solution containing a small amount of green tissue dye at cm h o pressure. the pulmonary veins through which efflux of perfusate emerged were noted, allowing collection of solutions after passage through the lungs. a ml mixture of i-labeled lysozyme-dextran nanogels and ilabeled ferritin nanocages were injected through the arterial catheter. ~ ml of % bsa in pbs was passed through the same catheter to rinse unbound nanoparticles. a solution of green tissue dye was subsequently injected through the same catheter. the cannulated lung lobe was dissected into ~ g segments, which were evaluated for density of tissue dye staining. segments were weighed, divided into 'high', 'medium', 'low', and 'null' levels of dye staining, and measured for i and i signal in a gamma counter. for experiments with cell suspensions derived from human lungs (chosen for research use according to the above standards), single cell suspensions were generously provided by the laboratory of edward e. morrisey at the university of pennsylvania. aliquots of , cells were pelleted at xg for minutes and resuspended in µl pbs containing different quantities of lysozyme-dextran nanogels synthesized with fitc-labeled dextran. cells and nanogels were incubated at room temperature for minutes before two-fold pelleting at xg with ml pbs washes. cells were re-suspended in µl facs buffer for staining with apcconjugated anti-human cd , applied by -minute incubation with a : dilution of the antibody stock. cells were pelleted at xg for minutes and resuspended in µl pbs for immediate analysis with flow cytometry (bd accuri). negative/positive nanogel or anti-cd signal was established by comparison to unstained cells. singlestained controls indicated no spectral overlap between fitc-nanogel fluorescence and anti-cd apc fluorescence. proteins were prepared in deionized and filtered water at concentrations of . mg/ml for human albumin, . mg/ml for hen lysozyme, and . mg/ml for igg. crosslinked albumin nanoparticles, lysozyme-dextran nanogels, and igg-coated liposome suspensions were prepared such that albumin, lysozyme, and igg concentrations in the suspensions matched the concentrations of the corresponding protein solutions. protein and nanoparticle solutions were analyzed in quartz cuvettes with mm path length in an aviv circular dichroism spectrometer. the instrument was equilibrated in nitrogen at °c for minutes prior to use and samples were analyzed with sweeps between and nm in nm increments. each data point was obtained after a . s settling time, with a s averaging time. cdnn software deconvolved cd data (expressed in millidegrees) via neural network algorithm assessing alignment of spectra with library-determined spectra for helices, antiparallel sheets, parallel sheets, beta turns, and random coil. -anilino- -naphthalenesulfonic acid (ansa) at . mg/ml was mixed with lysozyme, human albumin, or igg at . mg/ml in pbs. for nanoparticle analysis, nanoparticle solutions were prepared such that albumin, lysozyme, and igg concentrations in the suspensions matched the . mg/ml concentration of the protein solutions. protein or nanoparticles and ansa were reacted at room temperature for minutes. excess ansa was removed from solutions by centrifugations against kda cutoff centrifugal filters (amicon). after resuspension to original volume, ansa-stained protein/nanoparticle solutions/suspension were examined for fluorescence (excitation nm, emission - nm) and absorbance ( - nm) maxima corresponding to ansa. for imaging neutrophil content in naïve and iv-lps-injured lungs, mice were intravenously injected with rat anti-mouse anti-ly g antibody (clone a ) and sacrificed minutes later. lungs were embedded in m medium, flash frozen, and sectioned in µm slices. sections were stained with percp-conjugated anti-rat secondary antibody and neutrophil-associated fluorescence was observed with epifluorescence microscopy. similar procedures enabled histological imaging of lysozyme-dextran nanogels in iv-lps-injured lungs. nanogels synthesized with rhodamine-dextran were administered intravenously in injured mice minutes prior to sacrifice. lungs were sectioned as above and stained with clone a anti-ly g antibody, followed by briliant violetconjugated anti-rat secondary antibody. sections of human lungs were obtained after ex vivo administration (see nanoparticle administration in human lungs above) of lysozyme-dextran nanogels synthesized with rhodamine dextran. regions of tissue delineated as perfused and nonperfused, as determined by arterial administration of tissue dye as above, were harvested, embedded in m medium, flash frozen, and sectioned in µm slices. epifluorescence imaging indicated rhodamine fluorescence from nanogels, coregistered to autofluorescence indicating tissue architecture. a mouse was anesthetized with ketamine/xylazine five hours after intravenous administration of mg/kg b lps. a jugular vein catheter was fixed in place for injection of lysozyme-dextran nanogels, anti-cd antibody, and fluorescent dextran during imaging. in preparation for exposure of the lungs, a patch of skin on the back of the mouse, around the juncture between the ribcage and the diaphragm, was denuded. while the mouse was maintained on mechanical ventilation, an incision at the juncture between the ribs and the diaphragm, towards the posterior, exposed a portion of the lungs. a coverslip affixed to a rubber o-ring was sealed to the incision by vacuum. the exposed portion of the mouse lung was placed in focus under the objective by locating autofluorescence signal in the "fitc" channel. with ms exposure, fluorescent images from channels corresponding to violet, green, near red, and far red fluorescence were sequentially acquired. a mixture of rhodamine-dextran nanogels ( . mg/kg), brilliant violet-conjugated anti-cd antibody ( . mg/kg), and alexa fluor labeled kda dextran ( mg/kg) for vascular contrast was administered via jugular vein catheter and images were recorded for minutes. images were recorded in slidebook software and opened in imagej (fiji distribution) for composition in movies with coregistration of the four fluorescent channels. all animal studies were carried out in strict accordance with guide for the care and use of laboratory animals as adopted by national institute of health and approved by university of pennsylvania institutional animal care and use committee (iacuc). male c bl/ j mice, - weeks old, were purchased from jackson laboratories. mice were maintained at - °c and on a / hour dark/light cycle with food and water ad libitum. ex vivo human lungs were donated from an organ procurement agency, gift of life, after determination the lungs were not suitable for transplantation into a recipient, and therefore would have been discarded if they were not used for our study. gift of life obtained the relevant permissions for research use of the discarded lungs, and in conjunction with the university of pennsylvania's institutional review board ensured that all relevant ethical standards were met. error bars indicate standard error of the mean throughout. significance was determined through paired t-test for comparison of two samples and anova for group comparisons. linear discriminant analysis and principal components analysis were completed in gnu octave scripts (adapted from https://www.bytefish.de/blog/pca_lda_with_gnu_octave/, and made available in full in the supplementary materials). findings in this study contributed to united states provisional patent application number / . raw imaging, flow cytometry, gamma counter, and spectroscopy data supporting the findings of this study are available from the corresponding author upon reasonable request. all other data supporting the findings of this study are available within the paper and its supplementary information files. covid- in critically ill patients in the seattle region -case series the influenza pandemic: insights for the st century lung safe investigators; esicm trials group. epidemiology, patterns of care, and mortality for patients with acute respiratory distress syndrome in intensive care units in countries incidence and outcomes of acute lung injury. b. engl nanomedicine for the treatment of acute respiratory distress syndrome. the ats bear cage award-winning proposal the mercurial nature of neutrophils: still an enigma in ards? endothelial nanomedicine for the treatment of pulmonary disease balti- study investigators. effect of intravenous β- agonist treatment on clinical outcomes in acute respiratory distress syndrome (balti- ): a multicentre, randomised controlled trial national heart, lung, and blood institute acute respiratory distress syndrome (ards) clinical trials network. randomized, placebo-controlled clinical trial of an aerosolized β -agonist for treatment of acute lung injury neutrophil function in inflammation and inflammatory diseases paradoxical roles of the neutrophil in sepsis: protective and deleterious targeting neutrophils in ischemic stroke: translational insights from experimental studies neutrophil function in ischemic heart disease contribution of neutrophils to acute lung injury neutrophils kinetics in health and disease neutrophil-endothelial cell interactions in the lung the multifaceted functions of neutrophils neutrophils in the activation and regulation of innate and adaptive immunity what drives neutrophils to the alveoli in ards? pulmonary retention of primed neutrophils: a novel protective host response, which is impaired in the acute respiratory distress syndrome neutrophil margination, sequestration, and emigration in the lungs of l-selectin-deficient mice ly family proteins in neutrophil biology use of ly g-specific monoclonal antibody to deplete neutrophils in mice neutrophil targeted nano-drug delivery system for chronic obstructive lung diseases therapeutic targeting of neutrophil granulocytes in inflammatory liver disease prevention of vascular inflammation by nanoparticle targeting of adherent neutrophils neutrophil-mediated delivery of therapeutic nanoparticles across blood vessel barrier for treatment of inflammation and infection non-affinity factors modulating vascular targeting of nano-and microcarriers physical approaches to biomaterial design impact of particle elasticity on particle-based drug delivery systems factors controlling the pharmacokinetics, biodistribution and intratumoral penetration of nanoparticles cell-mediated delivery of nanoparticles: taking advantage of circulatory cells to target nanoparticles neutrophil sequestration and migration in localized pulmonary inflammation. capillary localization and migration across the interalveolar septum neutrophil recruitment to the lungs during bacterial pneumonia the lung is a host defense niche for immediate neutrophilmediated vascular protection icam- targeted nanogels loaded with dexamethasone alleviate pulmonary inflammation flexible nanoparticles reach sterically obscured endothelial targets inaccessible to rigid nanoparticles long-circulating janus nanoparticles made by electrohydrodynamic co-jetting for systemic drug delivery applications the transport and inactivation kinetics of bacterial lipopolysaccharide influence its immunological potency in vivo quantitative analysis of protein far uv circular dichroism spectra by neural networks selective staining of proteins with hydrophobic surface sites on a native electrophoretic gel lysozyme-dextran core-shell nanogels prepared via a green process in vivo editing of macrophages through systemic delivery of crispr-cas -ribonucleoprotein-nanoparticle nanoassemblies adeno-associated virus structural biology as a tool in vector development structure of human adenovirus cisplatin encapsulation within a ferritin nanocage: a high-resolution crystallographic study vascular targeting of radiolabeled liposomes with bio-orthogonally conjugated ligands: single chain fragments provide higher specificity than antibodies targeting superoxide dismutase to endothelial caveolae profoundly alleviates inflammation caused by endotoxin acute respiratory distress syndrome: diagnosis and management novel role for cftr in fluid absorption from the distal airspaces of the lung red blood cell-hitchhiking boosts delivery of nanocarriers to chosen organs by orders of magnitude neutrophil-particle interactions in blood circulation drive particle clearance and alter neutrophil responses in acute inflammation the tlr -myd pathway is critical for adaptive immune responses to adeno-associated virus gene therapy vectors in mice adeno-associated viral vectors at the frontier between tolerance and immunity serum ferritin: past, present and future facile double-functionalization of designed ankyrin repeat proteins using click and thiol chemistries a new reagent which may be used to introduce sulfhydryl groups into proteins, and its use in the preparation of conjugates for immunoassay doxil®--the first fda-approved nano-drug: lessons learned lipid rafts regulate lipopolysaccharide-induced activation of cdc and inflammatory functions of the human neutrophil alterations in membrane cholesterol cause mobilization of lipid rafts from specific granules and prime human neutrophils for enhanced adherence-dependent oxidant production generation of targeted adenoassociated virus (aav) vectors for human gene therapy biphasic janus particles with nanoscale anisotropy direct cytosolic delivery of crispr/cas -ribonucleoprotein for efficient gene editing direct cytosolic delivery of proteins through coengineering of proteins and polymeric delivery vehicles antioxidant protection by pecam-targeted delivery of a novel nadph-oxidase inhibitor to the endothelium in vitro and in vivo red: anti-ly g stain. green: tissue autofluorescence. (f) biodistributions of heat-inactivated, fixed, and ilabeled e. coli in naïve (n= ) and iv-lps-injured (n= ) mice tissue autofluorescence). (k) single frame from real-time intravital imaging of ldng (red) uptake in leukocytes (green) in iv-lps-inflamed pulmonary vasculature (blue, alexa fluor -dextran) biodistributions in iv-lps-injured mice for azide-functionalized liposomes conjugated to igg loaded with . , , , and dbco molecules per igg (bars further to the right correspond to more dbco per igg). (d) mouse lungs flow cytometry data indicating ly g anti-neutrophil staining density vs. levels of dbco( x)-igg liposome uptake. (e) flow cytometry data verifying increased dbco( x)-igg liposome uptake in and specificity for neutrophils following lps insult (inset: verification of increased concentration of neutrophils in the lungs following lps key: cord- -vasuu k authors: shannon, ashleigh; selisko, barbara; le, nhung-thi-tuyet; huchting, johanna; touret, franck; piorkowski, géraldine; fattorini, véronique; ferron, françois; decroly, etienne; meier, chris; coutard, bruno; peersen, olve; canard, bruno title: rapid incorporation of favipiravir by the fast and permissive viral rna polymerase complex results in sars-cov- lethal mutagenesis date: - - journal: nat commun doi: . /s - - -z sha: doc_id: cord_uid: vasuu k the ongoing corona virus disease (covid- ) pandemic, caused by severe acute respiratory syndrome coronavirus- (sars-cov- ), has emphasized the urgent need for antiviral therapeutics. the viral rna-dependent-rna-polymerase (rdrp) is a promising target with polymerase inhibitors successfully used for the treatment of several viral diseases. we demonstrate here that favipiravir predominantly exerts an antiviral effect through lethal mutagenesis. the sars-cov rdrp complex is at least -fold more active than any other viral rdrp known. it possesses both unusually high nucleotide incorporation rates and high-error rates allowing facile insertion of favipiravir into viral rna, provoking c-to-u and g-to-a transitions in the already low cytosine content sars-cov- genome. the coronavirus rdrp complex represents an achilles heel for sars-cov, supporting nucleoside analogues as promising candidates for the treatment of covid- . c oronaviruses (cov) are large genome, positive-strand rna viruses of the order nidovirales that have recently attracted global attention due to the ongoing covid- pandemic. despite significant efforts to control its spread, sars-cov- has caused substantial health and economic burden, emphasising the immediate need for antiviral treatments. as with all positive strand rna viruses, an rdrp lies at the core of the viral replication machinery and for covs this is the nsp protein. the pivotal role of nsp in the viral life-cycle, lack of host homologues and high level of sequence and structural conservation makes it an optimal target for therapeutics. however, there has been remarkably little biochemical characterisation of nsp and a lack of fundamental data to guide the design of antiviral therapeutics and study their mechanism of action (moa). a promising class of rdrp inhibitors are nucleoside analogues (nas), small molecule drugs that are metabolised intracellularly into their active ribonucleoside ′-triphosphate (rtp) forms and incorporated into the nascent viral rna by error-prone viral rdrps. this can disrupt rna synthesis directly via chain termination, or can lead to the accumulation of deleterious mutations in the viral genome. for covs, the situation is complicated by the post-replicative repair capacity provided by the nsp exonuclease (exon) that is essential for maintaining the integrity of their large~ kb genomes [ ] [ ] [ ] . nsp has been shown to remove certain nas after insertion by the rdrp into the nascent rna, thus reducing their antiviral effects [ ] [ ] [ ] . despite this, several nas currently being used for the treatment of other viral infections have been identified as potential anti-cov candidates [ ] [ ] [ ] . among these is the purine base analogue t- (favipiravir and avigan) that has broad-spectrum activity against a number of rna viruses and is currently licensed in japan for use in the treatment of influenza virus . clinical trials are currently ongoing in china, italy, and the uk for the treatment of covid- , although its precise moa against covs has not been shown. here we show that a recombinant sars-cov rna polymerase complex has an unusually high polymerisation rate and a very low nucleotide insertion fidelity. this enzyme readily incorporates t- -ribose- ′-phosphate into viral rna in vitro, and cell culture based infectious virus studies show an increase in mutations in the presence of favipiravir. these results indicate favipiravir can have antiviral effects through alteration of the sars-cov- genome even in the presence of an active exon. results t- inhibits sars-cov- through lethal mutagenesis. we infected vero cells with cov-sars- in the presence or absence of µm t- ( supplementary fig. a, b) and performed deep sequencing of viral rna. a -fold (p < . ) increase in total mutation frequencies is observed in viral populations grown in the presence of the drug as compared to the no-drug samples (fig. ). similar to previous findings with influenza , coxsackie b and ebola viruses, a -fold increase in g-to-a and c-to-u transition mutations is observed, consistent with t- acting predominantly as a guanosine analogue. the increase in the diversity of the virus variant population suggests that once incorporated into viral rna, t- is acting as a mutagen capable of escaping the cov repair machinery. interestingly, the sars-cov- genome has an already low cytosine content of . % and t- treatment may therefore place additional pressure on its nucleotide content. associated with this increase in mutation frequency, t- has an antiviral effect on sars-cov- , as illustrated by a reduction in virus-induced cytopathic effect, viral rna copy number and infectious particle yield. altogether these observations show that the mutagenic effect induced by t- is, at least in part, responsible for the inhibition of the replication. the highly active sars-cov polymerase shows distinct processivity modes. to determine the efficacy and moa of t- against sars-cov we first characterised nsp primerdependent activity using traditional annealed primer-template (pt) and self-priming hairpin (hp) rnas that may confer additional stability on the elongation complex ( supplementary fig. c) . consistent with prior findings, nsp alone is essentially inactive , and rna synthesis requires the presence nsp and cofactors whose stimulatory effect is enhanced by linking them as a nsp l fusion protein , . structures of nsp show a fourcomponent complex with a nsp /nsp heterodimer and an additional nsp monomer , and accordingly we found that addition of supplementary nsp to the nsp :nsp l complex further increases activity ( supplementary fig. ). the resulting nsp : l : complex is highly active on both pt and hp rnas, with reactions containing . µm of each substrate and µm nsp showing comparable initiation rates, with a rapid burst phase resulting in > % primer consumption followed by remaining primer use over a period of a few minutes (fig. a, b and supplementary fig. ). interestingly however, the apparent processivity for the two substrates differs substantially. when provided with an annealed pt pair, intermediate products account for~ % of the total lane intensity across all timepoints suggesting distributive polymerase activity ( fig. c and supplementary fig. ). in contrast, extension of a hp substrate has few intermediate products, indicating a more processive elongation mode. this pattern is consistently observed across rna substrates of different lengths, showing that the distributive pt mode does not convert into a processive state within a -nucleotide long template. interestingly, a recently resolved structure shows that the two copies of nsp form long helical 'sliding pole' extensions that contact the duplex rna product~ basepairs from the active site . in light of this, we believe the lower processivity of pt may indicate a less stable elongation complex for rna duplexes that are too short to form these contacts. on the other hand, short hairpin substrates are inherently more stably folded and are immune to complete strand separation. this added structural rigidity likely explains the more processive replication on these substrates. notably, the nidovirales rna replication/ transcription scheme involves precise recombination-like events to generate subgenomic rnas through a discontinuous mechanism along the~ , -nt genome . the different processivities we observe may be connected to these peculiar rna synthesis events. differences in rna secondary structures may significantly alter complex stability, subsequently controlling the dissociation and re-association at specific regions of the rna template. the sars-cov polymerase readily incorporates t- /t- as purine analogues. we sought to determine the extent to which the rdrp complex could incorporate the na inhibitors t- and t- (a non-fluorinated t- -related analogue, fig. a ) into rna. t- shows improved potency against influenza virus in certain cell lines and is reportedly more stable than t- in the rtp form used in enzyme assays [ ] [ ] [ ] . the moa for these compounds is currently controversial. t- has been shown to act through lethal mutagenesis for several viruses, predominantly by competing with guanosine to cause transition mutations , [ ] [ ] [ ] [ ] . however, two separate studies support an antiviral effect mediated by chain termination, with incorporation of either a single or two consecutive t- molecules blocking further extension by influenza polymerase , . for the nsp complex, omission of atp and/or gtp from elongation reactions results in rapid incorporation of t- and t- at multiple sites with both substrates (fig. b, c) . in reactions with µm t- -rtp, multiple incorporation events are seen within s, while t- -rtp is less efficient, potentially attributable to the higher lability of this compound. neither compound is incorporated in the place of cytosine or uracil, clearly showing they function as purine analogues (supplementary fig. ). for the more processive hp complex, efficient incorporation and elongation occurs opposite both uracil and cytosine, resulting in rapid accumulation of full-length products, advocating for lethal mutagenesis as the moa for these compounds (fig. c) . interestingly, the pt substrate reveals a striking difference in the moa depending on whether the analogues are incorporated in place of guanine or adenine. opposite uracil, both are rapidly incorporated but further extension is slow and inefficient. furthermore, synthesis is seen to stop following multiple consecutive incorporations (fig. c ), suggesting these analogues may somewhat destabilise the complex and promote template dissociation, particularly for the less processive pt substrate. in contrast, opposite cytosine there is a stall before each analogue incorporation step, but elongation past the analogue is rapid, even at consecutive sites. more full-length product is observed in these no-gtp experiments, suggesting both compounds are more efficient as guanosine analogues. a similar observation has been made for the poliovirus rdrp that efficiently incorporated and bypassed t- opposite cytosine but was prone to pausing opposite uridine. these pause events were attributed to rdrp backtracking, which was therefore assigned the primary cause of the inhibitory activity . our results indicate that for nsp , the t- /t- moa is dictated by the structural and functional properties of both the polymerase and the rna. while the presence of abortive synthesis products suggest that chain termination may contribute to the antiviral effect, this notably only occurred following several consecutive analogue incorporations, a situation that is relatively unlikely during viral replication. based on the speed and frequency of analogue incorporation under multiple sequence contexts, we conclude that t- /t- predominantly act as mutagens, consistent with our infectious virus results. the reactions performed without atp also showed detectable amounts of gtp:u mismatch products, allowing us to make a direct comparison of the t- incorporation rate with that of a natural gtp:u mismatch (fig. ) . reactions using µm gtp and µm t- -rtp show~ -fold more t- :u produced relative to gtp:u. considering the concentration difference, t- incorporation may be as much as~ -fold more efficient than the native gtp mismatch, the most common naturally occurring transition mutation. such high levels of incorporation may exceed the capacity the cov error-correcting mechanisms, consistent with our infectious virus data, although it remains to be determined if nsp can excise t- /t- , as is the case for ribavirin and -fluorouracil , , . the sars-cov polymerase is the fastest viral rdrp known. we carried out pre-steady state rapid-quench experiments to further understand the molecular basis of nsp elongation rates and fidelity. we initially attempted to form stalled elongation complexes, as is commonly done for the structurally related picornaviral rdrps - , but nsp complexes showed half-lives of only~ min across a range of conditions and cofactor stoichiometries ( supplementary fig. ). the significant rapid burst phase, however, allowed characterisation of elongation rates under single-turnover conditions using millisecond timescale edta quench-flow ( fig. and supplementary fig. ). experiments were performed at three ntp concentrations based on estimated steady-state k d values of - µm. ctp was omitted to allow elongation by only and nucleotides on the pt and hp substrates, respectively, while avoiding end-effects that can slow elongation rates. at . µm ntp we observe multiple incorporation steps within a mere - ms and formation of the + / products by ms (fig. ) . analysis of the data using a minimal model of identical nucleotide incorporation steps yields elongation rates of ± . and ± s − for the pt and hp substrates at µm ntp at °c (supplementary table ). fitting the rate concentration dependence yields maximal elongation rates of ± s − on annealed primer-templates and ± s − on hairpin templates (fig. d) . table lists rates ± standard error (s.e) for both an overall average rate and for each individual incorporation step. source data are provided as a source data file. these data reveal that the sars-cov nsp is the fastest viral rdrp known, with rates significantly faster than the - s − observed for picornaviral polymerases at room temperature [ ] [ ] [ ] and - s − for hepatitis c and dengue virus polymerases at and °c , . based on its structure, nsp is expected to use the palm domain based active site closure mechanism that is unique to viral rdrps and which exhibits a - -fold rate increase between °c and °c , , suggesting nsp can elongate at - s − at physiological temperatures. the sars-cov polymerase exhibits low nucleotide insertion fidelity. such a fast viral rdrp is consistent with the need to rapidly replicate~ , -nt-long rna genomes, but raises questions as to how fidelity of nucleotide incorporation is impacted. in our data, we consistently observe nucleotide misincorporations and subsequent elongation on templates where nsp should stall due to the lack of ctp ( supplementary fig. b, ) . in contrast, rdrps that form highly stable elongation complexes show limited or no readthrough products under comparable conditions , , . the efficiency of the nsp bypass is utp concentration dependent ( supplementary fig. ) , indicative of uracil misincorporation opposite a templating guanosine (utp:g). the utp:g elongation product is also observed in the pt quench-flow data at reaction times as short as ms, yielding a misincorporation rate of~ . s − (supplementary table ). this is only - -fold lower than the cognate utp:a rate measured in the same experiment, and more than one order of magnitude less accurate than the generally admitted − - − error rate of viral rdrps . structural markers of large nidovirus rdrp active sites. the molecular basis for fast and low fidelity replication by nsp is not yet known, but a comparison of rdrp structures reveals that a key ntp interaction is absent in cov enzymes (fig. ). viral rdrps use an electrostatic interaction with an arginine residue in motif f to position the ntp during catalysis . for most positivestrand rna virus polymerases, this arginine is stabilised by a salt bridge to a glutamic acid residue, also from motif f. notably, the cov nsp has an alanine in place of this glutamate (a ) and as a result, the arginine (r ) is not rigidly anchored above the active site (fig. a, b) , . this flexibility could allow catalysis to occur with a relaxed triphosphate positioning, decreasing fidelity by lowering the requirement for strict watson-crick base pairing in the active site. these unique features have likely played a central role in genome expansion and stability by providing a fast rna synthesis machinery whose inaccuracy is mitigated by the presence of an rna repair exonuclease. our data demonstrate that nucleoside analogues are pertinent candidates for the treatment of covid- . favipiravir, with its already defined safety profile and mode of action, may well find a place as an anti-rdrp component in combination therapies targeting coronaviruses. nucleotide substrates. t- -ribose-tp (rtp), the ′-triphosphate of the nonfluorinated derivative of t- -ribose, was synthesised following the iterative route from mono-via di-to triphosphate with minor modifications ( supplementary fig. ). to overcome low solubility of t- -ribonucleotides in organic solvent, the ′-and ′-acetyl groups were kept until the final tp-synthesis and were then cleaved by treatment with base. this reaction sequence resulted in an increased yield of up to % (for di-) and % (for triphosphate), nearly double of previously reported. t- -rtp was obtained from toronto research chemicals. lyophilised aliquots were resuspended in te buffer ph . and the stability was verified before each experiment by measuring absorbance at nm and nm for t- -rtp and t- -rtp respectively. stability was additionally checked through hplc analysis of both compounds fresh and after dilution in the assay reaction buffer and left at room temperature for min. no loss of the active triphosphate form was observed. other ntps were purchased from ge healthcare. synthetic oligonucleotides. primer-template (pt) pairs were purchased from biomers (hplc grade). rna oligonucleotides st ( -mer) and ls ( -mer), corresponding to the ′-end of the sars-cov genome (excluding the poly a tail) were used as templates, annealed to a ′ cy labelled sp primer ( -mer) corresponding to the ′-end of the anti-genome. for simplicity, these were named pt / and pt / throughout the manuscript. annealing was performed by denaturing primer:template pairs (molar ratio of : . ) in mm kcl at °c for min, then cooling slowly to room temperature over several hours. hairpin rnas were synthesised by integrated dna technologies (coralville, ia), resuspended at μm concentration in mm nacl, mm mgcl , mm tris (ph . ) and heated to °c for min before snap cooling on ice. expression and purification of sars-cov proteins. all sars-cov proteins used in this study were expressed in escherichia coli (e. coli) under the control of t promoters (primers used for cloning shown in supplementary table ) . cofactors nsp l and nsp alone were expressed from pqe vectors with cterminal and n-terminal hexa-histidine tags respectively. tev cleavage site sequences were included for his-tag removal following expression. the nsp l fusion protein was generated by inserting a gsgsgs linker between nsp -and nsp -coding sequences. cofactors were expressed in neb express c (new england biolabs) cells carrying the prare laci (novagen) plasmid in the presence a b of ampicillin ( µm/ml) and chloramphenicol ( µg/ml). protein expression was induced with µm iptg once the od = . - . , and expressed overnight at °c. cells were lysed by sonication in a lysis buffer containing mm tris-hcl ph , mm nacl, mm imidazole, supplemented with mm mgso , . mg/ml lysozyme, µg/ml dnase and mm pmsf. protein was purified first through affinity chromatography by batch binding to hispur cobalt resin (thermo scientific) followed by elution with lysis buffer supplemented with mm imidazole. eluted protein was concentrated and dialysed overnight in the presence of histidine labelled tev protease ( : w/w ratio to tev:protein) for removal of the protein tag. cleaved protein was void volume purified through a second cobalt column and subjected to size exclusion chromatography (ge, superdex s ) in gel filtration buffer ( mm hepes ph , mm nacl, mm mgcl and mm tcep). concentrated aliquots of protein were flash-frozen in liquid nitrogen and stored at − °c. a synthetic, codon-optimised sars-cov nsp gene (supplementary table ) bearing c-terminal xhis-tag preceded by a tev protease cleavage site was expressed from a pj vector (dna . ) in e. coli strain bl /pg-tf (takara). cells were grown at °c in the presence of ampicillin and chloramphenicol until od reached . cultures were induced with µm iptg and protein expressed at °c overnight. purification was performed as above in lysis buffer supplemented with % chaps. two additional wash steps were performed prior to elution, with buffer supplemented with mm imidazole and mm arginine for the first and second washes respectively. polymerase was eluted using lysis buffer with mm imidazole and concentrated protein was purified through gel filtration chromatography (ge, superdex s ) in the same buffer as for nsp l . collected fractions were concentrated and supplemented with % glycerol final concentration and stored at − °c. steady-state elongation complex reactions. nsp , nsp l and nsp were mixed just prior to each experiment at a : : molar ratio (unless otherwise stated) and preincubated on ice for mins. the protein complex was subsequently mixed with the rna pre-mix ( mm hepes ph . , mm nacl and mm mgcl ) containing either a single rna substrate or both hp and pt rnas at equimolar ratios. reactions were initiated with µm (final concentration) of all four ntps, or without ctp for partial elongation reactions. final reaction concentrations were µm nsp , . µm each rna. reactions were quenched at indicated timepoints with x volume of fbd stop solution (formamide, mm edta). to verify that activity was not due to either nsp l or nsp , which have been shown to harbour primase-like noncanonical rdrp activities [ ] [ ] [ ] [ ] control assays were run in the absence of nsp . the ntp k d was estimated using the same conditions, but with final equimolar concentration of atp, gtp and utp ranging from . to µm ( -fold dilution series). the time course of product formation was fit to single exponential equation for each concentration of ntp to give the observed rate constant (k obs ). observed rates were subsequently plotted against ntp concentration, and the data was fit via hyperbolic regression to give the equilibrium dissociation constant (k d ) and the maximum rate constant for incorporation of ntps (k pol ). formation of stalled elongation complex. attempts to form a stalled elongation complex were performed with hp - rna using a constant nsp concentration of . µm with varied nsp :nsp l :nsp ratios ( supplementary fig. ). protein, hp - rna ( . µm) and µm final concentrations of gtp and atp were incubated for min at room temperature to allow formation of a stalled + nucleotide elongation complex. reactions were diluted : in high salt to prevent rna rebinding and chased at various timepoints ( - min) with µm all ntps. chase reactions were quenched after s in fbd. stability half-life was calculated from the ratio of full-length (fl) product produced relative to the total amount of + -lock + fl product at each timepoint. half-life was obtained by fitting the data through a single exponential. pre-steady state quenched-flow kinetics. experiments were carried out in a bio-logic qfm- rapid chemical quench-flow apparatus that controls reaction time by flow rate through a . μl chamber between the reaction mixer and the quencher mixer. the nsp -nsp l -nsp complex was assembled by first preincubating the proteins on ice for min at a : : molar ratio, then adding this to sp / and hp - hairpin rna in reaction buffer ( mm hepes ph . product analysis. quenched reactions were mixed : with fbd loading dye and heated for min at °c, and cooled on ice for min before analysis on - % polyacrylamide, m urea tbe gels. gels were run at a constant w using sequi-gen gt systems from bio-rad or vertical electrophoresis systems from cbs scientific and visualised using an amersham™ typhoon™ biomolecular imager (ge healthcare). the intensity of each band was quantified using the imagequant software (ge healthcare)/image gauge (fuji) and/or using imagej as implemented in the fiji package, with background subtraction. product yield was determined by dividing the intensity of the product by total intensity of the product + remaining primer and multiplying by the input concentration of rna. for the rapid quench data, the programme peakfit (systat software) was used to fit gel lane profiles to a set of gaussian peaks and the fractional area contained within each peak was multiplied by the rna concentration in the experiment to calculate the amount of each elongation species as a function of reaction time. pre-steady state kinetic data analysis. rapid quench data were analysed using the programme kintek explorer to model the reactions as a series of seven (pt substrate) or eight (hp substrate) irreversible nucleotide addition steps. data from each ntp concentration series were fit independently (not globally) to obtain observed rates using either a model with (i) a single average rate or (ii) individual rates for each nucleotide addition step ( supplementary fig. ) . modelling the remaining intermediate species attributed to low processivity elongation using a formal rna dissociation step and a rebinding step at rates comparable to nonburst phase primer utilisation was a not successful, suggesting the nsp -rna complex exists in some form of inactive state. low processivity elongation was instead accommodated in the kinetic model by adding a reversible inactivation/ reactivation equilibrium step for each elongation product, allowing intermediate products to depart from the immediate processive pathway toward full-length product formation, but then rejoin the pathway for further elongation. the rate constants for these steps were shared among all intermediates species as there was insufficient data to fit them individually. the single average rate models were fit using only data from primer loss and the full-length product (+ or + ), but not the intermediate species. note that for the pt template we observed detectable amounts of additional + and + bands at the higher . and µm ntp concentrations and therefore included them in the model. this is presumably due to a utp:g mismatch to yield the + product in the absence of ctp, followed by a cognate atp:u addition that is slow because it is priming on a mismatched base pair. for the hp substrate, we were unable to definitively identify the gel migration bands for the + and + species and therefore did not model individual rates for these two elongation events. data from the hp rnas are challenging to analyse because the high thermodynamic stability of the folded rna structure can make it difficult to fully denature the helix during electrophoresis. this leads to a mixture of species that transition from denatured single-stranded rna for the initial species to more compact and faster migrating duplexes for longer elongation products. we were therefore unable to reliably analyse quench flow data from longer hairpin products. ntp-analogue incorporation. enzyme mix ( µm nsp , nsp l and µm nsp ) in complex buffer ( mm hepes ph . , nacl, mm tcep and mm mgcl ) was incubated min on ice and then diluted with reaction buffer ( mm hepes ph . , mm nacl and mm mgcl ) to µm nsp ( nsp l and ). the resulting enzyme complex was mixed with an equal volume of . µm primer/ template (pt) with or without . µm hairpin (hp) carrying cy or -fam fluorescent labels, respectively, at their ′ ends in reaction buffer, and incubated for min at °c or °c, as given in figure in vitro infection assays. cell line: veroe (atcc crl- ) cells were grown in minimal essential medium (life technologies) with . % heat-inactivated fetal calf serum (fcs), at °c with % co with % penicillin/streptomycin (ps, u ml − and µg ml − respectively; life technologies) and supplemented with % non-essential amino acids (life technologies). virus strain: sars-cov- strain bavpat was obtained from pr. drosten through eva global (https://www.european-virus-archive.com/). virus stocks were prepared using standard methods . all experiments were conducted in a bsl laboratory. antiviral experiments: for ec and cc determination, day prior to infection, × veroe cells were seeded in µl assay medium (containing . % fcs) in -well plates. the next day, seven -fold serial dilutions of t- ( µm to . µm in triplicate) were added to the cells ( µl/well, in assay medium). four virus control wells were supplemented with µl of assay medium. after min, µl of a virus mix diluted in medium was added to the wells. the amount of virus working stock used was calibrated prior to the assay based on replication kinetics so that the replication growth is still in the exponential growth phase for the readout , . four cell control wells (i.e. with no virus) were supplemented with µl of assay medium. plates were incubated for days at °c prior to quantification of the viral genome by real-time rt-pcr. rna extraction and viral rna quantification was performed . the and % effective nature communications | https://doi.org/ . /s - - -z article nature communications | ( ) : | https://doi.org/ . /s - - -z | www.nature.com/naturecommunications concentrations (ec , ec ; compound concentration required to inhibit viral rna replication by and %) were determined using logarithmic interpolation . for the evaluation of cc (the concentration that reduces the total cell number by %), the same culture conditions were set as for the determination of the ec , without addition of the virus, then cell viability was measured using celltiter blue® (promega). cell supernatant media were discarded and celltiter-blue® reagent (promega) was added following the manufacturer's instructions. plates were incubated for h prior recording fluorescence ( / nm) with a tecan infinite pro machine. from the measured od , cc was determined using logarithmic interpolation. for ec determination using cpe inhibition, cells and t- were prepared as described above. eight virus control wells were supplemented with µl of assay medium and eight cell control were supplemented with µl. after min, µl of a virus mix diluted in . % fcscontaining medium was added to the wells at moi . . three days after infection, cpe were assessed using celltiter-blue® reagent (promega). for the infectivity test cells and compound were prepared as described above (ec determination) with only three -fold serial dilutions of t- ( µm to µm, in triplicate). three virus control wells were supplemented with µl of assay medium. the experiment was conducted as described for the ec determination. at day the supernatant was collected and each triplicate was titrated by measuring the % tissue culture infectivity dose (tcid ); briefly, three replicates were infected with μl of -fold serial dilutions of the previous supernatant, and incubated for days. cpe was measured by celltiter-blue® reagent (promega) and tcid was calculated and expressed as tcid /ml. antiviral experiments data were analysed using graphpad prism software (graph pad software). graphical representations were also performed on graphpad prism software. sequence analysis. eight overlapping amplicons were produced from the extracted viral rna using the superscript iv one-step rt-pcr system (thermo fisher scientific) and specific primers (supplementary table ). pcr products were pooled at equimolar proportions. after qubit quantification using qubit® dsdna hs assay kit and qubit . fluorometer (thermofisher scientific) amplicons were fragmented by sonication in~ bp long fragments. libraries were built adding barcodes for sample identification to the fragmented dna using ab library builder system (thermofisher scientific). to pool the barcoded samples at equimolar ratio a quantification step by real-time pcr using ion library taqman™ quantitation kit (thermo fisher scientific) was realised. an emulsion pcr of the pools and loading on chip was realised using the automated ion chef instrument (thermofisher). sequencing was performed using the s ion torrent technology v . (thermo fisher scientific) following manufacturer's instructions. consensus sequence was obtained after trimming of reads (reads with quality score < . , and length < pb were removed and the first and last nucleotides were removed from the reads) mapping of the reads on a reference (determined following blast of de novo contigs) using clc genomics workbench software v. (qiagen). a de novo contig was also produced to ensure that the consensus sequence was not affected by the reference sequence. quasi species with frequency over . % were studied. sequencing data generated for this study are available (sra accession number: prjna ; genbank accession numbers are mt and mt ). in parallel with the viral genomes, a cloned sars-cov- dna fragment (region - , on mt genome) was treated with the same amplification and sequencing procedure to evaluate the mutation frequency induced by the sequencing steps. no sub-population was observed from the cloned dna implying that sub-populations observed in the virus samples reflect the sequence diversity induced upstream the amplification process. reporting summary. further information on research design is available in the nature research reporting summary linked to this article. sequencing data generated for this study are available (sra accession number: prjna ; genbank accession numbers are mt and mt ). source data are provided with this paper. other data are available from the corresponding authors upon reasonable request. source data are provided with this paper. received: june ; accepted: august ; unique and conserved features of genome and proteome of sars-coronavirus, an early split-off from the coronavirus group lineage rna ′-end mismatch excision by the severe acute respiratory syndrome coronavirus nonstructural protein nsp /nsp exoribonuclease complex discovery of an rna virus ′→ ′ exoribonuclease that is critically involved in coronavirus rna synthesis high fidelity of murine hepatitis virus replication is decreased in nsp exoribonuclease mutants understanding the mechanism of the broad-spectrum antiviral activity of favipiravir (t- ): key role of the f motif of the viral polymerase structural and molecular basis of mismatch correction and ribavirin excision from coronavirus rna the antiviral compound remdesivir potently inhibits rna-dependent rna polymerase from middle east respiratory syndrome coronavirus remdesivir is a direct-acting antiviral that inhibits rnadependent rna polymerase from severe acute respiratory syndrome coronavirus with high potency nucleoside analogues for the treatment of coronavirus infections favipiravir as a potential countermeasure against neglected and emerging rna viruses determining the mutation bias of favipiravir in influenza virus using next-generation sequencing antiviral efficacy of favipiravir against ebola virus: a translational study in cynomolgus macaques the rna polymerase activity of sars-coronavirus nsp is primer dependent expression, purification, and characterization of sars coronavirus rna polymerase one severe acute respiratory syndrome coronavirus protein complex integrates processive rna polymerase and exonuclease activities structure of the sars-cov nsp polymerase bound to nsp and nsp co-factors structure of the rna-dependent rna polymerase from covid- virus structure of replicating sars-cov- polymerase continuous and discontinuous rna synthesis in coronaviruses synthesis of t- -ribonucleoside and t- -ribonucleotide and studies of chemical stability cell line-dependent activation and antiviral activity of t- , the non-fluorinated analogue of t- (favipiravir) prodrugs of the phosphoribosylated forms of hydroxypyrazinecarboxamide pseudobase t- and its de-fluoro analogue t- as potent influenza virus inhibitors t- (favipiravir) induces lethal mutagenesis in influenza a h n viruses in vitro extinction of west nile virus by favipiravir through lethal mutagenesis lethal mutagenesis of hepatitis c virus induced by favipiravir favipiravir elicits antiviral mutagenesis during virus replication in vivo mechanism of action of t- ribosyl triphosphate against influenza virus rna polymerase the ambiguous basepairing and high substrate efficiency of t- (favipiravir) ribofuranosyl -triphosphate towards influenza a virus polymerase signatures of nucleotide analog incorporation by an rnadependent rna polymerase revealed using high-throughput magnetic tweezers coronaviruses lacking exoribonuclease activity are susceptible to lethal mutagenesis: evidence for proofreading and potential therapeutics polymerase ( dpol): assembly of stable, elongation-competent complexes by using a symmetrical primer-template substrate (sym/sub) structural basis for active site closure by the poliovirus rna-dependent rna polymerase a nucleobase-binding pocket in a viral rna-dependent rna polymerase contributes to elongation complex stability a quantitative stopped-flow fluorescence assay for measuring polymerase elongation rates temperature controlled high-throughput magnetic tweezers show striking difference in activation energies of replicating viral rna-dependent rna polymerases assembly, purification, and pre-steady-state kinetic analysis of active rna-dependent rna polymerase elongation complex characterization of the elongation complex of dengue virus rna polymerase: assembly, kinetics of nucleotide incorporation, and fidelity structure-function relationships underlying the replication fidelity of viral rna-dependent rna polymerases viral mutation rates a comprehensive superposition of viral polymerase structures nonstructural proteins and of feline coronavirus form a : heterotrimer that exhibits primer-independent rna polymerase activity the sarscoronavirus nsp +nsp complex is a unique multimeric rna polymerase capable of both de novo initiation and primer extension a second, non-canonical rna-dependent rna polymerase in sars coronavirus identification and characterization of a human coronavirus e nonstructural protein -associated rna ′-terminal adenylyltransferase activity global kinetic explorer: a new computer program for dynamic simulation and fitting of kinetic data in vitro screening of a fda approved chemical library reveals potential inhibitors of sars-cov- replication phylogenetically based establishment of a dengue virus panel, representing all available genotypes, as a tool in dengue drug discovery mutations in the chikungunya virus non-structural proteins cause resistance to favipiravir (t- ), a broad-spectrum antiviral we thank magali gilles and karine alvarez for excellent technical support, as well as prs. c. drosten and f. drexler for providing the sars-cov- through eva-global (european union's horizon programme, ga ). this work was supported by the fondation pour la recherche médicale (aide aux équipes), the score project h sc -phe-coronavirus- (grant# ) to bca, reacting covid- initiative (research and action targeting emerging infectious diseases) with the support of the ministry of solidarity and health and the ministry of higher eductation to bca, ed and bco, national institutes of health grant ai to op, and a grant from dzif (german center for infection research) to j.h. and c.m. the authors declare no competing interests. supplementary information is available for this paper at https://doi.org/ . /s - - -z.correspondence and requests for materials should be addressed to o.p. or b.c.peer review information nature communications thanks aartjan te velthuis and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. peer reviewer reports are available.reprints and permission information is available at http://www.nature.com/reprintspublisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/ licenses/by/ . /. key: cord- - de r h authors: vandewege, michael w; sotero-caio, cibele g; phillips, caleb d title: positive selection and gene expression analyses from salivary glands reveal discrete adaptations within the ecologically diverse bat family phyllostomidae date: - - journal: genome biol evol doi: . /gbe/evaa sha: doc_id: cord_uid: de r h the leaf-nosed bats (phyllostomidae) are outliers among chiropterans with respect to the unusually high diversity of dietary strategies within the family. salivary glands, owing to their functions and high ultrastructural variability among lineages, are proposed to have played an important role during the phyllostomid radiation. to identify genes underlying salivary gland functional diversification, we sequenced submandibular gland transcriptomes from phyllostomid species representative of divergent dietary strategies. from the assembled transcriptomes, we performed an array of selection tests and gene expression analyses to identify signatures of adaptation. overall, we identified an enrichment of immunity-related gene ontology terms among genes evolving under positive selection. lineage-specific selection tests revealed several endomembrane system genes under selection in the vampire bat. many genes that respond to insulin were under selection and differentially expressed genes pointed to modifications of amino acid synthesis pathways in plant-visitors. results indicate salivary glands have diversified in various ways across a functional diverse clade of mammals in response to niche specializations. bats (order chiroptera) make up % of mammalian diversity and several groups have been alluded to as examples of extreme phenotypic and genomic changes leading to species diversity (ray et al. ; pritham and feschotte ; dumont et al. ; hayden et al. ; platt et al. ; phillips and baker ; sotero-caio et al. ) . in particular, the leaf-nosed bats (phyllostomidae) include > extant species, representing the most ecologically diverse mammal family with their wide range of dietary strategies practiced cirranello et al. ) . morphometric and the bat family phyllostomidae is one of the most ecologically diverse families of mammals. salivary glands are hypothesized to facilitate adaptation to novel diets because their secreted products are the first to come in contact with food and pathogens. we sequenced expressed transcripts from phyllostomid salivary glands and found strong signals of selection among immune-related genes. selection and gene expression signals among specific lineages were less clear but pointed to modifications of the endomembrane transport system and metabolic pathways. although we could not strongly link gene evolution to dietary adaptations, results indicated diversification in response to niche specializations. evolutionary models suggest the ecological and phenotypic diversity among lineages and species are associated with strong selection on dietary specialization features (monteiro and nogueira ; rojas et al. ; dumont et al. ; rossoni et al. ; hedrick et al. ) . bats exhibit several characteristics rendering them interesting to examine, most obviously their ability to fly and echolocate. furthermore, bats act as vectors for zoonotic diseases including sars coronavirus, ebola, nipah virus, and rabies (calisher et al. ; smith and wang ; olival et al. ) . a lesser examined characteristic thought to be directly linked to their dietary multiplicity and adaptation is the anatomical diversity of their salivary glands tandler et al. ) . in mammals, there are typically three major pairs of salivary glands: the submandibular, parotid, and sublingual glands, and all use intracellular processes that involve the synthesis, modification, packaging, and secretion of proteins in membrane-bound granules. these glands are considered part of the digestive system as they secrete digestive enzymes, however, their products perform additional functions including antimicrobial resistance and biochemical communication (dobrosielski-vergona ; talley et al. ; bloss et al. ; safi and kerth ; vandewege et al. ) . tandler and phillips ( ) have shown that among species of bats, secretory products were correlated with diet, especially in insectivorous species. further, phillips et al. ( ) found evidence that salivary glands play a role in lipid metabolism in myotis lucifugus. because of this wide variation and their direct link to immunity, diet, and reproduction, submandibular glands (smgs) and their products are hypothesized to play an important role in the adaptive radiation of mammals (phillips and tandler ) . linking genetic variation and gene products to selection and adaptation is still a challenge in evolutionary biology. more recently, the implementation of next-generation sequencing has facilitated the search for signatures of adaptation through dna and/or rna comparative analyses. indeed, sequencing transcriptomes offers an in-depth, data-rich method to identify selective pressures that have influenced the evolution of tissue-specific genes. of interest are genes that have undergone positive selection to adapt physiological, immunological, and ecological processes to new environments (daugherty and malik ; hawkins et al. ) . selection analyses among bat genes have generally been limited to few species for specific purposes (shen et al. ; zhang et al. ) . hawkins et al. ( ) analyzed orthologs from different bat genomes and transcriptomes and found most genes under selection were related to immunity and collagen production. here, phyllostomidae was examined specifically because of their extensive and relatively rapid radiation to new feeding strategies. we sequenced the smg transcriptomes of nine phyllostomid bats representing different subfamilies and different diets, and through analysis of orthologs characterized how selection on coding sequence and expression differences have shaped smgs. nine species from seven out of the recognized subfamilies were chosen to maximize representation of the phylogenetic and dietary diversity of phyllostomidae ( fig. ). we also included two insectivorous bats, m. lucifugus from vespertilionidae and pteronotus parnellii from mormoopidae, as outgroups. tissues were extracted and frozen in liquid nitrogen within min after euthanasia. additional details from tissue loans provided by the natural science research laboratory (nsrl) of the museum of texas tech university can be found in supplementary table s , supplementary material online. rna isolation, sequencing, and assembly rna was extracted from smgs of each bat using trizol (invitrogen, carlsbad, ca, usa) following manufacturer protocols. oligo-dt magnetic beads were used to enrich for mrna strands with poly-a tails and a strand-specific pairedend cdna library was prepared using a scriptseq kit (epicentre, madison wi usa). libraries were sequenced on illumina platform (see supplementary table s , supplementary material online). for each species, pairedend reads were filtered for quality using trimmomatic . (bolger et al. ) putative open reading frames and translated peptide sequences were identified using transdecoder (haas et al. ) and the resulting peptide sequences were processed through the trinotate pipeline to identify functional properties and gene ontology (go) annotations associated with biological processes, molecular functions, and cellular components. to summarize, peptide sequences were queried against the swissprot database (dimmer et al. ) using blastp (altschul et al. ) and the pfam (finn et al. ) database using hmmer . . (eddy ) . peptides were also scanned for a signal peptide using signalp (petersen et al. ) , and transmembrane domains using tmhmm . (krogh et al. ). orthology assignment is still a major challenge in bioinformatics and evolutionary biology. here, we developed a process to filter out similar transcripts produced by trinity. the first step was to assume similar sequences would have similar trinotate annotations. we parsed the trinotate output to identify the best sequence representative for each unique swissprot annotation. to choose the best gene representative among multiple coding sequences (cdss) with the same swissprot annotation, we multiplied the length of the cds to the percent identity of the swissprot hit. this metric correlated with e value, but effectively acted as a bit score when e values were identical. the cds with the highest metric was chosen to represent the annotation. if a cds did not have a swissprot annotation, it was removed. we then ran combined best sequences from all species, and ran this data set through the orthomcl (li et al. ) pipeline to identify orthologous groups. only single gene ortholog groups were used in downstream selection tests. poor ortholog assignment can influence sequence relationships in a phylogeny, and because the relationships among phyllostomids are robust, a reasonable test of ortholog assignment would be to reconstruct a phylogenetic tree and determine if the resulting tree reflects previously described relationships. therefore, we reconstructed a phylogenetic tree from randomly sampled single gene orthologs shared among all individuals. each orthologous group was translated, aligned using linsi parameters in mafft (katoh and standley ) and reverse translated to construct a codon alignment. resulting alignments were concatenated and we used raxml (stamatakis ) to find the best tree from the unpartitioned data set using the ml and rapid bootstrapping algorithm, a gtrgamma model of nucleotide substitution, and bootstrap replicates. single gene orthologous groups that were found in seven or more phyllostomids were tested for evidence of selection using the maximum likelihood approach described by goldman and yang ( ) . codon alignments were constructed as above and the best tree for each gene was estimated using raxml. we used codeml in paml (yang ) to estimate the role of selection on gene evolution by comparing the rate of nonsynonymous substitution per nonsynonymous site (d n ) to the rate of synonymous substitution per synonymous site (d s ). the d n /d s ratio can be used as a sensitive measure of selective pressure; however, in most cases, the overall d n /d s is < and only a few amino acid sites are evolving quickly. therefore, to determine whether a gene was evolving adaptively, we calculated the likelihood of models that allow d n /d s to vary among codon sites (m a, m a, m , and m ). for all genes, we used likelihood ratio tests (lrts) to compare nested models that allow and disallow codon site d n /d s to be > (m a v m a, m v m ), and to test for significant differences between nested models (yang ) . if both models that allow d n /d s rates to be > (m a and m ) were significantly better fits to the data (lrt, p < . ), we inferred these genes were evolving under positive selection. we performed a false discovery rate (fdr) correction on the p values resulting from the m a v m a and m v m lrts. fdrs were estimated using the qvalue function in the "qvalue" r module (storey and tibshirani ; storey ) . we also conducted branch-site selection tests in codeml (yang and nielsen ; zhang et al. ) for ortholog groups that were present among all species. alignments were constructed as previously described, but we used the species tree generated above ( fig. ). the branch-site test for positive selection divides branches of a phylogenetic tree into foreground and background branches. the null model (model a) restricts positive selection among codons in both foreground and background branches. the alternative (model b) allows positive selection to occur among codon in foreground branches. likelihoods between model a and model b were compared using a lrt. the fdr was also estimated from the distribution of p values. we conducted independent branch-site tests where the d. rotundus branch was the foreground, a second test with trachops cirrhosus as foreground, and a third test that included the entire plant-visitors clade as foreground ( fig. ). we mapped quality filtered paired-end rnaseq reads back to reference sequences using bowtie . . (langmead et al. ) and default parameters of rsem (li and dewey ) . only ortholog groups that were present in four or more species were present in reference sequences. ortholog groups with < mapped reads across all samples were removed prior to analyses. normalization of raw read counts was performed using the estimatesizefactors and estimatedispersions functions in deseq v . (love et al. ) . patterns of species variation in expression levels were assessed by performing a phylogenetically corrected pca using the phytools r package (revell ) based on the blind variance stabilizing transformed data. we also conducted a differential expression analysis from the normalized count data between plant-visitors versus others (see fig. ) and determined statistical significance when the adjusted p value was < . . we repeated this analysis examining only nectarivores. for all single gene orthologs tested for selection, we counted the number of reoccurring go terms. we then used the go term proportion to summarize the general function of genes tested in the smgs of phyllostomids using revigo (supek et al. ), which nests redundant and similar terms from long go term lists by semantic clustering. we repeated the same analysis using go terms described from orthologs under positive selection. to test for go term enrichment for the genes under selection, panther (mi et al. ) was applied to identify statistically overrepresented go terms from the list of genes under positive selection using the genes tested as the reference list. we used the fisher's exact test and applied the fdr correction and statistical significance was determined below an adjusted p value < . . we estimated whether a protein was membrane bound, a receptor, immune related, or secreted by searching for specific go terms. if "secretion," "extracellular space," or "extracellular region" was present among go terms, we annotated the gene as secreted. if "immune," "defense," or "antimicrobial" was present, the gene was annotated as an immune. if "membrane" was present we called the gene membrane bound and if "receptor" was present, we called the gene a receptor (supplementary table s , supplementary material online). we also used david . (huang et al. ) to map kegg pathways to differentially expressed genes. to investigate putative proteins underlying the evolution of smgs, we performed rnaseq on chiroptera species: nine phyllostomids with varying dietary strategies, and two outgroups representing the bat ancestral state of strict insectivory ( fig. ). among the samples, between , and , transcripts were assembled per species. over % of the paired-reads mapped back to each assembly indicating most reads were used. among the transcripts, we predicted between , and , orfs, and among the orfs, we found between , and , unique hits to the swissprot database (supplementary table s , supplementary material online). the most similar isoform to the reference sequence was chosen to represent the annotation. swissprot annotations were not used a priori for ortholog clustering. however, if after clustering there were multiple swissprot annotations in a single gene ortholog group, the most common annotation was used to describe the ortholog group. in all, orthomcl grouped , annotations into , orthologous groups that included between and members (supplementary fig. s a , supplementary material online). among the ortholog groups, , were single gene families represented in four or more species and , were found in all species (supplementary fig. s b , supplementary material online). among these , groups, , ( %) had the same swissprot annotation among orthologs, suggesting overall consistency between orthomcl clustering and trinotate annotations. in the remaining groups, there was one dominant gene annotation and the outliers were likely a result of improper classification by trinotate, given blast hits are often closely related paralogs and not true orthologs, in which case annotations were manually corrected. to additionally test if the ortholog predictions were generally accurate and dna sequences largely reflected species relationships, we generated a phylogenetic tree from the concatenated alignments of randomly sampled orthologs that were present in all species. although individual gene trees may not reflect a species trees, we expected that a consensus would emerge from a large volume of dna sequences if orthologs were accurately predicted and aligned. we found that the resulting concatenated tree accurately reflected previous species trees generated by dumont et al. ( ) , baker et al. ( ) , and rojas et al. ( ) , with high bootstrap support at each node ( fig. ). among the , orthologs, , were single gene orthologs present in seven or more phyllostomids, and , had alignments with enough shared sites to produce output from codeml. after correcting for fdr, we found genes where models of evolution allowing positive selection were significantly better fit to the data than neutrality in both m a and m tests (supplementary table s , supplementary material online). to contrast background salivary gene functions to those under selection, we summarized the go terms of the , tested genes using the revigo server ( fig. a) . overall, most of these gene products were localized to the cytosol, but many were also positioned in the plasma membrane and extracellular region. common biological processes identified were related to transcription regulation, cell death, and protein synthesis and transport. by contrast, the biological process and cellular component terms of genes under selection were fairly different from those obtained in our global assessment of protein function ( fig. b) . these proteins had more terms related to the extracellular exosome or plasma membrane and were involved in the immune response ( fig. b) . interestingly, no terms related to diet or metabolism (i.e., carbohydrate or lipid metabolism) were associated with genes under positive selection. we used panther's overrepresentation test to determine whether any go terms were significantly overrepresented among genes under selection. there were no molecular function terms enriched among the genes under selection (supplementary fig. s , supplementary material online). the most enriched biological process term was innate immune response, but all enriched biological process terms were related to the defense against other organisms ( fig. c ). consistent with receptor-like proteins under selection, terms associated with membrane surfaces were enriched in the cellular component category ( fig. c ). upon survey of the multiple go terms assigned to each gene, we found that eight of the genes were involved in the immune response (supplementary table s , supplementary material online). moreover, % of immunity-related loci had secretory annotations although a binomial test did not suggest secretion among immune proteins was enriched (p ¼ . ). of the remaining genes, genes were membrane bound, two were secreted, and two had terms for both membranes and secretion. site tests can inform about positive selection on a protein among all species, but they cannot inform about episodic selection on specific lineages. therefore, we conducted branch-site tests using the species tree as a reference on the most unique lineages, namely the plant-visitors, d. rotundus, and t. cirrhosus, and the plant-visiting species. we identified , single gene orthologs that were present among all species and , had enough shared sites to be successfully tested in codeml. after correcting for fdr, we found , , and genes under positive selection in the plant-visitors, d. rotundus, and t. cirrhosus, respectively (supplementary table s , supplementary material online) making up between . % and . % of genes tested. bpia (bpi fold-containing family a member ), an immune response gene, was the only gene found to be under selection in both the plant-visitors and d. rotundus and approaching statistical significance in t. cirrhosus (supplementary table s we measured the expression of , orthologs among all species. we performed a phylogenetically corrected pca on the variance stabilizing transformed expression data and found that all species except d. rotundus, t. cirrhosus, and hsunycteris thomasi had relatively correlated expression profiles where most of the insectivores and plant-visitors clustered together ( fig. a ). as there were not any biological replicates and few dietary replicates among our samples, the most robust approach to identifying relevant differentially expressed genes was to compare the plant-visitors to the remaining species to identify expression differences that could explain how plant-visitors have adapted to a diet with a different macromolecular profile. sixteen genes were differentially expressed in plant-visitors, eight were upregulated, and eight were downregulated ( fig. b) , and no go terms or kegg pathways were significantly enriched among these genes. however, the most common pathways associated with these differentially expressed genes were metabolic pathways (supplementary table s , supplementary material online). we performed another differential expression experiment by just examining the nectarivores. interestingly, out of differentially expressed genes were downregulated in nectarivores ( fig. c ), many of these genes were also involved in metabolic pathways (supplementary table s , supplementary material online). salivary glands secrete products that have important biological roles in diet/digestion, oral health, and communication, as well as display the most varied cellular ultrastructure in distinct taxa . the great morphological variation observed in the ultrastructure of phyllostomid bat salivary gland granules led phillips and tandler ( ) to speculate that adaptation to novel niches drives salivary gland evolution. smg acinar cells are responsive to a variety of extracellular signals which can affect gene expression, protein synthesis, and protein modifications, and this responsiveness can be driven by the density and distribution of cell surface receptors . salivary proteins are among the first to directly encounter food and introduced pathogens, which lends biological importance in the context of adaptive evolution. however, their role in the adaptation of mammals to novel niches has been largely under investigated. here, we sequenced the smg transcriptomes of nine phyllostomid bats with varying feeding strategies to illuminate the role of salivary glands in this adaptive radiation. interestingly, the percentage of genes under selection expressed in phyllostomid salivary glands was not necessarily greater than genes under selection among bats a whole (hawkins et al. ) . further, links to dietary changes were not strongly apparent, but signatures of selection did reveal modification of smg functions. among the sequenced transcriptomes, we were able to annotate between , and , expressed proteins, which were successfully clustered as , single gene orthologs. about , were present in seven or more phyllostomids and tested for selection among codon sites. using codon selection models, we found only a small proportion ( . %) of genes demonstrated signatures of positive selection. consistently, hawkins et al. (hawkins et al. ) yielded the same frequency of genes under selection from the transcriptomes and genomes of bat species using multiple tissue sources (e.g., spleen, trigeminal ganglia, inner ear, and embryos transcriptomes). this suggests that overall, the genes expressed in smgs phyllostomids are not necessarily subjected to stronger selective pressures in coding regions than genes expressed in other tissues among bats. salivary glands are known to function in immunity through the localized function of immune cells (f abi an et al. ) and immunity was the only enriched term among genes under selection ( fig. c ). eight of the proteins inferred to be affected by positive selection were linked to immunity and defense, and five of these had a secretory component (supplementary table s , supplementary material online). these eight genes had common functions associated with the innate and humoral immune response and stimulate the activation of immune cells to viral infected sites. for example, bpia (bpi fold-containing family a member ) and bpia are secreted and known to localize to upper airways and prevent biofilm formation by gram-negative bacteria (liu et al. ; prokopovic et al. ) . a high exposure to pathogens unique to bats could explain this result, but positive selection among immune genes is not uncommon in animals (roux et al. ; xiao et al. ; van der lee et al. ; hawkins et al. ) . defense and immunity genes typically have a high evolutionary rate attributed to a continuous arms race between pathogens and host immune response (jiggens and kim ; kosiol et al. ; viljakainen et al. ), and our data provide insight about how positive selection has shaped bat oral tract defense against introduced pathogens. an unexpected facet of our results was a general paucity of genes under selection with roles related to metabolism among site tests. therefore, to identify lineage-specific trends, we used branch-site models and tested for episodic selection in groups that exhibit the most divergence from the ancestral insectivorous trait: the common vampire bat d. rotundus, the frog-eating bat t. cirrhosus, and all plant-visitors. there was a similar frequency of genes under positive selection as site tests, between . % and . %, and no go terms were enriched among these genes. most go terms were not redundant among the three tests, except for associations with the golgi body, which is expected given the importance of proper secretory function of salivary glands. the golgi body is the central organizer of the membrane trafficking system. it is where proteins are sorted into vesicles and trafficked out of the cell. however, it is noteworthy that the vampire bat displayed an overrepresentation of golgi-related ontology terms and genes under selection, when compared with the other two tested groups. apparent adaptation or specialization of the golgi body and endomembrane system was present in the vampire bat d. rotundus. the common vampire bat belongs to the subfamily desmodontinae, the only obligate sanguivores among amniotes, which may be the most ecologically divergent among the phyllostomids. francischetti et al. ( ) previously performed a transcriptomic analysis of vampire bat salivary glands and found a diversity of anticoagulants, antiinflammatory proteins, and neural disruptors hypothesized to enhance the efficiency of parasitizing animals for blood meals. thus, unlike other phyllostomids, secreted proteins may not only be helpful to digestion but effective adaptations to blood-feeding with minimal death risk for the bat and its prey. branch-site tests in d. rotundus identified genes under selection (supplementary table s , supplementary material online). notable go terms among these genes pointed to the regulation of protein trafficking, that is, golgi organization, protein n-linked glycosylation, trans-golgi network, and positive regulation of secretion (supplementary fig. s , supplementary material online). the genes linked to these terms were golgi phosphoprotein -like (glp l), vesicleassociated membrane protein-associated a (vapa), and nsfl cofactor p (nsf c). vapa can mediate vesicle transport from the endoplasmic reticulum (er) to the golgi body (lehto et al. ) . glp l is generally expressed in secretory tissues, localizes to golgi, is required for efficient anterograde trafficking, and knocking out glp l causes golgi dispersal and impairs secretion (ng et al. ) . nsf c seems to be relevant for golgi reassembly after mitosis (p echeur et al. ) . lastly, a gene under selection in the n-glycan biosynthesis pathway was pmm (phosphomannomutase ). pmm specifically catalyzes the isomerization of mannose -phosphase to mannose -phosphate. deleterious mutations in pmm tend to cause defects in protein glycosylation and subsequent congenital disorders (matthijs et al. ) . vampire bats diverged from an insectivorous ancestor and subsequently underwent a rapid transition from insectivory to sanguivory (datzmann et al. ; dumont et al. ) . given our results, it is plausible the endomembrane system was modified during this transition. moreover, given that some golgi body-related genes appeared under selection in all branch-site tests, this organelle played some role in the adaptive radiation of phyllostomids. another interesting result from branch-site tests was a response to insulin stimulus in the plant-visitors. two genes under selection, ubiquitin-conjugating enzyme e b (ube b) and casp and fadd-like apoptosis regulator (cflar), were specifically associated with this term. cflar regulates downstream genes involved in lipid metabolism, glucose uptake, and oxidative stress. in mice, cflar was shown to reverse nonalcoholic steatohepatitis liver disease (wang et al. ) that can be caused by insulin resistance and high blood sugar among other metabolic factors (liu et al. ) . the exact function of ube b is a little less clear but appears to be linked to muscle atrophy. ube b becomes more expressed in fasting rats and atrophying muscle cells, but becomes suppressed in response to insulin (wing and banville ; polge et al. ) . the plant-visitors make up a collection of frugivorous and nectarivorous species that ingest high amounts of sugars at once (laska ). interestingly, the phyllostomid great fruit-eating bat (artibeus lituratus) exhibits no difference in serum insulin levels between fasted and fed states (protzek et al. ) . moreover, serum insulin levels observed in a. lituratus were higher than those observed in mice (fujiwara et al. ) and obese humans (yassine et al. ). we can speculate physiological changes, that is, constantly high concentration of insulin, may have caused a selective response in insulin associated pathways. selection pressures may not only occur at the sequence level but also the expression regulation level. therefore, we paired selection tests with gene expression analyses. the distinct natural histories represented by t. cirrhosus and d. rotundus were reflected in the pca. in fact, the only way t. cirrhosus stood out in this study was in the expression pca. trachops cirrhosus specializes on frogs (tuttle and ryan ) which is a unique feeding strategy in phyllostmids, but data from the vespertilionid nyctalus lasiopterus suggests the transition from insects to carnivory may not require any major adaptations (ib añez et al. ) . consistently, selection pressures, given branchsite tests, were weakest in t. cirrhosus (supplementary table s , supplementary material online). the most remarkable observation from overall gene expression profiles was a general correlation among insectivores and plant-visitors. even the distantly related m. lucifugus and p. parnellii were included in this cluster ( fig. a ). hsunycteris thomasi slightly deviated from the other plant-visitors, albeit this deviation was not as extreme as t. cirrhosus and d. rotundus. morphological and genetic data suggest nectar-feeding derived independently in glossophaginae and lonchophyllinae (datzmann et al. ) . this independence may also have been captured by our pca. given overall similarity of expression profiles between plant-visitors and insectivores, few genes were differentially expressed ( fig. b ), but more genes were differentially expressed when we examined just the nectarivores ( fig. c ) and many of these proteins were in amino acid metabolic pathways. the strength of differential expression among genes in amino acid pathways was not enough to be significantly enriched, but given carbohydrates are in higher abundance in plantvisitor diets, amino acid synthesis is linked to the products of glycolysis and the citric acid cycle, it is plausible this group has modified these pathways as adaptive responses. we examined the evolutionary history of genes expressed in smgs to test for links between genetic variation and signatures of selection. given the ecological variation of these species, we expected to find strong selection signals and greater expression diversity. indeed, we identified a strong selection signal among immune-related genes in phyllostomids, but lineage-specific adaptations were less clear, that is, fewer genes under selection among a wide array of pathways and few differentially expressed genes. from our data, we inferred modifications of the endomembrane system with a focus on the golgi body, most apparent in the vampire bat. further, lineage-specific adaptations have occurred in response to insulin changes and modifications of metabolic pathways in the plant-visitors, signaling unique, and lineage-specific adaptations have occurred in phyllostomids with diverse feeding strategies. supplementary data are available at genome biology and evolution online. we would like to thank robert baker and carleton phillips for previous work that inspired this research. this work was not possible without the previous efforts of ttu students, faculty, and staff toward sample collection and curation. therefore, we are thankful for nict e ord oñez-garza, maria sagot, heath garner, robert bradley, and the texas tech university natural science research laboratory (ttu nsrl) for tissue loans. we also acknowledge the high performance computing center (hpcc) at texas tech university at lubbock for providing hpc resources. url: http://cmsdev.ttu.edu/hpcc. c.g.s.c. was supported with a fellowship (pde) by conselho nacional de desenvolvimento cient ıfico e tecnol ogico (cnpq), and by a postdoctoral fellowship (pnpd) from coordenac¸ão de aperfeic¸oamento de pessoal de n ıvel superior (capes), brazil during the development of this study. these funding agencies had no role in any experimental aspect of this study. gapped blast and psi-blast: a new generation of protein database search programs higher level classification of phyllostomid bats with a summary of dna synapomorphies potential use of chemical cues for colony-mate recognition in the big brown bat, eptesicus fuscus trimmomatic: a flexible trimmer for illumina sequence data bats: important reservoir hosts of emerging viruses morphological diagnoses of higher-level phyllostomid taxa (chiroptera: phyllostomidae) evolution of nectarivory in phyllostomid bats (phyllostomidae gray, , chiroptera: mammalia) rules of engagement: molecular insights from host-virus arms races the uniprot-go annotation database in biology of the salivary glands morphological innovation, diversification and invasion of a new adaptive zone accelerated profile hmm searches salivary defense proteins: their network and role in innate and acquired oral immunity the pfam protein families database the "vampirome": transcriptome and proteome analysis of the principal and accessory submaxillary glands of the vampire bat desmodus rotundus, a vector of human rabies insulin hypersensitivity in mice lacking the v b vasopressin receptor a codon-based model of nucleotide substitution for protein-coding dna sequences full-length transcriptome assembly from rna-seq data without a reference genome de novo transcript sequence reconstruction from rna-seq using the trinity platform for reference generation and analysis a metaanalysis of bat phylogenetics and positive selection based on genomes and transcriptomes from species a cluster of olfactory receptor genes linked to frugivory in bats morphological diversification under high integration in a hyper diverse mammal clade systematic and integrative analysis of large gene lists using david bioinformatics resources bat predation on nocturnally migrating birds a screen for immunity genes evolving under positive selection in drosophila mafft multiple sequence alignment software version : improvements in performance and usability patterns of positive selection in six mammalian genomes predicting transmembrane protein topology with a hidden markov model: application to complete genomes ultrafast and memoryefficient alignment of short dna sequences to the human genome food transit times and carbohydrate use in three phyllostomid bat species targeting of osbp-related protein (orp ) to endoplasmic reticulum and plasma membrane is controlled by multiple determinants rsem: accurate transcript quantification from rna-seq data with or without a reference genome orthomcl: identification of ortholog groups for eukaryotic genomes splunc /bpifa contributes to pulmonary host defense against klebsiella pneumoniae respiratory infection silibinin ameliorates hepatic lipid accumulation and oxidative stress in mice with non-alcoholic steatohepatitis by regulating cflar-jnk pathway moderated estimation of fold change and dispersion for rna-seq data with deseq mutations in pmm , a phosphomannomutase gene on chromosome p , in carbohydrate-deficient glycoprotein type syndrome (jaeken syndrome) panther version : expanded protein families and functions, and analysis tools evolutionary patterns and processes in the radiation of phyllostomid bats golph l antagonizes golph to determine golgi morphology host and viral traits predict zoonotic spillover from mammals phospholipid species act as modulators in p /p -mediated fusion of golgi membranes signalp . : discriminating signal peptides from transmembrane regions secretory gene recruitments in vampire bat salivary adaptation and potential convergences with sanguivorous leeches dietary and flight energetic adaptations in a salivary gland transcriptome of an insectivorous bat salivary glands, cellular evolution, and adaptive radiation in mammals plasticity and patterns of evolution in mammalian salivary glands: comparative immunohistochemistry of lysozyme in bats large numbers of novel mirnas originate from dna transposons and are coincident with a large species radiation in bats role of e -ub-conjugating enzymes during skeletal muscle atrophy massive amplification of rolling-circle transposons in the lineage of the bat myotis lucifugus isolation, biochemical characterization and anti-bacterial activity of bpifa protein insulin and glucose sensitivity, insulin secretion and b-cell distribution in endocrine pancreas of the fruit bat artibeus lituratus bats with hats: evidence for recent dna transposon activity in genus myotis phytools: an r package for phylogenetic comparative biology (and other things) when did plants become important to leaf-nosed bats? diversification of feeding habits in the family phyllostomidae bats (chiroptera: noctilionoidea) challenge a recent origin of extant neotropical diversity intense natural selection preceded the invasion of new adaptive zones during the radiation of new world leaf-nosed bats patterns of positive selection in seven ant genomes secretions of the interaural gland contain information about individuality and colony membership in the bechstein's bat adaptive evolution of energy metabolism genes and the origin of flight in bats bats and their virome: an important source of emerging viruses capable of infecting humans integration of molecular cytogenetics, dated molecular phylogeny, and model-based predictions to understand the extreme chromosome reorganization in the neotropical genus tonatia (chiroptera: phyllostomidae) raxml version : a tool for phylogenetic analysis and post-analysis of large phylogenies qvalue: q-value estimating for false discovery rate control statistical significance for genomewide studies revigo summarizes and visualizes long lists of gene ontology terms female preference for male saliva: implications for sexual isolation of mus musculus subspecies secretion by striated ducts of mammalian major salivary glands: review from an ultrastructural, functional, and evolutionary perspective microstructure of mammalian salivary glands and its relationship to diet genome-scale detection of positive selection in nine primates predicts human-virus evolutionary conflicts evolution of the abpa subunit of androgen-binding protein expressed in the submaxillary glands in new and old world rodent taxa rapid evolution of immune proteins in social insects targeting casp and fadd-like apoptosis regulator ameliorates nonalcoholic steatohepatitis in mice and nonhuman primates -kda ubiquitin-conjugating enzyme: structure of the rat gene and regulation upon fasting and by insulin transcriptome analysis revealed positive selection of immune-related genes in tilapia likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution paml : phylogenetic analysis by maximum likelihood codon-substitution models for detecting molecular adaptation at individual sites along specific lineages effects of exercise and caloric restriction on insulin resistance and cardiometabolic risk factors in older obese adults -a randomized clinical trial comparative analysis of bat genomes provides insight into the evolution of flight and immunity evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level associate editor: naruya saitou key: cord- - op m pu authors: wu, zhiqiang; yang, li; ren, xianwen; he, guimei; zhang, junpeng; yang, jian; qian, zhaohui; dong, jie; sun, lilian; zhu, yafang; du, jiang; yang, fan; zhang, shuyi; jin, qi title: deciphering the bat virome catalog to better understand the ecological diversity of bat viruses and the bat origin of emerging infectious diseases date: - - journal: the isme journal doi: . /ismej. . sha: doc_id: cord_uid: op m pu studies have demonstrated that ~ %– % of emerging infectious diseases (eids) in humans originated from wild life. bats are natural reservoirs of a large variety of viruses, including many important zoonotic viruses that cause severe diseases in humans and domestic animals. however, the understanding of the viral population and the ecological diversity residing in bat populations is unclear, which complicates the determination of the origins of certain eids. here, using bats as a typical wildlife reservoir model, virome analysis was conducted based on pharyngeal and anal swab samples of bat individuals of major bat species throughout china. the purpose of this study was to survey the ecological and biological diversities of viruses residing in these bat species, to investigate the presence of potential bat-borne zoonotic viruses and to evaluate the impacts of these viruses on public health. the data obtained in this study revealed an overview of the viral community present in these bat samples. many novel bat viruses were reported for the first time and some bat viruses closely related to known human or animal pathogens were identified. this genetic evidence provides new clues in the search for the origin or evolution pattern of certain viruses, such as coronaviruses and noroviruses. these data offer meaningful ecological information for predicting and tracing wildlife-originated eids. emerging infectious diseases (eids) pose a great threat to global public health. approximately %- % of human eids originate from wildlife, as shown by the typical examples of hemorrhagic fever, avian influenza and henipavirus-related lethal neurologic and respiratory diseases that originated from rodents, wild birds and bats (jones et al., ; wolfe et al., ; lloyd-smith et al., ; marsh and wang, ; smith and wang, ) . the primary issues associated with the prevention and control of eids are how to quickly identify the pathogen, determine where it originated and control the chain of transmission. these issues, along with the limited knowledge of the viral population and ecological diversity of wildlife, complicate the study of eids. therefore, meaningful information afforded by the understanding of the viral community present in wildlife, as well as the prevalence, genetic diversity and geographical distribution of these viruses, is very important for the prevention and control of wildlife-borne eids daszak et al., ) . bats are mammals with a wide geographical distribution, extensive species diversity, unique behaviors (characteristic flight patterns, long life spans and gregarious roosting and mobility behaviors) and intimate interactions with humans and livestock (calisher et al., ) . they are natural reservoirs of a large variety of viruses, including many important zoonotic viruses that cause severe diseases in humans and domestic animals, including henipaviruses, marburg virus and ebola virus (luis et al., ; quan et al., ; o'shea et al., ) . the severe acute respiratory syndrome (sars) outbreak in , which resulted in nearly cases and deaths worldwide, was suspected to have originated in bats and then spread to humans (li et al., ) . all of these examples reveal that zoonotic viruses carried by bats can be transmitted directly or via certain intermediate hosts from bats to humans or domestic animals with high virulence. as most bat-borne pathogens are transmitted by four routes (airborne, droplet, oral-fecal and contact transmission) from the respiratory tracts, oral cavities or enteric canals of bats to other species, it is particularly important to determine the viral communities present at these locations. in this study, bat individuals of representative species across china were sampled by pharyngeal and anal swabbing, to assess the variety of viruses residing in bat species. metagenomic analysis was then conducted to screen the viromes of these samples. here we outline the viral spectrum within these bat samples and the basic ecological and genetic characteristics of these novel bat viruses. the identification of novel bat viruses in this study also provides genetic evidence for cross-species transmission between bats or between bats and other mammals. these data offer new clues for tracing the sources of important viral pathogens such as sars coronavirus (sars-cov) and middle east respiratory syndrome cov (mers-cov). pharyngeal and anal swab samples were collected separately and then immersed in virus sampling tubes (yocon, beijing, china) containing maintenance medium and were temporarily stored at − °c. the samples were then transported to the laboratory and stored at − °c. viral nucleic acid library construction, next-generation sequencing and taxonomic assignments a tube with either a pharyngeal or anal swab sample in maintenance medium was vigorously vortexed to re-suspend the sample in solution. samples from the same bat species and from the same site were then pooled. the pooled samples were processed with a viral particle-protected nucleic acid purification method and then amplified by sequenceindependent reverse transcriptase-pcr as described previously wu et al., ) . briefly, the samples were centrifuged at g for min at °c. supernatant from each sample was filtered through a . -μm polyvinylidene difluoride filter (millipore, darmstadt, germany), to remove eukaryotic and bacterial-sized particles. the filtered samples were then centrifuged at g for h at °c. the pellets were re-suspended in hank's balanced salt solution. to remove naked dna and rna, the re-suspended pellet was digested in a cocktail of dnase and rnase enzymes, including u of turbo dnase (ambion, austin, tx, usa), u of benzonase (novagen, darmstadt, germany) , and u of rnase one (promega, madison, wi, usa) at °c for h in × dnase buffer (ambion). the viral nucleic acids were then isolated using a qiamp minelute virus spin kit (qiagen, valencia, ca, usa). viral first-strand cdna was synthesized using the primer k- n ( ′-gaccatctagcgacct ccac-nnnnnnnn- ′) and a superscript iii system (invitrogen, carlsbad, ca, usa). to convert firststrand cdna into double-stranded dna, the cdna was incubated at °c for h in the presence of u of klenow fragment (neb, ipswich, ma, usa) in × neb buffer (final volume of μl). sequenceindependent pcr amplification was conducted using μm primer k ( ′-gaccatctagcgacct ccac- ′) and . u phusion dna polymerase (neb). the pcr products were analyzed by agarose gel electrophoresis. a dna smear of larger than bp was excised and extracted with a minelute gel extraction kit (qiagen). the amplified viral nucleic acid libraries were then analyzed using an illumina ga ii sequencer (illumina, sandiego, ca, usa) for a single read of bp in length. the raw sequence reads were filtered using previously described criteria , to obtain valid sequences. each read was evaluated for viral origin by conducting alignments with the ncbi nonredundant nucleotide (nt) database (nt) and protein database (nr) using blastn and blastx (with parameters -e e- -f t). reads with no hits in nt or nr were further assembled using the velvet software (v . . , pittsburgh supercomputing center, pittsburgh, pa, usa) and the contigs were again aligned with nt and nr to identify any viruses that were present. the taxonomies of the aligned reads with the best blast scores (e-value o − ) from all lanes were parsed and exported with megan -metagenome analyzer . besides the alignment-first analytical strategy, we also tested assembly-first strategy to analyze the sequence data. the reads were firstly assembled by metagenomics assemblers (for example, metavelvet and idba-ud) (namiki et al., ; peng et al., ) and the output contigs were then aligned to nt and nr. as the percentage of viral reads was small in the whole data sets and most of the assembled contigs were thus of bacterial and host origins, the assembly-first strategy generally had lower sensitivity than the alignmentfirst strategy in this study. identification of the prevalence and positive rate of each virus according to the molecular clues provided by metagenomic analyses, the sequence reads classified into the same virus family or genus by megan were extracted and then assembled with seqman program (lasergene, dnastar, madison, wi, usa). a draft genome with several or a large number of single-nt polymorphisms of each virus was obtained. based on the partial genomic sequences of the viruses obtained by the assembly, we designed specific nested primers for pcr or reverse trancriptase-pcr to screen for each virus in individual samples from each bat species (the primer sequences for each virus are available in supplementary table s ). genome sequencing of each virus in positive samples by pcr the accurate locations of the reads and the relative distances between reads of the same virus were determined based on the alignment results exported using megan . representative positive samples for each virus were selected for genome sequencing as viral quasi-species. the reads with accurate genomic locations were then used for reads-based pcr to identify partial genomes. based on the partial genomic sequences obtained by specific nested pcr, the remaining genomic sequences were determined using inverse pcr, genome walking, and ′-and ′ rapid amplification of cdna ends. genome sequences and phylogenetic and evolutionary analyses all genome sequences have been submitted to genbank. the accession numbers for all sequences are listed in supplementary table s . the ga ii sequence data have been deposited into the ncbi sequence reads archive under the accession number sra . routine sequence alignments were performed using megalign program (lasergene) or t-coffee with manual curation. mega . (phoenix, az, usa) was used to align the nt and amino acid (aa) sequences using muscle package with the default parameters. the best substitution model was then evaluated using model selection package. finally, a maximum-likelihood method with an appropriate model was used to conduct phylogenetic analyses with bootstrap replicates. supplementary table s . bat viromes in china z wu et al main bat species, the most common genera and families of both frugivorous and insectivorous bats that reside in urban, rural and wild areas throughout china. all pharyngeal and anal swab samples were classified and combined into pools and then subjected to virome analysis . finally, a total of gb of nt data ( valid reads, bp in length) was obtained. in total, reads were best matched with viral proteins available in the ncbi nr database (~ . % of the total sequence reads). the number of virusassociated reads in each lane varied between and . in total, families of phages, insect viruses, fungal viruses, mammalian viruses and plant viruses were parsed. after excluding bat habit-related non-mammalian viruses, including insect viruses (mainly of the families baculoviridae, iflaviridae, dicistroviridae and tetraviridae; subfamily densovirinae), fungal viruses (mainly of the families chrysoviridae, hypoviridae, partitiviridae and totiviridae), phages (the order caudovirales and families inoviridae and microviridae) and plant viruses (matched reads of the non-mammalian virus families are provided in supplementary table s ), overview of the reads of families of mammalian viruses in each pooled sample is shown in figure a and supplementary table s . in addition, an overview of the classification from family to genus of the identified bat viruses is shown in figure b . the reads related to the family parvoviridae comprised the largest proportion of viruses as shown in figure , most of which were classified into the subfamily densovirinae. the dominating abundance of insect densoviruses was associated with the insectivorous habits of bats. the existence and prevalence of virus strains in of the families were confirmed. in total, positive results were confirmed by pcr screening (positive rates are shown in supplementary table s ) and viruses of representative positive samples were selected for genomic sequencing as quasi-species of these viruses (supplementary table s ; the full-length sequenced viruses are labeled in red). the most widely distributed families of mammalian viruses were herpesviridae, papillomaviridae, retroviridae, adenoviridae and astroviridae. the diverse reads related to these families occupied~ % of the total viral sequence reads (supplementary table s ). the assessment of diverse reads of bat herpesviruses, bat papillomaviruses, bat retroviruses, bat astroviruses and bat adenoviruses (supplementary figure s ) , and full-length sequenced representative strains of these viruses revealed that most of these viruses were distinct from each other within each family ( %- % nt identities). in addition to the above families, many reads related to the families circoviridae, paramyxoviridae, coronaviridae, caliciviridae, polyomaviridae, rhabdoviridae, hepeviridae, bunyaviridae, reoviridae, flaviviridae and picornaviridae, and the subfamily parvovirinae, exhibited low nt or aa sequence identities with known viruses. although sequence reads related to the families orthomyxoviridae and hepadnaviridae were occasionally present in some of the samples, we failed to amplify any sequences of viruses in these families, which may have been a result of very low viral loads. although samples from frugivorous bats of two species (rousettus leschenaultia and cynopterus sphinx) were collected for virome analysis, only a few reads related to herpesvirus, papillomavirus and retrovirus were found. this finding revealed that the virome of frugivorous bats is far less abundant than that of insectivorous bats in this study. bat main single-stranded rna viruses (coronaviridae, paramyxoviridae and picornaviridae) the family coronaviridae contains two subfamilies, coronavirinae and torovirinae. the subfamily coronavirinae includes four approved genera, alphacoronavirus, betacoronavirus, deltacoronavirus and gammacoronavirus. it is a group of large enveloped viruses with a positive single-stranded rna genome (~ to kb in length). six αand β-covs (hku , oc , nl , e, sars-cov and mers-cov) are human pathogens that cause mild-tosevere disease (corman et al., b; li et al., ; cui et al., ; o'shea et al., ) . here, novel bat covs (btcovs) were identified separately in bat species of genera from provinces (supplementary table s and supplementary figure s ). covs from seven bat species are reported for the first time. phylogenetic trees were constructed based on the deduced rna-dependent rna polymerase (nsp ) (figure a spike (s) and figure b proteins). the sequence identities of these btcovs are shown in supplementary table s . eleven btcovs were assigned to a group with lineage-b beta-covs and three btcovs were assigned to a group with lineage-c beta-covs. the btvs-betacov/sc identified in vespertilio superans bats had a closer genetic relationship with mers-cov than with other btcovs (supplementary figure s ) , as well as with neocov identified in african neoromicia capensis bats (corman et al., a) . one btcov, bthp-betacov/zj , represented a separate clade related to lineage b. the overall nt identity of this cov genome with lineage-b covs was only % and the rna-dependent rna polymerase and s proteins shared only % and % aa identities with those of lineage-b covs. this cov contained an unusual putative s-related orf between orf ab and the s gene (supplementary figure s and supplementary table s ) (quan et al., ) . a signal peptide ( - aa) and a transmembrane region ( - aa) were identified in the s-related orf , suggesting that this protein might be a surface protein. the remaining btcovs were all assigned to the genus alphacoronavirus and formed many novel separate clades. for btmr-alphacov/sax and btnv-alphacov/sc , although the phylogenetic tree constructed based on rna-dependent rna polymerase indicated that these two viruses were clustered with hku , the phylogenetic tree based on the s protein indicated that these two viruses represented two separate clades far from other covs, suggesting that recombination may occur in their genomes. although the orf ab, e, m and n genes of btrf-alphacov/hub and btms-alphacov/ z wu et al gs shared very high sequence identities (higher than %), the s genes of these two viruses shared only % nt identity. a similar phenomenon was observed between btrf-alphacov/yn and hku . the results of sample-by-sample screening of bat samples from the same gathering place revealed that btcov strains of the same species identified in the same cave had significantly diverse features. some key gene segments, such as the s and orf genes, presented great diversity with low sequence identities (supplementary figure s ) , indicating that these two genes are hypervariable regions within the btcov genome. similar to recently reported sars-like covs (sl-covs) (wiv , rs and lyra ) (ge et al., ; he et al., ) , two btcovs, btrs-betacov/ yn and btrs-betacov/gx , identified in rhinolophus sinicus from the yunnan and guangxi provinces shared the highest similarities with sars-covs in the backbone (including orf ab, e, m and n genes), orf , orf a, orf b and orf genes compared with other bat lineage-b β-covs (supplementary tables s -s and supplementary figures s a and b and s ) . furthermore, the s genes of lineage-b β-covs from r. sinicus had much higher genetic diversity and were scattered among the phylogenetic clades of sars-covs and lineage-b covs from other bat species (supplementary figure s c) . co-infections of cov strains of sublineages and of group in miniopterus fuliginosus were detected in two anal specimens collected in guangdong and henan. the covs of sublineage with highly similar backbone sequences presented differing degrees of variation in the s region. recombination was confirmed by similarity plots, bootscan analysis and detection of putative breakpoints around the s regions in the genomes of lineage covs in m. fuliginosus and m. pusillus. (supplementary table s and supplementary figure s ). the family paramyxoviridae is a group of large enveloped viruses with negative-sense singlestranded rna genomes (~ to kb in length) that are responsible for a variety of mild-to-severe human and animal diseases (mayo, ; smith and wang, ) . bat paramyxoviruses (btparavs) were separately identified in bat species from provinces of china (supplementary table s and supplementary figure s ). the full-length sequence of viruses (btmf-parav/ah , btml-parav/ qh and btha-parav/gd ) were almost completely determined (supplementary figure s and supplementary table s ) and other viruses were partially or completely sequenced in the l gene. twelve of the novel btparavs identified in the different bat species could be clustered together and formed a separate phylogenetic clade. alternatively, these btparavs could be classified into a potential separate genus, shaanvirus, with btmf-parav/ah and btml-parav/qh as prototypes. the genomic organizations of these two viruses were similar to those of three previously reported members of the genus jeilongvirus. however, the genomes of btms-parav/anhui and btml-parav/qh were shorter in length than those of the three rodent viruses and the sequence identities were low (supplementary table s ). the remaining two novel btparavs could be clustered together with rubulaviruses as new species (figure c . the central domain of the n protein contained three conserved motifs common to all paramyxoviruses, and the six conserved domains within the l proteins of the order mononegavirales (lau et al., ) could be found in all three fulllength sequenced viruses. picornaviruses (picovs) of the family picornaviridae are small, non-enveloped, positive single-stranded rna viruses with a genome of - kb in size. the members of the family picornaviridae cause mucocutaneous, encephalic, cardiac, hepatic, neurological and respiratory diseases in a wide variety of vertebrate hosts (tracy et al., ; wang et al., ) . nineteen bat picovs (btpicovs) were identified in bat species from provinces (supplementary table s and supplementary figure s ). phylogenetic analysis of the rnadependent rna polymerase genes was conducted (figure d . seven btpicovs could be clustered with three previously reported btpicovs (clade ) and could then be divided into separate sub-clades according to their host genera. clade contained two sub-clades formed by four btpicovs of two bat genera. clade was formed by two btpicovs identified in different bat genera and clustered in a sister relationship with the genus sapelovirus. clade contained two btpicovs identified in the same bat genus that were closely related to the genus kobuvirus. the previously reported miniopterus schreibersii picov- was closely related to the genera cardiovirus and senecavirus. two btpicovs, btrf-picov- /yn and btmf-picov- /sax , showed lower aa identities with other known picovs. different btpicovs identified from the same bat genus, such as rhinolophus or miniopterus, in different locations showed very close genetic relationships. the aa identities of the predicted rnadependent rna polymerase proteins of these novel btpicovs were low compared with known picovs (supplementary table s ). the predicted p , p and p regions and cleavage sites of these viruses showed typical features of picovs (supplementary table s ). the family circoviridae is a group of viruses with small, non-enveloped, circular single-strand dna genomes of . - kb in length (fauquet and fargette, ) . porcine circovirus (cv)- is the main swine pathogen (chae, ) . thirty-four novel bat cvs (btcvs) were identified in bat species table s and supplementary figure s ). the genome sizes of these viruses varied from to nts. seven new clades of btcvs in the genera circovirus and cyclovirus were found. clade was formed by two viruses clustered in a sister relationship with two pathogenic viruses, porcine and dog cvs (chae, ; li et al., ) , at the same root with a short branch length. three btcvs were closely related to human cycloviruses. seven btcvs were assigned to the proposed genus cyclovirus and formed four separate clades, three of which were closely related to human cycloviruses. a separate clade, clade , was constructed from two btcvs and was closely related to the genus cyclovirus. in addition to these viruses, novel btcvs branched out of the root of cvs and cycloviruses revealed the presence of new genera different from the known genera (figure a) . viruses of the family parvoviridae comprise a group of small, non-enveloped viruses with linear positive-sense single-stranded dna (~ kb genomes) that infect vertebrate animals and cause mild-tosevere diseases (brown, ) . bat parvoviruses (btpvs) and bat bocaviruses were identified in nine bat species from eight provinces (supplementary table s and supplementary figure s ). in addition to viruses in the genera bocavirus, parvovirus and amdovirus, four btpvs were most similar to the recently reported human bufavirus members and the bufavirus-related wuharv parvovirus, with similar ns and vp proteins (higher than % aa identities) (phan et al., ; yahiro et al., ) . bthp-pv/gd , in the genus parvovirus, was very closely related to a recently identified rat pv, with % aa identity (figure b) . the aa identities among the predicted vp proteins of these btpvs and bat bocaviruses, and other known members of the subfamily parvovirinae, are shown in supplementary table s . bat adeno-associated viruses have been described previously (li et al., b) ; however, considering the nonpathogenic nature of adeno-associated viruses, we did not perform further verification of these viruses. other rare bat viruses six bat caliciviruses (btcalvs), four bat polyomaviruses, one bat hepatitis e virus, one bat rhabdovirus, one bat bunyavirus, one bat orthoreovirus and one bat rotavirus were identified. phylogenetic in previous studies, zoonotic viruses in more than virus families have been identified in bats around the world (chen et al., ; o'shea et al., ) . two bat virome analyses conducted by li et al. ( a) and donaldson et al. ( ) have revealed the presence of covs, herpesviruses, picovs, cvs, adenoviruses, adeno-associated viruses and astroviruses in some bat species of north america. one bat virome analysis conducted by ge et al. ( ) mainly described insect viruses in some bat species of china. one bat virome analysis conducted by ng et al. ( ) has revealed the presence of a novel rhabdovirus in big brown bats, and one bat virome analysis conducted by he et al. ( ) has described the spectrum of viruses harbored by several bat species in myanmar. different from these previous reports, this study was the first to characterize the pharyngeal and anal virome of representative bat samples in china. we did not perform additional verification of non-mammalian viruses because of the association of the abundance of these viruses (not initially harbored in bats) with their life habits. this report suggests that bats harbor a large spectrum of mammalian viruses. except for a few viruses, such as btcalvs, btcvs and btpvs, which are closely related to known viruses, most of the bat viruses identified here that were widely distributed . these findings reveal that these three bat genera may act as major reservoirs for diverse mammalian viruses in china. notably, all bats collected in this study were considered to be apparently healthy and showed no overt signs of disease, further confirming that bats can tolerate diverse viruses through their unique metabolic and immune systems (o'shea et al., ) . this study extends the host range for members of each viral family and reveals unique ecological and evolutionary characteristics of bat-borne viruses. the diverse btcovs were grouped into several novel evolutionary clades that significantly differed from those of all known αand β-covs, providing additional evidence to support investigations of the evolution of bat-originated covs. with regard to btparavs, a previous study has revealed that bats host major mammalian paravs in the genera rubulavirus, morbillivirus, henipavirus and the subfamily pneumovirinae (drexler et al., ) . however, in this study, except for viruses assigned to the genus rubulavirus, the remaining viruses formed a new genus distant from the known genera and the identified btparavs showed no direct relationship with the known human or animal pathogens of the family paramyxoviridae. these results suggest an entirely different distribution of btparavs in china than previously reported. although the classifications of bat herpesviruses, bat papillomaviruses, bat retroviruses, bat astroviruses, bat adenoviruses, btpicovs, btcvs, btpvs and bat bocaviruses were extended according to the current virus taxonomy file released by the international committee on taxonomy of viruses, the large number of novel viruses grouped into the various evolutionary clades identified in this study further expand the taxa to include many new viral genera and species. many new clades formed by btcvs distinct from all known members of the genera circovirus and cyclovirus could be candidates for many new genera. viruses related to henipaviruses, ebola virus, rabies virus and pathogenic bunyaviruses were not detected in the chinese bat species examined in this study. diverse herpesviruses and papillomaviruses identified countrywide support the hypothesis that these dna viruses from different bat species are located in different phylogenetic positions within each family without strict host or geographic specificity (garcia-perez et al., ) ; however, many other dna or rna viruses, such as btcovs, btparavs, btpicovs and btpvs or bat bocaviruses, identified from the same or different bat species from different locations shared high sequence identities and close genetic relationships. these phenomena indicate that certain bat-originated dna and rna viruses have the potential for intra-or cross-species transmission concomitant with the migration, co-roosting and intra-or inter-species contact of their bat hosts. the identification of some viruses, such as certain rat pv-related btpv, norovirus-related btcalvs, human or swine cv-related btcvs and bat rotavirus in the rotavirus a group, also provides a new understanding of the evolution of these viruses in different mammalian hosts and possible transmission events that occur between bats and other hosts. furthermore, btcovs had more distinctive features than the other bat viruses. highly diverse s genes or orf s were present in particular covs carried by bats of the same species from different locations or even the same gathering place. considering the diversity of covs, co-infections may create opportunities for recombination and the emergence of new covs that are able to adapt to new hosts. these findings may explain why tracing the potential cov-related eids in insectivorous bats is often complicated by the presence of diverse key genomic segments and no virus with an identical genome sequence related to the pathogens causing human or animal eids has been identified in insectivorous bats. instead, the origin of the ebola virus and henipaviruses could be relatively easily confirmed by the identification of identical viruses in frugivorous bats (calisher et al., ; leroy et al., ; smith and wang, ) . only lineage-b β-covs of six bat species (r. ferrumequinum, r. sinicus, r. pusillus, r. macrotis, r. affinis and chaerephon plicata) from china (sl-covs) are closely related to sars-covs, as similar orf ab, e, m and n genes, and the presence of a unique structural orfs (including orf , a, b and ) have been identified (supplementary figure s ) (holmes and enjuanes, ; li et al., ; woo et al., ; quan et al., ; woo et al., ; yang et al., ) . recently, a functionally similar s gene has been identified in sl-covs of r. sinicus and r. affinis (wiv , rs and lyra ) with less sequence identity to the s gene of sars-cov, but which is capable of using the human ace as a receptor for virus entry (ge et al., ; he et al., ) . however, knowledge gaps exist between bat sl-covs and sars-covs with regard to the s gene and the unique structural orf that prevent the determination of which bat virus species is the direct ancestor of sars-cov. in this study, two btcovs (btrs-beta-cov/yn and btrs-betacov/gx ) in chinese horseshoe bats (r. sinicus) provided bat-originated unique structural orfs that were nearly identical to the original sars-cov isolated during the earliest phase of the sars pandemic (after transmission to humans, this region of sars-cov experiences ongoing adaptive evolution in humans, with gradual deletion ( )), providing some information to fill the knowledge gap with regard to the origin of human sars-covs. in addition, the higher similarities of the backbones of these two btcovs to sars- a strategy to estimate unknown viral diversity in mammals evidence for camel-to-human transmission of mers coronavirus the expanding range of parvoviruses which infect humans bats: important reservoir hosts of emerging virusesclin a review of porcine circovirus -associated syndromes and diseases dbatvir: the database of bat-associated viruses molecular evolution of the sars coronavirus during the course of the sars epidemic in china mers coronaviruses in dromedary camels rooting the phylogenetic tree of mers-coronavirus by characterization of a conspecific virus from an african bat characterization of a novel betacoronavirus related to middle east respiratory syndrome coronavirus in european hedgehogs evolutionary relationships between bat coronaviruses and their hosts emerging infectious diseases of wildlife-threats to biodiversity and human health metagenomic analysis of the viromes of three north american bat species: viral diversity among different bat species that share a common habitat bats host major mammalian paramyxoviruses international committee on taxonomy of viruses and the , unassigned species novel papillomaviruses in free-ranging iberian bats: no virus-host co-evolution, no strict host specificity, and hints for recombination metagenomic analysis of viruses from the bat fecal samples reveals many novel viruses in insectivorous bats in china isolation and characterization of a bat sars-like coronavirus that uses the ace receptor virome profiling of bats from myanmar by metagenomic analysis of tissue samples reveals more novel mammalian viruses identification of diverse alphacoronaviruses and genomic characterization of a novel severe acute respiratory syndrome-like coronavirus from bats in china lack of middle east respiratory syndrome coronavirus transmission from infected camels virology. the sars coronavirus: a postgenomic era global trends in emerging infectious diseases identification and complete genome analysis of three novel paramyxoviruses, tuhoko virus , and , in fruit bats from china fruit bats as reservoirs of ebola virus bat guano virome: predominance of dietary viruses from insects and plants plus novel mammalian viruses circovirus in tissues of dogs with vasculitis and hemorrhage bats are natural reservoirs of sars-like coronaviruses prevalence and genetic diversity of adenoassociated viruses in bats from china epidemic dynamics at the human-animal interface a comparison of bats and rodents as reservoirs of zoonotic viruses: are bats special? hendra and nipah viruses: why are they so deadly? a summary of taxonomic changes recently approved by ictv antibodies against mers coronavirus in dromedary camels metavelvet: an extension of velvet assembler to de novo metagenome assembly from short sequence reads distinct lineage of vesiculovirus from big brown bats, united states bat flight and zoonotic viruses idba-ud: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth acute diarrhea in west african children: diverse enteric viruses and a novel parvovirus genus identification of a severe acute respiratory syndrome coronavirus-like virus in a leaf-nosed bat in nigeria bats are a major natural reservoir for hepaciviruses and pegiviruses bats and their virome: an important source of emerging viruses capable of infecting humans evolution of virulence in picornaviruses hepatitis a virus and the origins of picornaviruses origins of major human infectious diseases coronavirus diversity, phylogeny and interspecies jumping discovery of seven novel mammalian and avian coronaviruses in the genus deltacoronavirus supports bat coronaviruses as the gene source of alphacoronavirus and betacoronavirus and avian coronaviruses as the gene source of gammacoronavirus and deltacoronavirus virome analysis for identification of novel mammalian viruses in bat species from chinese provinces novel henipa-like virus, mojiang paramyxovirus, in rats novel human bufavirus genotype in children with severe diarrhea, bhutan unbiased parallel detection of viral pathogens in clinical samples by use of a metagenomic approach novel sars-like betacoronaviruses in bats mers-related betacoronavirus in vespertilio superans bats receptor usage and cell entry of bat coronavirus hku provide insight into bat-to-human transmission of mers coronavirus covs, and the highly diverse s genes present in only r. sinicus imply that frequent recombination events occur among sl-covs of r. sinicus and other hosts, suggesting that the transmission of sl-covs from r. sinicus to other mammals is a result of the viruses obtaining novel s genes. these data indicate that human sars-covs most likely originated through zoonotic transfer, either directly or indirectly, from chinese horseshoe bats via a complicated adaptation process and a series of rare recombination events.although the transmission of mers-cov has been confirmed by the detection of identical mers-cov sequences in dromedary camels and humans (azhar et al., ; chu et al., ; meyer et al., ) , the zoonotic transmission of this virus from dromedaries to humans is still considered to be rare (hemida et al., ) and the wildlife source of mers-covs remains unknown. the data obtained here and a recently reported lineage-c btcov, neocov, identified by another group (corman et al., a; yang et al., a) provide new clues about the sources and pathways of human-and camel-derived mers-covs in bats of the family vespertilionidae. bat species of this family may have important roles in mers-cov evolution (yang et al., b) . furthermore, the diverse s genes of these mersrelated covs may provide an opportunity for their recombination to ultimately generate new covs.in conclusion, the understanding of the viral community characteristics, genetics and ecological distribution of bat viruses could enable the rapid identification of novel viruses with variant genomes and could thus facilitate the tracing of eids in bats. furthermore, this strategy could be extended to other wildlife or livestock worldwide, ultimately increasing knowledge of the viral population and ecological community, thus minimizing the impact of potential wildlife-originated eids on public health by providing meaningful basic data. the authors declare no conflict of interest. key: cord- -vvtv b authors: nikaido, masato; kondo, shinji; zhang, zicong; wu, jiaqi; nishihara, hidenori; niimura, yoshihito; suzuki, shunta; touhara, kazushige; suzuki, yutaka; noguchi, hideki; minakuchi, yohei; toyoda, atsushi; fujiyama, asao; sugano, sumio; yoneda, misako; kai, chieko title: comparative genomic analyses illuminate the distinct evolution of megabats within chiroptera date: - - journal: dna res doi: . /dnares/dsaa sha: doc_id: cord_uid: vvtv b the revision of the sub-order microchiroptera is one of the most intriguing outcomes in recent mammalian molecular phylogeny. the unexpected sister–taxon relationship between rhinolophoid microbats and megabats, with the exclusion of other microbats, suggests that megabats arose in a relatively short period of time from a microbat-like ancestor. in order to understand the genetic mechanism underlying adaptive evolution in megabats, we determined the whole-genome sequences of two rousette megabats, leschenault’s rousette (rousettus leschenaultia) and the egyptian fruit bat (r. aegyptiacus). the sequences were compared with those of other mammals, including nine bats, available in the database. we identified that megabat genomes are distinct in that they have extremely low activity of sine retrotranspositions, expansion of two chemosensory gene families, including the trace amine receptor (taar) and olfactory receptor (or), and elevation of the dn/ds ratio in genes for immunity and protein catabolism. the adaptive signatures discovered in the genomes of megabats may provide crucial insight into their distinct evolution, including key processes such as virus resistance, loss of echolocation, and frugivorous feeding. bats belong to the order chiroptera and have the ability of powered flight. accounting for one-fifth of all mammals in terms of the number of species, bats are one of the most successful groups of mammals. it is of primary interest for biologists to identify the processes and mechanisms of dynamic adaptation in bats. traditionally, morphological and paleontological analyses placed the order chiroptera within the superorder archonta (primates, dermoptera, chiroptera, and scandentia). however, dna sequencing data has challenged the validity of the archonta, and alternatively proposed the inclusion of bats into laurasiatheria (cetartiodactyla, perissodactyla, carnivora, pholidota, chiroptera and eulipotyphla). [ ] [ ] [ ] [ ] although laurasiatheria is now considered to be a natural assemblage, the phylogenetic position of bats within laurasiatheria remains to be resolved. , the paraphyly of microbats is also under debate. traditionally, morphological studies proposed the sub-division of the order chiroptera into two suborders: microchiroptera (microbats) and megachiroptera (megabats or old-world fruit bats). microbats use ultrasonic echolocation for flight and for foraging in the night, whereas megabats do not echolocate, and primarily use vision to fly and feed on fruits and/or nectars. megabats are also neuroanatomically distinct from microbats, as megabats have a developed visual system. molecular data suggests that five lineages of microbats, including rhinopomatidae, rhinolophidae, hipposideridae, craseonycteridae, and megadermatidae, are more closely related to megabats than to other microbats. therefore, the five lineages of rhinolophoid microbats and megabats were re-classified as 'yinpterochiroptera' and the remaining microbats as 'yangochiroptera'. , , thus, recent molecular studies suggest that several adaptive characteristics specific to megabats have emerged within a short period of time from a microbat-like ancestor. genome-wide analyses have been used to identify the unique evolution of bats in several studies. seim et al.'s study determined the genome sequence of one microbat (brandt's bat) and found the signatures for adaptive evolution in genes related to physiology and longevity. zhang et al. determined the genome sequences of one microbat (david's myotis) and one megabat (black flying fox) and found that genes for flight and immunity evolved due to positive selection. parker et al. identified the genomes of three microbats, including the greater horseshoe bat, the greater false vampire bat, and parnell's mustached bat, and one megabat, the straw-coloured fruit bat. in comparing the genomes of these bats with those of other mammals, this study identified that genes related to hearing/deafness showed convergent evolution among echolocating mammals. pavlovich et al. recently determined the whole genome of the egyptian fruit bat (r. aegyptiacus), which is a natural reservoir for the marburg virus, and revealed that the genes for immunity were expanded and diversified, suggesting an antiviral mechanism that is used to control viral infection. especially, as bats are natural hosts for zoonotic virus including henipaviruses, filoviruses, and coronaviruses, which are emerging viruses with high rates of fatality, the comparative genomic study in bats may provide an effective solution against the current global pandemics of coronavirus disease- . in this study, we determined the genome sequences of two rousette megabats, leschenault's rousette (rousettus leschenaultia) and the egyptian fruit bat (r. aegyptiacus). we assessed the genomic signatures for the process of natural selection that facilitates the dynamic and adaptive evolution of megabats. in particular, the main aim to determine the whole-genome sequence of egyptian fruit bat in addition to the previous study is to obtain higher quality genome data, which facilitates more accurate and comprehensive gene annotations, especially for multi-gene families. in addition, the genome sequences of leschenault's rousette belonging to the same genus as the egyptian fruit bat is of our interest to identify genomic differences in closely related bat species. these genome sequences were compared with those of mammals, including six microbats and three megabats, available in the database. we used genome-wide phylogenetic analyses, followed by candidate gene analyses focussed on retroposons and chemosensory multi-gene families for taste, olfaction, and pheromone detection. in addition, we also performed global positive selection analyses. as a result, the inter-relationships among laurasiatheria were consistently reconstructed, with the order eulipotyphla diverging first, followed by the divergence of chiroptera and the remaining groups, including cetartiodactyla, perissodactyla, pholidota, and carnivora. the reciprocal monophyly of yinpterochiroptera and yangochiroptera was also shown with reliable statistical support. we revealed several notable distinct features in megabat genomes, including extremely low activity of sine retrotranspositions and the expansion of the genes for the trace amine receptor (taar) and olfactory receptor (or). additionally, the signatures for positive or relaxed selection were observed in genes for immunity and protein catabolism. thus, our comparative genomic analyses may illuminate the genetic mechanisms underlying the dynamic adaptation of megabats during diversification in the order chiroptera. egyptian fruit bats (r. aegyptiacus) and leschenault's rousettes (r. leschenaulti), both of which were provided by ueno zoo, were maintained under controlled conditions using an air conditioner and moisture chamber. the animals were kept in steel cages and fed fruit and water at the same time every day. all experiments were performed in accordance with the animal experimentation guidelines of the university of tokyo and were approved by the institutional animal care and use committee of the university of tokyo. as for egyptian fruit bats, we prepared kidney-derived primary cultured cells. a pregnant egyptian fruit bat was deeply anesthetized with isoflurane, the uteri were surgically removed, and the animal was euthanized by bleeding. the kidney from the fetus was fragmented using scissors and treatment with tryple (gibco). the fragmented kidney was then cultured in dmed containing % fetal calf serum to obtain primary cultured cells. genomic dna was extracted from the frozen spleen tissue or cultured kidney cells of two individuals of egyptian fruit bat, and frozen kidney tissue from one individual of leschenault's rousette, using a blood & cell culture dna kit (qiagen, hilden, germany), according to the manufacturer's protocol with minor laboratory customizations, the information can be available upon request. the dna samples (> kb) were subjected to the sequencing as described below after quality and quantity check. to construct paired-end sequencing libraries, the genomic dna was fragmented using a covaris s focussed-ultrasonicator (covaris, woburn, ma, usa). the paired-end libraries were constructed using the truseq dna pcr-free library prep kit (illumina, san diego, ca, usa). mate pair libraries were prepared from genomic dna using the nextera mate pair sample preparation kit (illumina, san diego, ca, usa). all libraries were sequenced on an illumina-hiseq system using rapid-mode chemistry with paired-end sequencing. prior to assembly, data pre-processing was performed. first, the adapter sequences were trimmed using the fastq-clipper ea-utils v . . , setting the parameters to '-p -m -l '. second, we filtered the reads mapped to the mitochondrial genome using bwa-aln v . . with default parameters. finally, we performed base error correction using soapec v . with the parameters '-k -l '. we then assembled the reads using platanus v . . with default parameters. contamination candidates were removed by mapping to escherichia coli and phix genomes using blastn v . . , setting the parameters to '-e e- '. the statistics of the genome assemblies and the information of sequence libraries are summarized in supplementary tables s - and s - . in order to test the quality of the reference assembly in the egyptian fruit bat, we additionally constructed a fosmid library, which was end-sequenced using abi xl sequencers. the protein-coding genes in the genomes of egyptian fruit bat and leschenault's rousette were identified based on the alignment with annotated gene sequences of mammals (cat, dog, horse, cow, hedgehog, human, macaque, mouse, rat, black flying fox, little brown bat, brandt's bat, david's myotis, and large flying fox; supplementary table s ) that are available in the database. the sequences for each gene of the mammals were aligned to the two bat genomes by using blat to identify approximate gene loci. the blat alignments of the gene sequences to the genomes were refined by the exonerate software to estimate the exon-intron boundaries. in addition to the homology-based identification, rna-seq-based transcript reconstruction and ab initio gene prediction were performed to identify the protein coding genes. rna of primary culture cells from the kidney of the egyptian fruit bat was extracted by using trizol reagent (thermo fisher). a total of , , paired-end reads of mrna (illumina-hiseq, bp) were aligned to the genomes using tophat. in total, , , and , , paired-end reads could be mapped to the genome sequences of r. aegyptiacus and r. leschenaultii, respectively. transcript structures were reconstructed using augustus based on the tophat alignment of the illumina reads to the bat genomes. the expression levels of the reconstructed genes were computed using cufflinks based on the tophat alignment of the illumina reads to the genomes. a total of , genes were expressed with fragments per kilobase of transcript per million of reads mapped (fpkm) ! in the kidney-derived primary cultured cells. examples are shown in supplementary fig. s . ab initio genes were obtained by using genscan and snap. the genomic sequences were cut to seven megabase-long fragments, and genscan was run on each fragment. the genes identified were assigned to gene loci based on the overlap of exons on the same strand, and the redundancies of the transcripts were removed. only transcripts annotated with the start codon (atg) and introns flanked by canonical splice dinucleotide pairs (gt-ag, gc-ag, and at-ac) were retained. a total of , and , transcripts were annotated over , and , gene loci, respectively, on the genomes of the egyptian fruit bat and leschenault's rousette. the completeness of the gene determination was evaluated by using busco. similarly, we assessed protein-coding genes on the genomes of four other bat species, the straw-coloured fruit bat, the greater false vampire bat, the greater horseshoe bat, and the strawcoloured fruit bat. due to fragmental nature of these genome assemblies (n : - kb), however, we did not use the thresholds of initial codon and splice sites as used in the annotation of the genomes of the egyptian fruit bat and leschenault's rousette. we identified the longest orf in each transcript mapped by exonerate by using transdecoder (https://github.com/transdecoder/transdecoder/ blob/master/transdecoder.longorfs) and used it as the gene annotation. we identified , - , transcripts in , - , gene loci on these genomes (supplementary table s ). the ratio of complete genes of the annotated genes evaluated by busco was . - . %. additional to this annotation, the tandemly duplicated receptor genes, including ors, taste receptors (t rs and t rs), vomeronasal receptors (v rs and v rs), formyl peptide receptors (fprs), and taars, were annotated separately. olfactory receptors were identified by the method described previously. , the other receptor genes were identified using another protocol. in short, we obtained protein sequences of mammalian t rs, t rs, fprs, v rs, v rs, and taars from the ncbi refseq database (https://www.ncbi.nlm. nih.gov/refseq/). the redundant sequences, which contain more than % identity as identified by cd-hit , were removed to establish representative query sequences. for t rs, we used only the transmembrane regions as query sequences. we used the ncbi conserved domains database to annotate the -transmembrane domains of t rs. using the query sequences, we performed a tblastn search against the whole-genome sequence assemblies available in genbank (https://www.ncbi.nlm.nih.gov/genbank/). the taxonomic classification and the accession numbers of the whole-genome sequences are summarized in supplementary table s . the exon-intron structure of each sequence, which was obtained by tblastn, was predicted with the exonerate program using translated query sequences as protein models. the resulting hit sequences were classified into 'intact', 'truncated', and 'pseudo-genes'. due to an assembly issue, the 'truncated' genes included poly 'n' sequences. in order to estimate the gene copy numbers, in these analyses, we treated the 'truncated' genes as 'putatively intact'. the pseudo-genes include inactivating mutations in the coding region. the resulting genes were assessed to determine whether they encode the chemosensory receptors of interest using blastx searches and ghostz against the uniref database (https://www.uniprot.org/help/uniref). we used the framework for annotating translatable exons (fate), which is available in github (https://github.com/hikoyu/fate), for the automation of the procedures described above. we constructed a phylogenetic tree based on the single-copy orthologous gene sets of mammals, as previously reported by wu et al., to elucidate the phylogenetic relationships of megabats with other mammals. briefly, the nucleotide sequences of the , proteincoding genes of the two megabat species and other mammalian species (supplementary table s ) were aligned using the prank software v. in codon level. sites that are shared by < % of the species were removed from the alignment. among the , genes, , genes were listed for all species, and were used for the analyses. the gene tree was constructed using raxml software, v . . using the gtrþcþi model with , bootstrap replicates for each , gene. we collected the best tree for all , genes, which were used to infer the coalescent species tree with branch length by astral-iii. the node support of the species tree was obtained by , replicates of bootstrapping. branch length shown in the tree indicates the branch length in coalescent units. we used the genome of leschenault's rousette for the identification of tes, based on two approaches, including de novo characterization of tes and identification of homologous copies of known tes in another megabat, the large flying fox (supplementary table s ). in the first approach, repeatmodeler ver. . . (http://www.repeatmasker. org/repeatmodeler.html) was used to obtain a collection of repetitive sequences. for each of the preliminary consensus sequences, we conducted a local nucleotide blast search (r ¼ , g ¼ , e ¼ , with an e-value cutoff of À ) and collected - copies along with their -kp flanking sequences. the copy sequences were aligned using mafft ver. . the alignment was manually modified using mega . and a consensus sequence was re-constructed. the consensus sequence was used for the next round of the blast search, as described above, to obtain additional copies. this procedure was repeated until a full-length consensus sequence was completed. the full-length tes were characterized and classified based on the sequence structure, including terminal inverted repeats and long terminal repeats (ltrs), coding proteins such as transposase and reverse transcriptase, and by comparison with known elements using repeatmasker ver. . . (http://repeatmasker.org), censor, and rtclass . for the second approach, a te library of another megabat, the large flying fox, including te families which were obtained from repbase, was used as a query for a homology search against leschenault's rousette. the local nucleotide blast search, alignment of the copy sequences, and reconstruction of the consensus sequence were conducted as described above. a similar blast search was also conducted using te families of a microbat (the little brown bat, supplementary table s ) library; however, no additional novel tes were found except in the results from the two approaches listed above. all of the newly characterized te (sub) families were designated in conformity with the repbase classification. the repeat contents of the two rousettus genomes were estimated using repeatmasker with the sensitive option (-s) of cross-match search using the rousettus repeat library which we developed here. the te contents, such as the number of copies and length, were summarized based on their divergence (%) from the consensus sequence at the family/subfamily levels by using in-house perl scripts. the te contents of other species were summarized based on the repeatmasker output (http://www.repeatmasker.org/genomic datasets/rmgenomicdatasets.html). orthologous genes under relaxed selection on megabat lineages were identified from the aligned , single-copy genes. on every alignment, we used the codeml branch model in paml . to detect the elevation of the dn/ds ratio (the non-synonymous substitution rate to the synonymous substitution rate) on stem and crown megabat branches. the species tree shown in fig. was used as a guide tree in the analysis. likelihood ratio tests and inspections of the p-value were used to compare likelihoods between two models: (i) that assumed the megabat lineages as foreground branches; and (ii) that assumed the dn/ds ratio was not altered in all branches (null hypothesis), to evaluate the significance of the elevation of the dn/ds ratio for megabat branches. we performed further analyses for the genes of interest using the codeml branch-site models for analysing the positive selection on each site. in the branch-site test, we tested stem and crown megabats as the foreground branches and used microbats and outgroup species, including human, macaque, mouse, rat, cat, dog, chinese pangolin, sunda pangolin, bottlenose dolphin, cow, horse, hedgehog, asian musk shrew, and common shrew, as background branches. for the branch-site test, we used two models for the analysis, including one model of a null hypothesis that assumes that the gene was under two types of selective pressures (purifying selection and neutral selection), and one model that used an alternative hypothesis to assume the gene was under three categories of selective pressures, including positive selection on the megabat branches. the likelihood ratio test comparing the likelihoods of these two models was used to evaluate the significance of the alternative model. to assess the functionality of positively selected sites, protein structure deposited in protein data bank (pdb) was used. the protein structures were depicted using the open-source version of pymol. we constructed draft genomes of the egyptian fruit bat and leschenault's rousette by assembling short read data into contigs and scaffolding them using platanus v . . . the genome of the egyptian fruit bat is composed of . gbp with , scaffolds (n ¼ . mbp) and the genome of leschenault's rousette is composed of . gbp with , scaffolds (n ¼ . mbp) (supplementary tables s - and s - ). the high qualities of the two genomes are demonstrated by the ratios of complete genes, which are . and . %, respectively, as evaluated by busco (supplementary table s ). the quality of both genomes in terms of the continuity of the scaffolds and the rate of n is high enough to facilitate genome-wide evolutionary analyses and characterization of multi-gene families. in addition, independent genome assemblies and gene annotations of the two individuals of egyptian fruit bat determined in the previous study and this study may be utilized as an initial step towards the identification of the genotypic, transcriptomic, and phenotypic variation of this species in the future research. figure shows the maximum likelihood phylogenetic tree with the time scale for mammals, including bats (five megabats and six microbats) based on , single-copy orthologous gene sets. four species of euarchontoglires, including humans, macaca, mouse, and rat, were used as outgroups. the tree successfully highlights the evolutionary history of laurasiatherian mammals in that eulipotyphla diverged first among them. in this phylogenetic tree, chiroptera diverged after eulipotyphla; however, the bootstrap probability (bp) supporting this node was not so high ( . ) . in addition, the grouping of pegasoferae (chiroptera, perissodactyla, and carnivora), which was originally proposed by the insertion of retroposons and supported by several genome-wide analyses, , was not supported. given that the bps for the inter-relationships of cetartiodactyla, perissodactyla (carnivora þ pholidota), and chiroptera were relatively low ( . , . ) and the branch lengths were markedly short, it is highly likely that the initial divergence of laurasiatherian mammals occurred rapidly during evolution. such rapid speciation events may hamper reconstruction of the consistent tree topology for these groups. , importantly, as it was shown in the previous studies, , , the reciprocal monophyly of yangochiroptera and yinpterochiroptera was successfully supported in this analysis, suggesting that the megabats are nested in microbat lineages. although it is difficult to estimate the ancestral state in the megabat ancestors due to the rarity of the fossil record, the phylogenetic tree suggests that several distinct characteristics in megabats, including the welldeveloped visual system, frugivorous diet, and the absence of echolocation, evolved in a short period of time during evolution from a 'microbat-like' ancestor. we next focussed on assessing the signatures for such adaptive evolution in these groups based on the genome-wide comparative analyses. in both the leschenault's rousette and egyptian fruit bat genomes, tes account for $ % of the genome, including sines ( . %), lines ( %), ltr retrotransposons ( . %), and dna transposons ( . %) (supplementary table s and fig. s ). it is notable that the proportions of tes in megabats, including the two rousettus species and the large flying fox, are considerably lower as compared to the levels in other mammals, such as humans, where nearly half of the genome is covered by tes ( fig. a and supplementary fig. s ) . consistent with the previous observations, it is also interesting that the proportion of tes is generally correlated with the genome size in mammals , (supplementary fig. s ). co-variation between an accumulation of tes and dna loss by large segmental deletions is considered a major contributing factor to determine the genome size. therefore, the smaller genome sizes in the megabats may be due to a lower activity of tes, at least in part. indeed, our analysis revealed that the number of young (recently retrotransposed) te copies in the megabat genomes is very small (fig. b and supplementary fig. s ). as exemplified by the microbat myotis lucifugus, where the number of tes representing < % divergence from the consensus sequence is , ( . % among all te copies; supplementary fig. s ) consistent with the previous studies, in general, young tes constitute a few percent among all tes in mammalian genomes. however, the copy numbers of young tes is only , ( . %) and , ( . %) for the rousettus species and large flying fox, respectively ( fig. b supplementary fig. s ). the small proportion of young tes is partly accounted for the low frequency of retrotransposition events in megabat-specific sines (fig. ) . in general, different types of sine families are distributed for each mammalian clade, such as order, sub-order, or family. in megabats, the only known active sines are the s rrna-derived meg sines. it should be noted that rousettus genomes contain no more than , copies of the meg-related sines, which cover . % of the genome. however, clade-specific sines are, in general, retrotranspositionally highly active, with - copies present in each mammalian genome (fig. a) . the large flying fox (pteropus vampyrus) also has only , copies of meg-related sines. based on the wide distribution of meg sines in megabats, including rousettus, macroglossus, eonycteris, and cynopterus, the origin of meg can be traced back to the common ancestor of megabats, which existed at least million years ago. it is possible that such a low retrotranspositional activity of the sines found in rousettus and pteropus is observed widely among megabats. it has been demonstrated that flying vertebrates, including bats, have substantially lost tes and have smaller genome sizes in association with cellular metabolic constraints. , the small proportion of meg sines in the megabats may also be a result of the constraint related to their powered flight. another notable te family is line- (l ), as it has been reported that the retrotranspositional activity of l has been lost in megabats. it is unlikely that the extinction of l resulted from the quiescence of l itself, because a synthesized sequence of the reconstructed megabat l is capable of retrotransposition in human hela cells. in addition, we identified that in addition to l , all types of tes have the least activity in megabats among the mammals investigated (fig. b) . this low activity of young tes may be due to an unknown megabat-specific mechanism for te repression or a result of extensive dna loss during the past tens of millions of years. one of the possible mechanisms by which te activity may be tightly repressed is an antiviral immune system in megabats. suggesting that the egyptian fruit bat may possess a novel mode of antiviral defense, several antiviral-related genes are known to have expanded in this bat. for example, ribonuclease l, an interferoninducible endoribonuclease that cleaves viral rnas, evolved under relaxed selective constraint in bats. ribonuclease l is also known to restrict retrotransposition of human l and mouse iap elements in human cells. in addition, several other factors that restrict retrotransposition in humans and mice are known to be involved in an antiviral immune system. thus, it is possible that a unique antiviral mechanism against exogenous parasites (i.e. viruses) is secondarily used for the restriction of the endogenous retroelements. as general mobilization of sines in mammals relies on the l machinery, the restriction of megabat l could limit the meg sine activity. the low activity of tes may partly contribute to the small genome size ( supplementary fig. s ), which could also be advantageous with respect to cell size and metabolic constraints in megabats as well as other flying vertebrates , . therefore, the unusual characteristics of the tes, likely shared among megabats, are an important example to study the molecular mechanisms underlying restriction of retrotransposition. such future studies may shed light on the reason why bats have such compact genomes. it also remains unknown why ves sines in microbats are active, whereas the genome size is relatively small among mammals (fig. ) . the difference in the sine activity between megabats and microbats may be affected by a possibly distinct antiviral immune system between the two groups, given that expansion of some antiviral-related genes occurred specifically in megabats. most of the chemosensory receptors are encoded by multi-gene families, allowing animals to detect highly diversified chemicals in the environment. the previously published studies have shown that the collections of the chemosensory receptor genes are flexible and highly variable among mammals, including the ors, taste receptors (t rs and t rs), vomeronasal receptors (v rs and v rs), fprs, and taars. the number of certain chemosensory receptor gene families has been shown to have a strong correlation with the degree of dependence on these ligand chemicals for survival. , , several studies have revealed that bats lost several chemosensory receptor genes, such as t r for umami, and v rs for pheromone(s) that may be due to the specific sensory adaptation in the ancestor of these groups. it is possible that megabats re-allocated the diversity in chemosensory receptor genes as a sensory trade-off, given that megabats have experienced the secondary loss of echolocation ability, which is one of the most specialized senses in bats. to examine this possibility, we comprehensively characterized the chemosensory receptor genes and compared their diversity by focussing on whether or not the repertoires in megabats show notable differences from those in microbats. our comparative genomic analyses of chemosensory receptor genes in the genomes of mammals revealed that the copy number of the intact genes and pseudo-genes show a certain variation among bat species. in t rs, the absence of t r , the umami receptor, in all of the bats that we analysed is consistent with the findings of the previous studies. all megabats possess two t rs (t r and t r ), whereas microbats are somewhat variable, in that they can possess no (greater false vampire bat), one (little brown bat), or two (brandt's bat, greater horseshoe bat) t rs (fig. a and supplementary table s ). it is noteworthy that all megabats possess t r , which is the sweet receptor, suggesting the importance of sweet taste for their frugivorous lifestyle. no intact t rs in the greater false vampire bat could be explained by their specific adaptation for a carnivorous diet, which resembles the blood-feeding activity of the vampire bat (desmodus rotundus), which also lost t rs. , as for t rs, which are bitter taste receptors, the copy numbers are relatively smaller in megabats than those in microbats ( fig. a and supplementary table s ). the smaller number of t rs in megabats can also be explained by their frugivorous diet, as compared with that of microbats, which are mostly insectivores. indeed, the repertoires of t rs in primates have a strong correlation with their diet, suggesting the importance of t rs for feeding adaptation in mammals. we identified little variation between megabats and microbats in fprs, which are expressed in the sensory neurons of the vomeronasal organ and mediate innate avoidance behaviours (fig. a supplementary table s ). suggesting that fpr-mediated chemodetection is not directly linked with the difference in their habitats, mega-and microbats both possess two to eight fprs. however, a previous study, by comparing the orthologous sequences among a broad range of mammals, found the signatures for the operation of positive selection in fprs. therefore, to examine the possible contribution of fprs to the adaptive evolution of megabats, more detailed investigation is necessary by focussing on the dn/ds values among orthologous fpr sequences of many bat species, which are lacking at present. there was an extensive reduction in v rs, which are known to be expressed in vno neurons of mammals and detect various pheromones, [ ] [ ] [ ] in both megabats and microbats ( fig. a and supplementary table s ) . especially, only one v r was found in the genomes of megabats. the reduction of v rs revealed in this study is consistent with the findings of the previously published studies. the inactivation of trpc s , and ancv rs, , which is responsible for vno function, suggested the degeneration of vnos in most bat lineages including megabats. although most bats do not possess intact v rs, parnell's mustached bat possesses four intact v rs (fig. a and supplementary table s ) , which is consistent with the presence of the vno in this species. in addition, recent study has suggested that there are a substantial number of v rs in distantly related groups of phyllostomids and miniopterids, which possess an intact vno, suggesting that they retained v r-mediated chemical communication. , therefore, the ancestor of all extant bats is expected to possess an intact vno, as well as a certain number of v rs, that were independently degenerated after the divergence of each family, including megabats (pteropodidae). namely, the loss of echolocation and the degeneration of the vno occurred spontaneously in the ancestor of megabats. v rs are expressed in the basal region of the vno neurons , , and peptide pheromones were detected in mice. , however, intact v rs have been identified only in a limited number of mammals, such as rodents, mouse lemurs, and opossum. our comprehensive analysis failed to find intact putative v rs in the genomes of all bats and most of other mammals. this result suggests that, before the acquisition of the echolocation ability, the v r-mediated pheromone detection system has already been lost in the common ancestor of all extant bat lineages. it is noteworthy that the hedgehog and the horse possess seven and one intact v rs, respectively ( fig. a and supplementary table s ). this provides the first description of intact v rs in the genomes of laurasiatherian mammals. more detailed analyses may provide insight into the v r-mediated pheromone detection system in these species. one of the most intriguing results in the chemosensory receptor genes was obtained from taars. trace amine receptors have been believed to function as receptors for trace amines, for example, tyramine and octopamine in the brain. however, a recent study revealed that taars may be expressed primarily or exclusively in the moe, and are responsible for detecting volatile amines, including ethological odors that evoke innate animal behavioural responses. in this study, we revealed that the number of taars was increased in the common ancestor of megabats. in particular, the number of taars, which were identified to be from five to seven copies in microbats, increased to more than copies in megabats. in particular, leschenault's rousette possess putatively intact ( intact and truncated) taars, which is the largest number the number of intact, truncated, and pseudo-genes is indicated in blue, yellow, and red, respectively. we treated the truncated genes as 'putatively intact'. the dotted lines show the variation in the number of intact þ 'putatively intact' genes among mammals. it should be noted that the number of taars is obviously higher in megabats than in microbats. (b) phylogenetic tree of intact taars in mammals. only the intact genes were included in the tree. the taars of the egyptian fruit bat and leschenault's rousette are indicated by the square (green) and triangle (blue). it is obvious that the taars of subfamilies seven and eight were expanded in two rousettus bats. zebrafish taar c in the ncbi database was used as an outgroup. mouse taar - in the ncbi database was used as an indicator for each taar subfamily. accession codes for these database-derived genes are available in supplementary fig. s . identified among mammals ( fig. a ; supplementary tables s and s ). the phylogenetic analyses of intact taars for the mammals clearly demonstrated that the expansion of the genes in the two rousettus bats, including the egyptian fruit bat and leschenault's rousette, occurred in subfamilies seven and eight in a species-specific manner ( fig. b; supplementary fig. s and table s ). eyun et al. also reported a high copy number of taars in one megabat, the large flying fox; however, the repertoire was quite different from that of these two rousettus bats ( fig. a ; supplementary tables s and s ). although taars were expanded in subfamilies seven and eight in the two rousettus species, they were expanded only in subfamily seven in the java fruit bat. the number of intact genes, as well as the pseudo-genes, was highly variable among the megabats, suggesting that birth and death of taars were quite active. phylogenetic, as well as copy number, analyses suggest that taars have provided a large contribution to some process of adaptive evolution and diversification of megabats. interestingly, pavlovich et al. revealed the gene expansion of mhc genes in the genomes of the egyptian fruit bat, suggesting novel modes of antiviral defense. thus, the mhc genes and taars were both expanded in megabats. santos et al. reported that taars may be a key mediator in mhc-dependent mating choices in the sac-winged bat (saccopteryx bilineata). based on these findings, it is possible that the megabats use diversified taars for mate choice, by taking advantage of mhc-related molecules that are also diversified. functional experiments investigating taars and mating in megabats may provide insight into the possible link between taars and mhc genes. ors, which are expressed in the moe, have undergone extensive expansion and contraction that may be associated with environmental adaptations. in ors, we also revealed the notable increase of the genes in megabats, which is more evident in two rousettus bats, including the egyptian fruit bat and leschenault's rousette ( fig. a and supplementary table s ). although the copy numbers of putatively intact (intact and truncated) ors span from to in microbats, those of megabats ranges from to . the increase in the number of ors in megabats may be the signature for the reallocation in response, leading to the loss of the echolocation ability in the megabat ancestor. hayden et al. identified convergent or patterns linked to frugivorous diet in megabats and new world fruit-eating microbats (phyllostomids). given that the increase in the ors is more extensive, these patterns of ors are not only linked to the frugivorous diet, but also to some other roles, such as predator avoidance and social communication. by extensively analysing the copy-number variations of chemosensory receptor genes between megabats and microbats, we revealed obvious differences in taars and ors, both of which are expressed in the moe. it is possible that the contraction of vnomediated chemo-detection and echolocation in megabats may lead to the expansion of chemo-detection genes expressed in the moe. in addition, it is noteworthy that the repertoires of taars and ors function was deduced by enrichment analysis in webgestalt. were obviously differentiated even between closely related two species belonging to the rousettus, suggesting that birth and death of these genes are quite active in this genus ( fig. a and b ; supplementary tables s and s ). the results propose the possibility that two rousettus bats are particularly dependent on olfaction through taars and ors. in addition to the candidate approach, which focussed on retroposons and chemosensory receptor genes, we also performed global analyses on the protein-coding genes of megabats. the elevation of dn/ds ratios were examined for the , single-copy orthologous genes using the branch model of codeml implemented in paml . . the likelihood ratio tests and the inspection of p-value identified that the elevation of dn/ds ratios (p < . ) was significant in genes (supplementary table s ). as shown by the enrichment analyses for the resultant genes using webgestalt, the elevation of the dn/ds ratios in megabats was remarkable in genes related to the immune system and protein catabolism (table and supplementary table s ). the elevation of the dn/ds ratios in immune system genes has been reported in several comparative genomic analyses on mammals, including the pangolin, microbat, and megabat. notably, microbats and pangolins have recently begun to attract attention as possible host reservoirs of sars-related coronaviruses responsible for the current outbreak of coronavirus disease- (covid- ). , pavlovich et al. revealed the episodic evolution of immune response genes in egyptian rousette, a natural reservoir of marburg virus, by showing an unusual expansion of ngk , cd , mhc, and ifn gene families. we revealed the episodic evolution by showing the elevation of dn/ds ratios in many immune response genes in megabat lineages ( table ). the tolerance for zoonotic viruses without overt pathology in bats are consistent with the episodic evolution in immune response genes. namely, co-evolution of viruses and immune system in these species may be facilitated by the adaptive evolution. further molecular biological and physiological investigations of these candidate genes are of primary importance in elucidating how bats tolerate infections by various zoonotic viruses. interestingly, the elevation of the dn/ds ratio of protein catabolism was also reported in the tyrosine aminotransferase gene (tat) in megabats. to further investigate the evolution of the protein catabolism pathway in megabats, we focussed on another representative gene, -hydroxyacyl-coa dehydrogenase (hadh), in which the elevation of the dn/ds ratio was significant in the branch model (table ; supplementary tables s and s ). hadh is involved in the degradation of ile, val, lys, and tyr to convert them into energy via the citric acid (tca) cycle (fig. a) . the branch-site test for hadh ( fig. and supplementary table s ) revealed that seven sites were positively selected with a posterior probability (p) of > %, including three sites with a p of > % (fig. b) . the likelihood for the operation of positive selection was not significant, as only a few sites were detected as positively selected ( %, figure . positively selected sites in hadh on megabat lineages. (a) in protein metabolism, hadh is involved in the degradation of ile, val, lys, tyr and transforms these factors into acetyl-coa or succinyl-coa for the tca cycle (https://www.genome.jp/dbget-bin/www_bget?hsa: ). (b) the sequence alignment between the positively selected sites in hadh in the megabat lineages and microbats and human hadh. the codon alignment of all hadh sequences used in this study is available in supplementary alignment file s . the sites were identified by the branch-site model on paml. positively selected sites are highlighted in yellow (p, > %) and red (p-value,p > %). (c) positively selected residues on megabat lineages are mapped on the human hadh dimer (pdb: f y). the a chain is presented as a spherical model (yellow and red). the hadh dimer a chain is shown as a cartoon model (white) and the b chain is shown as a surface model (gray). the ligands of hadh, nad, and acetoacetyl-coa are shown as a stick model (blue and orange, respectively). supplementary table s ). we then mapped the positively selected sites on the human hadh dimer structure (pdb: f y, fig. c ). although the positively selected sites were not located on the ligand (nad and caa) binding sites, it was of interest that four sites (r , e , a , and l ) were located on the dimer interface (fig. c) . the mutations on these four residues change electric charges or polarities, such as r y, e n, a s, and l s, suggesting that dimer formation is likely to be interrupted and enzyme catalysis is degraded. shen et al. identified the significantly low activity of tat in megabats and discussed that the elevation of the dn/ds ratio in tat may be the relaxation of purifying selection in response to their frugivorous diet. megabats may utilize the ingested proteins for the synthesis of new proteins, rather than for energy production through catabolism, as their diets, which include fruits and nectar, are rich in carbohydrates but poor in protein. accordingly, it is possible that the megabats are less dependent on the protein catabolism pathway. in this study, we provide additional and inclusive evidence which suggests that the evolutionary constraints on genes for protein catabolism were relaxed due to the adaptation for frugivorous diets. in summary, our comparative genomic analyses revealed several distinct signatures for adaptive evolution in megabats. (i) the activity of tes is considerably lower compared to other mammals, which is possibly related to a defense mechanism against viruses. the small size of the genomes, which may be due to the low activity of tes, could be advantageous in association with cellular metabolic constrains of flying organisms. (ii) taars and ors, which function in the neurons of moe, show specific expansions, implying the important contribution of olfaction in their adaptation processes. (iii) positive selection in genes for immunity may suggests the coevolution of immune system and viruses, providing crucial insights into the mechanism of asymptomatic infection of bats for zoonotic viruses as a host reservoir. (iv) positive selection in genes for protein catabolism is consistent with the ability of frugivorous feeding that is one of the adaptive characters specific to megabats. bats-biology and behavior mammalian phytogeny: shaking the tree complete mitochondrial genome of a neotropical fruit bat, artibeus jamaicensis, and a new hypothesis of the relationships of bats to other eutherian mammals monophyletic origin of the order chiroptera and its phylogenetic position among mammalia, as inferred from the complete sequence of the mitochondrial dna of a japanese megabat, the ryukyu flying fox molecular evidence regarding the origin of echolocation and flight in bats parallel adaptive radiations in two major clades of placental mammals pegasoferae, an unexpected mammalian clade revealed by tracking ancient retroposon insertions phylogenomic analysis resolves the interordinal relationships and rapid diversification of the laurasiatherian mammals phylogenetic relationships of icaronycteris, archaeonycteris, hassianycteris, and palaeochiropteryx to extant bat lineages, with comments on the evolution of echolocation and foraging strategies in microchiroptera primate-like retinotectal decussation in an echolocating megabat, rousettus aegyptiacus integrated fossil and molecular data reconstruct bat echolocation maximum likelihood analysis of the complete mitochondrial genomes of eutherians and a reevaluation of the phylogeny of bats and insectivores genome analysis reveals insights into physiology and longevity of the brandt's bat myotis brandtii comparative analysis of bat genomes provides insight into the evolution of flight and immunity genome-wide signatures of convergent evolution in echolocating mammals the egyptian rousette genome reveals unexpected features of bat antiviral immunity a pneumonia outbreak associated with a new coronavirus of probable bat origin command-line tools for processing biological sequencing data fast and accurate short read alignment with burrows-wheeler transform soapdenovo : an empirically improved memory-efficient short-read de novo assembler efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads basic local alignment search tool blat-the blast-like alignment tool automated generation of heuristics for biological sequence comparison tophat: discovering splice junctions with rna-seq using native and syntenically mapped cdna alignments to improve de novo gene finding transcript assembly and quantification by rna-seq reveals unannotated transcripts and isoform switching during cell differentiation prediction of complete gene structures in human genomic dna gene finding in novel genomes busco: assessing genome assembly and annotation completeness with single-copy orthologs identification of olfactory receptor genes from mammalian genome sequences acceleration of olfactory receptor gene loss in primate evolution: possible link to anatomical change in sensory systems and dietary transition evolution of vomeronasal receptor (v r) genes in the common marmoset (callithrix jacchus) cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences faster sequence homology searches by clustering subsequences rates of molecular evolution suggest natural history of life history traits and a post-k-pg nocturnal bottleneck of placentals phylogeny-aware alignment with prank raxml version : a tool for phylogenetic analysis and post-analysis of large phylogenies astral-iii: polynomial time species tree reconstruction from partially resolved gene trees gene tree discordance, phylogenetic inference and the multispecies coalescent mafft multiple sequence alignment software version : improvements in performance and usability mega : molecular evolutionary genetic analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods annotation, submission and screening of repetitive elements in repbase: repbasesubmitter and censor simple and fast classification of non-ltr retrotransposons based on phylogeny of their rt domain protein sequences repbase update, a database of repetitive elements in eukaryotic genomes paml : a program package for phylogenetic analysis by maximum likelihood the pymol molecular graphics system a genomic approach to examine the complex evolution of laurasiatherian mammals a molecular phylogeny for bats illuminates biogeography and the fossil record characterization of the mitochondrial genome of rousettus leschenaulti dynamics of genome size evolution in birds and mammals transposable elements as genetic accelerators of evolution: contribution to genome size, gene regulatory network rewiring, and morphological innovation pinpointing the vesper bat transposon revolution using the miniopterus natalensis genome s rrna-derived and trna-derived sines in fruit bats origin of avian genome size and structure in non-avian dinosaurs palaeogenomics of pterosaurs and the evolution of small genome size in flying vertebrates loss of line- activity in the megabats reviving the dead: history and reactivation of an extinct l viral encounters with , -oligoadenylate synthetase and rnase l during the interferon antiviral response rnase l restricts the mobility of engineered retrotransposons in cultured human cells restricting retrotransposons: a review line-mediated retrotransposition of marked alu sequences the evolution of animal chemosensory receptor gene repertoires: roles of chance and necessity dramatic variation of the vomeronasal pheromone receptor gene repertoire among five orders of placental and marsupial mammals a cluster of olfactory receptor genes linked to frugivory in bats genomic and genetic evidence for the loss of umami taste in bats extreme variability among mammalian v r gene families prenatal development supports a single origin of laryngeal echolocation in bats evolution of the sweet taste receptor gene tas r in bats frequent expansions of the bitter taste receptor gene repertoire during evolution of mammals in the euarchontoglires clade formyl peptide receptor-like proteins are a novel family of vomeronasal chemosensors adaptive evolution of formyl peptide receptors in mammals molecular organization of vomeronasal chemoreception from genes to social communication: molecular sensing by the vomeronasal organ evolution of v r pheromone receptor genes in vertebrates: diversity and commonality widespread losses of vomeronasal signal transduction in bats trpc pseudogenization dynamics in bats reveal ancestral vomeronasal signaling, then pervasive loss a single pheromone receptor gene conserved across million years of vertebrate evolution inactivation of ancv r as a predictive signature for the loss of vomeronasal system in mammals vomeronasal organ in bats and primates: extremes of structural variability and its phylogenetic implications expressed vomeronasal type- receptors (v rs) in bats uncover conserved sequences underlying social chemical signaling a novel family of putative pheromone receptors in mammals with a topographically organized and sexually dimorphic distribution a multigene family encoding a diverse array of putative pheromone receptors in mammals the male mouse pheromone esp enhances female sexual receptive behaviour through a specific vomeronasal receptor sexual rejection via a vomeronasal receptor-triggered limbic circuit first evidence for functional vomeronasal receptor genes in primates comparative genomic analysis identifies an evolutionary shift of vomeronasal receptor gene repertoires in the vertebrate transition from water to land a renaissance in trace amines inspired by a novel gpcr family a second class of chemosensory receptors in the olfactory epithelium trace amine-associated receptors: ligands, neural circuits, and behaviors molecular evolution and functional divergence of trace amine-associated receptors mhc-dependent mate choice is linked to a trace-amine-associated receptor gene in a mammal webgestalt, a more comprehensive, powerful, flexible and interactive gene set enrichment analysis toolkit pangolin genomes and the evolution of mammalian scales and immunity isolation of sars-cov- -related coronavirus from malayan pangolins identifying sars-cov- -related coronaviruses in malayan pangolins adaptive evolution in the glucose transporter gene slc a in old world fruit bats (family: pteropodidae) the authors thank mr. yujiro kawabe for the illustration of the animals in fig. . computations were partially performed on the nig supercomputer at rois national institute of genetics. all nucleotide sequence reads and the genome assembly have been deposited in the ddbj sequence read archive (sra) for egyptian fruit bat (dra ) and for leschenault's rousette (dra ). the sequences of fosmid library of egyptian fruit bat were also deposited in the database (ddbj accession nos. ga -ga ). the raw data for the rna-seq analyses of the kidney-derived primary cultured cells of the egyptian fruit bat has been deposited in ddbj sra (dra ). none declared. the supplementary data are available at dnares online. key: cord- - tg up authors: zheng, fan; zhang, she; churas, christopher; pratt, dexter; bahar, ivet; ideker, trey title: identifying persistent structures in multiscale ‘omics data date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: tg up in any ‘omics study, the scale of analysis can dramatically affect the outcome. for instance, when clustering single-cell transcriptomes, is the analysis tuned to discover broad or specific cell types? likewise, protein communities revealed from protein networks can vary widely in sizes depending on the method. here we use the concept of “persistent homology”, drawn from mathematical topology, to identify robust structures in data at all scales simultaneously. application to mouse single-cell transcriptomes significantly expands the catalog of identified cell types, while analysis of sars-cov- protein interactions suggests hijacking of wnt. the method, hidef, is available via python and cytoscape. significant patterns in data often become apparent only when looking at the right scale. for example, single-cell rna sequencing data can be clustered coarsely to identify broad categories of cells (e.g. mesoderm, ectoderm), or analyzed more sharply to delineate highly specific subtypes (e.g. pancreas islet β-cells, thymus epithelium) [ ] [ ] [ ] . likewise, protein-protein interaction networks can inform groups of proteins spanning a wide range of spatial dimensions, from protein dimers (e.g. leucine zippers) to larger complexes of dozens or hundreds of subunits (e.g. proteasome, nuclear pore) to entire organelles (e.g. centriole, mitochondria) [ ] . many different approaches have been devised or applied to detect structures in biological data, including standard clustering, network community detection, and low-dimensional data projection [ ] [ ] [ ] , some of which can be tuned for sensitivity to objects of a certain size or scale (so-called 'resolution parameters') [ , ] . even tunable algorithms, however, face the dilemma that the particular scale(s) at which the significant biological structures arise are usually unknown in advance. guidelines for detecting robust patterns across scales come from the field of topological data analysis, which studies the geometric "shape" of data using tools from algebraic topology and pure mathematics [ ] . a fundamental concept in this field is "persistent homology" [ ] , the idea that the core structures intrinsic to a dataset are those that persist across different scales. recently, this concept has begun to be applied to analysis of 'omics data and particularly biological networks [ , ] . here, we sought to integrate concepts from persistent homology with existing algorithms for network community detection, resulting in a fast and practical multiscale approach we call the hierarchical community decoding framework (hidef). hidef works in the three phases to analyze the structure of a biological dataset (methods). to begin, the dataset is formulated as a similarity network, depicting a set of biological entities (e.g. genes, proteins, cells, patients, or species) and pairwise connections among these entities (representing similarities in their data profiles). the goal of the first phase is to detect network communities, i.e. groups of densely connected biological entities. communities are identified continually as the spatial resolution is scanned, producing a comprehensive pool of candidates across all scales of analysis (fig. a) . in the second phase, candidate communities arising at different resolutions are pairwise aligned to identify those that have been redundantly identified and are thus persistent (fig. b) . in the third phase, persistent communities are analyzed to identify cases where a community is fully or partially contained within another (typically larger) community, resulting in a hierarchical assembly of nested and overlapping biological structures ( fig. c,d) . hidef is implemented as a python package and can be accessed interactively in the cytoscape network analysis and visualization environment [ ] (availability of data and materials). we first explored the idea of measuring community persistence via analysis of synthetic datasets [ ] in which communities were simulated and embedded in the similarity network at two different scales (supplementary fig. a; methods) . notably, the communities determined to be most persistent by hidef were found to accurately recapitulate the simulated communities at the two scales (supplementary fig. b-g) . in contrast, applying community detection algorithms at a fixed resolution had limited capability to capture both scales of simulated structures simultaneously (supplementary fig. ; methods) . we next evaluated whether persistent community detection improves the characterization of cell types. we applied hidef to detect robust nested communities within cell-cell similarity networks based on the mrna expression profiles of , single cells gathered across the organs and tissues of mice (obtained from two datasets in the tabula muris project [ ] ; methods). these cells had been annotated with a controlled vocabulary of cell types from the cell ontology (co) [ ] , via analyses of cell-type-specific expression markers [ ] . we used groups of cells sharing the same annotations to define a panel of reference cell types and measured the degree to which each reference cell type could be recapitulated by a hidef community of cells (methods). we compared these results to toomanycells [ ] and conos [ ] , two recently developed methods that generate nested communities of single cells in divisive and agglomerative manners, respectively (methods). reference cell types tended to better match communities generated by hidef than those of other approaches, with % ( / ) having a highly overlapping community (jaccard index > . ) in the hidef hierarchy ( fig. a,b, supplementary fig. a,b) . this favorable performance was observed consistently when adjusting hidef parameters to formulate a simple hierarchy, containing only the strongest structures, or a more complex hierarchy including additional communities that are less persistent but still significant (fig. c, supplementary fig. c) . the top-level communities in the hidef hierarchy corresponded to broad cell lineages such as "t cell", "b cell", and "epidermal cell". finer-grained communities mapped to more specific known subtypes (fig. d) or, more frequently, putative new subtypes within a lineage. for example, "epidermal cell" was split into two distinct epidermal tissue locations, skin and tongue; further splits suggested the presence of still more specific uncharacterized cell types (fig. e) . hidef communities also captured known cell types that were not apparent from d visual embeddings (supplementary fig. a,b) , and also suggested new cell-type combinations. for example, astrocytes were joined with two communities of neuronal cells to create a distinct cell type not observed in the hierarchies of toomanycells [ ] , conos [ ] , or a two-dimensional data projection with umap [ ] (fig. f, supplementary fig. c ). this community may correspond to the grouping of a presynaptic neuron, postsynaptic neuron, and a surrounding astrocyte within a so-called "tripartite synapse" [ ] . next, we applied hidef to analyze protein-protein interaction networks, with the goal of characterizing protein complexes and higher-order protein assemblies spanning spatial scales. we benchmarked this task by the agreement between hidef communities and the gene ontology (go) [ ] , a database that manually assigns proteins to cellular components, processes, or functions based on curation of literature (methods). application to protein-protein interaction networks from budding yeast and human found that hidef captured knowledge in go more significantly than previous pipelines proposed for this task, including the nexo approach to hierarchical community detection [ ] and standard hierarchical clustering of pairwise protein distances calculated by three recent network embedding approaches [ ] [ ] [ ] (fig. a, fig. ) . we also applied hidef to analyze a collection of human protein interaction networks [ , ] . we found significant differences in the distributions of community sizes across these networks, loosely correlating with the different measurement approaches used to generate each network. for example, bioplex . , a network characterizing biophysical protein-protein interactions by affinity-purification mass-spectrometry (ap-ms) [ ] , was dominated by small communities of - proteins, whereas a network based on mrna coexpression [ ] tended towards larger-scale communities of > proteins. in the middle of this spectrum, the string network, which integrated biophysical protein interactions and gene co-expression with a variety of other features [ ] , contained both small and large communities (fig. c) . in agreement with the observation above, the hierarchy of bioplex had a relatively shallow shape in comparison to that of string (and other integrated networks including giant and pcnet [ , ] ), in which communities across many scales formed a deep hierarchy (fig. d ,e; availability of data and materials). in contrast to clustering frameworks, hidef recognizes when a community is contained by multiple parent communities, which in the context of protein-protein networks suggests that the community participates in diverse pleiotropic biological functions. for example, a community corresponding to the mapk (erk) pathway participated in multiple larger communities, including ras and rsk pathways, sodium channels, and actin capping, consistent with the central roles of mapk signaling in these distinct biological processes [ ] (supplementary fig. ) . the hierarchies of protein communities identified from each of these networks have been made available as a resource in the ndex database [ ] (availability of data and materials). to explore multiscale data analysis in the context of an urgent public health issue, we considered a recent application of ap-ms that characterized interactions between the sars-cov- viral subunits and human host proteins [ ] . we used network propagation to select a subnetwork of the bioplex . human protein interactome [ ] proximal to these proteins ( proteins and , interactions) and applied hidef to identify its community structure (methods). among the persistent communities identified (fig. f) , we noted one consisting of human transducin-like enhancer (tle) family proteins, tle , tle , and tle , which interacted with sars-cov nsp , a highly conserved rna synthesis protein in corona and other nidoviruses (fig. g) [ ] . tle proteins are well-known inhibitors of the wnt signaling pathway [ ] . inhibition of wnt, in turn, has been shown to reduce coronavirus replication [ ] and recently proposed as a covid- treatment [ ] . if interactions between nsp and tle proteins can be shown to facilitate activation of wnt, tles may be of potential interest as drug targets. community persistence provides a basic metric for distilling biological structure from data, which can be tuned to select only the strongest structures or to include weaker patterns that are less persistent but still significant. this concept applies to diverse biological subfields, as demonstrated here for single cell transcriptomics and protein interaction mapping. while these subfields currently employ very different analysis tools which largely evolve separately, it is perhaps high time to seek out core concepts and broader fundamentals around which to unify some of the ongoing development efforts. to that effect, the methods explored here have wide applicability to analyze the multiscale organization of many other biological systems, including those related to chromosome organization, the microbiome and the brain. consider an undirected network graph , representing a set of biological objects (vertices) and a set of similarity relations between these objects (edges). examples of interest include networks of cells, where edges represent pairwise cell-cell similarity in transcriptional profiles characterized by single-cell rna-seq, or networks of proteins, where edges represent pairwise protein-protein biophysical interactions. we seek to group these objects into communities (subsets of objects) that appear at different scales and identify approximate containment relationships among these communities, so as to obtain a hierarchical representation of the network structure. the workflow is implemented in three phases. phase i identifies communities in at each of a series of spatial resolutions . phase ii identifies which of these communities are persistent by way of a panresolution community graph ! , in which vertices represent communities, including those identified at each resolution, and each edge links pairs of similar communities arising at different resolutions. persistent communities correspond to large components in ! . phase iii constructs a final hierarchical structure that represents containment and partial containment relationships (directed edges) among the persistent communities (vertices). community detection methods generally seek to maximize a quantity known as the network modularity, as a function of community assignment of all objects [ ] . a resolution parameter integrated into the modularity function can be used to tune the scale of the communities identified [ , , ] , with larger/smaller scale communities having more/fewer vertices on average (fig. a) . of the several types of resolution parameter that have been proposed, we adopted that of the reichardt-bornholdt configuration model [ ] , which defines the generalized modularity as: where ⃗ defines a mapping from objects in to community labels; " is the degree of vertex ; is the total number of edges in ; is the resolution parameter; ( , ) indicates that vertices and are assigned to the same community by ⃗ ; and is the adjacency matrix of . to determine two values satisfying the above formula are defined as -proximal. the sampling step, which was practically set to . to sufficiently capture the interesting structures in the data; it is conceptually similar to the nyquist sampling frequency in signal processing [ ] . we used $"% = . , which we found always resulted in the theoretical minimum number of communities, equal to the number of connected components in . we used $&' = for single-cell data ( fig. to identify persistent communities, we define the pairwise similarity between any two communities and as the jaccard similarity of their sets of objects, ( ) and ( ): we initialize a hierarchical structure represented by , a directed acyclic graph (dag) in which each vertex represents a persistent community. a root vertex is added to represent the community of all objects. the containment relationship between two vertices, and , is quantified by the containment index (ci): which measures the fraction of objects in shared with . an edge is added from to in if ( , ) is larger than a threshold ( is -contained by ). since ( , ) < for all , (a property established by the procedure for connecting similar communities in phase ii), setting ≥ /( + ) guarantees to be acyclic. in practice we used a relaxed threshold = , which we found generally maintains the acyclic property but includes additional containment relations. in the (in our experience rare) event that cycles are generated in , i.e. ( , ) ≥ and ( , ) ≥ , we add a new community to , the union of and , and remove and from . finally, redundant relations are removed by obtaining a transitive reduction [ ] of , which represents the hierarchy returned by hidef describing the organization of communities. the biological objects assigned to each community are expanded to include all objects assigned to its descendants. throughout this study, we used the parameters = . , = , = . note that since is a threshold of minimum persistence, the results under a larger value of ′ can be produced by simply removing communities with persistence lower than ′ (figs. c, a- fig. ). different combinations of parameters and typically do not significantly change the performance of hidef in the benchmark tests on protein-protein interaction networks (supplementary fig. ), except that certain parameters (e.g. = . ) are less robust to network perturbation (i.e. randomly deleting edges from networks). we found that combining hidef with node embedding resolved this issue and further improved the performance and robustness (supplementary fig. ; see sections below). simulated network data were generated using the lancichinetti-fortunato-radicchi (lfr) method [ ] (supplementary figs. , ) . we used an available implementation (lfr benchmark graphs package at http://www.santofortunato.net/resources) to generate benchmark networks with two levels of embedded communities, a coarse-grained (macro) level and a fine-grained (micro) level. within each level, a vertex was exclusively assigned to one community. two parameters, c and f, were used to define the fractions of edges violating the simulated community structures at the two levels. all other edges were restricted to occur between vertices assigned to the same community (supplementary fig. a) . we fixed other parameters of the lfr method to values explored by previous studies [ ] . some community detection algorithms include iterations of local optimization and vertex aggregation, a process that, like hidef, also defines a hierarchy of communities, albeit as a tree rather than a dag. we demonstrated that without scanning multiple resolutions, this process alone was insufficient to detect the simulated communities at all scales (supplementary fig. ) . we used louvain and infomap [ , ] , which have stable implementations and have shown strong performance in previous community detection studies [ ] . for louvain, we optimized the and other parameters to default. in general, these settings generated trees with two levels of communities. note that infomap sometimes determined that the input network was nonhierarchical, in which cases the coarse-and fine-grained communities were identical by definition. mouse single-cell rna-seq data ( fig. ; supplementary fig. identical analyses were applied to the facs and the droplet datasets respectively, yielding a hierarchy of and communities respectively (fig. d) . scanpy . . [ ] was used to create tsne or umap embeddings and associated two-dimensional visualizations [ ] as baselines for comparison (fig. e,f; supplementary fig. a,b) . through previous analysis of the single-cell rna data, all cells in these datasets had been annotated with matching cell-type classes in the cell ontology (co) [ ] . before comparing these annotations with the communities detected by hidef, we expanded the set of annotations of each cell according to the co structure, to ensure the set also included all of the ancestor cell types of the type that was annotated. for example, co has the relationship "[keratinocyte] (is_a) [epidermal_cell]", and thus all cells annotated as "keratinocyte" are also annotated as "epidermal cell". the co was obtained from http://www.obofoundry.org/ontology/cl.html and processed by the data driven ontology toolkit (ddot) [ ] retaining "is_a" relationships only. we compared hidef to toomanycells [ ] and conos [ ] as baseline methods. the former is a divisive method which iteratively applies bipartite spectral clustering to the cell population until the modularity of the partition is below a threshold; the latter uses the walktrap algorithm to agglomeratively construct the cell-type hierarchy [ ] . we chose to compare with these methods because their ability to identify multiscale communities was either the main advertised feature or had been shown to be a major strength. toomanycells (version . . . ) was run with the parameter "min-modularity" set to . as recommended in the original paper [ ] , with other settings set to default. this process generated dendrograms (binary trees) with communities. the walktrap algorithm was run from the conos package (version . . ) with the parameter "step" set to as recommended in the original paper [ ] , yielding a dendogram. the greedymodularitycut method in the conos package was used to select n fusions in the original dendrogram, resulting in a reduced dendrogram with n+ communities (including n internal and n+ leaf nodes). here we used n = , generating a hierarchy with communities (fig. c) . the communities in each hierarchy were ranked to analyze the relationships between celltype recovery and model complexity (fig. c, supplementary fig. c) . hidef communities were ranked by their persistence; conos and toomanycells communities were ranked according to the modularity scores those methods associate with each branch-point in their dendrograms. conos/walktrap uses a score based on the gain of modularity in merging two communities, whereas toomanycells uses the modularity of each binary partition. we obtained a total of human protein interaction networks gathered previously by survey studies [ , ] , along with one integrated network from budding yeast (s. cerevisiae) that had been used in a previous community detection pipeline, nexo [ ] . this collection contained two versions of the string interaction database, with the second removing edges from text mining (labeled string-t versus string, respectively; fig. ). benchmark experiments for the recovery of the gene ontology (go) were performed with string and the yeast network ( fig. a,b, supplementary fig. ) . the reference go for yeast proteins was obtained from http://nexo.ucsd.edu/. a reference go for human proteins was downloaded from http://geneontology.org/ via an api provided by the ddot package [ ] . hidef was directly applied to all of the above benchmark networks. the nexo communities were obtained from http://nexo.ucsd.edu/, with a robustness score assigned to each community. to benchmark communities created by hierarchical clustering, we first calculated three versions of pairwise protein distances (hc. - ; fig. a,b; supplementary fig. ) using mashup, dsd and deepnf [ ] [ ] [ ] . mashup was used to embed each protein as a vector, with and dimensions for yeast and human, as recommended in the original paper. a pairwise distance was computed for each pair of proteins as the cosine distance between the two vectors. similarly, deepnf was used to embed each protein into a -dimensional vector by default. dsd generates pairwise distances by default. given these pairwise distances, upgma clustering was applied to generate binary hierarchical trees. following the procedure given in the nexo and mashup papers [ , ] communities with < proteins were discarded. since all methods had slight differences in the resulting number of communities, communities from each method were sorted in decreasing order of score, enabling comparison of results across the same numbers of top-ranked communities. hidef communities were ranked by persistence. nexo communities were ranked by the robustness value assigned to each community in the original paper [ ] . to rank each community c of hierarchical clustering (branch in the dendrogram), a one-way mann-whitney u-test was used to test for significant differences between two sets of protein pairwise distances: (set ) all pairs consisting of a protein in c and a protein in the sibling community of c; (set ) all pairs consisting of a protein in each of the two children communities of c. the communities were sorted by the one-sided p-value of significance that distances in set are greater than those in set . we adopted a metric average f -score [ ] to evaluate the overall performance of multiscale structure identification, focusing on the recovery of reference communities. given a set of reference communities * and a set of computationally detected communities ⃗ , the score was defined as: where ( ) is the best match of " in ⃗ , defined as follows: and ( " , sss⃗ ) is the harmonic mean of precision( " , sss⃗ ) and recall( " , sss⃗ ). the calculations were conducted by the xmeasures package (https://github.com/exascaleinfolab/xmeasures) [ ] . hidef was directly applied to the original networks in in most of our analyses of protein-protein interaction networks, and compared with the results of hierarchical clustering following the network embedding techniques [ , ] . we sought to explore if we can combine the strength of network embedding and hidef to further improve the performance and robustness to parameter choices (supplementary fig. ) . we borrowed the idea of shared-nearest neighbor (snn) graph that we had been using in the analyses of single-cell data. we made a customized script to use the -dimensional node embeddings of the string network as the input of the seurat findneighbors function [ ] . the parameters of this function remained as the default. the output snn graph has . ´ edges, which is on the same magnitude as the original network ( . ´ edges). we then applied hidef to this snn graph with different combinations of parameters ( supplementary fig. ) . human proteins identified to interact with sars-cov- viral protein subunits were obtained from a recent study [ ] . this list was expanded to include additional human proteins connected to two or more of the virus-interacting human proteins in the new bioplex . network [ ] . these operations resulted in a network of proteins and , interactions. hidef was applied to this network with the same parameter settings as for other protein-protein interaction networks (see previous methods sections), and enrichment analysis was performed via g:profiler [ ] (fig. f,g) . not applicable. not applicable. these models include the hierarchy of murine cell types (fig. ) , the hierarchies of yeast and human protein communities identified through protein network analysis, and the hierarchy of human protein complexes targeted by sars-cov (fig. ) . t.i. is cofounder of data cure, is on the scientific advisory board, and has an equity interest. t.i. . a yeast network [ ] and the human string network [ ] were used as the inputs of a and b, respectively. hc. - represent upgma hierarchical clustering of pairwise distances generated by mashup, dsd, and deepnf [ ] [ ] [ ] , respectively. c, distributions of community sizes (x-axis, number of proteins) for three human protein networks: bioplex . [ ] , coexpr-geo [ ] , and string [ ] . supplementary figure . exploring simulated networks. a, the lfr generative model [ ] was used to simulate networks with vertices and average degree (methods). the simulation included two layers of communities, "coarse" ( - communities, - vertices per community) and "fine" ( - companion plots to panels (b-d). points represent identified communities, delineated by size (y axis) and persistence (x axis). blue/gray point colors indicate a match/non-match to a true community in the simulated network (jaccard similarity > . ). note that when noise is low (e), the highest persistence communities correctly recover simulated communities with near-perfect accuracy, e.g. for persistence threshold > . hidef is compared with the louvain and infomap algorithms [ , ] , with louvain and infomap fixed at their default single resolutions (methods). the three plots (a-c) compare the performance of the three algorithms in recovering simulated communities at different settings of the coarse/fine mixing parameters (see supplementary fig. clustering following any of three protein pairwise distance functions (mashup, dsd, and deepnf) [ ] [ ] [ ] . using the performance analysis depicted in fig. b , the area under curve (auc) was computed for different sets of hidef parameters (p, ). this auc was compared to that of the best baseline tool, hc. (i.e. hierarchical clustering of pairwise distances generated by deepnf [ ] ) to generate an equal number of communities (methods). note the ratio hidef auc / hc. auc is usually higher than , indicating the favorable performance of hidef except for very high values of the t parameter. as per fig. b , the analysis was undertaken using the string network and the go cellular component branch. b, similar analysis with subsampling of network edges (in which a random % of network edges are removed prior to community detection at each resolution). higher persistence (y axis) than a given threshold (x axis). e-f, scatterplots of community size (y axis) versus persistence (x axis). the left column characterizes the single-cell transcriptomics data (fig. , supplementary fig. ) . the right column (panel b, d, f) characterizes the yeast and human protein-protein interaction datasets ( fig. a-b) . the human cell atlas data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis integrating single-cell transcriptomic data across different conditions, technologies, and species molecules into cells: specifying spatial architecture data clustering: a review community detection in networks: a user guide visualizing data using t-sne analysis of the structure of complex networks at different resolution levels van dooren p: significant scales in community structure persistent homology-a survey a topological paradigm for hippocampal spatial map formation using persistent homology homological scaffolds of brain functional networks cytoscape: a software environment for integrated models of biomolecular interaction networks benchmark graphs for testing community detection algorithms organ collection and p, library preparation and s, computational data a, cell type a, writing g, supplemental text writing g, principal i: single-cell transcriptomics of mouse organs creates a tabula muris the cell ontology : enhanced content, modularization, and ontology interoperability toomanycells identifies and visualizes relationships of single-cell clades joint analysis of heterogeneous single-cell rna-seq dataset collections dimensionality reduction for visualizing single-cell data using umap tripartite synapses: astrocytes process and control synaptic information gene ontology: tool for the unification of biology. the gene ontology consortium a gene ontology inferred from molecular networks compact integration of multi-network topology for functional analysis of genes going the distance for protein function prediction: a new distance metric for protein interaction networks deepnf: deep network fusion for protein function prediction systematic evaluation of molecular networks for discovery of disease genes assessment of network module identification across complex diseases architecture of the human interactome defines protein communities and disease networks a next generation connectivity map: l platform and the first , , profiles string v : protein-protein association networks with increased coverage, supporting functional discovery in genomewide experimental datasets understanding multicellular function and disease with human tissue-specific networks activation and function of the mapks and their substrates, the mapk-activated protein kinases ndex . : a clearinghouse for research on cancer pathways a sars-cov- protein interaction map reveals targets for drug repurposing dual proteome-scale networks reveal cellspecific remodeling of the human interactome the nonstructural proteins directing coronavirus rna synthesis and processing molecular functions of the tle tetramerization domain in wnt target gene repression inhibition of severe acute respiratory syndrome coronavirus replication by niclosamide broad spectrum antiviral agent niclosamide and its therapeutic potential finding and evaluating community structure in networks statistical mechanics of community detection introduction to digital signal processing the transitive reduction of a directed graph fast unfolding of communities in large networks maps of random walks on complex networks reveal community structure scanpy: large-scale single-cell gene expression data analysis ddot: a swiss army knife for investigating data-driven biological ontologies computing communities in large networks using random walks overlapping community detection at scale: a nonnegative matrix factorization approach accuracy evaluation of overlapping and multiresolution clustering algorithms on large datasets profiler: a web server for functional enrichment analysis and conversions of gene lists ( update) the reactome pathway knowledgebase we are grateful for the helpful discussions with drs. jianzhu ma, karen mei, and daniel carlin. reactome [ ] . key: cord- -j ewxk q authors: lin, jing-wen; sodenkamp, jan; cunningham, deirdre; deroost, katrien; tshitenge, tshibuayi christine; mclaughlin, sarah; lamb, tracey j.; spencer-dene, bradley; hosking, caroline; ramesar, jai; janse, chris j.; graham, christine; o’garra, anne; langhorne, jean title: signatures of malaria-associated pathology revealed by high-resolution whole-blood transcriptomics in a rodent model of malaria date: - - journal: sci rep doi: . /srep sha: doc_id: cord_uid: j ewxk q the influence of parasite genetic factors on immune responses and development of severe pathology of malaria is largely unknown. in this study, we performed genome-wide transcriptomic profiling of mouse whole blood during blood-stage infections of two strains of the rodent malaria parasite plasmodium chabaudi that differ in virulence. we identified several transcriptomic signatures associated with the virulent infection, including signatures for platelet aggregation, stronger and prolonged anemia and lung inflammation. the first two signatures were detected prior to pathology. the anemia signature indicated deregulation of host erythropoiesis, and the lung inflammation signature was linked to increased neutrophil infiltration, more cell death and greater parasite sequestration in the lungs. this comparative whole-blood transcriptomics profiling of virulent and avirulent malaria shows the validity of this approach to inform severity of the infection and provide insight into pathogenic mechanisms. scientific reports | : | doi: . /srep offers a feasible alternative as it is one of the "highways" of the immune system via which naïve, and activated or primed immune cells travel between lymphoid organs and the tissues affected by the infection. by profiling global transcriptomes of whole blood, insights can be obtained into the complex changes in systemic or even local host responses brought about by an infection, and thus inform more targeted mechanistic studies. to investigate the use of genome-wide transcriptomic profiling of the whole blood in identifying pathology signatures in malarial infection, and to gain insights into the mechanisms underlying pathology, we used the well-establised p. chabaudi chabaudi rodent malaria model to study malarial immunology and pathology , . using two strains of p. c. chabaudi, as and cb, that differ in virulence in c bl/ mice, we performed high-resolution comparative whole-blood transcriptomic analysis throughout the acute phase of the blood-stage infection, and identified several transcriptomic signatures associated with severe malarial pathology before the onset of pathology or disease. the virulent cb strain of p. c. chabaudi induces more severe pathology in the acute phase of a blood-stage infection compared with the avirulent as strain. infection of c bl/ mice was initiated by intraperitoneal inoculation of infected red blood cells (irbc) of p. c. chabaudi as or cb. infection with the cb strain gave rise to a more severe infection, resulting in % (range - %) of mice reaching the humane end points (more than % weight loss, persistent laboured breathing and severe hypothermia), while all as infected mice survived the infection without showing severe pathologies (fig. a) . as and cb infected mice showed comparable irbc loads (parasitemia multiplied by total rbc numbers), despite the fact that higher peak the total numbers of infected red blood cells (irbc) per ml of blood (b) and the parasitemia (percentage of infected irbc) (c) in the mice infected with as or cb parasites. (d,e) the change in rbc numbers during the infection in as or cb infected mice (d) and the hemoglobin (hgb) concentration in the blood of the infected mice (e). (f,g) the percentage change in temperature (f) and weight (g) during the infection in as or cb infected mice. data were pooled from mice in independent experiments. graphs show mean with sem, mann-whitney u test was performed (* p < . , **p < . , ***p < . ). scientific reports | : | doi: . /srep parasitemias were observed in the acute cb infection (fig. b,c) . a more severe rbc loss with a significantly lower hemoglobin concentration was observed in cb infection at days post infection (dpi) compared to that in as infected mice, agreeing with previous observations, which showed more severe anemia in cb infected balb/c mice . moreover, the rbc loss in cb infected mice is longer lasting, even after the peak of infection at dpi (fig. d,e) . in addition, at dpi, cb infected mice showed greater temperature and weight loss (fig. f,g) . virulent cb and avirulent as strains of p. c. chabaudi induce distinct responses in host whole blood transcriptome. to investigate whether as and cb parasites induce different host responses that might contribute to the differences in severity of the blood-stage infection we carried out genome-wide transcriptomic analyses of whole blood during the acute phase of infection. peripheral blood was collected into tempus tubes via cardiac puncture from c bl/ mice infected with as or cb at , , , , , and dpi. blood samples collected from age-matched uninfected animals at day and day were used as naïve controls to exclude transcriptional changes due to time. the total rna was extracted, depleted of globin mrna and analysed using illumina mouse wg- v . beadarrays. spearman's rank correlation coefficient (r s ) analysis of unfiltered transcripts normalised across the median of all samples, revealed high levels of similarity amongst naïve and , dpi samples in both as and cb infections (fig. ai ,ii, r s ranging from . to . ), while from dpi onwards the whole blood transcriptomes diverge significantly from the earlier time points (r s ranging from . to − . compared to naïve controls). when comparing as and cb infections at each of the time points, lower correlation values were observed between - dpi (fig. aiii , r s = . , . , . , respectively). this indicates that as and cb infections induce different host responses at these times of infection. as infected mice and those cb infected mice that survived this phase of infection showed similar profiles at dpi (fig. aiii , r s = . ). differential expression analysis (anova unequal variance with post-hoc hsd test, fdr < . ) was performed on transcripts normalised to the median of their respective naïve controls (supplementary table ). at any of the post-infection time points examined, there were differentially expressed transcripts with a greater than -fold change in the infected mice compared to the naïve mice (fig. b, supplementary fig. a) . consistent with correlation analyses of unfiltered transcripts (fig. a) , only a few transcripts were differentially expressed at and dpi (fig. b) . strikingly, at dpi, a large majority of transcripts were down-regulated compared to naïve controls in both as and cb infections. however, the number was greater in cb infected mice compared to as (as out of transcripts, cb out of transcripts) (fig. b) . the number of down-regulated transcripts increased steadily from to dpi in as infection (fig. b) . while down-regulation of transcripts was also observed at - dpi in the cb infection, the level of down-regulation was significantly lower at dpi and there was a sudden up-regulation of transcripts occurred only in cb infected mice at this time point (fig. b , supplementary fig. a ), a critical time point when infected animals were either recovering from acute infection or suffering from severe pathologies leading to death (fig. ) , indicating that these up-regulated transcripts may relate to the manifestation of severe pathology in the cb infection. indeed this higher level of transcriptional up-regulation was further confirmed in cb infected mice that had reached humane end points at dpi ( supplementary fig. b ). this up-or down-regulation of gene expression in whole blood are due to both the leukocyte population changes and transcript regulation during the infection ( supplementary fig. c ). in the blood associated with severe pathology. a hierarchical clustering of the differentially expressed transcripts of whole blood revealed clusters of transcripts comprised of genes that were significantly more up-regulated in cb infected mice between and dpi than in the blood of as infected mice ( supplementary fig. , genes in fig. a) , during which severe clinical signs were manifested (severe hypothermia, weight loss and host death) (fig. ). ingenuity pathway analysis (ipa) diseases and function analysis showed that some of these genes were significantly enriched in 'functions' of inflammation and myeloid cell movement (fig. b) . interestingly, genes (fig. a , indicated in red) were enriched in 'disease' of severe acute respiratory syndrome (sars) (fig. b) . these genes have been shown to be amongst the top up-regulated genes in transcriptomic analysis of pbmcs (peripheral blood mononuclear cells) isolated from sars patients suffering from severe lung inflammation (fig. c) . importantly, in of the cb infected mice that had reached humane end points at dpi, these genes showed an even higher level of up-regulation compared to naïve controls (fig. c) , indicating possible lung pathology in these cb infected mice. to identify further pathology-related blood transcriptomic signatures, we included the microarray data of blood isolated from the cb-infected mice that had reached the humane end points, and followed the same differential expression analysis as described above, yielding a total of differentially expressed transcripts (supplementary table ). a self-organising map (som) clustering method was used to generate clusters of co-expressing transcripts ( supplementary fig. , supplementary table ) . the cluster expression level was defined as the average fold-change of all transcripts within each module compared to that of naïve controls (fig. d) . the clusters were annotated with ipa diseases and function analysis and manually curated. several clusters, c , c , c , c , c and c showed greater up-regulation in cb infection during - dpi, indicating their association with lethality/severe pathology in cb infection (fig. d , indicated in red). modular analysis identified an early platelet aggregation signature only associated with the virulent p. c. chabaudi cb infection. clusters c and c , were exclusively up-regulated in the cb infection, especially in the mice that had reached humane end points at dpi (fig. d ). c showed a maximum average of . -fold up-regulation and c showed a maximum average of . -fold up-regulation in cb infected mice that were dying from the infection. twenty-seven genes within these clusters are associated with platelet aggregation/pro-coagulation, a common feature of severe malaria infection (fig. a) . they are significantly enriched in canonical pathways integrin signalling (−log(p value) . , z score ) and actin cytoskeleton signalling (−log (p value) . , z score ). importantly, this set of genes was significantly up-regulated in all cb infected mice as early as dpi (fig. b) , at the time of which very few genes were differentially expressed compared to naïve controls (fig. b) . modular analysis identified a more pronounced and longer-lasting anemia signature in the virulent p. c. chabaudi cb infection. two clusters, c and c contained genes related to anemia (fig. a ). for most of these genes, their down-regulation is associated with anemia, and up-regulation is associated with alleviation of anemia ( supplementary fig. ). the anemia signature was already present at dpi in both infections ( fig. a ), prior to clinical observation of rbc and hemoglobin loss at dpi (fig. ). the mean normalised intensity of these genes was significantly down-regulated compared to naïve controls, and cb infected mice had even significantly lower values (fig. b) . the peak activation z scores calculated by ipa for this anemia signature of cb infection was . -fold higher compared with that of as (fig. c , . vs . , dpi), agreeing with the more severe rbc loss (area under the curve analysis) in cb than as infection ( . -fold, . vs . ) (fig. d) . interestingly, at dpi, out of the genes within this signature were already up-regulated in as infections compared to naïve controls; by contrast, the majority ( ) of these genes were still down-regulated in the blood of cb infected mice (fig. a, supplementary fig. ). the activation z score was − . in as infected mice indicating anemia alleviation at this time point, compared to that of . in cb infected mice (fig. c ), which indicates a longer-lasting anemia. this is consistent with the more severe rbc and hemoglobin loss in cb infection at dpi compared to as (fig. d,e) . moreover, in support of this data, ipa canonical pathway analysis of heme biosynthesis ii pathway also showed that genes in this pathway were more down-regulated or less up-regulated in cb than in as infected mice at dpi compared to naïve controls (fig. d ). together, these data suggest a deregulated or stressed erythropoietic response in cb infected mice, leading to a stronger and prolonged severe anemia (fig. d ,e). the lungs of p. c. chabaudi cb infected mice. in addition to the genes we identified from hierarchical clustering, the som analysis revealed a further genes in clusters c and c related to sars, and the majority of these genes were up-regulated only in cb infected mice that reached human end point at dpi ( supplementary fig. a ). we therefore investigated whether this lung inflammation signature was reflected by more severe lung pathology in cb infected animals. lungs isolated from systemically perfused cb infected animals were of a darker coloration than lungs of uninfected or as infected mice (fig. a) . examination of bronchoalveolar lavage fluid showed significantly higher levels of igm in the lungs of mice infected with as or cb parasites compared to naïve mice (fig. b) . the hematoxylin and eosin (h&e) stained sections of perfused lungs a) were even more highly up-regulated compared to naïve controls in the blood extracted from cb infected mice that had reached humane end points at dpi compared to randomly selected mice infected with as or cb parasites at dpi. fold changes in sars patients were from ref. . (d) the twenty-four modules identified by using self-organising map (som) method were presented with the average fold change of all differentially expressed transcripts within each module compared to naïve controls. red represents positive mean fold change above zero (white) and blue indicates negative mean fold change. the clusters were annotated with ipa diseases and function analysis with manual curation. clusters indicated in red showed greater up-regulation in cb infection during - dpi. from both as and cb infected mice showed signs of leukocyte infiltration. in the lungs of some cb infected mice, patches of dense leukocyte accumulation between epithelial walls were observed in close proximity to hemozoin (hz) crystals ( supplementary fig. b) . we also observed more cell death in the lung tissues of cb infected mice (greater than two fold) compared to as infected mice by tunel staining (fig. c) . flow cytometry analysis confirmed leukocyte infiltration in the lungs of both as and cb infected mice compared to that in naïve mice (fig. d , gating strategy in supplementary fig. ). we further characterised the cell populations within the infiltrating leukocytes in the lungs of infected animals. although cd + t cells, cd + b cells and cd − cd − innate cell numbers all increased in both infections, they did not significantly differ between as and cb infections; however, there was a trend towards higher t cell numbers in the as infection, and more innate cells in the cb infection ( supplementary fig. b,c) . within the innate cell populations, both the percentage and the cell numbers of ly g + cd b + neutrophils were significantly increased in lungs of cb infected mice compared to as infected mice (fig. e) , whereas other myeloid cell populations did not significantly differ between the two infections ( supplementary fig. d ). this observation of neutrophil infiltration is consistent with the ipa 'disease and function' analysis on the sars signature, which was up-regulated more in cb infection compared to as infection, indicating myeloid cell movement (fig. b) , especially neutrophils (z score . ). one of the top up-regulated genes, s a (mrp , myeloid-related protein ) , indicates that neutrophils may be involved in the lung pathology of the cb infection. we therefore investigated whether the up-regulation of mrp transcript in the blood is associated with higher protein level and neutrophil infiltration in the lungs. we analysed the concentration of mrp protein in the serum at dpi. while high level of mrp protein was also detected in sera of all ( ) cb infected mice, it was below detection limit by elisa in the sera from out of as infected mice and all ( ) from naïve mice . f) . we next measured the amount of mrp in the whole lung lysates, and found that cb infected mice contained more than twice the amount of mrp protein compared to that of lungs of as infected mice at dpi; moreover, this upregulation was observed as early as dpi (fig. g) . ifng, il , kc (cxcl ) and lix (cxcl ) were higher in the lungs of cb infected mice, indicating a heightened proinflammatory response in the cb infection ( supplementary fig. a ). immunohistochemical staining of mrp on lung sections showed more mrp + cells present in the lungs of cb infected mice compared to naïve and as infected mice ( supplementary fig. b) , and flow cytometry analysis confirmed a greater than two fold increase of mrp hi ly g + neutrophils in cb compared to as infected mice (fig. h) . together, these data show that the blood transcriptomic signature of lung inflammation is linked to an mrp -associated neutrophil response in the lungs of mice infected with the more virulent cb strain of p. c. chabaudi compared with the avirulent as strain. amount of hz accumulated in the lungs of cb infected mice compared to as infected mice (fig. a) . this observation suggested a higher level of sequestration/accumulation of irbc in the lungs of cb infected mice. to investigate the level of sequestration of irbcs in p. c. chabaudi as and cb infected mice, we generated transgenic parasites, pccasluc p and pcccbluc p , expressing luciferase constitutively throughout the plasmodium life cycle under the control of eef a promoter (supplementary fig. ). at day , and dpi, the total parasite load was determined by measuring luciferase activity from μ l tail blood when the parasites were at late trophozoite stage , and the level of sequestration in different organs was investigated during schizogony. consistent with the peripheral load of irbc (fig. b) , there were no significant differences in luciferase activity between pccasluc p and pcccbluc p infected mice at dpi either in peripheral blood or by whole body imaging (fig. b) . after intensive systemic perfusion, the luciferase activities in isolated organs were measured and relative ratio of sequestration was calculated as the level of luciferase activity per organ (total flux per second) relative to the total parasite load measured in peripheral blood before schizogony (relative light unit, rlu) (fig. c) . consistent with previous findings , both as and cb irbc, sequester/accumulate mainly in the spleen, liver and lungs, with no significant signals observed in the kidney or brain. the relative levels of sequestration/accumulation in the spleen and liver were similar between as and cb infections. by contrast, a significantly higher level of sequestration in the lungs occurred in the cb infection at dpi, and the trend was still maintained at dpi (fig. c) . the higher level of sequestration/accumulation of schizonts in the lungs is consistent with the observation of greater amounts of hz in the lungs of cb infected mice compared to as infected mice (fig. a ). host genetics and immune status play important parts in the outcome of an infection with plasmodium , . however, there is an increasing amount of evidence showing that genetic diversity of the parasite also contributes to the varying severity of malarial disease. in this study we used a top-down systems analysis of peripheral blood to investigate whether transcriptomic signatures could be identified that would indicate or predict severity of acute blood-stage malaria caused by strains of p. c. chabaudi of differing virulence. using high-resolution (c) bar charts showing the relative ratio of sequestration in different organs, which was quantified as the level of luciferase activities in the perfused ex vivo organs relative to the total parasite load measured in peripheral blood at late trophozoite stage (b, left). all data in (b,c) were pooled from independent experiments (n = - in total). in all bar charts, median values are shown and each dot represents an individual mouse. mann-whitney u test was performed, p values are provided when significant difference was observed. profiling of the whole blood transcriptomics over multiple time points during the acute phase of infection, and data-driven modular analysis, we investigated the involvement of biological processes rather than specific genes, and uncovered several transcriptomic signatures related to severe pathologies in the virulent cb infection. these include distinct signatures for platelet aggregation, anemia and lung inflammation, which can be seen at different time points and distinguished the two infections. this analysis also revealed several signatures common between avirulent as and virulent cb infections, but they occurred at different time points or were of different magnitude. this highlights the value of studying pathological factors in the host induced by parasites over the course of the infection and not at a single time point. the platelet aggregation signature was highly up-regulated in all cb infected mice that had reached humane end points. this was the earliest pathology signature identified in this study and similar to the anemia signature was detected before the onset of severe disease. this set of genes was up-regulated as early as days post infection in all cb infected mice regardless of eventual survival. it has been shown that in severe p. falciparum infections, platelets mediate irbc clumping and adhesion , . these observations of association between infection severity and platelets aggregation suggest that similar mechanisms underlie pathology in both the p. c. chabaudi model of malaria and in human infections, and the experimental model may be useful to explore the underlying mechanisms. it would be of great interest to analyse the platelet aggregation signature in human malarial infections and investigate whether this transcriptomic signature could be used as an early marker to predict development of severe pathology. the anemia signature identified was present in the whole blood transcriptome ahead of the clinical onset of rbc loss in both avirulent as and virulent cb infections, but it was stronger and lasted longer in the cb infection. this transcriptomic signature predicted the more severe and longer-lasting anemia we have observed in cb infections . both the anemia signature and the heme biosynthesis ii pathway analysis indicate a deregulated or stressed host erythropoietic response in the more severe cb infection. in addition to the platelet aggregation and anemia signatures, we identified a lung inflammatory signature in cb infected mice. although sequestration of p. c. chabaudi as parasites in lungs has been documented , lung damage has not been previously reported for this experimental model. we confirmed that this sars-related lung inflammation signature in the blood was indeed associated with a more severe pulmonary neutrophilic infiltration and more cell death in the lungs in cb infections. furthermore, it was linked to a higher level of sequestration of cb irbc in this organ. both p. falciparum and p. vivax can sequester within the pulmonary microvasculature and cause lethal malaria-associated acute respiratory distress syndrome (ma-ards) . members of the pfemp family (p. falciparum erythocyte membrane protein- ) of variant surface-expressed parasite proteins have been shown as parasite ligands mediating parasite cytoadherence . although pfemp is lacking in other plasmodium parasites, another multigene family pir (plasmodium interspersed repeat) is present in most, if not all, species of plasmodium; and there is evidence that some pirs in p. vivax bind to icam- endothelial receptor in vitro . it is possible that differential pirs expression between p. c. chabaudi as and cb is responsible for this differential pulmonary sequestration ability. however, it is also possible that as parasite is removed more effectively from the lung than cb due to the higher inflammation caused by cb infection. the higher level of cb pulmonary sequestration leaves greater amounts of hemozoin compared with the as parasite. there is evidence that hz can directly induce pulmonary proinflammatory responses . in addition it has been shown that parasite-derived microparticles can induce macrophage activation in a tlr (toll-like receptor )-myd dependent manner . in our study, cb infection induced higher level of inflammation (ifng, il- and mrp ) in the lungs of infected mice. mrp (s a ) is one of the top up-regulated genes identified in the lung inflammation signagure. together with s a (mrp ), mrp / forms a heterodimer complex that has previously been shown to be a potent chemotactic factor for myeloid cells, especially neutrophils . mrp / are tlr ligands and are recognized as damage-associated molecular pattern molecules (damp) involved in many inflammatory diseases and infections . for example, in tuberculosis and influenza infection, mrp / is shown to exacerbate pro-inflammatory responses, cell-death and pathogenesis , . of relevance here, mrp protein is significantly increased in p. falciparum and p. vivax infected patients , . interestingly, in our p. c. chabaudi mouse model, mrp was detectable in all mice infected with the virulent cb strain; by contrast, it was detectable in only % of mice infected with the avirulent as strain. moreover, when mrp was detected in as infected mice, it was present at significantly lower level than that in cb infected mice. this coincided with a significantly higher number of mrp hi ly g + neutrophils in the lungs of cb infected mice. it is possible that mrp + cells respond to the microparticles upon rupture of sequestered cb schizonts, leading to proinflammatory response and recruiting more mrp + neutrophils. our analysis offers evidence that different parasite strains, exhibiting different sequestration tendencies, can lead to different levels of lung inflammation and damage. deciphering the complex host immune responses during acute malaria is extremely challenging. here we demonstrate that whole blood transcriptomic signatures can help to reveal severe malaria-associated pathologies, often preceding clinical observations. our data demonstrate the potential in searching further transcriptomic signatures in human malaria for severity diagnosis and prognosis. furthermore, these blood signatures can also provide crucial information about the pathogenic processes taking place in organs or tissues during infection, as demonstrated here with the neutrophil-related lung inflammation signature. this unbiased modular analysis of blood transcriptomic data also offers a promising method to search for protective mechanisms in mouse and human malarial infections. this is particularly important for p. vivax infections of humans, because of its greater genetic diversity , and the recent surge in reports of severe and fatal p. vivax malaria , . mice. female c bl/ aged - weeks from the spf unit at the francis crick institute mill hill laboratory were housed under reverse light conditions (light . - . , dark . - . gmt) at - °c, and had continuous access to mouse breeder diet and water. core body temperature was measured with an infrared surface thermometer (fluke); body weight was calculated relative to a baseline measurement taken before infection; and erythrocyte density was determined on a vetscan hm haematology system (abaxis). this study was carried out in accordance with the uk animals (scientific procedures) act (home office licence / and / ), and was approved by the francis crick institute ethical committee. walliker, university of edinburgh, uk and subsequently passaged through mice by injection of infected red blood cells (irbc) at the mrc national institute for medical research, uk and cryopreserved as described . for experimental work, infections were initiated by intraperitoneal (i.p.) injection of irbc derived from cryopreserved stocks. the course of infection was monitored on giemsa-stained thin blood films by enumerating the percentage of rbc infected with asexual parasites (parasitemia). the limit of detection for patent parasitemia was . % infected erythrocytes. mice were culled upon reaching humane end points by showing the following signs: emaciation (more than % weight loss), persistent labored breathing, severe hypothermia (body temperature below °c), inability to remain upright when conscious or lack of natural functions, or continuous convulsions lasting more than min. p. c. chabaudi as and cb expressing luciferase under the control of the constitutive promoter eef a were generated by transfection with the construct ppc-luc p, targeting the neutral p p locus (pchas_ or pchcb_ ). transfection and cloning of transgenic p. chabaudi parasites were performed as described previously , and integration was verified by southern blot analysis of chromosomes separated by pulsed field gel (pfg) as described . the construct ppc-luc p was modified from ppc-luccam by replacing the p. chabaudi ssu targeting region with p targeting region, chab - , generated by gene synthesis (genewiz llc, nj, usa). female c bl/ mice aged between - weeks were intraperitoneally infected with irbc of p. c. chabaudi as or cb. at , , , , and days post infection, . ml of blood was collected via cardiac puncture into ml tempus rna stabilising solution (applied biosystems). naïve samples were also collected at day (the day of infection) and day (the end of the experiment) and used as controls. samples were snap frozen on dry ice and stored at − °c until rna isolation. total blood rna was extracted using perfectpure rna blood kit ( prime) and globinclear kit (ambion) was used to remove globin mrna according to the manufacturer's instructions. crna samples were prepared from μ g globin reduced blood rna using illumina totalprep rna amplification kit (ambion) and hybridized to illumina mouse wg- v . beadarrays according to the manufacturer's protocols. at each step, the quantity and quality of the rna samples was verified using nanodrop spectrophotometer (thermo fisher scientific) and caliper labchip gx (caliper life sciences). microarray data analysis. illumina beadstudio/genomestudio software was used to subtract background and scale average samples' signal intensity and genespring gx . software (aigent technologies) was used to perform further normalization and analyses as described previously . first, all signal intensity values less than ten fluorescent units were set to equal ten, log transformed and per chip normalised using th percentile shift algorithm. transcripts were further normalized to the median across all samples or to the median of control samples. transcripts were first selected if they were present (cut off . ) in ≥ % of all samples, and further filtered with a minimum of -fold expression up-or down-change compared with the median intensity across all samples. all microarray data are deposited in geo under accession number gse . genespring software was used to perform statistical tests, anova unequal variance with post-hoc tukey's hsd (honest significant difference) test, followed by benjamini-hochberg multiple test correction (fdr < . ); fold change was further performed on the combined list of transcripts differentially expressed either in as or cb infection, with -fold cut off compared to naïve controls. the set of transcripts was defined as differentially expressed transcripts and was used for further analyses. hierarchical clustering of samples at different infected time points compared to naïve were performed using pearson uncentred correlation with an average-linkage-clustering algorithm that organizes all transcripts according to their trend of expression across all samples. the hierarchical clustering on the transcripts differentiall expressed in as or cb (supplementray table ) across the acute phase of infection was performed using euclidean distance metric with ward's linkage rule. for the pathological modules (in fig. ) , prediction was performed on the transcripts differentially expressed in as or cb infection including the samples collected at dpi (supplementray table ) using the self-organising map algorithm in genespring. euclidean distance was used for similarity measurement and the maximum number of iterations was set at e . the initial learning rate was set at . and initial neighbourhood radius , the number of grids was tested from × , till less than % of clusters have similarities above %, at a final grid of × ( clusters). ingenuity pathways analysis (ipa) (qiagen) was used to identify enriched disease and functions and networks. the significance of the association between the dataset and each analysis was measured using fisher's exact test, z score cut-off and/or p value cut-off . . this program was also used to map the canonical pathways and overlay it with expression data from the dataset. module annotation was determined using disease and function analysis in ipa. to obtain bronchoalveolar lavage fluid (balf), mice were terminally anaethetised, and lungs were cannulated and inflated with μ l pbs. the liquid was retrieved and spun at g for min at °c. supernatants were obtained and kept at − °c till further analysis. igm levels in balf were quantified using a sandwich elisa. (southern biotech). detection of mrp and cytokines in sera and lung lysate. mouse serum was collected via cardiac puncture, clotted in room temperature for min and collected by centrifuging twice at , g for min at °c. lung proteins were extracted in ripa lysis buffer with protease inhibitor cocktails (sigma) and homogenized with polytron homogenizer (kinematica) on ice. protein levels were quantified by pierce bca protein assay (thermo scientific) as per the manufacturer's instructions. all plates were read on a safire ii plate reader (tecan). mouse s a elisa kit (r&d systems) was used to determine the level of s a /mrp in serum and lung lysate samples following manufacturer's instructions. cytokine concentrations were determined using cytometric bead array (biolegend) following the manufacturer's manual. histology and immunohistochemical analyses. the lungs were extensively perfused with ml pbs and then inflated by injection of ml of % neutral buffered formalin (nbf) through the tracheal cannula. tissue was then fixed overnight in % nbf, and transferred into % ethanol until embedded in paraffin and sectioned. each lung specimen was stained with haematoxylin and eosin (h&e). for each mouse, the number of hemozoin crystals were quantified from randomly selected fields on h&e stained slides under leica light microscopy ( × ). immunohistochemical staining was performed to examine the expression of mrp on paraffin-embedded lung sections with anti-mrp antibody (clone b ). tunel staining was performed using apoptag ® fluorescein in situ apoptosis detection kit (merck millipore) following the manufacturer's protocol. imaging of slides was performed on a vs slide scanner (olympus) with a vc camera, a uplsapo lens, at a magnification of × or × . images were processed and analysed using olyvia image viewer . (olympus) and fiji . and tunel positive cell numbers were quantified in an area of nm using olyvia image viewer . . in vivo imaging and luciferase assay. mice were infected intraperitoneally with rbc infected with pccasluc p or pcccbluc p parasites; and at each time point μ l of heparinized tail blood was collected before sequestration . bioluminescence was assessed with the luciferase assay system (promega) according to the manufacturer's protocol and quantified with the tecan safire plate reader and magellan software (tecan). under these conditions, bioluminescence intensity is proportional to the amount of parasites in this blood volume , which reflects the total parasite load before sequestration. at the time of maximum sequestration ( . - . h gmt, reverse light) , d-luciferin ( mg/kg, caliper life sciences) was injected subcutaneously min before whole-body and organ imaging. mice were terminally anaesthetized and systemically perfused by intracardiac injection of ml pbs . the brain, lungs, liver, spleen, left kidney and gut were removed immediately and luciferase assessed using in vivo imaging system ivis lumina (xenogen), with a cm field of view, a binning factor of , and an exposure time of s. bioluminescence (total flux per second) was quantified with the software living image (xenogen) by adjusting a region of interest to the shape of each organ. to account for the influence of total parasite load on the number of parasites sequestered in the organs, bioluminescence in the organs was normalized to total parasite load. luciferase activities measured in the organs were divided by parasite load in μ l blood (see above), allowing comparison between mice with different parasite burdens. quantifying genetic and nongenetic contributions to malarial infection in a sri lankan population malaria infection changes the ability of splenic dendritic cell populations to stimulate antigen-specific t cells disruption of il- signaling affects t cell-b cell interactions and abrogates protective humoral immunity to malaria the severity of malarial anaemia in plasmodium chabaudi infections of balb/c mice is determined independently of the number of circulating parasites expression profile of immune response genes in patients with severe acute respiratory syndrome an unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data complement driven innate immune response to malaria: fuelling severe malarial diseases sequestration and histopathology in plasmodium chabaudi malaria are influenced by the immune response in an organ-specific manner genome-wide association study indicates two novel resistance loci for severe malaria evidence for additive and interaction effects of host genotype and infection in malaria platelet-mediated clumping of plasmodium falciparum-infected erythrocytes is a common adhesive phenotype and is associated with severe malaria platelet-induced clumping of plasmodium falciparum-infected erythrocytes from malawian patients with cerebral malaria-possible modulation in vivo by thrombocytopenia pathogenesis of malaria-associated acute respiratory distress syndrome malaria's deadly grip: cytoadhesion of plasmodium falciparum-infected erythrocytes functional analysis of plasmodium vivax vir proteins reveals different subcellular localizations and cytoadherence to the icam- endothelial receptor hemozoin induces lung inflammation and correlates with malaria-associated acute respiratory distress syndrome parasite-derived plasma microparticles contribute significantly to malaria infection-induced inflammation through potent macrophage stimulation proinflammatory activities of s : proteins s a , s a , and s a /a induce neutrophil chemotaxis and adhesion the endogenous toll-like receptor agonist s a /s a (calprotectin) as innate amplifier of infection, autoimmunity, and cancer s a /a proteins mediate neutrophilic inflammation and lung pathology during tuberculosis damp molecule s a acts as a molecular pattern to enhance inflammation during influenza a virus infection: role of ddx -trif-tlr -myd pathway mrp / as marker for plasmodium falciparum-induced malaria episodes in individuals in a holoendemic area up-regulated s calcium binding protein a in plasmodium-infected patients correlates with cd (+ )cd (+ ) foxp regulatory t cell generation the malaria parasite plasmodium vivax exhibits greater genetic diversity than plasmodium falciparum is plasmodium vivax malaria a severe malaria?: a systematic review and meta-analysis evidence and implications of mortality associated with acute plasmodium vivax malaria vector transmission regulates immune control of plasmodium virulence transformation of the rodent malaria parasite plasmodium chabaudi high-efficiency transfection and drug selection of genetically transformed blood stages of the rodent malaria parasite plasmodium berghei detectable changes in the blood transcriptome are present after two weeks of antituberculosis therapy fiji: an open-source platform for biological-image analysis were made in graphpad prism, each dot represents an individual biological replicate and p-values were derived from mann whitney u test or multiple t-test. we would like to thank thibaut brugat, audrey vandomme and barbara capuccini for critical reading of the manuscript. we are grateful to the brf at mill hill for animal husbandry, to the high-throughput screening (hts), flow cytometry, experimental histopathology and microscopy facilities at mill hill for excellent technical support. this work is supported by the francis crick institute, which receives its funding from the uk medical research council (grant u ), cancer research uk, and the wellcome trust (grant wt ma).