key: cord-0316284-e8dd0bw1 authors: Müller, Nicola F.; Wüthrich, Daniel; Goldman, Nina; Sailer, Nadine; Saalfrank, Claudia; Brunner, Myrta; Augustin, Noémi; Seth-Smith, Helena MB; Hollenstein, Yvonne; Syedbasha, Mohammedyaseen; Lang, Daniela; Neher, Richard A.; Dubuis, Olivier; Naegele, Michael; Buser, Andreas; Nickel, Christian H.; Ritz, Nicole; Zeller, Andreas; Lang, Brian M.; Hadfield, James; Bedford, Trevor; Battegay, Manuel; Schneider-Sliwa, Rita; Egli, Adrian; Stadler, Tanja title: Characterising the epidemic spread of Influenza A/H3N2 within a city through phylogenetics date: 2020-04-28 journal: bioRxiv DOI: 10.1101/2020.04.27.052225 sha: 0f24d0303ff578e21348ed633876175ad11c82a3 doc_id: 316284 cord_uid: e8dd0bw1 Infecting large portions of the global population, seasonal influenza is a major burden on societies around the globe. While the global source sink dynamics of the different seasonal influenza viruses have been studied intensively, it’s local spread remains less clear. In order to improve our understanding of how influenza is transmitted on a city scale, we collected an extremely densely sampled set of influenza sequences alongside patient metadata. To do so, we sequenced influenza viruses isolated from patients of two different hospitals, as well as private practitioners in Basel, Switzerland during the 2016/2017 influenza season. The genetic sequences reveal that repeated introductions into the city drove the influenza season. We then reconstruct how the effective reproduction number changed over the course of the season. We find trends in transmission dynamics correlated positively with trends in temperature, but not relative humidity nor school holidays. Alongside the genetic sequence data that allows us to see how individual cases are connected, we gathered patient information, such as the age or household status. Zooming into the local transmission outbreaks suggests that the elderly were to a large extent infected within their own transmission network, while school children likely drove the spread within the remaining transmission network. These patterns will be valuable to plan interventions combating the spread of respiratory diseases within cities given that similar patterns are observed for other influenza seasons and cities. Author summary As shown with the current SARS-CoV-2 pandemic, respiratory diseases can quickly spread around the globe. While it can be hugely important to understand how diseases spread around the globe, local spread is most often the main driver of novel infections of respiratory diseases such as SARS-CoV-2 or influenza. We here use genetic sequence data alongside patient information to better understand what the drives the local spread of influenza by looking at the 2016/2017 influenza season in Basel, Switzerland as an example. The genetic sequence data allows us to reconstruct the how the transmission dynamics changed over the course of the season, which we correlate to changes, but not humidity or school holidays. Additionally, the genetic sequence data allows us to see how individual cases are connected. Using patient information, such as age and household status our analyses suggest that the elderly mainly transmit within their own transmission network. Additionally, they suggest that school aged children, but not pre-school aged children are important drivers of the local spread of influenza. With a large fraction of the population being infected annually and up to 650,000 2 deaths per year, seasonal influenza causes a major burden on societies around the globe 3 (https://www.who.int/news-room/fact-sheets/detail/influenza-(seasonal)). Through 4 rapid evolution, influenza strains evade host immunity, allowing them to reinfect large 5 fractions of a population every year. In order to prevent infections, limited public health 6 resources have to be streamlined as efficiently as possible [1] . The planning of 7 interventions is dependent upon knowledge of the dynamics of epidemic spread of 8 influenza viruses in a city environment, which includes understanding the drivers of the 9 spread of seasonal influenza between individuals. Incidence and prevalence data can be 10 used to some extent to infer such dynamics. However, they lack the information about 11 how individual cases are epidemiologically related. cities constitute highly heterogeneous societies with various different living 29 arrangements and vastly different social and age groups. This means that lessons learnt 30 about influenza transmission from college campuses are not necessarily transferable to 31 cities. Children have, for example, been described to make up a larger portion of the 32 infected population at the beginning of a season compared to the end over 3 seasons in 33 Houston [14] . 34 In an effort to fill that gap, we studied the local spread of influenza and the factors 35 contributing to it in the city of Basel, Switzerland during the 2016/2017 Influenza 36 season which was dominated by Influenza A/H3N2. To do so, influenza samples 37 together with the age and residential address were collected from around 669 patients 38 from the University Hospital (USB), the Children's Hospital of Basel (UKBB) and 39 patients of private practices from around the city. Around 200 of these patients also 40 provided additional information through filling out a survey. The survey asked 41 questions about family status, financial status, and demographics. Details on the data 42 collection is provided in [15] . We assess here the importance of introductions of 43 influenza into a city for seeding a seasonal epidemic, the overall dynamics of 44 transmission throughout the season, and the extent to which different age groups and 45 family statuses drive the spread of seasonal influenza in a city. 46 Methods and Material 47 Data collection and sequencing 48 We collected all data in the 2016/17 influenza season as described in [15] . Sequencing 49 was performed as described in [16] . Raw Illumina reads were trimmed with 50 Trimmomatic 0.36 [17] . Alignment of paired-end reads was done by using bowtie 51 2.2.3 [18] , using strain A/New York/18/2014 as a reference. The aligned reads were 52 sorted by using samtools 1.2 [19] . Variants were called and filtered by using lofreq 53 2.1.2 [20] . Variant calling was done for sites with a coverage of at least 100. Sites with a 54 coverage of less than 100 were assumed to be unknown and were denoted as N, that is 55 any possible nucleotide (Details on the exclusion of sequences are described in the 56 supplement). Exact input specification can be found at 57 https://github.com/nicfel/FluBaselPhylo/tree/master/Sequences. The consensus 58 sequences from this study were deposited in GenBank (numbers MN299375-MN304713). 59 Timed phylogenetic tree based on the HA segment 60 We combined the Basel sequences with all sequences (as of the 17th of Juli 2018) from 61 https://www.gisaid.org sampled between January 1st 2016 and December 31st 2017 for 62 which at least the segments HA, NA and MP were available. We first aligned all 63 consensus sequences using muscle v3.8.3129. We then built an initial phylogeny from 64 the HA segment alone by using RaxML 8.2.1 [21] and obtained branch lengths in 65 calendar time via timetree [22] using the nextstrain pipeline [23] . 66 Initial clustering based on nucleotide differences 67 We then calculated the average nucleotide difference between any of the sequences and 68 sequences from Basel. In order to split the dataset into manageable pieces, we first into two groups if the two closest related sequences of each group exceeds this distance. 75 Based on this initial grouping, we added sequences that were not from Basel to each 76 cluster if they were at maximum 0.0025/2 mutations per position away from any of the 77 sequences from Basel. The factor of 2 is only to reduce the number of non-Basel 78 sequences in each of these initial clusters. Phylogenetic trees of initial clusters 80 We next estimated rates of evolution for each genomic segment using the SRD06 81 model [24] and a strict clock model from 200 full genome sequences sampled in 82 California, New York and Europe between 2010 and 2015 in Beast 2.5 [25] . These 83 sequences were downloaded from fludb.org, were not used otherwise and are an 84 independent dataset. We allowed each segment to have its own phylogeny in order to 85 avoid reassortment to bias the estimates of evolutionary rates. Each of the segments, as 86 well as the first two and third codon position was allowed to have its own rate scaler. 87 We ran 10 independent MCMC chains each for 10 8 iterations and then combined them 88 after a burn-in of 10%. These estimated evolutionary rates are long-term rates of 89 influenza A/H3N2. Since the effects of selection over short time periods are smaller 90 compared to longer time periods. The evolutionary rates can be expected to be faster 91 for shorter time windows [26] . We therefore expect the pairwise distances estimated for 92 our data from the 2016/17 outbreak using these rates to be an overestimate of the 93 actual divergence times. The xml and log files for the analysis can be found here 94 https://github.com/nicfel/FluBaselPhylo/tree/master/EvolutionaryRates. 95 We next reconstructed the phylogenetic trees of all initial clusters by using the full 96 genomes of all samples in the initial clusters. We fixed the evolutionary rates to be 97 equal to the mean evolutionary rates as estimated previously, with the mean 98 evolutionary rate being 2.9 * 10 − 3 per site and year. As a population prior, we used a 99 constant coalescent model with an effective population size being shared among all 100 initial clusters. We then estimated a distribution of phylogenies for each initial cluster, 101 assuming that all segments share the same phylogeny. If reassortment happened, it 102 would increase the distance between samples. Due to using fixed evolutionary rates as 103 estimated in the previous analysis, reassortment will not bias evolutionary rates. Hence, 104 reassortment events will increase the pairwise distance between isolates separated by 105 reassortment, but will not bias the distance between isolates that are not. Local cluster identification 107 To identify sets of sequences from Basel that were likely transmitted locally, we used the 108 phylogenetic tree distributions for each initial cluster and reconstructed the ancestral 109 states using parsimony. We made some modifications to the standard algorithm for 110 ancestral state reconstruction. To reflect our prior belief that Basel is unlikely to act as 111 a relevant source of influenza on a global scale, we classified internal nodes that are not 112 exclusively classified to be in Basel as not in Basel. Since the flu season is only a few 113 months long, we additionally assumed that lineages are unlikely to persist in Basel for 114 more than 0.1 years without being sampled. To reflect that assumption, we classified 115 internal nodes that are more than 0.1 years from a sample from Basel to be either in a 116 location other than Basel, or to be in an unknown location. We then defined sequences 117 to be in the same local cluster if all their ancestors are inferred to be in Basel. We get 118 these local clusters for each iteration of the MCMC. (e.g. [27] ), these approaches themselves make strict assumptions that are violated when 123 studying the spread of diseases on a city scale. Also, it is further unknown how well 124 they perform when migration between individual locations is very strong. From the grouping of sequences into local clusters as described above, sequences can 126 be classified into different local clusters over the course of the MCMC. For the 127 estimation of effective reproduction rates we however require each sequence to be in a 128 distinct local clusters. To do so, we randomly picked an iteration of the MCMC and 129 then chose the local clusters present in that iteration. In order to avoid sensitivity of 130 the results to the iteration we chose, we repeated each analysis 10 times with randomly 131 chosen iterations. Estimation of the effective reproduction number and sampling 133 probability 134 We then estimated the effective reproduction number through time as well as the 135 sampling proportions and phylogenies from all these local clusters jointly using 136 BDSKY [28] . We assumed the effective reproduction number to be piecewise constant 137 in intervals of 2 days and allowed it to change every 2 days. We then assumed the 138 difference between the log effective reproduction number in interval t (log R ef f (t)) and 139 in interval t-1 (log R ef f (t-1)) to be distributed around N(0,σ), with σ being estimated 140 in the MCMC [29] . Additionally, we assume the log R ef f at the most recent time Defining connectedness between individuals 151 We define two individuals to be connected if their pairwise phylogenetic distance is less 152 than 0.1 years. If we assume two individuals to be isolated at the same time, this cutoff 153 would correspond to a common ancestor that was at most 18.25 days ago. Considering 154 that the evolutionary rates we used to perform these inferences are long term rates and 155 therefore lower than the actual short term rates [26] , we expect that the cutoff values Connectedness across age and family status groups 162 We estimate the average number of connections members from each of the six categorial 163 age/family status groups have, according to the above definition of connectedness. To 164 do so, we model the number of connections an individual from a group has as a negative 165 binomial distribution. This allows us to model the number of connections an individual 166 from a group has, while taking the variance of the relationship between the group label 167 and the number of connections into account. This is in contrast to, for example, the 168 poisson or geometric models. We assess overall model fit with an ANOVA and then 169 April 26, 2020 5/23 perform Tukey contrasts, comparing all pairs of age groupings [30] . We correct for 170 multiple testing by using Schaffer's method, which is similarly conservative to bonferroni, 171 but takes into account the dependencies enforced in a linear modelling framework [31] . 172 Age mixing patterns 173 To identify mixing patterns between the six categorial age/family status groups, we 174 again use the definition of connection of two patients. 175 We use two different approaches to estimate how different groups are connected to 176 each other. First, we use multinomial logistic regression to estimate the probability that 177 a member from one group is connected to a member from another group. As weights, 178 we use the inverse number of samples from each group. This implicitly assumes that 179 individuals from each group have the same probability of being infected. Children 180 however might have higher rates of infection, and we therefore expect this weighting to 181 underestimate the role of children and to overestimate the role of adults 182 Second, we use a permutation approach. Between any two groups a and b, we 183 compute the probability of them being associated with one another as follows: For each 184 combination of groups a and b, we count the number of pairs that are associated with 185 one another. We then randomly permute the age labels 10 6 times. For each permutation, 186 we calculate if the number of pairs between these groups is greater or smaller than what 187 we observed. From these values, we then compute the probability that age groups a and 188 b are positively (P + ab ) or negatively (P − ab ) associated with one another as: suggests that with additional sampling effort, we would have captured more 216 introductions. With its own international airport, the airport of Zürich nearby, and a 217 major rail hub, Basel is well connected to the rest of Europe and the world. As such, 218 people working in Basel often do not live in the city and commute daily from elsewhere 219 in Switzerland, Germany and France. Basel is a tourist destination and often hosts 220 international conferences, attracting people from all over the world. This connectedness 221 likely drives these introductions of influenza into the city. Quantification of the overall local epidemic following the 223 introduction into the city 224 After introduction into the city, we next study how influenza is transmitted locally. We next investigated potential factors determining the changes in R ef f . The number 238 of influenza cases over the years show a strong seasonality, with the majority of cases 239 occurring in the winter months in both the northern and southern hemisphere [7] . 240 Relative humidity and temperature have been described to drive influenza 241 transmission [33] . Additionally, the effect of school closures on the spread of pandemic 242 influenza has been discussed [34, 35] . Thus we investigated potential correlations of 243 R ef f with temperature,relative humidity and school days (i.e. days when children go to 244 school). As we only studied one season, these correlations have to be interpreted with 245 caution, and analyses of other seasons are needed to confirm the potential correlations. 246 Neither humidity nor school days showed a significant correlation: relative humidity Viral shedding of viruses has been shown previously to be increased at lower 255 temperatures in animal models [33] and higher absolute humidity has been shown to 256 favor transmission on a population level [36] . We here observe lower effective 257 reproduction numbers at lower temperatures. The correlation of the effective 258 reproduction number with the temperature however is not necessarily causal, as it for 259 example could be due to social behaviour being different at lower temperatures. Additionally, the computed p-values could be inflated due to unaccounted 261 autocorrelation. Along with the R ef f , we co-estimated the sampling probability, that is the 263 probability of an infected individual being sampled. Since we followed the same 264 procedure for inclusion of patients throughout the epidemic season [15] , we assumed 265 that this probability is constant throughout the influenza season. We estimated the 266 sampling probability to be between 3% and 5% (see figures 1d and 5). In contrast to 267 the R ef f estimates, this value is more sensitive to the procedure of clustering of 268 sequences into sets of locally transmitted sequences (see figure 5) . Additionally, the 269 prior probability on the effective reproduction number, as well as the assumed becoming 270 un-infectious rate can influence this estimate [37] . After having determined the importance of introductions of influenza into the city and 285 the overall rate at which influenza is spread in the city, we next studied the effect that 286 age and family status has on the overall spread. Figures 2a and b show the patient age 287 distribution within our samples. In order to study the role of age and family status in 288 spreading influenza, we next subdivided our Basel patients into four different age 289 groups, preschoolers (<7 years old), school-aged children (7 to 17 years), adults (18 to 290 65 years) and the elderly (>65 years old). We further categorized adults into three 291 subgroups corresponding to family status: adults for whom we know that they live in 292 the same household as children, adults for whom we know that they do not and adults 293 for whom we do not have this information. We thus have overall six different categories 294 of patient groups. For each individual infected with influenza in each of these categorical groups, we defining two viral isolates to be connected is therefore done independently of the above 303 used definition of local transmission clusters. Using pairwise distances allows us to use 304 distance in the transmission chain whereas from the same cluster only says if two 305 sequences originated from the same introduction. 306 We find that school-aged children are on average connected to more individuals, 307 than preschoolers (see figure 2c ). This difference is statistically significant after multiple 308 hypothesis testing at a cutoff value of 0.1 years, but not other cutoff values. We further, 309 but not significantly, find that adults that reported to live in the same household as 310 April 26, 2020 8/23 The intervals are different from the age groups used in the analysis due to the availability of age data for the population of Basel. B Proportion of different age groups in this study compared to in the city of Basel. C Model-based confidence intervals for the difference in average number of connections between each pair of age groups from a negative binomial model. Upper and lower bounds represent 95% confidence intervals for the average fold-difference in connection number between two groups corrected for multiple hypothesis testing using the Tukey method. This means that confidence intervals that do not include 1 are statistically significant. We see that the average connection number for elderly patients is twice the average connection number for preschoolers, adults without children, and adults with unknown status. These values are estimated for a cutoff values of 0.1 years. Estimates for different cutoff values are shown in figure 10 . Also, the estimates of the mean number of connections in each group are shown in figure 11 . children are on average connected to more patients than those that do not live in the 311 same household as children. For the elderly, we find that they have significantly more show tendencies to be connected more in average than the latter three groups, though 319 the data is not informative enough, respectively we do not have enough data, to provide 320 strong evidence for that. That school aged children and elderly are connected to more individuals than the 322 other groups can have different explanations. The most obvious one is that individuals 323 of these groups are more likely to participate in transmission events, either as a donor or 324 recipient; alternatively, strong mixing within a group and a higher probability of visiting 325 a doctor upon infection and therefore a higher sampling probability could act as an 326 explanation (if members from any group are equally likely to transmit or receive 327 influenza to and from members of any other group, higher sampling probability would 328 increase to number of connections of every group and not just one). In order to assess the potential reasons, we investigated how strongly or weakly 330 different age groups are connected with each other. We did so using two different Here we show the mixing patterns between the different categorial patient groups. We define pairs of patients to be connected if their pairwise phylogenetic distance was below 0.1 years. Results for other thresholds are shown in Figures 12 and 13 . A Probability that an individual from the group in each row is connected to an individual from the group in a column. These probabilities were calculated by using the inverse number of samples from each group as weights. Upper and lower bounds correspond to 95% confidence intervals around the estimated probability. B The color of each tile in the heatmap corresponds to the p-value for either positive (red) or negative (blue) associations. These p-values are bonferroni corrected for the number of comparisons (42). We estimate these p-values by randomly permuting the group to patient labels and then comparing the number of pairs of interactions we observe in the data vs. when randomly permuting. For both approaches, we find that mostly school-aged children are associated with 339 other school-aged children, and the elderly are associated with other elderly people. We 340 additionally find higher association between children of any age and adults living in the 341 same household as children than children of any age and adults without children (see 342 figure 3 ). Adults living in the same household as children, on the other hand, are 343 estimated to have low association to other adults without children and much higher 344 association to children, whereas adults without children are mostly associated to other 345 adults without children (see figure 3a) . Increased sampling of the elderly relative to the other age groups is likely to occur, 347 since the elderly are more likely to visit a doctor when in case of infections with 348 influenza [38] . Thus, strong mixing within group and high sampling, might explain the 349 increased connectivity of the elderly. The second group that we found to have many connections to other patients where 351 school-aged children. When looking to which groups these were connected, we found 352 them to be associated with other school-aged children. However, they are unlikely to be 353 sampled more than preschoolers [39] , and we do not see evidence for oversampling of 354 this group compared to preschoolers (see figure 2b ). We therefore interpret our results 355 as school-aged children being involved in more transmission events compared to the 356 other patient groups, including preschoolers. Furthermore, adults living in the same 357 household as children might get mainly infected by the children and not by other adults, 358 which does however not mean that adults do not play a crucial role in introducing novel 359 lineages into the city. Indeed, children have been previously reported to be a strong 360 driver of influenza transmission [14] . These interpretations however are based on the analysis of influenza isolates from 362 one season and city and will therefore need to be repeated in different seasons and cities 363 to get a more complete understanding of the transmission patterns of influenza across 364 age groups. genetic datasets of influenza sequences to date. Additionally, we connected the genetic 376 information to patient information such a residential address and age for all and more 377 personal information for a lot of the patients, providing unparalleled resolution to study 378 how influenza spreads locally. The 2016/17 season for which we collected data was 379 dominated by influenza A/H3N2. Based on this data, we observe that hundreds of 380 introductions initiate the seasonal influenza epidemic in the studied city of Basel, that 381 the overall spread varies throughout the season, and that children seem to drive the 382 local outbreaks, while elderly have their own transmission chains. Overall, the dataset included sequencing runs from 842 unique isolates. For some of 398 these isolated, sequencing was performed more than once. In this case, the isolate that 399 fulfilled the quality criteria detailed below better was used. These isolates, in some 400 cases, were isolated from the same patient, but at different time points. Bases were 401 called to be neutral if the coverage for a site was less than 100. If a segment of a 402 sequence had more than 20% neutral bases on more than 5 segments, the sequence was 403 not used for further analysis. If the HA segment was not part of the 3 segments that 404 were had less than 20% neutral bases, the sequence was discarded as well. Cost-effectiveness study on influenza prevention in Hong Kong Unifying the epidemiological and evolutionary dynamics of pathogens The epidemic behavior of the hepatitis C virus Bayesian coalescent inference of past population dynamics from molecular sequences. Molecular biology and evolution Phylodynamics with migration: a computational framework to quantify population structure from genomic data. Molecular biology and evolution The structured coalescent and its approximations The global circulation of seasonal influenza A (H3N2) viruses Global migration dynamics underlie evolution and persistence of human influenza A (H3N2) Temporally structured metapopulation dynamics and persistence of influenza A H3N2 virus in humans Unifying viral genetics and human transportation data to predict the global transmission dynamics of human influenza H3N2 Global circulation patterns of seasonal influenza viruses vary with antigenic drift Extensive geographical mixing of 2009 human H1N1 influenza A virus in a single university community Stochastic processes constrain the within and between host evolution of influenza virus Interpandemic influenza in the Houston area Identification of influenza urban transmission patterns by geographical, epidemiological and whole genome sequencing data: protocol for an observational study Evaluation of two workflows for whole genome sequencing-based typing of influenza A viruses Trimmomatic: a flexible trimmer for Illumina sequence data Fast gapped-read alignment with Bowtie 2 The sequence alignment/map format and SAMtools LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies TreeTime: Maximum-likelihood phylodynamic analysis Nextstrain: real-time tracking of pathogen evolution Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. Molecular biology and evolution BEAST 2: a software platform for Bayesian evolutionary analysis The evolution of Ebola virus: Insights from the 2013-2016 epidemic Efficient Bayesian inference under the structured coalescent Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV) Improving Bayesian population dynamics inference: a coalescent-based model for multiple loci Design and analysis of experiments Implementing Shaffer?s multiple comparison procedure for a large number of groups Estimates of the reproduction number for seasonal, pandemic, and zoonotic influenza: a systematic review of the literature Influenza virus transmission is dependent on relative humidity and temperature Strategies for mitigating an influenza pandemic Estimating the impact of school closure on influenza transmission from Sentinel data Global environmental drivers of influenza Estimating the basic reproductive number from viral sequence data. Molecular biology and evolution Influenza-associated hospitalizations in the United States Mortality, morbidity, and hospitalisations due to influenza lower respiratory tract infections, 2017: an analysis for the Global Burden of Disease Study