key: cord-0586902-9dlemjdh authors: Schweinberger, Michael; Bomiriya, Rashmi P.; Babkin, Sergii title: A semiparametric Bayesian approach to epidemics, with application to the spread of the coronavirus MERS in South Korea in 2015 date: 2021-07-01 journal: nan DOI: nan sha: 1e5a6c8ee6ccf096a768834a740a1613b66610f8 doc_id: 586902 cord_uid: 9dlemjdh We consider incomplete observations of stochastic processes governing the spread of infectious diseases through finite populations by way of contact. We propose a flexible semiparametric modeling framework with at least three advantages. First, it enables researchers to study the structure of a population contact network and its impact on the spread of infectious diseases. Second, it can accommodate short- and long-tailed degree distributions and detect potential superspreaders, who represent an important public health concern. Third, it addresses the important issue of incomplete data. Starting from first principles, we show when the incomplete-data generating process is ignorable for the purpose of Bayesian inference for the parameters of the population model. We demonstrate the semiparametric modeling framework by simulations and an application to the partially observed MERS epidemic in South Korea in 2015. We conclude with an extended discussion of open questions and directions for future research. The spread of infectious diseases through populations by way of contact represents an important public health concern. The spread of COVID-19 and other viruses underscores the importance of understanding how infectious diseases spread by way of contact and how the spread of infectious diseases can be curbed. Indeed, COVID-19 is not the first virus to spread around the globe, and it will not be the last: Epidemics have been documented since at least the Middle Ages (e.g., the plague) and are primed to become more frequent rather than less frequent in the interconnected world of the twenty-first century (as the recent spread of COVID-19, MERS, SARS, Ebola, and Zika demonstrates). We follow a network-based approach to modeling the spread of infectious diseases. A network-based approach is appealing, because the network of contacts in a population determines how infectious diseases can spread (e.g., Keeling and Eames 2005; Danon et al. 2011; Welch, Bansal, and Hunter 2011) . One of the advantages of a network-based approach is that heterogeneity in the number of contacts can be captured (e.g., Danon et al. 2011) along with other features of population contact networks (e.g., Welch et al. 2011) . Indeed, conventional models of epidemics-including classic and lattice-based Susceptible-Infectious-Recovered (SIR) and Susceptible-Exposed-Infectious-Recovered (SEIR) models (e.g., Andersson and Britton 2000; Danon et al. 2011 )-can be considered to be degenerate versions of network-based models, in the sense that such models postulate that with probability 1 the population contact network is of a known form: e.g., with probability 1 each population member is in contact with every other population member. Worse, the postulated form of the population contact network may not resemble real-world contact networks. A second advantage is that a network-based approach helps study the structure of a population contact network and its generating mechanism, helping generalize findings to similar populations. Motivated by the shortcomings of classic and lattice-based SIR and SEIR models of epidemics, Britton and O'Neill (2002) , Hunter (2011, 2012) , Groendyke and Welch (2018) , Bu, Aiello, Xu, and Volfovsky (2021) , and others have explored a networkbased approach to epidemics. While a network-based approach is more appealing than classic and lattice-based SIR and SEIR models, existing network-based models of epidemics have shortcomings. As we discuss in Section 3, one of the more important shortcomings is that existing network-based approaches are either not flexible models of degree distributions or induce short-tailed degree distributions, that is, the population does not contain population members who have far more contacts than the bulk of the population members. Short-tailed degree distributions are problematic, because degree distributions of real-world contact networks are thought to be long-tailed (e.g., Laumann, Gagnon, Michael, and Michaels 1994; Albert and Barabàsi 2002; Handcock 2003a,b, 2004) and the population members in the upper tail of the degree distribution represent an important public health concern: Population members with many contacts can infect many others and are hence potential superspreaders. Indeed, there is circumstantial evidence to suggest that superspreaders have played a role in the SARS epidemic in 2002-2003, the MERS epidemic in 2015, and the ongoing COVID-19 pandemic. We introduce a flexible semiparametric modeling framework based on infinite mixture models and Dirichlet process priors (Ferguson 1973; Teh 2010) , with at least three advantages. First, it shares with existing network-based approaches the advantage that it enables researchers to study the structure of a population contact network and its impact on the spread of infectious diseases. Second, in contrast to existing network-based approaches, it is a flexible model of short-and long-tailed degree distributions and can detect potential superspreaders. Third, it addresses the important issue of incomplete data and can deal with a wide range of missing data and sample data. Starting from first principles, we show when the incomplete-data generating process is ignorable for the purpose of Bayesian inference for the parameters of the population model. In addition, we stress the importance of collecting network data with a view to reducing the posterior uncertainty about the population contact network and its generating mechanism along with possible sources of infections. We demonstrate the proposed framework by simulations and an application to the partially observed MERS epidemic in South Korea in 2015 (Ki 2015) . The MERS epidemic was driven by the coronavirus MERS, which is related to the coronaviruses SARS and COVID-19. We detect three to five potential superspreaders, who may have had a great impact on the outcome of the outbreak. The proposed semiparametric modeling framework, based on infinite mixture distributions and Dirichlet process priors (Ferguson 1973; Teh 2010) , extends to infinite populations. That said, we assume that the number of population members N is finite and embrace a superpopulation approach to statistical inference along the lines of Hartley and Sielken Jr. (1975) and Schweinberger, Krivitsky, Butts, and Stewart (2020) , motivated by applications. The assumption of finite N is motivated by the fact that in epidemiological applications the number of population members N cannot be infinite. For example, when the population of interest consists of all animals or all humans on earth, the size of the population is bounded above by real-world constraints such as geography and the scarcity of natural resources: Planet earth cannot host infinite populations of animals or humans. Since the population of interest is finite, the natural objective of statistical inference is to learn the stochastic process that generated the population contact network and allows infectious diseases to spread through the population of interest, with a view to understanding and predicting epidemics in the population of interest and similar populations. In other words, it is natural to embrace a superpopulation approach to statistical inference, as discussed by Hartley and Sielken Jr. (1975) and Schweinberger et al. (2020) . The properties of statistical procedures for superpopulation inference can be understood by developing a non-asymptotic statistical theory that relies on concentration inequalities and other non-asymptotic tools that have been embraced in high-dimensional statistics (see, e.g., Wainwright 2019). We are not aware of non-asymptotic statistical theory for stochastic models of epidemics, although there are asymptotic results in probability theory (e.g., Reinert 1995; Britton, Lindholm, and Turova 2011; Barbour and Reinert 2013; Pang and Pardoux 2020; Ball 2021 ) and statistical theory (e.g., Britton 1998 Britton , 2001 based on N → ∞ asymptotics. Developing non-asymptotic statistical theory for stochastic models of epidemics constitutes an interesting direction for future research, but is beyond the scope of our paper. The remainder of the paper is structured as follows. A network-based stochastic model of epidemics is reviewed in Section 2. We discuss shortcomings of parametric population models in Section 3 and introduce semiparametric population models in Section 4. In Section 5, we argue that collecting complete data is all but impossible, and stress the importance of collecting network data. Principled Bayesian inference based on incomplete data is discussed in Section 6. We present simulation results in Section 7 and an application to the partially observed MERS epidemic in South Korea in Section 8. We conclude with an extended discussion of open questions and directions for future research in Section 9. We introduce a network-based stochastic model of epidemics. We first describe a generic data-generating process in Section 2.1 and then review parametric population models in Section 2.2. Figure 1 : Data-generating process: Conditional on contacts (undirected lines) among population members (circles), infectious population members (red) spread an infectious disease by contact (directed lines) to susceptible population members (white), which are exposed (gray) before turning infectious (red). . . . . . . We consider a population with N < ∞ population members, who may be connected by contacts. In the simplest case, contacts among population members are time-invariant and are either absent or present. We discuss in Section 9.6 possible extensions to time-evolving population contact networks. A basic data-generating process is shown in Figure 1 and can be described as follows: • Generate a population contact network. • Conditional on the population contact network, generate an epidemic. The population contact network is generated by a random graph model as described in Section 2.2 (parametric) and Section 4 (semiparametric). Conditional on the population contact network, an infectious disease spreads through the population by way of contact, governed by a continuous-time stochastic process such as the SIR and SEIR model (Andersson and Britton 2000; Britton and O'Neill 2002; Groendyke et al. 2011 Groendyke et al. , 2012 . We focus on the network-based SEIR model, which can be sketched as follows (Britton and O'Neill 2002; Groendyke et al. 2011 Groendyke et al. , 2012 . In the simplest case, the initial state of the stochastic process consists of a population with one infected population member and N − 1 susceptible population members. Infected population members pass through three states: the exposed state; the infectious state; and the removed state. In the exposed state, population members are infected, but cannot infect others. In the infectious state, population members can infect susceptible population members by way of contact, with transmissions being independent across contacts. In the final state-the removed state-population members have either recovered and are immune to re-infection or have died, and hence cannot infect others. The epidemic continues until all infected population members are removed from the population. All of the described events-the event that an infectious population member infects a susceptible population member and the transition of an infected population member from the exposed to the infectious state and from the infectious state to the removed state-are independent and occur at random times. The waiting times until these events occur follow Exponential or Gamma distributions. More specific assumptions about the distributions of waiting times and the population contact network are detailed in Section 2.2 and in the monographs of Andersson and Britton (2000) and Britton and Pardoux (2019) . Consider an epidemic that started at time 0 and ceased by time 0 < t < ∞. We assume that the identities of infected population members are known and denoted by 1, , where Y i,j = 1 if population members i and j are in contact during the epidemic and Y i,j = 0 otherwise. Since contacts are undirected and self-contacts are meaningless, we assume that Y i,j = Y j,i and Y i,i = 0 hold with probability 1. The transmissions are denoted by The starting times of the exposure, infectious, and removal periods of infected population members are denoted by We write X = (E, I, R, T ) and, in a mild abuse of language, we refer to X as an epidemic. The complete-data likelihood function, given complete observations x and y of the epidemic X and the population contact network Y , is of the form L(η, θ; x, y) ∝ L(η E ; x) L(η I ; x) L(β; x, y) L(θ; y), where η = (η E , η I , β) ∈ Ω η ⊆ R d 1 (d 1 ≥ 1) and θ ∈ Ω θ ⊆ R d 2 (d 2 ≥ 1) are the parameter vectors of the population model generating the epidemic X and the population contact network Y , respectively. We describe each component of the complete-data likelihood function in turn, along with its parameters. The components L(η E ; x) and L(η I ; x) of the likelihood function are of the form where p(. | E i , η E ) and p(. | I i , η I ) are densities with suitable support parameterized by η E and η I , respectively: e.g., the densities may be Gamma densities, with η E ∈ R + × R + and η I ∈ R + × R + being scale and shape parameters of Gamma densities. Under the assumption that the waiting times until infectious population members infect susceptible population members are independent Exponential(β) random variables with rate of infection β ∈ R + , the component L(β; x, y) of the likelihood function is given by Here, 1 I i 0 denotes the concentration parameter and G denotes the base distribution of the Dirichlet process prior, while δ θ h denotes a point mass at Draws from a Dirichlet process prior can be generated by generating • the first draw from G; • the i-th draw with a probability proportional to α from G, and otherwise drawing one of the existing draws θ 1 , . . . , θ i−1 at random. The described construction of Dirichlet process priors reveals that sampling the degree parameters θ 1 , . . . , θ N from a Dirichlet process prior implies that some of the degree parameters are resampled. As a consequence, some population members share the same degree parameters. Thus, Dirichlet process priors induce a partition of the population into subpopulations, where subpopulations share the same degree parameters. In applications to real-world data, we can make probabilistic statements about which population members belong to subpopulations with high degree parameters based on the posterior distribution. A short demonstration is provided by the application to the partially observed MERS epidemic in South Korea presented in Section 8.4. In addition to detecting potential superspreaders, the model can accommodate short-and long-tailed degree distributions. To demonstrate that Dirichlet process priors can accommodate short-and long-tailed degree distributions, we consider a population of size N = 1,000 and generate three sets of degree parameters θ 1 , . . . , θ 1000 from the Dirichlet process prior with concentration parameter α = 5 and base distribution N (−5, 25). Figure 2 shows kernel density plots of the three sets of degree parameters θ 1 , . . . , θ 1000 along with the expected degrees of population members. The expected degree of population member i is defined as , i = 1, . . . , N. Figure 2 demonstrates that the distribution of the expected degrees µ 1 (θ), . . . , µ 1000 (θ) can be short-or long-tailed, depending on the degree parameters θ = (θ 1 , . . . , θ 1000 ). The first set of degree parameters θ 1 , . . . , θ 1000 generated from the Dirichlet process prior consists of three subsets of degree parameters, all of them negative. The resulting distribution of the expected degrees resembles a steep mountain with a high peak in a neighborhood of 0 and a short upper tail. In fact, 90% of all population members have expected degrees of less than 5, and the highest expected degree is less than 39, which is much lower than the highest possible degree of 999 in a population of size 1,000. The second set of generated degree parameters θ 1 , . . . , θ 1000 consists of many negative degree parameters and some positive degree parameters between 0 and 5. Since the log odds of the probability of a contact between two population members i and j is θ i + θ j , population members with positive degree parameters can have high to very high expected degrees. Figure 2 shows that the population consists of at least three subpopulations: population members with expected degrees of less than 100; population members with expected degrees between 100 and 200; and population members with expected degree of more than 300. The highest expected degree is about 722. The resulting distribution of the expected degrees is both multimodal and long-tailed. The third set of generated degree parameters θ 1 , . . . , θ 1000 resembles the second set of generated degree parameters, in that the distribution of the expected degrees is multimodal and long-tailed. That said, the third set of draws is less extreme than the second set of draws: e.g., the highest expected degree is about 259 rather than 722. The examples presented above demonstrate that Dirichlet process priors with Gaussian base distributions can accomodate both short-and long-tailed degree distributions, despite the fact that Gaussians are symmetric and unimodal distributions with light tails. Other, non-Gaussian base distributions can be chosen. As a consequence, the proposed semiparametric population model is flexible and can accomodate a wide range of degree distributions with countless forms and shapes, including short-and long-tailed degree distributions. In practice, complete observations of population contact networks and epidemics are rare. While population-level data (e.g., counts of the total number of infected, recovered or deceased population members) may be disseminated by public health authorities and can be collected by scraping websites and other channels of communication, collecting individuallevel data (e.g., the contacts, exposure, infectious, and removal times of infected population members) requires substantial investments in terms of time and resources, making it all but impossible to collect all relevant data. We first discuss possible reasons for incomplete data in Section 5.1 and then stress the importance of collecting network data in Section 5.2. There are many reasons for the fact that available data are, more often than not, incomplete. Some of the possible reasons are: • Epidemics are rare events that occur at random times and in random places, and when such rare events do occur, public health officials and scientists may not be well-prepared to collect relevant data without advance notice. • Ethical and legal considerations can make the collection of data on individual population members challenging, if not impossible: e.g., if there was universal cell phone coverage and all population members carried cell phones at all times, collecting data on contacts among population members would be straightforward by monitoring the locations of cell phones. However, collecting such data would violate laws that protect the privacy of population members. • Epidemics are not limited to urban areas with excellent infrastructure and ready access to public resources, but may occur in remote corners of the planet: e.g., the most recent outbreaks of Ebola started in remote areas of Africa. Worse, some areas with outbreaks may be war-torn. As a result, researchers may not be able to collect data by visiting areas with outbreaks without exposing themselves and others to unacceptable risks. • In addition, there are more mundane reasons for incomplete data, such as -design-based mechanisms: e.g., motivated by financial constraints, researchers may sample population members, which implies that a sampling design determines which data are collected; -out-of-design mechanisms: e.g., population members refuse to share data when the data are considered sensitive. Data can be incomplete because transmissions, exposure, infectious, or removal times of infected population members are unobserved, but many available data sets share one fundamental weakness: There are no data on contacts among population members. The lack of network data is all the more striking, because collecting network data would help reduce the posterior uncertainty about (a) the population contact network Y , which imposes hard constraints on how an infectious disease can spread, because Y i,j = 0 (no contact) implies T i,j = T j,i = 0 (no transmission) with probability 1; (b) the parameter θ of the population model that generated the population contact network Y ; (c) possible sources of infections. Advantages (a) and (b) may not be too surprising, but advantage (c) may be less obvious. To demonstrate that sampling contacts can reduce the uncertainty about possible sources of infections, it is instructive to inspect the full conditional probability of the event that a population member i infected a population member j, assuming that both of them were infected during the course of the epidemic. If ϕ(i infected j) denotes the prior probability of the event that population member i infected population member j, then the conditional probability of the event that population member i infected population member j, given everything else, takes the form . If y 1,j , . . . , y M,j are observed, then y 1,j , . . . , y M,j are fixed and impose hard constraints on who could have infected j: If i was not in contact with j (that is, y i,j = 0), i could not have infected j. By contrast, if y 1,j , . . . , y M,j are unobserved, then y 1,j , . . . , y M,j are not fixed and need to be inferred, increasing the uncertainty about the possible sources of infection. In other words, observed contacts help narrow down the possible sources of infections and, in so doing, help reduce the uncertainty about possible sources of infections. We describe two sampling designs for generating samples of contacts along with epidemiological data: ego-centric sampling and link-tracing. Some background on ego-centric sampling and link-tracing of contacts (but not epidemiological data) can be found in Thompson and Frank (2000) , Handcock and Gile (2010) , and Krivitsky and Morris (2017) . In the literature on network sampling (Frank 1988 ; Thompson and Frank 2000; Gile and Handcock 2006; Handcock and Gile 2010) , popular forms of link-tracing are snowball sampling (Goodman 1961) and respondent-driven sampling (Heckathorn 1997; Salganik and Heckathorn 2004; Gile and Handcock 2010; Gile 2011; Kurant, Markopoulou, and Thiran 2011) . Some of them do not generate probability samples in the strict sense of the word, but generate approximate probability samples (e.g., Gile 2011) . A recent review of these and other network sampling designs can be found in Schweinberger et al. (2020) . We adapt these ideas to sampling contacts along with epidemiological data. An ego-centric sample of contacts and epidemiological data can be generated as follows: (a) Generate a probability sample of population members. (b) For each sampled population member, collect data on the contacts of the population member and, should the population member be infected, data on possible sources of infection along with the exposure, infectious, and removal times of the population member. A probability sample of population members can be generated by any sampling design for sampling from finite populations (e.g., Thompson 2012 ). An interesting extension of ego-centric sampling is link-tracing. Link-tracing exploits the observed contacts of sampled population members to include additional population members into the sample. A k-wave link-tracing sample of contacts and epidemiological data can be generated as follows: (1) Wave l = 0: Generate an ego-centric sample. (2) Wave l = 1, . . . , k: (a) Add the population members who are connected to the population members of wave l − 1 to the sample. (b) For each added population member, collect all relevant data. Ego-centric sampling can be considered to be a special case of k-wave link-tracing with k = 0. We use ego-entric sampling in the simulation study in Section 7. We discuss Bayesian inference for the parameters η and θ of the population model based on incomplete data. Since interest centers on the population model, it is natural to ask: Under which conditions is the process that determines which data are observed ignorable for the purpose of Bayesian inference for the parameters η and θ of the population model? To answer the question, we start from first principles. We first separate the complete-data generating process from the incomplete-data generating process: • The complete-data generating process is the process that generates the complete data, that is, the process that generates a realization (x, y) of (X, Y ). • The incomplete-data generating process is the process that determines which subset of the complete data (x, y) is observed. A failure to separate these processes can lead to misleading conclusions, as pointed out by Rubin (1976) , Dawid and Dickey (1977) , Koskinen, Robins, and Pattison (2010) , Gile (2010, 2017) , Crane (2018) , and Schweinberger et al. (2020) . We therefore proceed as follows. We first separate the complete-data generating process from the incomplete-data generating process in Section 6.1 and then discuss Bayesian inference based on incomplete data in Section 6.2. We then discuss Bayesian computing in Sections 6.3, 6.4, and 6.5. To prepare the ground for principled Bayesian inference, we separate the complete-data generating process from the incomplete-data generating process, adapting the generic ideas of Rubin (1976) to stochastic models of epidemics. To do so, let are observed, respectively. The sequence of indicators A is considered to be a random variable, with a distribution parameterized by ϑ ∈ Ω ϑ ⊆ R q (q ≥ 1): e.g., ϑ ∈ [0, 1] N may be a vector of sample inclusion probabilities, where ϑ i ∈ [0, 1] is the probability that population member i ∈ {1, . . . , N } is sampled and data on the contacts of population member i are collected. The parameter ϑ may be known or unknown. We focus on the more challenging case where ϑ is unknown. The observed and unobserved subset of the complete data (x, y) are denoted by x obs and x mis and y obs and y mis , respectively, where x = (x obs , x mis ) and y = (y obs , y mis ). In Bayesian fashion, we build a joint probability model for all knowns and unknowns and condition on all knowns. Let p(a, x, y, ϑ, η, θ) = p(a, x, y | ϑ, η, θ) p(ϑ, η, θ) be the joint probability density of a, x, y, ϑ, η, θ, where is the prior probability density of ϑ, η, θ and is the conditional probability density of a, x, y given ϑ, η, θ; note that p(y | θ) ≡ p θ (y). It is worth noting that all of these probability densities are with respect to a suitable dominating measure, but we do not wish to delve into measure-theoretic details, which would distract from the main ideas. To determine when the incomplete-data generating process is ignorable for the purpose of Bayesian inference for the parameters η and θ of the population model, we introduce the following definition of a likelihood-ignorable incomplete-data generating process. Definition: likelihood-ignorable incomplete-data generating process. Assume that the parameters ϑ, η, θ are variation-independent in the sense that the parameter space of (ϑ, η, θ) is given by a product space of the form Ω ϑ × Ω η × Ω θ and that the parameters of the population model η and θ and the parameter of the incomplete-data generating process ϑ are independent under the prior, If the probability of observing data does not depend on the values of the unobserved data, p(a | x, y, ϑ) = p(a | x obs , y obs , ϑ) for all a, x, y and all ϑ ∈ Ω ϑ , then the incomplete-data generating process is called likelihood-ignorable, and otherwise nonignorable. We provide two examples of likelihood-ignorable and non-ignorable incomplete-data generating processes. Example: likelihood-ignorable. All infected population members visit hospitals, which record data on contacts, transmissions, exposure, infectious, and removal times of infected population members. To reduce the posterior uncertainty about the population contact network and its generating mechanism, investigators generate a probability sample of noninfected population members from the subpopulation of all non-infected population members and collect data on the contacts of sampled population members, rather than limiting the collection of data to infected population members visiting hospitals. A possible sampling design for generating samples of contacts and epidemiological data is link-tracing, as described in Section 5.2. Link-tracing exploits observed contacts of population members to increase the size of the sample, which implies that the sample inclusion probabilities depend on observed contacts but do not depend on unobserved contacts: p(a | x, y, ϑ) = p(a | x obs , y obs , ϑ) for all a, x, y and all ϑ ∈ Ω ϑ . Therefore, link-tracing sampling designs for generating samples of contacts and epidemiological data are likelihood-ignorable, as are ego-centric sampling designs. Example: non-likelihood-ignorable. Suppose that there exists a constant δ > 0 such that infected population members i with mild symptoms and fast recovery (R i −I i ≤ δ) do not visit hospitals, whereas infected population members i with severe symptoms and slow recovery (R i − I i > δ) do visit hospitals. Hospitals collect data on infected population members who visit them, but no data are collected on other population members. Since the incomplete-data generating process excludes all infected population members with mild symptoms and fast recovery, it cannot be ignored. Statistical analyses ignoring it may give rise to misleading conclusions about the rate of infection β and other parameters of the population model, and may generate misleading predictions of future epidemics in the population of interest and similar populations. Separating the complete-data generating process from the incomplete-data generating process paves the way for principled Bayesian inference based on incomplete data. The following result shows that, as long as the incomplete-data generating process is likelihood-ignorable, Bayesian inference for the parameters η and θ of the population model can be based on the marginal posterior p(η, θ | a, x obs , y obs ). In other words, the incompletedata generating process can be ignored for the purpose of Bayesian inference for the parameters η and θ of the population model, because the marginal posterior p(η, θ | a, x obs , y obs ) can be computed without requiring knowledge about the incomplete-data generating process. Proposition 1. If the incomplete-data generating process is likelihood-ignorable and the prior p(η, θ) is proper, the parameter ϑ of the incomplete-data generating process and the parameters η and θ of the population model are independent under the posterior, p(ϑ, η, θ | a, x obs , y obs ) ∝ p(ϑ | a, x obs , y obs ) p(η, θ | a, x obs , y obs ), and Bayesian inference for the parameters η and θ of the population model can be based on the marginal posterior p(η, θ | a, x obs , y obs ), which can be computed without requiring knowledge about the incomplete-data generating process. It is worth noting that the unobserved contacts of non-infected population members can be summed out, both in the numerator and the denominator of the ratio in (1), because the contacts of population members are independent conditional on θ by construction of the semiparametric population model described in Section 4. In other words, there is no need to use computational methods (e.g., Markov chain Monte Carlo methods) to generate model-based imputations of unobserved contacts of non-infected population members, which saves computing time. A proof of Proposition 1 can be found in the supplement. The case of Britton and O'Neill (2002) and Groendyke et al. (2011 Groendyke et al. ( , 2012 , who considered Bayesian inference from observed infectious and removal times, is a special case of the incomplete-data framework considered here, with x obs = {I, R} and y obs = {}. In general, when x or y are partially observed, Bayesian inference can be based on the marginal posterior of η and θ given the observed data, provided the incomplete-data generating process is likelihood-ignorable. To facilitate Markov chain Monte Carlo sampling from the posterior distribution, we approximate Dirichlet process priors by truncated Dirichlet process priors along the lines of Ishwaran and James (2001) . The truncation of Dirichlet process priors takes advantage of the stick-breaking construction of Dirichlet process priors (Sethuraman 1994; Ishwaran and James 2001; Teh 2010) . A stick-breaking construction of a Dirichlet process prior with base distribution G and concentration parameter α proceeds as follows. First, we sample parameters from the base distribution G: γ k iid ∼ G, k = 1, 2, . . . We then construct mixing proportions π 1 , π 2 , . . . by first sampling V k | α iid ∼ Beta(1, α), k = 1, 2, . . . and then setting (1 − V j ), k = 2, 3, . . . Last, but not least, we construct an infinite mixture distribution with mixing proportions π 1 , π 2 , . . . and point masses δ γ 1 , δ γ 2 , . . . as follows: The distribution P ∞ is then a draw from the Dirichlet process prior with concentration parameter α and base distribution G. A Dirichlet process prior can be truncated by choosing a positive integer K and setting V K = 1, which implies that π K+1 = 0, π K+2 = 0, . . . A finite mixture distribution with mixing proportions π 1 , π 2 , . . . , π K and point masses δ γ 1 , δ γ 2 , . . . , δ γ K can then be constructed as follows: The distribution P K can be regarded as a draw from the Dirichlet process prior with concentration parameter α and base distribution G truncated at K. If K is large, the truncated Dirichlet process prior is expected to be a good approximation of the Dirichlet process prior. Some theoretical guidance regarding the choice of K can be found in Ishwaran and James (2001) . In practice, K can be chosen by • selecting a large positive integer, such as K = N ; • exploiting domain knowledge; • choosing a value of K that leads to acceptable in-or out-of-sample performance. The truncated stick-breaking construction of π implies that π is generalized Dirichlet distributed, which is conjugate to multinomial sampling (Ishwaran and James 2001) and facilitates Markov chain Monte Carlo sampling from the posterior distribution. Since the concentration parameter α and the parameters µ and σ 2 of the base distribution N (µ, σ 2 ) are unknown, it is natural to express the uncertainty about α, µ, and σ 2 by assuming that α, µ, and σ 2 have hyperpriors. We assume that the hyperpriors of the hyperparameters α, µ, and 1/σ 2 are Gamma, Gaussian, and Gamma distributions, respectively, which are conjugate priors and facilitate Markov chain Monte Carlo sampling from the posterior distribution. A Bayesian Markov chain Monte Carlo algorithm for sampling from the posterior distribution is described in the supplement. We list here the main steps of the algorithm, without providing details, and discuss its computing time. Details can be found in the supplement. To list the main steps of the algorithm, let • Z i,k = 1 if population member i was assigned degree parameter γ k and Z i,k = 0 otherwise (i = 1, . . . , N , k = 1, . . . , K); • Z i = (Z i,1 , . . . , Z i,K ) and Z −i = (Z 1 , . . . , Z i−1 , Z i+1 , . . . , Z N ) (i = 1, . . . , N ), and Z = (Z 1 , . . . , Z N ); • γ = (γ 1 , . . . , γ K ); • π = (π 1 , . . . , π K ). As a consequence, the degree parameter θ i of population member i can be expressed as A Markov chain Monte Carlo algorithm for sampling from the posterior distribution can then be constructed by combining the following Markov chain Monte Carlo steps by means of cycling or mixing (Tierney 1994; Liu 2008) , assuming all unknown quantities have been initialized: 1. Impute the missing data: • Sample X mis | X obs = x obs , Y = y, η. • Sample Y mis | X = x, Y obs = y obs , Z = z, γ, η. 2. Sample the parameters of the population model: • Set θ i = Z i γ (i = 1, . . . , N ). • Sample η | X = x, Y = y. • Sample α | π. • Sample π | Z = z, α. • Sample µ | σ 2 , γ. • Sample σ 2 | µ, γ. Most of the Markov chain Monte Carlo steps involve Gibbs sampling from full conditional distributions (e.g., Beta, Gamma, and Gaussian distributions), while the others are Metropolis-Hastings steps. More details are provided in the supplement. The computing time of the algorithm is a function of • the number of subpopulations K, which satisfies K ≤ N ; • the number of infected population members M , which satisfies M ≤ N and in large populations satisfies M N (unless a non-negligible fraction of the population is infected); • the sampling design and the number of population members n sampled out of the N population members for the purpose of collecting data on contacts along with epidemiological data, which satisfies n ≤ N and in large populations satisfies n N (unless a non-negligible fraction of the population is sampled); • the sparsity of the population contact network. As a specific example, consider ego-centric sampling, as described in Section 5.2. An egocentric sampling design samples n out of the N population members and, for each sampled population member, collects data on the contacts of the sampled population member with the N − 1 other population members, in addition to data on the transmissions, exposure, infectious, and removal times of infected population members. As a result, the computing time of each iteration of the Bayesian Markov chain Monte Carlo algorithm is O(K n N ), because updates of the K degree parameters γ 1 , . . . , γ K involve computations of up to n (N − 1) probabilities of the form Having said that, the computing time of O(K n N ) in the case of ego-centric sampling is based on worst-case scenarios and can be reduced to O(K n) in special cases: e.g., when the population contact network is sparse in the sense that many population members have few contacts, it is possible to reduce the computing time by taking advantage of sparsity. In fact, many real-world networks are sparse: While population members can create up to N − 1 contacts, creating physical contacts that enable disease transmission requires geographical proximity and time. As a consequence, it is plausible that the expected degrees of many if not all population members are bounded above by a finite constant (Dunbar 1992; Rohe, Chatterjee, and Yu 2011; Krivitsky, Handcock, and Morris 2011; Lovász 2012; Amini, Chen, Bickel, and Levina 2013; Krivitsky and Kolaczyk 2015; Butts 2019) . In such cases, the resulting population contact network is sparse, and one could reduce the worst-case computing time of O(K n N ) to O(K n). Some ideas of how to exploit sparsity for the purpose of reducing computing time can be found in, e.g., Raftery, Niu, Hoff, and Yeung (2012) and Vu, Hunter, and Schweinberger (2013) . Markov chain Monte Carlo samples from the posterior distribution may show evidence of label-switching, that is, the labeling of subpopulations may have switched from iteration to iteration of the Markov chain Monte Carlo algorithm. The label-switching problem is rooted in the fact that the likelihood function is invariant to permutations of the labels of subpopulations. While the prior is not invariant to permutations of the labels of subpopulations, the prior is dominated by the likelihood function when the data is informative. As a consequence, the labels of subpopulations may switch from iteration to iteration of the Markov chain Monte Carlo algorithm. Label-switching can give rise to misleading conclusions about parameters that depend on the labeling of the subpopulations, including the indicators Z 1 , . . . , Z N and the degree parameters γ 1 , . . . , γ K and θ 1 = Z 1 γ, . . . , θ N = Z N γ. The label-switching problem is a well-known problem in the literature concerned with Bayesian Markov chain Monte Carlo estimation of finite mixture models and related models. We follow the Bayesian decision-theoretic approach of Stephens (2000) to undoing the label-switching, by minimizing the posterior expectation of a well-chosen loss function. A loss function and relabeling algorithm are stated in Schweinberger and Handcock (2015) and implemented in R package hergm (Schweinberger and Luna 2018) . We use them in Sections 7 and 8 to undo the label-switching in all Markov chain Monte Carlo samples generated by the Bayesian Markov chain Monte Carlo algorithm. We explore the frequentist properties of Bayesian point estimators and the reduction in statistical error due to collecting network data by using simulations. We consider a population of size 187 consisting of K = 3 subpopulations labeled 1, 2, 3. The three subpopulations consist of low-, moderate-, and high-degree population members. We assign population members i to subpopulations 1, 2, 3 by sampling Z i iid ∼ Multinomial(1; π = (.4, .3, .3)) (i = 1, . . . , N ). We then generate a population contact network from the population model described in Section 4 with degree parameters θ i = Z i γ (i = 1, . . . , N ). Conditional on the population contact network, an epidemic is generated by the stochastic model described in Section 2, assuming that I i − E i and R i − I i are independent Gamma(η E,1 , η E,2 ) and Gamma(η I,1 , η I,2 ) random variables, respectively (i = 1, . . . , M ). The data-generating values of the parameters are specified in Sections 7.1 and 7.2. Unless stated otherwise, we assume that the exposure, infectious, and removal times E, I, R are observed, whereas the transmissions T are unobserved, as are the population contact network Y and the indicators Z. Table 1 : Simulation results. Number of times 95% posterior credible intervals include the data-generating values of the parameters β, η E,1 , η E,2 , η I,1 , η I,2 , γ 1 , γ 2 , and γ 3 in %, using a Dirichlet process prior truncated at K = 3 and K = 5. We generate 1,000 population contact networks and epidemics as described above. The data-generating values of the parameters β, η E,1 , η E,2 , η I,1 , η I,2 , γ 1 , γ 2 , and γ 3 are shown in Table 1 . For each data set, we truncate the Dirichlet process prior at K = 3 and K = 5 as described in Section 6.3, and estimate the data-generating model by the Bayesian Markov chain Monte Carlo algorithm described in Section 6.4. Table 1 sheds light on the frequentist coverage properties of 95% posterior credible intervals. The simulation results indicate that the frequentist coverage properties of posterior credible intervals are excellent in the case of the epidemiological parameters β, η E,1 , η E,2 , η I,1 , and η I,2 , but less so in the case of the network parameters γ 1 , γ 2 , and γ 3 . These results underscore the challenge of estimating network parameters without observing network data. Section 7.2 demonstrates that the statistical error can be reduced by collecting network data. To demonstrate that collecting network data can reduce the posterior uncertainty about the parameters of the population model, we consider a population consisting of K = 3 subpopulations. The K = 3 subpopulations correspond to • a low-degree subpopulation of size 127 with degree parameter γ 1 = −3.5; • a moderate-degree subpopulation of size 50 with degree parameter γ 2 = −1.5; • a high-degree subpopulation of size 10 with degree parameter γ 3 = .5. We generate 1,000 ego-centric samples of sizes n = 25, 50, 75, 100, 125, 150, 187 from the population of size N = 187. We then estimate the population model from each sample of contacts along with observations of the exposure, infectious, and removal times of infected population members. In addition, we estimate the population model without observations of contacts, which corresponds to a sample size of n = 0, using observations of the exposure, infectious, and removal times of infected population members. To assess how much the posterior uncertainty about the parameters of the population model is reduced by sampling contacts, we use the mean squared error (MSE) of the posterior median and mean of the parameters. The MSE of the posterior median and mean of the rate of infection β and the degree parameters γ 1 , γ 2 , and γ 3 is plotted in Figure 3 against the sample size n = 0, 25, 50, 75, 100, 125, 150, 187. By construction of the model, estimators of the epidemiological parameters η E,1 , η E,2 , η I,1 , and η I,2 are not expected to be sensitive to n-which determines how much information is available about the network parameters-and the MSE of the posterior median and mean of η E,1 , η E,2 , η I,1 , and η I,2 are indeed not sensitive to n (not shown). However, Figure 3 demonstrates that samples of contacts do reduce the MSE of the posterior median and mean of β, γ 1 , γ 2 , and γ 3 : The MSE turns out to be highest when n = 0, and rapidly decreases as n increases. These observations underscore the importance of collecting data on contacts or functions of contacts, such as degrees. We showcase the statistical framework introduced in Sections 4, 5, and 6 by applying it to the partially observed MERS epidemic in South Korea in 2015 (Ki 2015) . The MERS outbreak in South Korea was driven by the coronavirus MERS, which is related to the coronaviruses SARS and COVID-19. We retrieved the data from the website http://english.mw.go.kr of the South Korean Ministry of Health and Welfare on September 22, 2015. The website has been removed since, but the data can be obtained from the authors. The first MERS case in South Korea was confirmed on May 20, 2015 and the last case was confirmed on July 4, 2015. By September 22, 2015, 186 cases had been confirmed, of which 144 had recovered while 35 had died, all of whom are considered to be removed from the population. 7 cases had not removed by September 22, 2015, despite the fact that the last case of MERS was confirmed on July 4, 2015. We assume that the removal times of those 7 cases are unobserved and that the outbreak ceased by September 22, 2015, because the data base has not been updated by the South Korean Ministry of Health and Welfare since July 2015. The data consist of the infectious times and removal times of the 186 infected population members, with 7 missing removal times. The exposure times are unobserved and inferred along with the 7 missing removal times. The MERS data set does not include direct observations of transmissions or contacts, but there are two sources of indirect observations: • The assessments of doctors of who infected whom. • The observed infectious and removal times, which reveal when infected population members were infectious and, in so doing, help narrow down the possible sources of infections. Both of these sources help inform who infected whom and who was in contact with whom, because an infection implies a contact. We use the semiparametric population model introduced in Sections 2 and 4, assuming that I i − E i and R i − I i are independent Gamma(η E,1 , η E,2 ) and Gamma(η I,1 , η I,2 ) random variables, respectively (i = 1, . . . , 186). The priors of the epidemiological parameters are given by η E,1 ∼ Uniform(4, 8), η E,2 ∼ Uniform(.75, 3), η I,1 ∼ Uniform(1.5, 8), η I,2 ∼ Uniform(2.5, 7.5), and β ∼ Uniform (.1, 8) , which cover a range of plausible values (see, e.g., the discussions of Groendyke et al. 2011 Groendyke et al. , 2012 . The degree parameters θ 1 , . . . , θ 186 are assumed to have been generated by a Dirichlet process prior with concentration parameter α and base distribution N (µ, σ 2 ). To express the uncertainty about the concentration parameter α and the parameters µ and σ 2 of the base distribution, we assume that the hyperparameters α, µ, and 1/σ 2 have Gamma(5, 1), N (0, 1), and Gamma(1, 10) hyperpriors, respectively. To facilitate Markov chain Monte Carlo sampling from the posterior distribution, we truncate the Dirichlet process prior at K = 3 as described in Section 6.3. We choose K = 3 because we expect two subpopulations, one corresponding to potential superspreaders and one corresponding to all other population members, and K = 3 is a convenient upper bound on the number of subpopulations. We consider two specifications of the prior probabilities of who infected whom, because the posterior is sensitive to the prior probabilities of who infected whom. The reason is that the MERS data set does not contain direct observations of transmissions or contacts, and the information on transmissions is therefore limited to two sources of indirect observations: the assessments of doctors of who infected whom and the observed infectious and removal times, as discussed in Section 8.1. While the observed infectious and removal times help narrow down the possible sources of infections, there may be many possible sources of infections left. As a consequence, it is not surprising that the posterior is sensitive to the choice of prior probabilities of who infected whom. We demonstrate the sensitivity of the posterior to the choice of prior in Section 8.4 by using two specifications of prior probabilities: (a) If the doctors assessed that population member i infected population member j, we specify ϕ(i infected j) = 1 and ϕ(h infected j) = 0 for all infected population members h ∈ {1, . . . , 186} \ {i, j}. If the doctors did not specify who infected j and there are M j ∈ {1, . . . , 185} infected population members h satisfying I h < E j < R h , we specify ϕ(i infected j) = 1 / M j for all infected population members i satisfying I i < E j < R i and ϕ(h infected j) = 0 for all other infected population members h. (b) If there are M j ∈ {1, . . . , 185} infected population members h satisfying I h < E j < R h , we specify ϕ(i infected j) = 1 / M j for all infected population members i satisfying I i < E j < R i and ϕ(h infected j) = 0 for all other infected population members h. Last, but not least, there is no evidence to suggest that the incomplete-data generating process is non-ignorable, therefore we estimate the population model under the assumption that the incomplete-data generating process is ignorable. It is worth noting that the assumption of ignorability is a strong assumption, but it is convenient and defensible unless there is strong evidence to the contrary. We discuss how to deal with non-ignorable incomplete-data generating processes in Section 9.4. We sample from the posterior distribution by using the Bayesian Markov chain Monte Carlo algorithm described in Section 6.4. The Markov chain Monte Carlo algorithm required 302 minutes. Out of the 2,000,000 sample points generated by the Markov chain Monte Carlo algorithm, we discard the first 200,000 sample points as a burn-in and retain every 100th sample point of the remaining 1,800,000 sample points, giving rise to a Markov chain Monte Carlo sample of size 18,000. The resulting Markov chain Monte Carlo sample shows evidence of label-switching. We undo the label-switching as described in Section 6.5. Trace plots of selected parameters can be found in Figure 4 . The trace plots do not show signs of non-convergence and suggest that the label-switching has been undone. The semiparametric population model introduced in Section 4 induces a partition of the population into subpopulations. We therefore start by investigating the local structure of the population, by determining which population members belong to which subpopulations. Figure 5 shows the posterior classification probabilities of population members along with the transmissions of MERS according to the assessments of doctors. According to doctors, 181 out of the 186 infected population members are suspected of having infected 0, 1, or 2 other population members, while 5 population members are believed to have infected more than 2 other population members: 1 (infected 30 others); 14 (infected 70 others); 15 (infected 6 others); 16 (infected 23 others); and 76 (infected 10 others). In total, these 5 population members are believed to have infected 139 out of the 186 infected population members. The posterior classification probabilities in Figure 5 suggest that three subpopulations can be distinguished: • A subpopulation consisting of population members 1, 14, and 16, which we represent by the color red in Figure 5 . Population member 1 was the first confirmed MERS case in South Korea and is believed to have infected population members 14 and 16. These three population members are suspected of being responsible for the three largest clusters of infections, infecting a total of 123 out of the 186 infected population members, and may therefore be considered to be primary drivers of the MERS outbreak. • A subpopulation consisting of population members 15 and 76, which we represent by the color orange in Figure 5 . Population member 76 (infected 10 others) belongs with high posterior probability to the orange-colored subpopulation, whereas there is more uncertainty about the classification of population member 15 (infected 6 others). Both of them are thought to be responsible for clusters of infections-albeit smaller clusters of infections than population members 1, 14, and 16-and may therefore be considered to be secondary drivers of the MERS outbreak. • A subpopulation consisting of all other population members, which we represent by the color gray in Figure 5 . To gain insight into the propensities of population members to form contacts within and between these subpopulations, we inspect the marginal posterior densities of the degree parameters γ 1 , γ 2 , and γ 3 . The degree parameters γ 1 , γ 2 , and γ 3 correspond to the red-, orange-, and gray-colored subpopulations in Figure 5 , respectively. Figure 6 shows Markov chain Monte Carlo approximations of the marginal posterior densities of γ 1 , γ 2 , and γ 3 . The 95% posterior credible intervals of γ 1 , γ 2 , and γ 3 are given by [.140, .959 Degree Count tween two population members i and j is θ i + θ j = Z i γ + Z j γ, members of the red-colored subpopulation have a high propensity to be in contact with other population members, in particular other members of the red-and orange-colored subpopulations, in addition to members of the gray-colored subpopulation. It is worth noting that the posterior uncertainty-as measured by the lengths of the 95% posterior credible intervals-is highest for γ 2 and lowest for γ 3 , which makes sense because no more than 1 or 2 population members belong to the orange-colored subpopulation with degree parameter γ 2 (with a high posterior probability), whereas 181 population members belong to the gray-colored subpopulation with degree parameter γ 3 (with a high posterior probability). Taken together, these results suggest that there were three to five potential superspreaders who may have had a great impact on the outcome of the MERS outbreak in South Korea. These observations underscore the importance of detecting potential superspreaders. We proceed with posterior predictions of the degree distribution of the population contact network. While the population contact network and its degree distribution are unobserved, posterior predictions of the degree distribution can be generated, provided draws from the posterior distribution are available. Probabilistic statements about the degrees of population members based on the posterior distribution are informed by the observed infectious and removal times along with the assessments of doctors of who infected whom. Both of these sources of information help inform who was in contact with whom, because • the observed infectious and removal times reveal when infected population members were infectious, which helps narrow down the possible sources of infections; • an infection implies a contact. The posterior predictions of the degree distribution shown in Figure 7 suggest that the degree distribution is long-tailed: The bulk of population members has no more than 10 contacts, but some population members have as many as 80 contacts. As pointed out before, population members with many contacts can infect many other population members and therefore represent an important public health concern. Last, but not least, it may be of interest to inspect posterior predictions of the epidemic itself. We focus on the maximum of the "epidemic curve," that is, the maximum number of infectious population members during the height of the MERS outbreak. The observed number is 129, that is, the number of infectious population members reached 129 at the height of the MERS outbreak. Figure 8 reveals that the posterior predictive distribution of the maximum number of infectious population members is bimodal. The two modes of the posterior predictive distribution correspond to two scenarios. In the first scenario, the first infected population member is isolated in the sense that it has no contacts. As a result, no other population member is infected, giving rise to the mode at 1. In the second scenario, the first infected population member is not isolated. Then MERS can spread throughout the population, resulting in the mode at the observed number of 129. These results are based on prior specification (a) described in Section 8.2, that is, the prior probabilities of who infected whom are specified in accordance with the assessments of doctors. If the assessments of doctors are ignored and prior specification (b) in Section 8.2 is used, then-conditional on the event that the first infected population member is not isolated-the predicted maximum number of infectious population members is 122.5 on average, which is 5% lower than the observed number of 129. By contrast, when the assessments of doctors are taken into account and prior specification (a) in Section 8.2 is used, then-conditional on the event that the first infected population member is not isolated-the predicted maximum number of infectious population members is 127.9 on average, which is 1% lower than the observed number of 129. In other words, ignoring the assessments of doctors and choosing prior specification (b) leads to underpredictions of the maximum of the "epidemic curve," whereas utilizing the assessments of doctors and choosing prior specification (a) leads to predictions that match, on average, the observed maximum of the "epidemic curve" rather well. These observations underscore that • posterior predictions of the epidemic are sensitive to the choice of prior of who infected whom, at least in the absence of observations of transmissions or contacts; • data on transmissions and contacts should be collected to reduce the posterior uncertainty about quantities of interest and the sensitivity of posterior predictions of the epidemic to the choice of prior. We have introduced a semiparametric population model that can accommodate both shortand long-tailed degree distributions and detect potential superspreaders, in addition to dealing with a wide range of missing and sample data. Having said that, many open questions remain. Some of them are rooted in the lack of data, while others stem from computational and statistical challenges arising from the lack of data and the complexity of the models. We review a selection of open questions and directions for future research below. To analyze the MERS outbreak in South Korea, we applied the proposed semiparametric modeling framework to the 186 infected population members. In so doing, we made the implicit assumption that the population of interest consists of those 186 infected population members. There is no denying that such an assumption is unappealing. In fact, the assumption was motivated by convenience rather than substantive considerations, including the challenge of determining the population of interest: e.g., does the population of interest consist of all residents of South Korea, all residents of East Asia, or the whole world? As pointed out before, collecting complete data on population contact networks and epidemics is all but impossible. As a consequence, public health officials and researchers face a recurring question in the event of epidemics (whether outbreaks of coronaviruses such as COVID-19, MERS, or SARS, Ebola viruses, or other viruses): Which data to collect? We believe that, to learn the structure of a population contact networks and its impact on the spread of an infectious diseases, investigators should attempt to collect contact and epidemiological data on all infected population members and collect samples of contacts from non-infected population members by likelihood-ignorable sampling designs. We discuss some of the challenges arising in practice along with possible solutions. First, while it is challenging to collect data on transmissions, it is advisable to collect data that help reduce the posterior uncertainty about epidemiological parameters. One possible source of data are viral genetic sequence data, among other possible data sources. We refer interested readers to Bouckaert et al. (2019) for a recent review of possible data sources. Second, it is prudent to sample from population contact networks to reduce the posterior uncertainty about network parameters. The two likelihood-ignorable sampling designs discussed above can be used to do so, though both require a sampling frame. If no sampling frame is available, an alternative would be respondent-driven sampling (e.g., Gile and Handcock 2010; Gile 2011) , which is a form of link-tracing without a sampling frame. Location data collected by mobile phones and other electronic sources would be alternative sources of data, but raise data privacy issues (Fienberg and Slavković 2010) . In the past decade, substantial progress has been made on data privacy in the statistical literature: see, e.g., the work of Karwa and Slavković (2016) on data privacy in scenarios where network data are generated by the β-model (albeit without Dirichlet process priors and without epidemics). Studying data privacy for epidemics would be an important direction of future research. The lack of data has computational implications. Statistical algorithms for likelihood-based statistical inference (e.g., EM algorithms, Dempster, Laird, and Rubin 1977) have to integrate over the unobserved data, hence the computing time tends to increase with the amount of missing data. How to develop scalable statistical algorithms, with statistical guarantees, is an open question. One idea would be to develop a two-step estimation algorithm, first estimating the network parameters and then estimating the epidemiological parameters, leveraging computational advances in the statistical analysis of network data (e.g., Raftery et al. 2012; Salter-Townshend and Murphy 2013) and epidemiological data (Bouckaert et al. 2019) . A two-step estimation algorithm may require data on contacts, however, because without data on contacts the posterior correlations of the network parameters and the rate of infection can be high (Groendyke et al. 2011) , in which case two-step estimation algorithms may not work well. A related problem is how to update posteriors in reasonable time as more data on infections and contacts come in. We have considered here ignorable incomplete-data generating processes. If the incompletedata generating process is non-ignorable, then either the incomplete-data generating process must be modeled or it must be demonstrated that Bayesian inference for the parameters of the population model is insensitive to the incomplete-data generating process. Both approaches require insight into the incomplete-data generating process and may require additional model assumptions, some of which may be untestable. We have focused here on the degrees of population members as network features, but there are many other important network features: e.g., if population contact networks exhibit closure (e.g., transitive closure, Wasserman and Faust 1994; Kolaczyk 2009 ), then infectious diseases may spread rapidly within subpopulations but may spread only slowly through the whole population, which has potential policy implications. There are two broad approaches to capturing closure in population contact networks: latent space models (Hoff, Raftery, and Handcock 2002; Handcock, Raftery, and Tantrum 2007; Smith, Asta, and Calder 2019) and related latent variable models (e.g., Salter-Townshend, White, Gollini, and Murphy 2012; Rastelli, Friel, and Raftery 2016; Fosdick and Hoff 2015; Hoff 2021) ; and exponential-family models of random graphs (Frank and Strauss 1986; Snijders, Pattison, Robins, and Handcock 2006; Lusher, Koskinen, and Robins 2013) . Both classes of models can accommodate the degree terms we have used (see, e.g., Krivitsky, Handcock, Raftery, and Hoff 2009; Thiemichen, Friel, Caimo, and Kauermann 2016) along with additional terms that reward closure in population contact networks. However, both of them come at costs in terms of computing time (Bhamidi, Bresler, and Sly 2011; Chatterjee and Diaconis 2013) , although the computing time depends on the class of models under consideration (see, e.g., Karwa, Petrović, and Bajić 2016) . In addition, Welch (2011) pointed out that closure in population contact networks may not be detectable unless data on the population contact network are collected. This, too, underscores the importance of collecting network data. We have assumed that the population contact network is time-invariant, motivated by the lack of data on the population contact network and the desire to keep the model as simple We sample from the posterior distribution by combining the following Markov chain Monte Carlo steps by means of cycling or mixing (Tierney 1994; Liu 2008) . Concentration parameter α | π. Assuming the hyperprior of concentration parameter α is Gamma (A 1 , B 1 ) , we can update α by sampling α | π ∼ Gamma(A 1 + K − 1, B 1 − log π K ). Mean parameter µ | σ 2 , γ. Assuming the hyperprior of mean parameter µ is N (O, S 2 ), we can update µ by sampling Precision parameter 1/σ 2 | µ, γ. Assuming the hyperprior of precision parameter 1/σ 2 is Gamma(A 2 , B 2 ), we can update 1/σ 2 by sampling Mixing proportions π | Z = z, α. Let N k be the number of population members in subpopulation k. Then π can be updated by sampling V k | Z = z, α ind ∼ Beta 1 + N k , α + K j=k+1 N j , k = 1, . . . , K − 1, then setting V K = 1 and constructing π as follows: (1 − V j ), k = 2, . . . , K. Indicators Z i | Y = y, Z −i = z −i , γ, π. We update indicators Z i by sampling Z i | Y = y, Z −i = z −i , γ, π ∼ Multinomial(1; π i,1 , . . . , π i,K ), where π i,k = π k N j: j =i p(y i,j | Z ik = 1, Y = y, Z −i = z −i , γ, π) K l=1 π l N j: j =i p(y i,j | Z il = 1, Y = y, Z −i = z −i , γ, π) . Degree parameters γ | Y = y, Z = z. We update γ by Metropolis-Hastings steps, where proposals are generated from random-walk, independence, or autoregressive proposal distributions (Tierney 1994) . Degree parameters θ. The degree parameter θ i of population member i is updated by setting θ i = Z i γ, i = 1, . . . , N. Epidemiological parameters η | X = x. We use Gibbs and Metropolis-Hastings steps for updating η along the lines of Groendyke et al. (2011 Groendyke et al. ( , 2012 . Unobserved contacts Y i,j | X = x, Z i = z i , Z j = z j , γ, η. If Y i,j is unobserved, we can update Y i,j by sampling Here, P θ i ,θ j (Y i,j = y i,j ) is defined by where λ i,j (θ) = θ i + θ j . Unobserved transmissions T | E, I, R, Y = y. If population member j was infected and the source of infection is unknown, we can update the source of infection by sampling from its full conditional distribution. If ϕ(i infected j) denotes the prior probability of the event that population member i infected population member j, the conditional probability of the event that population member i infected population member j, given everything else, takes the form P(i infected j | E, I, R, Y = y) = y i,j 1 I i