key: cord-0615706-lnvaagq9
authors: Franzolini, Beatrice; Cremaschi, Andrea; Boom, Willem van den; Iorio, Maria De
title: Bayesian clustering of multiple zero-inflated outcomes
date: 2022-05-10
journal: nan
DOI: nan
sha: 64504d463e33ff475eae12eadc0c7cc97aaa7c92
doc_id: 615706
cord_uid: lnvaagq9

Several applications involving counts present a large proportion of zeros (excess-of-zeros data). A popular model for such data is the Hurdle model, which explicitly models the probability of a zero count, while assuming a sampling distribution on the positive integers. We consider data from multiple count processes. In this context, it is of interest to study the patterns of counts and cluster the subjects accordingly. We introduce a novel Bayesian nonparametric approach to cluster multiple, possibly related, zero-inflated processes. We propose a joint model for zero-inflated counts, specifying a Hurdle model for each process with a shifted Negative Binomial sampling distribution. Conditionally on the model parameters, the different processes are assumed independent, leading to a substantial reduction in the number of parameters as compared to traditional multivariate approaches. The subject-specific probabilities of zero-inflation and the parameters of the sampling distribution are flexibly modelled via an enriched finite mixture with random number of components. This induces a two-level clustering of the subjects based on the zero/non-zero patterns (outer clustering) and on the sampling distribution (inner clustering). Posterior inference is performed through tailored MCMC schemes. We demonstrate the proposed approach on an application involving the use of the messaging service WhatsApp.

Count data presenting an excess of zeros are commonly encountered in applications. These can arise in several settings, such as healthcare, medicine or sociology. In this scenario, the observations carry structural information about the data-generating process, i.e., an inflation of zeros.

The analysis of zero-inflated data requires the specification of models beyond standard count distributions, such as Poisson or Negative Binomial. Commonly used models are the Zero-Inflated (Lambert, 1992) , the Hurdle (Mullahy, 1986) and the Zero-Altered (Heilbron, 1994) models. The first class assumes the existence of a probability mass at zero and a distribution over N 0 = {0, 1, 2, . . . }. This type of models explicitly differentiate between the zeros originated by a common underlying process, such as the utilisation of a service, described by the sampling distribution on N 0 , and those arising from a structural phenomenon, such as the ineligibility to use the service, which are modelled by the point mass. Very popular Zero-Inflated models are the Zero-Inflated Poisson (ZIP) and the Zero-Inflated Negative Binomial (ZINB) models, where the sampling distribution is chosen to be a Poisson and a Negative Binomial, respectively. These models allow to detect inflation in the number of zeros and to depart from standard distributional assumptions on the moments of the sampling distribution. For instance the ZIP model allows mean and variance of the distribution to be different from each other (as opposed to a standard Poisson distribution), while the ZINB additionally captures overdispersion in the data.

Hurdle models are a very popular choice of distributions for modelling zero-inflated counts. Differently from the Zero-Inflated ones, these models handle zeros and positive observations separately, assuming on the latter a sampling distribution with support on N = N 0 \ {0}. Thus the distribution of the count data is given by:

where p i and g now capture two distinct features of the data. Hurdle models present appealing features that can make them preferable to Zero-Inflated models. Firstly, Hurdle distributions allow for both inflation and deflation of zero-counts. Indeed, under a Zero-Inflated model, the probability of observing a zero is always greater than the corresponding probability under the sampling distribution, thus making it impossible to capture deflation in the number of zeros (Min and Agresti, 2005) . Secondly, and more importantly for our work, the probability of zero-counts in Hurdle models is independent from the parameters controlling the distribution of non-zero counts. This feature improves interpretability and facilitates parameter estimation. Note that the Zero-Altered model proposed by Heilbron (1994) is a modified Hurdle model in which the two parts are connected by specifying a direct link between the model parameters.

Univariate models for zero-inflated data can be extended to multivariate settings, where several variables presenting excess of zeros are recorded, e.g. in applications involving questionnaire or microbiome data analysis. In this context, a multivariate extension of the ZIP model has been proposed by Li et al. (1999) , through a finite mixture with ZIP marginals. In general, in this construction the number of parameters increases rapidly as the number d of zero-inflated processes increases. Building on the latter, Liu and Tian (2015) propose a different specification of the multivariate ZIP involving a smaller number of parameters and better distributional properties. Alternatively, in a Bayesian setting, Lee et al. (2020) model the binary variables indicating whether an observation is positive or not via a multivariate probit model. In several applications, knowledge relative to the grouping of the subjects is also available, thus providing additional information that can be exploited in the model (Choo-Wosoba et al., 2018) . Moreover, the clustering structure can be estimated by assuming a prior distribution on the partition of the subjects, e.g. via the popular Dirichlet process (Li et al., 2017) , or a mixture of finite mixtures as proposed by Hu et al. (2022) .

The focus of this work is clustering of individuals based on multiple, possibly related, zero-inflated processes. To this end, we propose a Bayesian nonparametric approach for joint modelling of zero-inflated count data. In particular, we specify a Hurdle model for each individual process with a shifted Negative Binomial sampling distribution on the positive integers. Let n denote the sample size and d the number of processes under study. The subject-specific probabilities of zero-inflation p ij for the i-th individual and the j-th process, i = 1, . . . , n, j = 1, . . . , d, and the parameter vector of the sampling distribution µ ij are flexibly modelled via an enriched mixture with random number of components, borrowing ideas from the Bayesian nonparametric literature on the Dirichlet process. One of the main novelties of our work is to combine a recent representation of finite mixture models with random number of components presented in Argiento and De Iorio (2019) with a finite extension of the enriched nonparametric prior proposed by Wade et al. (2011) to achieve a two-level clustering of the subjects, where at the outer level individuals are clustered based on the pattern of zero/non-zero observations, while within each outer cluster they are grouped at a finer level (which we refer to as inner level) according to the distribution of the non-zero counts. Figure 1 provides an illustration of the nested clustering structure. Outer Cluster 3

Figure 1: Example of two-level clustering induced by the enriched mixture with random number of components. The observations are first clustered based on their zero/non-zero patterns, indicated in the figure in blue and red, respectively. Within each outer cluster, subjects are grouped based on the sampling distribution of the non-zero observations. The inner clustering structure is here depicted via a multimodal discrete distribution, representing a finite mixture.

Enriched priors in Bayesian nonparametrics generalise concepts developed by Consonni and Veronese (2001) , who propose a general methodology for the construction of enriched conjugate families for the parametric natural exponential families. The idea underlying this approach is to decompose the joint prior distribution for a vector of parameters indexing a multivariate exponential family into tractable conditional distributions. In particular, distributions belonging to the multivariate natural exponential family satisfy the conditional reducibility property, which allows reparameterising the distribution in terms of a parameter vector, whose components are variation and likelihood independent. Then, it is possible to construct an enriched standard conjugate family on the parameter vector, closed under i.i.d. sampling, which leads to the breaking down of the global inference procedure into several independent subcomponents. Such parameterisation achieves greater flexibility in prior specification relative to the standard conjugate one, while still allowing for efficient computations (see, for example, Consonni et al., 2004 ). An example of this class of parametric priors is the enriched Dirichlet distribution (Connor and Mosimann, 1969) .

In a Bayesian nonparametric framework, Wade et al. (2011) first propose an enrichment of the Dirichlet process (Ferguson, 1973) that is more flexible with respect to the precision parameter but still conjugate, by defining a joint random probability measure on the measurable product space (X , Y) in terms of the marginal and conditional distributions, P X and P Y |X , and assigning independent Dirichlet process priors to each of these terms. The enriched Dirichlet process allows a nested clustering structure that is particularly appealing in our setting and allows for a finer control of the dependence structure between X and Y . This construction has been employed also in nonparametric regression problems to model the joint distribution of the response and the covariates (Wade et al., 2014; Gadd et al., 2019) , as well as in longitudinal data analysis (Zeldow et al., 2021) and causal inference (Roy et al., 2018) . Recently, Rigon et al. (2022) propose the enriched Pitman-Yor process which leads to more robust clustering estimation.

In this work, we consider the joint distribution of d zero-inflated process, where the d-dimensional vectors of probabilities (p i1 , . . . , p id ) correspond to X, while the parameters of the sampling distributions µ ij correspond Y . The enrichment of the prior is achieved by modelling both P X and P Y |X through a mixture with random number of components (see, for instance, Miller and Harrison, 2018) . We exploit the recent construction by Argiento and De Iorio (2019) based on Normalised Independent Finite Point Processes, which allows for a wider choice of prior distributions for the unnormalised weights of the mixture. Therefore, the proposed model offers more flexibility, while preserving computational tractability.

The motivating application for the proposed model is the analysis of multiple count data collected from a questionnaire on the frequency of use of the messaging service WhatsApp ClinicalTrials.gov (2021). In particular, the questionnaire concerns the sharing of COVID-19-related information via WhatsApp messages, either directly or by forwarding, over the course of a week. For each subject, responses to the same seven questions are recorded over seven consecutive days, providing information on a subject's WhatsApp use. In this setup, the multiple count processes correspond to the seven questions, all of which displaying an excess of zeros (see Figure 7 in Appendix B). The manuscript is organised as follows. Section 2 introduces the nonparametric model for multiple zero-inflated outcomes, while Section 3 describes the Markov chain Monte Carlo (MCMC) algorithm designed for posterior inference. We demonstrate the model on the WhatsApp application in Section 4. We conclude the paper in Section 5.

Let Y ij be the count of subject i = 1, . . . , n for outcome j = 1, . . . , d and let Y i = (Y i1 , . . . , Y id ) be the d-dimensional vector of observations for subject i. To take into account the zero-inflated nature of the data, we assume for each outcome j a Hurdle model. Each observed count Y ij is equal to zero with probability 1 − p ij , while with probability p ij it is distributed according to a probability mass function (pmf) g (· | µ ij ) with support on N. Assuming conditional independence among responses, the likelihood for a subject is given by:

In what follows, we set g to be a shifted Negative Binomial distribution with parameters µ ij = (r ij , θ ij ) and pmf:

where r ij ∈ N and θ ij ∈ (0, 1), for i = 1, . . . , n and j = 1, . . . , d. Different parametric choices for g are possible (e.g. a shifted Poisson), or even nonparametric alternatives could be employed. Note that the conditional independence assumption among the multiple processes leads to a significant reduction in the number of parameters as compared to multivariate zero-inflated models.

In this work, we propose an enriched extension of the Normalised Independent Finite Point Process (Norm-IFPP) of Argiento and De Iorio (2019) and specify a joint prior for (p i , µ i ) as conditionally dependent nonparametric processes. This allows to account for interindividual heterogeneity, overdispersion, outliers and induces data driven nested clustering of the observations. Each subject is first assigned to an outer cluster, and then clustered again at an inner level, providing increased interpretability. Differently from previous work on Bayesian nonparametric enriched processes, we opt for a finite mixture with random number of components, where the weights are obtained through the normalisation of a finite point process. Finite mixture models with random number of components have received increasing attention in the last years (see, for example, Nobile, 2004; Malsiner-Walli et al., 2016; Miller and Harrison, 2018) . The construction of Argiento and De Iorio (2019) allows for the specification of a wide range of distributions for the weights of the mixture. They show that a finite mixture model is equivalent to a realisation of a stochastic process with random dimension and an infinite dimensional support, leading to flexible distributions for the weights of the mixture given by the normalisation of a finite point process. This approach allows for efficient computations via a conditional algorithm, as compared to labour-intensive reversible jump algorithms common in mixture models.

In the proposed framework, we treat the probabilities p i as the parameters determining the outer level clustering, while the parameters µ i characterising the sampling distributions are shared within each inner-level mixture component, conditionally on the values of p i . Therefore, subjects who share the same values of the non-zero count probabilities p i may still belong to different inner clusters according to the value of µ i , although they belong to the same outer cluster. The conditional dependence structure between outer and inner levels is the following. Let

Outer mixture:

Inner mixture:

where the kernel f (y ij | p ij , r ij , θ ij ) is defined via conditionally independent Hurdle models in Eq.

(2)-(3), Beta (α, β) indicates the Beta distribution with mean α/(α+β) and variance αβ/((α + β) 2 (α + β + 1)), Gamma (α, β) the Gamma distribution with mean α/β and variance α/β 2 , Geometric (ζ) the Geometric distribution with mean 1/ζ and Poi 0 (Λ) the shifted Poisson distribution, such that if X ∼ Poi 0 (Λ) then X − 1 has a Poisson distribution with mean Λ. We denote with p m , r ms and θ ms the component-specific parameters, which are assumed a priori independent.

The outer mixture is a mixture of multivariate Bernoulli distributions, and coincides with the widely-used Latent Class model (Lazarsfeld and Henry, 1968) . Moreover, being conditionally independent of the actual values of the non-zero observations, it offers further computation advantages as shown in Section 3. The allocation variables c i and z i indicate to which component of the mixture each subject is assigned to at the outer and inner level, respectively. Moreover, M and S m indicate the random number of components at the outer and inner level of the enriched Norm-IFPP, respectively. Note that the choice of Gamma distribution for the unnormalised weight of the mixture leads to the standard Dirichlet distribution for the normalised weights. In this setting, the computations are greatly simplified by the introduction of a latent variable, conditionally on which the unnormalised weights are independent. See Argiento and De Iorio (2019) for details.

Instead of using allocation variables, the model can be rewritten in joint form, incorporating both inner and outer mixture. Let ψ msj = p mj , r msj , θ msj and let ψ ms = (ψ ms1 , . . . , ψ msd ) be the vector of parameters specifying the location of the s-th inner mixture component within the m-th outer component, with s ∈ {1, . . . , S m } and m ∈ {1, . . . , M }. We then specify the following model (equivalent to model (5)-(6)):

for i = 1, . . . , n, s = 1, . . . , S m , m = 1, . . . , M . Here Dirichlet M (γ M , . . . , γ M ) denotes the symmetric Dirichlet distribution defined on the (M − 1)-dimensional simplex with mean 1/M , which is the distribution of the normalised mixture weights.

Model (7) induces a partition of the subject indices {1, . . . , n} at an outer and an inner level. In particular, when two subjects, i and l, are assigned to the same cluster of the outer-level mixture, then the probabilities of observing a zero are the same and p i = p l . However, the vectors of parameters µ i and µ l characterising the sampling distribution might be different and, consequently, the two subjects might be assigned to different clusters at the inner level. This is reflected in the components of the vectors of parameters (p i , µ i ) and (p l , µ l ), which might share only the component corresponding to the probability of zero outcomes.

Posterior inference can be performed through both a conditional and a marginal algorithm, derived by extending the algorithms by Argiento and De Iorio (2019) to the enriched setup. The conditional algorithm is described in Algorithm 1, while in Algorithm 2 we present the marginal one.

Input: (y ij ) ij and parameter initialisation Output: posterior distribution of cluster allocation and other parameters for i in 1:n do Sample c i and z i from

end for Compute K, the number of allocated components at the outer level Relabel the outer-level clusters so that the first K components of the mixture are allocated Sample the latent variableū from Gamma n,

where n m is the cardinality of outer-level cluster m and n m = 0 for m > K for m in 1:K do Sample p m from the full conditional. Compute K m number of allocated components at the inner level Relabel the inner-level clusters so that the first K m components are allocated Sample the latent variable u m from Gamma n m , Sm s=1 ∆ ms Set S m = K m + x where

Sample the unnormalized weights of the m-th inner mixture from P[∆ ms ∈ dq | rest] ∝ q nms e −ωum h in (q)dq for s = 1, . . . , S m where n ms is the cardinality of inner-level cluster s and n ms = 0 for s > K m for s in 1:K m do Sample (r ms , θ ms ) from the full conditional end for for s in (K m + 1):S m do Sample r ms from the prior Sample θ ms from the prior end for end for for m in (K + 1):M do Sample p m and S m from the prior for s in 1:S m do Sample ∆ ms from the prior Sample r ms from the prior Sample θ ms from the prior end for end for Algorithm 2 Marginal algorithm Input: (y ij ) ij and parameter initialisation Output: posterior distribution of cluster allocation and other parameters for i in 1:n do Sample c i and z i from

where n m and n ms are the cardinalities of outer and inner clusters after removing the i-th observation, C −i m = C m \ {i} and C +i m = C m ∪ {i}. Similarly for C +i ms and C −i ms . Here the subscript old denotes an existing (occupied) cluster, while new denotes an unoccupied component end for Sample the latent variablesŪ and U 1 , . . . , U K from their full conditional:

for m in 1:K do Sample p m from its full conditional for s in 1:K m do Sample θ ms and r ms from the full conditional end for end for

The conditional algorithm is very flexible and allows for different prior distributions on the weights of the two mixtures as well as on M and S m (see, Argiento and De Iorio, 2019, for details) . In Algorithm 2 we use the notation q M and q S to denote the prior on M and S m , respectively, and we set them both equal to a shifted Poisson for the application in Section 4. Furthermore, h out and h in denote the prior distribution on the unnormalised weights (in our case Gamma distributions) of the outer and inner mixture, respectively, ψ out (u) and ψ in (u) denote the corresponding Laplace transforms of h out and h in (in our case ψ out (u) = (u + 1) −γ M and ψ in (u) = (u + 1) −γ S ).

To implement the marginal algorithm, we need to derive the marginal likelihood of the data, conditionally on cluster membership. The likelihood in Eq. (7) can be written as:

Recall that c i and z i denote the labels of the clusters to which the i-th subject belongs to in the outer and the inner clustering, respectively. The marginal likelihood of the data conditionally on the cluster allocation is obtained marginalising with respect to the prior distributions defined in (5) and (6). For a vector of counts y, we obtain:

jCm is the vector of observations y ij such that c i = m, for j = 1, . . . , d. Similarly, y * jCms is the vector of observations y ij such that c i = m & z i = s. Moreover, B (·, ·) denotes the Beta function, n 1 = i y i , n 0 = i (1 − y i ), y i is defined as in Eq. (4) and the last two summations run overthe elements of the vector y. Here K and K m are the numbers of clusters at the outer and inner level, respectively. Note that by cluster we mean an occupied component (i.e. a mixture component to which at least an observation has been assigned to), with K ≤ M and K m ≤ S m , m = 1, . . . , M . When implementing the marginal algorithm, after updating the latent variables U in and U out , we could add an extra step involving a shuffle of the nested partition structure as suggested by Wade et al. (2014) to improve mixing. Finally, the infinite sum in M NB (y) is truncated to 1000 terms, which yields a good approximation as the term (1 − ζ) r−1 decays exponentially to zero.

We apply our model to a dataset on WhatsApp use during COVID-19 (ClinicalTrials.gov, 2021) . The data consist of a questionnaire filled out by participants living in India. Each subject answers the same d = 7 questions for T = 7 consecutive days on the number of (j = 1) COVID-19 messages forwarded, (j = 2) WhatsApp groups to which COVID-19 messages were forwarded, (j = 3) people to whom COVID-19 messages were forwarded, (j = 4) unique forwarded messages received in personal chats, (j = 5) people from whom forwarded messages were received, (j = 6) personal chats that discussed COVID-19, (j = 7) WhatsApp groups that mentioned COVID-19. The table in Appendix A provides the list of the questions, as well as a brief description. In what follows, the first replicate (t = 1) corresponds to Sunday for all subjects, t = 2 to Monday, up to T = 7 corresponding to Saturday. The questionnaire responses were collected in June and July 2021, during India's infection wave of the Delta variant of the SARS-CoV-2 virus that causes coronavirus disease 2019 (COVID-19). From the initial 1156 respondents, we remove two subjects for which no answers are available, resulting in a final sample size of n = 1154. Moreover, 19% of the observations are missing. We also treat counts higher than 400, which are very rare (7 observations out of 56 546), as missing data as they are very far from the range of the majority of the data. We handle missing data using a two-step procedure. Firstly, whenever possible, we recover missing zeros using deterministic imputation based on respondent's answers to other sections of the questionnaire. For instance, if the answer to the question "did you send any message of this kind today?" is "no" and there is a missing value for the question "how many?", we can reasonably assume that the answer to the latter question is zero. In this way, we can recover 0.5% of the missing observations. Secondly, the remaining missing values are imputed using random forest imputation (as implemented in the R package mice van Buuren and Groothuis-Oudshoorn, 2011). Figure 7 in Appendix B displays the data after imputation.

To account for the fact that T repeated observations are available for each subject and process, we need to slightly modify model (7). We do so by assuming that the different time points are independent of each other, so that repeated observations can be straightforwardly included into the proposed model. Let Y ijt denote the count for the i-th subject and the j-th process at time t, i = 1, . . . , n, j = 1, . . . , d and t = 1, . . . , T . We assume that Y ijt are conditionally independent, given the parameters of the model. Thus, the likelihood contribution of each subject i is given by T t=1 d j=1 f y ijt | ψ msj . It must be highlighted that we are clustering individuals based on the pattern of all their observations, at each time point t and for each process j.

Finally we note that, thanks to the probabilistic structure of the Hurdle model for zero-inflated data, p i and the sampling distribution g (· | µ i ) reflect two distinct features of the respondents' behaviour: p i represents the probability of engaging in some COVID-19 related WhatsApp activity, while g (· | µ i ) captures the behaviour of those subjects who have actually engaged in the activity.

Posterior inference is performed through the conditional algorithm described in Algorithm 1. We run the algorithm for 15000 MCMC iterations, discarding the first 5000 as burn-in. Figure 2 shows that, at the outer level, the posterior distributions of the number of both components and clusters present a mode at the value three.

As point estimate of the cluster allocation, we report the configuration that minimises the posterior expectation of Binder's loss function (Binder, 1978) under equal misclassification costs, which is a common choice in the applied Bayesian nonparametrics literature (Lau and Green, 2007) . Briefly, this expectation of the loss measures the difference for all possible pairs of subjects between the posterior probability of co-clustering and the estimated cluster allocation. We refer to the resulting cluster allocation as the Binder estimate. The Binder estimate of the outer clustering contains three clusters, whose characteristics are summarised in Figures 3 and 4 . The largest cluster corresponds to WhatsApp users who on most days report a zero count for all d = 7 questions. The individuals in the other two clusters use WhatsApp more frequently when it comes to forwarding COVID-19 messages (j = 1, 2), receiving forwarded messages (j = 3, 4, 5) and having COVID-19 mentioned in their WhatsApp groups (j = 7). The main feature distinguishing Cluster 2 from Cluster 3 in terms of probabilities p i of non-zero counts is that on most days Cluster 2, unlike Cluster 3, discusses COVID-19 also in personal chats (question j = 6). Figures 5 and 6 display the main characteristics of the inner clusters. We are interested in the posterior distribution of the number of the inner clusters per outer cluster, as well as the inner clustering within each outer cluster. To this end, we run the MCMC algorithm fixing the outer clustering allocation to its Binder estimate, thus obtaining the conditional posterior distribution of the inner clustering. The results reveal substantial variability in the distribution of non-zero counts within outer Clusters 1 and 2 (see Figure 5 , bottom panel). Notably, around a quarter of the individuals in outer Cluster 2, as captured by its inner Cluster 2, forward COVID-19 messages to many more people (question j = 3) than subjects in inner Cluster 1 of outer Cluster 2. Figure 4 also supports the fact that outer Cluster 2 engages with WhatsApp in a much more persistent manner than the other outer clusters. These results highlight that a sizeable minority of WhatsApp users has a relatively large propensity to spread COVID-19 messages during a critical phase of the pandemic. This is in line with a similar survey in Singapore (Tan et al., 2021) and findings on "superspreaders" on other social media.

(a) Outer cluster 1 (b) Outer cluster 2 (c) Outer cluster 3 Figure 6 : Heatmaps of the posterior co-clustering probabilities for the inner clusters per outer cluster. Results are obtained conditioning on the Binder estimate of the outer cluster allocation. Observations are reordered based on the co-clustering probability profiles, through hierarchical clustering.

In this work, we propose a Bayesian nonparametric model for multiple zero-inflated count data, building on the well-established Hurdle model and exploiting the flexibility of finite mixture models with random number of components. The main contribution of this work is the construction of an enriched finite mixture with random number of components, which allows for two level (nested) clustering of the subjects based on their pattern of counts across different processes. This structure enhances interpretability of the results and has the potential to better capture important features of the data. We design a conditional and a marginal MCMC sampling scheme to perform posterior inference. The proposed methodology has wide applicability, since excess-of-zeros count data arise in many fields. Our motivating application involves answers to a questionnaire on the use of WhatsApp in India during the COVID-19 pandemic. Our analysis identifies a two-level clustering of the subjects: the outer cluster allocation reflects daily probabilities of engaging in different WhatsApp activities, while the inner level informs on the number of messages conditionally on the fact that the subject is indeed receiving/sending messages on WhatsApp. Any two subjects are clustered together if they show a similar pattern across the multiple responses. We find three different well-distinguished respondent behaviours corresponding to the three outer clusters: (i) subjects with low probability of daily utilisation; (ii) subjects with high probability of sending/receiving all types of messages and (iii) subjects with high probability for all considered messages except for non-forwarded messages in personal chats. Interestingly, the inner level clustering and the outer-cluster specific estimates of the sampling distribution g highlight similarities between the outer Clusters 1 and 3, where subjects tend to send/receive fewer messages compared to outer Cluster 2. Moreover, we are able to identify those subjects with high propensity to spread COVID-19 messages during the critical phase of the pandemic and for these subjects we do not find notable differences in terms of types of messages sent or received. Our results are in line with existing literature on the topic. Future work involves the development of more complex clustering hierarchies and techniques able to identify processes that most inform the clustering structure.

Funding: This work was partially supported by the NUS Centre for Trusted Internet and Community [grant number CTIC-RP-20-09]. 

Is infinity that far? A Bayesian nonparametric perspective of finite mixture models

Bayesian cluster analysis

A Bayesian approach for analyzing zero-inflated clustered count data with dispersion

WhatsApp in India during the COVID-19 pandemic. Identifier NCT04918849

Concepts of independence for proportions with a generalization of the Dirichlet distribution

Conditionally reducible natural exponential families and enriched conjugate priors

Reference priors for exponential families with simple quadratic variance function

A Bayesian analysis of some nonparametric problems

Enriched mixtures of Gaussian process experts

Zero-altered and other regression models for count data with added zeros

Zero-inflated Poisson model with clustered regression coefficients: Application to heterogeneity learning of field goal attempts of professional basketball players

Zero-inflated Poisson regression, with an application to defects in manufacturing

Bayesian model-based clustering procedures

Latent Structure Analysis

Bayesian variable selection for multivariate zero-inflated models: Application to microbiome count data

Multivariate zero-inflated Poisson models and their applications

A Bayesian mixture model for clustering and selection of feature occurrence rates under mean constraints. Statistical Analysis and Data Mining

Type I multivariate zero-inflated Poisson distribution with applications

Model-based clustering based on sparse finite Gaussian mixtures

Mixture models with a prior on the number of components

Random effect models for repeated measures of zero-inflated count data

Specification and testing of some modified count data models

On the posterior distribution of the number of components in a finite mixture

Enriched Pitman-Yor processes

Bayesian nonparametric generative models for causal inference with missing at random covariates

Tracking private WhatsApp discourse about COVID-19 in Singapore: Longitudinal infodemiology study

Improving prediction from Dirichlet process mixtures via enrichment

An enriched conjugate prior for Bayesian nonparametric inference

Functional clustering methods for longitudinal data with application to electronic health records

Acknowledgements: We thank Dr. Jean Liu and the Synergy Lab at Yale-NUS College for providing the data.