key: cord-0619977-tp0a0h1z authors: Chambers, Mark; Drovandi, Christopher title: Many-levelled continuation ratio models for frequency of alcohol and drug use data date: 2022-05-24 journal: nan DOI: nan sha: 3ba33c5eb926360cfb7787922ede2bfcc2688e73 doc_id: 619977 cord_uid: tp0a0h1z Studies of alcohol and drug use are often interested in the number of days that people use the substance of interest over an interval, such as 28 days before a survey date. Although count models are often used for this purpose, they are not strictly appropriate for this type of data because the response variable is bounded above. Furthermore, if some peoples' substance use behaviors are characterized by various weekly patterns of use, summaries of substance days-of-use used over longer periods can exhibit multiple modes. These characteristics of substance days-of-use data are not easily fitted with conventional parametric model families. We propose a continuation ratio ordinal model for substance days-of-use data. Instead of grouping the set of possible response values into a small set of ordinal categories, each possible value is assigned its own category. This allows the exact numeric distribution implied by the predicted ordinal response to be recovered. We demonstrate the proposed model using survey data reporting days of alcohol use over 28-day intervals. We show the continuation ratio model is better able to capture the complexity in the drinking days dataset compared to binomial, hurdle-negative binomial and beta-binomial models. Data measuring frequency of substance use are often collected with questions such as "How many days of the last 28 did you use . . . ". Valid answers to this question are limited to the 29 integer values between 0 and 28. In some cases it is sufficient to partition the full set of possible values into a smaller number of ordered frequency classes and treat the outcome as ordinal [16, 24] . However, often models are fitted to the numeric responses. In this case, the choice of an appropriate response distribution warrants careful thought. Substance days-of-use is often modelled as count data using Poisson or, more commonly, negative binomial models [5, 23] or their zero-inflated or hurdle variants [12, 17] . However, the number of days of substance use over a fixed interval is a bounded count since values greater than the length of the interval in days are not possible [1, 6] .The set of possible observed values are the same as would be modelled by a binomial model where each observation represents the sum of 28 Bernoulli trials. However, the appropriateness of a binomial model for substance use data is also questionable because, in the application described here, the binomial assumption would require that use of the substance on each of the 28 days was independent of whether it was used on any and all of the other 27 days. If, over a given 28-day interval, a person may be consistently more or less inclined to use the substance than explained by covariates, observed substance days-of-use will exhibit more variance than expected according to a binomial distribution. Therefore, the ability of the beta-binomial model to accommodate overdispersion [33] likely makes it a better candidate than the binomial for substance days-of-use data. Another feature sometimes observed in days-of-use data over a 28-day period is a preponderance of observations consistent with repeated weekly patterns of use. For example, if some people regularly consumed alcohol on weekends, but rarely on other days, modes in days-of-use at four and eight days out of 28 might be observed. Although the binomial and beta binomial models have the same support as the data, they cannot capture this multi-modality. We propose the use of continuation ratio models [10, 29] , a type of sequential ordinal model, for substance days-of-use data. Sequential ordinal models can be appropriate when the attainment of a particular level requires first attaining all lower levels [3, 21, 30] . The use of an ordinal model allows the distribution of the response variable to be constrained to plausible values, but also allows the odds of progressing beyond each threshold of use to differ. This second feature permits fitting distributions for substance days-of-use with multiple modes. We suggest considering a separate category for each observed value of substance days-of-use. Specifying a separate level for each observed value is less arbitrary than allocating the 29 possible response values into some smaller number of ordinal categories. More importantly, allocating a unique ordinal level for each possible response value facilitates the calculation of numeric summaries such as posterior means, quantiles and variances without needing to resort to crude approximations [2] . We demonstrate the approach using reported alcohol drinking days by individuals over four 28-day intervals during the COVID-19 pandemic in Australia. We compare results obtained with the proposed continuation ratio model with equivalent models assuming hurdle-negative binomial, binomial and beta-binomial response distributions. Online surveys were used to collect data on substance use of Australians during the COVID-19 pandemic for the Australians' Drug use: Adapting to Pandemic Threats (ADAPT) study. People were eligible to be recruited to the study if they were at least 18 years of age and had used illicit drugs regularly (i.e., at least once a month) in 2019 (i.e., 'pre-COVID'). Participants had the option to complete a one-off baseline survey, or to consent to subsequent follow-up surveys (2 months, 6 months and 12 months post baseline). this paper uses data from the latter group (n=452). Participants were reimbursed $15AUD (GiftPay voucher sent via email) for each survey that they completed, and also went in the draw to win one of three $100 AUD GiftPay vouchers. Data were collected on use of a range of substances, price and ease of obtaining illicit substances as well as how respondents had been impacted by the pandemic at each wave of the survey. Ethical approval was granted by the UNSW Sydney Human Research Ethics Committee (HC200264). The outcome variable modelled is the self-reported number of days that alcohol was used over the 28 days prior to each survey date. We model use of alcohol data generated from responses to two questions. Abstinence from alcohol use over the 28 days was deduced from the question "When did you LAST use alcohol?". Respondents that indicated they had never used alcohol, or they last used more than four weeks before the survey date were assigned a value of zero alcohol drinking days. Non-zero values were assigned, based on responses to the question "In the past four weeks (28 days) how many days have you drunk alcohol?". Responses were selected from a dropdown menu with options including "1 (once)", "2 (once a fortnight)", "3", "4 (once a week)", "5", and so on. That is, some options were described as the result of a specific pattern of drinking, while others were not. For the continuation ratio model, the response variable was defined as an ordinal type with a separate category for each of the 29 possible values from zero to 28. An integer-valued version of the same variable was used when other model families were fitted. The data are longitudinal with surveys from 452 people initially considered in Wave 1. Responses from 436 of these participants had complete covariate data and were used to condition models. Then, 302, 258 and 250 completed surveys from the original sampled population were used from Waves 2 to 4 respectively. The decline in sample size with survey wave was mainly due to loss to follow up. Few surveys were submitted missing values of the explanatory variables. The outcome variable was regressed on five explanatory variables. These were survey wave, isolation, gender, rurality and state. Survey wave was treated as a categorical variable with four levels, Wave 1 was May and June 2020, Wave 2 was July to September 2020, Wave 3 was predominantly November and December 2020, and Wave 4 was between May and July 2021. Isolation was defined as a dichotomous variable and coded as "yes" if the participant reported home-isolation or 14-day quarantine during the period of interest, otherwise the variable was coded as "no". Gender was coded as a factor with 3 levels, male, female and non-binary. Rurality was a dichotomous variable, coded as either city or rural/regional. There were few respondents from any of the states of Tasmania, the Northern Territory and the Australian Capital Territory. Therefore, we combined respondents from Tasmania with Victoria, those from the Northern Territory with South Australia and those from the Australian Capital Territory with New South Wales. A person-level random intercept was specified to allow for differences among people in overall alcohol consumption after adjusting for the other covariates. We fitted Bayesian regression models to number of alcohol drinking days over four waves of the ADAPT study during the COVID-19 pandemic. In the first instance, we assumed the probability that person i consumes alcohol on D ij = 0, . . . , 28 days during wave j is described by a continuation ratio ordinal model with 29 levels. The brms package [8] was used to conveniently code and run Bayesian models in R version 4.1.2 [25] by invoking the Stan Bayesian model fitting software [28] . The goodness of fit of the ordinal posterior distribution of the continuation ratio model is compared with models assuming alternative response distributions regressed on the same explanatory variables using posterior predictive checks [14] and the leave one out information criterion (LOO-IC) [31] . Dependence of linear predictors on explanatory variables Differences among people and survey waves in number of drinking days are assumed to be explained by differences in the values of distributional parameters of the specified response distribution family. Distributional parameters are related to a linear combination of the explanatory variables via a monotonic link function as commonly applied in generalized linear mixed models (GLMMs). Let x ij be a row vector of explanatory variables for COVID-19 isolation status, wave, state, rurality, person i on wave j and β be a column vector of coefficients. Then the linear predictor, η ij , is specified as where b i is a person-level random intercept. Distributional parameters from each fitted model are regressed against the explanatory variables as described in Equation 1. It should be recognized that while x ij is the same in all models, the parameters β and b i differ between models and distributional parameters. In GLMMs, the dependence of an outcome variable of interest on predictors is usually modelled explicitly through a single location distributional parameter such as the mean of the assumed response distribution. The values of other distributional parameters are assumed to be the same for all responses. So called models for "location, scale and shape" [26] are a generalization of this approach. Similarly, the brms package allows all distributional parameters of the specified response distribution to be modelled as functions of explanatory variables. For both the hurdle-negative binomial and beta-binomials that we fitted, we specified dependence on explanatory variables for two distributional parameters. In these cases, correlation between the random effects, b i , for each person, i, in the regression models for the two distributional parameters was modelled [9] . The continuation ratio model is a sequential ordinal model in that each response is modelled as the result of a sequence of binary outcomes. We describe here, the simplest case in which, for each person, the probability of each potential progression to another day's use in a 28-day interval is controlled by the same value, η ij . A threshold parameter, θ r , quantifies a degree of restraint from progressing to at least r drinking days given the person consumes alcohol on at least r − 1 days. Initially, setting r = 1, each person is assumed to participate in a decision to drink or not drink alcohol on at least one day. People participate in a second decision to drink on at least two days if and only if they pass the threshold of drinking on a first day, and so on until the participant reaches their total drinking days. A latent variable motivation for sequential ordinal models is described in Tutz [29] . We assumed the default logit link function for the continuation ratio model. Therefore, letting D ij be the number of alcohol drinking days by person i during wave j, the probability of progressing to at least r drinking days given at least r − 1 is given by It can be seen from Equation 2 that under the logistic model, the threshold parameters θ r provide multiplicative effects, exp (−θ r ), on the odds of each potential progression from r − 1 to r drinking days. Although conceptualized as a sequential process, continuation ratio models can be fitted as a single logistic regression model as described by Armstrong and Sloan [4] . However, with the brms package, the continuation ratio model is fitted with the cratio model family, obviating the need to 4/13 restructure the fitted data. The full probability distribution of the C-Ratio (η ij , θ 1 , . . . , θ 28 ) model is given in the Appendix. The i and j subscripts in the preceding model notation and hereafter indicate distributional parameters that were regressed on the explanatory variables. Corresponding hurdle-negative binomial, binomial and beta-binomial models were also fitted to the alcohol drinking days data for comparison with the continuation ratio model. Distributional parameters of all models were regressed against the same set of explanatory variables. Probability mass functions, expressed in terms of the distributional parameters described below, are provided in the Appendix. Hurdle-negative binomial According to the hurdle-negative binomial model, each person, i, has some probability, ψ ij , of abstaining completely from drinking alcohol on wave j, with ψ ij assumed to be described by a model similar to Equation 1 with a logit link function. The distribution of non-zero drinking days is assumed to be described by a truncated negative binomial with location parameter, µ ij , the mean of the (untruncated) negative binomial, is also defined similar to Equation 1, but with a log link function. We assume the overdispersion or scale parameter, α, of the negative binomial takes the same value for all people and waves. Normally, the hurdle component and the truncated negative binomial component are two distinct models that can be fitted separately [34] . Here, we model correlation between the person-level random effects of the ψ i· and µ i· . According to the hurdle-negative binomial model, alcohol drinking days, D ij are realisations of a Hurdle-NB (ψ ij , µ ij , α) distribution as defined in the Appendix. Binomial According to the fitted binomial model, the expected probability, π ij , that person i drinks alcohol on a given day in wave j is assumed to be defined by Equation 1 via a logit link function. Then, the number of days person i consumed alcohol over 28 days during wave j, D ij , is assumed to be the realization of a Binomial (28, π ij ) distribution. The beta-binomial model is parameterized in terms of an underlying (binomial) probability of drinking, π ij , and an overdispersion parameter, φ ij . The dependence of π ij and φ ij on explanatory variables is parameterized as described in Equation 1 with logit and log links respectively, except intra-person correlation between the random effects for each distributional parameter is modelled. The beta-binomial distribution is not included among native (inbuilt) brms model families. However, package documentation gives instructions on how it can be user-defined [7] . We denote the fitted beta-binomial model Beta-Bin (28, π ij , φ ij ) where the (beta-binomial) probability that person i drinks alcohol on a given day in wave alcohol on a given day in wave j, π * ij , is a random variable with a beta distribution, parameterized in terms of π ij and φ ij (see Appendix). We used horseshoe priors [11] for population-level (i.e. fixed effects) coefficient parameters to provide a level of protection against overfitting. For other parameters we used brms default priors. These are t distributions with three degrees of freedom and a scale parameter of 2.5 for both the ordinal level thresholds, θ r , and for the standard deviations of the random effects distributions, σ b . For models with multiple distributional parameters, the Lewandowski-Kurowicka-Joe (LKJ) prior distribution [19] 5/13 was assumed for the random effect correlation matrices. The LKJ prior is the brms default in this case. Posterior distributions of model parameters were approximated with four chains of 5000 iterations each generated by Stan's No U-turn Sampler. The first 500 iterations from each chain were discarded as burn-in or warm-up after which every fifth iteration was kept, resulting in a total of 3,600 samples from each conditioned model on which to base posterior inference. Convergence of the Monte Carlo algorithms was assessed by visual inspection of mixing of the four chains in trace plots and the rank-basedR statistic convergence diagnostic [32] . The alcohol drinking days data exhibits periodic modes every four days, consistent with repeated drinking patterns of once-a-week, twice-a-week, and so on (Fig. 1) . Zero alcohol use was reported on approximately 17 percent of 28-day intervals across the four waves combined. The observed empirical cumulative distribution function (ECDF) for drinking days across the four waves is depicted by the (identical) black line in each panel of Figure 2 . The multi-modal distribution of reported drinking days, evident in Figure 1 , manifests as large steps in the observed ECDF at multiples of four days, relative to smaller steps in the ECDF observed at other days of alcohol use (Fig. 2) . The thin blue lines in Figure 2 are ECDFs resulting from 25 draws from the posterior predictive distribution (see Gelman et al. [15] Chapter 6) of each model. Comparing the observed and posterior predictive ECDFs for each model indicates the binomial model predicts both fewer responses near zero and fewer responses near 28 than are reported. This suggests that the reported drinking days are overdispersed compared with what would be expected if drinking days had a binomial distribution. A problem with the hurdle-negative binomial model is that a small number of the posterior predicted values exceed 28 days use on each MCMC iteration. Predictions from the hurdle-negative binomial model for Figure 2 were truncated at 40 drinking days for ease of presentation. The posterior predictive distribution of the beta-binomial model reproduces the coarse scale distribution in reported drinking days but is unable to capture the heterogeneity in adjacent step sizes in the ECDF evident in observed drinking days. In contrast to the other models, the continuation ratio model closely reproduces the heterogeneity in step size in the ECDF. The difference in the posterior predictive distributions of the beta-binomial and ordinal models is more clearly shown in rootograms (Fig. 3) . Although the continuation ratio model treats the response as ordinal, the full implied numeric distribution can be recovered from its posterior distribution. All four models predicted mean alcohol drinking days by survey wave reasonably well (not shown). However, there were differences in posterior predictive standard deviation (Fig. 4) . The standard deviation of the hurdle-negative binomial posterior predictive distribution was more variable than the other models. The binomial posterior predictive distribution has lower standard deviation than the observed data, further highlighting overdispersion in the observed data relative to the binomial model assumption. The beta-binomial and continuation ratio models both give posterior predictions for each survey wave with standard deviations consistent with the observed data. It would be expected that a sequential model with 28 thresholds would fit cumulative days use well overall (Fig. 2) . A more rigorous test of this model is to check whether the posterior distribution of the thresholds for progression of alcohol drinking days are reasonable for different values of explanatory variables. Separate ECDFs for the four survey waves (Fig. 5 ) reveal complete abstinence from drinking over the 28-day intervals increased from 14% in Wave 1 to 24% in Wave 4. The posterior predicted distribution of the ECDF for each wave is reasonably consistent with the observed data. Although the same set of explanatory variables was used for each model, the four models do not have the same number of parameters. For example, the continuation ratio model includes the 28 threshold parameters that are not included in the other models. On the other hand, the hurdle-negative binomial and the beta-binomial have two sub-models with full sets of parameters, including person-level random effects, estimated for each. Since the continuation ratio model includes a separate ordinal category for each numeric response, we can compare its goodness of fit with the other three models using information criteria. According to LOO-IC (Table 1) , the continuation ratio model fits the alcohol drinking days data clearly better than the other models, accounting for differences in the number of parameters between models. In applications such as modelling substance use, it will usually be advantageous if clear inferences can be made based on the fitted model. As the best model, we base posterior inference on the continuation ratio model. Exponentiating the posterior distribution of the parameters, β, from Equation 1 gives posterior odds ratios for extending the number of alcohol drinking days levels of categorical variables relative to specified reference categories. For the ADAPT study, primary interest lay in frequency of use of alcohol and other drugs at the four survey waves as well as the effect of home isolation and quarantine. The posterior distribution of the odds ratio for isolation versus no isolation was mostly greater than unity, suggesting the target population was probably somewhat more inclined to extend their alcohol drinking days when in isolation or 9/13 quarantine than when not. Based on the posterior median estimate, the odds of extending drinking days from a given level of drinking in isolation or quarantine was around 15% higher than when not in isolation [Posterior median odds ratios 1.15, 90% credible interval (CI) 0.99, 1.40]. Conversely, we infer that the odds of extending alcohol drinking days most likely declined after Wave 1. Most notably, odds of extending alcohol drinking days in Wave 4 is estimated to have been around 35% lower than Wave 1 (Posterior median odds ratio 0.65, 90% CI 0.51, 0.81) (Fig. 6) . We have demonstrated the suitability of the continuation ratio model for fitting substance days-of-use data and its superiority compared with the negative binomial, binomial and beta-binomial distributions for the fitted alcohol drinking days data. The beta-binomial and hurdle-negative binomial can each be fitted with alternative numbers of distributional parameters regressed on explanatory variables from those included in Table 1 . We also fitted a beta-binomial with one distributional parameter, Beta-Bin (28, π ij , φ) and hurdle-negative binomials with one and three regressed distributional parameters, Hurdle-NB (ψ, µ ij , α) and Hurdle-NB (ψ ij , µ ij , α ij ). None of these alternative models were competitive with the continuation ratio model according to the LOO-IC measure. The continuation ratio model fitted to alcohol drinking days specified explanatory variable dependence through a single distributional parameter. Provided the fitted model adequately describes the data generating process, variable dependence on a single distributional parameter greatly simplifies inference. For example, Allison [3] expresses a preference for negative binomial over zero-inflated Poisson and zero-inflated negative binomial models for count data because zero-inflated components make the latter more difficult to interpret (see also [27] ). Although odds ratios from sequential ordinal models need to be interpreted slightly differently from most logistic models, including cumulative link ordinal models, they do have a clear interpretation in the context of substance days-of-use data. The advantage of ordinal models will be greatest when data are multimodal. It seems likely that the wording of available responses on the questionnaire in the ADAPT study exaggerated the true extent of multimodality in alcohol drinking days by prompting respondents to preferentially choose responses corresponding to multiples of four. Nevertheless, less pronounced multimodality of the kind observed is plausible. Cannabis use data modelled by Wagner et al. [33] (their Figure 2a) appear to exhibit multiple modes. Unfortunately, Wagner et al. model total substance days-of-use combined over several 28-day intervals for each person, potentially obscuring multimodality in their data. Also, Kowal & Wu [18] describe characteristics of mental health stress observations similar to substance days-of-use data considered here. Many-levelled continuation ratio models could be used in other applications where outcomes are measured in number of days observed over a fixed interval, such as migraines, interrupted sleep, and exercise [13, 20, 22] . Analysis of Ordinal Categorical Data Applications of the Bayesian bootstrap in finite population inference Logistic regression using SAS: Theory and application. SAS institute Ordinal regression models for epidemiologic data Parental and peer influences on the risk of adolescent drug use The analysis of bounded count data in criminology Define custom response distributions with brms brms: An R package for Bayesian multilevel models using Stan Bayesian item response modeling in R with brms and Stan Ordinal regression models in psychology: A tutorial Handling sparsity via the horseshoe Parental alcohol involvement and adolescent alcohol expectancies predict alcohol involvement in male adolescents Topiramate reduces headache days in chronic migraine: A randomized, double-blind, placebo-controlled study bayesplot: Plotting for Bayesian Models Bayesian Data Analysis Decrease in prevalence but increase in frequency of non-marijuana drug use following the onset of the COVID-19 pandemic in a large cohort of young men who have sex with men and young transgender women Effects of major depression on crack use and arrests among women in drug court Semiparametric count data regression for self-reported mental health Generating random correlation matrices based on vines and extended onion method Initiation and maintenance of exercise behavior in older women: predictors from the social learning model Forward and backward continuation ratio models for ordinal response variables Association between perceived insufficient sleep, frequent mental distress, obesity and chronic diseases among us adults, 2009 behavioral risk factor surveillance system Young people and alcohol: an econometric analysis Investigation of factors that affect the frequency of alcohol use of employees in Turkey R: A Language and Environment for Statistical Computing Generalized additive models for location, scale and shape Threshold models in a methadone programme evaluation Stan Development Team. Stan Modeling Language Users Guide and Reference Manual Sequential models in categorical regression Ordinal regression: A review and a taxonomy of models Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC Rank-normalization, folding, and localization: An improved (formula presented) for assessing convergence of mcmc (with discussion)* † The importance of distribution-choice in modeling substance use data: A comparison of negative binomial, beta binomial, and zero-inflated distributions Modelling the abundance of rare species: statistical models for counts with extra zeros We thank the people that participated in the ADAPT survey. Rachel Sutherland advised on the operational details of the study. CD acknowledges support from an Australian Research Council Future Fellowship (FT210100260). Table A1 . Probability mass functions of fitted models with parameters as described in the text. The quantity N in the continuation ratio, binomial and beta-binomial models is the length of the interval in days, which was 28 in the case of the ADAPT data fitted. Probability Mass Functionwhere Γ denotes the gamma function.