key: cord-0159631-4a1bcx4m authors: Aliverti, Emanuele; Mazzuco, Stefano; Scarpa, Bruno title: Dynamic modeling of mortality via mixtures of skewed distribution functions date: 2021-02-02 journal: nan DOI: nan sha: 357657009226a2a4164a4e23915ab3eca9f3a8ec doc_id: 159631 cord_uid: 4a1bcx4m There has been growing interest on forecasting mortality. In this article, we propose a novel dynamic Bayesian approach for modeling and forecasting the age-at-death distribution, focusing on a three-components mixture of a Dirac mass, a Gaussian distribution and a Skew-Normal distribution. According to the specified model, the age-at-death distribution is characterized via seven parameters corresponding to the main aspects of infant, adult and old-age mortality. The proposed approach focuses on coherent modeling of multiple countries, and following a Bayesian approach to inference we allow to borrow information across populations and to shrink parameters towards a common mean level, implicitly penalizing diverging scenarios. Dynamic modeling across years is induced trough an hierarchical dynamic prior distribution that allows to characterize the temporal evolution of each mortality component and to forecast the age-at-death distribution. Empirical results on multiple countries indicate that the proposed approach outperforms popular methods for forecasting mortality, providing interpretable insights on the evolution of mortality. In the last decades, changes in life expectancy and population growths have stimulated growing interest in sophisticated models for mortality (e.g., Booth, 2020; Zanotto et al., 2020) . These approaches characterize the observed age-specific regularities of the mortality function and their trends over time, providing a parsimonious representation of the components of mortality and allowing to provide forecasts. Estimates of past mortality trends provide insights on the global health of a population, describe relevant aspects of its demographic structure and facilitate comparisons across countries. Similarly, probabilistic forecasts are routinely employed by researchers, practitioners and organizations to aid and support welfare planning strategies such as pension funds, insurances and health systems, among many others (e.g., Lutz and Samir, 2010) . Human mortality has been generally characterized by several quantities, with the hazard function, the survival function and the probability density function being the most popular ones (e.g., Klein and Moeschberger, 2006) . Although each representation can be directly obtained from the others, their interpretations are complementary and highlight more clearly different aspects of mortality (e.g., Basellini and Camarda, 2019) . In this article we will focus on modeling the distribution of the ages at death, generally referred to as "age-at-death distribution" and corresponding to the density function that characterizes mortality (Keyfitz, 2005) . In the literature on modeling and forecasting mortality, this quantity has received significantly less attention than other quantities such as mortality rates (e.g., Lee and Carter, 1992; Li and Lee, 2005) . However, as outlined in Basellini and Camarda (2020) and Pascariu et al. (2019) , analysis of the age-at-death distribution can be incredibly useful, since it provides a direct evaluation of longevity and lifespan variability that cannot be directly inferred from other quantities (see also Cheung et al., 2005; Canudas-Romo, 2010) . To further see that, Figure 1 illustrates the age-specific distribution of deaths from four illustrative countries, over a time period that ranges from 1970 to 2015. These data highlight crucial aspects of the temporal evolution of mortality, such as the global shift of the curve to older ages-due to the increase in life expectancy-and the drop of perinatal and infant mortality. Figure 1 also suggests that such an evolution has characterized countries differently, resulting in faster changes in some countries (e.g., Italy) and slower in others (e.g., Sweden). Furthermore, the age-specific distribution of deaths illustrates that several mortality trends are converging, resulting in similar patterns in terms of modal age at death, infant mortality and compression of old age mortality. The first approaches for modeling the distribution of deaths date back to the 19-th century (e.g., Gompertz, 1825; Makeham, 1860; Pearson, 1897) , and have been generalized in the following century (Siler, 1983; Heligman and Pollard, 1980) . The main interest of such methods has been on characterizing mortality patterns across ages, focusing on a given country during a specific year. According to these approaches, the mortality curve can be characterized using different parametric specifications, accounting for distinct aspects of the mortality curve; see also Dellaportas et al. (2001) and Mazzuco et al. (2018) . One notable limitation of these approaches is the focus on single-country modeling; when interest is on forecasting mortality for multiple countries, independent country-specific analysis are required. This strategy ignores the fact that in most developed nations similar trends are observed, and this abundance of information can be appropriately included in the model to improve the quality of the estimates. Joint modeling of multiple nations, in fact, is expected to improve the estimation of country-specific parameters and temporal trends, allowing to share information across different countries. More recently, researchers have focused on explicitly modeling the temporal evolution of the ageat-death distribution, forecasting the parameters using time-series specifications (Pascariu et al., 2019; Basellini and Camarda, 2019) . These methods borrow ideas from death-rates forecasting, such as the milestone approach introduced by Lee and Carter (1992) and sequent developments (e.g., Hyndman and Ullah, 2007) . Potentially, these models can be both used to describe the evolution of mortality in the past years and to provide forecasts, although the focus of applications is more frequently on forecasting only. However, separate forecasts often lead to unreasonable results, with country-specific or sex-specific predictions that diverge substantially across groups. Such divergent forecasts across subpopulations are implausible, since industrialized countries have been showing global convergence trends in terms of mortality levels, and this tendency is expected to endure (Wilson, 2001; Bergeron-Boucher et al., 2017; Oeppen, 2008) . Following the breakthrough approach of Li and Lee (2005) , different authors have focused on "coherent" mortality forecasts, defined as predictions that do not diverge across groups. Such a strategy has constantly received attention in the past years, being successfully implemented in several models for death rates; for example, generalizing compositional-data techniques (Bergeron-Boucher et al., 2017) and functional-data models (Hyndman et al., 2013) for mortality. However, a critical issue with these methods concerns with the choice of a population standard to be used as a reference. This choice is substantially arbitrary, and a common consensus on which sub-population should be used is still lacking, since results are often sensitive to different references (Booth, 2020; Kjaergaard et al., 2016) . Motivated by the above discussion, in this article we propose a novel approach for modeling and forecasting the age-at-death distributions of multiple countries. We will focus on the age-distribution of deaths, combining the life tables quantities with the total number of deaths. This procedure allows to naturally take into account the population size of each country and its age-structure, which are important determinants of the age-at-death distribution. The proposed methods can also be applied, without modifications, directly to the life table quantities, where we can assume that data are provided on artificial populations with fixed population size. However, this would discard an important source of uncertainty, which corresponds to the population size of a specific nation. Indeed, imposing a fixed same sample size weights each nation equally, implying that a relatively small country (e.g. Sweden) would have the same influence as a much bigger country (e.g., Germany or United Kingdom). Therefore, we prefer to use the actual sample size of each county to have a proper quantification of uncertainty and genuinely account for different population structures. We propose a method based on a mixture distribution function of a Dirac mass, a Gaussian distribution and a Skew-Normal distribution, in order to characterize infant, adult and old-age mortality with an interpretable set of parameters. Following a Bayesian approach to inference, we specify hierarchical priors for the dynamic parameters and the country-specific coefficients, relying on informative common prior specifications to improve the borrowing of information across nations. Compared to the available techniques, our method allows to model directly the temporal evolution and the country-specific variations across multiple sub-populations, avoiding sequential procedures and providing estimates shrunk towards common levels. Under the proposed methods, coherent predictions are naturally obtained without selecting a reference population or an external standard, but instead estimating these quantities from the available data. We will refer to the proposed dynamic model based on a skewed distribution function as dysm in the sequel. Our interest is on providing a flexible representation of the age-at-death distribution in multiple countries, dynamically across different years or time periods. We will characterize this quantity through a continuous probability density function (pdf) f (x; ϑ jt ) which parametrizes the probability of dying at age x -based on the individuals current living at age x -in country j during year t via a set of countryand time-specific parameters ϑ jt , with x = 0, . . . , 110, j = 1, . . . , p countries and t = 1, . . . , T years. Although f (x; ·) is often specified as a continuous function, in practice we focus on characterizing the age-at-death distribution on a discrete grid, corresponding to the probability p x jt of dying at a specific age x = 0, . . . , 110. Such an aim can be addressed discretizing f (x; ϑ jt ) over a set of disjoint intervals (x − 1/2, x + 1/2] as where F denotes the cumulative distribution function (cdf) associated with f . This characterization of probabilities via a discretization of an underlying pdf automatically guarantees that the vector p jt = p jt (ϑ jt ) = [p 0 jt , . . . , p 110 jt ] with elements as in Equation 1 is a proper probability vector with p x jt ∈ [0, 1] and x p x jt = 1, therefore avoiding further restrictions (e.g., Oeppen, 2008) ; see the left panel of Figure 2 for a graphical representation of the discretization process. We denote as D jt = [D 0 jt , . . . , D 110 jt ] the number of deaths and as p jt = [p 0 jt , . . . , p 110 jt ] the death probabilities, where elements D x jt and p x jt denote, respectively, the number of deaths and the probability of dying at age x -based on the current individuals living at age x -in country j during year t. Given the total number of deaths n jt = x D x jt and the probabilities outlined in Equation (1), we model the vector D jt as a multinomial distribution with (D jt | n jt , p jt ) ∼ multinom(n jt , p jt ), j = 1, . . . , p, t = 1, . . . , T. (2) A crucial aspect of this approach involves the specification f (x; ·), which induces a model on the vector of probabilities p jt . In practice, different parametric models for f (x; ·) are available in the literature, with the Gompertz (1825) and Siler (1983) models being popular options. The age-at-death distribution f (x; ·) could be also modeled semi-parametrically, using for example splines or other basis functions (Hyndman et al., 2013; Basellini and Camarda, 2020) . However, when interest is explicitly on characterizing relevant aspects of mortality -such as the impact of infant mortality or the shape of old-age mortality -parametric models provide a much clearer interpretation and deeper insights on the structure of the populations under investigation. In addition, mortality can be naturally divided into different components, accounting for deaths at different phases of life (e.g., Lexis, 1879; Pearson, 1897) . Therefore, it is desirable to rely on a specification which preserves such a decomposition, for example using mixtures of distributions (Carriere, 1992; Mazzuco et al., 2018) . In this work, we draw inspiration from Mazzuco et al. (2018) and Zanotto et al. (2020) and rely on a mixture of three components to characterize mortality; see also Carriere (1992) for related arguments. Specifically, in order to facilitate model's interpretation, we decompose mortality as the contribution of infant mortality, adult mortality and old-age mortality. Infant mortality is accounted through a single Dirac mass at age 0, while adult mortality is modeled via a Gaussian distribution with mean µ jt ∈ R and standard deviation σ jt ∈ R + . Old-age mortality, instead, is characterized using a Skew-Normal distribution (e.g., Azzalini and Capitanio, 2013) with location parameter ξ jt ∈ R, scale parameter ω jt ∈ R + and shape parameter α jt ∈ R. The impact of each component to the overall mortality function is accounted assigning mixture weights π h jt ∈ [0, 1] such that π 0 jt + π 1 jt + π 2 jt = 1 in order to guarantee that the resulting expression remains a proper pdf. This choice leads to the following with ϑ jt = (π 1 jt , π 2 jt , µ jt , σ jt , ξ jt , ω jt , α jt ), 1[·] denoting the indicator function and φ(·), Φ(·) denoting the pdf and the cdf of a standard Gaussian, respectively; we omit the parameter π 0 jt from ϑ jt since π 0 jt = 1 − π 1 jt − π 2 jt due to the constraint on the mixture weights; see the right panel of Figure 2 for a graphical representation of the proposed model for the age-at-death distribution. Equation (3) allows to characterize the age-at-death distribution with flexibility, while also retaining interpretability of the mortality structure. The first component of Equation (3) accounts for infant and perinatal mortality, characterizing deaths in the first year of life. As outlined, for example, in Figure 1 , perinatal mortality is notably decreased during recent years, mainly due to technological advances (e.g., Wilson, 2001; Zanotto et al., 2020) ; the mixture weight π 0 jt characterizes the impact of this component in the overall mortality. The use of a Dirac mass is particularly adequate for developed countries, where infant mortality occurs during the first year of life and can be properly modeled with a single mass, avoiding more elaborate specifications and additional parameters. However, dysm can be naturally adapted to countries with high infant-mortality levels, replacing the Dirac mass with an half-normal distribution (Zanotto et al., 2020) and leveraging on the same model architecture; in Section 3.4, we compare such a specification with the proposed model. The Gaussian component, instead, is assigned to a relative weight π 1 jt and characterizes adult mortality via the location parameter µ jt and the scale parameters σ jt . In addition, these quantities correspond also to the moments of the Gaussian component, and therefore they can be interpreted directly as the mean and standard deviation of the age at adult mortality. The Skew-Normal component, lastly, accounts for the asymmetric shape of old-age mortality, which has been frequently observed in developed countries (e.g., Mazzuco et al., 2018; Pascariu et al., 2019) . Since most deaths occur at old ages, this component is expected to be associated to the largest mixture weight π 2 jt . Inference on the structure of old-age mortality rely on such mixture weight and on the Skew-Normal parameters ξ jt , ω jt and α jt , providing insights on the evolution of old-age mortality in terms of location, scale and shape of such a component. In practice, it can be convenient to focus on the mean and standard deviation of old-age mortality, which are more interpretable measures of the shift and compression of the age-at-death distribution. According to the proposed parametrization, these quantities can be conveniently expressed as with δ jt = α jt / 1 + α jt (e.g., Azzalini and Capitanio, 2013, sec. 2.1.4). Alternatively, it is also possible to parameterize directly the Skew-Normal distribution in terms of its cumulants (e.g., Azzalini and Capitanio, 2013, sec. 3.1.4). However, for simplicity in implementation and exposition, we prefer to focus on the outlined parameters, and eventually post-process the estimates to focus on the functionals of interest. The Multinomial model outlined in Expression 2, with probabilities given by Equations (1) and (3), provides a flexible specification for the distribution of the number of deaths occurred in a single country j during a specific year t. In order to account for the temporal variation of the age-at-death distribution, we model explicitly the dynamic evolution of the parameters ϑ jt through an hierarchical dynamic model. This strategy will allow us to highlight what aspects of mortality have changed in the past years, characterizing the rates of such changes and facilitating comparison across different countries. In addition, the inclusion of a dynamic component facilitates the developments of mortality forecasts extrapolating parameters in time, obtaining predictions for the future age-at-death distributions. We specify the hierarchical model in an unconstrained domain, transforming the parameters of Equation (3). This choice facilitates computation and implementations, allowing to rely on Gaussian dynamic models in the transformed space and avoiding restrictions on the support of the parameters. For this purpose, we introduce the reparametrized vectorθ jt , with elements (θ jt 1 , . . . ,θ jt 7 ) corresponding to In Equation (5), the mixture weights (π 1 jt , π 2 jt ) are mapped into (θ jt 1 ,θ jt 2 ) leveraging a cumulative logit transformation, while σ jt and ω jt -corresponding to adult and old-age mortality scale parameters -are transformed intoθ jt 4 andθ jt 6 , respectively, via a logarithmic transformations. Note that eachθ jt k ∈ R for k = 1, . . . , 7, and that the vector ϑ jt effectively corresponds to a one-to-one reparametrization of ϑ jt . We model the dynamic evolution of the elementsθ jt k leveraging a random-walk model with drift. This choice leads to the following specification. In Equation (6),θ j0 k ∈ R denotes the initial condition ofθ jt k , and it is assigned to a Gaussian distribution with mean m k ∈ R and standard deviation s k ∈ R + . These values correspond to the initial information for the dynamic hierarchical model, and characterize a probabilistic representation of our beliefs before the data are observed (e.g., West and Harrison, 1997) . The drift parameter β j k ∈ R measures the expected difference across consecutive values ofθ jt k and accounts for possible increasing or decreasing linear trends in the evolution of the parameters. The inclusion of this component is particularly relevant in our problem, where we expected to observe clear and important trends in the evolution of mortality, accounted by the parameters charactering the age-at-death distribution. Conditionally on the value at time (t − 1), eachθ jt k is recursively distributed as a Gaussian distribution with mean β j k +θ j t−1 k , and standard deviation η j k ∈ R + . Overall, we can interpret dysm as a Multinomial state-space model, where Equation (6) controls the evolution of the latent states via random-walks with Gaussian innovations and, conditional on the states, the observed vectors of deaths are modeled as independent Multinomial variables with probabilities given by Equations (1) and (3); see Figure 3 for a graphical representation of the proposed hierarchical dynamic model. Although this dynamic model might seem oversimplistic, we found that such a specification leads to good results in a large variety of applications involving developed countries; see Section 3 for further details. More elaborate extensions are possible, for example including a moving-average component, increasing the lag of the autoregressive terms or joint modeling multiple components ofθ jt k via multivariate random-walks. However, a random walk with drift on the latent states often characterizes well many phenomena without over-parameterizing the dynamic component, and is also general enough to be directly used, without further modifications, in a large variety of settings; see Durbin and Koopman (2012) and West and Harrison (1997) for related arguments and Section 3.4 for a comparison with alternative specifications. Additionally, we highlight that Equation (6) is valid for any equally-spaced time grid, and therefore the proposed approach can be also applied to different time scales. In fact, our description focuses on yearly data for simplicity in exposition, but this requirement is not necessary for developing the proposed model in practice. Several important settings involve analysis of mortality across different time scales, such as monthly data or multi-year periods; for example, when interest is on modeling short-term variations due to extraordinary events (e.g., covid-19 pandemic) or large-scale trends, respectively. The general structure of dysm facilitates its use in these different settings, without modifying the model specification. As discussed in Section 1, a crucial aspect of mortality forecasting involves the penalization of diverging scenarios via coherent modeling. This can be obtained explicitly including a common trend across countries (e.g., Li and Lee, 2005; Li, 2013) , or choosing a reference population (e.g., Hyndman et al., 2013; Basellini and Camarda, 2019) ; see also Booth (2020) for a recent discussion. However, these choices are arbitrary and significantly affect the final results (e.g., Kjaergaard et al., 2016) . Therefore, we follow a different path and induce coherent predictions via a Bayesian approach to inference, regularizing estimates toward a common mean trend. More specifically, we introduce a common prior specification for the country-specific dynamic hierarchical model defined in Equation (6), for each coefficient k = 1, . . . , 7. Including the prior forθ j0 k , we let where m β k , s β k , a k and b k denote fixed hyper-parameters; see Figure 4 for an illustration of the hierarchical dependence structure induced by the common prior specification. The specification of a common prior distribution for the country-specific parameters allows to share information across different nations, improving estimates for the age-at-death distributions. In addition, the introduction of a common prior shrinks estimates toward a common mean, providing an implicit regularization on the parameters. Such a shrinkage is what makes dysm a "coherent" model, using a data-driven reference mortality level without the need of a user-defined benchmark. The large abundance of research on the evolution of mortality, and the clear interpretation of the parameters in Equation (3), facilitate informative elicitation for the hyper-parameters of Equation (7). For example, European adult mortality is expected to lie between 30 and 70 years with high probability (e.g., Lewer et al., 2020) ; therefore, we can assumeθ j0 3 ∼ N (50, 10) for the initial condition ofθ jt 3 , independently for j = 1, . . . , p. Similarly, informative specifications for the prior distribution on the drift parameter β j k allow to include information on the direction and intensity of the temporal trends. For instance, we could center the trend β j 3 around m β 3 = 2, resulting in an expected increasing trend for the age at adult mortality of 2 years between consecutive time points. In Section 3.4, we compare such informative prior specification with a non-informative elicitation, confirming the importance of the proposed approach to induce shrinkage across countries and obtain coherent predictions. There is a large literature on simulation methods for non-Gaussian state-space models; see, for example, Durbin and Koopman (2012, Chapter 13 .3), Prado and West (2010, Chapter 6) or Chopin and Papaspiliopoulos (2020) for recent developments. A major complexity in our methods derive from the computation of the age-at-death distribution, which links ϑ jt to p jt via Equation (1). Therefore, we conduct posterior inference relying on the r package nimble , that allows to specify the main structure of dysm, compile it in c++, and implement an adaptive Metropoliswithin-Gibbs mcmc algorithm to simulate from the posterior distribution of the model's parameters. Specifically, the parameters of Equation (7) are updated using Gibbs steps, while the remaining are updated with multivariate Metropolis steps, due to the lack of conjugacy induced by Equation (1). Pseudo-code illustrating the main steps to conduct posterior inference is reported in the Supplementary materials. Forecasts for future time points t > T can be simulated recursively from the posterior predictive distribution, continuing the conditional representation of Equation (6) for each draw of the mcmc output. In order to improve the computational performance, we included further modifications to efficiently compute the cdf of the mixture density function in Expression (3) countries and T = 20 time points requires roughly 2 minutes per 1000 iterations on an 8-core intel I7-7700HQ, 2.8 ghz processor running Linux, which is satisfactory for our purposes. The focus of this Section is to evaluate the ability of dysm to characterize the age-at-death distributions of different European countries. In Section 3.2, we are interested in comparing its performance with some state-of-the-art methods for mortality forecast, focusing on rolling windows scenarios. In In order to quantitatively evaluate the performance of dysm in estimating and forecasting mortality, We compare dysm with different popular approaches for mortality forecasting: the Lee-Carter model (Lee and Carter, 1992) , the coherent Li-Lee model (Li and Lee, 2005) , the functional mortality model of Hyndman-Ullah (Hyndman and Ullah, 2007) , the compositional Oeppen model (Oeppen, 2008) , and the mem5 model (Pascariu et al., 2019) . All methods are conveniently available in a common interface within the r package MortalityForecast (Pascariu, 2020) . Since these methods focus on modeling single population data, we perform the analysis on each country separately. In order to compare the approaches on the same basis, results are evaluated in terms of the quality of the fitted and predicted age-at-death distributions and death rates. Under dysm, such estimates are obtained post-processing the mcmc output, computing the posterior distribution of Equation (1) the remaining values of m k to 0. We also let s k = 10 for all k to induce sufficient variability in such initial prior guesses and cover the temporal window under investigation. Lastly, we specified a shared standard Gaussian for the drift parameters β j k (setting m β k = 0 and s β k = 1) and a weakly-informative Inverse-Gamma on the variances of the dynamic innovations η 2 j k , setting a k = 0.01, b k = 0.01 for all k. In each scenario, posterior inference for dysm relies on 100000 iterations collected after a burn-in of 5000, thinning one iteration every 5. Convergence was assessed in terms of mixing, autocorrelation of the chains and Geweke diagnostics, which resulted satisfactory in all the windows considered, with an Effective Sample Size larger than 18000 out of 20000 stored iterations for all the parameters. Performance is evaluated in terms of Mean-Squared-Error (mse) and Mean-Absolute-Error (mae) between the fitted and predicted age-at-death distributions or death rates and the ground truth. Since the elements of the age-at-death distribution are generally very small numbers, we report results in relative terms, dividing the performance of each approach by the performance obtained by dysm in each scenario and country. Recalling that both mse and mae are measure of discrepancy, results larger than 1 indicate that predictions under dysm achieve an error which is lower than the competitor, and therefore we can conclude that dysm performs better; the opposite holds for values lower than 1. Results referring to in-sample evaluation are reported in Table 1 , computing the median, first and third quartiles of the relative performances across countries and rolling windows scenarios. Current results suggest that dysm outperforms the competitors both for the male and female populations, with a performance that is better than the other approaches in all settings. These results confirm the ability of dysm to characterize with flexibility age-at-death distributions across multiple countries and time periods, providing estimates that are consistent with the observed data and more accurate than state-ofthe art methods. When dysm is used for fitting death rates, a relatively worse performance is observed, particularly in terms of mae. This result can be explained considering that old-ages (80 − 100) are assigned more weight in the computation of death rates, compared to other quantities that characterize mortality (e.g., Keyfitz, 2005) . Therefore, dysm performs particularly well in characterizing premature and adult mortality, and this leads to large improvements for modeling the age-at-death distribution, in particular for male populations where the share of adult mortality is relatively large. Out-of sample performances are instead reported in Table 2 . Results provide further evidence in favor of the ability of dysm, which obtains out-of-sample forecasts more accurate than the competitors. This important results suggest that, overall, dysm is the most accurate methods for modeling and forecasting the age-at-death distribution in the 12 European countries under investigation. Performances stratified by countries are reported in the Supplementary materials. In Section 3.2, we have demonstrated that dysm outperforms the competitors in terms of the quality of the fitted and predicted age-at-death distributions. In this Section we argue that the proposed method has concrete practical benefits in terms of mortality modeling also from an interpretative point of view. Indeed, posterior inference on the parameters characterizing dysm and their functionals provides insights on the structure of mortality and its future evolution, highlighting aspects that have larger impact in such changes and explaining differences across countries. To illustrate this, we focus on the same European countries as in the previous section, modeling the more recent period 1980 − 2017 and providing forecasts until 2037. Posterior simulation for this application relies on the same settings as in Section 3.2. Results in terms of mixing, autocorrelations, Geweke diagnostic and effective sample size were satisfactory for all the parameters considered, with an effective sample size larger than 19000 out of 20000 iterations for all the parameter. We conduct inference on the set of parameters (π 0 jt , π 1 jt , π 2 jt , µ jt , σ jt , α jt ,μ SN jt ,σ SN jt ) to further facilitate interpretation. Results are reported in Figure 5 , illustrating the posterior medians and 90% credible intervals for the estimated parameters for male and female populations separately. Current empirical findings are consistent with known aspects of the temporal evolution of mortality of the past decades. For example, the impact of infant mortality-measured via the parameters π 0j tis notably decreased in all countries in the past years, dropping from values around 0.01 to 0.005. Although both male and female populations share such decreasing trends, we also observe that infant mortality is larger for males than females (e.g., Drevenstedt et al., 2008) . Another clear aspect involves the evolution of the mean age of old-age mortality, captured via the Skew-Normal meanμ SN jt , which has constantly increased in all countries. The overall trend is similar between males and females, with woman reporting older ages on average. Similarly, the decreasing trend ofσ SN jt is consistent with the empirical evidence on compression of old-age mortality. Under dysm, estimates and credible intervals for any complex functional of the model's parameters are easily obtained. More importantly, prediction intervals are determined taking into account the uncertainty of the entire inferential process, which is not the case for other coherent models, where all the uncertainty related to choice of reference is generally not accounted for. Some interesting functionals include death rates and life expectancies, which can be obtained from the age-at-death distribution via deterministic transformations; dysm allows to conduct coherent inference also on this quantities, rigorously characterizing uncertainty post-processing the mcmc output. As an illustrative example, Estimates are obtaining post-processing the mcmc for the age-at-death distribution. Solid and dashed lines denote posterior medians and 90% credible intervals, respectively. In this section we evaluate the robustness of the parametric assumptions underlying dysm, focusing on the application outlined in Section 3.3. We investigate different specifications for the mixture components characterizing dysm and alternative prior specifications. Specifically, we modify dysm focusing on • modeling old-age mortality with a rescaled Beta distribution, supported in the interval [75−110], instead of a Skew-Normal distribution; • modeling infant mortality with an Half-Normal distribution, as in Zanotto et al. (2020) , instead of a single Dirac mass at age 0; • removing the Gaussian component accounting for adult mortality, setting π 1 jt = 0; • characterizing the joint evolution ofθ jt with a multivariate random walk, instead of independent innovations; • characterizing the evolution ofθ jt via random-walks with Student-t innovations with 5 degrees of freedom, instead of Gaussian distributions; • replacing informative priors distributions in Equation (7) with flat uniform priors. Table 3 , and suggest that the current implementation for dysm is the best performing model in terms of marginal likelihood and other measures. For example, modeling old-age mortality with a scaled Beta or characterizing infant mortality with an Half-normal distribution do not improve the fit of the current specification for dysm; similar results are obtained with more complex specifications for the dynamic prior distribution or non-informative elicitation, confirming the importance of the proposed specifications. Interestingly, current results indicate that the specification without adult mortality performs better for female than for male mortality, confirming that the inclusion of a separate component devoted to adult mortality is fundamental for modeling male mortality, but also provides concrete benefits for female mortality. In this article we have introduced a dynamic model based on a Skewed distribution function (dysm) for modeling and forecasting the age-at-death distribution of multiple countries across time. The proposed method automatically regularize diverging scenarios, and allows to obtain predictions which are coherent across sub-populations. In addition, this strategy leads to a borrowing of information which improved the quality of the fit and of the forecasts, compared with state-of-the-art methods for mortality forecasting. Another important advantage of the proposed method is that it provides coherent forecasts without the need of choosing a reference population, which is instead naturally derived from data. The choice of reference population, in fact, is still an open issue of coherent forecast models and has received notable attention from scholars in recent years (Booth, 2020; Kjaergaard et al., 2016) . We have focused directly on the age-distribution for the number of deaths, considering the total number of deaths as a given quantity and relying on a Multinomial likelihood. This approach allows to focus directly on the distributions of deaths across different ages, accounting for the heterogeneity in the overall mortality trends. For future developments, it might be useful to include a further dynamic level in the hierarchy and model the process of death counts n j t . Such an aim can be achieved, for example, using an hierarchical Poisson or Negative-Binomial dynamic model. In this section we illustrate the main steps to perform posterior inference under dysm. Following Section 2.1 of the paper, the model can be formulated as (D jt | n jt , p jt ) ∼ multinom(n jt , p jt ), j = 1, . . . , p, t = 1, . . . , T. with D jt = [D 0 jt , . . . , D 110 jt ] denoting the number of deaths and as p jt = [p 0 jt , . . . , p 110 jt ] the probabilities for the age-at-death distribution, where the elements D x jt denote the number of deaths at age x, in country j during year t. The vector of probabilities is induced discretizing a density function f as which is specified as a mixture density function with For computational convenience, the model in parametrized in terms of uncosntrained parameters as ϑ jt = (θ jt 1 , . . . ,θ jt 7 ) := log π 1 jt 1 − π 1 jt , log π 2 jt 1 − π 1 jt − π 2 jt , µ jt , log(σ jt ), ξ jt , log(ω jt ), α jt , and the prior distributions are specified as ϑ j0 k ∼ N (m k , s 2 k ) (θ jt k |θ j t−1 k , β j k , η 2 j k ) ∼ N (β j k +θ j t−1 k , η 2 j k ), t = 1, . . . , T, to induce an auto-regressive specification across time. Posterior sampling is performed with the R package nimble (de Valpine et al., 2017; , which relies on an adaptive Metropolis-Hasting for non-conjugate step (updates ofθ jt k ) and a Gibbs Sampler for conjugate updates (β jk and η 2 jk ). The Metropolis-Hastings steps in nimble follow Haario et al. (2001) , leveraging Gaussian random-walks with an adaptation of the variance of the proposal; refer to Haario et al. (2001) and for further details. Specifically, the algorithm to conduct posterior inference under dysm follows the following steps. At each iteration b in 1, . . . , B, the Metropolis-within-Gibbs algorithm updates each multivariate parameterθ t k = (θ t 1k , . . .θ t pk ) recursively for t = 2, . . . , T as follows: 1. Propose a candidate valueθ t k from a Gaussian random-walk proposal g(θ t k ,θ 2. Accept the proposed value with probability min{1, q(θ t k )g(θ t k ,θ where q(θ t k ) denotes the unnormalized posterior density evaluated atθ t k . Considering the likelihood contribution for all countries, this quantity is proportional to the multinomial likelihood outlined in Equation (1), multiplied by the density of the auto-regressive Gaussian priors, conditionally on the value at time t − 1: 3. If the proposed value is accepted, setθ Note that the likelihood function depends on the model parameters via p x jt = p x jt (θ tj 1 , . . . ,θ tj 7 ), computed according to Equation (2). Since f (x; ·) in Equation (2) is a mixture density function, its cumulative distribution function (cdf) F (x, ·) corresponds to a mixture of cdfs, with same mixing weights. The parameters characterizing the random-walk components of Equation (5) are updated consid-ering the joint distribution for (θ j1 k , . . . ,θ jT k ), which correspond to a multivariate Gaussian with (θ j1 k , . . . ,θ jT k ) ∼ N (β j k 1 T , η 2 j k Ω −1 ) and Ω sparse T × T precision matrix with main diagonal equal to (1, 2, . . . , 2, 1), first minor diagonals equal to −1 and remaining elements equal to 0; see, for example, Lindsey (2004, Section 9.2.1) for a derivation of the precision matrix of an ar(1) process. This simple expression allows to recast the problem into conjugate Gaussian updates for β j k and into conjugate Inverse-Gamma updates for η 2 j k , with full conditional distributions equal to where L is such that L L = Ω −1 . In this section, we expand the results reported in Section 4 of the paper, illustrating the country-specific performance of dysm against the competitors. In-sample performances are reported in Tables S1 and S2, while out-of-sample metrics are reported in Tables S3 and S4 . Considering, for example, in-sample performance reported in Table S1 -S2, we observe that dysm performs slightly worse than some competitors only in the female Danish and Dutch populations. These results might be due to the specific structures of such populations, which deviate from the overall patterns and might be subject to an excessive amount of shrinkage by dysm; refer also to (Lindahl-Jacobsen et al., 2016; Bergeron-Boucher et al., 2020) for details on the Danish case. When out-of-sample performance is considered, only results for the male Spanish and Swedish populations indicate that dysm is slightly less accurate than some competitors in these countries. These results might be due to some specific characteristics of these countries, which notably affected mortality and life expectancy and are notably different from the overall trend, resulting in worse performances when coherency across predictions is imposed. Table S4 : Out-of-sample relative performance. Median across rolling windows, stratified by country (fra, gbr, ita, nld, nor, swe). First and third quartiles in squared brackets. The skew-normal and related families Modelling and forecasting adult age-at-death distributions A three-component approach to model and forecast ageat-death distributions Coherent forecasts of mortality with compositional data analysis Coherent mortality forecasting with standards: Low mortality serves as a guide Three measures of longevity: Time trends and record values Parametric models for life tables Three dimensions of the survival curve: Horizontalization, verticalization, and longevity extension An introduction to sequential Monte Carlo nimble: mcmc, particle filtering, and programmable hierarchical modeling Bayesian analysis of mortality data The rise and fall of excess male infant mortality Time series analysis by state space methods On the nature of the function expressive of the law of human mortality, and on a new mode of determining the value of life contingencies The age pattern of mortality Coherent mortality forecasting: the product-ratio method with functional time series models Robust forecasting of mortality and fertility rates: a functional data approach demography: Forecasting Mortality, Fertility, Migration and Population Data Applied mathematical demography The importance of the reference populations for coherent mortality forecasting models Survival analysis: techniques for censored and truncated data Modeling and forecasting us mortality Premature mortality attributable to socioeconomic inequality in england between 2003 and 2018: an observational study Sur la durée normale de la vie humaine et sur la théorie de la stabilité des rapports statistiques A Poisson common factor model for projecting mortality and life expectancy jointly for females and males Coherent mortality forecasts for a group of populations: An extension of the Lee-Carter method Dimensions of global population projections: what do we know about future population trends and structures? On the law of mortality and the construction of annuity tables A mortality model based on a mixture distribution function Coherent forecasting of multiple-decrement life tables: a test using Japanese cause of death data Tables for computing bivariate normal probabilities MortalityForecast: Standard Tools to Compare and Evaluate Various Mortality Forecasting Methods The maximum entropy mortality model: forecasting mortality using statistical moments Fast and accurate calculation of Owen T-function The Chances of Death, and Other Studies in Evolution Time series: modeling, computation, and inference Estimating the integrated likelihood via posterior simulation using the harmonic mean identity Parameters of mortality in human populations with widely varying life spans Bayesian forecasting and dynamic models On the scale of global demographic convergence 1950-2000. Population and Development Review A mixture-function mortality model: illustration of the evolution of premature mortality An adaptive metropolis algorithm Statistical analysis of stochastic processes in time Rise, stagnation, and rise of danish women's life expectancy Programming with models: Writing statistical algorithms for general model structures with nimble Alternative forecasts of danish life expectancy nimble: mcmc, particle filtering, and programmable hierarchical modeling Finally, in this section we illustrate in details the approaches used in Section 3.4 of the paper. The competing approaches rely on the same Multinomial specification outlined in Equations (1) and (2), while the underlying probability density function in Equation (3) and the prior distributions in Equation (5) are specified differently. Using a more general notation, we generalize Equation (3) aswhere g 0 , g 1 and g 2 denote the probability density function characterizing the mixure components.• The first competing approach relies on the same specification as dysm, but replaces g 2 with the density of a rescaled Beta-Distribution in the interval 75 − 100. This choiche leads to• The second specification relies on an half-normal distribution for infant mortality. This leads to• The third specification removes the component devoted to adult mortality. This setting can be obtained simply setting π 1 jt = 0.The remaining competitors are obtained with modifications on the prior distribution of Equation (5).Student-t innovations are obtained replacing the Gaussian distributions in Equation (5) with Student-t with ν = 5 degrees of freedom. Similarly, multivariate random walks can be obtained considering the vectorθ jt ∼ N 7 (θ j t−1 , Σ), with Σ full 7 × 7 covariance matrix assigned to an Inverse-Wishart prior.