key: cord-0130324-eb9jgtb8 authors: Han, Zixuan; Li, Tao; You, Jinhong title: Individual Heterogeneity Learning in Distributional Data Response Additive Models date: 2021-05-27 journal: nan DOI: nan sha: 8621bd0fbb22c7f46041d19a618da48dad663618 doc_id: 130324 cord_uid: eb9jgtb8 In many complex applications, data heterogeneity and homogeneity exist simultaneously. Ignoring either one will result in incorrect statistical inference. In addition, coping with complex data that are non-Euclidean becomes more common. To address these issues we consider a distributional data response additive model in which the response is a distributional density function and the individual effect curves are homogeneous within a group but heterogeneous across groups, the covariates capturing the variation share common additive bivariate functions. A transformation approach is first utilized to map density functions into a linear space. We then apply the B-spline series approximating method to estimate the unknown subject-specific and additive bivariate functions, and identify the latent group structures by hierarchical agglomerative clustering (HAC) algorithm. Our method is demonstrated to identify the true latent group structures with probability approaching one. To improve the efficiency, we further construct the backfitted local linear estimators for grouped structures and additive bivariate functions in post-grouping model. We establish the asymptotic properties of the resultant estimators including the convergence rates, asymptotic distributions and the post-grouping oracle efficiency. The performance of the proposed method is illustrated by simulation studies and empirical analysis with some interesting results. With the development of modern technology, data are increasingly being measured and recorded at several discrete times or a continuous time interval, which are called functional data, and have become a prevailing type of data (Ramsay, 1982) . Functional data analysis provides statistical methodology to deal this kind of data, among which functional regression is widely used to model the relationship of responses and predictors. A great quantity of researches focus on this field, which can be divided into three classes according to whether the responses or predictors are functional or scalar data as follows. The first case is that both responses and predictors are functional data, see (Malfait and Ramsay, 2003; Yao et al., 2005; He et al., 2010; Jiang and Wang, 2011; Cheng et al., 2016) ; the second one is functional responses with scalar predictors, see (Ferre and Yao, 2003, As a specific case of functional data, data consist of samples of distributions or densities appear in various research domains increasingly often. Examples giving rise to such data are distributions of cross-sectional or intraday stock return (Sen and Ma, 2015; Kokoszka et al., 2019) , mortality densities , distribution of intra-hub connectivity in neuroimaging (Saha et al., 2016; . Compared with conventional functional data, distribution or density function takes data as a whole to present its internal structure without the limitation of sample order and the information dimension. In recent years, along with the application of such data, the attention has turn to the complex regression models in which the random distributions or probability densities serve as responses or predictors. In this article, we focus on the model with density functions as responses coupled with scalar predictors, viz., distribution-on-scalar regression model. Considering the density functions as elements of a Hilbert space, they do not constitute a linear subspace due to the inherent constraints of being nonnegative and integrated to one. To deal with this constrain, one way is to adopt the geometric approach by choosing a suitable metric. With the the infinite-dimension version of Aitchison geometry, Talská et al. (2018) defined a density-onscalar linear regression model in Bayes Hilbert spaces. Chen et al. (2020) proposed distribution-ondistribution regression by adopting Wasserstein metric and tangent bundle of the Wasserstein space. Except this, some other models are within the broad framework of non-Euclidean data. For instance, proposed the Fréchet linear regression in a general metric space equipped with the Wasserstein metric. Jeon and Park (2020) developed a unified nonparametric additive regression models with Hilbertian responses where density-valued variables constitute Bayes-Hilbert spaces equipped with a suitable inner product. Another way of dealing with the nonlinear constrain of density functions is to map them into Hilbert space by transformation method. Petersen and Muller (2016) proposed a continuous and invertible transformation, e.g., the log quantile density transformation (LQD) to map probability densities to an unrestricted space of square integrable functions for further modeling and statistical inferences. In conventional statistical analysis, data is generally assumed to be homogeneous. However, this assumption might be inappropriate in many practical applications when the data are collected from objects with different characteristics or in different situations. Ample empirical studies show that inter-class individual homogeneity and intra-class heterogeneity may exist simultaneously, while ignoring the individual heterogeneity during the analysis may lead to incorrect statistical results, and ignoring the homogeneity will result in inefficient statistical inference. Therefore, the density functions within a heterogeneous population should be clustered into several homogeneous groups by some classification measures. A mature literature proposes various methods to identify the latent group structures. Vogt and Linton (2017) applied a distance-based clustering algorithm to the kernel estimation of nonparametric regression function. Vogt and Linton (2018) extended it to a multiscale statistic to avoid the selection of specific bandwidth. Su et al. (2016) developed a so-called classifier-Lasso (C-Lasso) shrinkage method to the linear panel structure model. As a extension, Su et al. (2019) developed a penalized-sieve-estimation-based C-Lasso procedure to heterogeneous time-varying panel data. Chen (2019) proposed a kernel based hierarchical agglomerative clustering (HAC) algorithm with less restrictive assumptions to the same kind data compared with former method. Relevant approaches discussed above are contributed in the context of functional or panel data. The focus of this article is to identify and estimate the latent group structures in additive regression model with density responses. The heterogeneity in the density responses are indeed found when we explore the COVID-19 data. On February 11, 2020, the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was officially named as 'COVID-19' by the World Health Organization (WHO). With the increase in the infection range and transmission speed, the WHO declared the COVID-19 a pandemic on March 11, 2020. At that time, there were about 37371 confirmed cases and 1130 deaths reported in about 114 countries. By December 15, 2020, according to the official repository of WHO (https://covid19.who.int), there have been 73,315,291 confirmed cases reported in about 271 countries and regions, leading to about 1,627,480 deaths. The tremendous increase on these numbers indicate that the COVID-19 is impacting the whole world drastically. It should be noted that due to the medical conditions and adopted prevention policies, along with some other factors both subjective and objective, the epidemic situation in different countries may vary to some extent. To explore the epidemic in each country relative to the global situation with influential factors, we take the density of relative daily mortality rate over a period 240 days in each country as response, where the daily mortality rate is defined as the ratio of deaths per day to the total population of each country, and the relative one is defined as the ratio of daily mortality rate in each country to the total mortality rate of all countries in the world. The predictors considered for explanation are 'aging' (the percentage of population ages 65 and above), 'beds' (the number of hospital beds per 1000 people), 'physicians' (the number of physicians per 1000 people), 'nurses' (the number of nurses per 1000 people) , 'GDP' (the GDP per capita in the US dollar) , and 'diabetes' (the percentage of population with diabetes). Since the outbreak time varies in different countries, we set the date on which each country first reported at least 30 deaths as the origin of time scale. Totally 149 countries are taken into account and the densities of daily mortality rate for these 149 countries are presented by Figure 1 . The existence of different shapes of the densities among countries intrigues us to consider the subject-specific functions and impose latent grouped structures representing heterogeneity in the functional additive model proposed by Han et al. (2020) , i.e. where z i is the density of the relative daily mortality rate over a period 240 days in country i, Ψ(·) is the LQD transformation, X i = (X i,1 , ..., X i,6 ) τ is the covariate vector of country i. Specifically, g i,0 (u) = m k|K,0 (u) if country i is sharing the kind k pattern, and m ·|K,0 (u) is one of K group-specific functions. (2019), we apply HAC method to the estimation of individual functions g i,0 (u), based on which the density functions of relative daily mortality rate can be clustered into four groups with different patterns of epidemic situations, i.e., K = 4, which is shown in Figure 2 . The figure is also served as sufficient and necessary evidence that the functional additive model implemented for the analysis of COVID-19 data should consider the individual heterogeneity among countries. In order to incorporate both inter-class homogeneity and intra-class heterogeneity in the data, we extend the additive functional regression for densities as responses proposed by Han et al. (2020) to the case of density functions with heterogeneity: and where {G 1 , ..., G K } is a partition of the index set {1, 2, · · · , n}, which means that K k=1 G k = {1, 2, ..., n}, and G i ∩ G j = ∅, ||m i|K,0 (·) − m j|K,0 (·)|| 2 = 0 for any i = j. We assume that the number of groups K and the membership in each individual group are unknown. In the above model, z i (u) ∈ F are the random densities, each coupled with the p−dimensional covariates x i = (x i,1 , · · · , x i,p ) τ , with common support S x . Without loss of generality, we take S x = [0, 1]. Denote Ψ(·) as the log quantile density transformation, and f i = Ψ(z i ). g i,0 (·) is the subject-specific function and g l (·, x l ) is the bivariate additive components. For the purpose of the identification, we assume that g l (·, x l ) are: E[g l (u, x i,l )] = 0, for u ∈ [0, 1], i = 1, ..., n. {ε i } n i=1 are independent processes with E(ε i (u)|x i ) = 0 and Cov(ε i (u)|x i ) = σ 2 i (u). Obviously, our proposed model (1) reveals that it is a natural extension of functional additive model. Specifically, if the subject-specific functions are homogeneous, viz., K = 1, g i,0 (·) ≡ g 0 (·), i = 1, · · · , n, then the model (1) becomes the additive function regression for densities as responses proposed by Han et al. (2020) Actually, Han et al. (2020) has conducted great work and gained excellent academic achievements. It proposed a novel additive functional regression model to accommodate random densities as responses and multivariate predictors with the increasing popularity of analyzing data in the form of distributional functions, and established relevant theoretical properties and statistical inference. Motivated by this model, we are interested to identify the subgroups when analyzing data consisting of densities with heterogeneity. In real applications, since only a random sample generated by the density is available, we first estimate each density by the modified kernel density approach proposed by Petersen and Muller (2016) along with the LQD transformation. The identification and estimation methodology of the latent group structures are accomplished in three steps. First, we utilize B-spline series approximating approach to estimate subject-specific function g i,0 (u) and bivariate additive components g l (u, x l ). Then, we apply HAC method to identify the membership of each latent group. Finally, backfitted local linear regression is applied to improve the efficiency of group-specific function m k|K,0 (u) and g l (u, x l ). Theoretical results concerning the resultant estimations including the uniform convergence rate of the initial estimation, the consistency of the estimation of group number and partitions, both the uniform convergence rate and the asymptotic normality of the group-specific functions and the post-clustering estimation of additive component are also established. The rest of this article is organized as follows. In Section 2, the modified kernel estimation and the LQD transformation for density functions are introduced as the preliminary works. Section 3 presents the procedure for identification and estimation of latent group structures and additive components in the model. Theoretical results are included in Section 4. The Monte Carlo simulations are conducted to illustrate the efficiency of proposed procedure in Section 5. We also demonstrate the application of proposed method by analyzing the COVID-19 and GDP data in Section 6. A discussion follows in Section 7. Detailed proofs of theoretical results and additional numeric results are in the supplementary materials. In this section, some preliminaries required for the model will be presented. Let F be the class of univariate continuous probability density function z(u), with a common support S and satisfying R u 2 z(u)du < ∞. Without loss of generality, we take S = [0, 1]. For each z(x) ∈ F, let F (x) = x −∞ z(u)du, Q(u) be the corresponding quantile function, i.e, Q = F −1 , and q(u) be the quantile density function defined as the derivative of quantile function, i.e. q(u) = Q (u) = d du F −1 (u) for u ∈ [0, 1]. To map each z(u) ∈ F into the linear space L 2 ([0, 1]), we utilize the log quantile density transformation (Petersen and Muller, 2016) that is defined as A challenge for fitting the regression model with density response is that in real applications, z i (u), thus f i (u), is unobservable, and can only be estimated from the random sample generated by Without loss of generality, we assume T i = T . Due to the boundary effects of conventional kernel density estimators, Petersen and Muller (2016) proposed a modified kernel density estimatorẑ i (u) as follows, i.e. as u ∈ (1 − h, 1] and 1 otherwise. Bandwidth h < 1/2, K is of bounded variation and symmetric of 0, satisfying the conditions that 1 0 K(u)du > 0, R |u|K(u)du, R K 2 (u)du and R |u|K 2 (u)du are finite. Different from the conventional kernel density estimator, modified kernel density estimator z i (u) possesses the consistency property, viz., Based on the estimation of density functions, to fit model (1), we takeẑ i (u) as the substitution of z i (u) and denotef i (u) = Ψ(ẑ i )(u), then the model (1) is written as where is the random error resulting from the estimation of f i (u). In this section, we provide the methodology for identification of the latent group structures and estimation of additive components in the proposed model through a three-step procedure. The identification and estimation procedure is provided in Algorithm 1. Algorithm 1 Identifying Heterogeneity in Additive Model with Density Responses .., n. 5: HAC Algorithm: Estimated group sets index {Ĝ 1 , ...,Ĝ K }. • Begin with n groups, where each subject is a group. • Merge the two groups with smallest distance into one. • Recalculate the distance between new groups after each grouping. • Repeat the previous two steps until the number of groups achieves K. 6: Refined Estimation: Backfitted local linear estimationm k|K,0 (u),g ( u, x l ), k = 1, ..., K, l = 1, ..., p. Spline series approximating method is commonly used to estimate unknown nonparametric functions, with detailed practical guidance in De Boor (1978) and Stone (1994) . Let {B 1 (u), B 1 (u), ..., B N 0 +1 (u)} be the set of B-spline basis functions of order q with L 0 interior knots, where N 0 + 1 = L 0 + q. Let {B 1,l (x l ), ..., B N l +1,l (x l )} be the set of B-spline basis functions of order q for x l (l = 1, · · · , p) with L l interior knots, where N l + 1 = L l + q. Then, defined the normalized spline basis of x l (l = 1, · · · , p) as Based on the basis functions, define the tensor product of B-spline basis as The spline approximation of subject-specific and bivariate components are given by Denote g i (u, x) = g i,0 (u) + p l=1 g l (u, x l ), therefore, the model (1) can be written as f i (u) = g i (u, x) + ε i (u), and we can approximate g i (u, x) by Then, the corresponding estimations arê Based on the initial estimations of subject-specific functions (4), the classic hierarchical agglomerative clustering (HAC) algorithm is applied for identifying the latent groups {G 1 , ..., G K } given the group number K. To tackle this problem, we need a metric to measure the distance between pair of functions g i,0 (·) and g j,0 (·). For instance, Chen et al. (2019) used L 1 -distance to measure the similarity of the coefficient functions in nonlinear models. Vogt and Linton (2017) combined L 2distance with k-means procedure to obtain the group structure. Vogt and Linton (2018) developed the multiscale techniques based on L ∞ for clustering. Similarly as Cardot et al. (1999) , in this article we work with the general L q -distance which can be estimated asd Assume that the number of groups K is given. The HAC algorithm for finding out the latent group structures among the individual subject-specific functions can be summarized as follows. First, begin with n groups with each subject as a group. Second, calculate the distance matrix M n×n = (d i,j ), find the smallest element except for the main diagonal elements and merge the corresponding groups into one. Next, recalculate the distance between new groups and the distance matrix with reduced size after each grouping. The distance between two groups is defined as the furthest distance between any two functions with one in a group and the other one in another group. Finally, repeat the previous two steps until the number of groups achieves K. Given the true group membership, the model (1) can be written in group-structure form To estimate the group-specific functions m k|K,0 (u) and bivariate additive component functions g l (u, x l ), we apply the backfitted local linear algorithm proposed by Fan (1993) . Define By plugging in the initial estimationĝ l (u, x i,l ), we obtain the estimation of f c i,0 (u) aŝ For each given u t ∈ [0, 1], m k|K,0 (u t ) can be approximated by the first-order Talyor expansion Define the weighted squared function as where K(·) is a nonnegative kernel function with bandwidth h 0 andĜ k denotes the estimated membership of group k. Then by minimizing Q 0 (a k,0 , b k,0 ), the estimation of m k|K,0 (u) asm k|K,0 (u) = a k,0 is derived asm where With the estimation of group-specific functions, we can have the estimation of f c i,l (u, where I(·) is the indicator function. For each given u ∈ [0, 1] and X i,l ∈ [0, 1], we approximate g l (u, X i,l ) by Then the pointwise estimator of g l (u, x l ) asg l (u, x l ) =â l can be obtained by minimizing Since the pointwise estimation of g l (u, x l ) for each u ∈ [0, 1] may not be smooth due to the dependence on the estimation of density responses f i , then an additional local smoothing step is implemented in the direction of u to rectify the smoothness. To do so, for each u ∈ [0, 1], we define the following function where W(·) is a nonnegative kernel function with bandwidth h l,u . By minimizing Q l (a l , b l ), the refined estimator of g l (u, x l ) can be obtained as where w l,i = K( The identification and estimation of the latent group structures discussed above is based on the prior that the number of groups K is known, but it is not true for practical problem. A major challenge in clustering is to estimate the group when the number is unknown. Following Chen (2019), the information criterion is applied for the selection of group number. Denote and ρ is a tuning parameter whose value may rely on n. The number of latent groups is estimated by minimizing the criterion IC(K), i.e., whereK is pre-specified as the maximal number of groups. In the simulation studies, we take two choices of ρ into account. They are where nK = min{|ĜK|, k = 1, ...,K} and |Ĝ k | denotes the cardinality of the groupĜ k . They correspond to two information criterions, the generalized Bayesian information criterion (GBIC) with ρ = ρ 1 and the generalized Akaike information criterion(GAIC) with ρ = ρ 2 . Other information criterions can be found in Chen et al. (2019) and Chen (2019). In this article, the leave-one-out cross-validation method is implemented for the bandwidth selection. Let h = (h 0 , h l,u , h l,x ) τ , we select the bandwidth h given the number of latent group K by minimizing the following mean squared error whereĜ k is the estimated group membership. For each t = 1, ..., T ,m k|K,0 (u t ) and g (−t,h) l (u t , X i,l ) are the estimations with bandwidth h of m k|K,0 (u t ) and g l (u t , X i,l ) obtained by using observations except the t-th observation respectively. Throughout this paper, for any fixed interval [a, b], we denote the space of l-th order smooth function as C (l) [a, b] = {g|g (l) ∈ c[a, b]}, and the class of Lipschitz continuous functions for some fixed constant C > 0 as Lip ([a, b] Meanwhile, let S l and S x denote the support of x l and x, respectively. Obviously, S x = p l=1 S l . The necessary assumptions for the asymptotic results are listed as follows. (A1) For any z ∈ F, z is differentiable, and there exists a constant M > 1, such that ||z|| ∞ , ||1/z|| ∞ , ||z || ∞ are all bounded by M . (A2) (a) The kernel density K is Lipschitz-continuous, bounded and symmetric about 0. Furthermore, K ∈ Lip([−1, 1], L k ) for some constant L k > 0. (b) The kernel density K satisfies the conditions that 1 0 K(u)du > 0, R |u|K(u)du, R K 2 (u)du and R |u|K 2 (u)du are finite. The kernel density W also satisfies the above assumptions. (A3) (a) For each 1 ≤ i ≤ n, the process {x i , ε i (u)} is stationary and α-mixing dependent for u ∈ [0, 1], with the mixing coefficient decaying to zero at a geometric rate, i.e., there exists constants A < ∞ and β > 4, such that α(k) ≤ Ak −β for all k ≥ 1. (b) The covariate variables x i,l , 1 ≤ l ≤ p, and the errors ε i (u) satisfy the following moment conditions that for some s > 2, For each i = 1, ..., n, the covariance function Cov(ε i (s), ε i (t)|x i ) = Σ i (s, t) has finite non-decreasing eigenvalues λ 1 ≤ ... ≤ λ max , satisfying j λ j < ∞. (A4) The latent group functions m k|K,0 (·), 1 ≤ k ≤ K, have continuous second order derivatives on the support interval, namely, m k|K,0 (·) ∈ C (2) [0, 1], and m k|K,0 (·) ∈ Lip([0, 1], L 0 ) for some constant L 0 > 0. Meanwhile, the additive component functions g l (u, x l ), 1 ≤ l ≤ p are continuous functions on [0, 1] × [a l , b l ] and twice continuously partial differentiable with respective to u and x l , where [a l , b l ] is a compact subset of S l . (A5) (a) The density function of covariate x, f (x), is continuous and bounded, with continuous derivatives of marginal densities f l,u (x l ) at each u ∈ [0, 1]. (b) For u ∈ [0, 1] and x ∈ S x , the joint density of u and x, f (u, x), as well as the joint density of u and x l , f l (u, x l ), are continuous and partially differentiable with respect to u and x, with continuous second order partial derivatives. (A6) (a) Let δ = min 1≤k =l≤K min i∈G k ,j∈G l d i,j , where d i,j is the (i, j) element of distance matrix discussed before, then h 2 +(T h) −1/2 = o p (δ). (b) There exists a positive constant ξ, with 0 < ξ < 1, such thatmin 1≤k≤K |G k | ≥ ξ · n. (A7) N 0 ∼ T 1/5 , N l ∼ T 1/6 , h 0 ∼ (nT ) −1/5 , h l,u , h l,x ∼ n −1/5 , 1 ≤ l ≤ p, as n, T → ∞. Remark 1. Assumption (A1) is basic and essential to derive the consistency of densities after transformation. The conditions in (A2) on the kernel function K(·) are mild and can be satisfied by commonly used kernel functions such as uniform and Epanechnikov kernel. Assumption (A3) (a) relaxes the dependence of error process and covariates spaces to the α-mixing dependence which is one of the weakest dependence conditions. The moment conditions in (A3) (b) is crucial to derive the uniform convergence and other asymptotic properties based on the kernel function. The smoothness conditions of component functions in (A4) and (A5) are greatly relaxed. Assumption (A6) (a) indicates that δ can converge to zero at an appropriate rate, and (b) are useful in proving the consistency of estimated group numberK via information criterion proposed before. (A7) are common conditions applied in kernel smoothing to satisfy the optimal convergence rates. We first derive the uniform consistency of initial estimations of g i,0 (u) and g l (u, x i,l ) which is presented by Theorem 1. Theorem 1. Assume that (A1)∼(A4) and (A7) hold,ĝ i,0 (u) andĝ l (u, x i,l ) are the initial estimations of g i,0 (u) and g l (u, x i,l ) respectively, which are defined by (4), i = 1, ..., n, ,l = 1, ..., p. Then as T → ∞ and n → ∞, it holds that Theorem 2 and 3 claim that the number of groups K and the membership of group structures {G 1 , ..., G K } can be correctly identified with probability 1. Theorem 2. Assume that (A1)∼(A5) hold and the number of latent groups K is known. Denote {G 1 , ..., G K } the true group structure of g i,0 (·), i = 1, ..., n, and {Ĝ 1 , ...,Ĝ K } the corresponding estimation. Then as T → ∞, it holds that P ({Ĝ 1 , ...,Ĝ K } = {G 1 , ..., G K }) → 1. Theorem 3. Suppose that (A1)∼ (A7) hold. If K is the number of latent groups, andK is the estimation of K via information criterion, then P (K = K) → 1, as T → ∞. Theorem 4. Assume that (A1)∼(A7) hold and the number of latent groups K is known. Then, as n → ∞ and T → ∞ , it holds that (i) Theorem 4 characterizes the uniform convergence of the oracle post-clustering estimation of group-specific functions and additive components. Subsequently, to establish the asymptotical normality ofm k|K,0 (u) andg l (u, x l ), we define Theorem 5. Assume that (A1)∼(A7) hold.K is the estimation of the number of latent groups, then Then, as T n → ∞, it holds for all u ∈ (0, 1) and x l ∈ [0, 1], that (i) In this section, we conduct the simulation study to demonstrate the performance of the proposed identification and estimation procedure for the model (1). In this simulation, we consider the case that the group number K = 3 and covariates number p = 2. The regression model (1) with the latent group structures (2) is where µ i (u|X = x) = g i,0 (u) + g 1 (u, x i,1 ) + g 2 (u, x i,2 ), with the latent group structures m 3,0 (u) = 6[2u − 6u 2 + 4u 3 + 0.05], i ∈ G 3 , and the additive component functions g 1 (u, x 1 ) = sin(2πu)(2x 1 − 1), g 2 (u, x 2 ) = sin(2πu) sin(2πx 2 ), for u, x 1 , x 2 ∈ [0, 1]. The groups are defined as G 1 = {1, 2, ..., n 1 }, G 2 = {n 1 + 1, n 1 + 2, ..., n 1 + n 2 }, G 3 = {n 1 + n 2 + 1, n 1 + n 2 + 2, ..., n 1 + n 2 + n 3 }, and the cardinalities of each group is set to be n 1 = 0.3n, n 2 = 0.3n, n 3 = 0.4n. The covariates x i,1 , x i,2 are generated by normal random vectors with mean zero and covariance matrix Σ = 1 0.5 0.5 1 . The random error ε(u) = 1 sin(πu) + 2 sin(2πu), where 1 ∼ N (0, 0.1 2 ), 2 ∼ N (0, 0.05 2 ), and 1 are independent of 2 . We note that the conditional mean functions µ(u|X = x) are the log-quantile transformation of the random density z(u|x). More specifically, the inverse of log-quantile transformation is Thus the conditional distribution function F (·|x) and quantile function Q(·|x) satisfy To generate the observations of response, for each 1 ≤ i ≤ n, let u i,1 , ..., u i,T i ∼ Uniform(0, 1), which are independent of X i , the observations of f i at time points , where z i are the random response densities. Without loss of generality, we assume T i = T . We set the sample size n = 50, 100, the number of observations from each random density is T = 50, 100. For each setting, the simulation repeated 200 times. Figures 3 and A.1 depict the average performance of pre-and post-clustering estimations for group structure terms g i,0 (u) and bivariate additive components g l (u, x l ), respectively, through 200 Monte Carlo runs with sample sizes n = 100 and the number of time points T = 100. In each figure, the true functions, pre-clustering and post-clustering estimations are presented from the left to right panels. From Figure 3 we can see that although the pre-clustering estimations capture the basic shapes of true densities, but there exists certain differences between the true curves and pre-clustering estimations. The minimizer and maximizer for each estimated curve are close to the true points but the extreme values between estimation and true curve do exist. On the other hand, post-clustering estimations more accurately depict the true density curves since not only they catch the shape of curves but also the minimizer, maximizer and extreme values for each estimated curve are almost same as the true density curve. Similar patterns can be found from the estimations of bivariate additive functions in Figure A. 1, meaning the efficiency of proposed identification procedure under this setting. Denote C = {G 1 , ..., G K } the true clusters andĈ = {Ĝ 1 , ...,ĜK} the estimated clusters. To evaluate the performance of the clustering algorithm, we take two measures into account. One is the traditional measure for clustering, the purity, which is defined as The other one is the normalized mutual information (NMI) which is a common measure for the similarity between clusterings (Ke et al., 2015) . Here, we define NMI betweenĈ and the true clusters C, i.e., where I(Ĉ, C) is the mutual information betweenĈ and C which is defined as is the entropy ofĈ and H(C) is defined analogously. Since both Purity and NMI do not depend on the ordering of clusters, they are proper measures for the efficiency of the clustering algorithm. It is obvious that the closer the both values are to 1, the more efficient the algorithm is, namely, the closer the estimated clusters are to the true clusters. To evaluate the efficiency of estimation procedure, we compare the performance of three estimators. one is (ĝ i,0 (u),ĝ l (u, x l )), the pre-clustering estimator obtained without considering the group structures, second is (m k|K,0 (u),g l (u, x l )) with given K, the oracle estimator obtained by giving the number of true groups, and the last one is (m k|K,0 (u),g l (u, x l )), post-clustering obtained from data in each estimated group. The root mean squared errors (RMSEs) are used to examine the performance of the estimations. For instance, the RMSE of the pre-clustering estimator are defined as The RMSEs of the other estimators are defined analogously. Table 1 , 2 and A.1 present the results under different settings, which include account of the estimated value of K, average and standard deviation of NMIs, Purities and RMSEs of the estimations. Firstly, about the performance of the clustering algorithm, we can see that the overall performances of GAIC and GBIC are quite similar. Besides, under either GAIC or GBIC, the account thatK equals the true value K, NMI, and purity rises with the increase of n or T . When T = 100 and n = 100, the value K is one hundred percent truly estimated, and the NMI and purity is close to 1. Secondly, about the performance of the estimators, it is clear from Table 2 and A.1 that the mean and standard deviation of RMSE for all estimators decrease with the increase of sample size n and number of time points T . Under all scenarios, the oracle and post-clustering estimators outperform the pre-clustering estimators. The RMSEs of post-clustering estimators are quite similar as oracle estimators, and they are getting closer along with the increase of n and T . Especially when n = 100, T = 100, they are very close to each other. Moreover, the performance of post-clustering estimators under GAIC and GBIC criterions are almost same. In this section, we apply the proposed methodology to two social studies. As introduced in Section 1, we are interested in exploring the relationship between the trend of epidemic situation in each country and some socio-economic factors. The data set of COVID-19 consist of the number of deaths per day from January 22, 2020, to December 15, 2020, in 190 countries and regions, as obtained from the Coronavirus Resource Center at Johns Hopkins University (https://coronavirus.jhu.edu/map.html). Due to the staggered time at which the pandemic reached individual countries, we consider a time period of 240 days, where the time 0 is the earliest day on which at least 30 deaths were reported. We take the density of the daily mortality rate, defined as the ratio of deaths per day to the total population of each country, duration a period 240 days as the response. To present the effect of epidemic trend in each country on the overall global situation, we replace the original daily mortality rate with the relative one with respect to all countries during the period. To satisfy the requirements of the proposed method, we choose countries with the marginal distributions of covariates were compactly supported and bounded in a domain defined by the maximum and minimum of observations respectively. Then, these processes finally generate a sample of n = 149 countries. The latest updated data for year 2019 of six predictors as mentioned in Section 1 are collected from the World Bank (https://data.worldbank.org/indicator). Since the raw data are relative mortality rates over daily bins, smoothing method is implemented first to construct densities. To do so, we employ the modified local linear kernel smoothers proposed by Muller et al. (1997) to generate smoothing curves. The density responses z i , i = 1, ..., 149 are then estimated and depicted over time by Figure 1 . The obvious difference in the shapes of the densities allow us to impose the latent group fixed effect in the functional additive model: where f i (t) = Ψ(z i ) is the LQD transformation of density z i . We first use B-spline method to obtain the initial estimationĝ i,0 (t) andĝ l (t, x l ), and then utilize the HAC algorithm to classify g i,0 , i = 1, ..., n, with the number of clustering groups determined by the information criterion introduced in Section 3.4. Figure A. 2 displays the values of GAIC and GBIC under the condition of different group numbers, indicating that the optimal group number is four. The members in each group are listed in Table A.2. After clustering, the backfitted local linear method is employed to construct the refined estimationsm k|4,0 (t) andg l (t, x l ). The estimation of latent group structures are presented by Figure 4 , and the corresponding densities within each group are displayed by Figure 2 . Firstly, the trend in Group 4 is quite different with the other three, as the relative daily mortality rate increase over the time, meaning the greater proportion of mortality rate in those countries relative to the global situation during this time period. In Group 3, the daily mortality rate at the beginning is pretty high, but it dramatically declines over the time, viz., the epidemic in those countries is relatively well controlled. The trend in Group 1 also decreases over the time, but not so sharp as Group 3. Compared with the other three groups, the daily mortality rate in Group 2 fluctuates mildly. Except for the estimation of latent groups structure, we are also interested in the influence of the selected socio-economic variables on the daily mortality rate. To quantify the contribution of each individual component function, we utilize an empirical version of the fraction of variance explained (FVE) criterion (Han et al., 2020) . Specifically, the empirical FVE of the l-th covariate , with v i,0 = Ψ −1 (g i,0 (·)), andṽ i,l = Ψ −1 (g i,0 (·)+g l (·, x i,l )). Then, we can find the best model by backward elimination, removing the predictor with the smallest FVE among the included predictors at each step successively. The procedure stops when the mean squared error (MSE) increases after one predictor being removed, given by is the fitted density of z i in the d-th step, andṽ (0) i = Ψ −1 (g i,0 (·) + p l=1g l (·, x i,l )). The MSE and FVE will be both recalculated every time a predictor is deleted. We start the backward elimination procedure at p = 6, and the result shows that the variable 'aging', 'physicians' and 'GDP' are selected in the final model, with 'beds', 'nurses' and 'diabetes' being removed. Figure 5 demonstrates the effects of the three predictors 'aging', 'physicians' and 'GDP' by heap maps, with FVE 45.52%, 59.42% and 73.98%, respectively. The heat maps of 'physicians' illustrates that the influence of 'physicians' on the relative daily mortality rate changes over the time. Besides, its expressed modes are opposite for the small and large number of physicians per 1000 people. For the country with large number of physicians, the function gets to the maximum value at the early time and then minimum value at the later time, while for the country with small numbers, the pattern is opposite. Similar or opposite expressed mode can be found in the heat map of 'aging' or 'GDP'. In fact, it is not surprising to see the impacts of these predictors on the daily mortality rate. For one thing, countries with more sufficient medical supplies, more adequate medical staff, and higher domestic economic level tend to provide more effective medical treatment to reduce the incidence of death and get better control of the epidemic. For another, a handful of relevant researches reported that more various variants of the new corona virus with faster reproduction and higher infectiousness have been found in many countries and regions, which is more challenging for the immune system of the elderly. To evaluate the overall performance of the proposed estimations, we select three different countries from each group and draw the observed and fitted density curves in Figure A. 3. On the whole, the estimated densities can well fit the observed density curves. Meanwhile, we compare the RMSE of the pre-and the post-clustering estimators of the fitted densities, defined as RM SE(ṽ i ) = 1 2 } 1/2 , whereṽ i = Ψ −1 (g i,0 (·) + p l=1g l (·, x i,l )), g i,0 (·) andg l (·, x i,l ) are taken as pre-and post-clustering estimators respectively. The result shows that the RMSE of the pre-clustering estimations is 0.6972, while the one of the post-clustering is 0.3751, indicating that it is essential to consider the heterogeneity in the relative daily mortality rate of each country, and the identification and clustering method indeed improves the efficiency of proposed model on the analysis of COVID-19 data. GDP per capita is an effective tool to understand the macroeconomic operation of a country or region. It is one of the most important macroeconomic indicators and often used to measure the economic development. In this empirical application, we consider the model: where education i is the the literacy rate of the i-th country, namely, the percentage of educated population age 15 and above , population i is the total population, oriGDP i is the per capital GDP of the original year, avgGDP i is the average per capita GDP over 50 years. The data are obtained from the World Bank (https://data.worldbank.org) over the period 1970 to 2019. By deleting the countries with missing data, we finally have a sample of n = 123 countries each with T = 50 time points observations. With the similar procedure as Section 6.1, we obtain the estimated density of relative per capita GDP for each county to explore the economic level of each country relative to the world, which is shown in Figure 6 . It is clear that obvious difference exists in the curves among various countries. Following the proposed identification and estimation procedure, based on the initial estimationŝ g i,0 (t) andĝ l (t, x l ), the results of HAC algorithm with GBIC and GAIC shown in Figure A .4 indicate that g i,0 should be classified into three groups. The memberships in each estimated group are presented in Table A .3. Finally, post-clustering estimationsm k|3,0 (t) and corresponding densities within each group are displayed in Figure 7 . First of all, relative per capita GDP in Group 3 increases over time, showing the promotion of economic influence on global development, while Group 1 has the opposite trend of decrease, indicating the decline of status in global economy. Compared with the other two groups, the effect of development in Group 2 is relatively stable. Once again backward elimination procedure with FVE criterion is implemented for model selection. The variable 'oriGDP', the per capita GDP of the original year, is removed and three predictors are left for the final additive model-'education', 'population' and 'average GDP'. Figure 8 illustrates the impacts of these predictors via heat map, with FVE 50.75%, 15.41% and 30.42%, respectively. The heat map of 'education' illustrated the influence on the relative per capita GDP over the time. For the country with high literacy rate, the function gets to the maximum value Figure 7 : The estimation of latent group structuresm k|3,0 (t) (the first row) and the densities of relative per capita GDP in each group (the second row). at the early time and then minimum value at the later time. For the country with low rate, the function presents opposite trend. Meanwhile, the other two predictors shows similar or opposite patterns. To evaluate the overall performance of the proposed estimations, we select three different countries from each group and draw the observed and fitted density curves in Figure A .5. It is clear from the figure that on the whole, the estimated densities can well fit the observed density curves. Moreover, the RMSE of the pre-clustering estimations is 0.6385, while the one of the post-clustering estimations is 0.2971, indicating the necessary to consider the heterogeneity and effectiveness of the identification and clustering method of proposed model on the analysis of per capita GDP data. With the abundance of complex data, the nature of heterogeneity and homogeneity of individual effects existing simultaneously in the data becomes very common. On the other hand, dealing with the task of analyzing complex data that are non-Euclidean and specifically do not lie in a vector or functional space for example, the widely used density functions are increasingly popular. To address these issues, we consider the extension of a distributional data response additive model proposed by Han et al. (2020) with heterogeneous sub populations in which the response is a distributional density function and the individual effect curves are homogeneous within a group but heterogeneous across groups, the covariates capture the common variation and share the common additive bivariate functions across subgroups. Based on the pioneer work by Petersen and Muller (2016) , a transformation approach is first applied to map density functions into a linear space. We then take the B-spline series approximating method to estimate the unknown subject-specific and additive bivariate components. The latent group structures are identified by the well known hierarchical agglomerative clustering (HAC) algorithm. We show that our method is able to identify the true latent group structures with probability approaching one. We further construct the backfitted local linear estimators for both grouped individual effect curves and the additive bivariate functions for the post-grouping model in order to improve the efficiency of the initial estimators. The asymptotic properties of the resultant estimators including the convergence rates, asymptotic distributions and the post-grouping oracle efficiency are established. The performance of the identification and clustering method is illustrated by simulation studies and two real data analysis, presenting the efficiency and validity of proposed model compared with the regular additive functional regression model without considering heterogeneity. Many challenging problems can be addressed in future researches. In the past decades, the volume of data increases exponentially and often exceeds the available computational resources. For the massive data, functional data analysis techniques under the conditions of limited and finite samples are no longer applicable, and how to identify the latent group structures under such circumstances is a big issue for us. Meanwhile, in this paper, we consider the identification for latent group structures with only scalar covariates. In many empirical applications, functional covariates may be more preferable to present the varying correlations between predictors and responses, see (Cheng et al., 2016; Luo et al., 2016; Li et al., 2017) . For this scenario, novel methods should be developed to involve both scalar and functional covariates. In addition, density functions are transformed into a linear function space via a continuous and invertible map due to the constraints for further modeling in this paper. Recently, some nice work has been done with modeling the density function response directly instead of any other transformations. See, for example, Talská et al. (2018) ; ; Chen et al. (2020) ; Jeon and Park (2020) . More relevant researches concerning this may be pursued as well in the future. Functional linear model Estimating latent group structures in time-varying coefficient panel data models Nonparametric homogeneity pursuit in functionalcoefficient models. Working paper Wasserstein regression Efficient estimation in semivarying coefficient models for longitudinal/clustered data A practical guide to splines Local linear regression smoothers and their minimax efficiencies Functional sliced inverse regression analysis Smoothed functional inverse regression Methodology and convergence rates for functional linear regression Additive functional regression for densities as responses Functional linear regression via canonical analysis Minimax adaptive tests for the functional linear model Additive regression with hilbertian response. The Annals of Statistics Functional single index models for longitudinal data Homogeneity pursuit Forecasting of density functions with an application to cross-sectional and intraday returns A functional varying-coefficient single-index model for functional response data Single-index varying coefficient model for functional responses The historical functional linear model From lifetables to hazard rates: the transformation approach Quantifying and visualizing intraregional connectivity in resting-state functional magnetic resonance imaging with correlation densities Functional data analysis for density functions by transformation to a hilbert space Fréchet regression for random objects with euclidean predictors When the data are functions Demarcate: Density-based magnetic resonance image clustering for assessing tumor heterogeneity in cancer Forecasting density function: Application in finance The use of polynomial splines and their tensor products in multivariate function estimation Identifying latent structures in panel data Sieve estimation of time-varying panel data models with latent structures Compositional regression with functional response Classification of nonparametric regression functions in longitudinal data models Multiscale clustering of nonparametric regression curves Functional linear regression analysis for longitudinal data Multivariate varying coefficient model for functional responses