key: cord-0600118-8jajq8zy authors: McIntosh, Anthony R title: Comparison of Canonical Correlation and Partial Least Squares analyses of simulated and empirical data date: 2021-07-14 journal: nan DOI: nan sha: fbcc9ac013853463723a9869d02860730f624fee doc_id: 600118 cord_uid: 8jajq8zy In this paper, we compared the general forms of CCA and PLS on three simulated and two empirical datasets, all having large sample sizes. We took successively smaller subsamples of these data to evaluate sensitivity, reliability, and reproducibility. In null data having no correlation within or between blocks, both methods showed equivalent false positive rates across sample sizes. Both methods also showed equivalent detection in data with weak but reliable effects until sample sizes drop below n=50. In the case of strong effects, both methods showed similar performance unless the correlations of items within one data block were high. For PLS, the results were reproducible across sample sizes for strong effects, except at the smallest sample sizes. On the contrary, the reproducibility for CCA declined when the within-block correlations were high. This was ameliorated if a principal components analysis (PCA) was performed and the component scores used to calculate the cross-block matrix. The outcome of our examination gives three messages. First, for data with reasonable within and between block structure, CCA and PLS give comparable results. Second, if there are high correlations within either block, this can compromise the reliability of CCA results. This known issue of CCA can be remedied with PCA before cross-block calculation. This, however, assumes that the PCA structure is stable for a given sample. Third, null hypothesis testing does not guarantee that the results are reproducible, even with large sample sizes. This final outcome suggests that both statistical significance and reproducibility be assessed for any data. With the trend in the collection of large datasets, there has been a parallel increase in the interest in multivariate statistical methods to extract the complicated patterns that relate data features. In the neuroimaging domain, data sets such as the UKBioBank, Imagene and Enigma, and Human Connectome Project provide a formidable challenge in the diversity of data spanning genetic, demographic, various phenotypic measures, and neuroimaging data on brain structure and function. One goal for these initiatives is to identify the dimensions in one data set that predict or explain dimensions in the other -for example, how does brain structure relate to lifestyle and cognitive measures? While univariate methods can be used for a general first impression, they are typically underpowered and cannot take advantage of the added sensitivity of extracting combinations of variables that link data sets [1, 2] . For example, a distributed pattern of brain activity may be correlated with a battery of neuropsychological measures but not easily discernible when each brain area or neuropsychological measures are considered alone. These combinations of variables are generally referred to as Latent Variables (LV), reflecting the fact that they represent a dimension in the data that is inferred from a collection of measures (e.g., 'memory' is inferred from a collection of memory tests; the default mode network is inferred from a composite of core regions). The first multivariate method focused on the extraction of LVs relating two data blocks is canonical correlation analysis (CCA). Introduced by Hoteling [3] , the method extracts LVs that maximizes the correlation between data blocks, X and Y. Each LV contains weights for variables within-block that indicate the relative contribution of each to LV. In the most basic form, the LVs are mutually orthogonal, with each accounting for successively less of the correlations between data sets. The weights for the variables within each LV are adjusted for the mutual intra-block correlations, thus indicating the unique contribution each variable makes to a given LV. In this way, CCA minimizes redundancy across variables within blocks A more recent method called Partial Least Squares (PLS) has a similar goal to CCA in deriving LVs that reflects the relationship between two data blocks. The original version of the method was developed by Wold [4] , with links to multi-table methods of Tucker [5] . There have been two popular versions, PLS-R and the PLS-C, where PLS-R is focused primarily on directed predictions of X on Y, while PLS-C is non-directed and most similar to CCA [6] . Thus we will focus on PLS-C for this paper. PLS-C operates by optimizing the covariance between data tables, extracting mutually orthogonal LVs that account for maximum covariance. The weights for the variable within each LV reflect the contribution of the variable to the LV without adjusting for the mutual intra-block correlations, which is an important difference from CCA. CCA and PLS have been applied to neuroimaging datasets [6] [7] [8] [9] [10] . There have been some attempts to compare the performance of the two methods with some uncertain [11, 12] . Part of the inconsistency seems to stem from how the two methods are implemented, whether the data examples are empirical or simulated, and the assumptions for the simulations. Our goal here is to focus on the most general forms of CCA and PLS-C, using empirical and simulated data to characterize where the two methods agree and where they depart. The code for all analyses is also open, allowing the interested reader to evaluate our results and extend them. We start by reviewing the theory behind both methods and then the general methods for the analysis and assessment of significance and reliability. We begin with a null data set having no reliable cross-block relations, then two empirical datasets with differing magnitudes of crossblock relations. We finish with another simulated dataset with many variables in one block, which is meant to capture the case in Big Data sets such as neuroimaging or genetics. This final dataset provides clear evidence of divergence and convergence of PLS and CCA. As noted above, CCA is one of the most general forms of multivariate analysis and is the core of the General Linear Model (GLM). Let us assume there are two data blocks in matrices X and Y, where both conform to a multivariate normal distribution. The rows of these matrices have n observations, and columns of p variables for X and q variables for Y. The calculation for the correlation matrices: Where matrices X and Y are centred and scaled by their column-wise standard deviation (i.e., zscore transform). Superscript "t" is matrix transpose, and "-1" is the matrix inverse, and '*' is matrix multiplication. The cross-block correlation matrices Rxy and Ryx: One implicit assumption is that X and Y contain some underlying relationships among variables such that the correlation matrices Rxx and Ryy are not identity. The second more explicit assumption is that the cross-block correlation Rxy exists, and is the focus of the statistical test. For CCA, before the decomposition of the cross-block matrix is done, it is adjusted for the within-block correlations: Most often, is used for singular value decomposition (SVD): The diagonal matrix S contains singular values that are the canonical correlations for each LV. Matrices U and V contain the singular vectors that correspond to the weights for each variable in X and Y, respectively. The first columns of each have the weights for the first LV with the corresponding first element of S having the canonical correlation for that LV. Subsequent LV's will have lower canonical correlations, accounting for increasingly less of . (Note that is sometimes calculated as: #$" * * #$" * . In this case, the singular values of the decomposition are the squared canonical correlation) For PLS, the main difference is that the decomposition is performed on Rxy directly without adjustment: In this case, the singular values in matrix S are the covariances rather than the correlation, even though Rxy may contain correlations. This is because the scaling of Rxy for CCA is equivalent to scaling that happens in the conversion of covariance to a correlation. To illustrate consider if X and Y here are vectors that are mean-centred only, the covariance Cxy is: And respective covariances of X and Y alone (i.e., variance) And the correlation Rxy Which is similar to the calculation done for in equation (5) . Hence, a correlation coefficient is a scaled covariance. This reinforces the point that CCA optimizes the correlation between X and Y, whereas PLS optimizes the covariance because of the differences in scaling the cross-block matrix rather than the contents within. A second, and more important point, is that the cross-block scaling limits the data structure that CCA can handle. If either Rxx or Ryy are rank deficient, the calculation in equation (5) is not possible. In a less severe case, when there are high correlations among variables within X or Y, the scaling in equation (5) will alter Rxy so that the CCA and PLS results will diverge. If there are high correlations within-block, one can calculate the canonical structure coefficients, which are the zero-order correlation of each item within-block with the canonical variate. This is done by first calculating the standardized canonical weights and then rescaling by the within-block correlation. For the X matrix, the standardized canonical weights, A, would be: And the canonical structure coefficients SA: When there are high within-block correlations, A and Sa will diverge; hence it is common that both are presented for CCA publications [13] . It is also the case that SA will be more similar to PLS weights in U for the same Rxy matrix. In CCA, the challenge of high within-block correlations can also be addressed by performing a dimensionality reduction with Principal Components Analysis (PCA) on one or both data blocks first and using the component scores as input data. The component scores are mutually orthogonal, and thus Rxx and/or Ryy would be identity (I). The operation in equation (5) is effectively dividing Rxy by one (an identity matrix), so the differences between CCA and PLS are reduced or eliminated. The use of PCA/ICA as a dimensionality reduction step prior to statistical testing has been a long practice [14] [15] [16] [17] , so it will be evaluated here. The issue with high within-block correlations for CCA is a ubiquitous issue with GLM approaches to statistical analyses where parameter estimation for a given variable is typically adjusted to reflect their unique contribution over and above the variance shared with other variables. Whether this is the desired outcome depends really on the theoretical orientation of the investigator. As we shall see below, the issue can also affect the reliability of parameters in CCA. Our simulations assessed the following features of CCA and PLS. All were evaluated as a function of sample size. 1) False positive rate 2) Detection of an effect 3) Reproducibility of covariances or canonical correlations (i.e., singular values S) and singular vector weights assessed both for the original population and within a given sample size. 4) The effects of analysis using PCA scores rather than the original variables. For each simulation, the following was done. First, standard CCA and PLS-C were run on the full data and statistical assessment done both with parametric statistics for CCA and nonparametric resampling statistics (permutations and bootstrap) as described below. The analysis established the "population" ground truth. Next random subsamples were drawn from the population 500 times, and CCA and PLS run for each subsample, along with the nonparametric statistical assessments. The sample sizes were 500, 250, 100, 50, and 20. In cases where n