key: cord-0270903-1dghrsuw authors: Salas, Lucas A; Zhang, Ze; Koestler, Devin C; Butler, Rondi A; Hansen, Helen M; Molinaro, Annette M; Wiencke, John K; Kelsey, Karl T; Christensen, Brock C title: Enhanced cell deconvolution of peripheral blood using DNA methylation for high-resolution immune profiling date: 2021-04-12 journal: bioRxiv DOI: 10.1101/2021.04.11.439377 sha: 17d51cf8ca75b8513cd9cd42fb6ae8390b8816c1 doc_id: 270903 cord_uid: 1dghrsuw DNA methylation microarrays can be employed to interrogate cell-type composition in complex tissues. Here, we expand reference-based deconvolution of blood DNA methylation to include 12 leukocyte subtypes (neutrophils, eosinophils, basophils, monocytes, B cells, CD4+ and CD8+ naïve and memory cells, natural killer, and T regulatory cells). Including derived variables, our method provides up to 56 immune profile variables. The IDOL (IDentifying Optimal Libraries) algorithm was used to identify libraries for deconvolution of DNA methylation data both for current and retrospective platforms. The accuracy of deconvolution estimates obtained using our enhanced libraries was validated using artificial mixtures, and whole-blood DNA methylation with known cellular composition from flow cytometry. We applied our libraries to deconvolve cancer, aging, and autoimmune disease datasets. In conclusion, these libraries enable a detailed representation of immune-cell profiles in blood using only DNA and facilitate a standardized, thorough investigation of the immune system in human health and disease. Advances in DNA methylation microarrays have allowed a greater understanding of how DNA 41 methylation is affected by environmental exposures and altered in chronic diseases. 1,2 Peripheral blood 42 is one of the most common biological matrices for those investigations. Blood DNA methylation profiles 43 include information from multiple cell lineages and, in some cases, cell states. Every cell lineage has 44 unique DNA methylation patterning regulating cell-specific gene expression, and some methods 45 leverage DNA methylation to understand underlying cell heterogeneity 1-7 . The reference-based 46 approach relies on the notion that the principal source of signal variability in a heterogeneous sample 47 (such as blood) reflects the signals' proportions in the different cell components 8 . Constrained 48 projection/quadratic programming (CP/QP) employs purified cell types as reference samples to generate 49 a reference library of differentially methylated sites among cell types and yields highly accurate 50 estimates of the underlying cell composition in mixed cell populations (e.g., peripheral blood) 9 . 51 Previously established statistical deconvolution frameworks such as CP/QP, support vector machine 52 (CIBERSORT), and robust partial regression (EpiDISH) have similar accuracy and precision in 53 deconvolution estimates 10 . Marker selection methods for library creation use automatic procedures to 54 discern library markers 11 or iterative approaches for selecting sets of markers (IDOL) that maximize the 55 accuracy of deconvolution estimates 12 . To enhance the utility of cell-type deconvolution, reference 56 library improvements and expansions of existing libraries to include additional cell types are needed to 57 broaden the scope of DNA methylation-based immune phenotyping. 13-16 58 Continued methodological advancements are highly dependent on the quality and genome coverage of 59 a reference library. In the original description of CP/QP for methylation-based deconvolution, Houseman 60 et al. developed a library based on an early microarray platform, the Illumina HumanMethylation27k 61 microarray 9 . When the Illumina HumanMethylation450k technology was released, Jaffe et al. applied 62 the Houseman method with the reference data developed by Reinius et al. 11, 17 . They accurately 63 sources of variability were assessed, including known SNPs tracing genetic ancestry (Supplementary Fig. 112 2). We excluded probes potentially tracking to polymorphisms, cross-reactive areas, or CpHs, probes 113 tracking to sex chromosomes, and those whose signal intensities were equal or below to the background 114 probes (see Online Methods for details). After filtering, 675,992 high-quality probes were retained for 115 analysis and deconvolution library construction. The first 20 principal components showed that the main 116 sources of variability corresponded to the cell-type identity and the slide beadchip, and not sex, age, or 117 other phenotype variables (Supplementary Fig. 3) . 118 We next applied the IDOL algorithm to the 56 samples collected across all 12 cell types using the 119 675,992 high-quality CpGs to select the optimal library for accurate deconvolution of these 12 cell types. 120 To test our library's performance, we used artificial mixtures (n=12) of DNA from purified cell types 121 representing varying proportions of the 12 cell-types in the IDOL analysis and measured DNA 122 methylation in these samples (Figure 2a, Supplementary Table 2 ). Using a discovery selection pool 123 totaling 3,535 CpGs representing candidate markers of differentially methylated CpGs across the 124 interrogated cell-types, we implemented the IDOL algorithm with 500 iterations to select a set of 125 libraries ranging in size from 250 to 3000 CpGs, in increments of 50 CpGs (detailed in the methods). The 126 coefficient of determination (R 2 ) and the root mean square error (RMSE) were calculated based on a 127 comparison of deconvolution estimates obtained using each library versus the known proportions of the 128 12 cell types in the artificial mixture samples (Supplementary Table 3 ). The optimal library, EPIC IDOL-129 Ext, consisted of 1,200 CpGs and was observed to have an average R 2 of 1 across all cell types and the 130 lowest average RMSE (0.226) among the interrogated library sizes. The training and testing datasets 131 showed high R 2 and low RMSE across the different interrogated cell-types (Figure 2b) . Probes in the EPIC 132 IDOL-Ext library tracked mostly to open sea regions of low CpG density (76%), followed by north and 133 south CpG island shore regions (14%) (see Table 1 ). The vast majority of probes were tracked to DNAse 134 hypersensitivity sites-DHS (73%). Notably, the library probes were also highly distinct as only 4% 135 overlapped with the EPIC IDOL-6 19 . 136 As only 459 (38%) of the 1,200 probes in the EPIC IDOL-Ext library are common to both the EPIC and 137 450k array platforms, we developed a library for the legacy IlluminaHumanMethylation450k array 138 platform, repeating the above-described optimization process after constraining the selection pool for 139 the candidate list of CpGs to those only present on the 450k array. The resulting library, 450k IDOL-Ext, 140 contains 1,500 probes; the library details are included in Table 1 . Consistent with a different distribution 141 of genomic contexts for probes on the 450k array, the 1,500 probe library had a lower proportion of 142 open sea (52%) and enhancer-related probes (5%). In total, 330 probes were shared between the 450k 143 Finally, we compared the EPIC IDOL-Ext, and 450k IDOL-Ext versus the pickCompProbes EPIC library 145 obtained using functions in the minfi Bioconductor package 11 . The pickCompProbes automatic selection 146 method builds the library picking the top 50 most hyper-and hypomethylated CpGs per cell-type, 147 totaling 1,200 probes (same size as our EPIC IDOL-Ext library), summarized in Table 1 . The overlap of 148 probes from the pickCompProbes library with the EPIC IDOL-Ext library was only 147 probes (12%), 149 though the probes' genomic context distribution was similar between libraries. When analyzing the cell-150 type by cell-type estimation performance across all three libraries, there was consistency in monocytes, 151 neutrophils, and NK cells. However, more variability was observed for estimates obtained from the 152 library derived by pickCompProbes. When assessing the deconvolution accuracy of the pickCompProbes 153 library, the RMSE was severely biased for eosinophils and T-cell subtypes. Specifically, CD4 and CD8 154 naïve T-cell distributions were biased with an underrepresentation of CD8nv (RMSE: 5.81%) and the 155 overestimation of CD4nv (RMSE: 9.25%). The pickCompProbes library also had unreliable results for 156 CD4mem vs. Treg compartments and Eos, Supplementary Fig. 4 . In contrast, both the EPIC IDOL-Ext and 157 450k IDOL-Ext libraries were highly accurate in the training and testing datasets, Figure 2 and 158 Supplementary Fig. 4 . The heatmaps summarizing the markers in the three libraries are shown in Figure 159 3. The complete set of markers information is available as Supplementary Tables 4 (EPIC IDOL-Ext), 5 160 (450k IDOL-Ext), and 6 (pickCompProbes). 161 The EPIC IDOL-Ext library was validated using samples with blood cell counts from flow cytometry (FCM) 162 and by using an independent set of artificial mixtures from the Gene Expression Omnibus (GEO) (Figure 163 4). In healthy adult samples (Figure 4a , GSE110530), we observed a strong correlation between the 164 deconvolution estimates and FCM measurements, with a maximum root mean square error-RMSE of 165 3.29% for CD8T cells. Using independent artificial mixtures (Figure 4b , GSE110554), the correlation was 166 close to 1, and the maximum RMSE was 1.67 for the neutrophils (granulocytes). The 450k IDOL-Ext 167 library was also validated using the GSE77797 with FCM (Figure 4c ) and artificial mixture information 168 (Figure 4d) . 169 As a proof-of-concept, we explored how the estimates could help reconstruct cell counts using FCM 170 information (Supplementary Fig. 5 ). Additional validation sets were analyzed, including glioma patient 171 blood and the cytometric information for the Reinius dataset 17 . A group of 72 glioma patients was used 172 to evaluate the information from T-cell subsets. For this, cytometric information was obtained using a 173 two-stage characterization of T cell subtypes. In the first stage, T-cells were measured and separated 174 into CD4+ and CD8+. Next, using a second tube, samples were characterized as naïve or memory. 175 Although the samples were measured in the same subjects, the proportions were estimated based on 176 two independent measurements and were mathematically derived based on both experiments' average 177 counts. We observed the second-largest difference for the CD8mem with an RSME of 5.44% and the 178 largest difference with the unmeasured remainder cells (RMSE of 12.48%) (Supplementary Fig. 6a) . The 179 450k IDOL-Ext was also validated using the GSE35069 datasets (Reinius) with FCM information 180 ( Supplementary Fig. 6b) . We observed considerable variation (~10%) for the granulocytes, monocytes, 181 and CD4T cells in this dataset 17 . Differences were more prominent for the PBMC samples' observed 182 values and in two of the six WBC samples, particularly for the granulocytes reported. 183 Finally, we explored whether these libraries could be applied for the deconvolution of umbilical cord 184 blood samples. We used the GSE68456 dataset (450k) with known FCM information for seven cell-types, 185 including the erythroblasts (nucleated red blood cells). As this subtype was not present in our library, 186 the fraction was added to the myeloid granulocyte components for comparison. This fraction showed 187 the largest variability in this dataset with an RMSE of 15.57% (Supplementary Fig. 7a ). In contrast, when 188 using umbilical cord blood artificial mixtures excluding the nucleated red blood cells, the largest RMSE 189 was observed for granulocytes being less than 4% (Supplementary Fig. 7b) . to control blood samples (n=355) (Wilcoxon rank-sum P <0.01, Supplementary Fig. 9 ). Predicted 204 immune cell proportions in breast cancer patients before and after receiving chemotherapy or a 205 combination of chemo/radiation therapy were explored. Patients receiving radiation therapy only (n=74) 206 showed a significant relative increase of the neutrophil proportion and a mirror decrease of several 207 lymphoid lineages (memory B cell, naïve B cell, naïve CD4 cell, and NK) after treatment (Wilcoxon rank-208 sum P <0.01, Supplementary Fig. 10a ). Patients receiving radiation therapy and chemotherapy (n=70) 209 exhibited significant increases in eosinophil, monocyte, and regulatory T cell proportions and decreases 210 of memory B cells after treatment (Wilcoxon rank-sum P <0.01, Supplementary Fig. 10b) . Finally, we 211 evaluated the changes in immune cell proportions in six COVID-19 patients with and without remission 212 versus six healthy controls in Supplementary Fig. 11 . Because of the limited sample size, no statistically 213 significant differences were observed, but, as expected, the median of neutrophils was higher in 214 patients versus controls. In contrast, all the median lymphocyte subpopulations were lower in the 215 infected patient than those who recovered from the disease. The median monocytes were lower in 216 those that remitted. 217 We investigated blood methylation data associated with subject-to-subject variation in non-pathological 218 conditions using data from monozygotic versus dizygotic twins and subjects at different ages 219 (Supplementary Table 8 ). To characterize immune cell variation between twins, predicted immune cell 220 proportions in monozygotic twins (n=852) and dizygotic twins (n=612) were estimated. Significant 221 differences in all the immune cells were observed in monozygotic twins and dizygotic twins (Paired t-test 222 P<0.01). Larger differences between dizygotic twins compared to monozygotic twins were observed in 223 memory B cell, naïve B cell, memory CD4 cell, naïve CD4 cell, memory CD8 cell, naïve CD8 cell, 224 eosinophil, monocyte, and NK (Wilcoxon rank-sum P <0.01, Supplementary Fig. 12) The inclusion of six additional cell-types and, in particular, of the naïve compartments is a major step 266 towards a universal deconvolution library applied to cord or adult blood data. Particularly intriguing for 267 umbilical cord blood, nucleated red blood cells interfered with more precise estimations than FCM 268 information, yet artificial mixtures showed the library capturing leukocyte components accurately. As our library. This is to observe how homogeneous they were using Jaffe's procedure 11 . The corresponding 336 cell-type proportion was retrieved and designated as "DNA methylation purity." We only included in this 337 library samples with DNA methylation purity levels higher than 85% (range 85.7% to 100%). Next, we 338 applied a stringent out-of-band p-detection value (pOOBHA) > 0.05 and set those that could not be 339 distinguished from the background probes as missing values. Then we selected only those probes with 340 complete information for all the samples in the library. No imputation was performed in this context as 341 the signals could be dependent on the specific cell-type. Further, as we only observed a minimal 342 variation in probes tracing to non-CpG or CpH probes, all the CpH beta values were under 0.08, they 343 were filtered. Finally, as both females and male samples were present, we discarded probes tracking the 344 X and Y chromosomes. According to Zhou et al., those that showed known polymorphisms or cross-345 reactivity were also excluded. 40 . Our set for library discovery included 675,992 complete high-quality 346 probes.The EPIC IDOL-Ext L-DMR library is available in GitHub 347 (https://github.com/immunomethylomics/FlowSorted.BloodExtended.EPIC), and will be available 348 through Bioconductor with the name FlowSorted.BloodExtended.EPIC. The extended blood 349 deconvolution can be performed using the FlowSorted.Blood.EPIC Bioconductor library. The package 350 contains an RGChannelSet R object processed using SeSaMe in which probes showing channel switching 351 were corrected and SNPs derived from Infinium Type I probes were added using the total signal 352 intensities. The object is unfiltered and contains 56 samples and the 12 artificial mixtures information on 353 1,008,711 probes corresponding to 866,091 CpGs using the latest annotation released by Illumina 354 (MethylationEPIC_v-1-0_B5). The reader needs to note that the cells were purified using an 355 immunomagnetic procedure; the name "FlowSorted" is kept for historical reasons and downstream 356 integration with previous minfi pipelines and similar algorithms. 357 For a complete description of the IDOL algorithm, please refer to the original application in Koestler et 359 al. 12 . In brief, the IDOL algorithm utilizes a training dataset (ground truth) consisting of samples with 360 DNA methylation data in which the measured fraction of each of the underlying cell types is known 361 (here corresponding to artificial mixtures with pre-specified proportions) as a means to identify a set of 362 probes confirming an optimal reference library for cell mixture deconvolution. A series of t-tests 363 compared the mean CpG-specific methylation between each leukocyte cell-type vs. the mean 364 methylation across all the other cell-types identified the probes discriminating CpGs (e.g., leukocyte 365 differential methylated regions or L-DMRs) for each specific cell-type of the 12 included in this 366 application. CpGs were then rank-ordered using their t-statistics, and the L/2 CpGs with the largest and 367 smallest t-statistic for each K cell type were identified and pooled. The tuning parameter L was set to L= 368 150 in our application, consistent with Koestler et al. 12 144 peripheral blood samples from breast cancer patients before and after receiving chemotherapy and 407 radiation or isolated radiation therapy treatment, and six COVID-19 patients (with 18 samples) and six 408 healthy controls. Illumina Infinium DNA methylation IDAT files were retrieved from GEO and 409 ArrayExpress for application data sets. SeSAMe package from Bioconductor was used to process the 410 data. Four thousand eight hundred and seventy-two samples in total were eventually contained in the 411 application data sets. We estimated the proportion of immune cells in those data sets using the 412 appropriate extended immune cell deconvolution library for the array (EPIC or 450k). Aging and environmental exposures alter tissue-specific DNA methylation 430 dependent upon CpG island context A cell-type deconvolution meta-analysis of whole blood EWAS reveals lineage-432 specific smoking-associated DNA methylation changes Methylation-derived neutrophil-to-lymphocyte ratio and lung cancer risk in 434 heavy smokers. Cancer Prevention Research canprevres DNA Methylation-Derived Neutrophil-to-Lymphocyte Ratio: An Epigenetic 437 Tool to Explore Cancer Inflammation and Outcomes. Cancer epidemiology, biomarkers & 438 prevention : a publication of the Immunomethylomics: A Novel Cancer Risk Prediction Tool. Annals of 441 the Immunomethylomic approach to explore the blood neutrophil lymphocyte 443 ratio (NLR) in glioma survival Cell-type deconvolution in epigenome-wide association studies: 445 a review and recommendations A comparison of reference-based 447 algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies DNA methylation arrays as surrogate measures of cell mixture distribution Identification of differentially 452 methylated cell types in epigenome-wide association studies Accounting for cellular heterogeneity is critical in epigenome-wide 455 association studies Improving cell mixture deconvolution by identifying optimal DNA 457 methylation libraries (IDOL) Enlarged leukocyte referent libraries can explain additional variance in blood-based 459 epigenome-wide association studies Estimation of Cell-Type Composition Including T and B Cell Subtypes for Whole 461 Blood Methylation Microarray Data DNA methylation age of human tissues and cell types Training a model for estimating leukocyte composition using whole-blood DNA 465 methylation and cell counts as reference Differential DNA methylation in purified human blood cells: Implications for 467 cell lineage and studies on disease susceptibility Differential methylation between ethnic sub-groups reflects the effect of 469 genetic ancestry and environmental exposures An optimized library for reference-based deconvolution of whole-blood 471 biospecimens assayed using the Illumina HumanMethylationEPIC BeadArray DNA methylation of cord blood cell types: Applications for mixed cell birth 474 studies Cell type specific DNA methylation in cord blood: A 450K-reference data set and 476 cell count-based validation of estimated cell type composition Characterizing the hypomethylated DNA 478 methylation profile of nucleated red blood cells from cord blood Cell type-specific DNA methylation in neonatal cord tissue and cord blood: a 850K-481 reference panel and comparison of cell types Systematic evaluation and validation of reference and library selection methods 483 for deconvolution of cord blood DNA methylation data RNA-Seq Signatures Normalized by mRNA Abundance Allow Absolute 485 Deconvolution of Human Immune Cell Types Grb2 regulates B-cell 487 maturation, B-cell memory responses and inhibits B-cell Ca2+ signalling The immunoglobulin tail tyrosine motif upgrades memory-type BCRs by 490 incorporating a Grb2-Btk signalling module Mutations in CHD7 in patients with CHARGE syndrome cause T-B + natural 492 killer cell + severe combined immune deficiency and may cause Omenn-like syndrome Chemokine Receptor CCR5 on Virus Epitope-Specific Memory and Effector CD8 + T Cells The stromal cell antigen CD248 (endosialin) is expressed on naive CD8+ human 498 T cells and regulates proliferation Role of eosinophil peroxidase in the origins of protein 500 oxidation in asthma Lysosomal trafficking regulator Lyst links membrane trafficking to toll-like 502 receptor-mediated inflammatory responses KIR2DL4 (CD158d) Genotype 505 Influences Expression and Function in NK Cells CTLA4/PKCη signaling pathway for suppressing tumor immunity Deciphering the differentiation trajectory 510 from hematopoietic stem cells to mast cells A single-cell hematopoietic landscape resolves 8 lineage trajectories and 512 defects in Kit mutant mice Minfi: A flexible and comprehensive Bioconductor package for the analysis of 514 Infinium DNA methylation microarrays ENmix: a novel background correction method for Illumina 516 SeSAMe: Reducing artifactual detection of DNA 518 methylation by Infinium BeadChips in genomic deletions Comprehensive characterization, annotation and innovative use 520 of Infinium DNA methylation BeadChip probes missMethyl: an R package for analyzing data from 522 Illumina's HumanMethylation450 platform Molecular signatures database (MSigDB) 3.0