A mammalian methylation array for profiling methylation levels at conserved sequences


A mammalian methylation array for profiling methylation levels at conserved sequences 

Adriana Arneson 1,2, Amin Haghani 3, Michael J. Thompson 4, Matteo Pellegrini 4, Soo Bin 

Kwon1,2, Ha Vu1,2, Caesar Z. Li 5, Ake T. Lu 3, Bret Barnes 6, Kasper D. Hansen 7,8, Wanding 

Zhou 9, Charles E. Breeze 10, Jason Ernst 1,2,11-15#, Steve Horvath 3,5# 

Affiliations 

1 Bioinformatics Interdepartmental Program, University of California, Los Angeles, CA 90095, 

USA  

2 Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, 

California, USA; 

3 Dept. of Human Genetics, David Geffen School of Medicine, University of California Los 

Angeles, Los Angeles, CA 90095, USA; 

4 Molecular, Cell and Developmental Biology, University of California Los Angeles, Los Angeles, 

CA 90095, USA; 

5 Dept. of Biostatistics, Fielding School of Public Health, University of California Los Angeles, 

Los Angeles, CA 90095, USA; 

6 Illumina, Inc, 5200 Illumina Way, San Diego, CA 92122, USA; 

7 Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, 

Maryland, USA; 

8 Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland, 

USA; 

9 Van Andel Research Institute, Grand Rapids, Michigan, USA; 

10 Altius Institute for Biomedical Sciences, Seattle, WA, USA; 

11 Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research at University 

of California, Los Angeles, Los Angeles, California, USA; 

12 Computer Science Department, University of California, Los Angeles, Los Angeles, California, 

USA; 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


2 
 

13 Department of Computational Medicine, University of California, Los Angeles, Los Angeles, 

California, USA. 

14 Jonsson Comprehensive Cancer Center, University of California, Los Angeles, Los Angeles, 

California, USA; 

15 Molecular Biology Institute, University of California, Los Angeles, Los Angeles, California, USA. 

 
# Joint senior authorship 

Correspondence: shorvath@mednet.ucla.edu and jason.ernst@ucla.edu 

 
SUMMARY 

Infinium methylation arrays are widely used to robustly measure methylation of DNA in humans. 

However, such arrays are not available for the vast majority of non-human mammals. Moreover, 

even if species-specific arrays were available, probe differences between them would confound 

cross-species comparisons. To address these challenges, we developed the Mammalian 

Methylation Array, a single custom Infinium array that measures cytosine methylation levels of 

over 35 thousand CpG sites that are well conserved across species within the mammalian class. 

By design, the probes on the array tolerate cross-species mutations. To design the array, we 

developed the Conserved Methylation Array Probe Selector (CMAPS) algorithm, which takes as 

input a multi-species sequence alignment and probe design constraints. A greedy search 

algorithm was used to identify oligonucleotide sequences (probes) with high coverage across 

different mammalian species. We annotate the probes on the array with respect to genes in 159 

different species and provide details on the sequence context including CpG island status and 

chromatin states. Our calibration experiments demonstrate the high fidelity of this array in 

humans, rats, and mice. The mammalian methylation array has several strengths: it applies to all 

mammalian species even those that have not yet been sequenced, it provides deep coverage of 

specific cytosines facilitating the development of highly robust epigenetic biomarkers, and it 

covers highly conserved CpGs which greatly increases the probability that biological insights 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


3 
 

gained in one species will readily translate to others. The mammalian methylation array is 

expected to find many applications in preclinical studies, comparative biology, and epigenetic 

studies of aging and development. 

 
Introduction 

Methylation of DNA by the attachment of a methyl group to cytosines is one of the most widely 

studied epigenetic modifications in vertebrates, due to its implications in regulating gene 

expression across many biological processes including disease (Ooi et al., 2007; Robertson, 

2005; Smith and Meissner, 2013). A variety of different assays have been proposed for measuring 

DNA methylation including microarray based methylation arrays (Bibikova et al., 2009, 2011) and 

sequencing based assays such as whole genome bisulfite sequencing (WGBS)(Cokus et al., 

2008; Lister et al., 2009) and reduced representation bisulfite sequencing (RRBS)(Meissner et 

al., 2005). Despite the availability of sequencing based assays, array based technology remains 

widely used for measuring DNA methylation due to its low-cost and high reproducibility and 

reliability(Pidsley et al., 2016).  

The first human methylation array (Illumina Infinium 27K) was introduced by Illumina Inc 

in 2009 (Bibikova et al., 2009), which were followed by the 450K(Bibikova et al., 2011) and EPIC 

arrays with larger coverage(Pidsley et al., 2016). More recently, Illumina released a mouse 

methylation array (Infinium Mouse Methylation BeadChip) that profiles over 285k markers across 

diverse murine strains. It will probably not be economical to develop similar methylation arrays for 

less frequently studied mammalian species (e.g. elephants or marine mammals) due to 

insufficient demand. Moreover, even if costs were no impediment, species-specific arrays would 

likely be sub-optimal in comparative studies across different species as the measurement 

platforms would be different.  

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


4 
 

To address these challenges, we developed a single mammalian methylation array 

designed to be used to measure DNA methylation across mammals. The array targets CpGs for 

which the CpG and flanking sequence are highly conserved across many mammals so that the 

methylation of many of these CpGs can be measured in each mammal. The design repurposes 

the degenerate base technology (originally used by Illumina Infinium probes to tolerate within-

human variation) to tolerate cross-species mutations across mammalian species. To select the 

specific probe sequences including tolerated mutations that appear on the array we developed 

the Conserved Methylation Array Probe Selector (CMAPS). CMAPS takes as input a multiple 

sequence alignment to a reference genome and a set of probe design constraints, and selects a 

set of probe sequences including tolerated mutations, which can be used to query methylation in 

many species. We apply CMAPS to select over 35 thousand CpGs for the mammalian methylation 

array, which we complemented with close to two thousand known human biomarker CpGs. We 

characterize the CpGs on the mammalian methylation array with various genomic annotations. 

Further, we use calibration data to evaluate the fidelity of individual probes in humans, mice, and 

rats. CMAPS has led to the design of the mammalian methylation array, which will facilitate the 

study of cytosine methylation at conserved loci across all mammal species.  

Results 

Designing the Mammalian Methylation Array 

The CMAPS algorithm is designed to select a set of Illumina Infinium array probes such 

that for a target set of species many probes are expected to work in each species (Methods). 

Array probes are sequences of length 50bp flanking a target CpG based on the human reference 

genome. Selecting sequences present in the human reference genome increases the likelihood 

that measurements in other species will transfer to human. The mammalian methylation array 

adapts the degenerate base technology for tolerating human SNPs so that probes can tolerate a 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


5 
 

limited number of cross-species mutations. The CMAPS algorithm is provided as input a multiple-

species sequence alignment to a reference genome. CMAP uses these inputs to then select the 

CpGs to target on the array. As part of selecting the CpGs, CMAP also selects the probe 

sequence design to target them including the specific set of degenerate bases. For designing the 

mammal methylation array, CMAPS was applied to the subset of 62 mammals within a 100-way 

alignment of 99 vertebrate genomes with human genome(Haeussler et al., 2019), but we note the 

CMAPS method is general. 

In designing a probe for a CpG, CMAPS considers multiple different options. One option 

is the type of probe. Illumina’s current methylation array technology allows up to two types of 

probes: Infinium I and Infinium II. The latter is newer technology requiring only one silica bead to 

query the methylation of a CpG, while the former requires two beads. By only requiring one bead 

Infinium II probes allow under fixed array capacity limits interrogating more CpGs, though Infinium 

I probes are better able to query CpGs in CpG rich regions (Bibikova et al., 2011). Another option 

for each of these two types of probes is whether the probe is on the forward or reverse genomic 

strand, giving four total combinations of options for probe type and strand for each CpG. In 

addition, CMAPS has options for the position and nucleotides identity of tolerated mutation across 

correspond to degenerate bases. The array degenerate base technology allows for potentially up 

to three degenerate bases per probe sequence, which are positions that can be designed to 

tolerate variation in the sequence being interrogated. For some probes fewer than three 

degenerate bases could be designed, which was determined based on a design score computed 

by Illumina for each probe and in the case of Infinium II probes also the number of CpGs within 

the probe sequence. CMAPS uses a greedy algorithm to select the tolerated mutations for each 

combination of probe type and strand. The algorithm aims to maximize the number of species in 

the alignment the probe is expected to work based on just local alignment information that is 

without considering how uniquely mappable the probe is across the genome. A probe for a CpG 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


6 
 

is expected to work in a non-human species based on local alignment information if there are no 

differences in the alignment between the human genome sequence and the other species 

excluding those accounted for by the probe’s degenerate bases (Figure 1a, Methods). For each 

CpG site in the human genome, CMAPS retained for further consideration the Infinium I probe 

out of the two options (forward or reverse of the CpG) which had the greater number of species 

for which the probe was expected to work, and likewise for Infinium II.  

We next applied a series of rules to identify a reduced subset of candidate probes. First, 

we included all 36,133 Infinium II probes that were expected to work in mouse (based on the 

mm10 genome), which maximizes the expected array utility for one of the most widely used model 

organisms. For the remaining set of CpG not selected in the previous step, we sorted them in 

descending order of the number of species for which an Infinium II probe was expected to work. 

We then added the top 16,867 CpG sites for a total of 53,000 CpG sites. Next, we ranked the 

CpGs targeted on the Illumina EPIC array (Pidsley et al., 2016) in descending order of the number 

of species for which a probe targeting the CpG is expected to work. For this the probe was 

required to be of the same probe type and strand as on the EPIC array, but used the degenerate 

bases picked by the CMAPS algorithm. The probe was allowed to differ in terms of degenerate 

base positions, as EPIC probes typically do not account for degenerate bases across species. 

For this we selected the top 3,000 CpG sites ranked sites that had not already been picked based 

on the earlier criteria. 

Lastly, we sorted the CpG sites in descending order of number of species they can target 

and picked the top 4,000 CpGs targeted by Infinium I probes that had not already been included. 

The Infinium I probes were selected to allow querying CpG dense regions such as CpG islands, 

as CpGs do not count towards the limited number of positions of variation as for Infinium II probes. 

This resulted in a set targeting 60,000 CpGs (Figure 1b).  

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


7 
 

For some of these 60,000 CpGs, the sequence of the probe targeting it can map to multiple 

locations in a genome, which could result in a confounded signal coming from multiple CpG sites. 

This issue is compounded by individual probes corresponding to multiple sequences reflecting 

different possible combinations of the degenerate bases. To identify a subset of probes less 

susceptible to such confounders, for 16 high quality genomes, we computed for each probe how 

many of its versions map uniquely in that genome (see Methods). We then filtered CpGs down 

by requiring all versions of a probe targeting it map uniquely in at least 80% of the species they 

are expected to target out of the 16 high quality genomes, unless the probe is expected to target 

at least 40 mammals from the alignment, in which case the mapping criterion was discarded. This 

reduced the set of candidate CpGs to 35,989 CpGs.  

We added probes targeting 1986 CpGs to the mammalian methylation array based on 

their utility for human biomarker studies (Supplementary Data). These probes, which were 

previously implemented in human Illumina Infinium arrays (EPIC, 450K, 27K), were selected due 

to their utility for human biomarker studies estimating age, blood cell counts, or the proportion of 

neurons in brain tissue(Guintivano et al., 2013; Hannum et al., 2013; Horvath, 2013; Horvath and 

Levine, 2015; Horvath et al., 2018; Houseman et al., 2012; Levine et al., 2018). The final 

manufactured mammalian methylation array measures cytosine levels of 37,492 cytosines: 

37,488 of these cytosines are followed by a guanine (CpGs) and 4 are followed by another 

nucleotide (non-CpGs). The probe identifiers (cg numbers) of 86 of these cytosines ends with 

either ".1" or ".2", i.e. these are duplicate probes for 43 genomic locations.  

 
A detailed analysis of the Infinium probe context of the mammalian array and relation to human 

and mouse arrays is presented in Supplementary Figure S1. The mammalian methylation array 

focus on highly conserved regions led to a an array that is distinct from other currently available 

Infinium arrays that focus on specific species. For example, the mammalian array only shares 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


8 
 

3107 probes with the Illumina MouseMethylation array and only 7111 CpGs with the Illumina EPIC 

array. 

 
Mappability analysis 

All 37488 CpGs profiled on the mammalian methylation array apply to humans, but only a 

subset of these CpGs applies to other species. When conducting analyses in a specific species 

it can thus be desirable to restrict analyses to the subset of CpG that apply in that species. One 

approach for doing this is simply omit CpGs whose detection p-values from normalization 

methods (Methods) are insignificant. This approach has the advantage of being applicable to 

species that have not yet been sequenced. Mapping sequences to genomes has the added 

benefit of providing a candidate position of the sequence in the target genome from which other 

information about the CpG can be inferred such as the nearest gene or CpG island status. 

We have mapped the array CpGs to 159 species, which also provides a candidate position 

from which a gene for the CpG can be associated. As expected, the closer a species is to humans, 

the more CpGs map to the genome of this species. Over 30k CpGs on the array map to most 

placental mammalian genomes (eutherians, Figure 2a, Supplementary Data). Roughly 15K 

CpGs map to most non-placental mammalian genomes (marsupials), such as kangaroos or 

opossums. Far fewer CpGs map to egg laying mammalian genomes (monotremes), such as 

platypus (Figure 2). A CpG that is adjacent to a given gene in humans may not map to a position 

adjacent the corresponding (orthologous) gene in another species. Between 15k to 22k CpGs 

(over 70%) were assigned to human orthologous species based on their mapped position in most 

phylogenetic orders (rodents, bats, carnivores, Figure 2b,c and Supplementary Data). These 

numbers surrounding orthologous genes are probably overly conservative (i.e. lower than the true 

numbers) because we found the majority of CpGs (about 58%) that do not map to orthologous 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


9 
 

genes in the non-human species are located in intergenic regions outside of promoters (Methods), 

which suggests that one of the gene assignments was inaccurate.  

 
Chromosome and gene region coverage of array 

We analyzed the chromosome and gene region coverage of the mammalian methylation 

array for human and mouse. The mammalian methylation has substantial coverage of all 

chromosomes (human, 235-3938; and mouse, 687-3179 probes per chromosome), with the 

exception of chrY that only has 2 probes in both species (Supplementary Figure S2a). When we 

assign the probes to the closest gene neighbor, around 80% of the probes are proximal to a gene 

in both of these species (Supplementary Figure S2b). The remaining 20% of probes are neither 

aligned to a promoter nor a gene body. The distribution of gene region and the distances to 

transcriptional start sites are comparable between human and mouse (Supplementary Figure 

S2b). CpGs on the mammalian array cover 6871 human and 5659 mouse genes when each 

CpGs is assigned uniquely to its closest gene neighbor (Supplementary Figure S2c). The gene 

coverage is uneven: while on average a gene is covered by 2 CpGs some genes are covered by 

as many as 150 CpGs. In mouse, 73% of CpGs (21,664) were assigned to a human orthologous 

genes (Supplementary Figure S2d), suggesting many CpG measurements from the array in 

mice will be informative to humans (and vice versa). 

 
Gene sets represented in mammalian array 

We analyzed gene set enrichments of all genes that are represented on the mammalian 

array using GREAT(McLean et al., 2010). Significant gene sets covered implicated gene sets that 

were found to play a role in development, growth, transcriptional regulation, metabolism, cancer, 

mortality, aging, and survival (Supplementary Figure S3). We also used the TissueEnrich(Jain 

and Tuteja, 2019) software to analyze gene expression (Methods). The majority of mammalian 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


10 
 

methylation array probes (~65%) are adjacent to genes that are expressed in all considered 

human and mouse tissue (Supplementary Figure S4a,b). However, the mammalian array also 

contains CpGs that are adjacent to genes that are expressed in a tissue-specific manner, notably 

testis and cerebral cortex (Supplementary Figure S4c).  

 
CpG island and methylation status 

 We analyzed the CpG island and DNA methylation properties of CpGs on the mammalian 

array. In general, an average of 5563 (18%) of probes in the mammalian array are located in CpG 

island depending on the species (Figure 3a). We used a CpG island detection algorithm (gCluster 

software (Li et al., 2020)) that additionally provided several species-level quantitative measures 

for each CpG island including the length, GC content, and CpG density that we provide as a 

resource (Supplementary Data). We also analyzed the DNA methylation levels in human for 

fractional methylation called from whole genome bisulfite sequencing data across 37 human 

tissues(Roadmap Epigenomics Consortium et al., 2015)5 (Supplementary Figure 5). This 

confirmed that the mammalian methylation array target CpGs across a wide range of fractional 

methylation levels. 

 
Chromatin state annotation of array probes 

We analyzed the overlap of human CpG’s targeted on the mammal methylation array with 

chromatin states for 127 cell and tissues. The CpGs cover all available chromatin states including 

different types of promoters (including bivalent promoters), regions repressed by polycomb group 

proteins, transcription start and end site, and enhancer regions (Figure 3b). Among enhancers, 

CpG’s had greater overlap with brain and neurosphere than other tissue groups. In addition to 

analyzing the array CpG’s overlap for cell and tissue specific chromatin states, we also analyzed 

them for a universal chromatin state annotation, which provides a single annotation to the genome 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


11 
 

per position based on data from more than 100 cell and tissue types (Vu and Ernst, 2020) 

(Supplementary Figure S6). This revealed the greatest enrichment for bivalent promoter states 

and also strong enrichment for other promoter states and a state associated with polycomb 

repression.  

While the mammalian methylation array was specifically designed to profile CpGs in highly 

conserved stretches of DNA based on sequence conservation, we assessed whether there was 

also evidence of conservation at the functional genomics level using human-mouse LECIF scores 

(Kwon and Ernst, 2020). The human-mouse LECIF quantifies evidence of conservation between 

human and mouse at the functional genomics level using chromatin state and other functional 

genomic annotations. In general, probes on the array had higher LECIF score than regions that 

align between human and mouse in general (Figure 3c).  

Mammalian array study of calibration data 

To validate the accuracy of the mammalian methylation array we applied it to synthetic 

DNA methylation samples for three species: human (n=10 arrays), mouse (n=20), and rat (n=15), 

where the methylation levels were known. The DNA samples from human, mouse and rat were 

engineered such that the fractional methylation at all CpG sites in their genomes approximately 

0%, 25%, 50%, 75% and 100% (Methods). The calibration data thus allow us to define a 

benchmark annotation measure “ProportionMethylated” (with ordinal values 0, 0.25, 0.5, 0.75, 1). 

The distribution of the intensity of the probes in each human sample is roughly centered around 

the benchmark measure (ProportionMethylated) (Figure 4a). However, as expected, the 

distributions in the mouse and rat samples of all the probes show somewhat different patterns in 

these two species compared to the human samples likely because many probes in the design of 

our array do not map to these genomes (Figure 4b-c). We also evaluate these for each species 

after removing the probes that were not designed to map to that species, and normalizing the 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


12 
 

array data using the SeSaMe package, which defines beta (relative intensity) values for each 

probe (Zhou et al., 2018). After this procedure, we see sharper peaks close to 0 and 1, though 

the quantification of absolute methylation levels are somewhat degraded around the beta value 

0.75 as we move away from humans (Figure 4d-f).  

Additionally, for each species, DNA methylation levels of each CpG we computed the 

correlation with the benchmark variable "ProportionMethylated" across the arrays. High positive 

correlations would be evidence for the accuracy of the array, which is indeed what we observe. 

CpGs that map to the human, mouse, and rat genome have a median Pearson correlation of 

r=0.986 with an interquartile range of [0.96,0.99], r=0.959 with IQR=[0.92,0.98], and r=0.956 with 

IQR=[0.91,0.98] with the benchmark variable ProportionMethylated in the respective species. The 

numbers of CpGs on the mammalian array that pass a given correlation threshold (irrespective of 

the mappability to a given species) are reported in Table 1. We also compare the SeSaMe 

normalization with the "noob" normalization that is implemented in the minfi R package (Aryee et 

al., 2014; Triche et al., 2013) (Table 1). We find that SeSaMe slightly outperforms minfi when it 

comes to the number of CpGs that exceed a given correlation threshold with 

ProportionMethylated.  

Comparison with the human EPIC methylation array study in calibration data 

We compared the mammalian methylation to the human EPIC methylation array, which 

profiles 866k CpGs in the human genome, for non-human samples. Some of the EPIC array 

probes are expected to apply to the mouse and rat genomes as well (Needhamsen et al., 2017). 

To facilitate a comparison between the mammalian methylation array and the human EPIC array 

for non-human samples we applied the latter to calibration data from mouse (n=15 arrays) and 

rat (n=10). The same engineered DNA data methylation data were analyzed on the human EPIC 

array as on the mammalian methylation array above. In particular, we were able to correlate each 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


13 
 

CpG on the EPIC array with a benchmark measure (ProportionMethylated) in mice and rats 

(Table 1). Only 2356 (out of 866k) CpGs on the human EPIC exceed a correlation of 0.90 with 

ProportionMethylated in mice. By contrast, 24050 CpGs on the mammalian array exceed the 

same correlation threshold in mice. Similarly, the mammalian array outperforms the EPIC array 

in rats: only 6159 CpGs on the EPIC array exceed a correlation of 0.90 with ProportionMethylated 

compared with 22427 CpGs on the mammalian array. The results are similar for the correlation 

thresholds of 0.85 and 0.95 (Table 1). 

The EPIC array contains 5574 CpGs that were also prioritized by the CMAPS algorithm 

based on high levels of conservation, excluding the 1986 CpGs from human biomarker studies. 

Out of these 5574 shared CpGs, 4341 and 3948 CpGs map to the mouse and rat genome, 

respectively. While human EPIC probes target the same CpG, the corresponding mammalian 

probe is typically different from EPIC probe due to differences in probe type (type I versus type II 

probe), DNA strand, or the handling of mutations across species degenerate bass. In the following 

comparison, we limited the analysis to the 4341 and 3948 probes when analyzing calibration data 

from mice or rats, respectively. We find that the mammalian array probes are better calibrated 

than the corresponding EPIC array probes when applied to mouse and rat calibration data 

according to two different analysis that focus on shared CpGs between the two platforms. First, 

the mammalian array outperforms the EPIC array when considering mean methylation levels 

across the shared CpGs (Figure 5). Second, when correlating each of the shared CpGs with the 

benchmark value ProportionMethylated we observe median correlation of 0.72 for both mice and 

rat calibration data generated on the EPIC array. For the same probes we observe median 

correlations of 0.94 and 0.93 for mice and rat calibration data generated on the mammalian array 

(SeSaMe normalization), respectively. We are distributing the methylation data and results from 

our calibration data analysis in three species (Supplementary Data). These calibration results will 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


14 
 

allow users to select cytosines whose methylation have a high correlation with the benchmark 

data in human, mice or rat. 

DISCUSSION  

The mammalian methylation array, which was enabled by the CMAPS algorithm for 

selecting conserved probes, is applicable to all mammals and hence drives down the cost per 

chip through economies of scale. The mammalian methylation array has unique strengths: it 

applies to all mammalian species even those that have not yet been sequenced, it provides deep 

coverage of specific cytosines which is a prerequisite for developing robust epigenetic 

biomarkers, and its focus on highly conserved CpGs increases the chances that findings in one 

species will translate to those in another species. We expect that the mammalian methylation 

array is particularly well suited for DNA methylation based biomarker studies in mammals. 

Our calibration data demonstrate that the array largely leads to high quality measurements 

in three species: human, mouse and rat. Our calibration data shows that the mammalian 

methylation array greatly outperforms the human EPIC chip when it comes to high fidelity 

measurement applications to mice and rats. The array thus should be preferable for most non-

human applications unless high-fidelity measurements are not needed in which case the larger 

content of the EPIC array may make it preferable.  

The mammalian methylation array has several limitations. First, only a fraction of genes 

in a given species are represented by CpGs. Second, it focuses on CpGs in highly conserved 

stretches of DNA and hence does not cover parts that are specific to a given species. Third, it 

provides worse coverage in more distal species, particularly in marsupials than in placental 

mammals (eutherians). Finally, the calibration data suggests there are some shifts in the absolute 

methylation levels detected for intermediate methylation levels, but the relative order is preserved. 

The correct relative ordering of beta values is of primary importance in most statistical tests and 

analyses. 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


15 
 

Several software tools have been adapted for use with the mammalian methylation array 

that range from normalization to higher level gene enrichment analysis. Software tools for 

generating normalized data include SeSaMe and the minfi R package (Aryee et al., 2014; Zhou 

et al., 2018). The eFORGE software (Breeze et al., 2019), which has been adapted for the use 

with the mammalian array, facilitates chromatin state analysis and transcription factor binding site 

analysis. Many researchers will be interested in genome coordinates of the mammalian CpGs in 

different species. Toward this end, we provide genome coordinates in 159 species. This list of 

species will increase as more high quality genomes become available. Detailed gene annotations 

in many species are available including details on gene region (e.g. exon, promoter, 5 prime 

untranslated region) and CpG island status (Supplementary Data). For human and mice we 

provide chromatin state annotations (Ernst and Kellis, 2012; Gorkin et al., 2020; Roadmap 

Epigenomics Consortium et al., 2015; Vu and Ernst, 2020) and the LECIF score on evidence of 

conservation at the functional genomics level between human and mouse(Kwon and Ernst, 2020).  

In other articles, we will describe the application of the mammalian methylation array to 

many different mammalian species. These upcoming studies will demonstrate that the 

mammalian methylation array is useful for many applications that involve mammalian species. 

 
Methods 

Conserved Methylation Array Probe Selector (CMAPS) 

Given a multi-species sequence alignment and reference genome, for each CG site and each of 

the four different possible probe designs, CMAPS computes an estimate of the number of species 

from the alignment that could be targeted if the use of degenerate base technology is optimized 

for tolerated mutations. The four probe designs involve each combination of probe type (Infinium 

I vs. Infinium II), and whether the probe sequence is on the forward or reverse DNA strand. For 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


16 
 

each probe option, CMAPS conducts a greedy search to select tolerated mutations, including 

position and allele that maximize species coverage for the probe. The maximum number of 

degenerate bases that can be included in a probe is a function of a design score provided by 

Illumina Inc. For Infinium II probes only, CpGs present in the probe sequence count as if they are 

a degenerate base. More specifically, the algorithm for determining the number of species and 

selecting the mutations to handle performs the following steps for each probe design: 

1. Let M be the maximum number of degenerate bases that can be designed into a specific 

probe, based on the design score 

2. For each species s in the alignment, let Ms be the number of mismatches in the alignment 

between that species and the human reference sequence of the probe 

a. If Ms > M or the species does not have the target CpG, continue to next species 

b. If Ms <= M,  

i. For each mismatch in species s, add each degenerate position to a multiset 

P  

ii. add the species to a set F of feasible species to target with this probe 

3. For all |P| choose M combinations of possible degenerate positions: 

a. For each unique position in the combination  

i. For each possible alternate nucleotide count the number of species in F 

that contain that alternate nucleotide 

ii. Pick the top k alternate nucleotides based on the count in i., where k is the 

number of occurrences of the current position in S 

b. Compute the number of species that match the human reference when accounting 

for the degenerate substitutions handled in a 

4. Select the combination of positions in S that maximizes 3.b 

 
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


17 
 

Our procedure for selecting the specific targeted CpG and probe designs are described in the 

main text. We note that 30 of the CpGs selected for the mammalian methylation array based on 

the conservation criteria (using the sequence alignment) overlap with the 1986 human biomarker 

CpGs. The design of the probes targeting them could differ however. The probe names of different 

probes targeting the same CpG are distinguished by extensions ".1" and ".2". For example 

cg00350702.1 and cg00350702.2 target the same cytosine but use different probe chemistry. The 

array contains four probes that measure cytosines that are not followed by a guanine selected by 

human biomarkers, which are indicated with a "ch" instead of a "cg".  

The CMAPS algorithm was applied with human hg19 as the reference genome and using 

the Multiz alignment of 99 vertebrates with the hg19 human genome downloaded from the UCSC 

Genome Browser (Haeussler et al., 2019; Rosenbloom et al., 2015). For the purpose of designing 

the mammalian array, only the 62 mammalian species in this alignment were considered and 16 

for the mappability analysis. However, the current version of the mappability analysis provides 

genome coordinates for 159 species. 

The mammalian methylation array includes an additional 62 human SNP markers (whose 

probe names start with "rs" for human studies), which can be used to detect plate map errors 

when dealing with multiple tissue samples collected from the same person. Finally, the 

mammalian array also adopted a standard suite of probes from the Illumina EPIC array for 

measuring bisulfite conversion efficiency in humans.  

 
Mapping probes to genomic coordinates 

We used two different approaches for mapping probes to genomes. The first approach (BSbolt 

software) was primarily used in designing the array. Subsequently, we adopted a second 

mappability approach (QUASR software) that allowed us to map more probes to more species.  

Mappability Approach 1: BSbolt 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


18 
 

For version 1 of our mappability analysis (i.e. for designing the array), we applied the BSbolt 

mapping approach to 16 high quality genomes from: Baboon (papHam1), Cat (felCat5), Chimp 

(panTro4), Cow (bosTau7), Dog(canFam3), Gibbon(nomLeu3), Green Monkey (chlSab1), Horse, 

(equCab2), Human (hg19), Macacque (macFas5), Marmoset(calJac3), Mouse (mm10), Rabbit 

(oryCun2), Rat (rn5), Rhesus Monkey (rheMac3), Sheep (oviAri3). 

We utilized the BSBolt software (Farrell et al., 2020) package from 

https://github.com/NuttyLogic/BSBolt 

to perform the alignments. For each species’ genome sequence, BSBolt creates an ‘in silico’ 

bisulfite-treated version of the genome. As many of the currently available genomes are in a low 

quality assembly state (e.g. thousands of contigs or scaffolds), we used the utility “Threader” 

(which can be found in BSBolt’s forebear BSseeker2(Guo et al., 2013) as a standalone 

executable) to reformat these fasta files into concatenated and padded pseudo-chromosomes. 

The set of nucleotide sequences of the designed probes, which includes degenerate base 

positions, was explicitly expanded into a larger set of nucleotide sequence representing every 

possible combination of those degenerate bases. For Infinium I probes, which have both a 

methylated and an unmethylated version of the probe sequence, only the methylated version was 

used as BSBolt’s version of the genome treats all CG sites as methylated. The initial 37K probe 

sequences resulted in a set of 184,352 sequences to be aligned against the various species 

genomes. We then ran BSBolt with parameters Align -M 0 –DB [path to bisulfite-

treated genome] -BT2 bowtie2 -BT2-p 4 -BT2-k 8 -BT2-L 20 -F1 [Probe 

Sequence File] –O [Alignment Output File] –S to align the enlarged set of probe 

sequences to each prepared genome. 

As we were not interested in the final BSBolt style output, we made a small modification to the 

code to retain its temporary output of alignment results in "sam" format. From these files, we 

collected only alignments where the entire length of the probe perfectly matched to the genome 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://github.com/NuttyLogic/BSBolt
https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


19 
 

sequence (i.e. the CIGAR string ‘50M’ and flag XM=0”). Then, for each genome we collapsed all 

the sequence variant alignments for each probeID down to a list of loci for that genome and for 

that probe.  

 
Mappability Approach 2: QUASR 

For version 2 of our mappability analysis, we aligned the probe sequences to all available 

mammalian genomes in ENSEMBL and NCBI Refseq databases using the QUASR package 

(Gaidatzis et al., 2015). The fasta sequence files for each genome were downloaded from these 

public databases. The alignment assumed that the DNA has been subjected to a bisulfite 

conversion treatment. For each species’ genome sequence, QUASR creates an in-silico-bisulfite-

treated version of the genome. The probes were aligned to these bisulfite treated genome 

sequences, which does not consider C-T as a mismatch. The alignment was ran with QUASR (a 

wrapper for Bowtie2) with parameters -k 2 --strata --best -v 3 and bisulfite = 

"undir” to align the enlarged set of probe sequences to each prepared genome. From these 

files, we collected the best candidate unique alignment to the genome. Additionally, the estimated 

CpG coordinates at the end of each probe was used to extract the sequence from each genome 

fasta files and exclude any probes with mismatches in the target CpG location. 

 
Genomic loci annotations 

Gene annotations (gff3) for each genome considered were also downloaded from the 

same sources as the genome. Following the alignment, the CpGs were annotated to genes based 

on the distance to the closest transcriptional start site using the Chipseeker package(Yu et al., 

2015). Genomic location of each CpG was categorized as either intergenic region, 3’ UTR, 5’ 

UTR, promoter (minus 10 kb to plus 100 bp from the nearest TSS), exon, or intron. The unique 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


20 
 

region assignment is prioritized as follows: exons, promoters, introns, 5’ UTR, 3' UTR, and 

intergenic. 

Additional genomic annotations, including human ortholog ENSEMBL ID, were extracted 

from the BioMart ENSEMBL database(Yates et al., 2020). The candidate gene for each probe 

was compared with human orthologous ENSEMBL ID to examine the similarity of the alignment 

with the human. For each probe, we examined if the assigned species ENSEMBL ID is identical 

to human-to-other-species-orthologous ENSEMBL ID in human mappability file. Orthologous 

comparison with human was done for genomes that could be matched to human genome by 

“targetSpecies_homolog_associated_gene_name" in Biomart using getLDS() function. 

Cell and tissue specific chromatin state annotations were based on the 25-state 

ChromHMM model based on imputed data for 12-marks (Ernst and Kellis, 2015; Roadmap 

Epigenomics Consortium et al., 2015). The chromatin state annotations from a ChromHMM model 

that was not specific to a single cell or tissue type were from (Vu and Ernst, 2020). We also 

provide in the annotation files of the array ChromHMM chromatin state annotations for mouse 

from (Gorkin et al., 2020). The human-mouse LECIF score was from (Kwon and Ernst, 2020). 

 
CpG island annotation 

 We called CpG islands using the “gCluster” algorithm(Gómez-Martín et al., 2018). This 

algorithm uses clustering methods to identify the sequences that have high G+C content and CpG 

density with the default parameters. Besides CpG island status, this algorithm calculated several 

other attributes including length, GC content, and CpG density for each defined island. The 

outcome of this algorithm was a BED file that was used to annotate the probes using the “annotatr” 

package in R by checking the overlap of the aligned probes and CpG island genomic coordinates.  

 
Human DNA methylation distribution 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


21 
 

 We downloaded the fraction methylated values based on whole genome bisulfite 

sequencing data from 37 different cells and tissues types from the Roadmap Epigenomics 

Consortium 

(http://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/WGBS/FractionalMethylation.t

ar.gz)(Roadmap Epigenomics Consortium et al., 2015). For each CpG, we averaged the fractional 

methylation values across the Roadmap samples.  

 
GREAT analysis 

We applied the GREAT analysis software tool(McLean et al., 2010) to conduct gene set 

enrichments for genes near CpGs on the array in human and mouse. The GREAT software 

performs both a binomial test (over genomic regions) and a hypergeometric test over genes when 

using a whole genome background. We performed the enrichment based on default settings 

(Proximal: 5.0 kb upstream, 1.0 kb downstream, plus Distal: up to 1,000 kb) for gene sets 

associated with GO terms, MSigDB, PANTHER and KEGG pathway. To avoid large numbers of 

multiple comparisons, we restricted the analysis to the gene sets with between 10 and 3,000 

genes. We report nominal P values and two adjustments for multiple comparisons: Bonferroni 

correction and the Benjamini-Hochberg false discovery rate. 

 
Tissue enrichment analysis 

The enrichment of tissue specific genes was done by TissueEnrich R package(Jain and Tuteja, 

2019) using teEnrichment() function limited to human protein atlas(Uhlén et al., 2015) and mouse 

ENCODE(Yue et al., 2014) databases.  

 
Normalization methods 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

http://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/WGBS/FractionalMethylation.tar.gz
http://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/WGBS/FractionalMethylation.tar.gz
https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


22 
 

R software scripts implementing normalization methods can be accessed through our webpage 

(see the section on Data availability). Two software scripts are currently available for extracting 

beta values from raw signal intensities, based on Minfi and SeSAMe, respectively. Both methods 

use the noob method (Triche et al., 2013) for background subtraction. The two scripts evaluate 

each probe's hybridization and extension performance using normalization control probes and 

Infinium-I probe out-of-band measurements (the pOOBAH method (Zhou et al. 2018), 

respectively. Users can use the detection p-values for each CpG to filter out non-significant 

methylation readouts from probes unlikely to work in the target species. 

 
Calibration data 

We generated methylation data on two different platforms: the mammalian methylation array 

(HorvathMammalMethylChip40) and the human EPIC methylation array. 

The DNA samples from each species were enzymatically manipulated so that they would exhibit 

0%, 25%, 50%, 75% and 100% percent methylation at each CpG location, respectively. We 

purchased premixed DNA standards from EpigenDx Inc (products 80-8060H-PreMixHuman, 80-

8060M-PreMixMouse, and Standard80-8060R-PreMixRat Premixed Calibration Standard). 

The variable “ProportionMethylated” (with ordinal values 0, 0.25, 0.5, 0.75, 1) can be interpreted 

as a benchmark for each CpG that maps to the respective genome. Thus, the DNA methylation 

levels of each CpG are expected to have a high positive correlation with ProportionMethylated 

across the arrays measurement from a given species. The mammalian array was applied to 

synthetic DNA data from 3 species: human (n=10 mammalian arrays), mouse (n=20), and rat 

(n=15).  

Similarly, the human EPIC array was applied to calibration data from of mouse (n=15 EPIC arrays) 

and rat (n=10). Thus, we applied 3 EPIC arrays and 2 EPIC arrays per value (0, 0.25, 0.5, 0.75, 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


23 
 

1) of ProportionMethylated in our mouse and rat studies, respectively. The EPIC array data were 

normalized using the noob method (R function preprocessNoob in minfi). 

 
Data availability 

The mammalian methylation array (HorvathMammalMethylChip40) is registered at the NCBI 

Gene Expression Omnibus (GEO) as platform GPL28271 . The chip manifest file, calibration data, 

supplementary data, and R software scripts are or will be available from available 

https://github.com/shorvath/MammalianMethylationConsortium/ or the Gene Expression 

Omnibus. 

 
Acknowledgements and Funding 

This work was supported by the Paul G. Allen Frontiers Group (SH) and NSF CAREER award 

#1254200, National Institutes of Health (DP1DA044371) and a JCCC-BSCRC Ablon Scholars 

Award (JE). 

Conflict of Interest Statement 

The Regents of the University of California is the sole owner of a provisional patent application 

directed at this invention for which AA, JE and SH are named inventors. SH is a founder of the 

non-profit Epigenetic Clock Development Foundation, which plans to license several patents from 

his employer UC Regents, and distributes the mammalian methylation array. Bret Barnes is an 

employee for Illumina Inc which manufactures the mammalian methylation array. The other 

authors declare no conflicts of interest. 

  
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://github.com/shorvath/MammalianMethylationConsortium/
https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


24 
 

No. CpGs whose correlation with the 
ProportionMethylation > threshold 

Species Threshold Mammal+Sesame Mammal+Minfi EPIC+Minfi 

Mouse 0.85 27,868 26,944 4,550 

Mouse 0.90 24,050 22,207 2,356 

Mouse 0.95 16,444 12,797 604 

Rat 0.85 26,425 25,779 17,650 

Rat 0.90 22,427 20,989 6,159 

Rat 0.95 15,101 12,848 819 

Human 0.85 36,438 35,761 NA 

Human 0.90 34,547 33,402 NA 

Human 0.95 30,327 28,445 NA 

Table 1. Correlating DNA methylation levels with calibration data. We evaluated the 
Mammalian Methylation Array with two different software methods for normalization: SeSaMe and 
Minfi (noob normalization). The EPIC array data were only normalized with the noob normalization 
method in Minfi. As indicated in the first column, the DNA samples came from three species: 
human (n=10 arrays), mouse (n=20), and rat (n=15). For each species, the “artificial” 
chromosomes exhibited on average 0%, 25%, 50%, 75% and 100% percent methylation at each 
CpG location. Thus, the variable “ProportionMethylated” (with ordinal values 0, 0.25, 0.5 ,0.75, 1) 
can be considered as benchmark/gold standard. The table reports the number of CpGs for which 
the Pearson correlation with the ProportionMethylation was greater than the correlation threshold 
(second column). 
 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


25 
 

Figures 

 
b 

 
Figure 1. Overview of mammalian methylation array design process. 
(a) Toy example of multiple sequence alignment at a CpG site considered by the CMAPS 
algorithm. The orange coloring highlights the CpG being targeted. Positions where other species 
have alignment that matches the human sequence are in dark blue; positions where other species 
have alignment that does not match the human sequence are in neon yellow; positions where 
other species have no alignment are in grey. (b) Flowchart detailing the selection of probes on 
the array by the CMAPS algorithm. A small fraction of probes designed were dropped during the 
manufacturing process. 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


26 
 

Figure 2. CpG and gene coverage of probes on the mammalian methylation array across 
different phylogenetic orders.  
(a) Probe localization based on the QUASR package (Gaidatzis et al., 2015). The rows 
correspond to different phylogenetic orders. The phylogenetic orders are ordered based on the 
phylogenetic tree and increasing distance to human. The boxplots report the median number of 
mapped probes across species from the given phylogenetic order. (b) The number of probes 
mapped to human orthologous genes for a subset of genomes (Methods). (c) Percentage of the 
probes associated with human orthologous genes among mapped probes in these species. 
 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


27 
 

Figure 3. CpG island and chromatin state analysis of mammalian methylation probes. 
We characterize the CpGs located on the mammalian methylation array regarding (a) CpG island 
status in different phylogenetic orders, (b) chromatin state analysis, and (c) LECIF score of 
evidence of human-mouse conservation at the functional genomics level. (a) The boxplots report 
the median number (and interquartile range) of CpGs that map to CpG islands in mammalian 
species of a given phylogenetic order (x-axis). The notch around the median depicts the 95% 
confidence interval. (b) The heatmap visualizes the ChromHMM chromatin state annotations of 
the location of the CpGs on the array (rows) in different human tissues (columns)(Ernst and Kellis, 
2012, 2015). The colors correspond to 25 human chromatin states as detailed in the right panel. 
The probes in the left panel heatmap are ordered by the chromatin state with the maximum 
median frequency across 127 human cell and tissue types. The right panel indicates the 
distribution of chromatin states in each tissue type represented on the mammalian methylation 
array. (c) Comparison of distribution of LECIF score for probes on the array and aligning bases 
between human and mouse. The LECIF score has been binned as shown on the x-axis, and the 
fraction of probes or aligning bases with scores in that bin are shown on the y-axis. 
 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


28 
 

Figure 4. Distribution of probe intensities within sample, colored by the expected 
percentage of methylation at each site. 

(a-c) Distribution of beta values (relative intensity) of all probes on the array before normalization 
for (a) human samples, (b) mouse samples, and (c) rat samples. (d-f) Distribution of probe 
intensity after Sesame normalization and restricting probes to those that CMAPS designed to (d) 
the human genome in human samples, (e) the mouse genome in mouse samples, and (f) the rat 
genome in rat samples. 
 
 
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


29 
 

Figure 5. Calibration data: mean methylation across probes shared between the human 
EPIC array and the mammalian array. The mammalian methylation array contained 5574 
probes targeting the same CpG that can also be found on the human EPIC array that were not 
included based on being human biomarkers. However, the mammalian array probes were 
engineered differently than EPIC probes so that they would more likely work across mammals. 
By applying both array types to calibration data, we are able to compare the calibration of the 
overlapping probes in mice (a,b) and rats (c,d). Upper panels (a,b) and lower panels (c,d) present 
the results for the mammalian array and the EPIC array, respectively. The benchmark measure 
(ProportionMethylated, x-axis) versus the mean value across roughly 4341 CpGs that map to 
mice (a,c) and roughly 3948 CpGs that map to rats (b,d). The mean methylation (y-axis) was 
formed across a subset of CpGs that i) are present on the human EPIC array, ii) present on the 
mammalian array, and iii) apply to the respective species according to the mappability analysis 
genome coordinate file.  

 
0.0 0.2 0.4 0.6 0.8 1.0

0
.2

0
.4

0
.6

0
.8

Mouse,MammalArray,Sesame cor=0.96, p=2.2e-11

ProportionMethylated

m
e
a
n
M

e
th

.I
n
te

rs
e
c
tM

a
m

m
a
l.
E

P
IC

.M
a
p
s
T

o
M

o
u
s
e

a

0.0 0.2 0.4 0.6 0.8 1.0

0
.2

0
.4

0
.6

0
.8

Rat,MammalArray,Sesame cor=0.96, p=1.5e-08

ProportionMethylated

m
e
a
n
M

e
th

.I
n
te

rs
e
c
tM

a
m

m
a
l.
E

P
IC

.M
a
p
s
T

o
R

a
t

b

0.0 0.2 0.4 0.6 0.8 1.0

0
.2

0
.4

0
.6

0
.8

Mouse DNA, EPIC array cor=0.79, p=0.0013

ProportionMethylated

m
e
a
n
M

e
th

.E
P

IC
.P

ro
b
e
s
T

h
a
tM

a
p
T

o
M

o
u
s
e

c

0.0 0.2 0.4 0.6 0.8 1.0

0
.3

0
.4

0
.5

0
.6

0
.7

0
.8

Rat DNA, EPIC array cor=0.79, p=0.11

ProportionMethylated

m
e
a
n
M

e
th

.E
P

IC
.P

ro
b
e
s
T

h
a
tM

a
p
T

o
R

a
t

d

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


30 
 

Supplementary Figures 

   
Supplementary Figure S1: Comparison of probe context between the Illumina EPIC, 450K 
and the Mammalian Methylation array: (a) Analysis of CpG and non-CpG (CH) probes, (b) color 
channel assignment, (c) type I and type II probes, and (d) next base reveals similar percentages 
across probes from these three array platforms. Color channel assignment and probe basepair 
context are important for DNA methylation array analysis and the similarity between these 
different arrays can facilitate extension of published analysis and normalization methods. Analysis 
of type I and type II probes shows a slightly lower percentage of type I probes for the mammalian 
methylation array. Type I probes assay DNA methylation using one color channel and two bead 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


31 
 

types, i.e. one unmethylated bead type and one methylated bead type. Conversely, type II probes 
assay DNA methylation using one bead type and two color channels indicating methylated and 
unmethylated cytosines. Adjustment for DNA methylation signal detected by these different probe 
types is one of the most important steps in DNA methylation array normalization, and a sufficient 
number of type I probes were included in the Mammalian Methylation array to facilitate the 
extension of published data normalization methods. (e) Comparison of shared and non-shared 
probes between the Mammalian Methylation array and MouseMethylation array loci reveals 3107 
shared probes.  (f) Comparison of shared and non-shared probes between the EPIC, 450k and 
the Mammalian methylation array. Comparative analysis was performed using Illumina probe IDs, 
which are unique to each probe. Intersection of IDs between arrays reveals over 5,000 probes 
that are common to all platforms (center). These probes can be used to follow up published human 
epigenome-wide association study (EWAS) results in model organisms such as mouse (Mus 
musculus) or rat (Rattus norvegicus), or across a range of other species, including all primates 
and other mammals. 
 
  
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


32 
 

Supplementary Figure S2. Chromosome and gene region analysis of mammalian 
methylation probes in humans and mice. The analysis is based on mapping probes on the 
mammalian methylation array to the human (hg19) and mouse (mm10) genome using QUASR 
package(Gaidatzis et al., 2015). (a) The number of probes per human and mouse chromosome. 
(b) The left panel reports the percentage of probes that are located in different gene regions 
(promoters, 5' UTR, 3' UTR, introns, exons) in humans and mice. The right panel reports the 
distribution of the probes relative to the nearest transcriptional start site. (c) Histogram of CpG 
number in different gene regions in human and mouse genomes (as defined in the legend of panel 
d). (d) Alignment to orthologous genes between humans and mice. The colors indicate the 
mapped gene region in the mouse genome. The unique region assignment are prioritized as 
follows: exons, promoters, introns, 5' UTR, 3' UTR. 
  
 
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


33 
 

Summary Figure S3. GREAT gene set enrichment analysis of all probes on the mammalian 
methylation array. The figure shows the top enriched pathway based on gene-level enrichment 
analysis for genes proximal to probes using GREAT7. The two columns correspond to enrichment 
analysis for human (hg19) and mouse (mm10) genomes, respectively, using the whole genome 
as background. The top five enriched datasets from each category (Canonical pathways, 
diseases, gene ontology, human and mouse phenotypes, and upstream regulators) were selected 
and further filtered for significance at p < 10-5. The category is indicated by the shape, the number 
of genes by the size of the shape, and the significance of the enrichment is indicated by the color 
scale. 
 
 
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


34 
 

Supplementary Figure S4. Human and mouse tissue-specific probes on mammalian 
methylation array. Characterization of the tissue specificity of CpG probes on the mammalian 
methylation array using the human protein atlas(Uhlén et al., 2015) and mouse ENCODE gene 
expression data(Yue et al., 2014). The left and right panels report results for human and mouse 
genomes, respectively. Each probe is mapped to the closest gene while other genes in the 
flanking region are ignored in this analysis. The number of genes (a) and the number of CpG 
probes (b) versus a categorical measure of tissue specificity. The categories on the y-axis have 
the following definitions. The following categories are defined in the TissueEnrich software 
"Tissue Enriched" labels genes with an expression level greater than 1 (TPM or FPKM) that also 
have at least five-fold higher expression levels in a particular tissue compared to all other tissues. 
"Group Enriched" labels genes with an expression level greater than 1 (TPM or FPKM) that also 
have at least five-fold higher expression levels in a group of 2-7 tissues compared to all other 
tissues, and that are not considered Tissue Enriched. "Tissue Enhanced" labels genes with an 
expression level greater than 1 (TPM or FPKM) that also have at least five-fold higher expression 
levels in a particular tissue compared to the average levels in all other tissues, and that are not 
considered Tissue Enriched or Group Enriched. (c) The number of tissue-enriched genes 
represented on mammalian array vs background in human and mouse transcriptome.  

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


35 
 

Supplementary Figure S5. Distribution of DNA methylation levels. Distribution of average 

fractional methylation across 37 cell and tissue types(Roadmap Epigenomics Consortium et al., 

2015) at CpG sites on the array (blue) and all sites in the genome (red).  

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


36 
 

Supplementary Figure S6: Mammalian Methylation Array enrichment for Universal 
Chromatin State Annotations. (Left) Distribution of probe overlap with a universal chromatin 
state annotation by the stacked modeling approach of ChromHMM applied to data from more than 
100 cell or tissue types(Vu and Ernst, 2020). (Right) The same as left, but showing the fold 
enrichments of the state relative to a uniform background. The strongest enrichment is seen for 
some bivalent promoter states. A full characterization of the states can be found in (Vu and Ernst, 
2020). 
 
  
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


37 
 

References 

Aryee, M.J., Jaffe, A.E., Corrada-Bravo, H., Ladd-Acosta, C., Feinberg, A.P., Hansen, K.D., and 
Irizarry, R.A. (2014). Minfi: a flexible and comprehensive Bioconductor package for the analysis 
of Infinium DNA methylation microarrays. Bioinformatics 30, 1363–1369. 

Bibikova, M., Le, J., Barnes, B., Saedinia-Melnyk, S., Zhou, L., Shen, R., and Gunderson, K.L. 
(2009). Genome-wide DNA methylation profiling using Infinium® assay. Epigenomics 1, 177–
200. 

Bibikova, M., Barnes, B., Tsan, C., Ho, V., Klotzle, B., Le, J.M., Delano, D., Zhang, L., Schroth, 
G.P., Gunderson, K.L., et al. (2011). High density DNA methylation array with single CpG site 
resolution. Genomics 98, 288–295. 

Breeze, C.E., Reynolds, A.P., van Dongen, J., Dunham, I., Lazar, J., Neph, S., Vierstra, J., 
Bourque, G., Teschendorff, A.E., Stamatoyannopoulos, J.A., et al. (2019). eFORGE v2.0: 
updated analysis of cell type-specific signal in epigenomic data. Bioinformatics 35, 4767–4769. 

Cokus, S.J., Feng, S., Zhang, X., Chen, Z., Merriman, B., Haudenschild, C.D., Pradhan, S., 
Nelson, S.F., Pellegrini, M., and Jacobsen, S.E. (2008). Shotgun bisulphite sequencing of the 
Arabidopsis genome reveals DNA methylation patterning. Nature 452, 215–219. 

Ernst, J., and Kellis, M. (2012). ChromHMM: automating chromatin-state discovery and 
characterization. Nat. Methods 9, 215–216. 

Ernst, J., and Kellis, M. (2015). Large-scale imputation of epigenomic datasets for systematic 
annotation of diverse human tissues. Nat. Biotechnol. 33, 364–376. 

Farrell, C., Thompson, M., Tosevska, A., Oyetunde, A., and Pellegrini, M. (2020). BiSulfite Bolt: 
A BiSulfite Sequencing Analysis Platform. BioRxiv 2020.10.06.328559. 

Gaidatzis, D., Lerch, A., Hahne, F., and Stadler, M.B. (2015). QuasR: quantification and 
annotation of short reads in R. Bioinformatics 31, 1130–1132. 

Gómez-Martín, C., Lebrón, R., Oliver, J.L., and Hackenberg, M. (2018). Prediction of CpG 
Islands as an Intrinsic Clustering Property Found in Many Eukaryotic DNA Sequences and Its 
Relation to DNA Methylation. Methods Mol. Biol. Clifton NJ 1766, 31–47. 

Gorkin, D.U., Barozzi, I., Zhao, Y., Zhang, Y., Huang, H., Lee, A.Y., Li, B., Chiou, J., Wildberg, 
A., Ding, B., et al. (2020). An atlas of dynamic chromatin landscapes in mouse fetal 
development. Nature 583, 744–751. 

Guintivano, J., Aryee, M.J., and Kaminsky, Z.A. (2013). A cell epigenotype specific model for 
the correction of brain cellular heterogeneity bias and its application to age, brain region and 
major depression. Epigenetics 8, 290–302. 

Guo, W., Fiziev, P., Yan, W., Cokus, S., Sun, X., Zhang, M.Q., Chen, P.-Y., and Pellegrini, M. 
(2013). BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data. BMC Genomics 
14, 774. 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


38 
 

Haeussler, M., Zweig, A.S., Tyner, C., Speir, M.L., Rosenbloom, K.R., Raney, B.J., Lee, C.M., 
Lee, B.T., Hinrichs, A.S., Gonzalez, J.N., et al. (2019). The UCSC Genome Browser database: 
2019 update. Nucleic Acids Res. 47, D853–D858. 

Hannum, G., Guinney, J., Zhao, L., Zhang, L., Hughes, G., Sadda, S., Klotzle, B., Bibikova, M., 
Fan, J.-B., Gao, Y., et al. (2013). Genome-wide methylation profiles reveal quantitative views of 
human aging rates. Mol. Cell 49, 359–367. 

Horvath, S. (2013). DNA methylation age of human tissues and cell types. Genome Biol. 14, 
R115. 

Horvath, S., and Levine, A.J. (2015). HIV-1 Infection Accelerates Age According to the 
Epigenetic Clock. J. Infect. Dis. 212, 1563–1573. 

Horvath, S., Oshima, J., Martin, G.M., Lu, A.T., Quach, A., Cohen, H., Felton, S., Matsuyama, 
M., Lowe, D., Kabacik, S., et al. (2018). Epigenetic clock for skin and blood cells applied to 
Hutchinson Gilford Progeria Syndrome and ex vivo studies. Aging 10, 1758–1775. 

Houseman, E.A., Accomando, W.P., Koestler, D.C., Christensen, B.C., Marsit, C.J., Nelson, 
H.H., Wiencke, J.K., and Kelsey, K.T. (2012). DNA methylation arrays as surrogate measures of 
cell mixture distribution. BMC Bioinformatics 13, 86. 

Jain, A., and Tuteja, G. (2019). TissueEnrich: Tissue-specific gene enrichment analysis. 
Bioinforma. Oxf. Engl. 35, 1966–1967. 

Kwon, S.B., and Ernst, J. (2020). Learning a genome-wide score of human-mouse conservation 
at the functional genomics level. BioRxiv 2020.09.08.288092. 

Levine, M.E., Lu, A.T., Quach, A., Chen, B.H., Assimes, T.L., Bandinelli, S., Hou, L., Baccarelli, 
A.A., Stewart, J.D., Li, Y., et al. (2018). An epigenetic biomarker of aging for lifespan and 
healthspan. Aging 10, 573–591. 

Li, X., Chen, F., and Chen, Y. (2020). Gcluster: a simple-to-use tool for visualizing and 
comparing genome contexts for numerous genomes. Bioinforma. Oxf. Engl. 36, 3871–3873. 

Lister, R., Pelizzola, M., Dowen, R.H., Hawkins, R.D., Hon, G., Tonti-Filippini, J., Nery, J.R., 
Lee, L., Ye, Z., Ngo, Q.-M., et al. (2009). Human DNA methylomes at base resolution show 
widespread epigenomic differences. Nature 462, 315–322. 

McLean, C.Y., Bristor, D., Hiller, M., Clarke, S.L., Schaar, B.T., Lowe, C.B., Wenger, A.M., and 
Bejerano, G. (2010). GREAT improves functional interpretation of cis-regulatory regions. Nat. 
Biotechnol. 28, 495–501. 

Meissner, A., Gnirke, A., Bell, G.W., Ramsahoye, B., Lander, E.S., and Jaenisch, R. (2005). 
Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation 
analysis. Nucleic Acids Res. 33, 5868–5877. 

Needhamsen, M., Ewing, E., Lund, H., Gomez-Cabrero, D., Harris, R.A., Kular, L., and Jagodic, 
M. (2017). Usability of human Infinium MethylationEPIC BeadChip for mouse DNA methylation 
studies. BMC Bioinformatics 18, 486. 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/


39 
 

Ooi, S.K.T., Qiu, C., Bernstein, E., Li, K., Jia, D., Yang, Z., Erdjument-Bromage, H., Tempst, P., 
Lin, S.-P., Allis, C.D., et al. (2007). DNMT3L connects unmethylated lysine 4 of histone H3 to de 
novo methylation of DNA. Nature 448, 714–717. 

Pidsley, R., Zotenko, E., Peters, T.J., Lawrence, M.G., Risbridger, G.P., Molloy, P., Van Djik, S., 
Muhlhausler, B., Stirzaker, C., and Clark, S.J. (2016). Critical evaluation of the Illumina 
MethylationEPIC BeadChip microarray for whole-genome DNA methylation profiling. Genome 
Biol. 17, 208. 

Roadmap Epigenomics Consortium, Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., 
Heravi-Moussavi, A., Kheradpour, P., Zhang, Z., Wang, J., et al. (2015). Integrative analysis of 
111 reference human epigenomes. Nature 518, 317–330. 

Robertson, K.D. (2005). DNA methylation and human disease. Nat. Rev. Genet. 6, 597–610. 

Rosenbloom, K.R., Armstrong, J., Barber, G.P., Casper, J., Clawson, H., Diekhans, M., Dreszer, 
T.R., Fujita, P.A., Guruvadoo, L., Haeussler, M., et al. (2015). The UCSC Genome Browser 
database: 2015 update. Nucleic Acids Res. 43, D670–D681. 

Smith, Z.D., and Meissner, A. (2013). DNA methylation: roles in mammalian development. Nat. 
Rev. Genet. 14, 204–220. 

Triche, T.J., Weisenberger, D.J., Van Den Berg, D., Laird, P.W., and Siegmund, K.D. (2013). 
Low-level processing of Illumina Infinium DNA Methylation BeadArrays. Nucleic Acids Res. 41, 
e90. 

Uhlén, M., Fagerberg, L., Hallström, B.M., Lindskog, C., Oksvold, P., Mardinoglu, A., Sivertsson, 
Å., Kampf, C., Sjöstedt, E., Asplund, A., et al. (2015). Proteomics. Tissue-based map of the 
human proteome. Science 347, 1260419. 

Vu, H., and Ernst, J. (2020). Universal annotation of the human genome through integration of 
over a thousand epigenomic datasets. BioRxiv 2020.11.17.387134. 

Yates, A.D., Achuthan, P., Akanni, W., Allen, J., Allen, J., Alvarez-Jarreta, J., Amode, M.R., 
Armean, I.M., Azov, A.G., Bennett, R., et al. (2020). Ensembl 2020. Nucleic Acids Res. 48, 
D682–D688. 

Yu, G., Wang, L.-G., and He, Q.-Y. (2015). ChIPseeker: an R/Bioconductor package for ChIP 
peak annotation, comparison and visualization. Bioinformatics 31, 2382–2383. 

Yue, F., Cheng, Y., Breschi, A., Vierstra, J., Wu, W., Ryba, T., Sandstrom, R., Ma, Z., Davis, C., 
Pope, B.D., et al. (2014). A comparative encyclopedia of DNA elements in the mouse genome. 
Nature 515, 355–364. 

Zhou, W., Triche, T.J., Jr, Laird, P.W., and Shen, H. (2018). SeSAMe: reducing artifactual 
detection of DNA methylation by Infinium BeadChips in genomic deletions. Nucleic Acids Res. 
46, e123–e123. 

 
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.07.425637
http://creativecommons.org/licenses/by-nc/4.0/