key: cord-0331590-czghzeoj authors: Éliás, Szabolcs; Schmidt, Angelika; Gomez-Cabrero, David; Tegnér, Jesper title: Gene regulatory network of human GM-CSF secreting T helper cells date: 2020-08-17 journal: bioRxiv DOI: 10.1101/555433 sha: 9175464c6b6db80cd2ed6e8efae0c8af3d2b79f2 doc_id: 331590 cord_uid: czghzeoj GM-CSF produced by autoreactive CD4 positive T helper cells is involved in the pathogenesis of autoimmune diseases, such as Multiple Sclerosis. However, the molecular regulators that establish and maintain the features of GM-CSF positive CD4 T cells are unknown. In order to identify these regulators, we isolated human GM-CSF producing CD4 T cells from human peripheral blood by using a cytokine capture assay. We compared these cells to the corresponding GM-CSF negative fraction, and furthermore, we studied naïve CD4 T cells, memory CD4 T cells and bulk CD4 T cells from the same individuals as additional control cell populations. As a result, we provide a rich resource of integrated chromatin accessibility (ATAC-seq) and transcriptome (RNA-seq) data from these primary human CD4 T cell subsets, and we show that the identified signatures are associated with human autoimmune disease, especially Multiple Sclerosis. By combining information about mRNA expression, DNA accessibility and predicted transcription factor binding, we reconstructed directed gene regulatory networks connecting transcription factors to their targets, which comprise putative key regulators of human GM-CSF positive CD4 T cells as well as memory CD4 T cells. Our results suggest potential therapeutic targets to be investigated in the future in human autoimmune disease. and furthermore, we studied naïve CD4 T cells, memory CD4 T cells and bulk CD4 T cells from the 23 same individuals as additional control cell populations. As a result, we provide a rich resource of 24 integrated chromatin accessibility (ATAC-seq) and transcriptome (RNA-seq) data from these primary 25 human CD4 T cell subsets, and we show that the identified signatures are associated with human 26 autoimmune disease, especially Multiple Sclerosis. By combining information about mRNA 27 expression, DNA accessibility and predicted transcription factor binding, we reconstructed directed 28 gene regulatory networks connecting transcription factors to their targets, which comprise putative key 29 regulators of human GM-CSF positive CD4 T cells as well as memory CD4 T cells. Our results 30 suggest potential therapeutic targets to be investigated in the future in human autoimmune disease. specifically the GM-CSF produced by autoreactive T cells was necessary for EAE induction, while T 47 cell-produced IFN-γ and IL-17 were dispensable [7] [8] [9] . In line with these results, GM-CSF expression 48 by Th cells was required for neuroinflammation in EAE, and even in the presence of IFN-γ and IL-17A 49 producing Th cells, pathogenicity was lost upon loss of GM-CSF [10] . Importantly, in humans the 50 fraction of GM-CSF positive (and IFN-γ positive) cells within CD4 T cells was elevated in MS patients' 51 cerebrospinal fluid compared to controls, while IL-17A positive cell fractions were not strikingly 52 different in these reports [11, 12] . The fraction of GM-CSF positive and IFN-γ positive cells was also 53 increased in peripheral blood of MS patients in one report [13] , but not in another [11] . Similarly, 54 enhanced fractions of GM-CSF producing CD4 T cells have been observed in synovial fluid of 55 patients with juvenile arthritis along with the well-known enhanced GM-CSF levels in synovial fluid 56 [14, 15] . Interestingly, an expanded Th subset producing GM-CSF was found in blood and CNS of MS 57 patients and could be diminished by disease-modifying therapy [16] , suggesting GM-CSF producing 58 Th cells to be an attractive therapeutic target. Of note, targeting GM-CSF in MS or arthritis is subject 59 to several ongoing clinical studies, highlighting the importance of this cytokine in these diseases [15] . 60 RNA-seq alignment were created using the gencode version 25 annotation file. RNA-seq alignment 251 was run with STAR's built-in adapter trimming option ('--clip3pAdapterSeq 252 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC 253 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA') and its built-in counting option ('--254 quantMode'). Only genes with more than 1 count per million in at least 3 samples were included in the 255 downstream RNA-seq analysis. 256 Generation of consensus peak-set. In order to generate a set of comparable features (genomic 257 regions) for read-counting and quantifying differential accessibility from the ATAC-seq data, a set of 258 consensus peaks was generated in two subsequent steps. (1) Generation of consensus peaks on the 259 technical replicate level: First, the peaks that appeared in at least two technical replicates (out of a 260 total of three, except one donor with a total of two technical replicates) with at least 75% reciprocal 261 overlap were selected. Then these selected regions were partitioned into disjoint non-empty subsets 262 so that each element is contained in precisely one subset. Only the partitions appearing in at least two 263 replicates were retained, and afterwards adjacent regions were merged. A bed file resulting from 264 these steps is herein referred to as a sample. (2) Next, all bed files containing each set of technical 265 replicate level consensus regions were concatenated and the presence of each region was counted 266 within each experimental sample; one occurrence corresponds to one donor (biological replicate) and 267 one cell type (experimental condition). Only the regions appearing in at least four samples were kept 268 (n=5 is the number of biological replicates in the smallest group regarding experimental condition). 269 Regions having a distance of 42 bases or less between them were subsequently merged (the number 270 42 corresponds to the sequencing read length in bases). Afterwards, reads were counted using the 271 featureCounts tool [42] with the criterion that at least half of the read had to overlap with a feature to 272 be assigned. 273 Calculation of differential expression and differential accessibility. Differential expression and 274 accessibility was calculated using the edgeR [43] library version 3.18.1 from Bioconductor. Donors 275 (biological replicates) and cell types (experimental conditions) were used as explanatory variables in 276 the generalized linear models. ATAC-seq data were normalized to length and GC content by 277 conditional quantile normalization (CQN) [44] . Comparisons were made between GM-CSF positive 278 CD4 T cells versus GM-CSF negative CD4 T cells, or between memory versus naïve CD4 T cells (see 279 Figure 1A ). The cutoff to call differentially expressed genes (DEGs) or differentially accessible regions 280 (DARs) was FDR<0.05 and >25% fold change (in the direction of either up-or down-regulation that is, 281 either 1. 25 Network reconstruction. A directed network was reconstructed by combining information from the 291 ATAC-seq and RNA-seq data in two subsequent steps. (1) Identifying source nodes: Peaks were 292 ranked based on their combined measure of significance and direction of differential accessibility [-293 9 log 10 (FDR) x sign(log 2 (fold change))]. Peaks containing footprints were scanned for TF binding motifs 294 using the TRANSFAC database [47] . An enrichment score was calculated to identify TFs with binding 295 sites enriched in differentially accessible peaks, using tools similar to gene set enrichment analysis 296 (GSEA) [48] . In detail, random sampling was performed on the ranked list of peaks to assess how 297 probable it is to observe at least the same enrichment by chance (P value) [49] . After multiple test 298 correction, TFs with FDR<0.05 were selected and the normalized enrichment score (NES) was 299 obtained. Only those source nodes (TFs) that were detectably expressed on RNA level (according to 300 a minimal RNA-seq filtering rule) were considered. Of these, most fell into the class of highly 301 expressed genes (HEGs) according to [50] . (2) Identifying target nodes: Target nodes are defined as 302 peaks with an assigned gene. The selection criteria were that (i) the peak contains a footprint with a 303 binding motif of the source node (TF) and (ii) the peak and/or the assigned gene has to be 304 differentially accessible or differentially expressed respectively. 305 To assign an importance measure to the source nodes in networks generated as above, the 306 PageRank [51] of the network nodes was calculated after inverting the directionality of all edges in the 307 network (only for the purpose of this computation). After this computation, the nodes with high 308 PageRank values (higher than the 99th percentile of all node values within a given network) were 309 selected from both the GM-CSF and memory network, and afterwards their values were investigated 310 in each of the two networks. Further, ex vivo re-stimulation with strong artificial stimuli such as PMA and ionomycin -which is 321 usually necessary to reach sufficient signal strength for detection by intracellular cytokine staining -322 drastically alters the transcriptome of T cells. Hence, we aimed to isolate GM-CSF producing cells ex 323 vivo in an as much as possible unmanipulated state by GM-CSF secretion assay, "capturing" and 324 isolating those cells that actively secrete GM-CSF (experimental setup, see Figure 1A ), here defined 325 as GM-CSF positive cells. The capture assay was performed starting from highly purified CD4 T cells 326 derived from human peripheral blood (purity 97.3±0.6 %, mean±SEM, Supplementary Figure S1A ). As 327 controls, we used the respective GM-CSF negative fraction from the isolation procedure, as well as 328 the bulk CD4 T cells before any capture assay procedure. The latter should, given the low fraction of 329 GM-CSF positive cells, be very similar to the GM-CSF negative fraction and hence allows for 330 estimation of the effects arising from the capture assay procedure. The purity of GM-CSF positive and 331 GM-CSF negative fractions was assessed by flow cytometry ( Figure 1B ) and the yield of isolated GM-332 CSF positive cells was 1.2±0.6 % (mean±SD) of CD4 T cells. To measure the transcriptome and DNA 333 accessibility from limited cell numbers, we employed highly sensitive next-generation sequencing 334 (NGS) methods (RNA-seq and ATAC-seq respectively). Since cytokine secreting cells may differ from 335 naïve cells due to a memory-like phenotype, and a large fraction of CD4 T cells are naïve 336 (mean±SEM 42.3±5.8 % in the donors here; see Supplementary Figure S1A), we further profiled 337 highly purified naïve and memory CD4 T cells from the same donors ( Figure 1A We studied the above-described five different CD4 T cell populations by RNA-seq and ATAC-seq. To 348 minimize potential batch effects due to technical factors, the library preparations and sequencing runs 349 were designed in such a way that donors, cell populations, and (for ATAC-seq) technical triplicates 350 were distributed in a balanced fashion. It is also worth noting that only donors of the same gender 351 were studied here (male, age 35.7±8.5 years, mean±SD), which may be important since it was 352 recently shown that gender was the largest source of variation explaining chromatin accessibility in 353 primary human CD4 T cells measured by ATAC-seq [36] . That study further discovered novel 354 elements escaping X chromosome inactivation and affecting immune genes [36] . To assess which 355 factors explained most of the variability between the samples under study here, we performed 356 principle component analysis (PCA). Indeed, for both data types there was a grouping of the samples 357 based on the cell subset, outweighing donor or experimental variation (Figure 2A , B) and confirming 358 the quality of our samples and data. Notably, for RNA data, the cell populations were generally more 359 distinct from each other than for DNA accessibility data. However, the GM-CSF positive and 360 corresponding GM-CSF negative fraction appeared relatively similar to each other in the PCA 361 performed on RNA data, while PCA results from ATAC-seq data were closer to the expected pattern 362 that is, bulk CD4 T cells appearing "between" GM-CSF positive and GM-CSF negative populations 363 ( Figure 2A, B) . The difference between RNA-seq and ATAC-seq data with respect to separation of 364 GM-CSF positive and negative cells may indicate that the capture assay procedure imposes distinct 365 changes on the transcriptome, highlighting the importance of using correspondingly treated controls to 366 determine differential expression. In contrast, changes in DNA accessibility appeared more robust 367 towards changes due to the experimental procedure at least within the experimental time frame under 368 study, although the distinction of the other groups was generally less apparent with ATAC-seq data. 369 Because the first two PCs only explained about 50% of the variation in the data, we also used another 370 dimensionality reduction method to explore the sample-to-sample relationships, namely t-distributed 371 stochastic neighbor embedding (t-SNE) [52] . t-SNE, in contrast to PCA, is a non-linear dimensionality 372 reduction algorithm, and it is suited for capturing local and global relationships at the same time. The 373 t-SNE results ( Figure 2C , D) generally confirmed the results of the PCA analysis (Figure 2A , B) that is, 374 the groups (cell types) being more distinct in RNA data than ATAC-seq data, with the exception of 375 GM-CSF positive and GM-CSF negative cells. 376 We defined significantly differentially accessible DNA regions (DARs) and significantly differentially 377 expressed genes (DEGs) in GM-CSF positive CD4 T cells or in memory CD4 T cells. To do so, we 378 specifically analyzed the signatures of GM-CSF positive versus GM-CSF negative CD4 T cells, as 379 well as the profiles of memory versus naïve CD4 T cells ( Figure 1A and Figure 3A , B). We used 380 generalized linear models based on the negative binomial distribution (edgeR) to determine 381 differential expression and accessibility respectively, and we called DEGs and DARs based on 382 combined FDR and fold change cutoffs. We called 16571 DARs in GM-CSF positive CD4 T cells 383 (compared to corresponding GM-CSF negative cells; Figure 3A ) and 13705 DARs in memory CD4 T 384 cells (compared to naïve CD4 T cells; Figure 3A their molecular patterns in the other cell types under study. We next extracted those memory-specific 394 DARs that were assigned to a gene from a list of genes known to be involved in T cell memory as 395 compiled by Polansky and colleagues [30] . Several of the DARs in memory cells were assigned to 396 such known memory-associated genes, most of those in the promoter region ( Figure 3C ). 397 Furthermore, we assessed a selected subset of memory related genes that were shown to be up-or 398 down-regulated on RNA level in memory T cells [30] . The majority of these genes were DEG in 399 memory cells in our data, notably up-or down-regulated almost exclusively (36 of 37 studied genes; 400 97%) in the expected direction ( Figure 3D ), validating our data. in the CyTOF and RNA-seq data ( Figure 4C ). According to a binomial distribution, the probability of 467 observing this or greater concordance between the two datasets by chance is 7.6%. This needs to be 468 acknowledged considering that those markers that do not concur might be regulated by protein 469 internalization from the surface, such as well-known for CD3 [58] that was down-regulated in CyTOF 470 data but barely affected on RNA level. Also, the total abundance of certain proteins may be regulated 471 on post-transcriptional level, as transcript levels cannot always predict protein abundance [59, 60] . 14 supporting the relevance of our data. The data also showed enrichment for several other diseases 484 related to the immune system, infection or metabolism, although it should be noted that some of these 485 contained only few elements (genes) and may thus be less relevant than MS or rheumatoid arthritis, 486 which comprised a large number of genes ( Figure 5A and Supplementary Table S1A ). When 487 performing the same enrichment analysis for the ranked gene list from memory versus naïve CD4 T 488 cells, a large number of diseases were significantly enriched including many diseases involving the 489 immune system, as expected (Supplementary Table S1B) . Since chromatin accessibility can directly affect gene expression, we next combined the ATAC-seq 523 and RNA-seq data, aiming to identify key TFs that may bind to open chromatin regions and hence 524 affect the expression of their target genes. The methods for calling consensus peaks are not trivial. 525 Therefore, we first confirmed that the genes assigned to the consensus peaks defined in this study 526 were enriched for several pathways involved in immune regulation including T cell receptor signaling 527 and Th subset differentiation, as well as for immune-related diseases including MS, RA and other 528 autoimmune disease (Supplementary Table S2A, B) . Furthermore, the distribution of the consensus 529 peaks (open chromatin regions) defined in our data showed an enrichment for being located in CpG 530 islands, promoters, 5' UTRs, exons and protein-coding regions ( Figure 6A ), suggesting that the 531 ATAC-seq data generated here should be well suited to identify TF binding and the expression of 532 corresponding target genes. We first studied the relationship of RNA and chromatin data on a global 533 level: We considered the genes that were both detected on RNA level and also had an ATAC-seq 534 peak assigned to them. We ranked these genes using the -log 10 (FDR)  sign(log(fold change)) 535 function separately for the RNA-seq and ATAC-seq data. Next, we visualized the correlation between 536 these two ranks for each contrast ( Figure 6B ). Considering only the genes significantly changing 537 (FDR<0.05) in the given contrast and assigning them to up-and down-regulated categories, we 538 detected more than random coincidence in the direction of the change between the RNA-seq and 539 ATAC-seq data (using Fisher's exact test with Monte Carlo simulation; Figure 6B ). Increased 540 openness of the chromatin was associated with increased expression of the corresponding gene's 541 RNA for the majority of genes. Less accessible chromatin also coincided with low expression of the 542 corresponding gene in many cases, although a substantial fraction of lowly accessible regions also 543 displayed high expression of the corresponding gene ( Figure 6C ). 544 545 To identify potential TFs that may establish the gene signatures of GM-CSF positive (versus GM-CSF 547 negative) and memory (versus naïve) CD4 T cells, we scanned the consensus peaks for footprints 548 and subsequently we scanned the identified footprints for TF binding motifs. For motif scanning we 549 used the TRANSFAC database [47] that contains experimentally validated binding sites, consensus 550 binding sequences (positional weight matrices) and regulated genes of eukaryotic TFs. Confirming 551 the methodology used to identify footprints, they were enriched in differentially accessible regions as 552 expected for the corresponding cell type comparisons (Supplementary Figure S3) . Next, we defined 553 the TFs whose binding sites were most enriched in peaks of GM-CSF positive or memory cells 554 respectively (ranked based on differentially accessibility), and identified about 20 TFs each that 555 passed the significance threshold (FDR<0.05) for enrichment ( Figure 7A ). These lists contained 556 several TFs with a well-known role in T cells, such as SATB1, YY1, ETS-and EGR-family TFs, 557 among other factors with a less defined role. Notably there was little overlap between the key TFs in 558 GM-CSF positive and memory cells, suggesting that our strategy may have identified key factors to 559 specifically define the GM-CSF positive T cell phenotype. The majority of these TFs were highly 560 expressed on RNA level ( Figure 7B ), falling within the class of highly expressed genes (HEGs) that 561 were suggested to be more functional than lowly expressed genes [50] . Interestingly however, most of 562 these TFs were not differentially expressed themselves in the cell population comparisons under 563 study ( Figure 7C ), suggesting that they may be regulated on post-transcriptional levels such as 564 protein phosphorylation and intracellular localization which is well-known for many TFs. Hence, with a 565 strategy exploring solely the transcriptome (or even proteome) without integrating ATAC-seq data, 566 several key TFs would likely be missed, while our integrative strategy combining RNA-seq and ATAC-567 seq data successfully identified such TFs from limited amounts of primary human T cells. Having identified key TFs in GM-CSF positive and memory CD4 T cells, we were interested whether 572 these factors regulated certain groups of target genes and whether several TFs may act together in a 573 concerted fashion. Some TFs regulated a large number of target genes, and clusters of TFs were 574 grouping together ( Figure 8A, B) . Exploring the co-binding between TFs in more detail showed that 575 certain groups of TFs bound together to the same targets ( Figure 8C, D) . These included the 576 TCF3:LEF1 pair that was always co-binding the same regions, as evident from the Memory/Naïve 577 CD4 T cell contrast ( Figure 8B, D) . TCF/LEF family proteins act downstream of the Wnt pathway and 578 it is well-known that they often display overlapping expression patterns and functional redundancy 579 [65]. In accordance with our data on human memory and naïve CD4 T cells, binding motifs for TCF 580 family members were recently also determined to be depleted in murine memory CD8 T cells as well 581 as human memory CD4 T cells using ATAC-seq [35, 39] . 582 Finally, by connecting the TFs to the genes assigned to their target regions, we generated a directed 583 gene regulatory network representing GM-CSF positive (versus negative) and memory (versus naïve) 584 CD4 T cells respectively (Figure 9 ). The source nodes were represented by the key TFs as selected 585 above (key TFs from Figure 7A , B) and target nodes were selected when the region was either 586 differentially accessible and/or the mRNA of the assigned gene was differentially expressed in the 587 respective cell population comparisons. Both the GM-CSF as well as the memory comparisons led to 588 similarly sized networks however, only some of the target and source nodes were shared between 589 both networks ( Figure 9A, B) . (directly or indirectly). Some of the key TFs that were shared between both networks also had a 597 relatively high PageRank in both networks, such as the TFs E2F3, SPIB and GABPA ( Figure 9A, B) , 598 and may be general regulators of T cell activation/differentiation. Importantly, we also identified key 599 TFs with a relatively high PageRank in the GM-CSF T cell regulatory network whose PageRank value 600 in the memory/naïve network was 0. These TFs, namely ZNF35, SP2, SATB1, YY1, TEF and ZNF333 601 may, thus, be novel regulators specifically controlling the molecular signatures of GM-CSF positive 602 CD4 T cells ( Figure 9A, B) . Notably, the authors [66] stated that their approach may "miss other disease related" genes and that 620 "follow-up studies, such as RNA-sequencing analysis on TH17 and TH1/17 subsets isolated from MS 621 patients, may help to identify more disease-related genes" which we propose should be extended to profiles, GM-CSF producing cells were more related to IFN-γ than IL-17 producing T cells [10] . With 630 an elegant fate-mapping system, that study [10] could also demonstrate that murine GM-CSF 631 producing murine T cells display a stable epigenetic imprint. While such fate mapping systems are not 632 possible with human cells, and also, capturing multiple cytokine-producing Th subsets in parallel from 633 the same donor without artificially raising their fraction by PMA and ionomycin was not feasible in our 634 study due to the limited cell numbers, our study provides data from unmanipulated ex vivo isolated 635 human cells that may be most relevant for human diseases. Together, our data on naïve, memory, vs. vs. Naïve CD4 vs. vs. Functional and Phenotypic Plasticity of CD4+ T Cell Subsets Granulocyte 701 macrophage colony-stimulating factor: a new putative therapeutic target in multiple sclerosis Mice with a 704 disrupted IFN-gamma gene are susceptible to the induction of experimental autoimmune 705 encephalomyelitis (EAE) do not contribute vitally to autoimmune neuro-inflammation in mice Production by Autoreactive T Cells Is Required for the Activation of Microglial Cells and the 711 Onset of Experimental Autoimmune Encephalomyelitis RORγt drives 714 production of the cytokine GM-CSF in helper T cells, which is essential for the effector phase 715 of autoimmune neuroinflammation STAT5 programs a distinct 717 subset of GM-CSF-producing T helper cells that is essential for autoimmune 718 neuroinflammation CSF Expression Identifies a Discrete Subset of Inflammation-Driving T Helper Cells Regulated 721 by Cytokines IL-23 and IL-1β IL-17 and GM-724 CSF expression are antagonistically regulated by human T helper cells Th cells in the cerebrospinal fluid of persons with multiple sclerosis are dominated by 728 pathogenic non-classic Th1 cells and GM-CSF-only-secreting Th cells IL2RA polymorphism controls GM-CSF production in human TH cells Brief Report: T Cell Expression of Granulocyte-Macrophage Colony-Stimulating Factor in 735 Juvenile Arthritis Is Contingent Upon Th17 Plasticity Targeting GM-CSF in inflammatory diseases GM-CSF and 740 CXCR4 define a T helper cell signature in multiple sclerosis High levels of circulating GM-743 CSF+CD4+ T cells are predictive of poor outcomes in sepsis patients: a prospective cohort 744 study Pathogenic T-cells and inflammatory 746 monocytes incite inflammatory storms in severe COVID-19 patients T cell responses in patients with COVID-19 Current State of the Science GM-CSF-based treatments in COVID-753 19: reconciling opposing therapeutic approaches Induction and molecular 756 signature of pathogenic TH17 cells Induction of pathogenic T17 cells 758 by inducible salt-sensing kinase SGK1 The encephalitogenicity of TH17 760 cells is dependent on IL-1-and IL-23-induced production of the cytokine GM-CSF Autoimmunity beyond Th17: GM-CSF producing T cells Functional 765 inflammatory profiles distinguish myelin-reactive T cells from patients with multiple sclerosis TGF-β Affects the Differentiation of 768 Transposition of native 771 chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins 772 and nucleosome position Molecular regulation of effector and memory T cell 774 differentiation Cells Supports a Linear Differentiation Model and Highlights 777 Molecular Regulators of Memory Development The epigenetic 780 landscape of T cell exhaustion Changes in Chromatin Accessibility Occur in CD8+ T Cells Responding to Viral Infection Exhaustion-786 associated regulatory regions in CD8(+) tumor-infiltrating T cells Epigenetic landscapes 791 reveal transcription factors that regulate CD8+ T cell differentiation Individuality and variation of 794 personal regulomes in primary human T cells CD4+CD28+KIR+CD11a(hi) T cells correlate with disease activity and are characterized by a 798 23 pro-inflammatory epigenetic and transcriptional profile in lupus patients Guidance of 801 regulatory T cell development by Satb1-dependent super-enhancer establishment ATAC-seq for precision immune profiling Genetic determinants of 807 co-accessible chromatin regions in activated T cells across humans Simple combinations of 810 lineage-determining transcription factors prime cis-regulatory elements required for 811 macrophage and B cell identities featureCounts: an efficient general purpose program for assigning 813 sequence reads to genomic features edgeR: a Bioconductor package for differential 816 expression analysis of digital gene expression data Removing technical variability in RNA-seq data using 819 conditional quantile normalization Wellington: a novel method for the 822 accurate identification of digital genomic footprints from DNase-seq data An expansive 825 human regulatory lexicon encoded in transcription factor footprints TRANSFAC and 828 its module TRANSCompel: transcriptional gene regulation in eukaryotes Gene set 831 enrichment analysis: a knowledge-based approach for interpreting genome-wide expression 832 in the memory/naïve cell comparison (B) . Black indicates that the key TF binds to the respective 998 region. Rows and columns are clustered using binary distance and complete linkage. (C, D) Heatmap 999 representation of the binary distance matrix of the key TFs using the binary vector of binding ("bound" 1000 or "not bound") for distance calculation. The red color scale represents the distance; smaller distance 1001 represents more co-binding. Rows and columns are clustered using Euclidean distance and complete Supplementary Figure S3 Supplementary Figure S3 . Enrichment of footprints across peaks ranked based on differential accessibility. Peaks were ranked based on their differential accessibility in the indicated contrast using the -log10(FDR) × sign(log2(Fold Change)). Peaks are represented on the horizontal axis and the enrichment score of the set of footprints identified in the indicated cell population is represented on the vertical axis.