key: cord-0276037-qbjenza9
authors: Johnson, R. D.; Ding, Y.; Venkateswaran, V.; Bhattacharya, A.; Chiu, A.; Schwarz, T.; Freund, M.; Zhan, L.; Burch, K. S.; Caggiano, C.; Hill, B.; Rakocz, N.; Balliu, B.; Sul, J. H.; Zaitlen, N.; Arboleda, V. A.; Halperin, E.; Sankararaman, S.; Butte, M. J.; UCLA Precision Health Data Discovery Repository Working Group,; UCLA Precision Health ATLAS Working Group,; Lajonchere, C.; Geschwind, D. H.; Pasaniuc, B.
title: Leveraging genomic diversity for discovery in an EHR-linked biobank: the UCLA ATLAS Community Health Initiative
date: 2021-09-23
journal: nan
DOI: 10.1101/2021.09.22.21263987
sha: 8bf664135c2b5a534087c98e920b8827d4448a58
doc_id: 276037
cord_uid: qbjenza9

Large medical centers located in urban areas such as Los Angeles care for a diverse patient population and offer the potential to study the interplay between genomic ancestry and social determinants of health within a single medical system. Here, we introduce the UCLA ATLAS Community Health Initiative-- a biobank of genomic data linked with de-identified electronic health records (EHRs) of UCLA Health patients. We leverage the unique genomic diversity of the patient population in ATLAS to explore the interplay between self-reported race/ethnicity and genetic ancestry within a disease context using phenotypes extracted from the EHR. First, we identify an extensive amount of continental and subcontinental genomic diversity within the ATLAS data that is consistent with the global diversity of Los Angeles; this includes clusters of ATLAS individuals corresponding to individuals with Korean, Japanese, Filipino, and Middle Eastern genomic ancestries. Most importantly, we find that common diseases and traits stratify across genomic ancestry clusters, thus suggesting their utility in understanding disease biology across diverse individuals. Next, we showcase the power of genetic data linked with EHR to perform ancestry-specific genome and phenome-wide scans to identify genetic factors for a variety of EHR-derived phenotypes (phecodes). For example, we find ancestry-specific associations for liver disease, and link the genetic variants with neurological and neoplastic phenotypes primarily within individuals of admixed ancestries. Overall, our results underscore the utility of studying the genomes of diverse individuals through biobank-scale genotyping efforts linked with EHR-based phenotyping.

Linking electronic health records (EHRs) to patient genomic data within biobanks in a de-identified fashion has the potential to significantly advance genomic discoveries and precision medicine efforts (e.g., population screening, identifying drug targets) [1] - [4] . However, the underrepresentation of minoritized populations in biomedical research [5] - [11] raises concerns that advancements in precision medicine may widen disparities in access to high-quality health care [12] - [14] . For example, European-ancestry individuals constitute approximately 16% of the global population, yet account for almost 80% of all genome-wide association study (GWAS) participants [13] . As a direct result of this imbalance, existing methods to predict disease risk from genetics (e.g., polygenic risk scores) are vastly inaccurate in individuals of non-European ancestry [13] , [15] thus forming a barrier for advancing genomic medicine to benefit patients of all ancestries.

The UCLA Health medical system is located in Los Angeles, one of the most ethnically diverse cities in the world. There is no ethnic majority: 48.5% of Los Angeles residents self-identify as Hispanic or Latino, 11 .6% as Asian, and 8.9% as Black or African American; additionally, 37% of Los Angeles residents are neither U.S. nationals, nor U.S. citizens at birth [16] . Therefore, the UCLA Health patient population and the availability of digital health data captured in EHRs from a single medical system present a unique opportunity to increase the inclusion of underrepresented minorities in biomedical research. We introduce the UCLA ATLAS Community Health Initiative (or ATLAS for brevity), a biobank embedded within the UCLA Health medical system composed of de-identified, EHR-linked genomic data from a diverse patient population. The current initiative aims to collect data from over 150,000 individuals; currently this consists of 26,414 individuals genotyped at 673,148 variants genome-wide each using the Illumina global screening array (GSA) [17] . The EHR contains a de-identified extract of medical records (billing codes, laboratory values, etc.) as well as demographic information such as self-reported race and ethnicity information. It is important to note that self-reported race and ethnicity (SIRE) represent social constructs that capture shared values, cultural norms, and behaviors of subgroups [18] that are distinct concepts from genetic ancestry, which refers to the history of one's genome with little to no relation to cultural aspects of identify. This difference is even more relevant for individuals self-describing as multi-racial (and/or admixed) where genetic ancestry bears little correlation to SIRE [19] , [20] . Understanding the interplay of genetic factors (such as genetic ancestry) with social determinants of health (as inferred from self-reports) is still mired in the confounding overlaps between race, socioeconomic status, and disease, but serves as a critical step in mapping and predicting disease risk across individuals of all ancestries, thus enabling equitable genomic medicine to individuals of all ancestries.

In this work, we leverage the unique genomic diversity of the patient population in ATLAS to explore the interplay between self-reported race/ethnicity and genetic ancestry within a disease context using phenotypes extracted from the EHR within a single medical system. We cluster individuals by genetic ancestry within the EHR-linked biobank, systematically construct phenotypes from EHR, and compute disease associations using multi-ethnic pipelines for both genome-wide and phenome-wide association studies. We find that genetic ancestry and self-reported ancestry yield distinct subpopulations thus emphasizing the distinction between genetic ancestry and self-reported race and ethnicities. We leverage genetic and self-reported data to find extensive variation of sub-continental ancestry within ATLAS across European, Asian, and American ancestries. For example, we find clusters of individuals with recent ancestry from Filipino, Japanese, and Korean ancestries. Such sub-continental clusters also stratify individuals according to disease groups thus emphasizing their utility in biomedical research. We perform genome-wide and phenome-wide association studies to recapitulate known genomic risk regions; as an example, focusing on chronic nonalcoholic liver disease, we recapitulate the 22q13.31 locus and perform a phenome-wide association study across 1,330 EHR-derived phenotypes at the lead SNP, rs2294915, across multiple populations. We describe genetic associations for liver-related phenotypes in multiple ancestry groups as well as associations with neurological and neoplastic phenotypes that are associated exclusively in the Admixed American group. These results underscore how the utility of large-scale genetic analyses and deep phenotyping in diverse populations have substantial medical relevance for population health.

The UCLA Health patient population is diverse, with 63.3% self-reporting their race as White or Caucasian, 6.7% as Black or African American, 10.5% as Asian, 0.6% as American Indian or Alaska Native, 0.3% as Pacific Islander, and 18.6% identify as one of the additional races listed in detail in the Supplementary Materials ( Figure 1A , Supplementary Table S1, S3). 15.8% of individuals self-report their ethnicity as Hispanic or Latino; the remaining individuals self-report as non-Hispanic/Latino (Figure 1A,   Supplementary Table S2, S3) . We investigated genetically inferred ancestry through principal component analysis (PCA) [21] , [22] , to identify population clusters according to the five continental Table S4 ). Although we broadly find that SIRE is concordant with the inferred continental genetic ancestry, we find marked differences between genetically defined ancestry groups and SIRE, further emphasizing that genetic ancestry is a distinct concept from self-reported race and ethnicity. For example, we find >10% of individuals within the European genetic ancestry group do not identify as Non-Hispanic/Latino -White/Caucasian (NH-WC) SIRE; 10% of individuals within the African genetic ancestry group do not self-report as Non-Hispanic/Latino -Black/African American (NH-AfAm), and >25% of the Admixed American genetic ancestry group do not identify as Hispanic/Latino -Other Race (HL-Other) or Hispanic/Latino -White/Caucasian (HL-WC) (Supplementary Table S5 ).

Further making the distinction between genetic ancestry and SIRE, we reveal extensive genetic heterogeneity both between and within SIREs within orthogonal spectra from PCA (Figure 2A and 2B).

For example, most individuals who self-report as NH-AfAm lie along a cline between the African and European genetic ancestry clusters. We also observe that the cluster of individuals with inferred African ancestry from PCA form a considerably smaller cluster than the group of individuals in the NH-AfAm SIRE in ATLAS. Within ATLAS, we find that 1,426 individuals self-identify as NH-AfAm, but only 1,233 individuals are grouped into the African genetic ancestry cluster. This difference is likely because many individuals in ATLAS identify as African American, which suggests genetic admixture between African and European ancestry in this group. Conversely, there are fewer individuals in the Non-Hispanic/Latino -Asian (NH-Asian) SIRE (N=2,469) than those grouped into the East Asian and South Asian ancestry clusters (N= 2,611). A similar trend follows for the NH-WC SIRE (N= 14,328) and the European ancestry cluster (N= 14,800). The majority of individuals who are included in the genetic ancestry clusters, but not the corresponding SIREs, had either unknown SIRE information or reported their race as 'Other Race', demonstrating how genetic ancestry inference can be advantageous when self-reported information is not known or individuals' race/ethnicity are not represented in patient questionnaires. However, 14% of individuals still have unclassifiable genetic ancestry (Supplementary Table S4 ) either because they are clustered into multiple ancestry groups or no ancestry group as all. The latter could be due to extensive admixture in their genomes or the absence of relevant ancestral groups in the chosen reference panels.

Labeling individuals by self-reported preferred language, we observe trends that are consistent with both SIRE and continental genetic ancestry ( Figure 2C ). For example, out of all individuals who report Spanish as their primary language, 96.1% of these individuals were estimated to have Admixed American genetic ancestry. Additionally, 98.5% of individuals who report Japanese, Korean, Tagalog, Vietnamese, Mandarin Chinese, and Cantonese as their primary languages were inferred to have East Asian genetic ancestry. We also see clusters of individuals who speak Armenian, Arabic, and Farsi/Persian; we find that 26.2% of the individuals that speak these languages could not be classified into one of the five continental ancestry groups within 1000 Genomes. This discrepancy is likely because the 1000 Genomes reference panel does not contain samples from regions where these languages are primarily spoken. These findings showcase the limitation of current reference panels of genetic diversity and demonstrate the value of characterizing individuals using both genetic ancestry and self-reported information.

Next, we inferred the ancestry of individuals within the ATLAS East Asian ancestry group (EAS) and Figure 3A , there are two notable clusters that do not match any of the East Asian subcontinental populations within 1000 Genomes. Projecting individuals' self-identified race over the PCs shows that the majority of individuals in these two clusters identify as 'Asian: Korean' and 'Asian:

Filipino' respectively (Supplementary Figure 5A ). This pattern is similarly reflected by the self-identified preferred languages where many of these individuals speak Korean and Tagalog. This clustering not only characterizes the fine-scale genetic and ethnic diversity of ATLAS, but also emphasizes how the concepts of genetic ancestry and self-reported constructs, such as primary spoken language, can be combined to identify and label distinct genetic clusters that would not have been characterized based on a single criterion alone.

Next, we identify clusters of individuals with subcontinental genetic ancestry of European descent, but due to limitations in reference panels, we were unable to describe the origins of the majority of the observed genetic variation within the ATLAS European continental ancestry cluster ( Figure 3B ). Comparing selfreported race and ethnicity information does not delineate any subgroups since most individuals are within the NH-WC SIRE (Supplementary Figure S6A) . Instead, we overlay individuals' self-reported preferred language over the projected PCs and observe clusters of individuals whose preferred languages are Arabic, Armenian, and Farsi/Persian; notably the primary populations that speak these languages are not present in the current 1000 Genomes reference panel (Supplementary Figure S6B) . Although not definitive about ancestral origins, these results suggest that individuals in these clusters may have cultural ties and/or genetic origins relating to the Middle East. We also observe two distinct clusters of individuals who speak Farsi/Persian (labeled as 'Farsi, Persian I' and 'Farsi, Persian II'), suggesting that although these groups may share cultural ties, the groups could have varying ancestral origins.

We perform a similar analysis for the Admixed American cluster of individuals. We are able to cluster 

A complementary method to principal components for inferring fine-scale ancestry is identical-by-descent (IBD) analysis [24] - [26] . Using pair-wise IBD estimates for all individuals in ATLAS and reference population information from the 1000 Genomes Project [23], Simons Genome Diversity Project [27] , and Human Genome Diversity Project [28] , we describe fine-scale populations based on total pairwise IBD ( Figure 4 ; see Methods). Each subgroup is annotated according to a combination of genetic ancestry from reference populations as well as self-reported race, ethnicity, and language information. Many subgroups have similar characteristics to those defined from PCA-based clustering, such as the Filipino and Dai Chinese clusters. We can also characterize subgroups not previously identified through the previous PCA analysis. For example, PCA-based clustering was only able to distinguish clusters at the level of continental African ancestry, whereas IBD clustering identified West African, East African, and Ethiopian subgroups.

In contrast, Japanese and Korean individuals form a single subgroup when estimated by the IBD clustering approach, whereas PCA-based clustering delineated these individuals into two separate groups. Note that both IBD and PCA-based clustering granularity is dependent on the clustering algorithm used and here we report at only a single level of resolution. For further discussion of PCA and IBD for fine-scale population analyses, see Belbin et al 2021 [29] . Our results show that fine-scale population identification is specific to each genetic ancestry inference method, as well as how the combination of multiple methods can maximize the number of identified subgroups.

Many individuals do not fall within a single genetic ancestry cluster, but instead lie on the spectrum between multiple ancestry groups. We can characterize this variation through genetic admixture, the exchange of genetic information across two or more populations [30] . We estimate genetic ancestry proportions using k=4, 5,or 6 ancestral populations and visualize groups of individuals by SIRE (see Methods; Supplementary Figure S9 ). For the following analyses, we use k=4 ancestral populations where the clusters correspond to European, African, East Asian, and Native American ancestry. Among individuals in the HL-Other SIRE, the estimated average proportion of European ancestry is 49%, 6%

African ancestry, and 44% Native American ancestry (Supplementary Table S6 ). We also observe that the HL-Other and HL-WC (White or Caucasian) SIREs have approximately the same admixture profile, where the proportion of European ancestry is 49% and 58% respectively, 6% and 5% African ancestry, and 44% and 36% Native American ancestry. However, there is also a large amount of variation within SIREs, where . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted September 23, 2021. ;

for example, individuals who identify as Hispanic or Latino ethnicity are estimated to have European ancestry percentages ranging from nearly 0% to almost 100%.

Understanding how disease prevalence varies across populations is integral to understanding how the interplay of genetic factors and social determinants of health contribute to disease risk. We investigate 1,330 EHR-derived phenotypes (phecodes) [31] spanning a wide range of disease categories (see Methods) and identify 1,401 total significant phecode-ancestry associations (p < 3.8e-5) across the 5 continental ancestry groups after adjusting for age and sex (Supplementary Table S7 ). Overall, there are 659 phenotypes that show cross-ancestry differences, where a phenotype is significantly associated with a particular ancestry group compared to the rest of the population. From this set, the highest number of phecodes are from the circulatory (N=84), endocrine/metabolic (N=74), and digestive (N=80) system-related groups. Specifically, we recapitulate many known associations such as liver and intrahepatic bile duct cancer (p=6.97e-35) within the East Asian ancestry population [32] - [34] , skin cancer (p=2.02e-162) in the European ancestry population [35] , [36] , hereditary hemolytic anemia (p=2.4e-22) [37] and primary open-angle glaucoma (POAG) (p=5.33e-12) [38] , [39] within the African ancestry population, as well as both alcoholic liver damage (p=2.0e-47) and cirrhosis of liver without mention of alcohol (p=4.84e-70) [40] - [43] in the Admixed American population ( Figure 5 ). Next, as an example, we analyze phecodes spanning different traits where we observe a significantly higher prevalence for at least one continental ancestry group per trait. For example, we observe that the prevalence of both schizophrenia (freq=0.02, SE=0.004) and sickle cell anemia (freq=0.03, SE=0.005) have the highest prevalence in the African ancestry group in ATLAS, which is consistent with previous findings [44] , [45] .

We also observe substantial disease risk heterogeneity across subgroups of the same continental ancestry.

We compute the prevalence for the same set of diseases across subgroups within the East Asian ancestry group (Korean, Japanese, Filipino, Chinese, and Vietnamese) in ATLAS and compare this with the aggregated East Asian ancestry group. The estimated prevalence of type 2 diabetes from the East Asian ancestry group is 0.26 (SE=0.009). However, analysis of specific subgroups shows a significant increase in the prevalence of type 2 diabetes for individuals in the Japanese (freq=0.33, SE=0.03) and Filipino (freq=0.32, SE=0.02) subgroups compared to the Chinese subgroup (freq=0.21, SE=0.01). These results indicate that genetically grouping individuals across sub-continental ancestries yield meaningful interpretation of disease risk across groups of individuals.

We also investigated disease prevalence within admixed individuals, where variation in genetic ancestry across individuals in the population allows for the correlation of disease risk with the proportion of genetic . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted September 23, 2021. ; https://doi.org/10.1101/2021.09.22.21263987 doi: medRxiv preprint ancestry from any given continental group. Within each SIRE group, we perform an association test between the proportions of inferred ancestry estimated from ADMIXTURE [46] and each phenotype (see Methods; Supplementary Table S8 ). After correcting for the number of tested phenotypes, we find numerous significant phenotype-ancestry associations: 113 associations within the HL-Other SIRE, 62 within the NH-WC SIRE, and 48 within the NH-Asian SIRE. However, we do not find any significant associations within the NH-AfAm SIRE, which could be due to the smaller sample size. Across SIREs, both the top associated phenotype categories as well as the direction of the associations greatly vary. Out (Figure 6A ), which is consistent with previous studies [47] , [48] ; we additionally find a similar trend for 'other chronic nonalcoholic liver disease' ( Figure 6B ). These results suggest that not only are some disease statuses associated with SIRE and continental genetic ancestry, but the specific ancestry proportions may also correlate with disease risk.

EHR-linked biobanks also offer the opportunity of investigating genetic associations with traits across the genome. These efforts impose special challenges, such as adjusting for population stratification and cryptic relatedness in health systems that serve entire families as well as extracting phenotypes from EHR, namely due to inconsistencies in mapping diagnosis codes (ICD codes) to phenotypes and difficulties in defining appropriate controls for specific phenotypes. Here, we implemented the phecode system (v1.2) [31] , [49] within a GWAS pipeline that accounts for population stratification (see Methods). As an example, we present results for phecode 571. 5 In the EUR study, we find three SNPs that pass genome-wide significance (p < 5e-8) and 70 SNPs that reach significance in the AMR study (Supplementary Table S9 ). All genome-wide significant SNPs from both studies fall within the 22q13.31 locus, which contains the PNPLA3 gene. This gene has been extensively studied for its role in the risk of various liver diseases such as nonalcoholic fatty liver disease [50] , [51] . The lead SNP from both analyses, rs2294915, is an intronic variant in the PNPLA3 gene and has MAF=0.45 in the AMR group but only MAF=0.24 in the EUR group. A nearby SNP, though not directly tested due to quality control filtering, is rs738409, a missense variant for PNPLA3 that has been welldocumented for its role in the susceptibility of several types of liver disease [52] . Using measurements of LD from the 1000 Genomes reference panel, we find that rs2294915 is in high LD with rs738409 in the AMR analysis (R 2 =0.94) as well as in the EUR analysis, although to a slightly lesser extent (R 2 =0.85) [53] .

Next, we leverage GWAS for all existing phecodes to investigate the association of the lead variant, rs2294915, across all 1,330 EHR-derived phenotypes (i.e. a phenome-wide association study: PheWAS).

After adjusting for both genome-wide significance and the number of phenotypes (p < 3.8e-11), we find that only the liver-related phenotypes within the AMR study reach significance ( Figure 8 ). Additionally, multiple neoplastic and neurological phenotypes, which are comorbidities with severe liver disease [50] , [54] - [56] , are nominally significant only in the AMR study after adjusting for the number of tested phenotypes (p < 3.8e-5). These findings suggest possible differential genetic architecture across these two populations, as well as variation even at the phenotype level, reflecting possible genetic or environmental modifiers of important comorbidities.

In this work, we introduce the ATLAS Community Health Initiative, a biobank embedded within the UCLA Health medical system comprising of de-identified EHR-linked genomic data from a diverse patient population. The UCLA Health system serves Los Angeles County, leading to a study population of great demographic, genetic, and phenotypic diversity. We investigate ancestry both on the continental as well as the subcontinental population level and find that genetic ancestry and self-reported demographic information yield distinct subpopulations in the ATLAS biobank. We present a collection of results cataloguing the associations between genetic ancestry and EHR-derived phenotype where we find that disease status is not only associated with continental genetic ancestry but also associated with the specific admixture profile describing an individual. We use multi-ethnic pipelines to recapitulate known . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted September 23, 2021. ; associations for chronic nonalcoholic liver disease at the 22q13.31 locus and perform a phenome-wide association study at the lead SNP, where we find associations with neurological and neoplastic phenotypes exclusively in the Admixed American ancestry group. As the sample size increases, the ATLAS Community Health Initiative will enable rigorous genetic and epidemiological studies to further understand the role of genetic ancestry in disease etiology, with a specific aim to accelerate genomic medicine in diverse populations. Already, the ATLAS biobank accounted for 73.4% of the Admixed American samples utilized in the primary analysis from the COVID-19 Host Genetics Initiative [57] .

As the field moves forward with increased collaboration between the genetics and healthcare communities, it is of utmost importance to also be aware of potential pitfalls that may occur when translating research findings into actual clinical populations. Currently, many clinical protocols are deeply ingrained with racial bias, no matter how benign the original goal was intended [58] - [62] . Many of these flawed policies stemmed from erroneously linking race, a social rather than biological construct, with disease risk despite not presenting any biological justification. Although race and genetic ancestry are correlated [63] , [64] , our work shows that populations constructed from these two concepts are not analogous. We encourage protocol decisions that are rooted in actual biological phenomena, such as genetic markers, providing transparent, immutable criteria. For example, Benign Ethnic Neutropenia (BEN) is observed predominantly in African Americans, but specifically is strongly associated with the variant at rs2814778 [65] , [66] . Recent studies have suggested that genotype screening at rs2814778 could aid in the interpretation of neutropenia in African Americans and avoid unnecessary invasive procedures as well as lead to an increase of the inclusion of these individuals to various treatments [67] . Additionally, the Kidney Donor Risk Index (KDRI) equation uses race as a risk factor [68] , but it has been recently proposed to use the presence of APOL1 variants as a factor instead [69] . Discovered after the creation of the KDRI, the presence of these variants was shown to be associated with shorter allograft longevity [70] . Despite this finding, the original KDRI score is still commonly used. In order to remedy and not perpetuate current healthcare inequalities, we underscore the importance of favoring transparent clinical protocols with clear biological justification instead of race-adjusted formulas that leverage convenience at the expense of potential inequities.

There are various limitations within our study, and we describe a few of these in detail as follows. First, the phenotypes are based on ICD codes, and due to the nature of billing codes, this form of labeling does not constitute a formal patient diagnosis and may contain individuals who do not have the specific disease. This uncertainty in phenotyping likely limits the power of our study to find disease associations. For the further investigation into specific phenotypes, we recommend refining each phenotype definition based on additional disease-specific factors and metrics. Additionally, although ICD codes are an international standard, there are still deviations between different institutions in how specific diagnoses are recorded.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. This adds further heterogeneity in phenotyping and could present future challenges when replicating studies or porting algorithms to other institutions. Second, due to the de-identified nature of the data, we lack information that could help us better describe the fine-scale population groups. For example, location of birth, zip code, and family history has been shown to be useful descriptors for determining subgroups of genetic ancestry [29] . This geographic information could also be used as a proxy for various environmental exposures such as pollution. Additional socioeconomic information, such as income and availability of health insurance, could likely account for a portion of observed associations as well as provide more insight into the socioeconomic determinants of health. Third, our findings within the African and South Asian ancestry populations are limited due to the smaller sample sizes. As sample sizes increase, we hope to further refine population substructure within these initial continental ancestry groups and have the power to detect novel disease associations that have previously been mired by lack of statistical power.

We conclude by discussing directions for future work. Although we investigate admixed populations, such as African American and Hispanic/Latino populations, admixed individuals who do not fall under these groups are excluded from downstream analyses due to concerns over population structure. In the future, we hope to incorporate methods and pipelines that properly control for population structure in all types of admixed populations. Additionally, we plan to compute polygenic risk scores (PRS) across all 5 continental ancestry groups. PRS has already shown modest clinical utility for diseases such as breast cancer [71] and cardiovascular disease [72] , but has proven difficult to perform accurate predictions across populations [13] . The genetic diversity within the ATLAS Community Health Initiative biobank partnered with the longitudinal clinical data provides a unique resource to further explore the role of ancestry in PRS prediction. Furthermore, as the size of the biobank grows and more data is collected over time, we hope to explore even more individualized health solutions and interventions. Currently, there are more than 27,000 genotyped participants with their de-identified EHR linked through the DDR. Patients' participation is voluntary and their privacy is protected by de-identifying the samples.

Participants self-report their race and ethnicity via two distinct fields where there are multiple-choice fields for race and ethnicity (see Supplementary Table S1 , S2 for full list). Only one selection from each category can be chosen as a patient's primary race and ethnicity. We group together race/ethnicity pairings to form 21 'self-reported race/ethnicity' (SIRE) groupings (see Supplementary Table S3 ). Patients also report their 'Preferred Language' from multiple-choice fields.

Genotype quality control. Bio-samples collected from the UCLA ATLAS Community Health Initiative in the form of blood samples, were de-identified and then processed for DNA extraction and genotyping.

ATLAS participants (N= 26,439) were genotyped on a custom Illumina Global Screening (GSA) array [17] that included a standard GWAS backbone and an additional set of pathogenic variants selected from ClinVar [73] . We filtered out poor quality markers by removing variants with >5% missingness (M= 9,318 variants removed) and then removed strand-ambiguous SNPs (M= 7,686). We excluded samples with missingness >5% (N=3 individuals removed) and kept one individual from each set of twins or duplicated individuals (N=22 individuals removed). All quality control steps were conducted using PLINK 1.9 [74] and duplicate individuals/twins were determined using KING [75] . These steps resulted in a total of 673,148 variants and 26,414 individuals.

With the M=673,148 variants that passed quality control, we used KING [75] to compute pairwise kinship coefficients to determine family relationships. We identified a set of unrelated individuals (N=25,842) where individuals with kinship coefficient <0.0884 were included ('king --unrelated'). Additionally, we identified 22 duplicate individuals or twins, 213 parent-offspring relatives, and 117 first-degree relatives. This level of relatedness is expected since families will often be members of the same healthcare system.

After performing array-level genotype quality control, the PLINK-formatted genotypes are converted to VCF format and uploaded to the Michigan Imputation Sever [76] . On a variant level, the server removes the variant if it is not an A, C, G, T allele, monomorphic, a duplicate, an allele mismatch between the reference panel and provided data, an insertion-deletion, or if the SNP call rate is less than 90%. The server will additionally correct for any necessary strand flips or allele switches needed to match the reference panel. The server additionally phases the data using Eagle v2.4 [77] and imputation is performed against the TOPMed Freeze5 imputation panel [78] using minimac4 [79] . In summary, the explicit parameters used on the server are "TOPMed Freeze5" for the reference panel, "GRCh38/hg38" for the array build, "off" for the rsq filter, "Eagle v2.4" for phasing, no QC frequency check, and "quality . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Additionally, we did not compute explicit PC thresholds for the European subcontinental clusters.

GWAS quality control per ancestry. We limited our analyses to N=25,842 unrelated individuals and then performed additional quality control steps within each continental ancestry groups for GWAS (European, African, Admixed American, East Asian, South Asian). SNPs that violated Hardy-Weinberg equilibrium (HWE) with p<1e-12 were excluded. Individuals with a heterozygosity rate that surpassed +/-3 standard deviations from the ancestry-specific mean were also excluded. Analyses were restricted to common SNPs per ancestry group where MAF>1%. Project [27] , and the Human Genome Diversity Project [28] . In total, 418,195 SNPs were kept for IBD analysis after filtering by missingness >=10% and MAF>1%. The merged dataset was then statistically phased using Shapeit4 [81] . IBD was called using iLASH using default parameters [82] . For downstream analysis, IBD segments were summed between individuals to create an adjacency matrix, where each row represented a pair of individuals, and each column represented the total genome-wide IBD between those two individuals. Using KING [75] , the adjacency matrix was filtered to remove rows representing individuals who were third degree relatives or closer. Communities are annotated using the presence of reference individuals in a cluster and EHR characteristics, such as preferred language and self-reported race/ethnicity. Genetic admixture analysis. We inferred the proportion of genetic ancestry by using the AMIXTURE software [46] under the unsupervised clustering mode with the number of clusters k=4, 5, 6. For each SIRE, we compare the ancestry proportions from the clusters. For k=4, we assign the cluster with the majority of NH-WC individuals as European ancestry, the cluster with the majority of NH-AfAm individuals as African ancestry, the cluster with the majority of NH-Asian individuals as East Asian ancestry, and the cluster with the highest number of HL-Other and HL-WC individuals as Native American ancestry.

Phecodes. We aggregated billing (ICD9/ICD10) codes into more clinically informative groupings known as phecodes [31] . We derived phecodes from ICD codes in the EHR using mappings described in the PheWAS catalogue (Phecode Map 1.2) [83]. Using phecodes to define case/control phenotypes, we treated individuals with the occurrence of a specific phecode as a case and a control otherwise. We restricted our analyses to phecodes that had >100 cases present in ATLAS, yielding a total of 1330 phenotypes.

Association between phecodes and genetic ancestry. We performed a marginal association between each phecode and continental genetic ancestry group under a logistic regression model while also adjusting for age and sex. Statistical significance was determined after correcting for the number of tested phecodes (p<3.8e-5).

We perform a marginal regression between each of the ancestry proportions estimated from ADMIXTURE where k=4 (European, African, East Asian, and Native American ancestry) and 1,300 EHR-derived phenotypes (phecodes) within each of the 7 ATLAS SIRE groups groups (NH-WC, NH-AfAm, HL-Other, HL-WC, NH-Asian, NH-PI, NH-AmIn). Additional details on computing admixture proportions can be found under the section Genetic admixture analysis. We additionally adjust for age and sex in the regression. Only traits with >10 cases per SIRE were tested. Significance is determined after adjusting for the number of tested phenotypes (p<3.8e-5).

GWAS for 'Other chronic nonalcoholic liver disease'. We performed an association between all imputed autosomal variants and 'Other chronic nonalcoholic liver disease' within the European (N-Case: 2,275, N-Controls: 14,155) and Admixed American (N-Case=919; N-Controls=3,262) continental ancestry groups.

Using PLINK 1.9, we performed a marginal association test at each SNP using a logistic regression model ('plink --logistic beta') where we adjusted for age, sex, and PCs 1-10. Quantile-quantile plots and genomic inflation factors (EUR λGC = 1.02; AMR λGC = 1.01) provide evidence that both analyses are well-calibrated (Supplementary Figure S10 ).

We performed an association between each typed SNP and 1330 phecodes. Due to the number of tests, we only perform associations at genotyped SNPs. To determine significance, we used a stringent threshold that corrects for both the number of tested phenotypes as well as genome-wide significance (p<3.8e-11) and a less stringent threshold that only corrects for genome-wide significance (p<5e-08).

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Genetic ancestry

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted September 23, 2021. ; 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint Cluster annotation labels were determined using a combination of known genetic ancestry samples from 1000 Genomes and self-reported race, ethnicity, and language information from the EHR.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted September 23, 2021. ; https://doi.org/10.1101/2021.09.22.21263987 doi: medRxiv preprint Figure 4 : IBD sharing between ATLAS participants. InfoMap community membership is indicated by color for all communities with greater than 100 individuals (20 communities total) and individuals with a degree greater than 30. Community membership indicates elevated shared IBD within that community. Community identity is labelled adjacent to the network plot in the corresponding color.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Individuals who self-report as "Hispanic/Latino -Other Race" (HL-Oth) (N=2,206) and have had a diagnosis of (A) "Other chronic nonalcoholic liver disease" or (B) type 2 diabetes are binned by their proportions of European, African, and Native American ancestry estimated using ADMIXTURE. Bins are defined by the proportion of each ancestral population in increments of 0.20. Within each bin, we plot the prevalence of the diagnoses and provide standard errors (+/-1.96 SE) of the computed frequencies.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted September 23, 2021. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. European (bottom) ancestry groups across 1,330 phecode phenotypes. The red dotted line denotes p=3.8e-11, the significance threshold after adjusting the genome-wide significance threshold for 1,330 tested phenotypes. The red dashed line denotes p=3.8e-5, the significance threshold after correcting for only the 1330 tested phenotypes.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted September 23, 2021. ; https://doi.org/10.1101/2021.09.22.21263987 doi: medRxiv preprint

Electronic health records and polygenic risk scores for predicting disease risk

Phenotypic signatures in clinical data enable systematic identification of patients for genetic testing

Phenotype risk scores identify patients with unrecognized mendelian disease patterns

Leading Edge Perspective Personalized Medicine and the Power of Electronic Health Records

Representation of American blacks in clinical trials of new drugs

Participation in cancer clinical trials: race-, sex-, and age-based disparities

The Missing Diversity in Human Genetic Studies

Volunteer Participation in the Health eHeart Study: A Comparison with the US Population

Diversity of Enrollment in Prostate Cancer Clinical Trials: Current Status and Future Directions

Assessment of the Inclusion of Racial/Ethnic Minority, Female, and Older Individuals in Vaccine Clinical Trials

Machine Learning and Health Care Disparities in Dermatology

Clinical use of current polygenic risk scores may exacerbate health disparities

The effect of race and sex on physicians' recommendations for cardiac catheterization

Polygenic risk score for schizophrenia is more strongly associated with ancestry than with schizophrenia

QuickFacts: Los Angeles city, California

Infinium Global Screening Array-24 Kit | Population-scale genetics

Racial identity among Hispanics: implications for health and well-being

Hidden in Plain Sight -Reconsidering the Use of Race Correction in Clinical Algorithms

Race and Genetic Ancestry in Medicine -A Time for Reckoning with Racism

Principal Component Analysis and Factor Analysis

Genes mirror geography within Europe

A global reference for human genetic variation

The Variance of Identity-by-Descent Sharing in the Wright-Fisher Model

Identity inference of genomic data using longrange familial searches

Length Distributions of Identity by Descent Reveal Fine-Scale Demographic History

The Simons Genome Diversity Project: 300 genomes from 142 diverse populations

Insights into human genetic variation and population history from 929 diverse genomes

Toward a fine-scale population health monitoring system

A genetic atlas of human admixture history

PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations

The burden of liver cancer in Asians and Pacific Islanders in the Greater San Francisco Bay Area

Cancer health disparities among Asian Americans: what we do and what we need to do

SEER Statistics

Incidence Estimate of Nonmelanoma Skin Cancer (Keratinocyte Carcinomas) in the U

Global Prevalence of Glaucoma and Projections of Glaucoma Burden through 2040: A Systematic Review and Meta-Analysis

Prevalence of primary open angle glaucoma in the last 20 years: a meta-analysis and systematic review

Prevalence of Nonalcoholic Fatty Liver Disease in the United States: The Third National Health and Nutrition Examination Survey

Ethnicity and nonalcoholic fatty liver disease

Genetic predisposition and increasing dietary fructose exposure: The perfect storm for fatty liver disease in Hispanics in the

Chronic Liver Disease in the Hispanic Population of the United States

Race, schizophrenia, and admission to state psychiatric hospitals

Disorders of Hemoglobin: Genetics, Pathophysiology, and Clinical Management

Fast model-based estimation of ancestry in unrelated individuals

African ancestry and its correlation to type 2 diabetes in African Americans: a genetic admixture analysis in three U.S. population cohorts

Influence of genetic ancestry and socioeconomic status on type 2 diabetes in the diverse Colombian populations of Chocó and Antioquia

Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data

Association of PNPLA3 with non-alcoholic fatty liver disease in a minority cohort: the Insulin Resistance Atherosclerosis Family Study

PNPLA3 gene in liver diseases

The rs738409 (I148M) variant of the PNPLA3 gene and cirrhosis: a meta-analysis

LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants

Neurological and neuropsychiatric syndromes associated with liver disease

Neurologic Manifestations of Chronic Liver Disease and Liver Cirrhosis

Cancer and liver cirrhosis: implications on prognosis and management

Mapping the human genetic architecture of COVID-19

Hidden in Plain Sight -Reconsidering the Use of Race Correction in Clinical Algorithms

Reconsidering the Consequences of Using Race to Estimate Kidney Function

Race, Ancestry, and Medical Research

The Case for Removing Race From the American Academy of Pediatrics Clinical Practice Guideline for Urinary Tract Infection in Infants and Young Children With Fever

Dissecting racial bias in an algorithm used to manage the health of populations

Categorization of humans in biomedical research: genes, race and disease

Implications of biogeography of human populations for 'race' and medicine

Reduced Neutrophil Count in People of African Descent Is Due To a Regulatory Variant in the Duffy Antigen Receptor for Chemokines Gene

Benign Ethnic Neutropenia

Association Between a Common, Benign Genotype and Unnecessary Bone Marrow Biopsies Among African American Patients

A comprehensive risk quantification score for deceased donor kidneys: The kidney donor risk index

Effect of Replacing Race With Apolipoprotein L1 Genotype in Calculation of Kidney Donor Risk Index

APOL1 Kidney Risk Alleles: Population Genetics and Disease Associations

The role of polygenic risk and susceptibility genes in breast cancer over the course of life

Polygenic Risk Scores and Coronary Artery Disease

ClinVar: Improving access to variant interpretations and supporting evidence

Second-generation PLINK: rising to the challenge of larger and richer datasets

Robust relationship inference in genome-wide association studies

Next-generation genotype imputation service and methods

Reference-based phasing using the Haplotype Reference Consortium panel

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

minimac2: faster genotype imputation

FlashPCA2: principal component analysis of Biobank-scale genotype datasets

Accurate, scalable and integrative haplotype estimation

Rapid detection of identity-by-descent tracts for mega-scale datasets