key: cord-0284149-ggln0sjd authors: Qi, M.; Cahan, O.; Foreman, M. A.; Gruen, D. M.; Das, A. K.; Bennett, K. P. title: Quantifying representativeness in randomized clinical trials using machine learning fairness metrics date: 2021-06-28 journal: nan DOI: 10.1101/2021.06.23.21259272 sha: 1a4c2a129e8bef923dd01719f78be4609720f718 doc_id: 284149 cord_uid: ggln0sjd Objective We formulate population representativeness of randomized clinical trials (RCTs) as a machine learning (ML) fairness problem, derive new representation metrics, and deploy them in visualization tools which help users identify subpopulations that are underrepresented in RCT cohorts with respect to national, community-based or health system target populations. Materials and Methods We represent RCT cohort enrollment as random binary classification fairness problems, and then show how ML fairness metrics based on enrollment fraction can be efficiently calculated using easily computed rates of subpopulations in RCT cohorts and target populations. We propose standardized versions of these metrics and deploy them in an interactive tool to analyze three RCTs with respect to type-2 diabetes and hypertension target populations in the National Health and Nutrition Examination Survey (NHANES). Results We demonstrate how the proposed metrics and associated statistics enable users to rapidly examine representativeness of all subpopulations in the RCT defined by a set of categorical traits (e.g., sex, race, ethnicity, smoker status, and blood pressure) with respect to target populations. Discussion The normalized metrics provide an intuitive standardized scale for evaluating representation across subgroups, which may have vastly different enrollment fractions and rates in RCT study cohorts. The metrics are beneficial complements to other approaches (e.g., enrollment fractions and GIST) used to identify generalizability and health equity of RCTs. Conclusion By quantifying the gaps between RCT and target populations, the proposed methods can support generalizability evaluation of existing RCT cohorts, enrollment target decisions for new RCTs, and monitoring of RCT recruitment, ultimately contributing to more equitable public health outcomes. Inequitable representation and evaluation of diverse subgroups in randomized clinical trials (RCTs) and other clinical research may generate unfair and avoidable differences in population health outcomes [1] [2] [3] [4] . In an analysis of trials conducted by Pfizer between 2011 and 2020, scientists found an urgent need for solutions to enhance diverse representation across all populations within clinical research 5 . Health inequity attracted great public attention during the COVID-19 pandemic [6] [7] [8] . For example, race and ethnicity are identified factors associated with risk for COVID-19 infection and mortality [9] [10] [11] . The adequate enrollment of participants with diverse race and ethnicity is required in clinical trials to ensure valid treatment effect conclusions and to support reliable generalizability of clinical trial results across subpopulations. A well-designed RCT is considered the most reliable way to estimate cause-effect relationships between treatments and outcomes 12, 13 . The randomization process, which makes RCTs gold standards of treatment effectiveness, contains two random assignments, one from target population to trial cohort and the other from trial cohort to different experimental groups 14, 15 . The first random assignment is critical to the applicability and generalizability of clinical findings [16] [17] [18] but has received much less attention than the second one. Figure 1 demonstrates that if a latent patient trait guides the patient assignment into the study and affects the outcome, then the study generalizability to other reference populations may be limited from a causal inference perspective. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint Given the target population (i.e., the broad group of people to which RCT results are intended to apply) of a study on any disease, metrics have been developed to determine if a subpopulation (i.e., a subset of the target population that share a single or multiple common traits and thus can be distinguished from the rest) is underrepresented in the study cohort. Examples of traits include demographics, socioeconomic status, and clinical characteristics. EF is a widely used measure of participation disparities in clinical trials. For a given disease, it is defined as the number of trial participants divided by the estimated US cases in each subgroup 19, 20 . EF is usually a very small number by definition and requires the total number of target population for calculation, which makes potential underrepresentation and discriminations due to subgroup membership with respect to EF challenging to compare and calculate in a numerically stable way. We prove that our proposed metrics based on EF can be obtained through easily calculated and more intuitive rates. Researchers typically assess subgroup representativeness by comparing subgroup EFs with that of a reference subgroup (e.g., non-Hispanic White individuals) All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint that is traditionally advantaged. Our method calculates possible underrepresentation of all subgroups at the same time. Complex but valuable representativeness metrics like GIST 1.0 and 2.0 21,22 calculate the generalizability of clinical studies by comparison of eligible populations with target populations based on electronic healthcare records and evaluate the restrictiveness of trial eligibility criteria. Our proposed metrics also compare rates between trial cohorts and target populations, but deal with multiple traits in the cohort differently. GIST calculates measures of each trait and then obtains a final score from the univariate trait measures. We instead calculate representation metrics for all possible subgroups created by the multiple traits and then focus on visualizations and statistical methods that enable users to effectively identify significantly underrepresented subgroups with respect to the target populations. By indicating the representativeness of all possible subgroups, our approach could eventually be combined with GIST-type approaches to help illuminate the "black box" of sample selection and trial generalizability in a single trial and across multiple trials. Machine Learning (ML) Fairness metrics have been developed to quantify and mitigate bias in ML and AI models [23] [24] [25] . To improve the performance of existing RCT representativeness measurements, we consider assignment to the RCT a random binary classification problem and develop standardized metrics for RCTs based on variations of ML fairness metrics by mapping to the context of RCTs. ML fairness All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint metrics quantify potential bias towards protected groups in trained ML classification model outcomes. Our metrics, instead of comparing positive and negative classes based on model outcomes, focus on the trial-subject data generation process within the RCT. Our novel insight is to regard subject assignment to an RCT as a classification function that is random and then create variants of ML fairness metrics. Our metrics capture how well the actual assignment of subjects to an RCT cohort matches with a truly random assignment. The statistical properties of the hypothetical random assignment from a target population can be estimated using community-based or nationally representative datasets of individual characteristics, such as the National Health and Nutrition Examination Survey (NHANES) 26 or from electronic medical records (EMR). Our main goal is to eliminate or reduce inequitable representations in the subject enrollment stage by measuring and identifying equity gaps which persist across different subpopulations. Our method augments the Consolidated Standards of Reporting Trials (CONSORT) 27,28 statement and its extension CONSORT-Equity 29 , which aims to avoid biased results from incomplete or nontransparent research reports that could mislead decision-making in healthcare. Our method supports incorporating representativeness evaluation before, during, and after any RCTs. Additionally, it can help Institutional Review Board (IRB) better evaluate the equity in trial-design stages and assist FDA All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint regulators to ensure a fair distribution of clinical benefits to both the study sample and the general population. Our proposed representativeness metrics are expected to identify subgroups that are insufficiently recruited into and represented in the clinical trial cohort using study summary data only, ensuring privacy, security, and confidentiality of health information. These metrics can then be used by clinicians, clinical researchers, and health policy advocates to assess potential gaps in the applicability of clinical trials in real-world settings. The contributions discussed in this paper are: (1) Formulating the problem of representativeness evaluation in RCTs as a comparison between a truly random assignment function and the actual assignment observed in the clinical trial cohort; (2) Deriving new metrics for representativeness of RCT based on ML fairness metrics; (3) Utilizing proposed metrics to measure subject representation of RCT cohorts with respect to a target population; (4) Identifying needs, gaps, and barriers of equitable representation of various subgroups in RCTs; (5) Designing a tool (an R-Shiny App) to automatically evaluate trial representativeness through on-demand subject stratification and distribute reports containing visualizations and explanations for different users. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. We establish a general mapping from RCT to ML Fairness and then derive metrics to evaluate the population representation of RCTs based on ML fairness measures [30] [31] [32] [33] [34] [35] [36] . We provide a visual representation of results with associated statistical tests to transparently communicate the quantitative results to diverse user groups. Table 1 provides a glossary of fairness and representativeness terms used throughout the manuscript. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. In an ML prediction model, given a feature vector of subject from distribution , a binary classifier predicts if the subject is positive ( ′ = 1) or negative ( ′ = 0). The true outcome is ∈ {0,1}. Within RCTs, the feature vector is the protected attributes or subject traits; the binary classifier assigns subjects into the study cohort, where ′ = 1 means a subject is recruited while ′ = 0 means not recruited or exclusion. is the true random assignment result of the subject into the study from the whole target population. For RCT representativeness evaluation, each available individual (i.e. a person who has the studied disease) is defined by = ( , ) = (( , ′ ), ), where ∈ represents the All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2021. ML fairness metrics are concerned with guaranteeing similarity results across different subgroups 37 . We assume that the ideal RCT achieves statistical parity 38 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. In ML fairness, the disparate impact measure is the ratio of positive rates of both unprotected and protected groups 39 : Disparate impact adopts the "80 percent rule" suggested by the US Equal Employment Opportunity Commission (EEOC) 40 to decide when the result is unfair: The "80 percent rule" requires the selection rate of a subgroup to be at least 80% of the selection rate of the other subgroups. As shown in the following theorem, when applied to the RCT, disparate impact reduces to an intuitive quantity based on the enrollment odds of a protected group and in the target. Based on the ideal RCT assumptions above, the disparate Impact metric is equivalent to the ratio of enrollment odds of subjects of the protected group in the observed cohort to the ratio of the odds of subjects in the ideal cohort: See supplementary materials for proof. Since log odds provides advantages for ease of understanding, we propose the following metric for RCT. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2021. In the log disparity metric, a value of 0 indicates perfect clinical equity. A value smaller than the lower threshold, −τ , implies a potential underrepresentation of a subgroup while a value greater than implies a potential overrepresentation. We further add an upper threshold, τ . A value less than −τ implies highly underrepresentation; similarly, a value greater than τ implies highly overrepresentation. Values between −τ and τ mean equitable representation. Our metric thresholds are selected based on guidance from literature 23,41-43 , but other optimal thresholds under different criteria are allowed as inputs. We use a significance level of 0.05, a lower threshold of -log (0.8), and an upper threshold of -log (0.6). The ML fairness Equal Opportunity 44 metric which requires subgroups to have the same true positive rates can also be applied to RCTs. Let ideal RCT assumptions hold and ( ) be binomial random variable, then the ML fairness Equal Opportunity metric has the following equivalent form: All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 2) Both metrics have a common interpretation for subgroups with very different background rates: 0 means that demographic parity holds, <0 means subgroup is underrepresented, and >0 means subgroup is overrepresented. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 3) Statistical tests quantify the significance of observed disparities for each subgroup which take into account the RCT study size and estimation errors of the ideal assignment rate. We use a one-proportion two-tailed z-test to determine whether the observed rate is significantly deviated from the ideal population rate. We use Benjamini-Hochberg to correct for multiple comparisons across all subgroups. If the difference between observed and ideal rates is not statistically significant, the subgroup is treated as representative; otherwise, we will use metrics to quantify the subgroup representativeness. Other statistical tests could be used. See supplement for details. We assess the proposed methodologies on three real-world RCTs: ACCORD 46 , ALLHAT 47 , and SPRINT 48 in BioLINCC with the ideal subgroup assignment rate calculated from individuals with matched disease conditions in NHANES. According to participants' baseline characteristics typically summarized in Table 1s The observed rates of the subgroup are calculated from the RCT data All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. For each study, we construct all possible subgroups that can be instantiated as ( ). We define 29 univariate, 109 bivariate, and 306 multivariate subgroups based on nine protected attributes. In general, any baseline subject attributes can be selected as protected attributes in our approach. In our experiment, ideal rates from target populations ( ( ( ) = 1| = 1)) are calculated from NHANES 2015-2016 using the R survey() package 49 which accounts for potential bias from complex survey designs. The NHANES population selected varies based on study objectives and desired target population. To evaluate ACCORD 46 , we estimate ideal rates of subgroups of diabetic individuals in the US using subjects who report having diabetes in NHANES, and we use subjects who report having hypertension in NHANES as the target population to evaluate ALLHAT 47 and SPRINT 48 . These criteria could be modified to consider study inclusion and exclusion criteria depending on the goals of analysis. Since users may have better target population data that match their studies, userprovided target population datasets and multiple target files are allowed. For example, clinicians who focus on their local communities could use the community or healthsystem population as the target to evaluate the equity of RCTs, whereas researchers All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint who work on a global disease, the target population may be better estimated from global population datasets. To demonstrate the proposed metric, we created a visualization using different colors to represent different representativeness levels in RCTs. For compact presentation, we focus on the log disparity metric. Figure 2 illustrates how the log disparity function applies to relative common subgroups Female and Female Non-Hispanic Black in ACCORD. As shown in Figure 2A , for women with type-2 diabetes, the ideal rate from NHANES is 0.445 while the observed RCT rate is 0.386. The observed female-subject rate falls into the light orange region, which reveals the underrepresentation of female subjects. For Figure 2B , when the subgroup of interest is changed to non-Hispanic black female participants, the ideal rate decreases to 0.079 and the observed rate becomes 0.095. Now the interested subgroup falls into the teal region, which means that non-Hispanic black female participants are equitably represented in ACCORD. This indicates the influence of protected attribute race/ethnicity on the representativeness evaluation. By comparing Figure 2A and 2B, we can observe that metric functions change as the ideal rate changes. The representativeness of 29 univariate subgroups for three RCTs are shown in Table 2 and 3. Dark red represents the subgroups absent from the RCT; light orange and All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. orange red indicate that subgroups are underrepresented or highly underrepresented in the RCT relative to the target population; light and dark blue specify the potentially overrepresented or highly overrepresented subgroups; teal shows the subgroup is either equitably represented or has no significant difference; grey indicates that no individuals with selected protected attributes exist in estimated target population; black indicates absent subgroup in both estimated target population and RCT. Non-Hispanic White 0.23 -0.73 -0.30 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Comparing to the summary statistics in published literature (i.e., about 47% subjects are women in ALLHAT and 36% subjects are women in SPRINT 51-54 ), ALLHAT captures the All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint gender distribution among real-world hypertensive participants while SPRINT fails to enroll enough female participants. The color change across categories of an attribute highlights interesting trends in subject representation. Among three studies, only two attributes achieved equitable representation across all subgroups: gender in ALLHAT and TC in SPRINT. From the tables, we observe that current smokers, young participants, non-Hispanic Asian subjects, subjects with SBP under 130 mm Hg or FG between 5.6-6.9 mmol/L are frequently underrepresented. This indicates that some subgroups in the target population are missing or inadequately represented in the RCTs. The decision-making on a subject, e.g. aged 40, based on the SPRINT study would require additional evidence beyond this study. Also, participants with lower education levels tend to be more underrepresented in the SPRINT while participants with higher education levels tend to be more underrepresented in the ALLHAT. This points out that potential social determinant confounders may exist in the RCT. We note, across all three studies, non-Hispanic black participants are overrepresented, perhaps reflecting efforts to ensure minority participation or reflecting study locations. In both hypertension RCTs, Asian subjects may have been insufficiently enrolled. This underrepresentation may also reflect study choices or locations. These trends have to be validated by analysis on more RCTs. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint For subgroups defined by multiple attributes, sunburst plots better visualize the change of subgroup representation by adding additional protected attributes, as shown in Figure 3 . For each type of protected attributes (i.e., demographic characteristics, risk factors, and lab results), separate sunburst charts are generated since their matched population from NHANES are different. The sunburst plots explicitly address diversity, equity, and inclusion of clinical studies with respect to the target population. For instance, Figure 3B identifies the missing evidence in subgroups including any female and non-Hispanic male subjects aged under 45. This lack of subject diversity may lead to similar results as shown for the effectiveness of Actemra on COVID-19 patients, in which the study results flipped after All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint including more marginalized participants. Furthermore, our visualization automatically checks if the inclusion and exclusion criteria are met. Based on the criteria of SPRINT, it successfully excluded subjects with SBP under 130 mm Hg but subjects with potential impaired glucose or diabetes still existed based on the lab results. An advantage of the proposed metrics is they provide a standardized scale for judging trial representativeness for subgroups with vastly different expected rates in the trial; for example, the estimated ideal rate of participation in the type-2 diabetes trial estimated from NHANES for subgroups of female subjects, female subjects aged over 64, Hispanic female subjects aged over 64, and Hispanic female subjects aged over 64 with high school degree are 0.445, 0.172, 0.025, and 0.006 respectively. Evaluating differences between simple rates for many subpopulations would be more challenging. To facilitate visualizations of measured performance on clinical trials, we have incorporated a comprehensive set of fairness metrics into our prototype representativeness visualization tool using R shiny to enable researchers and clinicians to rapidly visualize and assess all potential misrepresentation in a given RCT for all possible subgroups. In our application, the number and order of the attributes for the sunburst can be changed by users; for example, instead of Figure 3B , users can visualize representativeness of subgroups for Age with further divisions by Gender and then Race/Ethnicity. With these metrics, users can rapidly determine All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. conditions for patients if the matching background information is available to obtain the ideal rate of each subgroup. Furthermore, our approach can be used as a frame of reference to guide the clinicians and policy-makers to make decisions with legitimate reasons and evidence. We offer user selections to dynamically control different conditions including subgroup characteristics, metric types, metric cutoffs, under which the users will make their own decisions. The technical challenges we encountered include determining how to appropriately treat continuous variables such as age and consider inclusion and exclusion criteria when mapping RCT cohorts and NHANES sample population. Currently, we discretize all All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint continuous variables, with alternative approaches left as future work. It may be desirable to further refine the target populations to adjust for missing and underrepresented subgroups due to RCT inclusion and exclusion criteria. We plan to validate our metrics by applying them to more trials and compare results with other metrics such as GIST 2.0. It can also be useful to create a method combining the proposed metrics with GIST to enable detailed subpopulation analyses of inclusion and exclusion criteria and analysis of multiple trials. Quantifying representation is important for scientific rigor and to build true equity into research designs and methods. Health equity is not just a clinical issue; it is also a socioeconomic concern with broad consequences [55] [56] [57] . We developed metrics and methods to evaluate how equitably subgroups are represented in RCTs. Unlike most existing studies which focus on one protected attribute each time (e.g. race) for a single disease (e.g. type-2 diabetes), our proposed approach can analyze clinical trials designed for several diseases such as hypertension and type-2 diabetes, simultaneously and can additionally report representativeness of subgroups defined by multiple attributes including age and race/ethnicity. Our next steps are to utilize these metrics to monitor existing RCTs, help design new RCTs, and provide tools for disseminating findings to different user groups, such as patients, clinicians, data scientists, and policy-makers, who will bring the discoveries into play to advance health equity. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint BMI, and smoking status respectively. D. Representativeness of SPRINT RCT subgroups in sunburst plot with inner to outer rings defined by lab results total cholesterol and fasting glucose respectively. This manuscript was prepared using ACCORD, ALLHAT, and SPRINT Research Materials obtained from the NHLBI Biologic Specimen and Data Repository Information Coordinating Center and does not necessarily reflect the opinions or views of the ACCORD, ALLHAT, SPRINT or the NHLBI. All methods were carried out following the NHLBI approved research plan: Equity in Clinical Trials, and all procedures were carried out in accordance with the applicable guidelines and regulations from NHLBI Research Materials Distribution Agreement. The procedures were approved by The Rensselaer IRB as IRB Review Not Required. Informed consent was obtained from all subjects by NHLBI. Data from research participants who refused to permit the sharing of their data are deleted from the repository data set. Supplementary material is available in another document. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint Needs, priorities, and recommendations for engaging underrepresented populations in clinical research: A community perspective No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted disparities in health: Descriptions, causes, and mechanisms Participation in pediatric oncology research protocols: Racial/ethnic, language and age-based disparities Demographic diversity of participants in Pfizer sponsored clinical trials in the United States COVID-19 and the widening gap in health inequity Rethinking COVID-19 vulnerability: A call for LGBTQ+ im/migrant health equity in the united states during and after a pandemic COVID-19 amplifiers on health inequity among the older populations COVID-19 disparities: An urgent call for race reporting and representation in clinical research Disparities in incidence of COVID-19 among underrepresented racial/ethnic groups in counties identified as hotspots during Race, ethnicity, and age trends in persons who died from COVID-19 -United States In support of clinical case reports: A system of causality assessment Causal inference in randomized clinical trials A clinician's guide to specification and sampling Randomization in clinical studies A literature review on the representativeness of randomized controlled trial samples and implications for the external validity of trial results Assessing the generalizability of randomized trial results to target populations Clinical trial generalizability assessment in the big data era: A review Participation in cancer clinical trials race-, sex-, and age-based disparities Toolkit for Using the AHRQ Quality Indicators Simulation-based Evaluation of the Generalizability Index for Study Traits GIST 2.0: A scalable multi-trait metric for quantifying population representativeness of individual clinical studies AI fairness 360: an extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias 50 years of test (un)fairness: lessons for machine learning Mathematical notions vs. human perception of fairness: A descriptive approach to fairness for machine learning National Health and Nutrition Examination Survey Data The consolidated standards of reporting trials (consort): guidelines for reporting randomized trials CONSORT: when and how to use it Research methods reporting consort-equity 2017 extension and elaboration for better reporting of health equity in randomised trials Measuring non-expert comprehension of machine learning fairness metrics Fairness through awareness Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness A reductions approach to fair classification No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted Fairness without harm: Decoupled classifiers with preference guarantees The dark side of machine learning algorithms: How and why they can leverage bias, and what can be done to pursue algorithmic fairness Fairness and machine learning Fairness in deep learning: A computational perspective Certifying and removing disparate impact Modeling the behavior of the 4/5ths rule for determining adverse impact: Reasons for caution Putting fairness principles into practice: challenges, metrics, and improvements A confidence-based approach for balancing fairness and accuracy Making hospital readmission classifier fair -What is the cost? Equality of opportunity in supervised learning Estimating the success of re-identifications in incomplete datasets using generative models Action to control cardiovascular risk in diabetes (accord) trial: design and methods Major outcomes in high-risk hypertensive patients randomized to angiotensin-converting enzyme inhibitor or calcium channel blocker vs diuretic: The antihypertensive and lipidlowering treatment to prevent heart attack trial (allhat) A randomized trial of intensive versus standard bloodpressure control Analysis of complex survey samples Hypertension Prevalence and Control Among Adults: United States Mortality and morbidity during and after ALLHAT: Results by gender The Antihypertensive and Lipid Lowering Treatment to Prevent Heart Attack Trial (ALLHAT) Heart Failure Validation Study: diagnosis and prognosis Sex differences in hypertension and other cardiovascular diseases Which patients does the SPRINT study not apply to and what are the appropriate blood pressure goals in these populations? Economic dimensions of health inequities: The role of implementation research All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint this version posted June 28, 2021. ; https://doi.org/10.1101/2021.06.23.21259272 doi: medRxiv preprint All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.