key: cord-0293092-n9z9dxf1 authors: Kerawala, R.; Mayo, K. R.; King, P. F.; Vogel, J. M.; Program, The All of Us Research title: Analysis of the All of Us Research Program Researcher Workbench Workspaces date: 2022-05-16 journal: nan DOI: 10.1101/2022.05.12.22274998 sha: 149432d0237ca5b89fee1e89abc5d96a5fa59f14 doc_id: 293092 cord_uid: n9z9dxf1 The All of Us Research Program, through the All of Us Research Hub platforms Researcher Workbench, provides researchers and citizen scientists with access to a broad dataset of surveys, electronic health records, physical measurements, genetic data, and Fitbit device wearable data. All of Us has a goal of recruiting a minimum of one million participants and aims to capture the diversity of individuals in the United States. The All of Us Research Hub platform includes a Research Projects Directory on its website, displaying researcher-provided descriptions of each active workspace on the Researcher Workbench. Inspired by the initial work completed by All of Us investigators in 2019, these workspace descriptions were analyzed with a new methodology. Genetic, methods and validation studies, and educational research purposes were associated with disease-focused research, as compared to non-disease focused research. Of all the population categories of interest, only race and ethnicity were associated with disease-focused research. Further categorization of the disease-focused workspaces revealed the top five disease categories: cardiovascular disease, brain and mental health disorders, cancer and benign tumors, diabetes, and immunology-related conditions. Athena OHDSI catalog terms were sorted and helped classify the workspaces by each disease category. Subcategory distribution for the cancer, genetic, and cardiovascular disease-related conditions was examined as well. This framework has the potential to be used for continued longitudinal analysis of the workspaces and continued learnings regarding the importance of disease-focused research in public health. Additionally, as workspace descriptions are created at project initiation, this can provide us with a leading-edge indication of researcher interest in the All of Us Research Program data. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 16, 2022. ; https://doi.org/10.1101/2022.05.12.22274998 doi: medRxiv preprint device wearable data. All of Us has a goal of recruiting a minimum of one million participants . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) research. In addition, we present descriptive statistics on which diseases were most commonly 93 studied during the first year that data was available for researcher use, in support of the AoURP 94 program goal to understand and facilitate research that will be of importance to participants. The objective for this initial analysis is to examine how information from the workspace relevant study topics, such as research purposes and study populations of interest. As the 108 datasets were cumulative, methodology was refined after the initial analysis. Thus, 109 development, refinement, and dissemination of these methods for identifying research activity 110 information from these workspace descriptions was one of our primary goals. In addition, 111 institutional affiliation was investigated to identify where particular workspace descriptions are . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 16, 2022. ; https://doi.org/10.1101/2022.05.12.22274998 doi: medRxiv preprint 112 created geographically, and if there were any patterns associated with affiliation to specific 113 institutions. A focus on genetic research workspaces was also performed to investigate 114 researcher interest into the use of genetic data that the AoURP would eventually provide. In It was anticipated that analysis of workspace description data can help inform future 119 research projects and present both qualitative and quantitative descriptions of research activity 120 that may be of interest not only to participants, but also to funding agencies, and to the public. In particular, we hypothesize that research purpose and populations of interest data can 122 elucidate the way the AoURP data was being used at the time of this report, which could inform 123 what types of data various researchers are interested in and the potential applications of that 124 data. In this analysis, the DRC provided descriptions for active researcher workspaces from 128 three time points, April 21st, 2021, June 7th, 2021, and June 30th, 2021, as .csv files designated 129 cumulative datasets A, B, and C, respectively. Using the initial analysis of dataset A, we 130 quantified workspace creation and modification times and preferred data types, and created 131 the disease categories manually, as identified in Table 1 . These categories were based on a 132 previous AoURP publication to maintain consistency, and adapted using the most common 133 workspace focus areas and major disease areas [6] . Using the Athena OHDSI catalog to . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) cancer workspaces was also examined, given the significant number of workspace descriptions 148 that sought to investigate these topics, as found in initial analysis. Using R, filters were applied for the various parameters in sorting workspace data, 150 including deduplication and determining aggregate counts. For each of three datasets, these 151 counts are summarized in Table 2 . DFR was determined by filtering workspaces that had 152 "checked" the disease research box at the time that particular workspace was created. This Over time, the most common disease categories of the DFR workspaces were Population and public health is the most common research purpose across workspace 239 types. However, genetic and social behavioral workspaces are more common in DFR, whereas . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 16, 2022. 240 educational and methods development or validation workspaces are more common in non-241 DFR. Interestingly, the top 5 disease categories studied were cardiovascular disease, brain and 242 mental health disorders, cancer, reproductive disease, and immunology, even when subsetting 243 to workspaces doing genetic research. This suggests that a majority of researchers in this initial 244 cohort (Dataset C) are interested in studying these major disease categories, regardless of 245 whether they intend to leverage genomic data in their analyses. Focusing on cardiovascular 246 disease and cancer workspaces, these workspace descriptions could be further classified based 247 on particular Athena OHDSI catalog terms. One of these subcategories was "not primary focus" 248 indicating that while a term was present somewhere in the workspace description, this disease 249 focus was not the primary goal of a workspace research project. As shown in the supplemental 250 figures, the non-primary focus subcategories in cardiovascular disease and cancer workspaces 251 were significant in counts. Besides this finding, for the most part, the manual selection of 252 disease categories is successful using the Athena OHDSI catalog ( Figure S2 , Figure S3 ). Additionally, focusing on current, active workspaces limits the ability to perform 254 retrospective analysis of a larger scale by excluding some of the original demonstration and 255 practice workspaces. Another limitation or possible source of bias is the manual categorization 256 of which conditions are included in the disease categories. We attempted to correct for some of 257 this bias through use of the Athena OHDSI catalog for the disease category selection. Using R is 258 a more systematic, replicable approach to the analysis as well, ensuring that the methods used 259 were able to be done on each of the datasets. Emphasizing researcher transparency and 260 retrieving early workspace data may help mitigate some of these limitations. However, the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) scale is yet limited by the novelty of the Researcher Workbench as a tool, which opened for 262 beta testing in May 2020, and by the compact timeframe for data inclusion of April -June 2021. Future work will be able to advance these findings with more systematic approaches, 264 and with time, the ability to perform more comprehensive, longitudinal analyses. Additionally, 265 later analyses of the workspace descriptions can continue to standardize the curation of disease 266 categories by using natural language processing or artificial intelligence methods to eliminate 267 some of the inaccuracies or human biases with creating these categories. Study of changes in CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 16, 2022. ; https://doi.org/10.1101/2022.05.12.22274998 doi: medRxiv preprint All of Us Research Program | Scripps Research n National Institutes of Health (NIH) -All of Us Framework for Access to All of Us Data Resources v1.1 -All of Us Research Projects Directory. National Institutes of Health (NIH) -All of Us Diversity and 307 inclusion for the All of Us research program: A scoping review The All of Us Research Program. The "All of Us All of Us Research Program Tribal Consultation Final Report Recruitment and Retention of 321 Older People in Clinical Research: A Systematic Literature Review Over time, the top 5 disease categories were 329 cardiovascular disease and related ailments, followed by brain and mental health disorders, cancer and 330 tumors, reproduction, and immunology. Overall, the same categories were studied in genetic workspaces 331 compared to all of the workspaces. None were statistically significant over time (Fisher's Exact test Disease Categories-Cancer Subtypes Distribution. Notably, "cancer" was included in a workspace, 335 as a non-primary focus most often within the cancer category. This was followed by miscellaneous 336 cancer types, cancer overall, and cancer in combination with other diseases Disease Categories-Cardiovascular Disease and Related Disorders Distribution. Notably, 340 cardiovascular disease in conjunction with other diseases was a primary focus in the majority of 341 workspaces from this category. Other major subcategories included CVD, other cardiovascular-related 342 disorders, and non-primary focus on cardiovascular diseases