key: cord-0979021-v0tkxliu authors: Oskotsky, Tomiko; Bajaj, Ruchika; Burchard, Jillian; Cavazos, Taylor; Chen, Ina; Connell, Will; Eaneff, Stephanie; Grant, Tianna; Kanungo, Ishan; Lindquist, Karla; Myers-Turnbull, Douglas; Naing, Zun Zar Chi; Tang, Alice; Vora, Bianca; Wang, Jon; Karim, Isha; Swadling, Claire; Yang, Janice; Sirota, Marina title: Nurturing diversity and inclusion in AI in Biomedicine through a virtual summer program for high school students date: 2021-03-08 journal: bioRxiv DOI: 10.1101/2021.03.06.434213 sha: 08e1638daa8ea1c35c3736dce0f7309c680bda81 doc_id: 979021 cord_uid: v0tkxliu Artificial Intelligence (AI) has the power to improve our lives through a wide variety of applications, many of which fall into the healthcare space; however, a lack of diversity is contributing to flawed systems that perpetuate gender and racial biases, and limit how broadly AI can help people. The UCSF AI4ALL program was established in 2019 to address this issue by promoting diversity and inclusion in AI. The program targets high school students from underrepresented backgrounds in AI and gives them a chance to learn about AI with a focus on biomedicine. In 2020, the UCSF AI4ALL three-week program was held entirely online due to the COVID-19 pandemic. Thus students participated virtually to gain experience with AI, interact with diverse role models in AI, and learn about advancing health through AI. Specifically, they attended lectures in coding and AI, received an in-depth research experience through hands-on projects exploring COVID-19, and engaged in mentoring and personal development sessions with faculty, researchers, industry professionals, and undergraduate and graduate students, many of whom were women and from underrepresented racial and ethnic backgrounds. At the conclusion of the program, the students presented the results of their research projects at our final symposium. Comparison of pre- and post-program survey responses from students demonstrated that after the program, significantly more students were familiar with how to work with data and to evaluate and apply machine learning algorithms. There was also a nominally significant increase in the students’ knowing people in AI from historically underrepresented groups, feeling confident in discussing AI, and being aware of careers in AI. We found that we were able to engage young students in AI via our online training program and nurture greater inclusion in AI. applied to the results. 126 Of the 89 high school students who submitted applications to our program and the 38 applicants 129 we accepted into the program, 29 enrolled in and completed the program. 130 All 29 students were females who were rising sophomores (21%), juniors (45%) or seniors 131 (34%) in high school. Most of the students were from California (79/%), although several were 132 from other states. The racial backgrounds of the students included Asian inclusive of those from 133 the Indian subcontinent and Philippines (79%), Native Hawaiian or Other Pacific 134 Islander/Original Peoples (3%), and Hispanic or Latino (7%), and 14% declined to state. 135 Twenty-one percent will be first generation college students. (Table 1) Senior / 12th grade student 10 34% Junior / 11th grade student 13 45% Sophomore / 10th grade student 6 21% Freshman / 9th grade student 0 0% Yes 4 14% 1st week: Lessons in Python and Machine Learning In the first week of the program, students spent the afternoons learning about machine learning 140 concepts and programming in Python. We had seven UCSF graduate student instructors and 141 teaching assistants (TA) to help with teaching during the first week. iPython notebooks with the 142 in-class exercises were shared the evening before the class, to give students an opportunity to 143 practice on their own before the solutions were reviewed in class. 144 145 Students covered the basics of programming, data management, and data visualization in the 147 first two days to prepare to code in Python language and work with data within a Google CoLab 148 environment in preparation of their projects. Topics covered include programming basics (data 149 types, logic, loops, functions), data structures, common Python packages, plotting with 150 matplotlib, and using sklearn. During the lesson, students were placed in breakout rooms with 151 teaching assistants to review coding exercises and practice programming activities together. algorithm that can aid in identifying how many resources a given country or state will need. 203 Students were then presented with high-level information on several ML techniques used for 204 Students selected an 80% and 20% split for their training and testing data, respectively. Each 284 student first trained their model using their individual virus data. Then, they trained the model 285 using all their virus data to predict the interaction between each SARS-CoV-2 protein and each 286 human protein from the first PPI dataframe they built. The students finetuned the algorithmic 287 parameters, to improve the model's performance. To visualize the algorithm's optimal 288 performance, each student built a confusion matrix for the SVM predicting virus-human protein 289 interaction (Fig 2a-e) and extracted feature importance in a bar plot (Fig 2f) . Additionally, such as SARS-CoV2 viral load, gender, and age. In this unsupervised analysis the students 379 experienced how sample outliers can skew variance and cause inflation of PCA components. In 380 Additionally, students were asked to go above and beyond to apply their findings to translational 430 applications. For example, students were asked to critically evaluate the cost of false negatives 431 (spreading COVID-19, not receiving treatment on time, worse outcomes) and false positives 432 (waste of limited resources) in respect to patients and outcomes, and applying this evaluation to 433 the decision of a model. Students were also asked to perform covariate analyses to determine 434 feature importance and apply back to their understanding of clinical relevance and application 435 (Fig 5b,c) . One finding that the group reported was that leukocytes were heavily negatively 436 correlated with COVID test results (Fig 5d) . Lastly, the group summarized their findings and 437 recommendations for future plans to the entire group as well as the limitations and biases in the 438 data (i.e. single location, limited follow-up, missing data). Table 1 ). Students of the 2020 virtual were also no less likely to recommend the AI4ALL 467 program to peers than the students who attended the 2019 in-person program (Mann Whitney U who are from historically underrepresented groups, their confidence in discussing AI, and their 505 awareness of careers in AI. While the format of the 2020 program differed from 2019, with the 506 2020 program taking place online instead of in-person due to the pandemic, students' survey 507 responses from both years were comparable. 508 509 Despite the success of our virtual training program, there were some limitations to having a 510 program take place entirely online, including the lack of in person interactions and the need for 511 reliable internet connection. Nevertheless, the ability to engage young students in AI and the 512 opportunity to contribute to diverse representation in this field make holding our program in any 513 format worthwhile. 514 515 We have learned that it is possible to deliver virtually an AI curriculum to young high school 516 students that provides them with an engaging and impactful experience. Through our virtual 517 program, we were able to connect with students from around the country and involve teaching 518 assistants and faculty from outside the Bay Area and from other institutions. We were also able 519 to give students who are located far from AI training programs a chance to become involved 520 bringing the goal of increasing diversity in AI a little closer to reality. 521 522 Author Contributions 523 TO and MS designed and co-directed the program, performed analysis of program survey data, 524 outlined and wrote the manuscript. JW, IK, and JB led and described Project 1. ZN and IK led 525 and described Project 2. IC, TG, and JY led and described Project 3. WC, RB, and CS led and 526 described Project 4. AT and BV led and described Project 5. JB, TC, WC, SE, TG, KL, AT, and Mihika Rayan 18 Corresponding author: Correspondence to Marina Sirota (marina.sirota@ucsf Helpful Everyday Examples of Artificial Intelligence IoT For All Combined Proteomics/Genomics Approach Links Hepatitis C Virus Infection with Nonsense Multiple Routes to Oncogenesis Are Promoted by the Human Papillomavirus-Host Protein Mapping Identifies RBBP6 as a Negative Regulator of Ebola Virus Replication UniProtKB guide to the human proteome Biopython: freely 600 available Python tools for computational molecular biology and bioinformatics. Intell Caused Tertiary Transmission of Coronavirus Disease 2019 in Korea: the Application of Lopinavir/Ritonavir for the Treatment of COVID-19 Pneumonia Monitored by Quantitative COVID-19 pneumonia | Radiology Case | Radiopaedia.org Upper airway gene 616 expression differentiates COVID-19 from other acute respiratory illnesses and reveals