key: cord-0021269-ovv1t218
authors: Taylor, Mason D.; Mendenhall, Bryn; Woods, Calvin S.; Rasband, Madeline E.; Vallejo, Milene C.; Bailey, Elizabeth G.; Payne, Samuel H.
title: Online Tools for Teaching Cancer Bioinformatics
date: 2021-08-31
journal: nan
DOI: 10.1128/jmbe.00167-21
sha: 57dccd9d9d92d9459847ce680c1a2eea8aa55bb1
doc_id: 21269
cord_uid: ovv1t218

The rise of deep molecular characterization with omics data as a standard in biological sciences has highlighted a need for expanded instruction in bioinformatics curricula. Many large biology data sets are publicly available and offer an incredible opportunity for educators to help students explore biological phenomena with computational tools, including data manipulation, visualization, and statistical assessment. However, logistical barriers to data access and integration often complicate their use in undergraduate education. Here, we present a cancer bioinformatics module that is designed to overcome these barriers through six exercises containing authentic, biologically motivated computational exercises that demonstrate how modern omics data are used in precision oncology. Upper-division undergraduate students develop advanced Python programming and data analysis skills with real-world oncology data which integrates proteomics and genomics. The module is publicly available and open source at https://paynelab.github.io/biograder/bio462. These hands-on activities include explanatory text, code demonstrations, and practice problems and are ready to implement in bioinformatics courses.

As data become a dominant feature of biology (1), university curricula must adapt and prepare their students for a data-centric future (2) . Many life science undergraduates do not receive bioinformatics training, and a majority of professional biologists gain bioinformatics skills without institutional training (3) . This is often because scientific computing courses are designed for computer science and engineering majors (4) . Beyond simple introductory courses, there is a need to develop advanced curricula which truly integrate computational and biological perspectives (reviewed in reference 5) as either life science electives or a formal bioinformatics degree (6, 7) . A key design strategy is identifying authentic, biologically motivated computational courses, including hands-on exercises with real data (8) .

As an emerging discipline, bioinformatics is evolving rapidly. Early coursework and textbooks for bioinformatics focused on algorithms, e.g., Rosalind (http://rosalind.info/). However, the rise of deep genomic and proteomic profiles has highlighted a need for instruction in exploratory data science and statistics. Massive public data sets offer an incredible opportunity to integrate molecular and cellular biology with computation (9), and published classroom activities exist for a variety of topics, including DNA assembly (10), metagenomics (11) , bacteriophage genomics (12) , and viral genomics (13) . Unfortunately, current topics largely focus on genomics and not a broader genotype-tophenotype view of biology. Additionally, many public data sets have practical barriers that complicate classroom use, such as data accessibility and package installation (14) . Here, we present a cancer bioinformatics module with six exercises to teach an integrated proteogenomic view of cancer. The exercises contain explanatory text, code demonstrations, and practice problems encoded in publicly available Google Colab notebooks intended for online or traditional instruction.

Improvements in cancer diagnosis and treatment will come from analyzing DNA, RNA, and protein data, along with traditional clinical information (15) . We present a module for teaching computational cancer biology to demonstrate how modern omics data are used in precision oncology. Each lesson explores fundamental cancer concepts and supporting molecular data (Table 1 ). In lesson one, students are introduced to the proteogenomic data set (16) (17) (18) (19) . As DNA mutation is the root cause of cancer, lessons two through four describe three distinct kinds of mutations and their impacts on the genome. Lessons five and six use transcriptomics and proteomics data to identify differential gene expression and activated pathways. Each lesson also teaches computational tools and analytical techniques essential for modern cancer research. The first lesson introduces pandas DataFrames and manipulation of data matrices. Subsequent lessons expand the students' skill set with DataFrames, application programming interfaces (APIs), statistical methods, graphing, and network exploration.

The module is designed to overcome both practical and philosophical barriers to student engagement with the integration of three computational tools. First, lessons are coded in Google Colab, a live Python coding environment accessible via a Web browser (https://colab.research.google.com/). Importantly, Colab does not require any software installation and works with all operating systems, including mobile platforms. Thus, students have access to a functional computing environment on the first day of class, without requiring information technology (IT) support. Second, real cancer data sets are streamed via the cptac Python API (20) , providing access to genomic, transcriptomic, proteomic, and clinical data. Mundane and frustrating tasks like locating files or loading data are automated, thus saving time for students to focus on cancer-relevant computation. Third, our module uses an autograder, which contains homework answers and hints used to give students immediate feedback and provide help on integrated practice problems.

The module is designed for upper-division undergraduates. For biological preparation, we suggest that students have completed courses in molecular biology and genetics; for computational preparation, we suggest that students have completed courses in Python programming and introductory statistics. The first lesson, called Introduction to Cancer Datasets, is a skills check. A prepared student should be able to complete it in about 30 to 45 min. This lesson is the same as the final homework in the author's (S.H.P.) Introduction to Bioinformatics course, and students with one semester of programming would be prepared computationally. If students have difficulty with this first assignment, then it is not advised to continue with the module.

We created this module for those wishing to teach computational cancer biology or expand their offering of bioinformatics topics to undergraduate life science majors. The module and links to the six lessons are available at https://paynelab. github.io/biograder/bio462. Each lesson (except for Lesson 1: Introduction to Cancer Datasets) is designed to contain a week's worth of instruction material, with approximately 6 h of computational exercises. Although each lesson is independent, the later lessons build on the computational skills from earlier lessons. Therefore, the module is intended to be used as a set.

Lessons are delivered through Google Colab notebooks and interweave descriptive text and software code. Notebooks begin with a section describing the topic and relevant literature, which could be used in classroom lectures or preparatory reading. The explanatory text motivates students with the description of a cancer biology problem and then proceeds with demonstrative software examples that teach students core computing concepts. These are followed by practice problems where students write their own code. Each practice problem is graded with the autograder, and students can ask for hints on how to approach the question. As with any course, it is advised that a teaching assistant with a bioinformatics background be available to guide students through the rigorous lessons. Instructors interested in the answer key may contact the authors.

Student evaluation of the lessons and module contained both supportive and critical feedback. In general, students were enthusiastic about the authentic integration of computing and cancer biology. A significant positive of the lessons was the links to articles that prompted a deeper exploration of biological concepts and a reference for algorithm use or syntax. The most common challenge was learning to work with data matrices; for students who learned programming through the object-oriented lens, computational exploration of DataFrames was an adjustment. Student feedback on specific exercises prompted substantial revision in wording, hints, and primer examples.

This module is purely computational and does not include any laboratory components. The data used in the modules are publicly available and do not include any personally identifiable information.

Cancer bioinformatics is a growing area, and proteogenomic analyses of large cancer cohorts are rapidly changing cancer diagnosis and treatment. Many of these rich data sets are publicly available, enabling the creation of engaging and impactful curricula. We created a module that introduces undergraduate students to proteogenomic cancer data and teaches bioinformatics skills in data analysis and interpretation. By utilizing a unique combination of technologies, the module avoids common obstacles to large-scale and authentic bioinformatics exercises in a classroom setting.

Unmet needs for analyzing biological big data: a survey of 704 NSF principal investigators

Bioinformatics core competencies for undergraduate life sciences education

A global perspective on evolving bioinformatics and data science training needs

Teaching computational thinking through bioinformatics to biology students

A survey of scholarly literature describing the field of bioinformatics education and bioinformatics educational research

A curriculum for bioinformatics: the time is ripe

Bioinformatics curriculum guidelines: toward a definition of core competencies

Bioinformatics and the undergraduate curriculum

Using the Cancer Genome Atlas as an inquiry tool in the undergraduate classroom

Hands-on assembly of DNA sequencing reads as a gateway to bioinformatics

Incorporating genomics and bioinformatics across the life sciences curriculum

A broadly implementable research course in phage discovery and genomics for first-year undergraduate students

Genome analysis of SARS-CoV-2 case study: an undergraduate online learning activity to introduce bioinformatics, BLAST, and the power of genome databases

Barriers to accessing public cancer genomic data

The next horizon in precision oncology: proteogenomics to inform cancer diagnosis and treatment

Integrated proteogenomic characterization of clear cell renal cell carcinoma

Proteogenomic characterization reveals therapeutic vulnerabilities in lung adenocarcinoma

Proteogenomic characterization of endometrial carcinoma

Clinical Proteomic Tumor Analysis Consortium, et al. 2021. Proteogenomic and metabolomic characterization of human glioblastoma

Simplified and unified access to cancer proteogenomic data

We thank members of the Payne laboratory and global collaborators who tested these lessons and provided feedback.This work was supported by a National Cancer Institute (NCI) CPTAC award (grant number U24 CA210972) and by the Simmons Center for Cancer Research.We declare no conflicts of interest.