key: cord-0329457-f9p2ec3f
authors: de Araújo, Gilderlanio Santana
title: From bioinformatics user to bioinformatics engineer: a report
date: 2020-08-10
journal: bioRxiv
DOI: 10.1101/2020.08.03.225979
sha: 60d43635eff6ffb21f92b2740f8c155517151efc
doc_id: 329457
cord_uid: f9p2ec3f

Teaching computer programming is not a simple task and it is challenging to introduce the concepts of programming in graduate programs of other fields. Little efforts have been made on engaging students in computational development after programming trainings. An emerging need is to establish subjects of bioinformatics and programming languages in genetics and molecular biology graduate programs, when students in these degree programs are immersed in a sea of genomic and transcriptomic data, which demands proficient computational treatment. I report an empirical guideline to introduce programming languages and recommend Python as first language for graduate programs in which students were from genetics and molecular biology backgrounds. Including the development of programming solutions related to graduate students' research activities may improve programming skills and better engagement. These results suggest that the applied approach leads to enhanced learning of introductory to autonomy in highly advanced programming concepts by graduate students. This guide should be extended for other research programs.

empirical software engineering tasks, such as: definition of the scope of the candidates' research project 48 and functional requirement elucidation. I adopted "divide and conquer" paradigm to script programming, 49 pattern recognition of their data and processes, which suits well biological application development. 50 Practical, interactive, and personalized activities in the context of each candidate in a real-time way 51 improves consolidating concepts of programming languages and autonomy for life science candidates, 52 which, in practice, put them in the path to become bioinformatics engineers. Python is a hybrid programming language that allows scripting in a functional and object-oriented 56 paradigm (https://www.python.org/doc/). By its resources, Python is considered a production-ready 57 language, provides clear syntax and semantics, taking advantages of mandatory code indentation, which 58 improves readability and refactoring. Python prioritizes the developer experience, making many software 59 engineers choose it as a programming language, based on their potential for productivity, learning curve 60 cost, and computational support. The course "Programming for Bioinformatics with Python" was designed to provide a basic knowledge in 72 high-level programming languages for graduate students in genetics and molecular biology, at the Federal 73 University of Pará.

This curricular component is a way of improving computational skills, encouraging and arousing 75 autonomy in algorithm implementation for processing and analyzing biological data within the context 76 of a master, doctoral or post-doctoral research project of candidates with little or no knowledge of 77 programming languages.

The course was offered twice, from March to April 2019, and for the same months in 2020. The first Differently from some courses, that use terminal or simple text-based editors, we conducted pro-91 gramming practical classes using the PyCharm Community (https://www.jetbrains.com/pt-br/), which 92 is a professional integrated development environment (IDE) dedicated to improve Python programming.

Linux Ubuntu environment was used for running scripts by command line. (https://seaborn.pydata.org), that are proposed to assist routines for data manipulation and visualization.

In parallel to programming Python topics, the course was designed to execute students' research 105 tasks. First, at the beginning of the course, all students were asked to provide a summary of their 106 graduate projects. Second, problems in computational biology must be identified within the scope of 107 each research project. Then, we were able to draw an overview of computational solutions, elicit and 108 select functional requirements for bioinformatics problems, considering the time and scope. We adopted 109 a divide-and-conquer strategy to implement solutions in the course time. In the classes on data types and data structures, we explored how "omics" data could be modeled be simply represented by a string, or even by a more robust object like Bio.Seq from BioPython library.

The Seq object provides methods similar to those implemented for strings, such as count, find, split 120 and strip. In addition, the Seq object has an alphabet as an attribute, which can be instantiated from being predominant the education in Biology (n = 6). Only one of the students reported a previous 176 training in a field related to technology and data processing. The entire distribution of students by area 177 is shown in Figure 2A . in the course. This is an aspect that highlights a level of autonomy achieved by students on developing 236 their own solutions.

With the described approach, new perspectives on training graduate students were conceived, with 238 subjects related to programming languages. These students are now able to deal with bioinformatics 239 problems that require analysis of large scale data, such as genome sequences and transcriptomic data.

The course methodology consequently demystifies the use of programming languages and presents itself 241 as a unique opportunity for the application of computer knowledge, to achieve quick solutions. tories. This aspect has been the motivational element to make the students have a real perception of the 261 applicability of their own scripts or computational pipelines. This fact corroborates the high percentage 262 of students who still use Python after the end of the course.

In this way, I believe that this report contributes to consolidate new teaching methodologies, including 264 applied classes of high-level programming languages like Python for bioinformatics in the era of "omics" 265 sciences.

Acknowledgements 267

Thanks to all graduate students that answered the online questionnaire and remain doing science and 268 scripting in Python, even in pandemic situations. The outcomes of that questionnaire provided a helpful 269 feedback to improve programming language training. 270 core competencies (an update from the curriculum task force of iscb's education committee). PLoS 304 computational biology, 12(5), e1004943.

A global 272 perspective on evolving bioinformatics and data science training needs

Mirtarbase update 2018: A resource for experimentally 276 validated microrna-target interactions

A map of human genome variation from population-scale sequencing

Circbase: A database for circular rnas

Transcriptome and genome 283 sequencing uncovers functional variation in humans

Learning python: Powerful object-oriented programming

The development and application of bioinformatics 290 core competencies to improve bioinformatics training and education

Bioinformatics Algorithms: Design and Implementation in Python

Bioinformatics 295 software for genomic: a systematic review on github

Pirbase: A 297 comprehensive database of pirna sequences

The cancer genome atlas 300 pan-cancer analysis project