key: cord-0058571-rstbik9c
authors: da Silva Filho, Reginaldo Inojosa; de Azevedo da Rocha, Ricardo Luis; Oliveira, Claudio Santos
title: Formal Language Model for Transcriptome and Proteome Data Integration
date: 2020-08-26
journal: Computational Science and Its Applications - ICCSA 2020
DOI: 10.1007/978-3-030-58814-4_60
sha: 2aab58dce452ba481c6f1ff8f39cde2b733cf114
doc_id: 58571
cord_uid: rstbik9c

We present in this article, for the area of structural and functional genetics, a preliminary theoretical model based on the representation of transcriptomes and proteomes as families of formal languages, in which the phenomenon of translation is described as a artificial language transduction process (present in programming languages compilers), making it possible to unify the transcriptomic and proteomic data.

Computer simulation is a fast and flexible virtual process that can significantly reduce the need for resources and time, which can make systems unfeasible [16] . It does not replace in vivo experimentation, but provides important information to achieve scientific and technological goals in critical areas of biology and health [5, 13] .

Our purpose, in this initial work, is to present a model (basis for the construction of a computational framework) for further simulation in the area of gene expression, more specifically, the mapping between the elements of the transcriptomes and the proteomes (apprehended by omics), in order to allow data integration between these two areas, and study their interactions.

Transcriptomic and proteomic are independent areas, however, there are efforts to integrate them [8, 15, 21] . Our objective is to create a model that integrates both under the same form of representation: formal languages. More specifically, we will represent the transcriptomes and the proteomes as families of formal languages, and model the phenomenon of translation as a transduction process of artificial languages.

We seek the simplicity necessary to provide an immediate computational implementation, without however making it an ad hoc solution. We are inspired by a series of existing approaches that deal with the modeling of genetic processes [1] , such as the π-calculus and the work of Searls [7, 19] . In our model, we will present the biological substrate of the relationship between transcriptomes and the proteomes from the point of view of language processing [9] . Biological information processing is based on the DNA and RNA alphabets of 4 symbols each, and the protein alphabet containing 20 symbols. Biological translation mechanism is represented by the process of transduction between the "strings of symbols" [20] formed by such alphabets (RNAs and proteins). Thus, using an analogy with the activity of programming, we can say that a "source code" (transcriptome) is converted into an "object code" (proteome), which allows the "computational architecture" (organism) to execute the instructions contained therein (events that lead to the manifestation of phenotypic characteristics).

The work is organized as follows. Section 3 presents the definition both of formal proteome and transcriptome, as well as the main abstractions from biology such as the concept of organism and the establishment of a set of time intervals that serve as a basis for the construction of such definitions. Section 4 presents the different types of transformations that can be applied to the transcriptome. Such transformations take into account only the formal characteristics and, initially, we will not analyze their biological implications. Among these transformations, we highlight the modeling made of the translation phenomenon. Finally, Sect. 5 discusses the results and the next steps to be followed.

In this section, we give several definitions and notations required for the adequate discussion of the present article. We assume that the reader has familiarity with basics genetic concepts and computational formal languages theory. The basic computational concepts used in this work (mostly concerning automata theory), as well the pertinent notation, are summarized in Table 1 of Annex; for more details see [11] .

The first genome expression product is the transcriptome [6] . It is the complete set of coding transcripts that are active in a given cell organism, tissue, lineage or organelle in a given time interval; constituting the profile (or large-scale pattern) of all mRNAs present in the cell in a given period [12] . The second product of the organism's gene expression is its proteome, which is the complete set of all proteins, obtained by the process of translation in a given organ, tissue, lineage or cell organelle in a given period of time [4, 12] .

The Translation process involves ribosomes, mRNAs, tRNAs and amino acids. In the cytoplasm, amino acids must join with their respective tRNAs. During translation, the tRNA molecules bind to the ribosome. The ribosome, in turn, processes the entire coding sequence for the mRNA. As this reading takes place, a series of tRNA connections and disconnections are made with the ribosomes, which in concert bring the amino acids that will compose the resulting protein formed.

In the mRNA coding sequence, each group of three nucleotides corresponds to a given amino acid. Such a group is called codon, which is chemically recognized by the tRNA carrying the corresponding amino acid. Since there are four nucleotides that make up an RNA, there are 64 different codons (Table 1) , where three of them do not encode amino acids, but serve as a stop sign for translation. Since there are only 20 amino acids (Table 1 of Annex, first column), different codons can encode the same amino acid. Because of this, it is said that the genetic code is degenerate [17] .

The insertion and connection of amino acids occurs in the same order in which their respective codons are recognized, in a phenomenon known as principle of collinearity [14] . Thus, the linear sequence of the nucleotides determines the primary structure in a protein, forming, in parts, the resulting protein sequence, until it reaches the codon that serves as the termination signal.

Given two alphabets Σ 1 and Σ 2 distinct, the function R : Σ * 1 → Σ * 2 is transduction [2, p. 65], whose graph R is described as:

The function R can be performed by a machine called transducer [2, p. 77], defined as:

Where Q is a set of states, Σ 1 is the constituent alphabet of the input strings, Σ 2 is the alphabet of the output strings, q 0 is the initial state, F ⊆ Q is the set 

For a transducer R, a computation is a sequence of consecutive transitions

where:

with δ i ∈ ∂ and 1 ≤ i ≤ n. Computation can also be represented by the notation:

where the word u = α 1 α 2 . . . α n ∈ Σ * 1 is the computing entry and v = β 1 β 2 . . . β n ∈ Σ * 2 is the output word, for p, s ∈ Q.

A computation is successful if p = q 0 and s ∈ F [3, p. 18] . Thus, for a computing entry u, the processing R(u), calculated by the transducer, is the concatenation produced as a result of the computation. In this way, we have to:

Although formal (artificial) languages do not encompass the complexity of human language, they are useful not only for the study of linguistic themes, but also for several computational purposes [18] . In general, from a linguistic point of view, an occurrence experienced by someone (real or not) is expressed in a language (natural or artificial) and then transformed into a record, which can be passed on and transformed over time. Something similar is done to define the abstractions for transcriptomes and proteomes: we describe a biological phenomenon in terms of a (formal) language, establishing some assumptions and relaxing some restrictions in relation to the physical phenomenon, building an functional (in the mathematical sense) articulation of possible relationships, which can be "expressed" in the form of computer program and executed under different forms, circumstances and modes (based on experimental or hypothetical data). This abstract construction begins with the description of the prerequisites that serve as support for the model, described in the two items below: Table 1 of Annex. A string p ∈ Σ m AM , represents the primary structure of a protein.

The natural numbers m and n in Σ m AM and Σ n RNA are restrictions that prevent the occurrence of proteins or mRNA strings with infinite size.

Thus, the Formal Transcriptome (T ) and the Formal Proteome (Π) can be defined by the functions:

Whose codomain L ΣRNA and L ΣAM are families of indexed and non-empty languages, so that for max = card(I) × card(C), they are specified as:

where each member language, both in L ΣRNA and L ΣAM , there may be repeated words.

The modeling presented in this section will allow the definition of a transformation on formal transcriptomes and proteomes.

The transformation on formal transcriptomes is related to considerations about what the phenomenon of translation computes. First, we have to capture the main characteristics of the translation process in the form of postulates. They are: Table 1 of Annex, the presence of codons (which we will call k i ) in the coding section of the mRNAs, allows to describe it as: 

Now we can state the proposition that is the center of the modeling proposed in this article. 

When delimiting (by means of formal languages) the aspects of both transcriptomes and proteomes, we stipulate the criteria to show that, although they are heterogeneous in constitution, both are governed by an internal uniformity expressed by the formal structure that describes them, expressed through the developed model.

With absolutely reasonable considerations about the nature of the translation phenomenon, abstractly presented as a language processor, we arrive at the logical conclusion described by the equation M (T ) = Π, which relates transcriptomics and proteomics information.

Thus, based on transcriptomics data, we can infer information about organism proteomics aspects, through in silico experiments, going beyond a simple database architecture, entering the field of inference and allowing the automatic generation of a transcriptome -proteome mapping, even in situations that have information gaps.

Throughout this preliminary article, we emphasize our approach to focus attention on formal linguistic aspects resulting from computational modeling, placing their biological meanings in the background and verifying the properties resulting from symbolic relations.

Continuing this idea, we can consider and imagine a mapping that is "inverse" to that resulting from the translation phenomenon, that is: morphism M (Π) = T , which, although not related to any biological phenomenon, could very well be useful in computationally simulated analyses. It would allow the application of inverse problems for questions of genetic expressiveness, that is, to answer the question "given a desired phenotypic characteristic, which transcripts would produce it?" It is important to note that in such a morphism M , for an input, there are two or more outputs. The implications of this fact and the resulting extensions in the model, as well the influence of other factors (as probability and heuristics, for example), are themes for future work.

Computational modeling, formal analysis, and tools for systems biology

Transductions and Context-Free Languages

Codes and Automata

Proteome informatics

Systems biology in drug discovery

Transcriptome Analysis: Introduction and Examples from the Neurosciences

A grammar inference approach for predicting kinase specific phosphorylation sites

Integration of deep transcriptome and proteome analyses of salicylic acid regulation high temperature stress in ulva prolifera

The impact of formal reasoning in computational biology

Genetics: from Genes to Genomes

Introduction to Automata Theory, Languages, and Computation

The Dictionary of Genomics

Systems biology: a brief overview

Concepts of Genetics

Integrating transcriptome and proteome profiling: strategies and applications

Building Software for Simulation: Theory and Algorithms, with Application in C++

Genetics: A Conceptual Approach

Handbook of Formal Languages

The language of genes

On compensation loops in genomic duplications

Integration of transcriptomics, proteomics and metabolomics data to reveal the biological mechanisms of abrin injury in human lung epithelial cells