key: cord-0849004-feq9grdc
authors: Isa Irawan, Mohammad; Mukhlash, Imam; Rizky, Abduh; Ririsati Dewi, Alfiana
title: Application of Needleman-Wunch Algorithm to identify mutation in DNA sequences of Corona virus
date: 2019-05-01
journal: J Phys Conf Ser
DOI: 10.1088/1742-6596/1218/1/012031
sha: bd3bd3ed5d72a3ce42d110a8de6620b7b0f9f1bf
doc_id: 849004
cord_uid: feq9grdc

—Corona virus is a virus capable of mutating very quickly and many other viruses that arise due to mutations of this virus. To find out the location of corona virus mutations of one type to another, DNA sequences can be aligned using the Needleman-Wunsch algorithm. Corona virus data was taken from Genbank National Center for Biotechnology Information from 1985-1992. The Needleman-Wunsch algorithm is a global alignment algorithm in which alignment is performed to all sequences with the complexity of O (mn) and is capable of producing optimal alignment. An important step in this reserach is the sequence alignment of two corona viruses using the Needleman-Wunsch algorithm. Second, identification of the location of mutations from the DNA of the virus. The result of this reserach is an alignment and the location of mutations of both sequences. The results of identification DNA mutation can be used to find out other viruses mutation corona virus as well as can be used in the field of health as a reference for the manufacture of drugs for corona virus mutation outcome.

Computational biology is the field of science that focuses on the preparation of a mathematical model in solving and analyzing biological sequence problems. Computing in the field of biology is known as bioinformatics. Bioinformatics is the science of the combination of biological science and informatics for data storage, retrieval, data manipulation, and the distribution of information associated with biological macromolecules, such as DNA, RNA, and proteins. Bioinformatics is more often used for microbiological computing and focuses on analyzing biological sequences data.

DNA or Deoxyribonucleic Acid are biomolecules in the form of nucleic acids (found in the nucleus of cells), which function to store genetic information in an organism. Double-stranded DNA joins hydrogen bonds between bases in two strands. This base is Adenin (A), Cytosine (C), Guanine (G), and Thymine (T). DNA is a method that can prove whether an organism has a family relationship or not with other organisms. One of the introduction of an organism in bioinformatics is sequence alignment, which is the process of composing /aligning a sequence with one or more other sequences so that the sequence equations are clear or the level of similarity is obtained [1] .

Mutation is a change in the genetic material of a creature that occurs unplanned, random, and is the basis for a heritable source of living organisms. From To find out whether this corona virus is a relative or not, a method is needed that can align the corona virus DNA with one another. These problems can be solved using dynamic programming. Dynamic programming in DNA sequence alignment has two types of techniques, namely global and local alignment. Some algorithms used in local alignment include Smith-Waterman, FASTA, BLAST, and many algorithms that are being developed. As for global alignment, the Needleman-Wunsch algorithm is still often used and also developed to be more efficient [3] .

In this study, parallel alignment of DNA sequences is done globally because in this alignment, alignment is carried out from the end of the sequence to the other end of the sequence of the DNA character. The algorithm used in this study is the Needleman-Wunsch algorithm. This algorithm was originally created by Saul Needleman and Christian Wunsch in 1970 [4] . This algorithm is a dynamic programming implementation that is used to determine the level of similarity or the compatibility of two texts. The way this algorithm works is that DNA sequences are aligned by matching and shifting, so as to obtain the maximum global or overall level of similarity (Global Alignment) of the two DNA sequences with complexity O(mn). By looking at the results and the process of the Needleman-Wunsch algorithm in DNA alignment, this study discusses how the Needleman-Wunsch algorithm can be used to align the DNA sequences of the corona virus so that they can determine the presence of mutations in the corona virus. Corona virus DNA sequences data used is data in NCBI GenBank.

Several previous studies that underlie this research, among others, research conducted by Vijay Naidu and Ajit Narayanan [5] in 2016 the Needleman-Wunsch algorithm can be used to align two polymorphic malware viruses. Another study conducted by Mikhael Avner Malendes and Hendra Bunyamin [6] in 2017, concluded that the Needleman-Wunsch Algorithm had superior performance to align the sequence for both small and large data.

In this, research on sequence alignment on corona viruses using the needleman-wunsch algorithm is carried out which will then identify mutations in the sequence and the virus. The program implementation is done using the Python programming language with Anaconda software and Jupyter Notebook 5.0.0 application.

There are 4 mutation classifications, namely [16]: 1. Type 1 A mutation caused by changes in nucleotides, for example "a" changes to "g". 2. Type 2 A mutation that occurs because there are parts of the nucleotide that change the order of its position, for example the "check" section changes the order to "guacc".

A mutation caused by the insertion of a new segment into the sequence, for example the insertion of "aa" in the middle of the "gguugg" segment will change the seggment to "gguaaugg".

A mutation that occurs due to the elimination of nucleotide segments in the sequence, for example the removal of nucleotide "ag" from the segment "acaguua" so that the segment changes to "acuua".

Mutation is a change in structure in the genetic material of a creature that occurs randomly and is the basis for the source of a variety of living organisms that are heritable.

Because type 1 and type 2 mutations do not change the position of all nucleotides, these mutations are called substitution mutations. While type 3 and type 4 mutations are called transfer mutations because they can change the position of nucleotides. 

(1) With A, B, C as sequence. ܽ , ܾ , ܿ states the basic units of the sequence in position to-i, where these elements are obtained from the set

Sequence alignment is the process of composing / aligning a sequence with one or more other sequences so that the sequence equations are clear or the level of similarity is obtained [1] .

Here is an example of the alignment of two different short DNA sequences

Sign | states the existence of a match or match between the two sequences. DNA sequence alignment has two types of techniques, namely global and local alignment. Global alignment is the alignment performed for the whole sequence, some of the algorithms used in local alignment include Smith-Waterman, FASTA, BLAST, and many algorithms that are being developed. Whereas local alignment is alignment done to several or part of the sequence, the algorithm commonly used is the Needleman-Wunsch algorithm.

For example sequences ‫ܣ‬ = ܽ ଵ , ܽ ଶ , … , ܽ and ‫ܤ‬ = ܾ ଵ , ܾ ଶ , … , ܾ , then create a score matrix of size (n+1) x (m+1). Where n is the number of rows stating the length of the first sequence, and m is the number of columns stating the length of the second sequence. Then fill in the first row and the first column of the value matrix with the value of the gap penalty. Gap penalty is the value obtained when comparing the residues in a sequence with blank characters (gaps) in other sequences. b. Charging Matrix Suppose the value matrix is called the matrix S, then the formula for the elements of the matrix S is

where:

x ‫ܽ(ݏ‬ , ܾ ) = the substitution matrix element in residue i in sequence a and residual j in sequence b x ݀ = gap penalty in virtual symbol With assumption gap linear model, is that

Sequence Alignment Sequence in bioinformatics can be described using the following notation:

The Needleman-Wunsch algorithm is an implementation of dynamic programs (dynamic programming). The Needleman-Wunsch algorithm is used to determine the level of similarity or compatibility of two texts. This algorithm is also used to find alignments that have optimal values on global alignment in two sequences. This algorithm was created by Saul Needleman and Christian Wunsch in 1970 [7] . 

Then the S value matrix is obtained as the following table: In the example above, the traceback step starts from S(7,6) Score current = 8 (matrix score i,j) Score diagonal = 3+5 = 8 Score left = 3-3 = 0 Score up = 0-3 = -3 Because score current = Score diagonal = 8, then i-1 = 7-1 = 6 j-1 = 6-1 = 5 so, the next cell is S(6,5). For S(6,5):

Score current = 3 Score diagonal = 6-3 = 3 Score left = 6-3 = 3 Score up = 3-3 = 0 Because score current = Score diagonal= 3, then i-1 = 6-1 = 5 j -1= 5-1=4 then, the next cell is S(5,4). From the example above, the alignment results are obtained as below: Sequence 1:

See from the results of the sequence alignment above, shows that both sequences have mutations in the 7th nucleotide, that is, T in the first sequence changes to C on the second sequence. The score of the alignment of both sequences is 8 and the homology is 50%

Before identifying the location of mutations from two DNA sequences, the process of aligning two DNA sequences is done first using the Needleman-Wunsch algorithm. Here the corona virus DNA sequence data will be used with the following details:

DNA ATGTTGGCACAGTTACTGTTAGCAGTGACTCTTTTGTCTGCTTTAGGTGAATGTAGTATAGTAGGTGAAAATTACACATACTATTACCAGAGTCAGTTTAGACCGCCTAATGGCTGGCA  TAAACATGGTGGAGCCTATCTTGTAGTTAATGAAACTGATATATCCTATGATGCTGCGTCTTGTACTGTGGGTACAATAAAAGGCGGCATTGTCATTAATGAGAGTGCTATATCTTTTG  TTACTAAAACACCTATTGCTTGGTCAGCTCAAGGCGTTTGCACTACATATTGTAATTATTCCAGCCTATATGTGTTTGTAACCCATTGTGGGGGCCGTGGACATAATAGTTGTATTATA  AATACAAATCGCATAGGCGAGATTGTTTTAGGTGTTAAATCCTTTTCTGGTAACTGGATTTATAATCGCACTATACAGGCTACTGGTCCGTATAGTAAATTTACAGCCTGGCAATGTCT  TGCTAATTTTACCAGTGTGTTTCTAAATGGCAACCTTGTGTATAGTTCTAACTTTACGGAGGATGTTGCAGCGGCTGGTGTTTATGCTAAAACAGTCAATGGTCTAAAACGTAGAATTA  TGAAGGACACTGATGTTTTGGCATATTTTGTAAATGGCACTGCTGTTGAAGTGATTGTTTGTGATGACAACCCTAAAGGTAGGTTAGCATGTCAGTATAATACAGGAAATTTTACTGA  TGGGTTATACCCTTTCGTAAGTAATAATGTAGTTAATGATAGTGTTGTTGTTTATGATGTTATTAGTACTACAACTTATGGTAAACTTAACAACATTACTTTCCATAATGAAACTAGTG  CACCACCTGCAGGTTCTAATGTTGCTAATTTTATTAAATATCAGACGCATGTGTTGCCTGAAGGTTTTGTTAGGCTTAATTTTTCTTTCTTGTCTACTTACAGGTATCAGGAGTCTGATT  TTACTTATGGTTCTTATCATAAGGCTTGTAATTTTAGACTAGAAAGTATTAATAATGGTTTAATGTTTAATACTTTAAGTGTTTCTATTAGCTATGGACCACTTAAGGGTTCTTGTAAGC  AGTCAGTGTTTAATCATAAAGCAACGTGCTGTTATGCCTATAAATATCCCACTAATGGGGTTCAAGAGTGTAAGGGTGTTTATAATGGAGAACGCAATACTAAATTTGAATGCGGGCT  TCTTGTATTTATAGACAAGACTGATGGTTCACGCATAATAACTGCAGAAAAACCACCTGTTTATACTACTAATTTTACTAATAATATTGTTGTTGGTAAGTGTGTTAATTATAATATTT  ATGGTAGGTATGGCCAAGGCGTCATTAGTAATGTAACTACTGAAGCATTTGGTTTTTTAGAGGGAGATGGTTTGGTCATCTTGGACACTGCTGGTTCTATAGATATTTTTGTTGTTAGG  GATGGTCCATTCACACATTATTACAAGATTAATCCTTGTAATGATGTAAATCAACAATATGTAGTGTCAGGAGGAAATATAGTTGGTCTTCTCACATCTAGTAATGAGACTGGCTCTA  TTCAGTTAGAAGATCAGTTTTATATTAAACTCACTAATAGCACCCGTAGGCATCGGAGA   ATGTGGGCATCGTTACTGTCAGTAGTGACTCTTTTGTTTGCTTTAAGTGAATGTAGTATAGTAGGTGAAAATTACACATACTATTACCAGAGTCAGTTTAGGCCGCCTAATGGCTGGCA  TAAACATGGTGGAGCCTATCTTGTAACCAATGAAACTGACATATCCTATAATGGTGTGTCTTGTACTGTGGGTACAATAAAAGGCGGCATTGTCATTAATGAGAGTGCTATATCTTTT  GTTACAAAAACACCCATTGCTTGGTCAGCCAACGGCGTTTGCACTACATATTGTAATTACTCCAGCTTATATGTGTTTGTTACCCATTGTGGGGGCAGCGGACATACTAGTTGTATTAA  AATACAAATCGCATAGGCGAGATTGTTTTAGGTGTTAAAGACTTTTCTGGTAACTGGATTTATAATCGTACTATAAAGGCTATTGGTCCGTATAGTAAATTTACAGCCTGGCAATGTCT  TGCTAATTTTACCAGTGTGTTTCTAAACGGCAACCTTGTGTATAGTTCTAACTTTACGGAGGATGTTGCAGCGGGTGGTGTTTATGCTAAAAGCGTCAATGGTCTAAAACGTAGAATTA  TGAAGGACACTGATGTTTTGGCATATTTTGTAAATGGCACTGCTGTTGAAGTGATTGTTTGTGATGACAGCCCTAGAGGTAGGTTAGCATGTCAGTATAATACAGGAAATTTTACTGA  TGGGTTATACCCTTTCGTAAGTTACAATGTAGTTAATAATAGTGTTGTTGTTTATGAGGTTATTAGTACTACAACTTATGGTAAACTTAACAACATTACTTTTCATAATGAAACTGGTG  CACCACCTGCAGGTTCTAATGTTGCTAATTTTATTAAATATCAGACGCATGTGGTGCCTGAAGGTTTTGTTAGGCTCAATTTTTCTTTCTTGTCTACTTACAGGTATCAGGAGTCTGATT  TTACTTATGGTTCTTATCATAAGGCTTGTAATTTTAGACTAGAAAGTATTAATAATGGTTTAATGTTTAATACTTTAAGTGTTTCTATTAGCTATGGACCACTTAAGGGTTCTTGTAAGC  AGTCAGTATTTAATCGTAAAGCAACATGCTGTTATGCCTATAAATATCCCACTAATGGGGTTCAAGAGTGTAAGGGTGTTTATAATGGAGAACGCAATACTAAATTTGAATGCGGGCT  TCTTGTATTTATAGACAAGACTGATGGTTCACGCATAATAACTGCAGAAAAACCACCTGTTTATACTACTAATTTTACTAATAATATTGTTGTTGGTAAGTGTGTTAATTATAATATTT  ATGGCAGGTATGGCCAAGGCGTCATTAGTAATATAACTACTGAAGCATTTGGATTTTTACAGGGAGATGGTTTGGTCATCTTGGACACTGCTGGTTCTATAGATATTTTTTCTGTTAAG  GATGGGCCACTCACACATTATTACAAAATTAATCCTTGTAATGATGTAAATCAACAATATGTAGTGTCAGGAGGAAATATAGTTGGTCTTCTCACATCTAGTAATGAGACTGGCTCTA Figure 3 , the results of homology of 10 types of viruses from one time to another over 70 % (meaning that the viral DNA has similarities) but the first type of virus with the last type experien ced a difference of 39% with 117 mutations (on the score to 2 to 6) and 138 (in scores 1 and 7). This shows that the corona virus DNA in the type of infectious bronchitis virus in 1985 and 1992 has und ergone a mutation. 2. The homology level uses 7 variations of match values, different mismatches and gaps produce almo st the same homology level (difference below 1), this shows that the determination of match values, mismatch and gap does not affect the level of homology. 3. Based on Figure 4 , 7 out of 10 experiments (2nd, 3rd, 4th, 5th, 6th, 8th, and 9th trials) gave the max imum score at score 6, the score with match = 14, mismatch = -8 and gap = 8. 7 variations of match values, different mismatches and gaps produce different maximum scores. This shows that determin ing the match value, mismatch and gap affect the maximum score. 4. Based on Figure 5 , variations in match values, mismatches and gaps 2, 4, 5, and 6 produce the same mutations. However, in variations in match values, mismatches and gaps 1, 3 and 7 produce differen t mutations. This shows that the determination of match values, mismatch and gap affects mutations . 

To test the truth of the Needleman-Wunsch algorithm, sequence sequencing is done in 2 conditions, namely 1. 2 different sequences but the same length 2. 2 sequences of the same arrangement and length The following are the results of alignment sequences with the conditions as above: a. 2 different sequences but the same length as: Mutation: None (meaning sequence 1 and sequence 2 are the same sequence) From the two alignments performed, it can be concluded that the Needleman-Wunsch algorithm is correct and can be used to align two sequences.

Based on the analysis of the results of program testing, it can be concluded that the Needleman-Wunsch algorithm can be applied to identify mutations in the DNA sequence of the corona virus. The test results were carried out on 10 corona virus DNA types of Infectious bronchitis from 1985 to 1992 in sequence. The first type of virus with the last type experienced a difference of 39% and 117 mutations (variation of match, mismatch and gap scores to 2,4,5 and 6) and 138 mutations (variation of match score,

Re-testing of 10 Infectious bronchitis virus corona type DNA viruses from 1985 to 1992 with a link as input. This test is done with 3 match values, mismatch and gap

Match= 10, mismatch= -6, and gap= 6

Match= 14, mismatch= -8, and gap= 8

Essential Bioinformatics. Cambridge [2] Center of Diseases Control. Middle East Respiratory Syndrome (MERS)

Rekayasa Perangkat Lunak. Yogyakarta: Penerbit ANDI

Algorithms in Bioinformatics. Verlag

Needleman-Wunsch And Smith-Waterman Algorithms For Identifying Viral Polymorphic Malware Variants

Perbandingan Needleman-Wunsch dan Lempel-Ziv dalam Teknik Global Sequence Alignment: Keunggulan Faktorisasi Sempurna

A General Method Applicable to the Search for Similarities in the Amino Acid Sequences of Two Proteins

Biologi. Facil-Grafindo: Bandung

Veterinary Virology Second Edition

Sequence analysis of Leukemia DNA

346 (changes in nucleotide C to A) 399 (changes in nucleotide G to T) 400 (changes in nucleotide A to C) 428 (changes in nucleotide T to C) 435 (changes in nucleotide A to C) 442 (changes in nucleotide T to C) 506 (changes in nucleotide C to T) 553 (changes in nucleotide G to C) 668 (changes in nucleotide G to A) 674 (changes in nucleotide G to A) 739 (changes in nucleotide T to A) 741 (changes in nucleotide C to T) 754 (changes in nucleotide A to G) 774 (changes in nucleotide G to T) 819 (changes in nucleotide T to C) 832 (changes in nucleotide G to A) 889 (changes in nucleotide G to T) 912 (changes in nucleotide C to T) 1083 (changes in nucleotide A to G) 1091 (changes in nucleotide G to A) 1101 (changes in nucleotide A to G) 1317 (changes in nucleotide C to T) 1345 (changes in nucleotide A to G) 1365 (changes in nucleotide A to T) 1372 (changes in nucleotide C to G) 1431 (changes in nucleotide A to G) 1438 (changes in nucleotide G to T) 1442 (changes in nucleotide C to T) 1459 (changes in nucleotide A to G) 1594 (changes in nucleotide T to C) 1604 (changes in nucleotide A to C)