key: cord-0034393-gc9hevy5 authors: Cho, Yeun-Jin; Kim, Hyeoncheol title: Cleavage Site Analysis Using Rule Extraction from Neural Networks date: 2005 journal: Advances in Natural Computation DOI: 10.1007/11539087_132 sha: 6bfdfbb0b4918f250b1398bf2fb9b418ce9ac722 doc_id: 34393 cord_uid: gc9hevy5 In this paper, we demonstrate that the machine learning approach of rule extraction from a trained neural network can be successfully applied to SARS-coronavirus cleavage site analysis. The extracted rules predict cleavage sites better than consensus patterns. Empirical experiments are also shown. The first cases of severe acute respiratory syndrome (SARS) were identified in Guangdong Province, China in November, 2002 and have spread to Hong Kong, Singapore, Vietnam, Canada, the USA and several European countries [20] . An outbreak of a life-threatening disease referred to as SARS has spread to many countries around the world. By late June 2003, the World Health Organization (WHO) has recorded more than 8400 cases of SARS and more than 800 SARSrelated deaths, and a global alert for the illness was issued due to the severity of the disease [25] . A growing body of evidence has convincingly shown that SARS is caused by a novel coronavirus, called SARS-coronavirus or SARS-CoV [14, 19] . A novel SARS associated with coronavirus (SARS-CoV) has been implicated as the causative agent of a worldwide outbreak of SARS during the first 6 months of 2003 [16, 24] . Currently, the complete genome sequences of 11 strains of SARS-CoV isolated from some SARS patients have been sequenced, and more complete genome sequences of SARS-CoV are expected to come [13] . It is also known that the process of cleaving the SARS-CoV polyproteins by a special proteinase, the so-called SARS coronavirus main proteinase (CoV Mpro), is a key step for the replication of SARS-CoV [18] . The importance of the 3CL proteinase cleavage sites not only suggests that this proteinase is a culprit of SARS, but also makes it an attractive target for developing drugs directly against the new disease [3, 10, 23] . Several machine learning approaches including artificial neural networks have been applied to proteinase cleavage site analysis [1, 3, 4, 7, 15] . Even though neural network model has been successfully used for the analysis [1, 7] , one of the major weakness of the neural network is its lack of explanation capability. It is hidden in a black box and can be used to predict, but not to explain domain knowledge in explicit format. In recent years, there have been studies on rule extraction from feed-forward neural networks [1, 5, 6, 7, 8, 12, 17, 21, 22] . The extracted rules provide human users with the capability to explain how the patterns are classified and may provide better insights about the domain. Thus, it is used for various data mining applications. In this paper, we investigate the SARS-CoV cleavage site analysis using feedforward neural networks. Also we demonstrate how to extract prediction rules for cleavage sites using the approach of rule extraction from neural networks. Experimental results compared to other approaches are also shown. Kiemer, et al. used feedforward neural networks for SARS-CoV cleavage site analysis [11] . They showed that the neural network outperforms three consensus patterns in terms of classification performance. In this paper, we use decompositional approach for rule extraction. Decompositional approaches to rule extraction from a trained neural network (i.e., a feed-forward multi-layered neural network) involves the following phases: 1. Intermediate rules are extracted at the level of individual units within the network. At each non-input unit of a trained network, n incoming connection weights and a threshold are given. Rule extraction at the unit searches a set of incoming binary attribute combinations that are valid and maximallygeneral (i.e., size of each combination is as small as possible). 2. The intermediate rules from each unit are aggregated to form the composite rule base for the neural network. It rewrites rules to eliminate the symbols which refer to hidden units but are not predefined in the domain. In the process, redundancies, subsumptions, and inconsistencies are removed. There have been many studies for efficient extraction of valid and general rules. One of the issues is time complexity of the rule extraction procedure. The rule extraction is computationally expensive since the rule search space is increased exponentially with the number of input attributes. If a node has n incoming nodes, there are 3 n possible combinations. Kim [12] introduced a computationally efficient algorithm called OAS(Ordered-Attribute Search). In this paper, the OAS is used for extraction of one or two best rules from each node. Twenty-four genomic sequences of coronavirus and the annotation information were downloaded from the GenBank database [2] , We configured neural networks with 160 input nodes, 2 hidden nodes and 1 output nodes and trained them with training sets. The classification performance of the neural networks is shown in table 2. We used the OAS algorithm to extract rules from trained neural networks [12] The five rules extracted generally outperforms the consensus rules. Coverage is reasonably high and accuracy is very high. The rule 'L@p2' in consensus patterns actually subsumes 11 other rules in the table 3. While its coverage is high (i.e. 55.6%), its accuracy is low compared to others. The rules that we extracted also contain the 'L' at position p2, but we excluded the rule 'L@p2' by our 90% of rule extraction threshold. For SARS-CoV cleavage site analysis, we used the approach of rule extraction from neural networks. We trained 3-layered feedforward neural networks on genomic sequences of coronaviruses, and then extracted IF-THEN rules from the neural networks. Their performances are compared to consensus patterns. The results are promising. Rule mining using neural network classifier can be a useful tool for cleavage site analysis. Survey and critique of techniques for extracting rules from trained artificial neural networks GenBank: update Cleavage site analysis in picornaviral polyproteins: discovering cellular targets by neural networks ZCURVE-CoV: a new system to recognize protein coding genes in coronavirus genomes, and its applications in analyzing SARS-CoV genomes Neural Networks in Computer Intelligence Rule generation from neural networks Introduction to knowledge-based neural networks. Knowledge-Based Systems Abstraction and Representation of Hidden Knowledge in an Adapted Neural Network. unpublished Prediction of proteinase cleavage sites in polyproteins of coronaviruses and its applications in analyzing SARS-CoV genomes Mutation analysis of 20 SARS virus genome sequences: evidence for negative selection in replicase ORF1b and spike gene Coronavirus 3CL-pro proteinase cleavage sites: Possible relevance to SARS virus pathology Computationally Efficient Heuristics for If-Then Rule Extraction from Feed-Forward Neural Networks Initial SARS Coronavirus Genome Sequence Analysis Using a Bioinformatics Platform. APBC2004 Mining viral protease data to extract cleavage knowledge Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection Understanding neural networks via rule extraction Dissection Study on the Severe Acute Respiratory Syndrome 3C-like Protease Reveals the Critical Role of the Extra Domain in Dimerization of the Enzyme Diagnosis of Severe Acute Respiratory Syndrome (SARS) by Detection of SARS Coronavirus Nucleocapsid Antibodies in an Antigen-Capturing Enzyme-Linked Immunosorbent Assay SARS -BEGINNING TO UNDERSTAND A NEW VIRUS Symbolic interpretation of artificial neural networks Extracting refined rules from knowledgebased neural networks Data Mining in the Bioinformatics Domain. Proceedings of the 26th VLDB Conference Genetic Variation of SARS Coronavirus in Beijing Hospital. Emerging Infectious Diseases Relationship of SARS-CoV to other pathogenic RNA viruses explored by tetranucleotide usage profiling