key: cord-0606520-lx28t9yw authors: Ghosh, Dibyendu; Chakraborty, Srija; Kodamana, Hariprasad; Chakraborty, Supriya title: Application of Machine Learning in understanding plant virus pathogenesis: Trends and perspectives on emergence, diagnosis, host-virus interplay and management date: 2021-12-03 journal: nan DOI: nan sha: dd075d3cfa748851b59531878b3c2c9b8ccf16dd doc_id: 606520 cord_uid: lx28t9yw Inclusion of high throughput technologies in the field of biology has generated massive amounts of biological data in the recent years. Now, transforming these huge volumes of data into knowledge is the primary challenge in computational biology. The traditional methods of data analysis have failed to carry out the task. Hence, researchers are turning to machine learning based approaches for the analysis of high-dimensional big data. In machine learning, once a model is trained with a training dataset, it can be applied on a testing dataset which is independent. In current times, deep learning algorithms further promote the application of machine learning in several field of biology including plant virology. Considering a significant progress in the application of machine learning in understanding plant virology, this review highlights an introductory note on machine learning and comprehensively discusses the trends and prospects of machine learning in diagnosis of viral diseases, understanding host-virus interplay and emergence of plant viruses. Over the years, extensive research has been carried out in various fields of biology to understand the science behind a plethora of complex biological phenomena. The study of problems such as traits in plants and plant viral diseases lead to generation of massive data sets. The progress in technology has rendered data generation a simple task. Cost-effective technologies such as Next Generation Sequencing (NGS) have made it easier to gather data regarding gene expression, chromosome conformation, genetic variation, traits and diseases of animals and plants leading to generation of such massive data sets having multiple characteristics [1] . However, the resultant data explosion, especially in the field of omics, has made the handling of large datasets a major concern. The traditional statistical data analysis methodologies are not effective or efficient anymore in this context [2] . Furthermore, biological phenomena comprise various aspects, which lead to the generation of more than one data type. This calls for an integrated analysis of the different types of data. But the noisiness of heterogeneous biological data makes this a difficult task [3] . Data dimensionality is another major impediment, for instance, omics data is generally highly resolved, hence highly dimensional. Moreover, sample size in biological studies is limited in most cases. This may lead to issues including overfitting, multi-collinearity and data sparsity [4] . In order to overcome all these barriers, attempts are being made to incorporate machine learning (ML) and deep learning (DL) tools in the analysis of the datasets. ML tools identify patterns in the data using different statistical methods. ML paradigm can be used to derive models for classification, pattern recognition, and making predictions based on existing data. DL algorithms extract high level features from huge datasets (like a collection of genomic sequences, or images), recognize the hidden patterns and then use them to create trained prediction models [5] . These trained models can be further applied to different data from other sources for applications such as prediction and classification. These techniques have the ability to tackle tough problems by seeing structure in seemingly random data, even when the amount of data is too complex and large for human comprehension [6] . Hence, ML especially DL has the ability to perform analysis of enormous datasets in an extremely efficient, cost-effective, accurate and high-throughput manner [7] . 4 In the context of ML, there are two primary methods for training the models: supervised and unsupervised learning. Both of these have potential for use in biology. Under supervised learning, the given collection features, or attributes of a system under investigation are labeled [8] . Two recurring problems in the supervised learning framework are regression and classification. The classification process assigns objects into the classes on the basis on the properties of features. In biology, one example of such training (involving mapping of objectto-class) is mapping of gene expression profiles to their respective diseases. The algorithm returns a "doubt value", a parameter that shows how unclear the algorithm is about assigning the object to one of the many possible cases, or an "outlier", which shows how unlikely the object is to the other previous objects making classification tough. Some of the widely used supervised modes are linear/nonlinear regression, support vector machines (SVM), Gaussian processes, and neural nets [9] . In unsupervised learning, the objects involved in the study are not under any predefined labels [10] . The entire goal of these models is to recognize similarities in the various objects by exploring the data. These similarities define clusters in the data (groups of data objects). So, the basic concept of unsupervised learning is to discover natural patterns and form groupings of the data. Thus overall, in supervised learning, the data is pre-labeled and the algorithm learns how to use the labels to associate the objects to the classes. On the other hand, in unsupervised learning, the data is unlabeled, and the algorithm also learns to create labels by clustering the objects. Principal component analysis includes k-means clustering, Gaussian mixture models, Density-based spatial clustering of applications with noise (DBSCAN), and hierarchical clustering [9] . In certain exceptional cases, a method known as semi-supervised learning has proven to be quite useful. One example of such a situation could be the classification of protein sequences. Here, only a few samples of protein sequences are labeled (belonging to a known class) while quite many sequences belong to unknown classes. The semi-supervised algorithm transfers the class labels from the labeled to the unlabeled objects present in the feature space nearby [11] . The basic steps for creating a machine learning model for the study of biological data are shown in Figure 1 . After gathering the data (labeled or unlabeled), it is split into two sets for training and then testing. The data samples need to undergo preprocessing and augmentation before the splitting in case they are corrupted with noise and outliers. Next the model is trained using the training dataset. The model can either be created from scratch, or a pre-trained model can be 5 adjusted according to the dataset collected. Once the trained model is ready, the testing data is passed through it to check the accuracy with which the objects are classified into different labels [12] . When working with neural networks, we essentially attempt to create the inferences analogous to the human brain by building an Artificial Neural Network (ANN) [13] . An ANN resembles a biological neural network. The artificial neurons used here are basically mathematical models that carry out three main functions: activation, addition and multiplication. The goal is to build layers of neurons, each of which produces a suitable response to any input provided to it. The neurons of each layer multiply their inputs with the corresponding weights. Then it is passed through the activation function and finally transferred to the next layer of neurons. Once the input layer is fired up, the decision moves along to the layers of the neurons (hidden layers), firing up the respective neurons until the final output layer is reached [14] . A schematic representation of a neural network is shown in Figure 2 . In neural networks, the information flow direction is determined by the intermodal connections. On this basis, there are two classifications of neural networks: for unidirectional flow, we have cascade forward and feed-forward; for bidirectional flow we have feed-backward or recurrent [15] . In feed-forward networks, the flow of information between the layers takes place in one direction. Cascade forward is similar except the input to the next layer is weighted. In recurrent networks, flow of information takes place in both directions. All the nodes are interconnected among each other, including self-connection. These networks are unfortunately extremely complex, bulky, difficult to operate and take up a large amount of computational space. There are many other neural networks whose development is in progress, some including selforganizing networks, convolutional neural networks (CNN), variational auto encoders (VAE), generative adversarial networks (GAN) [16, 17] . Various parameters come in use while evaluating the classification performance of the model developed. Some of the important parameters include accuracy, sensitivity/recall rate, specificity rate, precision/positive predictive value, negative predictive value, F1scor [18] . These judge the performances of models by calculating various ratios involving true positives, false positives, true negatives and false negatives. All of these can be combined into a single confusion matrix, which is then studied to judge the model's performance [19] . In addition to this, the phenomena of overfitting and underfitting are widely faced while employing ML models and, therefore, are being investigated by researchers [20] . Overfitting is a situation when 6 the fitting of the model is with respect to the noise in the data and not the signal. The validation data error increases while the training data error decreases [21] . On the other hand, underfitting is the reverse scenario. In this case, the model is not capable of recognizing data variability [22] . Many methods are being developed in an attempt to develop the perfect model and to prevent any such imperfect fittings. Penalty methods, training by early stopping, batch normalization, and dropouts are to name a few [23] . Here comes the benefits of using DL techniques for feature extraction as DL empowers automatic extraction of features rather than handcrafted method used in traditional ML algorithms, for instance the application of convolutional neural networks. DL has substantially improved the reliability of plant stress phenotyping by enabling the accommodation of a large sample size for training and testing [39] . A major constraint of this method is the vast variation of environmental conditions between the field and the lab. While a consistent temperature, humidity, and light intensity are maintained in the lab, all of these variables are constantly changing in the field, influencing the captured images [41] . Hence, it is recommended to use field images to train a model since it has been demonstrated that a classifier trained on field images can also classify lab-based images with precision [42] . The lack of availability of a huge collection of field-based images of specific plant disease is another key challenge for an accurate and reliable diagnosis. Transfer learning is a recent advancement in the field of ML which enables the data scientists to adopt a previously well-trained model for solving a similar kind of problems [39] . For example-a model trained for chilli-leaf curl disease detection may be used for detecting leaf curl symptoms caused by viruses in tomato (Figure 3 ). There are several approaches to adopt a pre-trained model; one can select and finetune the architecture and/or parameters of a model 8 depending upon the types of datasets. Table 1 summarizes the development of ML assisted diagnosis of plant viral diseases over last few years. The recent trend of studying plant virome through metagenomics has unveiled the diversity of plant viruses. Huge numbers of phylogenetically related and unrelated virus species have been found in diseased samples [43, 44] . Havoc explosion of virome data generated through NGS necessitates the urgent structural orientation and analysis of sequence data in order to understand the actual portrait of the viral diversity. Although a significant progress has been followed up in the case of animal viruses, limited efforts have yet been recorded in the field of plant virology [45, 46] . V-pipe has provided a bioinformatics pipeline for analyzing genomic diversity of human immunodeficiency virus (HIV) from sequencing data [47] . As RNA viruses use error-prone polymerases during their replication, the chances of mutations in their genome sequences remain quite high. Mutation in the viral genome finally leads to the emergence of new virulent viral strains [48] . A neural network-based model can predict probable point mutations in the RNA sequence. It has been successfully explored in the case of newcastle virus [49] and its optimized form may be very useful for prediction of mutations in plant viral genome (Figure 3 ). Besides RNA viruses, DNA viruses also possess significant genetic variations and events like recombination and genome reassortment play crucial role in mediating the emergence of new viral forms [50, 51] . The identification of novel virus and satellite molecules through metagenomics approach emphasizes the importance of precise taxonomic classification followed by demarcation of these new species. An excellent effort by Silva and collaborators have developed Fangorn Forest, a ML based method, for classification of geminiviruses. Among the three tested algorithms, random forest (RF) has been proven to be best in classification of genes and genera of this largest plant virus family [52] . Being obligate parasites, viruses rely on cellular machineries of plants for every aspect of pathogenesis including replication, gene expression and movement [53] . Plants elicit a robust antiviral immune response to restrict viral invasion [54] . Viruses encode effector proteins which disarm plant defense signaling. This tug of war continues which fuels the co-evolution of both virus and host [55] . Hence, understanding the interplay between plant and viruses is crucial for in-depth dissection of viral pathogenesis. Although plants have evolved a variety of tools and tactics to prevent virus multiplication, the resistance (R) protein-mediated immune response and gene silencing are the most well-known features of their antiviral defense [54] . The majority of canonical R-proteins contain nucleotide binding site leucine-rich repeats (NBS-LRR), which mediate direct or indirect recognition of virus-encoded effector proteins, resulting in the activation of effector triggered immunity (ETI). Very few R-genes imparting immune response against viruses have been identified and characterised till date, which limits our knowledge regarding the detailed mechanism of dominant resistance in plant virus interaction [56] . Support vector machine-assisted development of a high throughput bioinformatics tool, NBSPred, precisely identifies NBS-LRR containing R proteins from genome, transcriptome and proteome data [57] . Receptor-like kinases (RLK) are crucial players in the immune perception of phytopathogens, many of them acting as pattern recognition receptors (PRRs) which lead to induction of pattern triggered immunity (PTI) [58] . However, several plant viruses target RLKs to promote viral pathogenesis modulating the host-virus interaction [61] [62] [63] [64] . ML helps biologists to predict GRNs from highthroughput transcriptome data [65] which may lead to identification of several regulatory nodes of plant immune signalling (Figure 3 ). On the other hand, viruses encode few but multitasking viral effector proteins which facilitate the viral pathogenesis. Examining the sub cellular localization of these effector proteins is important to understand their mechanism of action. Furthermore, viruses also redirect the subcellular localization of several host proteins to disrupt their assigned functions [66] . ML assisted development of online tools such as LOCALIZER, MU-LOC enable precise as well as accurate analysis of subcellular localization of effector proteins and host factors by simply using amino acid sequences of proteins as input (Figure 3 ) [67, 68] . Application of ML in the successful prediction of fungal effector proteins has added an extra edge in phytopathology research [69] . In the case of viruses, some viral effector proteins have been evolved to block antiviral gene silencing, known as viral suppressors of RNA silencing (VSRs). VSRs expands the negative impact of viral diseases by promoting synergistic associations among different plant viruses [70] . Jagga et al have developed a bioinformatics platform, pVsupPred, for the prediction of VSRs encoded by plant-infecting viruses. They have used four classifier models including LibSVM, J48, Naı¨ve Bayes, RF and among all of them, RF algorithm has emerged as the best with an overall accuracy of 86.11% [71] . Later on, in another study, sequential minimal optimization (SMO) algorithm had been optimised to achieve an overall accuracy of 95.3% for the successful identification of plant virus encoded VSRs (Figure 3 ) [72] . Another significant facet of plant-virus interaction is the virus induced alteration of microRNA (miRNA) homeostasis which impacts the transcriptome profile of the infected cells. Hence, it is important to identify the accurate targets of specific miRNAs regulating plant immunity and viral pathogenesis [73] . Advent of ML in advancing the scope of bioinformatics has significantly eased this difficult job. Supervised ML approaches including graphical models, kernel machines and evolutionary algorithms are being widely used to identify the specific miRNA targets in eukaryotes [74] . Further, a new category of DL models known as graph neural nets (GNN) is emerging as a promising tool in bioinformatics. The biological networks, based on small RNAs-disease associations can be constructed as graphs with nodes and edges. GNN can operate on the graphical data and has more representative features, which can be efficiently used for inferences [75] . Finally, the best possible way to understand the functional aspect of a protein is to visualize its accurate structure. A very small proportion of plant proteins involved in immune signalling have been structurally characterised yet. In addition, structures of plant viral proteins are also largely unresolved. Labour intensive methods of protein crystallization is the major bottleneck here. However, Jumper and collaborators have revolutionised the idea of protein structure prediction by launching Alphafold2, a neural network assisted structural bioinformatics platform, which can successfully solve a protein structure with almost equivalent experimental accuracy even if there is no similar protein structure available [76] . ML-guided docking studies efficiently screen chemical inhibitors of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) encoded spike (S) protein [77] . Similarly, structure prediction of plant viral proteins and the prediction of their chemical inhibitors followed by successful delivery will be a novel and effective virus management strategy (Figure 3 ). Although the last decade has witnessed a sharp increase of application of ML in solving complex biological problems [78] , its usage in the field of plant virology is still at a very naïve state. Several reports highlighted the role of ML in the precise diagnosis of plant viral diseases [30] [31] [32] [33] [34] [35] [36] . Plant virologists can foresee the tremendous scope of ML in addressing virus evolution, emergence, plant-virus interplay and above all management strategies. Moreover, several specific issues need to be explored. which enables the identification of prokaryotic virus sequences from mixed metagenomic data [79] . A similar approach can be employed for plant virome study. Fifthly, ML is now widely utilized in genomic selection for rapid and better prediction of superior genotypes for breeding purposes [80, 81] . A certain progress of ML assisted genomic prediction will definitely help breeders in developing elite virus tolerant/ resistant varieties. Finally, a collaborative effort from both plant virologists and big data analysts is of prime importance for the fruitful application of ML in the understanding of plant virus pathogenesis followed by the development of antiviral strategies. shown here is a standard representation of an artificial neural network. This network is comprised of three basic layers: the input layer, the single hidden layer and the output layer. It is assumed that the input layer has n independent variables, each of which when activated gives a certain output. Depending on this output, the subsequent neurons of the hidden layer. After incorporating the correct bias function, the hidden layer function goes on to activate the output layer, which goes on to produce the final result after considering the bias. Opportunities and obstacles for deep learning in biology and medicine Machine learning and complex biological data Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities. Information Fusion The curse(s) of dimensionality Statistics versus machine learning Deep learning for biology Deep learning Machine learning for high-throughput stress phenotyping in plants Machine learning and Its applications to biology Applications and potentials of artificial neural networks in plant tissue culture. In Plant tissue culture engineering Semi-supervised protein classification using cluster kernels Recent advances of deep learning in bioinformatics and computational Biology Human-level control through deep reinforcement learning Introduction to the artificial neural networks Machine learning techniques in plant biology Evolving artificial neural networks A novel radial basis function neural network for discriminant analysis Performance analysis of deep learning CNN models for variety classification in hazelnut Identification of plantleaf diseases using CNN and transfer-learning approach A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay statistics: Stopped training and other remedies for overfitting Process mining: a two-step approach to balance between underfitting and overfitting Avoiding overfitting in multilayer perceptrons with feeling-of Deep Learning for plant stress phenotyping: trends and future perspectives Identification of proteins of Tobacco mosaic virus by using a method of feature extraction Plant disease detection by imaging sensors -parallels and specific demands for precision agriculture and plant phenotyping Deep learning models for plant disease detection and diagnosis Plant virus metagenomics: biodiversity and ecology Plant virus metagenomics: what we know and why we need to know more evolution: Machine learning methods for predicting human-adaptive influenza A viruses based on viral nucleotide compositions Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput data Experimental evolution of plant RNA viruses Biology S: The prediction of virus mutation using neural networks and rough set techniques Complexity of begomovirus and betasatellite populations associated with chilli leaf curl disease in India Capsicuminfecting begomoviruses as global pathogens: host-virus interplay, pathogenesis, and management Fangorn Forest (F2): a machine learning approach to classify genes and genera in the family Geminiviridae Plant immune responses against viruses: how does a virus cause disease? The Plant cell Plant immunity against viruses: antiviral immune receptors in focus The tug-of-war between plants and viruses: great progress and many remaining questions Defended to the Nines: 25 years of resistance gene cloning identifies nine mechanisms for R protein function NBSPred: a support vector machine-based high-throughput pipeline for plant resistance protein NBSLRR prediction Receptor kinases in plant-pathogen interactions: more than pattern recognition Molecular dialogues between viruses and receptorlike kinases in plants. Molecular plant pathology Bioinformatics analysis of the receptor-like kinase (RLK) superfamily Transcriptome analysis of two cultivars of tobacco in response to Cucumber mosaic virus infection Comparative transcriptome analysis in Triticum aestivum infecting wheat dwarf virus reveals the effects of viral infection on phytohormone and photosynthesis metabolism pathways Nuclear proteome of virus-infected and healthy potato leaves Comparative metabolomics and transcriptomics of plant response to Tomato yellow leaf curl virus infection in resistant and susceptible tomato cultivars Statistical and machine learning approaches to predict gene regulatory networks from transcriptome datasets Changes in subcellular localization of host proteins induced by plant viruses MU-LOC: a machine-learning method for predicting mitochondrially localized proteins in plants LOCALIZER: subcellular localization prediction of both plant and effector proteins in the plant cell EffectorP: predicting fungal effector proteins from secretomes using machine learning Impact of viral silencing suppressors on plant viral synergism: a global agro-economic concern Supervised learning classification models for prediction of plant virus encoded RNA silencing suppressors Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors Roles of small RNAs in virus-plant interactions Supervised learning methods for microRNA studies Graph neural networks and their current applications in bioinformatics. 2021 Highly accurate protein structure prediction with AlphaFold Screening of therapeutic agents for COVID-19 using machine learning and ensemble docking studies Machine learning for big data analytics in plants VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data Advances and Challenges in Genomic Selection for Disease Resistance Genomic selection in plant breeding: methods, models, and perspectives Not applicable. DG, SRC, HK and SC conceived the work; DG and SRC collected information, analysed and wrote the manuscript, DG, SRC, HK and SC edited the manuscript; SC arranged the funding. Not applicable. Not applicable. All authors have agreed for publication in the journal. The authors declare that they have no competing interests.