key: cord-0031824-zz5xoync
authors: Wang, Jing; Zhang, Qinglong; Han, Junshan; Zhao, Yanpeng; Zhao, Caiyun; Yan, Bowei; Dai, Chong; Wu, Lianlian; Wen, Yuqi; Zhang, Yixin; Leng, Dongjin; Wang, Zhongming; Yang, Xiaoxi; He, Song; Bo, Xiaochen
title: Computational methods, databases and tools for synthetic lethality prediction
date: 2022-03-29
journal: Brief Bioinform
DOI: 10.1093/bib/bbac106
sha: 02fc0fd98c194c23d9f009d790c3bda9c815e1d6
doc_id: 31824
cord_uid: zz5xoync

Synthetic lethality (SL) occurs between two genes when the inactivation of either gene alone has no effect on cell survival but the inactivation of both genes results in cell death. SL-based therapy has become one of the most promising targeted cancer therapies in the last decade as PARP inhibitors achieve great success in the clinic. The key point to exploiting SL-based cancer therapy is the identification of robust SL pairs. Although many wet-lab-based methods have been developed to screen SL pairs, known SL pairs are less than 0.1% of all potential pairs due to large number of human gene combinations. Computational prediction methods complement wet-lab-based methods to effectively reduce the search space of SL pairs. In this paper, we review the recent applications of computational methods and commonly used databases for SL prediction. First, we introduce the concept of SL and its screening methods. Second, various SL-related data resources are summarized. Then, computational methods including statistical-based methods, network-based methods, classical machine learning methods and deep learning methods for SL prediction are summarized. In particular, we elaborate on the negative sampling methods applied in these models. Next, representative tools for SL prediction are introduced. Finally, the challenges and future work for SL prediction are discussed.

Synthetic lethality (SL) is originally defined as the setting in which abnormal expression of either of two genes alone has little effect on cell viability but abnormalities in the expression of both genes concurrently lead to cell death [1] . Basically, 'SL' can be categorized into two classes: (i) SL, which occurs between genes with loss-offunction mutations (gene A) and their partner gene (gene B). (ii) Synthetic dosage lethality (SDL), which occurs between the overexpressed gene (gene A) and their partner gene (gene B) [2] (Figure 1 ). In cancer, the application of SL has the following significances: (i) SL provides an approach for targeted therapy. Abnormalities of gene A can be regarded as cancer-specific biomarkers and pharmacological inhibition of gene B leads to the selective killing of cancer cells [3] . (ii) SL expands the space of druggable targets. SL points the way to indirect targeting the genes that are not classically 'druggable,' owing to their molecular structure or because they are loss of function mutations [1, 4] . poly(ADP-ribose) polymerase inhibitor (PARPi) is the first successful clinical example based on SL [1, [5] [6] [7] and SL-based therapy has been regarded as one of the most effective anticancer treatments in the last decade [8] . The encouraging results of PARPi led to an increasing amount of drug candidates focusing on SL interactions. We list some recent clinical trials related to SL interactions in Table 1 [8, 9] .

Despite the attractive concept of SL-based therapeutics, only PARPi has progressed to the clinic so far. A major hurdle might be the identification of clinically relevant, robust SL pairs [10] . Identifying potential SL gene pairs is mainly achieved by two methods: laboratorybased methods and computational-based methods. The most common laboratory-based methods include yeast screening, drug screening, RNA interference (RNAi) screening, clustered regularly interspaced short palindromic repeat (CRISPR) screening [8] . The limitation of yeast screening is that only a small portion of yeast genes (∼2000) have human orthologs [11] , which limits the potential of this method. Drug screening tests drugs on various cell lines with specific mutations to identify SL gene pairs. SL gene pairs identified by drug screening would be easier to achieve clinical translation. However, the effect and specificity of drug inhibition tend to be lower than gene knockdown [8] and SL-gene pairs are limited in druggable gene targets. With the advent of RNAi and CRISPR/Cas9 technology, it is now possible to screen human cells for SL gene pairs. However, due to a large number of gene combinations (∼200 million in a mammalian cell) [12] , it is impractical to screen all potential SL pairs by these laboratory-based methods.

To overcome the abovementioned disadvantages, a variety of computational methods have been proposed, which can reduce the search space of SL gene pairs. These methods can be divided into four categories: statistics-based methods, network-based methods, classic machine learning-based methods and deep learning-based methods. Statistics-based methods are based on certain hypotheses to predict SL gene pairs. For instance, Jerby-Arnon et al. [13] developed a datadriven model called data mining synthetic lethality identification pipeline (DAISY) for SL prediction based on the assumption that SL genes tend to be co-expressed but seldom coinactivation. Network-based methods identify SL gene pairs through constructing protein-protein interaction (PPI) [14] [15] [16] [17] network, signaling network [18, 19] or metabolism network [20] [21] [22] . With the rapid development of machine learning, various algorithms have been applied for SL prediction, including random forest (RF) [23] [24] [25] [26] [27] , matrix factorization [28] [29] [30] and so on. Deep learning-based methods have recently emerged as useful methods to identify SL gene pairs, especially graph neural network (GNN) [31] [32] [33] .

The rest of this review is organized as follows. The next section introduces SL-related databases, including label databases, feature databases and other related databases. The third section summarizes the computational methods for SL prediction. After that, negative sampling methods applied in these computational methods are explained in the fourth section. The subsequent section introduces available tools to predict SL interactions. Finally, challenges and future work are discussed in the last section.

Due to the development of high-throughput screening technologies, a large amount of SL data have been identified. Many databases are developed to gather SL pairs, which are listed in Table 2 . Among these databases, Syn-LethDB [34] is a unique comprehensive database for SL. Other databases are based on yeast screening, RNAi Other databases provide data of large-scale singlegene knockout, cancer genomics and mutations and orthology analysis. SL interactions can be identified from the first two kinds of databases and can be inferred with the help of orthology analysis databases. The statistics of these databases are listed in Table 4 . A brief introduction of each feature database and other database is shown in the supplementary file.

SynLethDB SynLethDB [34] is a comprehensive database for SL and it has two versions so far. SynLethDB 1.0 was released in 2015 and SynLethDB 2.0 was updated in 2020. SL pairs are collected from multiple sources, including manual curations from literatures, three SLrelated databases (BioGRID [35] [36] [37] , Syn-lethality [38] and GenomeRNAi [39] ), bispecific shRNA screening (DECIPHER), computational predictions (DAISY [13] ) and text mining data for five species (human, mouse, fruit fly, worm and yeast). A brief introduction of the integrated databases is provided in the supplementary file.

SynLethDB provides a webserver to calculate the confidence score for each SL pair by integrating individual scores derived from different evidence sources. In addition, the latest version adds SynLethKG. It is a comprehensive knowledge graph (KG) of SL including 11 types of biomedical entities and 27 types of relationships, representing features and relationships between genes, cancers and drugs.

The CellMap

The CellMap [40] is a web-based database of genetic interaction for Saccharomyces cerevisiae released in 2016. Through constructing over 23 million double mutants, ∼350 000 positive and ∼550 000 negative genetic interactions are identified. Three different interaction maps are constructed: nonessential × nonessential (N×N), essential × nonessential (E×N) and essential × essential (E×E) genetic network. The number of SL pairs between nonessential genes (∼10 000) is estimated by applying an extreme negative interaction score threshold (<−0.35) to the N × N dataset.

SL gene pairs can be experimentally obtained by the comparison of single and dual mutants in the same assay. Compared to double mutant yeast strains, which can be developed through high-throughput mating methodologies, it is more challenging to develop human cell lines with double mutations [10, 41] . Early human double perturbation screening used combinatorial siRNA knockdowns or siRNA knockdown under specific mutations to obtain genetic interaction data. SL pairs can be extracted from these data based on a specific indicator.

Laufer et al. [42] performed 51 680 combinatorial RNAi experiments and identified genetic interactions for one or more of 11 phenotypes between 2376 gene pairs in human colon cancer cells.

Vizeacoumar et al. [43] identified negative genetic interaction partners of five specific driver-mutated genes across a set of isogenic cancer cell lines through pooled shRNA screening. A total of 826 genetic interactions are tested and 200 negative genetic interactions (24.2%) are confirmed. They generate a genetic interaction network consisting of 2014 nodes and 2617 edges.

With the advances in CRISPR technology, it is now possible to systematically map SL networks in human cancer cells using combinatorial CRISPR screening.

Shen et al. [44] developed a high-throughput CRISPR screening approach for targeting single and pairwise genes. They screened all possible pairs of 73 cancer genes in three human cell lines, with totally 152 SL gene pairs were identified.

Horlbeck et al. [12] systematically screened 222 784 gene pairs from two human cancer cell lines through the CRISPR interference method, and constructed a large genetic interaction (GI) map.

Najm et al. [45] developed a dual-Cas9 platform to screen genetic interactions across six human cell lines and examined SL interactions among them.

Zhao et al. [46] probed metabolic gene networks through combinatorial CRISPR screening developed by Shen et al. [44] . They interrogated a set of 51 genes in A549 and HeLa cells, which are involved in glycolysis and pentose phosphate pathways.

Zamanighomi et al. develop GEMINI refer to section Tools and applications to identify sensitive lethal and sensitive recovery interactions from combinatorial CRISPR screening. Wan et al. [41] used GEMINI to identify SL interactions from the combinatorial CRISPR experiments in three cell lines. They provide SL gene pairs with both SL relationships and L1000 gene expression profiles.

Benstead-Hume et al. [25] extracted various features from PPI networks for use in a RF classifier to predict SL and SDL pairs both within and across five species. All predicted pairs can be obtained in the Slorth database released in 2019.

Han et al. [48] developed an algorithm to identify potential SL interactions for specific cancer types from The Cancer Genome Atlas (TCGA) refer to the Supplementary data and functional screening data. As a result, 10 637 SL interactions are detected. They integrate SL interactions predicted by other studies and construct the Cancer Genetic Interaction database (CGIdb).

Srivas et al. [49] exploited ∼169 000 potential interactions between tumor suppressor genes (TSG) orthologs and druggable genes in yeast. Under the guidance of the strongest signal, they screened thousands of TSG-drug pairs in HeLa cells and construct conserved SL interaction networks.

The increasing volume of biological data and the rapid development of computer technology have paved the way to develop computational methods for SL prediction. The principle behind computational methods is to utilize biological knowledge that is confirmed to be able to determine known SL interactions, thus providing valuable insights into identifying more SL interactions from genes of interest [50] . Moreover, they show an impressive ability in SL prediction. In general, computational methods can be divided into (i) statistical-based methods, (ii) network-based methods, (iii) classic machine learning (ML) methods and (iv) deep learning methods. Due to the various principles of these methods, they have their own merits and demerits, which are listed in Table 5 . Summary of studies involved in this review are shown in Table 6 and their performance scores are summarized in Table 7 .

This section focuses on the related works of statistical methods on the SL prediction task. Based on the knowledge of systems biology, statistical-based methods learn to fit existing SL data using particular assumptions. The assumptions are based on prior biological knowledge, such as the fact that SL genes are frequently coexpressed, having similar functions, or exhibiting mutual exclusivity with respect to specific genetic events. Models based on these assumptions are usually explainable as they can reveal statistical regularities between gene pairs at the phylogenetic level to some extent, but the accuracy of these models greatly depends on the prior statistical assumptions.

Earlier studies mainly focused on identifying SL pairs in yeast, due to the limited access to human SL pairs. For instance, yeast SL pairs can be predicted by maximum likelihood estimation (MLE) method using the domain genetic interaction probabilities [51] or genetic interactions of significant short polypeptide clusters [52] . Furthermore, SL gene pairs of humans or other species can be predicted through yeast orthology mapping [16, [53] [54] [55] . However, orthology mapping has two major limi- tations. First, only a small portion of yeast genes have human orthologs as humans are evolutionarily distant from yeast. Second, SL relationships may develop independently across species [14] .

With the rapidly accumulation of human genome data, global human SL interactions prediction has started to be investigated. Among them, DAISY is the most representative approach. DAISY is a data-driven computational pipeline based on large amounts of cancer genomic data proposed by Jerby-Arnon et al. [13] in 2014. They identify SL interactions in cancer through three statistical procedures in parallel ( Figure 3A ): (i) Genomic survival of the fittest. It is based on the observation that the coinactivition of SL pairs leads to cell death. Therefore, SL pairs can be selected by identifying gene coinactivation events that occur substantially less than expected. (ii) shRNAbased functional examination. It is based on the fact that knocking out the SL gene is lethal to cells when its SL partner gene is inactive. This can be implemented by an integrated analysis of shRNA essentiality screening, their somatic copy number alterations (SCNA) and transcriptomic profiles. (iii) Pairwise gene coexpression.

SL pairs are likely to be involved in closely associated biological processes and hence tend to be co-expressed [56, 57] . Then a cancer genome-wide SL interactions network is constructed from SL gene pairs identified by all the three procedures. DAISY successfully identifies SL pairs by capturing the results obtained from large-scale genomic data and shRNA screens, but these data are at times noisy and inaccurate.

Other researches have also developed some valuable statistical inferring methods and more assumptions have been proposed for SL prediction. For example, gene pairs altered in a mutually exclusive pattern are likely to be SL pairs [58, 59] ; SL pairs upon coinactivation may exhibit prolonged patients' survival [60] ; SL pairs tend to have high phylogenetic similarity [60] .

Wang et al. [61] identified differentially expressed genes between tumors with and without functional p53 mutations by univariate F-test or t-test. The genes which exhibit higher relative expression in p53 mutated tumors were further selected as the candidate SL partner genes for p53. Chang et al. [62] selected lung adenocarcinomadependent genes through computing gene expression of lung adenocarcinoma versus nontumorous tissues, and then associated with five clinical factors to obtain predicted SL pairs. Feng et al. [63] developed an integrated computational pipeline based on ISLE (identification of clinically relevant SL) [60] , which determine SL partner genes of GNAQ following four aspects: molecular condition (differentially overexpressed genes), clinical condition (genes associated with poor prognosis), phenotypic condition (more essential genes) and druggable condition. Recently, Yang et al. [64] inferred SL gene pairs in liver cancer based on DAISY and ISLE, which contain five inference analyses (functional similarity, differential gene expression, pairwise gene coexpression, pairwise survival and rank aggregation). Sinha et al. [65] proposed a computational pipeline called Mining Synthetic Lethals (MiSL) to identify mutation-specific SL pairs for specific cancers. Their basic assumption is that SL partner genes of a mutated gene tend to be amplified more frequently or deleted at a lower frequency in primary tumor samples containing the mutated gene.

This section focuses on the network-based methods for SL prediction. Compared with statistical methods, network-based methods provide a more comprehensive understanding of genes in the entire biological network and improve our understanding of the mechanisms of SL. Currently, network-based methods predict SL pairs through constructing biological networks (PPI networks, signaling networks or metabolic networks), then analyzing the topological characteristics of genes in biological networks and assessing the network changes in response to knocking out gene pairs.

Kranthi et al. [15] pointed out that the connectivity of the protein in the PPI network and the structure of the network are related to its functional characteristics. In general, the protein nodes with high degrees are usually functionally basic, and a lack of them would lead to lethality. Based on this, they developed graph information centrality measures in biological systems to identify SL gene pairs. They modified the information centrality method by knocking out two nodes. However, this method does not take the efficiency changes of knocking out a single node in the network into account, as the network changes may be caused by knocking out one gene at times [18] . Jacunski et al. [14] evaluated the connectivity homology by calculating the network parameters in the PPI network and designing an SL prediction model based on connectivity homology. Ku et al. [17] identified functionally distinct KRAS SL subnetworks or modules based on the MCODE clustering algorithm in the PPI network, all of which can be traced back to a specific pathway or protein complex.

Zhang et al. [19] predicted SL gene pairs by combining a data-driven method with the knowledge of pathway information from signaling networks to mimic the influence of single gene knockdown and double genes knockdown to cell viability. Gene pairs are considered as potential SL pairs when double genes knockdown significantly increase the likelihood of cell death, whereas single gene knockdown does not. Liu et al. [18] constructed human cancer signaling network (HCSN) by calculating the shortest path between no cancer gene and cancer gene pairs. Then they screened SL pairs from HCSN by three procedures: network-based method (according to the distance between cancer genes and noncancer genes), frequency-based method and function-based method. This method screens SL pairs by a multistep strategy, thus it might get better results.

Apaolaza et al. [20] developed a genetic minimal cut set (gMCS)-based method to predict SL interactions and revealed a potential mechanism explaining the effect of specific gene knockout to disrupt cell growth. gMCS refers to minimal sets of reactions, the removal of which will invalidate the function of specific metabolic tasks. Megchelenbrink et al. [21] presented a network modeling method called identifying dosage lethality effects (IDLE). IDLE predicts enzymatic SDLs from a genomescale model of metabolism (GSMM). For each pair of enzymes (A, B) in the human GSMM, they predicted SDL by measuring the growth reduction level caused by changing the enzyme flux of A and B. IDLE identifies SDLs in clinical settings, but it does not integrate more data sources such as patient-specific omics data. In addition, Pratapa et al. [22] developed Fast-SL, an algorithm to rapidly identify SL gene sets in metabolic networks. The algorithm overcomes the issue of computational complexity encountered in previous methods by iteratively narrowing the searching space for SLs, thus substantially reducing the computational time.

Indeed, network-based methods can only integrate one or more interaction networks among genes. Relationships between genes and other entities like patients cannot be directly modeled [30] . In addition, they cannot utilize other data that contain related information about SL, such as sequence and function properties of genes. What is more, they do not use the existing SL samples so the underlying patterns of known SL pairs are not being exploited.

This section mainly introduces some classic ML methods for SL prediction tasks. Compared with the networkbased methods, ML methods can effectively integrate multidimensional data and achieve feature learning through parameter fitting, providing more comprehensive information for SL prediction. Classical ML methods attempt to reveal the patterns of observed samples that cannot be acquired through principle analysis, in order to achieve reliable prediction of unknown data (Figure 2 ). There are two main types of classical ML methods: 

The principle of SVM is to create an optimal decision boundary that maximizes the distance between two classes [66] . Paladugu et al. [67] proposed an SVM model that uses topological properties of two genes in a PPI network as features for SL prediction in yeast.

DT model creates tree-like structure for classification purpose, where each internal node corresponds to a test of a feature and each leaf node refers to a classification result [68] . Yin et al. [69] predicted SL interactions in breast cancer based on DT. Two features [mutation coverage and copy number variations (CNV)] are classified and optimized by experimentally validated SL pairs, which are used to predict SL interactions based on DT.

K-NN algorithm is a nonparametric method [70] that classifies unknown samples by a plurality voting of its neighbors. Wu et al. [71] proposed a k-NN model to achieve the similarity-based classification of gene pairs. The basic hypothesis of this model is that unknown gene pairs which exhibit high levels of similarity to the known SL pairs are more likely to be potential SL pairs.

Ensemble classifiers achieve predictions by integrating the results of several independently trained weak models on the same samples. The integrated models outperform those of separate models. How to choose the independent weaker models and how to integrate their learning results are the main challenges of this algorithm. Pandey et al. [72] defined a large number of features for characterizing SL interactions from diverse data sources. Then, they design an integrated multi-network and multi-classifier (MNMC) framework composed of six different classifiers to predict yeast SL gene pairs. Wu et al. [73] also developed an ensemble algorithm (MetaSL) that integrates RF, DT, SVM and other ML classifiers based on a variety of biological features. Compared with MNMC [72] , MetaSL assigns different weights to different classifiers according to their performances in the training process. Thus the prediction results are based on a weighted consensus from the participating classifiers. However, the limitation of this study is that interdependence exists among the input features.

RF [74] actually belongs to ensemble classifiers, but all of the integrated classifiers are DTs. RF achieves strong predictive power by combining the simplicity of DTs with the f lexibility and powerful functions of ensemble classifiers. Besides, it can cope with high dimensional (containing many features) data without feature selection as it is able to randomly select a subset of features.

Das et al. [23] developed an RF-based R package Dis-coverSL to predict SL interactions in cancers using multiomics cancer data. Li et al. [24] encoded genes as enrichment scores based on GO terms and KEGG annotation and a gene pair is represented by numerous features derived from their enrichment scores. Following this, they utilized SL label data to build an RF-based prediction model with optimized functional features. In particular, the maximum relevance and minimum redundancy method [75] is used to generate a ranked feature list and incremental feature selection method is applied to select the most appropriate number of features. Benstead-Hume et al. [25] also extracted features from the graph in the PPI network and use the RF model to predict SL gene pairs. Considering paralog pairs share functionality similarities and are more likely to be SL pairs, De Kegel et al. [26] developed an RF classifier to predict paralog SL pairs. Specifically, they applied Tree-Explainer [76] to compute the inf luence of each feature on a specific prediction, so the classifier is able to make interpretable predictions. Benfatto et al. [27] developed an algorithm called PAn-canceR Inferred Synthetic lethalities (PARIS) that can address the importance of individual gene deficiency in explaining their dependencies in multiple cancer cells. The core of the PARIS algorithm lies in the feature selection step, achieved by RF through assigning importance scores to each mutation and expression feature based on CRISPR screening data across multiple cancer cell lines.

The classic ML methods described above are based on a supervised learning frame that requires both posi-tive and negative training samples. However, SL prediction tasks lack real negative samples, as the majority of them are randomly selected from unknown samples, which may pick up false negative data. Matrix factorization methods effectively avoid this defect by capturing the underlying mechanisms of SL samples and integrating relevant information. Matrix factorization aims to decompose an input matrix into the product of two lowrank matrices, and then the data-missing matrix is filled with data obtained through model training.

Huang et al. [28] designed a graph regularized selfrepresentative matrix factorization (GRSMF) model which uses the linear representation of matrix X's rows and columns to decompose itself. What is more, authors integrate GO similarity matrix data as a graph regularization term to address the sparse input data and improve the prediction accuracy. Compared with the conventional matrix factorization, GRSMF has the ability of data-adaptiveness and avoids determining the dimension of the latent space. To further differentiate the importance weights between SL pairs and unknown pairs, Liu et al. [29] proposed a logistic matrix factorization model, called SL2MF ( Figure 3C ), to learn latent representations of SL pairs. The combination of the latent vectors determines the probability of SL pairs. Moreover, they apply neighborhood regularization to constrain the latent vector, based on the hypothesis that genes with similar GO or PPI properties should be factorized into similar latent vectors. In addition, conventional matrix factorization methods have limited capability on complicated heterogeneous data. To address this issue, Liany et al. [30] improved the collective matrix factorization (CMF) method through three measures. The first two measures rely on a transformation (principal components analysis and graph features). The third measure is to extend the model by using matrixspecific weights. This modified model figures out the problem that conventional CMF cannot learn the unique representation of each entity when multiple input matrices contain the same entity types.

This section discusses the application of deep learning in SL prediction. Deep learning is a subset of ML methods. Compared to classical ML methods that extract features manually based on knowledge, deep network structures can better capture nonlinear and complex relationships between inputs and outputs, allowing them to identify complex patterns behind the data. Interdependent relationships always exist in biological entities and processes, which are often inherently noisy and occur at multiple scales. Therefore, biological data can be well suited for deep learning.

Neural networks are the most commonly employed models in deep learning as they show high significant fit for complex nonlinear problems [111] . Given the fact that most SL pairs are cell-line specific, Wan et al. [41] develop a semisupervised neural network method called EXP2SL to identify SL pairs. For a pair of gene, they use cell line shRNA perturbation (LINCS L1000 project) gene expression profile to construct 978-dimensional features as inputs of the encoding layers and predict the potential SL pairs. EXP2SL is the first model to predict cell-linespecific SL pairs and it makes full use of unlabeled SL data.

Currently, there are plenty of deep learning methods, among which the following two types are the most frequently and effectively used models in SL prediction: GNN and KG embedding.

GNN can efficiently capture the structures of graph and model complex relations between neighbor nodes in the graph. Three archetypes of GNN are adopted in SL prediction, including graph convolutional network (GCN), graph attention network (GAT) and graph auto-encoder (GAE).

GCN is an extension of convolutional neural network on graph structure. Compared with matrix factorization mentioned in the previous section, GCN can capture the information of neighbor nodes in the graph. Cai et al. [31] apply GCN to SL prediction and propose a model called dual-dropout GCN (DDGCN) ( Figure 3D ). DDGCN can aggregate information of neighboring genes in a graph by convolution operators. Furthermore, researchers adopt a dual dropout regularization technique [77] during the training process to avoid overfitting due to the sparse SL data. However, there are two limitations in this study. First, DDGCN only utilizes the information of the known SL pairs and lacks information on other features. Second, DDGCN does not assign different weights to different neighbors.

GAT [78] , which is able to assign different weights to each neighbor nodes, adopts attention mechanism to counter the shortcomings of GCN. Long et al. [32] developed a Graph Contextual Attention Network model called GCATSL that effectively integrates multiple biological data for SL prediction. After constructing multiple gene feature graphs with different data source as model's inputs, a dual attention mechanism (nodelevel and feature-level) is designed for each feature to capture local and global neighbors' importance to learn their representations. Multilayer perceptron is further exploited to aggregate the extracted features with original features.

Graph auto-encoder GAE extends the idea of autoencoder to a graph. The node embeddings in the graph can be obtained through the encoder-decoder structure. In general, GAE uses GCN as the encoder. After inputting the topology and node information of the graph into the encoder, the inner product is adopted as the decoder to reconstruct the original graph. Hao et al. [33] combined GCN with autoencoder to construct a multiview graph autoencoder (SLMGAE) with a variety of data for SL prediction. SLMGAE takes SL graph as main view and graphs of other data (PPI, GO, etc.) as support views. Multiple GAEs are applied to graph reconstruction and GCN is used as the encoder. SLMGAE is able to integrate various data sources of genes in a GNN based framework and differentiate each data source by an attention mechanism.

The network embedding-based methods mentioned above integrate the information of multiple or heterogeneous biological networks, but in essence, there is no unified consideration for different relationship types. KG demonstrates excellent performance to this problem, which is a kind of knowledge-rich heterogeneous network composed of interconnected entities and relevant properties. It embeds the rich entities and the relationship information into the continuous vector space with low dimension [79] , so as to facilitate computation while retaining the structural information. Due to the complexity and diversity of biological information, KG performs well in biological tasks such as SL prediction.

Wang et al. [80] constructed a KG algorithm (KG4SL) for SL prediction ( Figure 3E ). The algorithm consists of three modules. First, a gene-specific weighted subgraph is generated for each gene. Second, gene representation is updated by aggregating its neighbors' representations in its weighted subgraph. Third, SL score can be calculated through the inner product of the two genes' aggregation result. However, this method may not fully integrate the neighborhood topological structures when generating a gene-specific weighted subgraph due to the large degrees of some nodes. In addition, some neighbors might be uninformative and promiscuous in the process of message passing. Zhang et al. [2] developed the Synthetic Lethality Knowledge Graph (SLKG), which integrates three types of entities (genes, drugs and diseases) and four types of relationships. Drug repositioning is achieved by defining three core scoring functions: SLScore (SDLScore), DrugScore and Cancer-Score. SLScore is calculated by integrating different SL evidences.

In general, the above four methods have their own characteristics. Statistics-based methods and networkbased methods are usually interpretable for novel predictions and do not require known SL samples. Statisticsbased methods are based on statistical assumptions on the biological data, thus the accuracy of the assumptions and the quality of the biological data are needed. For example, considering SL gene pairs tend to be coexpression and seldom coinactiviton, DAISY [13] identifies SL gene pairs from large-scale genomic data. Networkbased methods are based on a deep knowledge of a single or heterogeneous biological network, often accompanied by some creative concepts to identify the potential nodes that play an important role in the biological network. For instance, Kranthi et al. [15] developed a graph information centrality to identify SL gene pairs from human cancer protein interaction network. ML methods are trained on known SL samples. Classic ML methods tend to get better results than deep learning methods in small and medium-sized data sets (below hundreds or tens of thousands of samples) due to their fewer hyperparameters. Deep learning can achieve a better prediction under a big data set and it can extract high-level features from the data based on its complex network structure and a large number of parameters. However, due to the end-toend learning process, the intermediate process of deep learning is a black box with good performance but a lack of interpretability [81] . This will lead to great uncertainty and unreliability when applied in biological or medical practice.

SL computational models normally follow a supervised learning framework. Experimental data are composed of positive and negative samples. Positive samples can be extracted from databases in Table 2 . How to prepare negative samples is one of the challenges for SL prediction.

Randomly picking up unknown gene pairs is a commonly used method for negative sampling [32, 80] . This approach is relatively simple and can obtain enough negative samples. However, as shown in Figure 4 , this method may pick up unidentified positive samples. Mislabeled data would lead to the worse performance of the model.

Another negative sampling method is extracting gene pairs from GI databases with certain GI scores as negative samples [33, 41] . For instance, Hao et al. [33] extracted negative SL samples with GI scores around 0 and positive SL samples with GI scores below −3. This negative sampling method avoids introducing potential positive samples, but the number of negative samples is relatively small.

To predict or identify SL interactions based on various data, number of easy-to-use tools have been developed. In this section, a brief introduction of these representative tools is presented. Further details of these tools are listed in Table 8 .

G2G is a web server for the human SL interactions prediction published by Almozlino et al. [82] . The web server provides access to predicting phenotypes of paired gene deletions by an improved algorithm based on RF. Followed by submitting a source gene and a target gene, the phenotype for that gene pair can be computed. Furthermore, users can submit only one gene and then G2G returns all predicted interacting genes according to their neighbor relationships in the PPI network.

Synthetic Lethality Bio Discovery Portal is a comprehensive web tool to predict SL [23] interactions from hallmark cancer pathways through mining genetic and chemical interactions in cancer. The web tool was developed by Deng et al. [83] in 2019 based on the previous statistical approach DiscoverSL (refer to section Statisticalbased methods).

Users can search the web tool from three modes: 'GENES' (including 623 commonly mutated cancer genes), 'CANCER' (including 18 histology types) and 'DRUG'. In addition, the 'INFERRED DRUG SYNERGY' provides potential synergistic drug combinations.

Magen et al. [84] define 'survival-associated pairwise gene expression states' (SPAGEs) as pairs of genes whose co-expression levels are related to cell survival. They present a data-driven pipeline named SPAGE finder that identifies 71 946 SPAGEs from TCGA data, spanning 12 distinct types and a small portion of which are SL pairs. They provide a webserver visualizing the SPAGEs identified by the original manuscript, and allowing input or upload a gene list file of comma-separated gene names which will be rendered on the left panel.

Synthetic Lethality using Gene expression and Genomics (SynLeGG) is a web server developed by Wappett et al. [85] in 2021. SynLeGG utilizes MultiSEp algorithm to partition gene expression to discover SL-related characteristics. It predicts genetic dependency relationships including SL spanning 30 tissues and 783 cancer cell lines.

GEMINI GEMINI [47] is an R package based on the variational Bayesian method to identify genetic interactions from combinatorial CRISPR perturbation studies. Scoring systems related to the individual and combined effects are defined to identify SL interactions.

Traditional genetically targeted cancer therapies normally focus on targeting gene products that are mutated or overexpressed in specific cancer types. However, from a drug discovery perspective, the loss-of-function mutations are much harder to target, and the same is true for several undruggable overexpressed genes. Fortunately, SL provides an avenue for the treatments of these targets as they facilitate the indirect targeting of nondruggable genes through the identification of a second druggable target that can interact with the primary genes [10] . Despite a marked increase in the identification of SL gene pairs, relatively few SL drug candidates have entered into clinic, and the field remains largely in its infancy [4] . Computational methods hold great prospects in this field but still remain some challenges. In this section, we will discuss these challenges and possible work in the future, mainly including biological issues and data and algorithm issues.

Expand the concept of SL

The conventional concept of SL is defined as the interaction between two genes. With a deeper understanding of SL, some studies expand the concept of it.

Most present studies focus on identifying SL interactions between two genes. However, the biological genetic interactions are complex and it is imperative to identify multiple genetic interactions. Kuzmin et al. [86] scored trigenic interactions in ∼200 000 yeast and identified 3196 trigenic negative interactions. The global trigenic interaction network is estimated nearly 100-fold larger than the digenic network. Pratapa et al. [22] develop Fast-SL to identify high order SL interactions, including triplets and quadruplets. Prediction of SL interactions between multiple genes may be one of the challenges in the future.

Ryan et al. [87] consider that SL can be divided into hard SL and soft SL. Conventional SL is called hard SL. Soft SL exists between gene A and gene B but can be rescued by other genes. These reverse effects are called synthetic rescue (SR) or synthetic viability (SV). Gu et al. [88] identified candidate SR (SV) pairs by applying a statisticalbased method and demonstrated that SR (SV) enables the prediction of drug resistance. The integration of SL and SR (SV) may result in higher reproducibility of SL prediction, thus future work for SL prediction should take SR (SV) into consideration.

Conventionally, the mutated genes are utilized to distinguish cancer cells from normal cells and pharmacological inhibition of their partner genes is commonly adopted for SL-based cancer therapies. However, this concept can be extended. Akimov et al. [89] point out that the main determinant of any SL interaction is the phenotype alteration caused by a specific mutation or molecular perturbation. Therefore, considering the polygenic nature of the phenotype, they propose that phenotype might be a more robust differentiating context for SL interactions. SL interaction between WRN gene and microsatellite instability phenotype is an example phenotype-centric SL [90, 91] . The identification of more phenotype-centric SLs is a meaningful work in the future.

The integrated signaling system is critical for cell survival. Within it, various pathways interact with each other for survival and disrupting signals involved in multiple pathways is a practice of SL [92] . From the perspective of signals, we can get a deeper understanding of the biological mechanism of SL. In this regard, SL interactions can expand from genes to any signals, such as epigenetic regulators [93] . Integrating different types of signals to predict SL interactions would be crucial for future researches.

At present, the main application of SL is still focused on the discovery of new anticancer targets. However, some researches indicate that SL would be applied in a wider range. These studies are explained in this section, which may give reference to the researchers in this field.

SL has been successfully applied to identify anti-cancer targets but has found limited use in other diseases. There have been some researches probed into nononcological diseases, such as bacterial infection [94] [95] [96] [97] , malaria [98] and virus infection [99] . Computational methods may assist further application of SL in more nononcological diseases in the future.

The essence of SL is a kind of genetic interaction and the analysis of SL can provide mechanistic insight into genes. Lippert [102] . In the future, as more SL pairs would be discovered, more biological mechanisms about genes will be revealed.

In spite of various promising computational methods that have been developed to identify SL interactions, drug repositioning researches based on SL have seldom been explored. After all, the ultimate goal of identifying novel SL pairs is to develop novel tumor target therapy.

Recently, Zhang et al. [2] develop SLKG, a comprehensive KG aimed at providing the computational basis to tumor therapies based on SL. They demonstrate that SLKG is able to identify the optimal repurposing drugs and drug combinations. Future efforts are expected for these pioneer studies to achieve the clinical translation of SL.

Some researchers explored wider applications of SL. It is reported that [103, 104] SL interaction may be a new approach in chemoprevention of cancer, but this approach is to a great extent in its infancy. Additionally, Lee et al. [105] developed a precision oncology framework to predict patients' cancer therapy response based on SL and SR interactions.

Data quality

The quality of training data is crucial to SL prediction. However, high false-positive and false-negative ratios often be observed in SL data generated by highthroughput screenings. In addition, positive SL samples could be negative under certain conditions. All of the abovementioned issues could lead to label inaccuracy and inconsistency. Besides, the performance of current models used to predict SL interactions is difficult to assess due to a lack of a gold standard source of human SL pairs. Therefore, to preprocess the SL data before applying it in computational models, establishing a gold standard source of SL pairs is necessary.

Due to the limited technology, known SL pairs are less than 0.1% of all potential pairs [31] , which lead to two issues concerning training data. The first issue is sparse data. When applying these sparse data in ML models, overfitting tends to occur. Cai et al. [31] propose dual forms of dropout in their DDGCN model to avoid overfitting problems. For future work, more SL gene pairs would be identified by the cooperation of biological and computational researchers to address the sparse data. Moreover, computational models fitted better to sparse data should be developed.

The second issue is imbalanced samples. The performance of the model deteriorates as the imbalance between the two classes increases. To address this problem, appropriate evaluation metrics should be adopted. Area under precision-recall curve is a more effective metric than area under receiver optimizer characteristics curve when applied on highly skewed tasks [31, 84] . Matthews correlation coefficient [106] has also been successfully used in SL prediction study [24] of which the samples are highly imbalanced. Besides, Li et al. [24] generated pseudopositive SL samples by synthetic minority oversampling technique method, which is designed to generate a number of predefined new samples from samples of minority class [107] . With these studies, we are one step closer to resolve the issue of imbalanced samples, and researchers would be inspired to explore more innovative solutions in the future researches, such as developing computational models that fit the imbalanced SL data better.

The mechanism behind SL is complex and cannot be generalized. Li et al. [9] propose a novel SL classification based on the specificity of its biological mechanism, which contains organelle level, pathway level, gene level and conditional SL. Conducting feature selection according to its biological mechanism before training a model is essential. In this way, informative features could be selected, which would help to explore more efficient computational models.

Most ML approaches have not achieved clinical practice owing to lack of interpretability [108] . These models are regarded as 'black boxes', which optimize prediction accuracy without understanding the biological mechanisms behind the predicted results [81] . To resolve these difficulties, model interpretation is now a fast-growing subfield of ML methods [109] . Several efforts have been made on this issue for genetic prediction [110, 111] . For SL prediction, interpretable models have not yet been reported. More attention should be paid in developing interpretable models for SL prediction in the future.

Identification of SL gene pairs is imperative as it can provide novel targets for targeted therapy. However, the search space of gene combinations is too large to be investigated experimentally. Computational methods have been advanced to complement experimental approaches, which can reduce the search space of SL gene pairs. This review provides a comprehensive overview of computational methods, databases and tools for SL prediction. It introduces six types of label databases, three types of feature databases, three types of other related databases and six tools for SL prediction. Moreover, four types of computational methods with a detailed description of strengths and weaknesses have been summarized. In addition, we highlight several challenges in this field, some of which may inspire the future researches.

• Computational methods for SL can accelerate the discovery of novel SL-based targeted cancer therapies. • This study reviews six types of label databases, three types of feature databases, three types of other related databases and six tools for SL. The related information and links of all databases are provided. • Computational methods including statistical-based methods, network-based methods, classic machine learning methods and deep learning methods are introduced, and their merits and demerits are discussed.

• The challenges include biological issues and data and algorithm issues. Expanding the concept of SL and expanding the application of SL are discussed in the section of biological issues. In addition, data quality, sparse data and imbalanced samples, lack of informative features and lack of interpretability require further exploration in future studies.

Supplementary data are available online at https:// academic.oup.com/bib.

Synthetic lethality as an engine for cancer drug target discovery

The tumor therapy landscape of synthetic lethality

Synthetic lethal therapies for cancer: what's next after PARP inhibitors?

Synthetic lethality in cancer therapeutics: the next generation

Specific killing of BRCA2-deficient tumours with inhibitors of poly(ADP-ribose) polymerase

Targeting the DNA repair defect in BRCA mutant cells as a therapeutic strategy

PARP inhibitors: synthetic lethality in the clinic

Advances in synthetic lethality for cancer therapy: cellular mechanism and clinical translation

Development of synthetic lethality in cancer: molecular and cellular classification

Synthetic lethality and cancer

A road map to personalizing targeted cancer therapies using synthetic lethality

Mapping the genetic landscape of human cells

Predicting cancerspecific vulnerability via data-driven detection of synthetic lethality

Connectivity homology enables inter-species network models of synthetic lethality

Identification of synthetic lethal pairs in biological systems through network information centrality

Humanized yeast genetic interaction mapping predicts synthetic lethal interactions of FBXW7 in breast cancer

Integration of multiple biological contexts reveals principles of synthetic lethality that affect reproducibility

Synthetic lethality-based identification of targets for anticancer drugs in the human Signaling network

Predicting essential genes and synthetic lethality via influence propagation in signaling pathways of cancer cell fates

An in-silico approach to predict and exploit synthetic lethality in cancer metabolism

Synthetic dosage lethality in the human metabolic network is highly predictive of tumor growth and cancer patient survival

Fast-SL: an efficient algorithm to identify synthetic lethal sets in metabolic networks

DiscoverSL: an R package for multi-omic data driven prediction of synthetic lethality in cancers

Identification of synthetic lethality based on a functional network by using machine learning algorithms

Predicting synthetic lethal interactions using conserved patterns in protein interaction networks

Comprehensive prediction of robust synthetic lethality between paralog pairs in cancer cell lines

Uncovering cancer vulnerabilities by machine learning prediction of synthetic lethality

Predicting synthetic lethal interactions in human cancers using graph regularized self-representative matrix factorization

SL(2)MF: predicting synthetic lethality in human cancers via logistic matrix factorization

Predicting synthetic lethal interactions using heterogeneous data sources

Dual-dropout graph convolutional network for predicting synthetic lethality in human cancers

Graph contextualized attention network for predicting synthetic lethality in human cancers

Prediction of synthetic lethal interactions in human cancers using multi-view graph autoencoder

SynLethDB: synthetic lethality database toward discovery of selective and sensitive anticancer drug targets

The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions

The BioGRID interaction database: 2019 update

BioGRID: a general repository for interaction datasets

Syn-lethality: an integrative knowledge base of synthetic lethality towards discovery of selective anticancer therapies

GenomeRNAi: a database for cell-based and in vivo RNAi phenotypes

A global genetic interaction network maps a wiring diagram of cellular function

EXP2SL: a machine learning framework for cell-line-specific synthetic lethality prediction

Mapping genetic interactions in human cancer cells with RNAi and multiparametric phenotyping

A negative genetic interaction map in isogenic cancer cell lines reveals cancer cell vulnerabilities

Combinatorial CRISPR-Cas9 screens for de novo mapping of genetic interactions

Orthologous CRISPR-Cas9 enzymes for combinatorial genetic screens

Combinatorial CRISPR-Cas9 metabolic screens reveal critical redox control points dependent on the KEAP1-NRF2 regulatory Axis

GEMINI: a variational Bayesian approach to identify genetic interactions from combinatorial CRISPR screens

Genetic interaction-based biomarkers identification for drug resistance and sensitivity in cancer cells

A network of conserved synthetic lethal interactions for exploration of precision cancer therapy

A survey on computational models for predicting protein-protein interactions

Understanding and predicting synthetic lethal genetic interactions in Saccharomyces cerevisiae using domain genetic interactions

Predicting synthetic lethal genetic interactions in Saccharomyces cerevisiae using short polypeptide clusters

Human synthetic lethal inference as potential anti-cancer target gene detection

Proposal for a new therapy for drugresistant malaria using plasmodium synthetic lethality inference

A comparative genomic approach for identifying synthetic lethal interactions in human cancer

The genetic landscape of a cell

Systematic interpretation of genetic interactions using protein networks

Inferring synthetic lethal interactions from mutual exclusivity of genetic events in cancer

Link synthetic lethality to drug sensitivity of cancer cells

Harnessing synthetic lethality to predict the response to cancer treatment

Identification of potential synthetic lethal genes to p53 using a computational biology approach

Uncovering synthetic lethal interactions for therapeutic targets and predictive markers in lung adenocarcinoma

A platform of synthetic lethal gene interaction Networks reveals that the GNAQ uveal melanoma oncogene controls the hippo pathway through FAK

Mapping the landscape of synthetic lethal interactions in liver cancer

Systematic discovery of mutation-specific synthetic lethals by mining pan-cancer human primary tumor data

Deep learning applied to hyperspectral endoscopy for online spectral classification

Mining protein networks for synthetic genetic interactions

Decision tree and ensemble learning algorithms with their applications in bioinformatics

Predicting Synthetic Lethal Genetic Interactions in Breast Cancer using Decision Tree

An introduction to kernel and nearest-neighbor nonparametric regression

Synthetic lethal interactions prediction based on multiple similarity measures fusion

An integrative multinetwork and multi-classifier approach to predict genetic interactions

In silico prediction of synthetic lethality by meta-analysis of genetic interactions, functions, and pathways in yeast and human cancer

Random decision forests

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

From local explanations to global understanding with explainable AI for trees

Dropout: a simple way to prevent neural Networks from overfitting

Graph attention networks

Predicting polypharmacy side-effects using knowledge graph embeddings

KG4SL: knowledge graph neural network for synthetic lethality prediction in human cancers

Opportunities and obstacles for deep learning in biology and medicine

G2G: a web-server for the prediction of human synthetic lethal interactions

SL-BioDP: multi-cancer interactive tool for prediction of synthetic lethality and response to cancer treatment

Beyond synthetic lethality: charting the landscape of pairwise gene expression states associated with survival in cancer

SynLeGG: analysis and visualization of multiomics data for discovery of cancer 'Achilles Heels' and gene function relationships

Systematic analysis of complex genetic interactions

Synthetic lethality and cancer -penetrance as the major barrier

A landscape of synthetic viable interactions in cancer

Re-defining synthetic lethality by phenotypic profiling for precision oncology

Prioritization of cancer therapeutic targets using CRISPR-Cas9 screens

WRN helicase is a synthetic lethal target in microsatellite unstable cancers

Synthetic dysmobility screen unveils an integrated STK40-YAP-MAPK system driving cell migration

Epigenetic synthetic lethality approaches in cancer therapy

Systems biology-guided identification of synthetic lethal gene pairs and its potential use to discover antibiotic combinations

Exploiting the synthetic lethality between terminal respiratory oxidases to kill mycobacterium tuberculosis and clear host infection

A synthetic lethal approach for compound and target identification in Staphylococcus aureus

Revisiting the betalactams for tuberculosis therapy with a compound-compound synthetic lethality approach

Using yeast synthetic lethality to inform drug combination for malaria

Crippling life support for SARS-CoV-2 and other viruses through synthetic lethality

Gene function prediction from synthetic lethality networks via ranking on demand

Essential plasticity and redundancy of metabolism unveiled by synthetic lethality analysis

Synthetic lethality across normal tissues is strongly associated with cancer risk, onset, and tumor suppressor specificity

Lung-cancer chemoprevention by induction of synthetic lethality in mutant KRAS premalignant cells in vitro and in vivo

Hereditary cancer syndromes as model systems for chemopreventive agent development

Synthetic lethalitymediated precision oncology via the tumor transcriptome

Machine learning approach to gene essentiality prediction: a review

SMOTE: synthetic minority over-sampling technique

Predicting drug response and Synergy using a deep learning model of human cancer cells

Definitions, methods, and applications in interpretable machine learning

Using deep learning to model the hierarchical structure and function of a cell

Translation of genotype to phenotype by a hierarchy of cell subsystems

UniProt: the universal protein knowledgebase in 2021

The Gene Ontology C. The gene ontology resource: 20 years and still GOing strong

KEGG: new perspectives on genomes, pathways, diseases and drugs

The molecular signatures database (MSigDB) hallmark gene set collection

The comparative Toxicogenomics database: update 2019

A next generation connectivity map: L1000 platform and the first 1 000 000 profiles

PhyloGene server for identification and visualization of co-evolving proteins using normalized phylogenetic profiles

CORUM: the comprehensive resource of mammalian protein complexes-2019

The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets

Human protein reference database-2009 update

HIPPIE v2.0: enhancing meaningfulness and reliability of protein-protein interaction networks

Defining a cancer dependency map

The cancer cell line Encyclopedia enables predictive modelling of anticancer drug sensitivity

The COSMIC (catalogue of somatic mutations in cancer) database and website

Inparanoid: a comprehensive database of eukaryotic orthologs

OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups

Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups

SGD: saccharomyces genome database

Ranking novel cancer driving synthetic lethal gene pairs using TCGA data

Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1

A genome-wide RNAi screen identifies multiple synthetic lethal interactions with the Ras oncogene

Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae

A robust toolkit for functional profiling of the yeast genome

Global mapping of the yeast genetic interaction network

This work is supported by the National Natural Science Foundation of China (http://www.nsfc.gov.cn; nos. 62103436) to Song He.