key: cord-0029682-2xxmzfbo
authors: Cai, Lijun; Gao, Mingyu; Ren, Xuanbai; Fu, Xiangzheng; Xu, Junlin; Wang, Peng; Chen, Yifan
title: MILNP: Plant lncRNA–miRNA Interaction Prediction Based on Improved Linear Neighborhood Similarity and Label Propagation
date: 2022-03-25
journal: Front Plant Sci
DOI: 10.3389/fpls.2022.861886
sha: b2e03bbc31cb814d974e590b92cf0cbfe764786c
doc_id: 29682
cord_uid: 2xxmzfbo

Knowledge of the interactions between long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) is the basis of understanding various biological activities and designing new drugs. Previous computational methods for predicting lncRNA–miRNA interactions lacked for plants, and they suffer from various limitations that affect the prediction accuracy and their applicability. Research on plant lncRNA–miRNA interactions is still in its infancy. In this paper, we propose an accurate predictor, MILNP, for predicting plant lncRNA–miRNA interactions based on improved linear neighborhood similarity measurement and linear neighborhood propagation algorithm. Specifically, we propose a novel similarity measure based on linear neighborhood similarity from multiple similarity profiles of lncRNAs and miRNAs and derive more precise neighborhood ranges so as to escape the limits of the existing methods. We then simultaneously update the lncRNA–miRNA interactions predicted from both similarity matrices based on label propagation. We comprehensively evaluate MILNP on the latest plant lncRNA-miRNA interaction benchmark datasets. The results demonstrate the superior performance of MILNP than the most up-to-date methods. What’s more, MILNP can be leveraged for isolated plant lncRNAs (or miRNAs). Case studies suggest that MILNP can identify novel plant lncRNA–miRNA interactions, which are confirmed by classical tools. The implementation is available on https://github.com/HerSwain/gra/tree/MILNP.

An increasing number of studies have shown that non-coding RNAs (ncRNAs), especially long non-coding RNAs (lncRNAs) and microRNAs (miRNA), act in various biological processes (Amin et al., 2019) . miRNAs with a sequence length of approximately 22 nucleotides control post-transcriptional gene expression (DeVeale et al., 2021) . lncRNAs, usually with a sequence length greater than 200 nucleotides, are widely engaged in essential regulatory processes (Ard et al., 2014; Chen et al., 2017; Fang et al., 2020; Statello et al., 2020; Goodall and Wickramasinghe, 2021) . lncRNAs control the expression of miRNAs to influence the expression of their target genes: lncRNAs compete with mRNA for miRNAs, thereby regulating miRNA-mediated target inhibition (Geisler and Coller, 2013) . For example, in the lumbar intervertebral disk degeneration , lncRNAs may act as competing endogenous RNA (ceRNAs) that bind competitively to miRNAs through their miRNA response elements, thereby regulating the expression of miRNA-targeted mRNAs. miRNAs and lncRNAs interact with each other to exert higher levels of post-transcriptional regulation.

As computer technology advances rapidly, numerous methods are employed to study miRNAs, lncRNAs, and proteins, as well as their interactions (Fu et al., 2019 (Fu et al., , 2020 Cai et al., 2020a Cai et al., ,b, 2021 Dai et al., 2021; Li P. et al., 2021; Liu et al., 2021; Rahaman et al., 2021; Song et al., 2021; Tan et al., 2021; Zhang C. L. et al., 2021; Zhang et al., 2022) . With regard to miRNAs, a miRNA that is positively selected during human evolution is identified to regulate energy expenditure, and the relevance of this positively selected locus to metabolic disorders may explain the link between this locus and metabolic diseases (Stower, 2020) . With regard to lncRNAs, a lncRNA GCMA activated by SP1 acts as a competing endogenous RNA in gastric cancer via competition for miR-124 and miR-34a to promote tumor metastasis (Tian et al., 2020) . With regard to interactions between lncRNAs and proteins, lncRNA DIGIT regulates endoderm differentiation by promoting the formation of phase-separated condensates of bromodomain and the extraterminal domain protein BRD3 (Daneshvar et al., 2020) . With regard to interactions between lncRNAs and miRNAs, the targeting lnc-MGC inhibits host lnc-MGC expression while suppressing the expression of key cluster miRNAs in the kidneys and preventing early diabetic nephropathy (Allison, 2016) . Studies such as these are abundant and have made important contributions.

Although predictions about lncRNA-miRNA interactions exist, most of them are not about plants (Jiang et al., 2018 (Jiang et al., , 2019 Ayachit et al., 2020; Banerjee et al., 2020; Ma et al., 2020; Qazi et al., 2020; Shen et al., 2020; Aglawe et al., 2021) . The confirmed plant lncRNA-miRNA interactions are very limited and have been barely covered. For instance, the NPInter4.0 (Teng et al., 2020) documents extensive functional interactions between ncRNAs and molecules of over 30 species, yet only two of them are plants. From 71 RNA-RNA interactions for the two plants, only one of them is a miRNA-lncRNA interaction. It is no secret that the mechanisms of plant miRNA-lncRNA interactions remain elusive. Also, lncRNAs are characterized by low sequence conservation, especially among distantly related species.: The lncRNA molecules of different species or the same species may vary in terms of amino acid and nucleotide fragments during biological evolution, which entails that predictions obtained from animal studies are not guaranteed to be applicable in plants (Noviello et al., 2018) . As a result, conclusions about the mechanism of plant lncRNA-miRNA interactions cannot be completely copied from animals and must be explored.

Studies on lncRNA-miRNA interactions generally fall under two categories, namely, bioinformatics-based machine learning methods and similarity network-based methods Peng et al., 2017; Zeng et al., 2017 Zeng et al., , 2018 Zeng et al., , 2019 Zhao et al., 2020; Chen et al., 2021; Singh et al., 2021; Wang et al., 2021; Zhou et al., 2021; Zhu et al., 2021) . The former extracts biological features and trains models to obtain dichotomous results (i.e., the output is whether lncRNA and miRNA interact) (Intell, 2019; Li J. et al., 2021) . By comparison, the latter computes single or multiple correlation similarity matrices to obtain the final predictions (Wang et al., 2014) . The works that use machine learning, even deep learning methods, do succeed. However, machine learning is flawed in terms of two aspects (Peng et al., 2018; Zhang et al., 2019b) . First, it relies on data features. For certain lncRNAs or miRNAs, they may not have expression profiles or target genes. In this situation, machine-learning methods are not applicable. Moreover, for some isolated lncRNAs or miRNAs that do not have any interactions with miRNAs or lncRNAs at all, they have difficulty forecasting any unknown interaction. By contrast, similarity network-based approaches can address such imperfections. Constructing similarity networks does not necessarily depend on specific data features (Zhang et al., 2019b) , and it is able to predict isolated lncRNAs and miRNAs solely on the basis of sequence information. Linear neighborhood similarity, which refers to selecting the most appropriate neighborhoods for linear reconstruction, as a new similarity measurement perspective, is currently gaining momentum in bioinformatics Zhang et al., 2018a Zhang et al., , 2019a Xie et al., 2020; Zhang W. et al., 2021; Jia and Luan, 2022; Zhu et al., 2022) , such as LPLNP (Zhang et al., 2018a) and MPLPLNP (Jia and Luan, 2022) , in predicting lncRNA-protein interactions, and FLNSNLI and LPLNS in predicting miRNA-disease associations. To the best of our knowledge, no similarity networkbased method is available to date for predicting lncRNA-miRNA interactions in plants. Owing to the imperfections of machinelearning methods and the necessity to independently detect plant lncRNA-miRNA interactions, novel and effective methods must be constructed.

In this study, we hypothesize that lncRNA-miRNA interactions with highly similar lncRNAs will have similar interaction or non-interaction patterns with miRNAs. Under this assumption, a multi-source information-based linear neighborhood propagation method (MILNP) is proposed. The similarity is calculated through our improved linear neighborhood similarity (ILNS) algorithm, where ILNS has the advantage of obtaining a more accurate neighborhood range over the pre-improvement. First, multidimensional features are separately extracted from the sequences of lncRNAs and miRNAs to calculate sequence similarity, whereas interaction profile similarity is obtained using their interactions. These two similarities are then combined to obtain integrated similarity. Label propagation based on the integrated similarity is used to calculate individually the prediction matrix of lncRNAs and the prediction matrix of miRNAs. Finally, the two prediction matrices are summed by taking different weights to obtain the final prediction. The contribution consists of the following components.

• We proposed a novel similarity measurement, ILNS, for calculating multiple similarity profiles. • We constructed MILNP based on ILNS to predict lncRNA-miRNA interactions and discovered new interactions in the plant. • High-accuracy prediction results in multiple experiments and showed superiority over the existing methods and reliability for finding new interactions of MILNP.

The original data used herein are derived from a previous study ) that investigated plant lncRNA-miRNA interactions. miRNA sequences are downloaded from miRBase22.1 (Kozomara et al., 2019) , whereas lncRNA sequences are downloaded from GreeNC1.12 (Paytuvi-Gallart et al., 2019) and CANTATAdb2.0 (Szcześniak et al., 2019) . Datasets of lncRNA-miRNA interaction from Arabidopsis thaliana, Glycine max, and Medicago truncatula are chosen with 2,500 positive samples from the positive dataset of each species, for a total of 7,500 positive samples. Similarly, 2,500 negative samples from the negative dataset of each species are chosen, also for a total of 7,500 negative samples. These positive and negative samples are intermixed as the training-validation set to avoid imbalance in sample distribution. The dataset consists of five parts per sample: the symbol, the miRNA name, the lncRNA name, the sequence yielded from combining the miRNA sequence with the lncRNA sequence, and the sample label (0 for the absence of interaction, and 1 for the presence of an interaction). However, such a format is inappropriate for the method we applied herein. The processing is as follows. First, the original sequence binding files are separated by name, sequenced, and labeled to obtain the name of miRNA, name of lncRNA, binding sequences of miRNA and lncRNA, and labels. Second, all miRNA sequences are found according to the miRNA name order by checking against the reference documents from miRBase22.1 (Kozomara et al., 2019) , and the lncRNA sequence of each line is intercepted according to the binding file, which happens to follow the lncRNA name order. Third, all miRNAs and lncRNAs (originally 15,000 lines each) are de-duplicated to obtain 1,340 unduplicated miRNAs and 7,963 unduplicated lncRNAs. Finally, the serial numbers of the remaining miRNAs and lncRNAs are determined, and the miRNA-lncRNA interaction matrix is drawn in accordance with the tag file. A rough workflow is shown in Figure 1 .

Given that multiple pieces of information are required to calculate the interactions when we adopt the linear neighborhood similarity method, we also utilize these features that represent sequence information in addition to the interaction matrix.

Owing to the existence of orphaned miRNAs and lncRNAs, the sequence is more versatile than the interaction. We collate the k-mer frequency (Ahmed et al., 2020) , GC content, number of base pairs, and MFE (Negri et al., 2018) of miRNAs and lncRNAs according to the de-duplication files, the original files, and the features here.

The data are summarized in Table 1 .

Improved Linear Neighborhood Similarity Measure

is regarded as a data point, and we assume that we can gather the attribute parts of other data points to get the current one. Adjacent data FIGURE 1 | Dataset processing.

Frontiers in Plant Science | www.frontiersin.org points are usually viewed as possessing similar properties. Hence, the neighbors can be selected as the contributing force for the reconstruction of the point, whereas other irrelevant data points participate in the calculation but are assigned a weight of 0. For x i , the common calculation method for selecting neighbors is Euclidean distance. Considering that various features, such as GC content and MFE, characterize the components of the data point in these dimensions and that similar data points are assumed to eventually form multidimensional vectors with similar directions, the Cosine distance is then chosen to select neighbors N 1 (x i ) (a total of n 1 ), and then the Euclidean distance is chosen to select nearer neighbors N 2 (x i ) (a total of n 2 ) on the basis of the Cosine distance to select more exact neighbors. The latter is a subset of the former, ensuring that neighbors are nearer in both direction and position. The order of the two can be swapped since the final selected neighborhood is the same. The percentages of neighbors are expressed by K 1 and K 2 where K 1 = n 1 /m and K 2 = n 2 /n 1 . To minimize the reconstruction error for m data points, we propose the objective function:

where M is an m × n matrix in the feature space, and both C 1 and C 2 are indicator matrices that separately indicate whether they are neighbors on the basis of the Cosine distance and nearer neighbors on the basis of the Euclidean distance. G and W are both weight matrices. Here, µ is a weight parameter, and e is an m × 1 column vector with all elements being 1. ||•|| F is to obtain the Frobenius norm of a matrix. ||•|| 2 is to obtain the 2-norm of a vector. The first term of the objective function is to get the optimal weight matrix to minimize the reconstruction error of all data points, whereas the second term is to reduce overfitting during the reconstruction. For the weight matrices G and W, the elements are non-negative.

Given that the first neighbors are required to find the second ones, the objective function must be decomposed:

The Lagrange multiplier method is used to solve Equation (2), which then has the following form:

where λ 1 and λ 2 are Lagrange factors. According to the Karush-Kuhn-Tucker condition (Kjeldsen, 2000) , the following conditions must be satisfied to determine the optimal value:

The partial derivative of L is determined with respect to G:

If λ 1 T = 0, then there will be

In that case,

If λ 1 T = 0, then there will be G = 0. Thus, G ij = 0. Given the relevance of data point-based reconfiguration that G ij = 0 when x j ∈N 1 (x i ), the solution is

Thus far, the iterative form with unknown parameters has been obtained. We inscribe its equivalent form for Equation (2) to obtain the following parameter:

where Gr i is the gram matrix. If x j ∈N 1 (x i ) and x k ∈N 1 (x i ), then

. Otherwise, Gr j,k = 0. Solving Equation (9) using the Lagrange multiplier method yields

By taking the partial derivatives of g and λ, λ i can be obtained: where the reconstruction error is close to none, i.e., 1 / 2 ϑ T i Gr i ϑ i ≈ 0. According to the Lagrange multiplier method, e T ϑ i − 1 = 0. Thus, λ i = µ. If λ = µ × e, G can be represented as

The similarity matrix based on the Cosine distance is then acquired through iteration until convergence. Similarly, W is obtained with respect to G:

The final similarity matrix can be acquired by iterating until convergence or the maximum rounds.

Given a set of l lncRNAs, l 1 ,..., l i ,..., l l , and a set of m miRNAs, m 1 ,..., m j ,..., m m , whose interaction is represented by a matrix Y of l × m, if there exists an interaction between lncRNA l i and miRNA m j , then Y ij = 1; otherwise, Y ij = 0.

Four features of lncRNAs and miRNAs are extracted (110 features in total). The frequency of l lncRNAs with 4 k long contiguous subsequences is calculated. Herein, we assumed k = 1, 2, 3 to obtain the lncRNA-related feature vector. The sequence similarity between pairs of l lncRNAs is calculated to yield the similarity matrix of l × l, which is denoted as S_lncS. In the same manner, the similarity matrix of m miRNAs is denoted as S_miS of m × m. The interaction profiles of lncRNAs and miRNAs are derived from the interaction matrix. For lncRNA l i , the interaction profile indicates whether it interacts with each miRNA, matching the i-th row of Y, i.e., Y(i,:). Similarly, for miRNA m j , it matches the j-th row of Y, i.e., Y(:, j). The similarity between two interaction profiles of l lncRNAs is calculated as the matrix S_lncP of l × l, whereas the similarity between two interaction profiles of m miRNAs is computed as the matrix S_miP of m × m.

Label propagation (Kato et al., 2009 ) assigns labels to previously unlabeled data points. During label assignment, the labels of labeled data points are propagated to unlabeled data points. The core idea of the label propagation algorithm is that similar nodes should have similar labels. It involves two stages, namely, calculating the similarity matrix and propagating the labels. The edge from node i to node j represents the similarity of these nodes. All edge weights constitute a weight matrix, where the higher the similarity the larger the weight. Herein, ILNS is adopted to construct the similarity matrix and calculate the Cosine-distance neighbors and the Euclidean-distance neighbors of each node until convergence. Nearer neighbors of each data point are fixed to a certain proportion, and the weights of others are 0. The weight matrix is actually a sparse matrix. The labels are propagated through the edges between the nodes. The larger the weight of the edge, the more similar the nodes are to each other and the easier to propagate the labels (Zang and Zhang, 2012; Zhang et al., 2016) . For m data points x 1 ,..., x m , an m × m probability transfer matrix P is defined as

where P ij represents the transferring probability and w ij is the weight. The propagation involves three steps. First, a unique label is allocated to each node, i.e., label one for node one and label i for node i, where the labels are different from each other. Second, for node j, all nodes are traversed to discover their neighboring nodes and obtain their labels to obtain the label with the most occurrences. If more than one label satisfies the largest number of occurrences, then one is randomly selected to replace the current label. Finally, if the label of node j no longer changes after this round of relabeling or the pre-set number of rounds is reached, then the iteration is stopped. Otherwise; step 2 is repeated.

The sequence and interaction profiles of lncRNAs and miRNAs are captured to develop our model. The workflow of MINLP is shown in Figure 2 . The specific steps are as follows:

Step 1: the sequence feature similarity S_lncS and S_miS are calculated using the ILNS algorithm.

Step 2: the interaction profiles of all lncRNAs and all miRNAs are exported according to the interaction matrix of lncRNAs and miRNAs.

Step 3: the interaction profile similarity S_lncP and S_miP are calculated using the ILNS algorithm.

Step 4: S_lncS and S_lncP are combined to obtain lncRNA integrated similarity, and S_miS and S_miP are combined to obtain miRNA integrated similarity.

Step 5: for the two integration similarities, the linear neighborhood propagation method is used to generate the prediction of lncRNA and the prediction of miRNA.

Step 6: the weighted sum of the two prediction matrices is calculated, and the final interaction prediction matrix is determined.

The criteria for measuring prediction models are area under curve (AUC), area under precision-recall (AUPR), REC, SPE, and ACC. AUC is the area under the receiver operating curve (ROC) coordinated by true positive rate-false positive rate, which is suitable for observing model performance in the case of a balanced positive and negative sample size. The formulae of REC, SPE, and ACC are as follows. 

The performance is evaluated via fivefold cross-validation. To achieve more accurate outcomes, each fivefold cross-validation is repeated for 20 rounds to ensure that a sufficient number of learnings are reached. K-fold cross-validation is frequently used to upgrade model performance, where data is divided into K equal parts, one of which acts as test data and the other acts as training data. A distinct test set is selected each time, and the rest serves as a training set. Finally, the results of K experiments are averaged.

A total of four relevant parameters are obtained in this work.

In the computation of ILNS, Cosine distance-based neighbors (With the ratio of K 1 ) and Euclidean distance-based neighbors (With the ratio of K 2 ) are computed. K 1 is set to {0.1, 0.2,..., 0.9}, whereas K 2 is set to {0.1, 0.2,..., 1}, and their step size is 0.1. The purpose of this arrangement is to ensure that K 2 considers all neighbors generated by K 1 , regardless of the size of n 1 . During label propagation, the parameter α is set as the probability of label absorption, i.e., for node x j the probability of absorbing the label of its nearest neighbor node x i is α. The value of α is set within the range {0.1, 0.2,..., 0.9}, and the step size is 0.1. After the lncRNA prediction matrix SL and the miRNA prediction matrix SM are figured out, β is the trade-off parameter, i.e., the final prediction matrix will be measured as β × SL + (1 -β) × SM. The value of β is within {0.0, 0.05,..., 1.0}, and the step size is 0.05.

The settings of the parameters are shown in Table 2 . Subsequent experiments are conducted with the most optimal parameter combinations.

The effects of the different parameters are visualized in Figure 3 . First, α and β are fixed. Theoretically, the neighbors are set twice to find a more accurate batch faster. However, serious analysis reveals that a change in K 2 is logical within a certain range of K 1 ; otherwise, even when K 2 is 100%, it will not be very helpful. Let K 2 = 1.0, as K 1 changes from 0.1 to 0.9, we find that K 2 = 0.9 is the optimal value, as shown in Figure 3A . AUC initially decreases and then increases with K 1 and reaches its lowest point at 0.4. The AUC values are always above 0.975. K 1 = 0.9 is then fixed, and K 2 is changed. As K 2 becomes larger, AUC tends to increase globally and reaches the maximum at 0.9 ( Figure 3B) . The most pronounced increase is evident from 0.7 to 0.8, clearly demonstrating that 0.8 is a cut-off point. Finally, the impact of α and β is investigated. The contour and concentration plots in Figures 3C,D, respectively, demonstrate the variations in AUC with α and β. The gradient from blue to yellow is set to indicate the increase in AUC values. Herein, the scenario with β = 0 is dropped to detect subtle variations in the other cases owing to our prior observation that the AUC value is as low as 0.4 in the case of β = 0, thereby forming a cliff-like change from others. Both plots show that the best results are achieved at α of 0.2-0.8 and β of 0.15-0.9, which are extremely close to 0.98. The yellowest areas appear locally as a result of drawing tool error, but it does not affect the fact that the results are roughly consistent. Figures 3E,F present the grid plots of AUC with respect to variations in α and β, respectively. Figure 3E is the case with β = 0 removed from the corresponding plots in 3(C) and 3(D). Figure 3F includes all cases and validates the previous interpretation that the inclusion makes distinguishing the changes from the others difficult. The models of all ranges show that the best parameters are K 1 = 0.9, K 2 = 1.0, α = 0.7, and β = 0.35 when AUC at this point is 0.9797. This tells that the performance of our model is attributed to Cosine distance, determining a more accurate neighborhood which is preferable to applying Euclidean distance only.

To demonstrate the superiority of our model, we calculate a similarity network with a single information source and build linear neighborhood propagation models to compare with MILNP. The results are presented in Table 3 . The optimal values obtained are set in bold typeface. We construct these models based on the optimal combination of parameters. Except for the difference in information sources, all the other processes are guaranteed to be the same. The model using only sequence similarity with ILNS as the core algorithm is named MILNP-I. The model using only interaction profile similarity with ILNS as the core algorithm is named MILNP-II. Overall, both of them are inferior to MILNP with integrated information. MILNP-II is generally very close to our model, indicating that the interaction information is a key contributor throughout the prediction. Thus, with specific optimization, MILNP-II could be credible for predicting isolated lncRNAs or miRNAs. The performance of MILNP-I is not as good as that of MILNP-II, but its AUC is barely satisfactory. We also tried another method for (Zhang et al., 2018b) , to obtain new models, named the SLNPM-series, as shown in Table 3 . With the information source controlled, a comparison of SLNPM-II and MILNP-II reveals that ILNS is more accurate. All these results clearly validate the superiority of MILNP in terms of information integration and similarity calculation.

As far as we know, very few studies have investigated lncRNA-miRNA interactions in plants. We select Pmlipred and CIRNN as reference methods. Both methods predict plant lncRNA-miRNA interactions. PmliPred ) builds a prediction model by using a machine learning approach combined with a deep learning approach, and the final prediction results are made from fuzzy decisions of the two components. We use the publicly available source code on GitHub to implement Pmlipred . CIRNN builds integrated deep learning models with both a CNN and an IndRNN, where the former is used to automatically extract gene sequence functional features, whereas the latter is utilized to obtain sequence feature representations and dependencies. We replicate it in the detailed description of CIRNN . The results of the comparison are summarized in Table 4 . The optimal values obtained are set in bold typeface. The performance of the two methods is clearly good, but not as good as that of our model. In particular, the AUC and ACC values have a relatively large gap. We notice that both methods stretch to deep learning.

Thus, we implement similar models on their basis to measure their effectiveness. We construct a CNN-Gate Recurrent Unit (GRU) combinatorial model. Sequential features are extracted from the original data by CNN and compressed into a onedimensional vector in the flattened layer to input into GRU. This process is well-suited for processing sequential information.

We consider GRU instead of others because GRU has fewer parameters and reduces overfitting. We first use a three-layer CNN and a single-layer GRU mixed with RF to obtain CNNRF1. We then add another layer of CNN on top of that to obtain CNNRF2. Gladly, although our MILNP is simple and built by deriving mathematical formulas via a top-down approach and layer by layer, the results are satisfactory. For the baseline SLNSM (Zhang et al., 2018b) that is originally created for animal prediction, we run our dataset and observe that our method is slightly better. That is because we focus on improving the computational procedure for linear neighborhood similarity by adding the spatial direction restriction. The optimal parameter combination shows that such performance is attributed to Cosine distance, determining a more accurate neighborhood, which is preferable to the approach of the baseline.

Top-rank prediction is an important way of visualizing the performance of the models. We examine the top-rank predictions from 200 to 2,000 and identify the percentage of interactions that are truly correct. As shown in Figure 4 , an average of 186 positive interactions per prediction is reached in the top 200 predictions, whereas 1,380 real interactions are determined in the top 2,000 predictions. The results demonstrate the good performance of our model. Furthermore, we predict the interactions of isolated lncRNAs and miRNAs with MILNP. For the isolated lncRNA or miRNA, only sequence-dependent information can be used. In separate cases gma-miR395a and lcl| Gmax_Glyma.18G279100.1 are taken as examples. We validate the prediction of the selected miRNA and lncRNA with respect to RNAhybrid2.1.2 (Rehmsmeier, 2004) . All predictions are sorted in descending order of probability. For miRNA gma-miR395a, 4 of the top 10 are correctly predicted, as shown in Table 5 , thereby confirming the predictive power of MILNP. However, the list reveals that the fifth and eighth detected lncRNAs in the prediction of miRNA "gma-miR395a" belong to Medicage truncatula, as evidenced by their nomenclature. The situation is worth contemplating. On the one hand, this indicates that the selected samples may affect the performance of MILNP. On the other hand, it inspires us to further explore cross-species linkages and assume that the remaining uncertified interactions are possible. Likewise, we make predictions for lncRNA lcl| Gmax_Glyma.18G279100.1. To our surprise, five of the results happen to be identified by the tool as having interactions, a result that is very encouraging. We conjecture that the remaining ones predicted by our model can be possible. As demonstrated by the results of the comparison of the two sets of predictions, the fact that sample selection has a great influence on the prediction results should not be ignored. The association between the selected sample and other samples also affects the results. For those that have a similarity with many samples, the prediction results may be more accurate. If a sample has little similarity with other samples, then predicting its potential interactions will be difficult.

The different results also prompt us to reconsider the results of previous experiments. We find that, although MILNP achieves good AUC and ACC, its PRE is relatively low, which may be attributed to the model itself and the distribution of the dataset. Some very similar samples may have confused the model and that induces it to arrive at a wrong judgment. Nevertheless, it could also be a new revelation that suggests these possible associations. This assumption warrants further biological laboratory validation.

lncRNA-miRNA interactions are important because they influence various biological activity processes. Most studies on these interactions focused on animals. Although experimental results derived from studies of plants are not as easy to verify as those obtained from animals, current research is not merely a conjecture. Great improvements have been made by scientists after proposing bold assumptions and providing carefully evaluated proofs. Herein, we attempt to study plant interactions and propose a linear neighborhood propagation model based on combinatorial information. We validated it on datasets of three plants. We have obtained relatively good results. More importantly, we proposed a novel method for measuring similarity from mathematical fundamentals. We used the combined information of molecular sequences and interactions to construct a similarity network with a guarantee of being nearer in both spatial location and direction. We achieved the final prediction by label propagation. A series of experiments showed the outstanding performance of our model, demonstrating the superiority of the combinatorial information. We also attempted to predict isolated lncRNAs and miRNAs without any interaction yet and validated the predictions with existing tools. Our model possesses good generalization properties and can be used to discover new interaction relationships. Our multisource information-based linear neighborhood propagation method is a novel and unique method for predicting plant lncRNA-miRNA interactions. However, the entire study requires a large time investment of about 3 months. Hence, in a follow-up study, we will tune the parameters to make the model more efficient. We will also consider deep learning methods on this basis and combine the results that we may obtain.

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

Bioinformatics tools and databases for genomics-assisted breeding and population genetics of plants: a review

Accurate prediction of RNA 5-hydroxymethylcytosine modification by utilizing novel position-specific gapped k-mer descriptors

A lncRNA and miRNA megacluster in diabetic nephropathy

Evaluation of deep learning in non-coding RNA classification

Long non-coding RNA-mediated transcriptional interference of a permease gene confers drug tolerance in fission yeast

Salient features, data and algorithms for MicroRNA screening from plants: a review on the gains and pitfalls of machine learning techniques

Identification of mRNA and non-coding RNA hubs using network analysis in organ tropism regulated triple negative breast cancer metastasis

iEnhancer-XG: Interpretable sequence-based enhancers and their strength predictor

Active semisupervised model for improving the identification of anticancer peptides

ITP-Pred: an interpretable method for predicting, therapeutic peptides with fused features low-dimension representation

Gene regulation in the immune system by long noncoding RNAs

Prediction of lncRNA-protein interactions via the multiple information integration

MDA-CF: predicting miRNA-disease associations based on a cascade forest model by fusing multi-source information

lncRNA DIGIT and BRD3 protein form phaseseparated condensates to regulate endoderm differentiation

The roles of microRNAs in mouse development

Trans-and cis-acting effects of Firre on epigenetic features of the inactive X chromosome

StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency

Improved pre-miRNAs identification through mutual information of pre-miRNA sequences and structures

RNA in unexpected places: long non-coding RNA functions in diverse cellular contexts

RNA in cancer

Remodelling machine learning: an AI that thinks like a scientist

Multi-feature fusion method based on linear neighborhood propagation predict plant LncRNA-Protein Interactions

Identification and analysis of rice yield-related candidate genes by walking on the functional network

Investigation and development of maize fused network analysis with multi-omics

PmliPred: a method based on hybrid model and fuzzy decision for plant miRNA-lncRNA interaction prediction

Robust label propagation on multiple networks

A contextualized historical analysis of the kuhn-tucker theorem in nonlinear programming: the impact of world war II

miRBase: from microRNA sequences to function

Predicting microRNAdisease associations using label propagation based on linear neighborhood similarity

Identification of KEY lncRNAs and mRNAs associated with oral squamous cell carcinoma progression

Genome-wide analysis of changes in miRNA and target gene expression reveals key roles in heterosis for Chinese cabbage biomass

Inferring gene regulatory networks using the improved markov blanket discovery algorithm

Upregulated lncRNA DLX6-AS1 underpins hepatocellular carcinoma progression via the miR-513c/Cul4A/ANXA10 axis

Inferring MicroRNA-Disease associations by random walk on a heterogeneous network with multiple data sources

Bioinformatics analysis of the rhizosphere microbiota of Dangshan Su pear in different soil types

Pattern recognition analysis on long noncoding RNAs: a tool for prediction in plants

Detection of long non-coding RNA homology, a comparative study on alignment and alignment-free metrics

A Walkthrough to the Use of GreeNC: The Plant lncRNA Database

The advances and challenges of deep learning application in biological big data processing

A novel information fusion strategy based on a regularized framework for identifying disease-related microRNAs

HSEAT: a tool for plant heat shock element analysis, motif identification and analysis

Decrypting the role of predicted SARS-CoV-2 miRNAs in COVID-19 pathogenesis: A bioinformatics approach

Fast and effective prediction of microRNA/target duplexes

Transcription factors-DNA interactions in rice: identification and verification

Delineating characteristic sequence and structural features of precursor and mature Piwi-interacting RNAs of epithelial ovarian cancer

Pretraining model for biological sequence data

Gene regulation by long non-coding RNAs and its biological functions

An miRNA linked to metabolic disease

CANTATAdb 2.0: Expanding the Collection of Plant Long Noncoding RNAs

LncRNA SNHG1 and RNA binding protein hnRNPL form a complex and coregulate CDH1 to boost the growth and metastasis of prostate cancer

NPInter v4.0: an integrated database of ncRNA interactions

Correction: SP1-activated long noncoding RNA lncRNA GCMA functions as a competing endogenous RNA to promote tumor metastasis by sponging miR-124 and miR-34a in gastric cancer

Similarity network fusion for aggregating data types on a genomic scale

Machine learning for phytopathology: from the molecular scale towards the network scale

LDA-LNSUBRW: lncRNA-disease association prediction based on linear neighborhood similarity and unbalanced bi-random walk

Label propagation through sparse neighborhood and its applications

Prediction and Validation of Disease Genes Using HeteSim Scores

Prediction of potential diseaseassociated microRNAs using structural perturbation method

Predicting diseaseassociated circular RNAs using deep forests combined with positiveunlabeled learning methods

Prediction of Drugtarget Binding Affinity by An Ensemble Lear ning System with Network Fusion Information

A parameter-free label propagation algorithm for person identification in stereo videos

Plant miRNA-lncRNA Interaction Prediction with the Ensemble of CNN and IndRNN

DeepMGT-DTI: Transformer network incorporating multilayer graph information for Drug-Target interaction prediction

SFLLN: A sparse feature learning ensemble method with linear neighborhood regularization for predicting drug-drug interactions

A fast linear neighborhood similarity-based network link inference method to predict MicroRNA-Disease Associations

The linear neighborhood propagation method for predicting long non-coding RNA-protein interactions

Sequencederived linear neighborhood propagation method for predicting lncRNA-miRNA interactions

LncRNA-miRNA interaction prediction through sequence-derived linear neighborhood propagation method with information combination

MDAPlatform: a component-based platform for constructing and assessing miRNA-disease association prediction methods

Integrative Analysis of miRNA-mediated competing endogenous RNA network reveals the lncRNAs-mRNAs interaction in glioblastoma stem cell differentiation

LPI-deepGBDT: a multiplelayer deep framework based on gradient boosting decision trees for lncRNAprotein interaction identification

lncRNA/circRNA-miRNA-mRNA ceRNA network in lumbar intervertebral disc degeneration

Fusing multiple biological networks to effectively predict miRNA-disease Associations

An iterative method for predicting essential proteins based on multifeature fusion and linear neighborhood similarity

MG wrote the first draft of the manuscript. XF wrote sections of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.

We would like to thank Xiangxiang Zeng for kind suggestions and discussions that have helped improve the presentation of this manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.Publisher's Note: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.Copyright © 2022 Cai, Gao, Ren, Fu, Xu, Wang and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.