key: cord-0035144-2wny0xhm authors: Peng, Sheng-Lung; Tsay, Yu-Wei; Wang, Tai-Chun; Tang, Chuan Yi title: Probe Selection with Fault Tolerance date: 2008 journal: Advanced Intelligent Computing Theories and Applications DOI: 10.1007/978-3-540-87442-3_27 sha: cf1504b5799eb683f4e4c15a3b2aa5789d4a5781 doc_id: 35144 cord_uid: 2wny0xhm Microarray techniques play an important role for testing some reactions of diseases which are caused by viruses. Probes in microarray are one kind of the most important materials. Usually, scientists use a unique probe for marking a special target sequence. Thus, for identifying n different viruses, we need n different probes. Recently, some researchers study non-unique probes to identify viruses by using less number of probes. In this case, a virus can be identified by a combination of some probes. In this paper, we study the problem of finding a set of probes that can identify all the given targets. We consider the k-fault tolerance selection of probes. That is, if any k probes fail, then we still can identify each target. We propose a practical algorithm for this k-fault tolerance probe selection problem. Some experiments are studied on SARS, H5N1, and so on. Research for the human nosography and immunology is important today [1] . SARS, AIV, cervical cancer, acquired immunodeficiency syndrome (AIDS), and so on, are all infected from viruses. Many researchers use microarray for clinical experiments [4, 8] . A microarray usually uses parallel analysis for an expression of thousands of genes. Because it can be formed by different types (e.g., oligonucleotide and cDNA), the applications of microarray are very wide [5] . Recent works or softwares try to find unique probes for the target sequences [2, 11, 12] . That is, each probe can only hybridize with one target sequence. In this case, for n different targets, we need n different probes to identify these targets. If we allow one probe fault, then for the premise of unique sequences, each target will need two unique sequences as probes. It is not hard to see that the number of probes will be doubled. It is very impractical. Thus, some researchers focus on non-unique probes [9, 10] . Since each of non-unique probes can hybridize with more than one target, fewer probes are needed. In this case, a target is identified by a combination of selected probes. Klau et al. proposed an integer linear programming algorithm to find a minimum set of non-unique probes which can identify all the input targets [6, 7] . In this paper, we propose a simple greedy algorithm to find a minimal set of non-unique probes for recognizing the input targets. A probe is a single-strand DNA or RNA molecule with a specific sequence labeled either radioactively or immunologically. It is used to detect the complementary sequence by hybridization for identifying unknown species or gene functions. When designing probes, we need to get a single-strand DNA or RNA sequence. Then we use primers to help us produce the complement subsequences of the ss-DNA (RNA). Each of them is between 25 and 80 bp. To obtain probes, we need to consider the temperature, GC contents rate, DNA secondary structure, and cross-hybridization. Most known probes are stored in NCBI GeneBank. There are several softwares can help us to find probes, such as MAPS [2] , GoArray [11] , OligoArray [12] , and so on. Figure 1 shows an example. For unique probes, each target needs one unique probe to identify itself. In the non-unique probes, we use less probes to identify all the targets. Table 1 shows a relation table according to Figure 1 . In this table, if target T i can hybridize to probe P j , then the (i, j) entry is 1. According to this table, it is clear that we need four probes to identify all the targets by using unique probes. But for nun-unique probes, two probes, namely, P 1 and P 3 , are sufficient. That is an advantage of using non-unique probes. Furthermore, in large transcript families, such as alternative splice variants of a gene, or in a large family of closely homologous genes, it is often impossible to find a unique probe of 25 bp as a signature for a specific variant [10] . So we consider about non-unique probes. A microarray is an array containing thousands of probes. These probes are printed on a solid surface. Sometimes it is a glass or nylon membrane. The sensitive of a microarray will be influenced by the reaction of target's concentration, temperature, and reaction time. Moreover, we also need to compute Table 1 . The relation matrixes for Figure 1 the diffusion coefficient, adhesive force coefficient, the uniform level of reaction targets, and so on. Some researches show that a long oligonucleotide is more sensitive than short one ( [3] ). However, the design problem is very complicated. Chung et al. try to use a hierarchical method to design probes [3] . For exon chip, the length of the probes on the chip can be different. But for GeneChip, its length has the limitation of 25 bp. Sometimes probes will fail because of the pollution during producing a microarray. So we need to add extra probes in order to make sure if some probe fails, then there still has another probe can take care of it. Table 2 shows the case that if any probe fails, then the probe set still can identify the four targets. In this section, we will introduce our greedy algorithm for finding a set of nonunique probes with k-fault tolerance. In our method, we use the software Promide proposed by Rahmann to find all the possible probes [9] . The input file is in FASTA format. After finding the Table 3 . An example of target-probe relation table Table 3 Target pairs (T1, T2) (T1, T3) (T1, T4) (T2, T3) (T2, T4) (T3, T4) probes, we construct a relation table between probes and targets. Table 3 shows an example. For each target, if a probe can bind the target, we then assign the value 1 in the corresponding entry. Thus, we know which probe can affect which target. For distinguishing the targets, we translate the relation table into the distinguishable table (see Table 4 ). In the case that different probes have the same distinguishable ability, we only keep one. In Table 4 , T 1 can hybridize to probe P 1 , but T 2 cannot. That is, P 1 can distinguish T 1 and T 2 . Thus, we record the value 1 in the (P 1 -(T 1 , T 2 )) entry. Take Table 4 as an example. In order to illustrate the relationship between probes and target pairs, we use the distinguishable table to draw a relation graph (see Figure 2 ). In this graph, each edge denotes that the probe can differentiate the two targets in the target pair. By using non-unique probes, each target pair has more than two incident edges. That is, if we remove any one node from the probe set, any node in the target pairs still has at least one incident edge. It means that each target pair still can be distinguished by one probe. Based on this fact, it is not hard to see that a probe set is k-fault tolerance if every target pair in the graph has at least k + 1 edges. Figure 3 shows an example of 1-fault tolerance. In this graph, every target pair has at least two edges and any failure probe will not affect the distinguishable ability. Output variables coverArray It is an array of size n. Each entry stores the number of target pairs that can be distinguished by the corresponding probe. ftArray It is an array of size m. Initially, every entry is 0. If every entry is at least k + 1, then the found probe set is k-fault tolerance. identifyArray It is an array of size m. Each entry stores the number of probes that can be used to identify the target pair. P robeSet A probe set outputted by our algorithm. After constructing the relation graph, we start to find a probe set. We give some definition first for the variables which are used in our algorithm in Table 5 . Our algorithm is shown in Algorithm 1. It can be divided into three parts. The first part is from step 7 to step 10. This part helps us to find the smallest number in the identif yArray. By our experience, if the target pair that can be distinguished by few probes, then considering it first will obtain a good solution. After finding a candidate target pair in the first part, we then go to the second part from step 13 to step 16. In this part, we find the largest number in the coverArray. That is, we select the probe that can distinguish the maximum number of target pairs. The final part of our algorithm is to update all the related information after we select a probe. If the requirement of the fault tolerance does not fulfill, we then repeat the process again until we fulfill the requirement. The algorithm first finds the minimum number of identif yArray (e.g., identif yArray [1] ). Then it finds the maximum number in coverArray for target pair 1. Since probe 1 and probe 6 have the same value, we choose probe 1. After choosing probe 1, we update the remaining information. The selected probes are shown in Table 6 , i.e., {P 1 , P 3 , P 6 }. It is not hard to check that any failure of selected probes does not loss the distinguishing ability for the input targets. In this section, we show the result by using our algorithm to find a set of nonunique probes for some viruses (e.g., SARS, HIV, avian's influenza A virus, and so on) with the ability of k-fault tolerance. Our virus data were downloaded from Table 7 shows the result. The number k in k-FT (k from 0 to 4) is the fault tolerant number. In our experiment, if the data are highly conservative such as SARS, then it is easy for Promide to find many possible non-unique probes. On the other hand, if the data are not highly conservative, Promide will find a lot of unique probes. In this case, our algorithm obtains a lot of probes like Retroviruses. The production of microarrays is costly. An important factor is the number of probes used in the microarray. Besides, the precision of tests for microarrays is another factor. These motivate this research. By the experimental result, our method shows that we can reduce the number of probes used for identifying several groups of viruses. Further, our method can be easily extended to k-fault tolerance, i.e., if there are k failure probes, the remaining probes can still be used to recognize the input viruses. In fact, our probe selection algorithm is an approximation algorithm. If the input size is not too large, we may use a branch-and-bound algorithm to find an optimal solution. Second, for the tool Promide, it is unable in dealing with an oversize sequences. Thus, it is necessary to find or design a new tool for large scale input size. Finally, while the cost of building microarrays is down, fault tolerance becomes an important issue. Our method obtains only an approximated result. Designing new algorithms considering fault tolerance will be a new direction. Cellular and Molecular Immunology, 5 th edn Maps: a microarray project system for gene expression experiment information and data validation Design of long oligonucleotide probes for functional gene detection in a microbial community Applications of microarray technology in breast cancer research. The Institute of Microarray standard data set and figures of merit for comparing data processing methods and experiment designs Optimal robust non-unique probe selection using integer linear programming Integer linear programming approaches for non-unique probe selection Medical applications of microarray technologies: a regulatory science perspective Rapid large-scale oligonucleotide selection for microarrays Non-unique probe selection by matrix condition optimization Goarrays: highly dynamic and efficient microarray probe design Oligoarray 2.0: design of oligonucleotide probes for dna microarrays using a thermodynamic approach