key: cord-0946190-93k9mxn5 authors: Lin, Baochuan; Malanoski, Anthony P. title: Resequencing Arrays for Diagnostics of Respiratory Pathogens date: 2009-01-16 journal: DNA Microarrays for Biomedical Research DOI: 10.1007/978-1-59745-538-1_15 sha: c8863bd543b53a71b7f748031ae00dacb8270884 doc_id: 946190 cord_uid: 93k9mxn5 Microarray technology has revolutionized the detection and analysis of microbial pathogens. The success of this technology is evident from the various microarrays that have been developed for this purpose, variation in the density of probes, and the time ranges required for assay completion. Among these, high-density re-sequencing microarrays have demonstrated great potential for detecting bacterial, viral pathogens, and virulence markers. Resequencing microarrays use closely overlapping probe sets to determine a target organism’s nucleotide sequence. Hybridization to a series of perfect matched probes provides confirmatory presence/absence information, while hybridization to mismatched probes reveals strain-specific single nucleotide polymorphism (SNP) data. This approach provides sequence information of the diagnostic regions of detected organisms that is considerably more informative over that provided from other microarray techniques. High-density resequencing microarrays were developed to detect single nucleotide polymorphisms (SNP) and so produce detailed genetic sequence reads. A resequencing microarray comprises ''probe sets'' -high-density arrangements of short highly specific oligonucleotide probes (25 and 29 used currently) where each base in a reference sequence is queried by four probes. One probe is an exact match of the reference sequence and the other three represent the same section of reference sequence with the central base position replaced by one of the possible SNP variants. In practical use, the number of probes is doubled so that for each base both the forward (sense) and reverse (antisense) directions are contained in a probe set. It is possible to completely ''resequence,'' resolve every base in the sequence of an unknown sample, the reference sequence itself or any other sequence that differs from the reference by one mismatch or fewer per 25 base pair (bp) (1) . At higher mismatch rates, a large number of bases can still be identified, but an increasing number will fail as the differences increase. This array-based format, combined with specific PCR, has proven ideal for SNP genotyping and phylogenetic analysis (2) (3) (4) (5) (6) . Initial work demonstrated the advantages of using a resequencing array with many short reference sequences to detect multiple bacterial and viral pathogens (7) (8) (9) (10) (11) . Taking full advantage of the sequential base resolution capability of resequencing microarrays, similarity searches of DNA databases have been incorporated into the analysis allowing for fine detailed discrimination of closely related pathogens and tracking mutations within the targeted pathogen even with only partial base call resolution in a reference sequence (8, (10) (11) (12) . The effective use of resequencing microarrays for respiratory pathogen detection or any large collection of organisms relies on the integration of several components. The overall design for the resequencing microarray and selection of primers for amplification must occur first. This consists of several tasks: First, selection of organisms and desired level of discrimination for each organism and whether specific nucleic acid markers must be tested for; second, determination from known sequence data of sequence regions to choose reference sequences from; third, selection of reference sequences and check for possible conflicts; fourth, primer selection. The order of several of these steps can be interchanged and refinements consist of repeating several of these steps after making changes. Once fabricated, an amplification method is required in order to achieve the sensitivity required for diagnosis/surveillance applications, so that any of the target pathogens can be detected directly from collected samples. Finally, because so many potential organism detection events are to be dealt with, a standardized algorithm is applied to determine if pathogens are detected and report the maximum level of detail possible using the resolved base sequence information from the multiple-pathogen resequencing microarrays. 1. IQ-Ex (0.2 pg/mL) control template, control forward 1.0 kb PCR primers (20 mM), control reverse primers (20 mM), and oligonucleotide control reagent (3.2 nM). These are part of GeneChip 1 Resequencing Assay Kit (Affymetrix Inc., Santa Clara, CA). These reagents can be stored at -20°C, for up to 6 months. Set up 100 ml PCR for 1.0 kb IQ-EX containing 20 mM Tris-HCl (pH 8.4), 50 mM KCl, 2.5 mM MgCl 2 , 200 mM dNTPs, 1 U of Platinum Taq DNA polymerase (Invitrogen Life Technologies, Carlsbad, CA), 3 ml each of control forward 1.0 kb PCR primers (20 mM) and control reverse primers (20 mM), and 5 ml IQ-Ex (0.2 pg/ml) control template. The amplification reaction is carried out with initial denaturation at 95°C for 10 min, followed by 30 cycles of: 94°C for 30 s, 68°C for 30 s, 72°C for 60 s, and a final extension at 72°C for 5 min (Note 1). ! 3 0 ) Organism/gene! 3 0 ) Organism/gene! 3 0 ) Organism/gene! 3 0 ) Organism/gene! 3 0 ) Organism/gene! 3 0 ) Organism/gene! 3 0 ) Organism/gene The success of using resequencing microarray for multi-organism detection, as in broad spectrum detection of various respiratory pathogens, relies on resolving two issues before the assay can be applied to samples. First the chip must be designed by selecting appropriate reference sequences to answer the questions that will be asked. The second consideration is multiplex primer selection since the assay outlined, rapid analysis of samples with large amounts of background nucleic acid material, requires the use of specific or semi-specific primers. Selection of partial genomic sequences from pathogens (reference or target sequences) for placement on a resequencing microarray to provide direct sequence-based identification of multiple pathogens depends on what specific knowledge is required for the various pathogens. For example, the respiratory pathogen microarray v.1 (RPM v.1) chip design includes 57 target genes, partial sequences from the genes containing diagnostic regions of each pathogen (i.e., E1A, hexon, and fiber for human adenoviruses (HAdVs); hemagglutinin, neuraminidase and the matrix genes for influenza A viruses). The targets for both HAdVs and influenza are both long enough that RPM v.1 not only allows identification but also produce strain-specific sequence data at the same time. The remaining respiratory pathogens only required detection, so fewer and shorter partial sequences were selected allowing resequencing of 29.7 kb of sequences to provide at least species level identification of 26 distinct organisms (10) (Note 8). For the RPM v.1 chip, selection of partial sequences to generate probe sets on the microarray was based on the same rules used in selecting probes for long-oligonucleotide spotted microarrays even though such rules do not account for the strengths and weaknesses of a resequencing microarray. Overall, the detection and discrimination performance of such sequences was good. In fact, probes that for a spotted array would only discern to a particular level such as serotype will at least give the same level of discrimination on a resequencing array and often provide more detailed discrimination such as strain differentiation. Using these selection rules can however lead to wasted space on the resequencing microarray because in some cases where two probes were required on a spotted microarray, the information from only one is sufficient to provide equivalent or greater detection and discrimination on a resequencing microarray. Design methods have since been refined to reduce redundancy and better incorporate the advantages of resequencing arrays into probe selection. Once selected, the sequence file was sent to Affymetrix for fabrication (Note 9). The gene-specific primer pairs for all targets on the RPM v.1 chip (8) were designed according to the following criteria to meet minimum amplification efficiency requirement. 1. From our work we have established a gross predictor that hybridization will occur for an organism on the array and that at least 70% of the bases match between the sequence used on the microarray and the organism when aligned (BLAST) (12) . A list of sequences that may potentially hybridize to the reference sequence is constructed using a BLAST query. Primers are selected from consensus sequences of wellconserved regions flanking the reference sequence from the list. All potential primers that are 18-24 bases in length with $50% GC content, with no repetitive sequences and have annealing temperature range from 55 to 60°C without potential for self annealing and hairpin formation are considered. This list is further filtered to ensure uniqueness with respect to the other pathogens and human genome by using a full search of the GenBank database with the BLAST program. This insured that the potential primers for an organism have a number of mismatches with these two groups of sequences and would not mis-prime on a sequence region not of interest in the assay (Note 10). Once selected, all primers in the same primer cocktails are checked for potential hybridization to other primers to reduce the potential of primer-dimer formation. The primers that form conceivable primer dimers, 8 or more contiguous base matches between the primers, are replaced with new ones until all potential primer dimers are removed. Also, we adapt a method developed by Shuber et al. and Brownie et al. (13, 14) to further suppress primer-dimer formation by adding a linker sequence of 22 bp (primer L) to the 5 0 -end of primers used (Note 11). 3. To minimize the possibility of intra-primer interactions, the number of primers in a mix is kept to no more than 100. For RPM v.1, the primers were divided into two independent reactions to satisfy this requirement. Fine-tuning adjustments to both mixtures (swapping primers that amplified poorly for new ones) were carried out to ensure all target genes from the 26 targeted pathogens (West Nile Virus is included on the array but not in this amplification scheme) would amplify sufficiently to generate detectable hybridization. Primer sequences are listed in Tables 15.1 and 15.2 (8). 1. Mix 150 mL of the fluid samples (nasal washes or throat swabs in storage media) with 150 mL of 2X T&C lysis solution premixed with 1 ml of 50 mg/mL proteinase K thoroughly by vortexing. The sample mixture is incubated at 65°C for 15 min with vortex mixing every 5 min. After incubation, place the sample on ice for 3-5 min. 2. Add 150 mL of MPC protein precipitation reagent to the sample mixture and vortex vigorously for 10 s. At this point, sample mixture should appear cloudy. Pellet the debris by centrifugation for 10 min at 13,000 rpm at room temperature using microcentrifuge. If the pellet is clear, small or loose, add an additional 25mL of MPC protein precipitation reagent, mix, and spin again. 3. Transfer the supernatant to a clean 1.5 mL tubes and discard the pellet (Note 12). Add 500 mL of isopropanol, then invert tube several times to mix thoroughly. Pellet the DNA by centrifugation at 4°C for 10 min at 13,000 rpm using microcentrifuge. Pour off the isopropanol, be careful not to lose the DNA pellet. Rinse twice with 75% ethanol, centrifuge briefly if the pellet is dislodged. Remove all the residual ethanol with a pipette and air dry the pellet for 5-10 min. Resuspend the total nucleic acids in 25 ml of nuclease-free water and store at -20°C until further use (Note 13). Two Arabidopsis thaliana plant genes, corresponding to NAC1 and TIM, were chosen as internal controls for reverse transcription 2. Pipette 700 mL of the mixture into QIAquick spin column in a 2 mL collection tube and centrifuge for 30-60 s at 13,000 rpm using microcentrifuge at room temperature. Discard flow-through and repeat the process if the mixture volume is larger than 700 mL. 3. Wash the spin column with 750 mL PE buffer with the indicated amount of ethanol added, and centrifuge for 30-60 s at 13,000 rpm using microcentrifuge at room temperature. Discard the flow-through, then place column back into the collection tubes. Centrifuge for 60 s at 13,000 rpm using microcentrifuge at room temperature to remove residual ethanol. 4. Elute the DNA by placing the spin column in a new 1.5 mL tube, and add 50 mL EB buffer (10 mM Tris-HCl, pH 8.5) to the center of the spin column and centrifuge for 60 s at 13,000 rpm using microcentrifuge at room temperature (Note 17). Purify the IQ-EX PCR products as described in step 1. Determine the concentration of the IQ-EX using UV spectrophotometry. 6. Add 7.6 mL of the fragmentation solution to 35 mL of eluted DNA and 3 mg of IQ-EX PCR product in a final volume of 35 mL. Incubate the reaction mixture at 37°C for 5 min (Note 18), then denature the enzyme activity at 95°C for 15 min. Store at 4°C after incubation. At this point, you can store the sample at 4°C for up to 1 week before labeling. 7. Add 17.4 mL of the labeling solution to 35 mL of fragmented DNA and IQ-EX PCR product from step 3. Incubate the reaction mixture at 37°C for 30 min (Note 18), and then denature the enzyme activity at 95°C for 15 min. Store at 4°C. Use fragmented and labeled IQ-EX PCR product to prepare the hybridization buffer. 8. Add 160 mL of hybridization buffer to 60 mL of fragmented and labeled PCR products. At this point, you can store the sample at -20°C for up to 1 month before hybridization. 9. Add 200 mL pre-hybridization buffer to each chip, and incubate the chip in the hybridization oven at 49°C at 60 rpm for 15 min. In the meantime, denature the samples from step 5 at 95°C for 5 min, and then equilibrate the tubes at 49°C for 5 min (Note 19). 10. Remove the arrays from the hybridization oven, and remove and discard the pre-hybridization buffer. Add 200 mL of denatured samples. At this point, you should see a small bubble inside the chip which serves as a mixing mechanism for the microarray, so ensure such a bubble is present. Incubate the chip at the hybridization oven at 49°C at 60 rpm for 4-16 h (Note 20). Remove the hybridization mixture from the array, and fill with 250 mL of array holding buffer. At this point, you can store the array at 4°C for up to 3 h before washing and staining (Note 21). 11. Prime the GeneChip 1 fluidic stations (Affymetrix) with Wash A and B. Register a new experiment in GeneChip Operating Software Service (GCOS). Load the array, SAPE and antibody stain solution, into the designed fluidic module. Start the washing and staining protocol ''DNAAr-ray_WS5_450.'' Remove the array when the protocol is complete; make sure at this time that there is not a bubble. If there are visible bubbles, manually fill the array with array holding buffer using a pipette. Apply two tough spots to each of the two septa on the back of the array. The array can be stored at 4°C for up to 24 h before scanning (Note 22). Flush the fluidic stations with DI water and shut down. 12. Turn on the GeneChip Scanner 3000 at least 10 min before use. If the array was stored at 4°C, allow to warm to room temperature before scanning. Insert the array into scanner. Use GCOS to start scanning the array by selecting the corresponding experiment. GCOS will process the image file of the scanned microarray and create cell intensity data. An example of the results is shown in Fig. 15 .1. 13. Use GeneChip 1 Sequence Analysis Software (GSEQ) to analysis the cell intensity data and generate the base call (Note 23). Export the sequence information to FASTA file. 1. Sending entire FASTA files to be searched by BLAST is wasteful of time and potentially misleading (Note 24). The set of resolved bases resulting from hybridization is instead subjected to a filtering process. Each references sequence is examined by itself and split into possible subsequences (SubSeqs) suitable for BLAST search. SubSeqs are found by finding seed locations within the sequence that have at least 18 of 20 bases resolved. One of these locations is increased in size while the total called base percentage stays above 40% unless a contiguous stretch of at least 18 or 19 bases of N calls is encountered. The section is marked as its own SubSeq if its length is at least 30 bases and the remaining seed locations are examined in a similar manner. 2. BLAST is used to perform a similarity search of the NCBI nr database using the SubSeqs as the queries. The BLAST program used is the NCBI Blastall -p blastn with a defined set of parameters. Masking of low complex regions is performed for the seeding phase; however, such regions are included in the actual scoring. The default gap penalty and nucleotide match score are used. The nucleotide mismatch penalty, -q, parameter is set to -1 rather than the default. The results of any BLAST query with an expected value <0.0001 are returned in tabular format from the blastall program. If any SubSeq has a value of 10 -6 or less then it is considered positive for identification of whatever organism is reported in the return. 3. A SubSeq might return many records from the database with the same score. The identified organism is whatever taxonomic classification encompasses all tied best scoring returns when there are more than one (Note 25). If only a single return has the best score then that is considered the closest specific strain to the organism in the sample. 4. Different SubSeq for the same reference sequence are required to result in a single pathogen identification for that reference sequence. If one SubSeq has a significantly better score and more detailed identification than others that is taken as the identification of the reference sequence. If all SubSeqs have similar scores, then the taxonomic classification that is consistent for all of them is considered the best identification that can be made. A final examination can be made for the results from reference sequences that targeted the same organism to insure that they are reporting consistent results. It is not required that all reference sequences identify an organism nor is strictly required that they make the same exact identification. This process can be very laborious and time consuming considering the large number of reference sequences and automation of this process is possible (Note 26). . During the fabrication process at Affymetrix, their chip design group will perform a design clarification process which will check and suggest removing ambiguous, repetitive and homologous sequences. Upon completion of this step, the masks required to produce chips are fabricated and checked for quality. Finally, the arrays will be manufactured. The clarification process can take as little as 2 weeks if little feedback is required but can run longer and the production of chips depends upon scheduling constraints to meet delivery to all their customers. 10 . Primers selected should have at least three base mismatches with human genome sequences to avoid non-specific amplification. 11. Short linker primers can also be used, but the linker primers must be unique with higher melting temperature and unrelated to the target pathogens and background genome sequences that the samples may contain. 12. Try to avoid lipid and small white powdery protein substances in the supernatant. 13. If you air dry the pellet too long, you will need to add nuclease-free water to the pellet and store at 4°C overnight to ensure the complete suspension of the nucleic acids. 14. Internal control is not absolutely necessary for the reaction, but it is good to ensure there is no false negative result due to the RT or PCR step. Other genes besides NAC1 and TIM can also be used for internal controls provided the custom design chip has reference sequences to hybridize with the control genes. 15. Do not use 50x dNTPs (ACGU, Sigma) at this step. 16. Two-step PCR are used here to shorten the cycling time, alternatively, a three-step PCR can be used as long as the annealing temperature is raised enough to switch to linker only priming. 17. Smaller amount of elution buffer can be used; EB buffer can be used to bring up the volume for the next step. 18. Longer incubation time is recommended for both fragmentation (30 min) and labeling (2 h) by Affymetrix. However, our experience suggested that a shorter incubation time is sufficient. 19. It is recommended that the chip should be warm up to room temperature before performing pre-hybridization. Our experiences suggested that this is not a critical step. 20. It is recommended that the hybridization should be carried out for 16 h in order to reach equilibrium. Our experience indicates that 4 h hybridization can generate sufficient base call for pathogen identification. 21. Wash A can also be used to fill the chip at this stage. Chips can be stored for up to 24 h before washing. 22. Chips can still be scanned after storing for more than 24 h and less than 1 week, although the fluorescence signal will be weaker. 24. Resequencing arrays provide positional information which allows for the use of similarity searches; however, because of how the information is obtained they can potentially bias for or against variants with insertions or deletions depending on the reference sequence selected. Splitting regions that are separated by large sections of N calls reduces this bias. 25. The organism identified may not necessarily be the primary organism the reference sequence is intended to identify but a near-neighbor species. 26. The algorithm as described was built into a new software program, Computer-Implemented Biological Sequencebased Identifier system, version 2 (CIBSI 2.0) to automate the pathogen identification process for the RPM v.1 array (12) . CIBSI 2.0 besides determining what each reference sequence detects furthermore, whether the identifications from separate targets support a common organism identification and determine whether detected organisms belong to the target set that the assay was designed to detect or are related to close genetic near neighbors. Target pathogens are the organisms the assay was specifically designed to detect. Resequencing and mutational analysis using oligonucleotide microarrays Simultaneous genotyping and species identification using hybridization pattern recognition analysis of generic Mycobacterium DNA arrays Extensive polymorphisms observed in HIV-1 clade B protease gene using highdensity oligonucleotide arrays Tracking the evolution of the SARS coronavirus using high-throughput, high-density resequencing arrays High-density microarray of smallsubunit ribosomal DNA probes Sequence-specific identification of 18 pathogenic microorganisms using microarray technology Use of resequencing oligonucleotide microarrays for identification of Streptococcus pyogenes and associated antibiotic resistance determinants Using a resequencing microarray as a multiple respiratory pathogen detection assay Application of broad-spectrum, sequence-based pathogen identification in an urban population Broad-spectrum respiratory tract pathogen identification using resequencing DNA microarrays Identifying influenza viruses with resequencing microarrays Automated identification of multiple micro-organisms from resequencing DNA microarrays The elimination of primerdimer accumulation in PCR A simplified procedure for developing multiplex PCRs Assessing unmodified 70-mer oligonucleotide probe performance on glass-slide microarrays