Identification of high-efficiency 3'GG gRNA motifs in indexed FASTA files with ngg2 Submitted 9 April 2015 Accepted 29 October 2015 Published 18 November 2015 Corresponding author Elisha D. Roberson, eroberso@dom.wustl.edu Academic editor Kjiersten Fagnan Additional Information and Declarations can be found on page 9 DOI 10.7717/peerj-cs.33 Copyright 2015 Roberson Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Identification of high-efficiency 3′GG gRNA motifs in indexed FASTA files with ngg2 Elisha D. Roberson Departments of Medicine & Genetics, Division of Rheumatology, Washington University in Saint Louis, Saint Louis, MO, United States of America ABSTRACT CRISPR/Cas9 is emerging as one of the most-used methods of genome modification in organisms ranging from bacteria to human cells. However, the efficiency of editing varies tremendously site-to-site. A recent report identified a novel motif, called the 3′GG motif, which substantially increases the efficiency of editing at all sites tested in C. elegans. Furthermore, they highlighted that previously published gRNAs with high editing efficiency also had this motif. I designed a Python command-line tool, ngg2, to identify 3′GG gRNA sites from indexed FASTA files. As a proof-of-concept, I screened for these motifs in six model genomes: Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Mus musculus, and Homo sapiens. I also scanned the genomes of pig (Sus scrofa) and African elephant (Loxodonta africana) to demonstrate the utility in non-model organisms. I identified more than 60 million single match 3′GG motifs in these genomes. Greater than 61% of all protein coding genes in the reference genomes had at least one unique 3′GG gRNA site overlapping an exon. In particular, more than 96% of mouse and 93% of human protein coding genes have at least one unique, overlapping 3′GG gRNA. These identified sites can be used as a starting point in gRNA selection, and the ngg2 tool provides an important ability to identify 3′GG editing sites in any species with an available genome sequence. Subjects Bioinformatics, Computational Biology, Data Science, Databases Keywords gRNA, Motif discovery, Python, Open-source, CRISPR/Cas9, 3′GG INTRODUCTION Genome engineering allows for the targeted deletion or modification by homology directed repair of a target locus. Currently, one of the most popular methods for genome manipulation is the clustered regularly interspaced short palindromic repeat (CRISPR)/CRISPR associated protein 9 (Cas9) system adapted from Streptococcus pyogenes. The S. pyogenes CRISPR/Cas system was initially thought to represent a novel DNA repair mechanism, but was eventually found to provide heritable bacterial immunity to invading exogenous DNA, such as plasmids and bacteriophages (Barrangou et al., 2007; Makarova et al., 2006). During endogenous CRISPR/Cas9 function, foreign DNA integrates into the CRISPR locus. The bacterial cell then expresses the pre-CRISPR RNA (crRNA) and a trans-activating crRNA (tracrRNA) that pair to form a complex that How to cite this article Roberson (2015), Identification of high-efficiency 3′GG gRNA motifs in indexed FASTA files with ngg2. PeerJ Comput. Sci. 1:e33; DOI 10.7717/peerj-cs.33 mailto:eroberso@dom.wustl.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.33 http://dx.doi.org/10.7717/peerj-cs.33 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 is cleaved by RNAse III (Deltcheva et al., 2011). The resulting RNA is a hybrid of the pre-crRNA and the tracrRNA, and includes a 20 bp guide RNA (gRNA) sequence. The gRNA is incorporated into Cas9 and can then guide the cleavage of a complementary DNA sequence by the nuclease activity of the Cas9 protein. The topic of CRISPR-Cas genome editing has been reviewed extensively elsewhere (Doudna & Charpentier, 2014; Hsu, Lander & Zhang, 2014; Jiang & Doudna, 2015; Mali, Esvelt & Church, 2013). Codon-optimized versions of Cas9 are available for a wide range of organisms, and can easily be synthesized if it is not already available. Transfecting cells with Cas9 plasmid along with a fused crRNA-tracrRNA hybrid construct called a single-guide RNA (sgRNA) allows for temporary activity of Cas9. Alternatively, cells can also be transfected with a Cas9 protein preloaded with a gRNA to reduce off target effects (Kim et al., 2014). Keeping a stock of plasmids with a sgRNA backbone minus the gRNA site makes it easy to quickly generate new sgRNA plasmids by site-directed mutagenesis. The Cas9 protein loaded with the sgRNA will bind to sites complementary genomic loci, but will only cut it if a protospacer adjacent motif (PAM) site immediately follows the complementary sequence (Mojica et al., 2009). The PAM site for the commonly-used Streptococcus pyogenes type-II CRISPR is an NGG motif. Therefore, a S. pyogenes Cas9 gRNA site can be defined as N20NGG. It is important to note that constitutively expressed sgRNAs typically use a U6 snRNA promoter that strongly prefers a G starting base. For U6 compatibility, sequences starting with A, C, or T may be used if they are cloned into a sgRNA vector with an appended G base, resulting in a 21 bp gRNA (Farboud & Meyer, 2015; Ran et al., 2013b), or by incorporating the gRNA into a tRNA poly-cistron and taking advantage of tRNA processing cleavage (Xie, Minkenberg & Yang, 2015). I will refer to the subset gRNA sites contain a starting G base (GN19NGG) as canonical 3 ′GG gRNA sites. The rate of editing using the CRISPR/Cas9 system is far higher than homologous recombination, but higher efficiency is still desirable. The introduction of a longer stem in part the sgRNA stem-loop structure and the flip of a single A in a polyA track of a separate sgRNA stem-loop, called the flip + extension (F + E) sgRNA design, resulted in increased Cas9 editing efficiency (Chen et al., 2013). Recently, another improvement was reported that increases efficiency. gRNA sites with a GG motif adjacent to the PAM site, called 3′GG gRNAs, have far higher activity than equivalent gRNA sites in the same region (Farboud & Meyer, 2015). These sites take the form of N18GGNGG. The 3 ′GG motif efficiency in species other than C. elegans is unknown. Tools already exist to identify S. pyogenes Cas9 gRNA targets in sequences via a web interface for an input DNA, or for common model organisms (Gratz et al., 2014; Heigwer, Kerr & Boutros, 2014; Liu et al., 2015; Montague et al., 2014; Naito et al., 2015; Stemmer et al., 2015; Xiao et al., 2014). However, there are limitations to these methods. Searching a whole genome for gRNA sites is not feasible via a web interface unless the genome is exceptionally small. There is already support for most model organisms, but leaves individuals working on less commonly studied species without a resource. In this manuscript, I report a Python command-line tool, ngg2, for identification of 3′GG gRNA motifs from indexed FASTA genome files. As a proof of concept, I report all 3′GG gRNA Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 2/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 motifs in 6 model species plus two additional mammalian genomes, identifying more than 88 million sites, of which more than 60 million are unique matches within the reference genome for that species. More than 83% of all protein coding genes in 7/8 species have at least one unique 3′GG gRNA overlapping it for potential editing. MATERIALS & METHODS ngg2 motif identification I designed ngg2 using Python with compiled regular expressions for the 3′GG gRNA plus PAM motif. The use of compiled regular expressions makes the search quite efficient even for relatively large genomes. This tool is Python based, relying on the Python base functions and some external dependencies, such as the regex and pyfaidx packages. ngg2 uses the FASTA index via pyfaidx (Shirley et al., 2015) to directly seek the genomic target without reading the entire file. The default mode scrapes the entire FASTA input for 3′GG gRNA sites, but individual contigs or contig regions can be specified instead. ngg2 identifies these sites on both the sense and antisense strands independently for each chromosome, facilitating multiprocessing to decrease computation time. ngg2 buffers all detected gRNA sites in memory, and then identifies uniqueness by storing the gRNA sites in a dictionary. This means that all unique sites will be appropriately flagged, but near matches, i.e., single-base mismatches will not. The output from this tool could be pipelined with other tools, or further extended with BioPython to allow for identification of near matches as they are beyond the scope of this tool. The output can be extended to include non-canonical sites starting with any base. ngg2 output includes the contig name, start and end positions, the gRNA sequence, the PAM sequence, whether the site starts with a G, and whether the gRNA sequence was unique in the searched region. For a whole-genome this is very handy, but be aware that selecting only a small region will only tell you if a gRNA is unique within the region, not the genome. The source code for ngg2 is available from GitHub. Multi-species site identification I used ngg2 to identify all 3′GG gRNA motifs 6 commonly studied organisms and two others: Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Mus musculus, Homo sapiens, Sus scrofa, and Loxodonta africana. I used a GNU Make script to download genomes and GTF gene annotations, calculate genome GC content, and annotate genes in R to enable reproducibility. The Makefile downloads the top-level or primary assembly genomes from Ensembl Release 79, runs ngg2 on all contigs for each FASTA file, and calculates GC content for each genome. I based the GC content of each genome from non-N base content. After identifying gRNA sites, I used R, particularly relying on the plyr, dplyr, tidyr, magrittr, GenomicRanges, and GenomicFeatures packages, to identify the overlap of each gRNA with gene exons and tabulate the number of genes overlapping at least one gRNA (Lawrence et al., 2013; R Core Team, 2014). A gRNA was considered overlapping a gene if at least one base of gRNA sequence overlapped at least one base of exonic sequence. The best Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 3/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 Table 1 Count of gRNA classes in each species. All N18GGNGG motifs are included in the ‘All gRNAs’ section, while only canonical gRNAs starting with a G are in the ‘Canonical gRNAs’ section. The ‘All’ class accumulates all matching motifs for that section, while the ‘Unique’ class counts only sites with on exact match in the reference genome. All gRNAs Canonical gRNAs All Unique All Unique S. cerevisiae 44,757 41,462 9,938 9,717 C. elegans 379,955 333,752 85,887 82,696 D. melanogaster 929,164 815,501 243,705 238,460 D. rerio 5,815,459 3,110,150 835,035 744,702 M. musculus 19,368,938 13,925,626 3,856,020 3,660,550 S. scrofa 18,711,809 12,716,221 4,145,116 3,558,512 H. sapiens 23,022,656 14,782,453 4,172,179 3,954,608 L. africana 20,276,122 14,929,328 4,075,522 3,893,752 Total 88,548,860 60,654,493 17,423,402 16,142,997 case puts the cut site within the exon body and should certainly disrupt the gene. The worst case of a 1bp overlap cutting in an intron should still generate indels big enough to extend into the exon or to delete a canonical splice site. I calculated all summary statistics and generated ggplot2 figures using RStudio (v0.98.1102) Markdown with knitr (Xie, 2013). RESULTS 3′GG gRNA sites are common in each species Overall, I identified greater than 88 million 3′GG gRNA sites in the tested genomes (Table 1). Some of these gRNA sequences were not unique in a given genome, leaving more than 60 million unique 3′GG sites. Approximately 16 million of the 60 million unique sites were canonical G starting motifs. The sites identified in each species with the gRNA sequence, PAM sequence, genome coordinates, annotated overlapping genes, and number of perfect genome matches are available for download (Roberson, 2015). The R scripts, Python files, and Make files are also available in a public repository for reproducibility. The genomes I analyzed had vastly different sizes, ranging from approximately 12 Mb for yeast to greater than 3 Gb for humans and elephants, and as a result had dramatically different numbers of 3′GG gRNA sites per genome. Therefore, I also assessed the site density per megabase of reference genome size (Table 2). Unique sites with a G starting base averaged a density of 1,218 sites/Mb, or 1 site per 821 bp. All unique sites averaged 4,210 sites/Mb, or 1 unique 3′GG gRNA site per 238 bp. D. rerio had the lowest density at 527 unique G-start sites/Mb, while D. melanogaster had the highest density at 1,659 unique sites/Mb. The low density of unique sites in zebrafish may be due to genome complexity from previous duplication events I profiled the performance of canonical G-start gRNA searches in each of the tested genomes for both block and exhaustive scans using both 1 and 10 CPUs (Table 3). The parallelization in this program is by contig and strand, so the maximum utilized number Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 4/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 Table 2 3′GG gRNA sites per megabase genome size. Reference genome size was determined from the species FASTA index. The number of unique 3′GG gRNA sites in the genomes is encouraging, with an average across all species of one unique site per kb of genome. All gRNAs Canonical gRNAs All Unique All Unique S. cerevisiae 3,681.55 3,410.52 817.46 799.29 C. elegans 3,788.70 3,327.99 856.42 824.60 D. melanogaster 6,464.83 5,674.00 1,695.62 1,659.13 D. rerio 4,117.24 2,201.93 591.19 527.24 M. musculus 7,092.58 5,099.33 1,412.01 1,340.43 S. scrofa 6,662.50 4,527.72 1,475.90 1,267.04 H. sapiens 7,427.26 4,768.92 1,345.97 1,275.78 L. africana 6,342.71 4,670.14 1,274.89 1,218.03 Table 3 Run times with one and multiple CPUs. Profiling was performed using Python v2.7.3 using 1 or 10 processors on a server with Intel i7-3930K processors and 32 GB of RAM. Canonical gRNAs were searched for benchmark purposes. When possible, it is clearly advantageous to use multiple processors to accelerate gRNA searches. Block Exhaustive 1 CPU 10 CPU Delta 1 CPU 10 CPU Delta Saccharomyces cerevisiae 0.9 0.3 −71% 1.2 0.4 −68% Caenorhabditis elegans 6.4 1.4 −78% 8.1 2.1 −74% Drosophila melanogaster 67.8 12.7 −81% 71.7 13.6 −81% Danio rerio 99.3 20.3 −80% 138.2 26.8 −81% Mus musculus 186.0 47.7 −74% 284.1 66.6 −77% Sus scrofa 536.4 111.1 −79% 633.2 126.7 −80% Homo sapiens 207.4 53.9 −74% 306.2 71.6 −77% Loxodonta africana 293.4 64.8 −78% 398.3 79.9 −80% of threads would be twice the number of contigs. Using 10 CPUs reduced runtimes by approximately 70–80% in all cases. It is worth noting that exhaustively scraping the human genome for canonical sites took only 71.6 s with 10 CPUs, and even the longest search took only 126.7 s for Sus scrofa using 10 CPUs. Little strand bias observed for canonical 3′GG gRNA sites The strand of each gRNA site with respect to the reference was included in the ngg2 output files. For each organism, I considered every gRNA site as an independent Bernoulli trial with a 50% probability of a “Sense” strand designation as a successful trial outcome (Table 4). 5/8 species showed strand bias for all gRNA sites (C. elegans, D. melanogaster, D. rerio, H. sapiens, L. africana). Only C. elegans and H. sapiens demonstrated strand bias significantly different from the expected ratio for canonical 3′GG sites. While the difference in strand selection is significant, it may be unimportant to editing site selection. Wildtype Cas9 cleaves both DNA strands simultaneously, and therefore the strand of the target Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 5/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 Table 4 Strand bias for gRNA sites. The gRNA type is either all 3′GG sites or only canonical G starting gRNA sites. The estimate column is the estimated rate of positive strand selection observed. The p-value column is detected for whether the Bernoulli trial estimates differ significantly a 50/50 strand selection, and the adjusted p-value is based on a Benjamini–Hochberg false-discovery rate correction. gRNA type Species Estimate p. value p. adj All Saccharomyces cerevisiae 0.500 9.02E−01 1.00E+00 Caenorhabditis elegans 0.494 9.09E−12 1.36E−10 Drosophila melanogaster 0.498 8.86E−06 9.75E−05 Danio rerio 0.501 6.22E−04 6.22E−03 Mus musculus 0.500 6.52E−01 1.00E+00 Homo sapiens 0.501 9.59E−19 1.53E−17 Loxodonta africana 0.499 4.02E−06 4.83E−05 Sus scrofa 0.500 4.88E−01 1.00E+00 Canonical Saccharomyces cerevisiae 0.501 8.00E−01 1.00E+00 Caenorhabditis elegans 0.490 1.50E−10 2.10E−09 Drosophila melanogaster 0.500 6.09E−01 1.00E+00 Danio rerio 0.501 9.30E−02 7.44E−01 Mus musculus 0.500 4.57E−02 4.11E−01 Homo sapiens 0.501 2.01E−06 2.62E−05 Loxodonta africana 0.500 9.11E−01 1.00E+00 Sus scrofa 0.500 4.45E−01 1.00E+00 sequence doesn’t matter. Strategies that employ dual nickases to reduce off target effects could be affected by such bias, as they require two separate gRNA sites on opposite strands (Ran et al., 2013a). The difference observed is less than 0.6% different from expected 50% ratio, and whether this functionally affects the ability to choose paired 3′GG gRNAs remains to be seen. CGG & GGG PAM sites are underrepresented I visualized the distribution of the four PAM sites (AGG, CGG, GGG, TGG) as a stacked bar chart of each sites proportion of the total identified sites in each species (Fig. 1). In general, the AGG and TGG sites represented the majority of 3′GG gRNA sites in all species. I tested whether PAM site distribution differed from chance based on the GC content of the reference genome. For each species, I considered each PAM site a Bernoulli trial, and defined success as either CGG or GGG site identity. The probability of success was set equal to the estimated genome-wide GC content calculated from the reference genome, excluding N bases (Table 5). None of the tested genomes met the expected GC success rate. The rate of picking a CGG or GGG PAM was less than the genome GC content in S. cerevisiae, M. musculus, Sus scrofa, Loxodonta africana, and H. sapiens. In particular, the estimate for M. musculus, H. sapiens, and Loxodonta africana was >10% different from the genome GC fraction. This is not necessarily unexpected. The CGG PAM site includes a 5′ CpG dinucleotide that is generally underrepresented due to the relatively high frequency of methyl-cytosine deamination to thymine. C. elegans, D. melanogaster, and D. rerio were the exceptions, with CGG and GGG PAM selection greater than the expected frequency. Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 6/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 Figure 1 PAM site usage across tested species. Each species has four potential protospacer adjacent motifs (PAM) possible for identified gRNA sites. The stacked bar chart shows the fraction of all PAM sites each motif occupies. The CGG motif, that includes a CpG dinucleotide, is the least prevalent motif in the zebrafish, mouse, human, elephant, and pig genomes. However, C. elegans may not be unexpected, as it lacks DNA methylation and would not necessarily be at an advantage to limit CpG dinucleotides. Most protein coding genes overlap at least one unique 3′GG gRNA A common use of genome engineering is to knock out or otherwise modify the function of a protein coding gene. The efficiency of such edits is critical, as just introducing frame-shifting mutations can require screening a large number single-cell clones or derived animals to identify a successful edit. As part of this study, I annotated for each gRNA in the 8 species if there was any overlap with a gene. Conversely, I also annotate a count of how many of each of the four classes (all sites, all unique sites, canonical sites, and unique canonical sites) overlap every gene. No less than 89% of any species’ genes overlap at least one unique 3′GG gRNA (Table 6). This catalog of potential sites demonstrates that most protein coding genes can be targeted by at least one 3′GG gRNA site to achieve high editing efficiency. DISCUSSION In this manuscript, I have described a new tool for identifying 3′GG gRNA sites and presented a catalog of potential editing sites in 8 species. Importantly, many genomic loci Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 7/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 Table 5 PAM site frequency compared to genome GC content. The average genome GC content and the estimated chance of picking a GC PAM site (CGG or GGG) are shown for each species. GC content was calculated from the downloaded reference files. gRNA type Species gc Estimate p. value p. adj All Saccharomyces cerevisiae 0.382 0.298 2.30E−301 1.10E−300 Caenorhabditis elegans 0.354 0.422 1.98E−323 3.01E−322 Drosophila melanogaster 0.420 0.452 3.46E−323 4.79E−322 Danio rerio 0.367 0.373 7.90E−218 3.20E−217 Mus musculus 0.417 0.298 1.58E−322 1.40E−321 Homo sapiens 0.409 0.251 1.68E−322 1.40E−321 Loxodonta africana 0.408 0.289 1.58E−322 1.40E−321 Sus scrofa 0.417 0.352 1.58E−322 1.40E−321 Canonical Saccharomyces cerevisiae 0.382 0.284 1.00E−98 2.10E−98 Caenorhabditis elegans 0.354 0.420 9.88E−324 1.58E−322 Drosophila melanogaster 0.420 0.438 4.70E−81 4.70E−81 Danio rerio 0.367 0.376 1.60E−102 4.80E−102 Mus musculus 0.417 0.299 8.40E−323 1.10E−321 Homo sapiens 0.409 0.250 8.40E−323 1.10E−321 Loxodonta africana 0.408 0.288 8.40E−323 1.10E−321 Sus scrofa 0.417 0.344 8.40E−323 1.10E−321 Table 6 Fraction of genes overlapping at least one gRNA. Ensembl GTF files were used to annotate overlap of gRNA sites with known genes. A gene was called as potentially cut if at least one gRNA overlapped at least 1 base with an exon of that gene. Most genes in the 8 species have at least one unique cut per gene. All motifs Canonical motifs Species All Unique All Unique S. cerevisiae 0.93 0.90 0.65 0.62 C. elegans 0.96 0.83 0.81 0.68 D. melanogaster 0.99 0.97 0.91 0.89 D. rerio 0.89 0.61 0.74 0.42 M. musculus 0.99 0.96 0.90 0.84 S. scrofa 0.99 0.86 0.92 0.76 H. sapiens 0.98 0.93 0.92 0.84 L. africana 0.91 0.87 0.61 0.59 can be targeted by unique 3′GG gRNA sites. The efficiency of 3′GG gRNA sites in species other than C. elegans has yet to be established, but is worth further study. This tool reports the uniqueness of identified sites, but blast searching of potential gRNA sequences is warranted to identify near-match sites. It is also important to consider the target genome’s specific genotypes when designing a gRNA. In particular, variants that alter PAM sites away from NGG will not be cleaved by Cas9 even if the gRNA is an exact match. Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 8/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 The accuracy of editing can be improved by using two gRNAs and a mutant Cas9 nickase. I observed significant, but low-effect strand bias in these genomes. This may lead to some loci not being compatible with paired 3′GG gRNA sites. When possible, choosing paired 3′GG gRNA sites should be strongly considered. Efficiencies of less than 10% were increased to 50% efficiency or greater by using the 3′GG strategy (Farboud & Meyer, 2015). As such, using paired 3′GG gRNAs with a nickase may give the best of both worlds with both high accuracy and high efficiency. It is important to note that ngg2 will operate on any indexed FASTA file. Many gRNA site finding tools are limited to catalogs of gRNA sites in model organisms. This tool fills an important gap for individuals working outside of commonly used species, demonstrated by the use of ngg2 on the genomes of S. scrofa and L. africana. The provided gRNA site survey and associated tool, ngg2, represent a valuable resource for designing genomic modification strategies. ACKNOWLEDGEMENTS I wish to thank Dr. Li Cao for her helpful comments during the preparation of this manuscript, and Dr. Matthew Shirley for his suggested use of pyfaidx. ADDITIONAL INFORMATION AND DECLARATIONS Funding A portion of effort spent on designing this software was supported under NIH P30 AR048335 as an activity of the Human Genomics and Bioinformatics Facility in the Washington University Rheumatic Disease Core Center. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: NIH: P30 AR048335. Competing Interests I have no competing interests related to this manuscript or tool. Author Contributions • Elisha D. Roberson conceived and designed the experiments, performed the experi- ments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/2015 ngg2 manuscript http://dx.doi.org/10.6084/m9.figshare.1515944. Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 9/11 https://peerj.com/computer-science/ https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.7717/peerj-cs.33 REFERENCES Barrangou R, Fremaux C, Deveau H, Richards M, Boyaval P, Moineau S, Romero DA, Horvath P. 2007. CRISPR provides acquired resistance against viruses in prokaryotes. Science 315:1709–1712 DOI 10.1126/science.1138140. Chen B, Gilbert Luke A, Cimini BA, Schnitzbauer J, Zhang W, Li G-W, Park J, Blackburn EH, Weissman JS, Qi LS, Huang B. 2013. Dynamic imaging of genomic loci in living human cells by an optimized CRISPR/Cas system. Cell 155:1479–1491 DOI 10.1016/j.cell.2013.12.001. Deltcheva E, Chylinski K, Sharma CM, Gonzales K, Chao Y, Pirzada ZA, Eckert MR, Vogel J, Charpentier E. 2011. CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III. Nature 471:602–607 DOI 10.1038/nature09886. Doudna JA, Charpentier E. 2014. The new frontier of genome engineering with CRISPR-Cas9. Science 346:1258096 DOI 10.1126/science.1258096. Farboud B, Meyer BJ. 2015. Dramatic enhancement of genome editing by CRISPR/Cas9 through improved guide RNA design. Genetics 199:959–971 DOI 10.1534/genetics.115.175166. Gratz SJ, Ukken FP, Rubinstein CD, Thiede G, Donohue LK, Cummings AM, O’Connor- Giles KM. 2014. Highly specific and efficient CRISPR/Cas9-Catalyzed homology-directed repair in Drosophila. Genetics 196:961–971 DOI 10.1534/genetics.113.160713. Heigwer F, Kerr G, Boutros M. 2014. E-CRISP: fast CRISPR target site identification. Nature Methods 11:122–123 DOI 10.1038/nmeth.2812. Hsu PD, Lander ES, Zhang F. 2014. Development and applications of CRISPR-Cas9 for genome engineering. Cell 157:1262–1278 DOI 10.1016/j.cell.2014.05.010. Jiang F, Doudna JA. 2015. The structural biology of CRISPR-Cas systems. Current Opinion in Structural Biology 30:100–111 DOI 10.1016/j.sbi.2015.02.002. Kim S, Kim D, Cho SW, Kim J, Kim J-S. 2014. Highly efficient RNA-guided genome editing in human cells via delivery of purified Cas9 ribonucleoproteins. Genome Research 24:1012–1019 DOI 10.1101/gr.171322.113. Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan MT, Carey VJ. 2013. Software for computing and annotating genomic ranges. PLoS Computational Biology 9:e1003118 DOI 10.1371/journal.pcbi.1003118. Liu H, Wei Z, Dominguez A, Li Y, Wang X, Qi LS. 2015. CRISPR-ERA: a comprehensive design tool for CRISPR-mediated gene editing, repression and activation. Bioinformatics 31(22):3676–3678 DOI 10.1093/bioinformatics/btv423. Makarova K, Grishin N, Shabalina S, Wolf Y, Koonin E. 2006. A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biology Direct 1:7 DOI 10.1186/1745-6150-1-7. Mali P, Esvelt KM, Church GM. 2013. Cas9 as a versatile tool for engineering biology. Nature Methods 10:957–963 DOI 10.1038/nmeth.2649. Mojica FJM, Dı́ez-Villaseñor C, Garcı́a-Martı́nez J, Almendros C. 2009. Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology 155:733–740 DOI 10.1099/mic.0.023960-0. Montague TG, Cruz JM, Gagnon JA, Church GM, Valen E. 2014. CHOPCHOP: a CRISPR/Cas9 and TALEN web tool for genome editing. Nucleic Acids Research 42:W401–W407 DOI 10.1093/nar/gku410. Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 10/11 https://peerj.com/computer-science/ http://dx.doi.org/10.1126/science.1138140 http://dx.doi.org/10.1016/j.cell.2013.12.001 http://dx.doi.org/10.1038/nature09886 http://dx.doi.org/10.1126/science.1258096 http://dx.doi.org/10.1534/genetics.115.175166 http://dx.doi.org/10.1534/genetics.113.160713 http://dx.doi.org/10.1038/nmeth.2812 http://dx.doi.org/10.1016/j.cell.2014.05.010 http://dx.doi.org/10.1016/j.sbi.2015.02.002 http://dx.doi.org/10.1101/gr.171322.113 http://dx.doi.org/10.1371/journal.pcbi.1003118 http://dx.doi.org/10.1093/bioinformatics/btv423 http://dx.doi.org/10.1186/1745-6150-1-7 http://dx.doi.org/10.1038/nmeth.2649 http://dx.doi.org/10.1099/mic.0.023960-0 http://dx.doi.org/10.1093/nar/gku410 http://dx.doi.org/10.7717/peerj-cs.33 Naito Y, Hino K, Bono H, Ui-Tei K. 2015. CRISPRdirect: software for designing CRISPR/Cas guide RNA with reduced off-target sites. Bioinformatics 31:1120–1123 DOI 10.1093/bioinformatics/btu743. R Core Team. 2014. R: a language and environment for statistical computing. 3.1.2 edition. R Foundation for Statistical Computing. Ran FA, Hsu PD, Lin C-Y, Gootenberg JS, Konermann S, Trevino AE, Scott DA, Inoue A, Matoba S, Zhang Y, Zhang F. 2013a. Double nicking by RNA-Guided CRISPR Cas9 for enhanced genome editing specificity. Cell 154:1380–1389 DOI 10.1016/j.cell.2013.08.021. Ran FA, Hsu PD, Wright J, Agarwala V, Scott DA, Zhang F. 2013b. Genome engineering using the CRISPR-Cas9 system. Nature Protocols 8:2281–2308 DOI 10.1038/nprot.2013.143. Roberson E. 2015. Homo sapiens cuts per gene annotated for 3 prime GG motif gRNAS—exhaustive scan. Available at http://dx.doi.org/10.6084/m9.figshare.1515944. Shirley M, Ma Z, Pedersen B, Wheelan S. 2015. Efficient “pythonic” access to FASTA files using pyfaidx. PeerJ PrePrints 3:e1196 DOI 10.7717/peerj.1196. Stemmer M, Thumberger T, Del Sol Keyer M, Wittbrodt J, Mateo JL. 2015. CCTop: an intuitive, flexible and reliable CRISPR/Cas9 target prediction tool. PLoS ONE 10:e0124633 DOI 10.1371/journal.pone.0124633. Xiao A, Cheng Z, Kong L, Zhu Z, Lin S, Gao G, Zhang B. 2014. CasOT: a genome-wide Cas9/gRNA off-target searching tool. Bioinformatics 30:1180–1182 DOI 10.1093/bioinformatics/btt764. Xie Y. 2013. Dynamic documents with R and knitr. Boca Raton: Chapman and Hall/CRC. Xie K, Minkenberg B, Yang Y. 2015. Boosting CRISPR/Cas9 multiplex editing capability with the endogenous tRNA-processing system. Proceedings of the National Academy of Sciences of the United States of America 112:3570–3575 DOI 10.1073/pnas.1420294112. Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 11/11 https://peerj.com/computer-science/ http://dx.doi.org/10.1093/bioinformatics/btu743 http://dx.doi.org/10.1016/j.cell.2013.08.021 http://dx.doi.org/10.1038/nprot.2013.143 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.7717/peerj.1196 http://dx.doi.org/10.1371/journal.pone.0124633 http://dx.doi.org/10.1093/bioinformatics/btt764 http://dx.doi.org/10.1073/pnas.1420294112 http://dx.doi.org/10.7717/peerj-cs.33 Identification of high-efficiency 3'GG gRNA motifs in indexed FASTA files with ngg2 Introduction Materials & Methods ngg2 motif identification Multi-species site identification Results 3'GG gRNA sites are common in each species Little strand bias observed for canonical 3'GG gRNA sites CGG & GGG PAM sites are underrepresented Most protein coding genes overlap at least one unique 3'GG gRNA Discussion Acknowledgements References