URLs in this document have been updated. Links enclosed in {curly brackets} have been changed. If a replacement link was located, the new URL was added and the link is active; if a new site could not be identified, the broken link was removed.

Science and Technology Sources on the Internet

Guide to Selected Bioinformatics Internet Resources

Christy Hightower
Engineering Librarian
Science and Engineering Library
University of California Santa Cruz
christyh@cats.ucsc.edu

Introduction to Bioinformatics
Scope of this Guide
Definitions, Glossaries, and Dictionaries
News/Keeping Current
Sequence and other Non-Bibliographic Databases
Software
Comprehensive Web Sites
Bibliographic Databases
Technical Reports and Preprints
Major Conferences and Symposia
Important Organizations
Guides, Tutorials and Primers
Recommended Reading
References

Introduction to Bioinformatics

The tremendous interest in bioinformatics, a new discipline at the intersection of molecular biology and computer science, is fueled by the excitement surrounding the sequencing of the human genome and the promise of a new era in which genomic research dramatically improves the human condition. Advances in detection and treatment of disease and the production of genetically engineered foods are among the most often mentioned benefits. Bioinformatics is a fertile new area for programmers. As the eminent computer scientist Donald Knuth is often quoted as saying: "Biology easily has 500 years of exciting problems to work on" (Doernberg 1993).

The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as:

"Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline...There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to assess relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information."

Damian Counsell's Bioinformatics FAQ (2001) puts it more simply. "I would say most biologists talk about 'doing bioinformatics' when they use computers to store, retrieve, analyze or predict the composition or the structure of biomolecules. As computers become more powerful you could probably add simulate to this list of bioinformatics verbs. 'Biomolecules' include your genetic material---nucleic acids---and the products of your genes: proteins."

While the terms bioinformatics and computational biology are often used interchangeably, medical informatics is another field entirely. "Medical informatics generally deals with 'gross' data, that is information from super-cellular systems, right up to the population level, while bioinformatics tends to be concerned with information about cellular and biomolecular structures and systems." (Counsell 2001)

For more information, see the Definitions, Glossaries and Dictionaries and the Recommended Reading sections of this guide.

Scope of this Guide

Because of the potential for this field to sweep a great deal of both computer science and molecular biology under its wing, this guide is by necessity very selective rather than comprehensive. There is a focus on human rather than plant or animal data sources and the ethical, business, political and legal aspects of bioinformatics and genomics are completely ignored (except for their appearance in the news sites). The resources selected are aimed at the college and research level. Furthermore, due to the large number of databases and web-based resources on the subject only the best or most well known in each category was chosen. (For an idea of the size of the problem, consider that the January 1, 2002 issue of Nucleic Acids Research lists 335 molecular biology databases that might be considered relevant to bioinformatics.) And although not intentional, the author's American academic perspective may have colored the selection of data sources. Bioinformatics is a particularly international subject, with a notably high degree of information sharing among researchers in different countries (not to mention a strong tradition of making this information freely available to the public). The human genome project was a particularly good example of this multinational collaboration. In fact, the same data is often available from similar but slightly differing databases located in different countries. For example, GenBank (at the National Center for Biotechnology Information), together with the DNA DataBank of Japan and the European Molecular Biology Laboratory (EMBL) comprise the International Nucleotide Sequence Database Collaboration. These three organizations exchange data on a daily basis. While this sharing is highly admirable from a scientific standpoint it does add to the sense of information overload and confusion for non-specialist librarians who approach this subject.

To find the resources listed in this webliography, the author read the books and articles listed in the Recommended Reading section of this guide, and consulted with graduate students and faculty in the bioinformatics program at the University of California Santa Cruz and with other academic librarians with interests in the field. The resources that bioinformatics faculty web pages point to were reviewed, as were the search results from the prominent search engines such as Google using the most likely keywords. Many of the resources listed by the Comprehensive Web Sites themselves were also assessed. The annual list of molecular biology databases from the journal Nucleic Acids Research was reviewed. In November 2001 the author also attended the day-long Medical Librarian Association's "Molecular Biology Information Resources" continuing education course (see http://www.ncbi.nlm.nih.gov/Class/MLACourse/) which was taught by a specialist from the National Library of Medicine in order to better understand the NCBI databases in particular.

Definitions, Glossaries, and Dictionaries

Definitions

A quick review of the basic genetic terms and concepts will help in understanding the sequence databases. The NCBI Genetics Review site is highly recommended reading since it provides a particularly good overview of the concepts as well as listing some good references for additional information ({http://www.ncbi.nlm.nih.gov/Class/MLACourse/Original8Hour/Genetics/}). The following terms are central to understanding bioinformatics:

Nucleotide:: One of the structural components, or building blocks, of DNA and RNA. A nucleotide consists of a base (one of four chemicals: adenine, thymine [uracil instead of thymine for RNA], guanine, and cytosine) plus a molecule of sugar [ribose for RNA, deoxyribose for DNA] and one of phosphoric acid (from the National Human Genome Research Institute (NHGRI) Glossary of Genetic Terms {http://www.genome.gov/glossary.cfm}).
Gene:: A length of DNA which codes for a particular protein, or in certain cases a functional or structural RNA molecule (from PhRMA Genomics Lexicon {http://genomics.phrma.org/lexicon/}). Less than 5% of the human genome codes for genes. The rest are non-coding sequences which may have other functions.
Genome:: The complete gene complement of an organism, contained in a set of chromosomes (in eukaryotes), in a single chromosome (in bacteria), or in a DNA or RNA molecule (in viruses) (from Academic Press Dictionary of Science and Technology {http://www.harcourt.com/dictionary/}).
Genomics:: Operationally defined as investigations into the structure and function of very large numbers of genes undertaken in a simultaneous fashion (from What is Genomics? {http://www.genomecenter.ucdavis.edu/what.html}). Genetics looks at single genes, one at a time, as a snapshot. Genomics is trying to look at all the genes as a dynamic system, over time, and determine how they interact and influence biological pathways and physiology, in a much more global sense (from Basic Genetics & Genomics http://www.genomicglossaries.com/content/Basic_Genetic_Glossaries.asp).
Proteome:: The complement of proteins expressed by an organism, tissue or cell type (from Proteomes and Proteomics {http://www.mrc-dunn.cam.ac.uk/pages/proteomes.html}). The concept of the proteome is fundamentally different to that of the genome: while the genome is virtually static and can be well defined for an organism, the proteome continually changes in response to external and internal events (from Thinking Big: Proteome Studies in a Post- Genome Era {http://www.abrf.org/ABRFNews/1996/December1996/Proteome.html}).
Proteomics:: The study of the full set of proteins encoded by a genome (from the Human Genome Project Information Glossary - {http://www.ornl.gov/sci/techresources/Human_Genome/glossary/}). The characterisation of patterns of gene expression at the protein level or the link between proteins and genomes. Proteomics encompasses many different approaches to protein study, from bioinformatics of protein content of genomes to large scale direct protein analysis of complicated protein mixtures, and the definition of a protein's properties, their interactions and modifications (from Proteomes and Proteomics {http://www.mrc-dunn.cam.ac.uk/pages/proteomes.html}).

Glossaries and Dictionaries

Science Magazine: Functional Genomics Resources: "Finding the right word: A guide to some useful online glossaries" Post-genomics, biotech and bioinformatics - {http://www.sciencemag.org/site/feature/plus/sfg/resources/index.xhtml}: An excellent selective list, ranked by the site's editors, of the ten "best" online glossaries. See also glossaries on related topics at this site.
Access Excellence Graphics Gallery - {http://www.accessexcellence.org/AB/GG/}: "Graphics Gallery is a series of labeled diagrams with explanations representing the important processes of biotechnology. Each diagram is followed by a summary of information, providing a context for the process illustrated."
Genomics Glossary - http://www.genomicglossaries.com/: Actually a collection of several glossaries and taxonomies, including a Bioinformatics Glossary at http://www.genomicglossaries.com/content/Bioinformatics_gloss.asp. The Scout Report and Science Magazine give this resource very high praise, but this author found the site to be cluttered and difficult to navigate, although the content is very good.
Human Genome Project Information Glossary - {http://www.ornl.gov/sci/techresources/Human_Genome/glossary/}: A useful glossary of genetics terms from the DOE Human Genome Program that you can both browse and search.
National Human Genome Research Institute (NHGRI) Glossary of Genetic Terms - {http://www.genome.gov/glossary.cfm}: This is sometimes called the "talking glossary" since audio clips allow you to hear definitions and longer explanations given by an expert. Try it with the word "nucleotide." Illustrations are also sometimes available.
PhRMA Genomics Lexicon - {http://genomics.phrma.org/lexicon/}: This extensive glossary is sponsored by the Pharmaceutical Research and Manufacturers of America. Also provides links to other dictionaries and glossaries.

News/Keeping Current

Sequence and Other Non-Bibliographic Databases

Introduction
Database Directories and Lists
Nucleotide Sequences
Genome Databases
Protein Sequences
Protein Structure

Introduction

Sequence and other non-bibliographic databases are the central, most important type of information resource in this field. The multiplicity of databases makes selection confusing, and the databases themselves can be challenging to understand and navigate. Nomenclature is not standard. Data formats/metadata schemes are not standard. Databases struggle with data redundancy and charges that they contain a lot of "junk." There are a lot of interrelated pieces of information surrounding a gene (genome location, structure, sequence, expression information, chemistry, etc.) or a protein, which lead to somewhat complicated database structures and links to related databases which may or may not be intuitive. There are the additional requirements that 3D structures place on metadata and the increasing volume of sequence data pouring into these databases (take a look at the exponential growth of Genbank at http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html). A great deal of this growth is in the form of direct submissions that may or may not have been peer reviewed. Whether peer reviewed or not the data are subject to frequent changes and updates as new information becomes available. There are many approaches to solving these problems, which means multiple data structures and multiple search interfaces and many specialized databases.

As mentioned earlier, there are hundreds of databases that might be considered relevant to bioinformatics. There are specialized databases for each species, and separate databases for different types of information (nucleic acid sequences, protein sequences, protein structures, biochemical and biophysical information, etc.). There is also a great redundancy of databases, with multiple databases covering nearly the same information for the same organisms. This situation arose in part from many researchers developing their own databases in their own formats over the years, and from databases developing in parallel in Europe, Japan and the United States. The situation is further complicated by the existence of several versions or mirrors of the same database on different servers (each with varying degrees of currency or completeness), and by the sharing of records between databases. For example, the Entrez search system draws data from SWISS-PROT but only includes SWISS-PROT records for proteins that are based upon nucleotide sequence data that meet the criteria for inclusion in GenBank. A search of SWISS-PROT through another interface may retrieve more records as well as more detail in each record.

The following list of databases is intended to orient the reader to the major databases. To become a proficient searcher in each database takes considerable training, which is beyond the scope of this guide. This database list is highly selective, including only a few representatives of each type. Emphasis is placed on the larger, better known databases, on the free public databases, and on those that cover human data. Grouping databases by type is a common and useful way of organizing them, but many databases provide more than one type of information to the user so bear in mind that this classification is not precise.

A review of the basic genetic terms and concepts is highly recommended before approaching the sequence databases. See the Definitions, Glossaries, and Dictionaries and the Guides, Tutorials and Primers sections of this guide for recommended sources.

Database Directories and Lists

Nucleotide Sequences

GenBank - http://www.ncbi.nlm.nih.gov/Genbank/GenbankSearch.html, and Entrez Nucleotides Database - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide

GenBank is the nucleotide sequence database built and distributed by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health. As of this writing, GenBank contains more than 13 billion bases from over 100,000 species, and is growing exponentially (see http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html). The data are obtained through direct submission of sequence data from individual laboratories, from large-scale sequencing projects, and from the US Patent and Trademark Office. A little more than half of the total sequences in the database are from Homo sapiens.

There are two ways to search GenBank: a text-based query can be submitted through the Entrez system at {http://www.ncbi.nlm.nih.gov/Entrez/index.html}, or a sequence query can be submitted through the BLAST family of programs (see http://www.ncbi.nlm.nih.gov/BLAST/). To search GenBank through the Entrez system you would select the Nucleotides database from the menu. The Entrez Nucleotides Database is a collection of sequences from several sources, including GenBank, RefSeq, and the Protein Databank, so you don't actually search GenBank exclusively. Searches of the Entrez Nucleotides database query the text and numeric fields in the record, such as the accession number, definition, keyword, gene name, and organism fields to name just a few. So, for example, you could enter the terms Bacillus anthracis and you would be presented with many records that contain and describe nucleotide or protein sequences related to the anthrax bacteria.The accession number is very handy, because it is a unique and persistent identifier for the GenBank entry as a whole and doesn't change even if there is a later change or update to the sequence or annotation. Nucleotide sequence records in the Nucleotides database are linked to the PubMed citation of the article in which the sequences were published. Protein sequence records are linked to the nucleotide sequence from which the protein was translated. To become an effective searcher of this database takes study. For starters, take the Nucleotides database online tutorial that starts at {http://www.ncbi.nlm.nih.gov/Database/tut1.html}, and consult the other resources available from the NCBI Education Page at {http://www.ncbi.nlm.nih.gov/Education/}. See also the Recommended Reading section of this guide.

If you have obtained a record through a text-based Entrez Nucleotides Database search you can read the nucleotide sequence in the record. However, most researchers wish to submit a nucleotide sequence of interest to find the sequences that are most similar to theirs. This is done using the BLAST (Basic Local Alignment Search Tool) programs. You select the BLAST program you wish to use depending upon the type of comparison you are doing (nucleotide to nucleotide, or nucleotide to protein sequence, etc.) and then you select the database to run the query in (any of several nucleotide or protein databases). Many NCBI databases accept BLAST searches, as do many of the other databases covered elsewhere in this guide. The result is a detailed report that summarizes your query, provides a graphical overview of database matches, indicates the statistical significance of the matches and describes each significant alignment. From here you can link to the full database record for the individual matches. You can learn more about BLAST searching from the NCBI BLAST educational page at {http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html} (read the online tutorial).

EMBL Nucleotide Sequence Database - http://www.ebi.ac.uk/embl/

"The EMBL Nucleotide Sequence Database constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications. The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis."

From the home page you can submit simple text searches to the EMBL Nucleotide Sequence Database, or to the Protein Databank (what you search when you select protein structures from the menu) or to a protein sequence database called Swall. For more complex searches, they recommend accessing the databases through the Sequence Retrieval System (SRS) server (http://srs.ebi.ac.uk/). SRS is a database querying / navigation system, similar in function to the Entrez system. It allows you to simultaneously search across several databases and to display the results in many ways. SRS can be used to access a large number of databases, including EMBL, SWISS-PROT and the Protein Databank, depending upon the configuration of the particular SRS server you are using. The structure and content of an EMBL Nucleotide record is very similar to that of an NCBI Entrez Nucleotide database record.

Genome Databases

Entrez Genome - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome

"The whole genomes of over 800 organisms can be found in Entrez Genomes. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life - bacteria, archaea, and eukaryota - are represented, as well as many viruses and organelles." Text searches can be done from the main page. Data can also be accessed alphabetically by species {http://www.ncbi.nlm.nih.gov:80/PMGifs/Genomes/allorg.html}), or hierarchically by drilling down through a taxonomic list to a graphical overview for the genome of that organism, then to specific chromosomes, then on to specific genes. At each level are maps, pre-computed summaries, analysis appropriate to that level, and links to related records from a variety of other Entrez databases. BLAST searches of some genomes are also possible.

Very useful pages for some of the most commonly studied species (e.g., human, mouse, fruit fly, malarial parasite) can be found on the Genomic Biology page under "organism-specifc resources" (http://www.ncbi.nlm.nih.gov/Genomes/). These pages are so detailed that each could be classified as a comprehensive web site in itself. Each one brings together links to the genomic data, useful tools, related data sources and news about the genome of that species. The Human Genome Guide (http://www.ncbi.nlm.nih.gov/genome/guide/human/) is particularly rich.

Human Genome Browser from UCSC - http://genome.ucsc.edu/

"The sequence of the human genome is too big to see at all at once; few people want to look at raw DNA sequence anyway. The alternative is the Human Genome Browser for a quick display of any requested portion of the genome at any scale, along with more than two dozen tracks of information (genes, ESTs, CpG islands, assembly gaps, chromosomal band, ...) associated with the complete human genome sequence... Clicking on a displayed feature opens a second window providing protein sequence, coordinates and accession numbers, as appropriate. Clicking in the corner of the display calls up raw DNA sequence corresponding to the display window boundaries. This look-up feature is far more convenient than manual retrieval of a precise coordinate range from GenBank entries."

The Genome Database (GDB) - {http://www.gdb.org/}

The Genome Database is the official central repository for genomic mapping data resulting from the Human Genome Initiative. The database contains three types of data: (1) regions of the human genome, including genes, clones, and ESTs, (2) maps of the human genome, including cytogenetic maps, linkage maps, radiation hybrid maps, content contig maps, and integrated maps (these maps can be displayed graphically via the Web), and (3) variations within the human genome including mutations and polymorphisms, plus allele frequency data. There are options to browse genes by chromosome, genes by symbol name, and genetic diseases by chromosome. There are multiple ways to search, including text-based searches for people, citations, segment names or accession numbers, and sequence searching via BLAST.

KEGG: Kyoto Encyclopedia of Genes and Genomes - {http://www.genome.jp/kegg/}

This database often appears in Google search results, so let's put it in context. Despite the name, this is actually a biochemical pathway database and gene catalog, not an encyclopedia in the book sense. "The primary objective of KEGG is to computerize the current knowledge of molecular interactions; namely, metabolic pathways, regulatory pathways, and molecular assemblies. At the same time, KEGG maintains gene catalogs for all the organisms that have been sequenced and links each gene product to a component on the pathway. Because we need an additional catalog of building blocks, KEGG also organizes a database of all chemical compounds in living cells and links each compound to a pathway component."

Protein Sequences

SWISS-PROT - {http://web.expasy.org/groups/swissprot/}

"SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc.), a minimal level of redundancy and a high level of integration with other databases." "The data in Swiss-Prot are derived from translations of DNA sequences from the EMBL Nucleotide Sequence Database, adapted from the Protein Identification Resource (PIR) collection, extracted from the literature and directly submitted by researchers. It contains high-quality annotations, is non-redundant, and cross-referenced to several other databases, notably the EMBL nucleotide sequence database, PROSITE pattern database and PDB."

From the home page, a quick text search can be done by accession or ID number, description, gene name, or organism. By searching SWISS-PROT through the Sequence Retrieval System (SRS) more sophisticated searches can be performed and the format of the results can be customized. Access to SWISS-PROT (directly or via SRS) and links to many other proteomics resources are available from the ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB) at {http://us.expasy.org/}. The SWISS-PROT records are quite detailed. Be advised that other databases or search systems that import SWISS-PROT data may not always provide access to the entire SWISS-PROT record.

Entrez Protein Database - http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?dB=Protein

"The Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL and DDBJ as well as protein sequences submitted to PIR, SWISS-PROT, PRF, and the Protein Data Bank (PDB) (sequences from solved structures)." The native SWISS-PROT records usually contain more detailed annotations than will be obtained from Entrez Protein Database records derived from SWISS-PROT records. In typical Entrez fashion, results from a search of the Protein database link to PubMed, to the taxonomy database, to related sequences, and in some cases to pre-computed BLAST search results (look for BLink links).

Protein Information Resource - International Protein Sequence Database (PIR-PSD) - http://pir.georgetown.edu/

In 1988 the Protein Information Resource (PIR), which is affiliated with Georgetown University Medical Center, established a cooperative effort with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID) to collect, publish and distribute the PIR-International Protein Sequence Database (PIR-PSD). They describe the database as "a comprehensive, non-redundant, expertly annotated, fully classified and extensively cross-referenced protein sequence database in the public domain". Text searches can be done in the title, species, author, citation, keyword, superfamily, feature and gene name fields. Gapped-BLAST sequence similarity searches are also an option. Note that both SWISS-PROT and the Entrez Protein database contain data adapted from the PIR.

Protein Structure

Protein Data Bank (PDB) - http://www.rcsb.org/pdb/

The PDB was established at Brookhaven National Laboratories in 1971, making it the first public bioinformatics database. The PDB is now operated by the Research Collaboratory for Structural Bioinformatics (RCSB) which is a collaborative effort of the San Diego Supercomputing Center, Rutgers University, and the National Institute of Standards and Technology (NIST). The PDB is a repository of experimentally determined three-dimensional structures of biological macromolecules (proteins, enzymes, nucleic acids, protein-nucleic acid complexes, and viruses) derived from x-ray crystallography and NMR experiments (see {http://www.rcsb.org/pdb/experimental_methods.html} for a helpful overview of these methods). Depositing structures obtained from theoretical models is discouraged. Data are deposited by the international user community and maintained by the RCSB PDB staff. Approximately 50-100 new structures are deposited each week. A variety of information associated with each structure is available, including "sequence details, atomic coordinates, crystallization conditions, 3-D structure neighbors computed using various methods, derived geometric data, structure factors, 3-D images, and a variety of links to other resources."

There are three ways to search the PDB. The SearchLite interface accepts text queries using Boolean operators, and searches the text fields such as the author, compound, molecule class, and keywords fields. The SearchFields interface is an advanced search option that allows you to choose specific fields in which to search and to apply various limits. It also allows you to customize the format of the results. The third search method requires leaving the PDB site, going to the NCBI Entrez site and performing a NCBI BLAST sequence search with "pdb" selected as the target database. See the notes on protein BLAST searching at {http://www.ncbi.nlm.nih.gov/blast/html/BLASThomehelp.html#AABLAST}.

Since this is such an old database, historic inconsistencies in the way data are reported within PDB records may lead to unexpected or incomplete results when searching, particularly for text-based information. Certain keywords, like alpha, are not properly searchable. For example, looking for alpha hemolysin fails to find anything, but a search on hemolysin alone results in ten hits, including 7AHL, which is alpha hemolysin. The PDB file format itself also has numerous flaws, but remains the most widely accepted format for structural data. The database producers are aware of these problems and are working to solve them. Several software packages can be used to view PDB files in 3D, including the RasMol and Chime browser plug-ins and Deep-View. For more information see the PDB Query Tutorial at {http://www.rcsb.org/pdbstatic/tutorials/LargeBeta.swf} and the PDB Documentation and Information page at {http://www.rcsb.org/pdb/info.html#General_Information}. See also the entry for the MMDB below, which is a subset of the PDB with some added features.

MMDB: Molecular Modeling DataBase - http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml

The MMDB is NCBI's structure database. It is a subset of three-dimensional structures obtained from the Protein Databank (PDB), excluding theoretical models. MMDB adds value through the addition of explicit chemical graph information and through the cross-linking of structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy. The explicit bond information makes for more consistent interpretation of the coordinate data by visualization software. MMDB can provide data for three different structure viewers: Cn3D, a viewer developed by the NCBI; RasMol; and MAGE. All three are available for a variety of platforms (Windows, MacOS, UNIX). After installing the software, the 3-dimensional structure can be viewed by clicking the button labeled View/Save Structure close to the bottom of each structure summary.

The structure database may be queried directly, using accession numbers or text terms such as author names, protein names, species names or publication dates. The result will yield "Structure Query" pages, providing access to entries which matched the keywords. From the Structure Summary pages of an individual matching entry one may access amino acid and nucleic acid sequences, retrieve PubMed documents, get taxonomy information, and launch the software to view the 3D image.

The MMDB documentation also notes that:

"The structure database is considerably smaller than Entrez's protein or nucleotide databases, but a large fraction of all known protein sequences have homologs in this set, and one may often learn more about a protein by examining 3-D structures of its homologs. Protein sequences from MMDB are extracted and available in the Entrez protein sequence database. They are linked to the 3-D structures, therefore it is possible to determine whether a protein sequence in Entrez has homologs amongst known structures by examining its Related Sequences or Protein Neighbors and checking whether this set has any Structure Links."

Software Directories

Bioinformatics Software Resource (BISR) - {http://bioinfo.nist.gov/BISR/}: A catalog and clearinghouse of links to bioinformatics and computational biology software and resources. Over 400 packages are currently available, more than 70% of the software is free, and a variety of operating systems are supported. This database is maintained by the Chemical Science and Technology Laboratory of the National Institute of Standards and Technology (NIST).
Database Searching, Browsing and Analysis Tools - http://www.ebi.ac.uk/Tools/index.html: A list of software tools (programs) you can use via the web to submit queries to the sequence databases and to analyze the results of those queries. This list is from the European Bioinformatics Institute. See also the ExPASy Proteomics Tools list below.
ExPASy Proteomics Tools - {http://www.expasy.org/links.html}: Tools for proteomics that may be used over the web, covering such categories as protein identification and characterization, similarity searches, secondary structure prediction, and sequence alignment.
Genamics SoftwareSeek - http://genamics.com/software/index.htm: A repository and database of over 1200 free and commercial tools for use in molecular biology and biochemistry. Windows, MS-DOS, Mac, Unix and Linux platforms are supported, as well as online tools that run through your Internet browser. You may browse by category (such as DNA sequence analysis, molecular modeling, or protein structure prediction) or you may search by platform, program name or keyword.
Freshmeat Open Source Software Repository - http://freshmeat.net/: This database of UNIX and cross-platform open source software is a good source for molecular modeling and visualization programs and contains a smattering of bioinformatics applications. Each entry provides a history of the project's releases (very useful for spotting stale code) and a popularity ranking. See also Open Source Software Promoters.

Tools in Specific Programming Languages

Bioinformatics makes use of a number of programming languages, including C++, Perl, Java, Python, XML, Ruby and Lisp. Worth noting here is the development of the various Bio*.org projects that now cluster under the umbrella group called the Open Bioinformatics Foundation (http://www.open-bio.org/), which was incorporated in October 2001. Each of these projects is an international association of developers of open source tools (software programs or program modules) for bioinformatics, genomics and life science research written in their particular language. Each association attempts to archive, mirror or provide pointers to any and all biology-related code in their specific language that is freely available for download. BioPerl.org was the first of these projects.

BioPerl -
{http://bioperl.org/}
BioPython -
{http://biopython.org//wiki/Biopython}
BioJava -
{http://biojava.org/wiki/Main_Page}
BioDAS -
{http://www.biodas.org/wiki/Main_Page}

Open Source Software Promoters

Just as bioinformatics researchers have been remarkably open (some would say advanced) in making their sequence data freely available to the public, many bioinformatics programmers want to do the same with their program source code. In this regard they join many other computer scientists and programmers in supporting open source software. OpenSource.org (http://www.opensource.org/) puts the benefits of the open model this way: "When programmers can read, redistribute, and modify the source code for a piece of software, the software evolves. People improve it, people adapt it, people fix bugs. And this can happen at a speed that, if one is used to the slow pace of conventional software development, seems astonishing." In addition to the Open Bioinformatics Foundation ({http://www.open-bio.org/wiki/Main_Page}) mentioned in the previous section, Bioinformatics.org: The Open Lab (http://bioinformatics.org/) and OpenInformatics.org ({http://www.openinformatics.org/}) are two other organizations of note dedicated to promoting open source software among bioinformatics and life science researchers.

Comprehensive Web Sites

Bibliographic Databases

PubMed - {http://www.ncbi.nlm.nih.gov/pubmed}

For medical bibliographic citations this is the place to go. PubMed is the public interface to the medical literature database (MEDLINE) produced by the National Library of Medicine. PubMed provides access to over 11 million MEDLINE citations for articles and conference papers back to the mid-1960's. There are links to many sites providing full text articles (some for free) and PubMed citations link to Entrez nucleotide, protein and structure records when available. Unfortunately, PubMed currently supports searching by Chemical Abstracts Service (CAS) Registry Numbers (RNs) in a very limited way. The dictionary of RNs supported in PubMed is limited and is not currently extended to sequences found in other parts of the Entrez system. The PubMed interface is a rich and somewhat complicated one that requires some study to use efficiently. PubMed with its Entrez links provides an almost one-stop-shopping experience, and is an amazingly rich resource for medical and genetics data.

INSPEC - {http://www.iee.org/publish/inspec/about/} [subscription required]

For citations to computer science literature, start with INSPEC (for noncommercial computer science articles freely available on the Internet, see ResearchIndex below). Produced by the Institution of Electrical Engineers (IEE), INSPEC is the leading bibliographic information service providing access to conference papers and journal articles in computer science and information technology as well as electrical engineering and physics. The database covers literature from 1969-present, and is available from a variety of database vendors, most of which will also provide links to your library's online journals.

Chemical Abstracts and the Registry File - http://www.cas.org/ [subscription required]

For access to the chemical literature this is the place to start. Produced by CAS (Chemical Abstracts Service), Chemical Abstracts is the leading bibliographic information service providing citations to conference papers, journal articles, patents and other documents pertinent to chemistry (and it is to our advantage that they define chemistry very broadly). In December of 2001 CAS extended the coverage of Chemical Abstracts to literature from 1907 to the present, which is a boon to researchers since the chemical literature becomes obsolete slowly, if at all.

"Substance identification is a special strength of CAS, which is widely known for the CAS Chemical Registry, the largest substance identification system in existence. When a chemical substance is newly encountered in the literature processed by CAS, its molecular structure diagram, systematic chemical name, molecular formula, and other identifying information are added to the Registry and assigned a unique CAS Registry Number." This is relevant to bioinformatics because currently about 45% of the Registry File consists of protein and nucleic acid sequences. CAS gets its sequence information from the chemistry journals, patents and other documents that CAS routinely covers as well as sequences from the GenBank database. 13% of the sequences in the Registry File are unique, while 87% overlap with GenBank. The RefSeq sequences generated from Genbank are not part of the CAS Registry. While most sequences reported in the Protein Data Bank are in Registry, Registry does not provide access to the 3D data that is available in the PDB.

The ability to perform nucleic acid and amino acid sequence similarity searching is highly desirable. BLAST similarity searching is currently possible in some versions of the Registry File (it is available in SciFinder 2001 and via STN on the Web) but unfortunately not in SciFinder Scholar, which is the version most commonly subscribed to by academic libraries. On STN, once a Registry sequence search is completed (via BLAST or a text based search) there are multiple files which can then be very easily searched with CAS Registry Numbers, e.g., BIOSIS AGRICOLA, USPATFULL, CAplus (Chemical Abstracts), and others. (Other vendors support Registry Number searching in many of these databases as well, though they lack the Registry File itself). Chemical Abstracts can also be searched directly via the usual bibliographic fields (author, title, etc.). If you are in need of access to the patent literature, or if you are studying protein chemistry, pharmacogenomics, or small molecules then the CAS databases should be high on your priority list. But without access to BLAST searches in the Registry File the academic bioinformatics community will probably continue to rely heavily on other databases.

BIOSIS Previews - {http://wokinfo.com/products_tools/specialized/bci/} [subscription required]

Start here for plant biology and other non-medical biology articles and conference papers from 1969 to the present (only 34% of the journals in BIOSIS overlap with MEDLINE). Since 1993 a "sequence data" field has been available but rather than containing actual nucleotide or amino acid sequences this field contains the accession number for the sequence from databases such as GenBank, EMBL and SwissProt, if the author included this number in the article text. A very small percentage of records actually use this field. BIOSIS is available from a number of database vendors, most of which will also provide links from the citation to your library's online journals.

ISI Web of Science - {http://apps.webofknowledge.com/} [subscription required]

This is a large, powerful and costly citation database. It is an index to scientific, commercially published journal articles from 1975 to the present that also allows you to search for citations to a particular article. You look up the reference to a work that you have identified to find other more recent journal articles that have cited it. Cited reference searching is a unique way to trace ideas and subjects from past research into the present day. Searchable by author, keyword, and cited reference. Computer scientists and biologists are quite interested in citation data. Web of Science doesn't index conferences as a primary literature source, which is a disadvantage in bioinformatics where conferences are so important. See also ResearchIndex below.

ResearchIndex (formerly CiteSeer) - {http://citeseer.ist.psu.edu/}

This is a free, full-text index to the freely available research articles on the web. "Although availability varies greatly by discipline, over a million research articles are freely available on the web. Some journals and conferences provide free access online, others allow authors to post articles on the web, and others allow authors to purchase the right to post their articles on the web." (Lawrence 2001) this index is popular with computer scientists because a great deal of their literature is available this way, and because ResearchIndex also provides citation analysis. While Web of Science doesn't index conference papers (which are a mainstay in computer science), ResearchIndex does (if the proceedings are on the web for free). It also offers reference linking, extraction of citation context, related document detection and the BibTeX entry for each article.

Technical Reports and Preprints

GenomeBiology.com: Preprint Depository - {http://genomebiology.com/preprint/}: GenomeBiology.com is an online journal from the same publishing group that brings you BioMed Central (http://www.biomedcentral.com/), of which the free online journal BMC Bioinformatics is a part (http://www.biomedcentral.com/1471-2105/). GenomeBiology.com provides free access to its peer-reviewed articles and preprints, although it charges a subscription fee to access its reviews, reports, news and commentaries. The preprints in this depository are not peer reviewed. The only screening process is to ensure relevance of the preprint to GenomeBiology.com's scope and to avoid abusive, libelous or indecent articles.
NCSTRL - Networked Computer Science Technical Reference Library - {http://csetechrep.ucsd.edu/Dienst/htdocs/Welcome.html}: NCSTRL (pronounced "ancestral") is an international collection of technical reports from a selection of participating computer science and computer engineering departments, industrial and government research laboratories made available for noncommercial and educational use. Searchable by keyword, author, or title.
PrePrint Network from the Department of Energy - {http://www.osti.gov/preprints/}: The Department of Energy funds a great deal of bioinformatics research at US universities. They are particularly interested in protein structure, DNA repair of radiation damage, and bioremediation of polluted sites. The Preprint Network is the gateway to preprints in disciplines of interest to the DOE, including bioinformatics. The Network is a metasearch engine that searches across a number of preprint and technical report collections, including the Networked Computer Science Technical Reference Library (NCSTRL), among others. They also offer an update service that will e-mail you when new resources are added in your area of interest. "The Preprint Network is one leg of a triad of electronic products for the science information consumer. We also offer PubSCIENCE ({http://www.osti.gov/pubscience}), a gateway to journal literature, and the DOE Information Bridge (http://www.osti.gov/bridge), an on-line access route to full-text technical report literature of the Department of Energy."

Major Conferences & Symposia

Important Organizations

American Crystallographic Association (ACA) - {http://www.amercrystalassn.org/}
Bioinformatics.org - http://bioinformatics.org/
The Center for Information Biology and DNA Data Bank of Japan - {http://www.ddbj.nig.ac.jp/}
European Bioinformatics Institute (EBI) - http://www.ebi.ac.uk/
European Molecular Biology Laboratory (EMBL) - {http://www.embl.de/}
Federation of American Societies for Experimental Biology (FASEB) - http://www.faseb.org/
The Human Genome Organisation (HUGO) - {http://www.hugo-international.org/}
International Society for Computational Biology (ISCB) - http://iscb.org/
National Center for Biotechnology Information (NCBI) - http://www.ncbi.nlm.nih.gov:80/
National Center for Genome Resources - http://www.ncgr.org/
National Human Genome Research Institute (NHGRI) - {http://www.nhgri.nih.gov/}
National Library of Medicine (NLM) - http://www.nlm.nih.gov/
Swiss Institute of Bioinformatics (SIB) - http://www.isb-sib.ch/

Guides, Tutorials and Primers

Bioinformatics Frequently Asked Questions - http://bioinformatics.org/FAQ/: This is a scholarly yet pragmatic FAQ (filled with what the author calls "blunt opinions") that is rich in useful information and advice. It is written by Damian Counsell of the Institute of Cancer Research, UK, though he cautions readers that the FAQ doesn't represent the ICR's views. It begins with a wonderful overview of the field that helps put all the major pieces (definitions, programs, databases) into perspective. He goes on to answer questions about finding resources in the field, questions about careers and jobs, and many practical questions like "how can I align two sequences?," "how can I predict the function of a gene," and "how do I write this up?" It is still a work in progress and the lists of books are now a bit dated, but nevertheless the FAQ is highly recommended reading.
The Bioinformatics Resource (TBR): Tutorials - {http://www.hgmp.mrc.ac.uk/CCP11/directory_tutorials.jsp?Rp=20}: This brand new database (launched January 25, 2002) covers a wide range of topics, contains substantial numbers of records, and is both searchable by keyword and browsable by topic. Sixty-three tutorials are currently cataloged. The list is very heavily weighted towards university course web pages, yet there are some real gems in here. TBR is the website of the CCP11project (Collaborative Computational Project 11). CCP11 was established to foster bioinformatics in the UK research community, thus explaining the high number of UK resources listed.
Crystallography 101 - {http://www-structure.llnl.gov/Xray/101index.html}: Crystallography is important for the study of protein structures and bioinformatics is much concerned with the prediction and modeling of protein folding and structure. This is a substantial and well written tutorial on the subject by Bernhard Rupp, Professor of Molecular Structural Biology and Head of the Macromolecular Crystallography Group at the Lawrence Livermore National Laboratory.
MIT Biology Hypertextbook - {http://esg-www.mit.edu:8001/esgbio/7001main.html}: This is the often cited, extensive and well illustrated basic introductory molecular biology text that is used as a supplement to courses at MIT. It is arranged in chapters covering all the basic topics (such as cell biology, enzyme biochemistry, recombinant DNA), and includes a searchable index and practice problems.
NCBI: Education Page - {http://www.ncbi.nlm.nih.gov/Education}: Online education materials from the National Center for Biotechnology Information. Includes online tutorials for the BLAST search program and some of the Entrez Databases (PubMed, Nucleotides, Structures), as well as a useful essay on similarity searching and glossary of terms related to sequence searching.
NCBI: Medical Library Association's CE Course Manual: Molecular Biology Information Resources - http://www.ncbi.nlm.nih.gov/Class/MLACourse/: This is Renata McCarthy's manual from the excellent full day continuing education course for librarians on molecular biology information resources. The online notes, links and examples are very helpful even without taking the class in person. Which is a lucky thing, since unfortunately this course will no longer be offered in locations throughout the U.S. Due to the complexity of the material the course is being expanded to three days and will be offered several times per year only at the National Library of Medicine, starting in early 2002. Once the revised course is available, this page will contain a link to the new course web page, which will include a schedule of course dates and registration information.
Protein Data Bank (PDB): Education Resources - {http://www.rcsb.org/pdb/static.do?p=general_information/news_publications/newsletters/educationcorner.html}: A nicely organized directory of high quality educational sites related to proteins and nucleic acids, as well as pointers to tutorials on using the PDB itself. There is a section called "protein documentaries" that lists multimedia sites (VRML, RealPlayer and/or Chime plug-ins required) and an excellent selection of molecular modeling resources in the section called "Other Educational Resources." Also worth visiting is the link to "Links" in the upper right under "Other Information Resources" that takes you to their "Macromolecular Structure Related Resources" page which is a comprehensive web directory of its own.
Science Magazine: Functional Genomics Educational Resources - {http://www.sciencemag.org/site/feature/plus/sfg/education/index.xhtml}: This site has a lot to recommend it. A "film festival" section provides RealTime movies and webcasts of press conferences and lectures. The glossary section has already been recommended earlier in this guide. There is an annotated list of "Ten Great Educational Websites" (which are very cool, though most seem to be aimed at the high school level) plus an education site of the month. And not to be missed are the three sites in the "A Little Base (Pair) Humor" section: Cartoonists' views of the Human Genome Project from Slate magazine, the DNA-O-Gram which allows you to send a nucleotide-encoded message to a friend, and Swiss-Jokes -- "The infamous random sampler of helvetian humor from ExPASy."

References

Counsell, Damian. 2001. Bioinformatics FAQ. [Online]. Available: http://bioinformatics.org/faq/ [January 16, 2002].

Doernberg, D. 1993. Computer Literacy Interview With Donald Knuth. [Online]. Available: {http://www1.fatbrain.com/interviews/knuth_interview.html} [November 19, 2001].

Lawrence, S. 2001. Online or Invisible? [Online]. Available: {http://citeseer.ist.psu.edu/online-nature01/} [November 19, 2001].

National Center for Biotechnology Information (NCBI). 2001. NCBI Education Site. [Online]. Available: http://www.ncbi.nlm.nih.gov/Education/ [November 19, 2001].

Nucleic Acids Research. 2002. 30(1). [Online]. Available: {http://nar.oxfordjournals.org/content/vol30/issue1/} [January 25, 2002].

Previous	Contents		Next
Issues in Science and Technology Librarianship		Winter 2002
DOI:10.5062/F4959FJK

Science and Technology Sources on the Internet

Guide to Selected Bioinformatics Internet Resources

Table of Contents

Definitions

Glossaries and Dictionaries