BLAST (Basic Local Alignment Search Tool) & FASTA
BACKGROUND INFORMATION: The three BLAST programs that one will commonly use are BLASTN, BLASTP and BLASTX. BLASTN will compare your DNA sequence with all the DNA sequences in the nonredundant database (nr). BLASTP will compare your protein sequence with all the protein sequences in nr. In BLASTX your nucleotide sequence will be translated in all six reading frames and the products compared with the nr protein database. Several online tutorial are available including BLAST QuickStart and Basic Web BLAST from NCBI and a YouTube video .
BLAST Homepage - (NCBI)
Nucleotide BLAST ( BLASTN) N.B. the default database is the "nucleotide collection (nt/nr)." Comparatively recently NCBI offers the ability to conduct Batch BLAST searches.
Protein BLAST ( BLASTP) N.B. This program is also coupled with a motif search. If you suspect that your pprotein may only show weak sequence similarity to other proteins, I would suggest clicking on the
Translated BLAST ( BLASTX)
TBLASTX searches translated nucleotide databases using a translated nucleotide query; while TBLASTN searches translated nucleotide databases using a protein query. These are useful resources if you are interested in homologs in unfinished genomes. Under "Databases" select "genomic survey sequences", "High throughput genomic sequences" or "whole-genome shotgun reads"
Blast with Microbial Genomes (BLASTN, TBLASTN, TBLASTX etc.). Permits one to compare a nucleic acid or protein sequence against finished archaeal and bacterial genomes.
1. Depending upon the time of day your results may appear almost immediately or your search may be delayed or not accepted at all. Be prepared for plenty of results. You may only want to print the first few pages (e.g.1-5). Alternatively under "Algorithm Parameters" change the "Maximum targets" from 100 (default) to 10 or 50 .
2. For PSI-BLAST, and other searches I frequently enter information in the "Entrez Query" section e.g. Escherichia coli[organism] or Viruses[organism] to see "hits" specifically to E. coli or viruses/bacteriophages (see here for details)
3. It is adviseable to always select "
EMB BLAST - (European Molecular Biology network - Swiss node). Very convenient since it permits one to specifically search databases such as prokaryote, bacteriophage, fungal, & 16S rRNA using BLASTN, and specific bacterial genomes or SwissProt using BLASTX or BLASTN.
BlastStation-Free in the Cloud (TM Software, Inc.; founder: Takashi Miyajima) is a web-based 64-bit local BlastStation running on the Cloud computer. It supports megablast, blastn, blastp, and blastx searches; allows easy database creation from your FASTA or FASTQ file, which can be compressed in .gz, .Z, or .zip format. A graphical display of search results and a summary table display of search results. The latter can be exported in CSV format, while the hit sequences can be exported in FASTA format. Also available for download in Mac or PC format.
ParAlign (CMBN Bioinformatics Group, University of Oslo, Norway) - employs a heuristic method for sequence alignment. In essence, ParAlign is about as sensitive as Smith-Waterman but runs at the speed of BLAST. Nice graphics.
GTOP Sequence Homology Search (Laboratory for Gene-Product Informatics, National Institute of Genetics, Japan) - offers BLASTP search capability against individual Archaea, Bacteria, Eukaryota, and viruses.
Actinobacteriophage Database (Graham Hatfull, U.S.A.) - allows BLASTN and BLASTP analyses against a growing list of phages that infect bacterial hosts within the phylum Actinobacteria.
HHPred Homology detection & structure prediction by HMM-HMM comparison - is a method for database searching and structure prediction that is as easy to use as BLAST but is much more sensitive in finding remote homologs. HHpred is the first server that is based on the pairwise comparison of profile hidden Markov models (HMMs). Whereas most conventional sequence search methods search sequence databases such as UniProt or the NR, HHpred searches alignment databases, like Pfam or SMART. This greatly simplifies the list of hits to a number of sequence families instead of a clutter of single sequences. HHpred accepts a single query sequence or a multiple alignment as input. (Reference: Söding J et al. 2005. Nucl. Acids Res. 33, W244-W248 (Web Server issue)
For more sophisticated studies you might want to employ:
PSI-BLAST or PHI-BLAST or DELTA-BLAST (Domain Enhanced Lookup Time Accelerated BLAST) search - (NCBI) Position-Specific Iterative BLAST creates a profile after the initial search.
BLAST 2 - (NCBI) BLAST two sequences against one another. N.B. This utilizes BLASTN, P, X as well as TBLASTN and TBLASTX.
Gene Context Tool III- is an incredible tool for visualizing the genome context of a gene or group of genes (synteny). In the following diagram an RpoN (Sigma54) protein was analyzed. (Reference: R. Ciria et al. (2004) Bioinformatics 20: 2307-2308).
Cinteny - Server for Synteny Identification and Analysis of Genome Rearrangement using reversal distance as a measure. You may create a project and upload your own data by following the links below or work with pre-loaded data by selecting the genomes below (Reference: Sinha, A.U. & Meller, J. 2007. BMC Bioinformatics 8: 82)
Other search engines include:
Fasta Protein Similarity Search - (EBI) This tool provides sequence similarity searching against protein databases using the FASTA suite of programs. FASTA provides a heuristic search with a protein query. FASTX and FASTY translate a DNA query. Optimal searches are available with SSEARCH (local), GGSEARCH (global) and GLSEARCH (global query, local database).
TC-BLAST (Saier Laboratory Bioinformatics Grp, Univ. San Diego, U.S.A.) - Scans the transport protein database (TC-DB) producing alignments and phylogenetic trees. The TC-DB details a comprehensive classification system for membrane transport proteins known as the Transport Commission (TC) system.
MEROPS BLAST - permits one to screen protein sequences against an extensive database of characterized peptidases (Reference: Rawlings, N.D et al. 2002. Nucleic Acids Res. 30: 343-346).
COMPASS - is a profile-based method for the detection of remote sequence similarity and the prediction of protein structure. The server features three major developments: (i) improved statistical accuracy; (ii) increased speed from parallel implementation; and (iii) new functional features facilitating structure prediction. These features include visualization tools that allow the user to quickly and effectively analyze specific local structural region predictions suggested by COMPASS alignments.(Reference: R.I. Sadreyev et al. 2009. Nucl. Acids Res. 37(Web Server issue:W90-W94)
SANSparallel: interactive homology search against Uniprot - the webserver provides protein sequence database searches with immediate response and professional alignment visualization by third-party software. The output is a list, pairwise alignment or stacked alignment of sequence-similar proteins from Uniprot, UniRef90/50, Swissprot or Protein Data Bank. The stacked alignments are viewed in Jalview or as sequence logos. The database search uses the suffix array neighborhood search (SANS) method, which has been re-implemented as a client-server, improved and parallelized. The method is extremely fast and as sensitive as BLAST above 50% sequence identity. (Reference: P. Somervuo & L. Holm. 2015. Nucl. Acids Res. 43 (W1): W24-W29).
Detect bacterial toxins through text and homology searches:
BTXpred Server Prediction of Bacterial Toxins - the aim of BTXpred is to predict bacterial toxins and its function from primary amino acid sequence using SVM, HMM and PSI-Blast. Bacterial toxins play an vital role to cause disease and are responsible for majority of symptoms and lesions during infection. Makes a polyproteins from fasta-formatted sequences(Reference: Saha, S., & Raghava, G.P. 2007. In Silico Biol. 7(4-5):405-12.
DBETH Database of Bacterial ExoToxins for Human - is a database of sequences, structures, interaction networks and analytical results for 229 exotoxins from 26 different human pathogenic bacterial genera. All toxins are classified into 24 different Toxin classes. The aim of DBETH is to provide a comprehensive database for human pathogenic bacterial exotoxins. (Reference: Chakraborty, A. et al. 2012. Nucl. Acids Res. 40(Database issue): D615-620).
MvirDB LLNL Virulence Database - integrates DNA and protein sequence information from Tox-Prot, SCORPION, the PRINTS virulence factors, VFDB, TVFac, Islander, ARGO and a subset of VIDA. Entries in MvirDB are hyperlinked back to their original sources. A blast tool allows the user to blast against all DNA or protein sequences in MvirDB, and a browser tool allows the user to search the database to retrieve virulence factor descriptions, sequences, and classifications, and to download sequences of interest. MvirDB has an automated weekly update mechanism. Each protein sequence in MvirDB is annotated using a fully automated protein annotation system and is linked to that system's browser tool (Reference: Zhou, C.E. et al. 2007. Nucl. Acids Res. 35(Database issue): D391-394).
COG analysis - Clusters of Orthologous Groups - COG protein database was generated by comparing predicted and known proteins in all completely sequenced microbial genomes to infer sets of orthologs. Each COG consists of a group of proteins found to be orthologous across at least three lineages and likely corresponds to an ancient conserved domain (CloVR) . Sites which offer this analysis include:
WebMGA (Reference: S. Wu et al. 2011. BMC Genomics 12:444), RAST (Reference: Aziz RK et al. 2008. BMC Genomics 9:75), and BASys (Bacterial Annotation System; Reference: Van Domselaar GH et al. 2005. Nucleic Acids Res. 33(Web Server issue):W455-459.) and JGI IMG (Integrated Microbial Genomes; Reference: Markowitz VM et al. 2014. Nucl. Acids Res. 42: D560-D567. )
Discover EggNOG 4.1 - A database of orthologous groups and functional annotation that derives Nonsupervised Orthologous Groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. (Reference: Powell S et al. 2014. Nucleic Acids Res. 42 (D1): D231-D239
OrthoMCL - is another algorithm for grouping proteins into ortholog groups based on their sequence similarity. The process usually takes between 6 and 72 hours.(Reference: Fischer S et al. 2011. Curr Protoc Bioinformatics; Chapter 6:Unit 6.12.1-19).
KAAS (KEGG Automatic Annotation Server) provides functional annotation of genes by BLAST or GHOST comparisons against the manually curated KEGG GENES database. The result contains KO (KEGG Orthology) assignments and automatically generated KEGG pathways. (Reference: Moriya Y et al. 2007. Nucleic Acids Res. 35(Web Server issue):W182-185).
PSP - Prokaryotic Selection Pressure - is an easy-to-use web tool for rapid identification of orthologous genes with positive selection from set of multiple, closely related prokaryotic genomes. It provides several interesting functions for in-depth analysis of evolutionary selection: retrieving the orthologous groups, removing the affection of gene recombination, generation of codon-delimited alignment, building phylogenetic tree and estimation of ? under different models. It also facilitates efficient exploration of the identified orthologous genes with positive selection at metabolic-pathway level by enrichment of KEGG Orthology and/or COG. (Reference: Su, F. et al. 2013. BMC Genomics 14:924).
arCOGs (Archaeal Clusters of Orthologous Genes - can be used to classify genes and provide improved functional annotation specific to archaeal genomes.(Reference: Makarova KS et al. 2007. Biology Direct 2:33).
Unique search engine:
MineBlast - performs BLASTP searches in UniProt to identify names and synonyms based on homologous proteins and subsequently queries PubMed, using combined search terms in order to find and present relevant literature. This tool only allows max. 100 queries per user per day. (Reference: G. Dieterich et al. 2005. Bioinformatics 21: 3450-3451).
Comparison of homology between two small genomes:
Kablammo helps you create interactive visualizations of BLAST results from your web browser. Find your most interesting alignments, list detailed parametersfor each, and export a publication-ready vector image. Incredibly easy to use - here are the results for a BLASTN comparison to Escherichia phages T1 (query)and ADB-2. (Reference: Wintersinger JA et al. Bioinformatics 31:1305-1306).
SCAN2 (Softberry.com) provides one with a colour-coded graphical alignment of genome length DNAs in Java. In the top panel regions of high sequence identity are presented in red. By highlighting the gray, yellow, green, black boxes one can select specific regions for examination of the sequence alignment. For additional information on the output see here. This site appears to work best with Internet Explorer.
Advanced PipMaker (Schwartz et al. Genome Research Vol. 10, Issue 4, 577-586, April 2000) aligns two DNA sequences and returns a percent identity plot of that alignment, together with a traditional textual form of the alignment. You might want to download Laj (Penn State - Bioinformatics Group, U.S.A.) for viewing and manipulating the output from pairwise alignment programs such as PipMaker representations of the alignments.
JDotter: A Java Dot Plot Viewer (Viral Bioinformatics Resource Center, University of Victoria, Canada) - a dot matrix plotter for Java. Produces similar diagrams to the above mentioned programs, but with better control on output. Also available here .
multi-zPicture: multiple sequence alignment tool (Comparative Genomics Center, Lawrence Livermore National Laboratory, U.S.A.) - provides nice dotplot graphs and dynamic visualizations. If simple gene locations are provided in the form (e.g. > 2000 5000 RNA_polymerase; indicates the the RNA polymerase gene is found on the plus strand between bases 2000 and 5000) this data will be added to the dynamic visualization. zPicture alignments can be automatically submitted to rVista to identify conserved transcription factor binding sites.
GeneOrder 3.0 (D. Seto, Bioinformatics & Computational Biology, George Mason Univ., U.S.A.) is ideal for comparing small GenBank genomes (up to 2 Mb). Each gene from the Query sequence is compared to all of the genes from the Reference sequence using BLASTP. There are two display formats: graphical and tabular. Currently the graph is an applet and must be saved as a "SCREEN SHOT". If your data is not present in GenBank use this site . The upgrade to this program is GenOrder 4.0 which will compare genomes up to 8Mb (Reference: Mahadevan P. &Seto D. 2010. BMC Research Notes 3:41).
CoreGenes (D. Seto, Bioinformatics & Computational Biology, George Mason Univ., U.S.A.) is designed to analyze two to five genomes simultaneously, generating a table of related genes - orthologs and putative orthologs. These entries are linked to their GenBank data. It has a limit of 0.35 Mb, while the newer version CoreGenes 2.0 extends the limit to approx. 2.0Mb. If your data is not present in GenBank use this site. The following diagram is from an analysis of coliphages T3, T7, Yersinia phage phi-YeO3-12 and Roseophage S10I.
CoreGenes 3.0 - is the latest member in the CoreGenes family of tools. It determines unique genes contained in a pair of proteomes. (Caveat: Currently only supports a
single pairs of genomes). This has proved exctremely useful in determining unique genes in comparisons between large Myoviridae. CoreGenes 3.5 is the batch CoreGenes server.
I have used this suite of programs extensively in the classification of bacterial viruses.