Downloading selected sequences from GenBank


A. Whole genomes
This can be accomplished in several ways:


1. Downloading a single file -


i. On the NCBI home page choose “Nucleotide” or “Genome” and paste in the accession number.  Alternatively, typing in the name and search.  In the latter case, it is sometimes useful to use the Advanced Search Builder

ii. Choose the Item that you are interested in and its format “FASTA” or “GenBank” and click on it.

iii. There are two main ways of downloading the data: (a) “Copy” and “Paste” into Notepad or similar text editor and saving in the appropriate file format; or, (b) choosing “send to:” “Choose Destination” “File” and selecting from the “Format” list either GenBank(full) of FASTA. Click on “Create File” to generate and download “sequence.gb” and “sequence.fasta” files, respectively. N.B. The same can be done from the FASTA document in NCBI


2. Downloading multiple files –


i. On the NCBI home page choose “Nucleotide” or “Genome” and paste in the required accession numbers (there is a limit of 100).  Alternatively, typing in the name and search.  In the latter case, it is sometimes useful to use the Advanced Search Builder

ii. You can use the “send to:” command as in A.1.ii. to save all the sequences

iii. Alternatively, you can click on the boxes associated with the desired records to download a selected subset of data.


B. Selected proteins


i. You can approach the selection of a specific protein for downloading in much the same manner as described for a GenBank flatfile (*.gbk) or fasta-formated nucleotide in the way described for genomes and nucleotides as described above. Alternative, you can go to the Protein database and make your selection.

ii. You may want to download the results of a BLASTp search for subsequent phylogenetic analysis. This can be accomplished within the BLASTp search results by choosing “Select All” or selecting the boxes for the sequences you are interested in.

iii. Select “Download” and “FASTA (complete sequence)” and “Continue.” The downloaded file will be called seqdump.txt.

iv. Open the latter in Notepad or related text editor.

v. The data will look like this:

(a) single record:
>ANJ65251.1 putative RNA polymerase 1 [Erwinia phage vB_EamP_Rexella]
MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDVKFGIDLLVQMALHKRCDLQTLVGTLRHHCESAQEVVNNILKCAEADLVDYNVSLGIFIVRCTISNDV
QEELDRFQYPLPMVVEPKKITNNKQSGYLLNNKSIILKDNHHEDDVCLDHINRLNKIKFRINFDTARMVKNEWRNLDKRKEGETQADFMKRKKAFEKYDSTARDVMEVLHKVS
DTFHLTHSYDKRLRTYAQGYHVNYQGTAWNKAVIEFAEEEVTNG

(b) Multi-hit record:
>YP_009286151.1 putative RNA polymerase 1 [Erwinia phage vB_EamP_Frozen] >ANJ65154.1 putative RNA polymerase 1 [Erwinia phage vB_EamP_Frozen]
>ANJ65337.1 putative RNA polymerase 1 [Erwinia phage vB_EamP_Gutmeister]
MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDVKFGIDLLVQMALHKRCDLQTLVGTLRHHCESAQEVVNNILKCAEADLVDYNVSLGIFIVRCTISNDV
QEELDRFQYPLPMVVEPKKITNNKQSGYLLNNKSIILKDNHHEDDVCLDHINRLNKIKFRINFDTARMVKNEWRNLDKRKEGETQADFIKRKKAFEKYDSTARDVMEVLHKVSD
TFHLTHSYDKRLRTYAQGYHVNYQGTAWNKAVIEFAEEEVTNG


vi. Use your text editor to duplicate the latter record and remove nonessential test so that you end up with a poly-fasta document in the following format:
> ANJ65251
MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc
> ANJ65154
MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc
> ANJ65337
MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc

OR

> Erwinia phage vB_EamP_Rexella
MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc
> Erwinia phage vB_EamP_Frozen
MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc

> Erwinia phage vB_EamP_Gutmeister
MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc

C. Using BioEdit to edit file format


1. Run BioEdit and open “sequence.gb”

2. The GenBank DEFINITION line will appear in the left column, and the associated sequence in the right column.

3. One can change this to give the ACCESSION number by “Edit” and “Select all sequences”; and then under “Sequence” select “rename” and “with ACCESSION.”  N.B. if you get >one accession number this may indicate that one of the sequences may have been replaced in GenBank.

4. In either of these cases place cursor in left column and under “Edit” choose “Select All Sequences”

5. To Export to Excel: Under “Edit” choose “Export” and “tab-delimitated text” and save the file as *.tab.  Then, select file (all file formats) in Excel using “Delimited”, “Next” and “Finish” will give you the names in column A and the sequences in column B.

6. To Export as Individual Sequence Files: Under “File” choose “Export” and “split into individual fasta files.”  Save these in a unique folder.  Please note that the names are too similar as in Acinetobacter phage YMC11/12/R1215 and Acinetobacter phage YMC11/12/R2315 files may not be generated.

D. Splitting poly-fasta protein files using EMBOSS Explorer seqretsplit

1. EMBOSS seqretsplit can be accessed here or here.

2. Upload your seqdump.txt or sequence.fasta file, and leave "Output sequence format" as default "Pearson FASTA"

3. When the results come up in your Internet browser search for the fasta symbol (>) and right click to download the separate files.  These will be identified by their accession numbers.  Unfortunately, the latter are named ab123456.fasta and not AB123456.fasta.

E. Extract protein sequences from GenBank flatfiles.

1. Use Rocap Genbank/EMBL to FASTA Conversion Tool to convert GenBank flat file (*.gbk) into fasta-formatted amino acid sequence file (*.faa)