03 - Parsing GenBank files

Introduction

Thus far we've looked at GenBank files by eye and in Artemis. Now we'll do this with a Python script.

To recap, we've used the NCBI Entrez Programming Utilities via Biopython's Bio.Entrez to download the Pectobacterium atrosepticum genome in GenBank format as NC_004547.gbk.

Also, you have been provided a FASTA file glycoside_hydrolases_nt.fasta which we've said should contain genes from this genome, named using the old locus tags:

  • ECA0662 6-phospho-beta-glucosidase
  • ECA1451 6-phospho-beta-glucosidase
  • ECA1871 6-phospho-beta-glucosidase
  • ECA2166 6-phospho-beta-glucosidase
  • ECA3646 beta-glucosidase
  • ECA4387 6-phospho-beta-glucosidase
  • ECA4407 6-phospho-beta-glucosidase
  • ECA4432 6-phospho-beta-glucosidase

What we're going to do now is recreate that FASTA file by loading the GenBank file into Python, and pull out the nucleotide coding sequence for these genes using the CDS feature with the desired locus tag. We'll also translate them into proteins.

Parsing a GenBank file and finding a feature

We'll use Biopython to parse each genome, which gives all the features as a list. We'll then loop over the list of features to find the desired CDS features:

In [1]:
# Biopython's SeqIO module handles sequence input/output
from Bio import SeqIO

def get_cds_feature_with_qualifier_value(seq_record, name, value):
    """Function to look for CDS feature by annotation value in sequence record.
    
    e.g. You can use this for finding features by locus tag, gene ID, or protein ID.
    """
    # Loop over the features
    for feature in genome_record.features:
        if feature.type == "CDS" and value in feature.qualifiers.get(name, []):
            return feature
    # Could not find it
    return None

genome_record = SeqIO.read("NC_004547.gbk", "genbank")
cds_feature = get_cds_feature_with_qualifier_value(genome_record, "old_locus_tag", "ECA0662")
print(cds_feature)
type: CDS
location: [736846:738235](-)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: inference, Value: ['COORDINATES: similar to AA sequence:RefSeq:WP_010282995.1']
    Key: locus_tag, Value: ['ECA_RS03295']
    Key: note, Value: ['Derived by automated computational analysis using gene prediction method: Protein Homology.']
    Key: old_locus_tag, Value: ['ECA0662']
    Key: product, Value: ['6-phospho-beta-glucosidase']
    Key: protein_id, Value: ['WP_039289952.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MKAFPDGFLWGGSVAANQVEGAWNEDGKGVSTSDLQLKGVHGPVTERDESISCIKDRAIDFYHQYPQDIQLFAEMGFKVLRTSIAWTRIYPEGNEAEPCEAGLAFYDHLFDEMAKHHIQPLITLSHYEMPYGLVKKLGGWGNRAVIDHFEKYARTVFSRYKDKVKHWLTFNEINMALHSPFTGIGLSGEPSKQDIYQAIHHQLVASARVVKACHDIIPDAKIGNMLLGAVRYPMTCHPKDVLEAQNKNREWLFFGDVQVRGTYPAWIQRYFRENDVELTITAQDKDDLSHTVDFVSFSYYMSGCATFEPEKYQSSRGNIIRLIPNPHLEASEWGWQIDPEGLRFLLNELYDRYQKPLFIVENGLGARDVVEEDGSIHDSYRIDYLRRHLIQVREAIDDGVELLGYTSWGPIDLVSAGTAQMSKRYGFIHVDRDDEGKGTLERRRKDSFYWYQDVISSNGKSL']

Using your text editor you can compare this to the matching section in the GenBank file NC_004547.gbk were you should find:

     CDS             complement(736847..738235)
                     /locus_tag="ECA_RS03295"
                     /old_locus_tag="ECA0662"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:WP_010282995.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="6-phospho-beta-glucosidase"
                     /protein_id="WP_039289952.1"
                     /translation="MKAFPDGFLWGGSVAANQVEGAWNEDGKGVSTSDLQLKGVHGPV
                     TERDESISCIKDRAIDFYHQYPQDIQLFAEMGFKVLRTSIAWTRIYPEGNEAEPCEAG
                     LAFYDHLFDEMAKHHIQPLITLSHYEMPYGLVKKLGGWGNRAVIDHFEKYARTVFSRY
                     KDKVKHWLTFNEINMALHSPFTGIGLSGEPSKQDIYQAIHHQLVASARVVKACHDIIP
                     DAKIGNMLLGAVRYPMTCHPKDVLEAQNKNREWLFFGDVQVRGTYPAWIQRYFRENDV
                     ELTITAQDKDDLSHTVDFVSFSYYMSGCATFEPEKYQSSRGNIIRLIPNPHLEASEWG
                     WQIDPEGLRFLLNELYDRYQKPLFIVENGLGARDVVEEDGSIHDSYRIDYLRRHLIQV
                     REAIDDGVELLGYTSWGPIDLVSAGTAQMSKRYGFIHVDRDDEGKGTLERRRKDSFYW
                     YQDVISSNGKSL"

Note that because of how we downloaded NC_004547.gbk using NCBI Entrez, the CDS features include the protein translation. Depending on the settings, the NCBI website does not always show the translations here.

Locations in GenBank format

The first line of a GenBank (or EMBL) feature gives the co-ordinates. Simple cases on the forward strand would look like 46146..47009 or 1063081..1063944 meaning start..end using inclusive one-based counting.

Here we have complement(736847..738235) where complement(end..start) is for features on the reverse strand. Here the start of the gene is at position 738235 while the stop codon should be the three bases ending at position 736847. If you still have Artemis open, you can check this.

More complicated CDS locations using join(...) are common in eurkaryotes to describe slicing.

The way that Python (and many other programming languages) slices strings or arrays of data counts from zero but excludes the end point, which is why in Biopython these locations seem to begin one base earlier.

In [2]:
print(cds_feature.location)
[736846:738235](-)

Rather than messing about with the start/end coordinates ourselves (and worrying about counting from zero or counting from one), we can get Biopython to apply the location information from the feature to extract the described region of the genome sequence.

In [3]:
gene_sequence = cds_feature.extract(genome_record.seq)
print("CDS nucleotide sequence:")
print(gene_sequence)
print("Start codon is %s" % gene_sequence[:3])  # Python's way to get first three letters
print("Stop codon is %s" % gene_sequence[-3:])  # Python trick for last three letters
CDS nucleotide sequence:
ATGAAAGCATTCCCCGACGGATTTTTATGGGGCGGTTCAGTCGCAGCAAATCAGGTTGAAGGGGCATGGAATGAAGACGGCAAAGGCGTGTCGACCTCCGATCTTCAGCTAAAGGGCGTGCATGGTCCGGTGACAGAACGCGATGAGAGCATTAGCTGCATCAAAGATCGGGCAATCGATTTTTATCATCAATATCCGCAGGATATACAGCTATTCGCCGAAATGGGCTTCAAGGTGTTACGCACCTCCATCGCCTGGACGCGTATTTATCCCGAAGGCAATGAAGCAGAACCCTGCGAAGCGGGTCTGGCCTTTTACGATCATCTGTTTGATGAAATGGCAAAGCATCATATTCAGCCGCTGATTACGCTGTCGCACTATGAAATGCCGTACGGTCTGGTGAAAAAGTTGGGAGGCTGGGGCAACCGCGCCGTCATCGACCATTTTGAGAAATATGCCCGTACCGTCTTTAGCCGCTACAAAGACAAGGTGAAGCACTGGCTGACCTTCAATGAAATCAACATGGCGCTGCACTCGCCTTTTACGGGTATCGGGCTAAGCGGCGAACCCTCAAAACAGGATATTTATCAGGCCATCCACCATCAGTTGGTTGCCAGTGCACGCGTGGTGAAAGCCTGTCACGACATCATCCCTGATGCCAAAATCGGCAATATGCTGCTGGGCGCGGTGCGCTACCCCATGACCTGTCATCCGAAAGACGTACTGGAAGCGCAGAACAAGAATCGGGAATGGCTGTTCTTCGGTGACGTGCAAGTTCGCGGCACCTATCCGGCGTGGATTCAGCGTTATTTCCGGGAAAATGATGTCGAACTCACGATTACCGCGCAGGACAAAGACGATCTGAGCCACACCGTAGACTTTGTTTCATTCAGCTATTACATGAGTGGCTGTGCGACCTTCGAACCAGAAAAATACCAGTCTTCACGCGGCAACATCATCCGCCTGATTCCTAACCCGCATCTTGAAGCGTCCGAATGGGGCTGGCAGATCGACCCCGAAGGCCTGCGTTTCCTGTTGAATGAGCTGTATGACCGCTATCAGAAGCCGCTGTTCATCGTGGAAAACGGGTTAGGCGCACGCGACGTGGTGGAAGAGGATGGCAGCATTCACGACAGCTACCGCATTGATTATCTGCGTCGTCATCTGATTCAAGTACGTGAAGCCATCGACGATGGCGTCGAACTGCTTGGCTACACCAGTTGGGGGCCGATCGATCTAGTCAGCGCAGGCACGGCGCAGATGTCAAAACGCTACGGCTTTATTCATGTCGATCGTGATGATGAAGGCAAAGGCACGTTGGAACGTCGGCGCAAAGACAGCTTCTACTGGTATCAGGATGTGATTAGCAGTAACGGGAAATCGTTGTAA
Start codon is ATG
Stop codon is TAA

Notice that getting Biopython to extract the sequence has taken care of the reverse complement for us - this nucleotide sequence starts ATG (a start codon) and ends TAA (a stop codon).

Translating sequences

Once we have the nucleotides, it is just one line to get Biopython to translate it. By default this would use the "Standard" genetic code (for humans etc), so we should explicitly specify we want to use the bacterial table. The GenBank annotation tells us we should use NCBI translation table 11 - see the NCBI's list of genetic codes.

In [4]:
protein_sequence = gene_sequence.translate(table=11, cds=True)
print("Translated into amino acids:")
print(protein_sequence)
Translated into amino acids:
MKAFPDGFLWGGSVAANQVEGAWNEDGKGVSTSDLQLKGVHGPVTERDESISCIKDRAIDFYHQYPQDIQLFAEMGFKVLRTSIAWTRIYPEGNEAEPCEAGLAFYDHLFDEMAKHHIQPLITLSHYEMPYGLVKKLGGWGNRAVIDHFEKYARTVFSRYKDKVKHWLTFNEINMALHSPFTGIGLSGEPSKQDIYQAIHHQLVASARVVKACHDIIPDAKIGNMLLGAVRYPMTCHPKDVLEAQNKNREWLFFGDVQVRGTYPAWIQRYFRENDVELTITAQDKDDLSHTVDFVSFSYYMSGCATFEPEKYQSSRGNIIRLIPNPHLEASEWGWQIDPEGLRFLLNELYDRYQKPLFIVENGLGARDVVEEDGSIHDSYRIDYLRRHLIQVREAIDDGVELLGYTSWGPIDLVSAGTAQMSKRYGFIHVDRDDEGKGTLERRRKDSFYWYQDVISSNGKSL

Here we also told Biopython to interpret this as a complete CDS, meaning it checks there is a whole number of codons (the sequence is a multiple of three in length), verifies the last codon is a stop codon, and also ensures even if an alternative start codon is used it becomes a methione (M). In this case the start codon is the typical ATG as we saw above.

Let's double check that our translation matches the one given in the annotation:

In [5]:
print(protein_sequence == cds_feature.qualifiers["translation"][0])
True

In the same way, for all eight genes of interest, let's find the CDS feature, extract the nucleotide sequence, translate it, and print this on screen.

In [6]:
# This assumed you've already loaded the GenBank file as genome_record,
# and have the function get_cds_feature_with_qualifier_value defined.

old_tags = ["ECA0662", "ECA1451", "ECA1871", "ECA2166",
            "ECA3646", "ECA4387", "ECA4407", "ECA4432"]

for tag in old_tags:
    print("Looking at " + tag)
    cds_feature = get_cds_feature_with_qualifier_value(genome_record, "old_locus_tag", tag)
    ########################################################################
    # Fill in the code to extract the nucleotide sequence and translate it #
    ########################################################################
Looking at ECA0662
Looking at ECA1451
Looking at ECA1871
Looking at ECA2166
Looking at ECA3646
Looking at ECA4387
Looking at ECA4407
Looking at ECA4432

Saving FASTA sequences

The following code takes the ideas we've introduced above, and combines them with more advanced Python to save the nucleotide and protein sequences as two FASTA files.

In [7]:
# Biopython's SeqIO module handles sequence input/output
from Bio import SeqIO

def get_cds_feature_with_qualifier_value(seq_record, name, value):
    """Function to look for CDS feature by annotation value in sequence record.
    
    e.g. You can use this for finding features by locus tag, gene ID, or protein ID.
    """
    # Loop over the features
    for feature in genome_record.features:
        if feature.type == "CDS" and value in feature.qualifiers.get(name, []):
            return feature
    # Could not find it
    return None

genome_record = SeqIO.read("NC_004547.gbk", "genbank")

old_tags = ["ECA0662", "ECA1451", "ECA1871", "ECA2166",
            "ECA3646", "ECA4387", "ECA4407", "ECA4432"]

with open("nucleotides.fasta", "w") as nt_output, open("proteins.fasta", "w") as aa_output:
    for tag in old_tags:
        print("Looking at " + tag)
        cds_feature = get_cds_feature_with_qualifier_value(genome_record, "old_locus_tag", tag)
        gene_sequence = cds_feature.extract(genome_record.seq)
        protein_sequence = gene_sequence.translate(table=11, cds=True)
    
        # This is asking Python to halt if the translation does not match:
        assert protein_sequence == cds_feature.qualifiers["translation"][0]
 
        # Output FASTA records - note \n means insert a new line.
        # This is a little lazy as it won't line wrap the sequence:
        nt_output.write(">%s\n%s\n" % (tag, gene_sequence))
        aa_output.write(">%s\n%s\n" % (tag, gene_sequence))

print("Done")
Looking at ECA0662
Looking at ECA1451
Looking at ECA1871
Looking at ECA2166
Looking at ECA3646
Looking at ECA4387
Looking at ECA4407
Looking at ECA4432
Done