To recap, we've used the NCBI Entrez Programming Utilities via Biopython's Bio.Entrez
to download the Pectobacterium atrosepticum genome in GenBank format as NC_004547.gbk
.
Also, you have been provided a FASTA file glycoside_hydrolases_nt.fasta
which we've said should contain genes from this genome, named using the old locus tags:
What we're going to do now is recreate that FASTA file by loading the GenBank file into Python, and pull out the nucleotide coding sequence for these genes using the CDS feature with the desired locus tag. We'll also translate them into proteins.
We'll use Biopython to parse each genome, which gives all the features as a list. We'll then loop over the list of features to find the desired CDS features:
# Biopython's SeqIO module handles sequence input/output
from Bio import SeqIO
def get_cds_feature_with_qualifier_value(seq_record, name, value):
"""Function to look for CDS feature by annotation value in sequence record.
e.g. You can use this for finding features by locus tag, gene ID, or protein ID.
"""
# Loop over the features
for feature in genome_record.features:
if feature.type == "CDS" and value in feature.qualifiers.get(name, []):
return feature
# Could not find it
return None
genome_record = SeqIO.read("NC_004547.gbk", "genbank")
cds_feature = get_cds_feature_with_qualifier_value(genome_record, "old_locus_tag", "ECA0662")
print(cds_feature)
Using your text editor you can compare this to the matching section in the GenBank file NC_004547.gbk
were you should find:
CDS complement(736847..738235)
/locus_tag="ECA_RS03295"
/old_locus_tag="ECA0662"
/inference="COORDINATES: similar to AA
sequence:RefSeq:WP_010282995.1"
/note="Derived by automated computational analysis using
gene prediction method: Protein Homology."
/codon_start=1
/transl_table=11
/product="6-phospho-beta-glucosidase"
/protein_id="WP_039289952.1"
/translation="MKAFPDGFLWGGSVAANQVEGAWNEDGKGVSTSDLQLKGVHGPV
TERDESISCIKDRAIDFYHQYPQDIQLFAEMGFKVLRTSIAWTRIYPEGNEAEPCEAG
LAFYDHLFDEMAKHHIQPLITLSHYEMPYGLVKKLGGWGNRAVIDHFEKYARTVFSRY
KDKVKHWLTFNEINMALHSPFTGIGLSGEPSKQDIYQAIHHQLVASARVVKACHDIIP
DAKIGNMLLGAVRYPMTCHPKDVLEAQNKNREWLFFGDVQVRGTYPAWIQRYFRENDV
ELTITAQDKDDLSHTVDFVSFSYYMSGCATFEPEKYQSSRGNIIRLIPNPHLEASEWG
WQIDPEGLRFLLNELYDRYQKPLFIVENGLGARDVVEEDGSIHDSYRIDYLRRHLIQV
REAIDDGVELLGYTSWGPIDLVSAGTAQMSKRYGFIHVDRDDEGKGTLERRRKDSFYW
YQDVISSNGKSL"
Note that because of how we downloaded NC_004547.gbk
using NCBI Entrez, the CDS features include the protein translation. Depending on the settings, the NCBI website does not always show the translations here.
The first line of a GenBank (or EMBL) feature gives the co-ordinates. Simple cases on the forward strand would look like 46146..47009
or 1063081..1063944
meaning start..end
using inclusive one-based counting.
Here we have complement(736847..738235)
where complement(end..start)
is for features on the reverse strand. Here the start of the gene is at position 738235
while the stop codon should be the three bases ending at position 736847
. If you still have Artemis open, you can check this.
More complicated CDS locations using join(...)
are common in eurkaryotes to describe slicing.
The way that Python (and many other programming languages) slices strings or arrays of data counts from zero but excludes the end point, which is why in Biopython these locations seem to begin one base earlier.
print(cds_feature.location)
Rather than messing about with the start/end coordinates ourselves (and worrying about counting from zero or counting from one), we can get Biopython to apply the location information from the feature to extract the described region of the genome sequence.
gene_sequence = cds_feature.extract(genome_record.seq)
print("CDS nucleotide sequence:")
print(gene_sequence)
print("Start codon is %s" % gene_sequence[:3]) # Python's way to get first three letters
print("Stop codon is %s" % gene_sequence[-3:]) # Python trick for last three letters
Notice that getting Biopython to extract the sequence has taken care of the reverse complement for us - this nucleotide sequence starts ATG
(a start codon) and ends TAA
(a stop codon).
Once we have the nucleotides, it is just one line to get Biopython to translate it. By default this would use the "Standard" genetic code (for humans etc), so we should explicitly specify we want to use the bacterial table. The GenBank annotation tells us we should use NCBI translation table 11 - see the NCBI's list of genetic codes.
protein_sequence = gene_sequence.translate(table=11, cds=True)
print("Translated into amino acids:")
print(protein_sequence)
Here we also told Biopython to interpret this as a complete CDS, meaning it checks there is a whole number of codons (the sequence is a multiple of three in length), verifies the last codon is a stop codon, and also ensures even if an alternative start codon is used it becomes a methione (M
). In this case the start codon is the typical ATG
as we saw above.
Let's double check that our translation matches the one given in the annotation:
print(protein_sequence == cds_feature.qualifiers["translation"][0])
# This assumed you've already loaded the GenBank file as genome_record,
# and have the function get_cds_feature_with_qualifier_value defined.
old_tags = ["ECA0662", "ECA1451", "ECA1871", "ECA2166",
"ECA3646", "ECA4387", "ECA4407", "ECA4432"]
for tag in old_tags:
print("Looking at " + tag)
cds_feature = get_cds_feature_with_qualifier_value(genome_record, "old_locus_tag", tag)
########################################################################
# Fill in the code to extract the nucleotide sequence and translate it #
########################################################################
The following code takes the ideas we've introduced above, and combines them with more advanced Python to save the nucleotide and protein sequences as two FASTA files.
# Biopython's SeqIO module handles sequence input/output
from Bio import SeqIO
def get_cds_feature_with_qualifier_value(seq_record, name, value):
"""Function to look for CDS feature by annotation value in sequence record.
e.g. You can use this for finding features by locus tag, gene ID, or protein ID.
"""
# Loop over the features
for feature in genome_record.features:
if feature.type == "CDS" and value in feature.qualifiers.get(name, []):
return feature
# Could not find it
return None
genome_record = SeqIO.read("NC_004547.gbk", "genbank")
old_tags = ["ECA0662", "ECA1451", "ECA1871", "ECA2166",
"ECA3646", "ECA4387", "ECA4407", "ECA4432"]
with open("nucleotides.fasta", "w") as nt_output, open("proteins.fasta", "w") as aa_output:
for tag in old_tags:
print("Looking at " + tag)
cds_feature = get_cds_feature_with_qualifier_value(genome_record, "old_locus_tag", tag)
gene_sequence = cds_feature.extract(genome_record.seq)
protein_sequence = gene_sequence.translate(table=11, cds=True)
# This is asking Python to halt if the translation does not match:
assert protein_sequence == cds_feature.qualifiers["translation"][0]
# Output FASTA records - note \n means insert a new line.
# This is a little lazy as it won't line wrap the sequence:
nt_output.write(">%s\n%s\n" % (tag, gene_sequence))
aa_output.write(">%s\n%s\n" % (tag, gene_sequence))
print("Done")