01 - FASTA format

Introduction

We've come up with a little example to motivate the specific sample data we we be using.

In the course of this workshop we're going to be looking at two forms of a lipase protein from the bacteria Proteus mirabilis, both the natural wild-type and an engineered form of this enzyme.

To prepare for this we're going to first have to introduce some widely used sequence file formats. These are used for storing nucleotide and amino acid sequence data, and even entire genome sequences.

FASTA format

What is FASTA format?

The FASTA format (named after an early bioinformatics tool of the same name) uses a special > marker line to indicate the start of each sequence. This > header line should begin with an identifier, and then - optionally - a space and description (all one one line). The subsequent lines until the next > marker are the associated sequence data, usually wrapped to make them easier to read.

In a new terminal window, please change to this data directory using:

$ cd ~/2018-03-06-ibioic/01-introduction/data

If you list the *.fasta files, you should see:

$ ls *.fasta
engineered_nt.fasta     glycoside_hydrolases_aa.fasta wildtype_nt.fasta

The wildtype.fasta file should look like this using the less command. Within these tools, press space to see the next page of text, and the letter q to quit.

$ less wildtype_nt.fasta
>wildtype lipase protein from Proteus mirabilis
ATGAGCACCAAGTACCCCATCGTGCTGGTGCACGGCCTGGCCGGCTTCAACGAGATCGTG
GGCTTCCCCTACTTCTACGGCATCGCCGACGCCCTGAGGCAGGACGGCCACCAGGTGTTC
ACCGCCAGCCTGAGCGCCTTCAACAGCAACGAGGTGAGGGGCAAGCAGCTGTGGCAGTTC
GTGCAGACCCTGCTGCAGGAGACCCAGGCCAAGAAGGTGAACTTCATCGGCCACAGCCAG
GGCCCCCTGGCCTGCAGGTACGTGGCCGCCAACTACCCCGACAGCGTGGCCAGCGTGACC
AGCATCAACGGCGTGAACCACGGCAGCGAGATCGCCGACCTGTACAGGAGGATCATGAGG
AAGGACAGCATCCCCGAGTACATCGTGGAGAAGGTGCTGAACGCCTTCGGCACCATCATC
AGCACCTTCAGCGGCCACAGGGGCGACCCCCAGGACGCCATCGCCGCCCTGGAGAGCCTG
ACCACCGAGCAGGTGACCGAGTTCAACAACAAGTACCCCCAGGCCCTGCCCAAGACCCCC
GGCGGCGAGGGCGACGAGATCGTGAACGGCGTGCACTACTACTGCTTCGGCAGCTACATC
CAGGGCCTGATCGCCGGCGAGAAGGGCAACCTGCTGGACCCCACCCACGCCGCCATGAGG
GTGCTGAACACCTTCTTCACCGAGAAGCAGAACGACGGCCTGGTGGGCAGGAGCAGCATG
AGGCTGGGCAAGCTGATCAAGGACGACTACGCCCAGGACCACATCGACATGGTGAACCAG
GTGGCCGGCCTGGTGGGCTACAACGAGGACATCGTGGCCATCTACACCCAGCACGCCAAG
TACCTGGCCAGCAAGCAGCTG

The engineered.fasta file should look like this:

$ less engineered_nt.fasta
>engineered lipase protein from Proteus mirabilis
ATGAGCACCAAGTACCCCATCGTGCTGGTGCACGGCCTGGCCGGCTTCAGCGAGATCGTG
GGCTTCCCCTACTTCTACGGCATCGCCGACGCCCTGACCCAGGACGGCCACCAGGTGTTC
ACCGCCAGCCTGAGCGCCTTCAACAGCAACGAGGTGAGGGGCAAGCAGCTGTGGCAGTTC
GTGCAGACCATCCTGCAGGAGACCCAGACCAAGAAGGTGAACTTCATCGGCCACAGCCAG
GGCCCCCTGGCCTGCAGGTACGTGGCCGCCAACTACCCCGACAGCGTGGCCAGCGTGACC
AGCATCAACGGCGTGAACCACGGCAGCGAGATCGCCGACCTGTACAGGAGGATCATCAGG
AAGGACAGCATCCCCGAGTACATCGTGGAGAAGGTGCTGAACGCCTTCGGCACCATCATC
AGCACCTTCAGCGGCCACAGGGGCGACCCCCAGGACGCCATCGCCGCCCTGGAGAGCCTG
ACCACCGAGCAGGTGACCGAGTTCAACAACAAGTACCCCCAGGCCCTGCCCAAGACCCCC
TGCGGCGAGGGCGACGAGATCGTGAACGGCGTGCACTACTACTGCTTCGGCAGCTACATC
CAGGAGCTGATCGCCGGCGAGAACGGCAACCTGCTGGACCCCACCCACGCCGCCATGAGG
GTGCTGAACACCCTGTTCACCGAGAAGCAGAACGACGGCCTGGTGGGCAGGTGCAGCATG
AGGCTGGGCAAGCTGATCAAGGACGACTACGCCCAGGACCACTTCGACATGGTGAACCAG
GTGGCCGGCCTGGTGAGCTACAACGAGAACATCGTGGCCATCTACACCCTGCACGCCAAG
TACCTGGCCAGCAAGCAGCTG

Here we have two short FASTA files, each just 16 lines long, and each containing a single nucelotide sequence - which by eye look almost identical.

FASTA files can contain much longer sequences - like whole chromosomes.

FASTA files often contain multiple sequences - like all the proteins from a bacterium, all the gene coding sequences from a genome, or any hand compiled set of nucleotide sequences of interest. Have a look at the third file, glycoside_hydrolases_nt.fasta for comparison:

$ less glycoside_hydrolases_nt.fasta
>ECA0662 6-phospho-beta-glucosidase
ATGAAAGCATTCCCCGACGGATTTTTATGGGGCGGTTCAGTCGCAGCAAATCAGGTTGAA
GGGGCATGGAATGAAGACGGCAAAGGCGTGTCGACCTCCGATCTTCAGCTAAAGGGCGTG
CATGGTCCGGTGACAGAACGCGATGAGAGCATTAGCTGCATCAAAGATCGGGCAATCGAT
...

You should find this contains eight nucleotide sequences. We'll look at the genome these came from soon, the bacterium Pectobacterium atrosepticum: accession NC_004547.2 (originally known as Erwinia carotovora).

Most bioinformatics tools for working on sequence data will accept FASTA format input.

Parsing FASTA format

Because the FASTA file format is relatively simple, some Python for Bioinformatics courses will take you through writing your own parser code. Instead we're going to use Biopython and cover some basic Python at the same time.

Biopython is a Python package - a collection of functions and other useful programming elements that is written and maintained by others, but made freely available for you to use in your own work.

In [1]:
# Test after a hash (#) is a comment in Python

# Loads the sequence input/output code from Biopython
from Bio import SeqIO

# This is a relative path, compared to this notebook the FASTA file
# is under the sub-directory data:
filename = "data/glycoside_hydrolases_nt.fasta"

# Using Biopython's SeqIO.parse(...) function with two arguments,
# the input filename and the file format, here "fasta" 
for record in SeqIO.parse(filename, "fasta"):
    # Python for loops use indentation, traditionally four spaces
    # These percentage signs are a common way for inserting values
    # into strings, %s for another string, %i for an integer number:
    print("%s length %i" % (record.id, len(record.seq)))

print("Done")
ECA0662 length 1389
ECA1451 length 1425
ECA1871 length 1395
ECA2166 length 1431
ECA3646 length 1437
ECA4387 length 1473
ECA4407 length 1398
ECA4432 length 1443
Done