BLAST+
Programmatically with Biopython¶BLAST+
using Python and Biopython in the Jupyter notebookBLAST+
command line object in Biopython, and its similarities with running BLAST+
at the terminalBLAST+
output into a Python variableBLAST+
outputThis notebook presents examples of methods for using BLAST
programmatically, with a local installation of BLAST
.
It can be very convenient to use a web BLAST in the browser - such as the NCBI interface, or a project-specific instance (like this one at the PGSC) - for BLAST
searches, but there can be limitations to this approach.
It may not be practical to submit a large number of simultaneous queries via a web form, either because the interface prevents this, or because it is tiresome to point-and-click over and over again. It may be that the interface does not make it easy to change custom options that you might want to modify to help refine your query. It could be the case that the web database does not contain sequences that you are interested in searching against (if, for example, some of the sequences are proprietary), or it might not be constrained to a relevant set of organisms, so the search might take much longer than it needs to for your purposes. If you need to repeat a query, it can be awkward to get the same settings every time, and it is possible to forget options from search to search; also the returned result may not describe how or when the search was run.
The code that we develop in this notebook will be adapted for use in the next notebook: 04-programming_web_blast
, but we focus on a local installation now to understand the main principles.
To interact with the local installation of BLAST
, we will use the free Biopython
programming tools. These provide an interface to interact with BLAST
, run jobs, and to read in the output files. To use these tools, we need to import them. We will need tools to perform specific functions:
os
module.BLAST
search results as dataframes/tables for analysis, we will use the pandas
package.seaborn
visualisation package.BLAST
, we will use the Bio.Blast.Applications
module from Biopython
.We import these tools, and some standard library packages for working with files (os
) below.
# Show plots as part of the notebook (this is a Jupyter-specific operation)
%matplotlib inline
import matplotlib.pyplot as plt
# Standard library packages
import os
# Import Pandas and Seaborn
import pandas as pd
import seaborn as sns
# Import Biopython tools for running local BLASTX
from Bio.Blast.Applications import NcbiblastxCommandline
You will use Python/biopython
in the code blocks below to first perform the BLASTX
search, then parse the results into a pandas
dataframe, and finally plot some summary statistics using seaborn
.
BLASTX
¶To create the command-line, you need to provide the same information as if you were running BLAST
at the terminal: the location of the query sequence file, the location of the database, and any arguments that modify the type of BLAST
search we are running.
Firstly, you will define two variables that contain the paths to the input data, and the location you want to place the BLAST
output file. Then you will define variables that contain paths to: the input query sequence file; the database you're searching against; and the file containing BLAST
output
#Â Define paths to input and output directories
datadir = os.path.join('data', 'kitasatospora') # input
outdir = os.path.join('output', 'kitasatospora') #Â output
os.makedirs(outdir, exist_ok=True) # create output directory if it doesn't exist
# Define paths to input and output files
query = os.path.join(datadir, 'k_sp_CB01950_penicillin.fasta') # query sequence(s)
db = os.path.join(datadir, 'kitasatospora_proteins.faa') # BLAST database
blastout = os.path.join(outdir, 'AMK19_00175_blastx_kitasatospora.tab') # BLAST output
# Get help with how to construct the command-line
help(NcbiblastxCommandline)
The information above tells how to pass the paths to the query sequence, database, and how to specify other values to control BLASTX
, e.g.:
cline = NcbiblastxCommandline(query="m_cold.fasta", db="nr", evalue=0.001)
You will provide the locations of the query sequence (the penicillin-binding protein), the database you're searching against (the proteins from five other Kitosatospora species), and a location to write the output.
# Create command-line for BLASTX
cmd_blastx = NcbiblastxCommandline(query=query, out=blastout, outfmt=6, db=db)
The cmd_blastx
object now contains instructions that are equivalent to running BLASTX
at the command-line. We can even get it to print out a command-line that we could copy-and-paste into the terminal, to run the search:
# Get a working command-line
print(cmd_blastx)
You don't need to use the terminal at all, though. You can run the BLASTX
search from Python, by calling the cmd_blastx
object, with:
cmd_blastx()
# Run BLASTX, and catch STDOUT/STDERR
# !! Do not execute cell if skipping computation !!
stdout, stderr = cmd_blastx()
# Check STDOUT, STDERR
print("STDOUT: %s" % stdout)
print("STDERR: %s" % stderr)
If everything has worked, there should be no information in either STDOUT
or STDERR
. You should, however now see a file named AMK19_00175_blastx_kitasatospora.tab
in the output/kitasatospora
directory. This file contains your BLASTX
search results, and we shall import and inspect these in the next section.
BLASTX
results¶We have already defined a variable called blastout
that holds the BLASTX
search output, so we can use this when we load the data.
# !! If you are skipping computational steps, uncomment the line below !!
#blastout = os.path.join('prepped', 'kitasatospora', 'AMK19_00175_blastx_kitasatospora.tab') # BLAST output
# Read BLASTX output
results = pd.read_csv(blastout, sep="\t", header=None)
Jupyter notebooks integrate well with pandas
dataframes, and it is straightforward to see the first few rows of the results table, by using the dataframe's head()
method:
# Inspect results table
results.head()
You should see that the table, like the output file, contains 11 columns. It also contains an additional index
(on the left), which uniquely labels each row of the table.
# Define column headers
headers = ['query', 'subject',
'pc_identity', 'aln_length', 'mismatches', 'gaps_opened',
'query_start', 'query_end', 'subject_start', 'subject_end',
'e_value', 'bitscore']
# Assign headers
results.columns = headers
# Inspect modified table
results.head()
Now the results are a little more readable. You can also use these column names directly to work with the information in them.
You can, for example, obtain a summary of the information in the table with the dataframe's .describe()
method:
# Show a summary of the results table data
results.describe()
You can also extract and work with specific columns, by naming them:
# Show all subject matches
print(results['subject'])
# Create a new column describing how long the alignment is on the query sequence
qaln_length = abs(results['query_end'] - results['query_start']) + 1
print(qaln_length)
# Add qaln_length to the results table as a new column
results['qaln_length'] = qaln_length
results.head()
Dataframes also have a .plot.<plot_type>()
method, which lets us plot information from the table directly.
For example, to generate a scatterplot, we can use:
results.plot.scatter(<X_AXIS>, <Y_AXIS>)
where we replace <X_AXIS>
and <Y_AXIS>
with the names of the columns we want to see on those axes, as below.
# Create a scatterplot
results.plot.scatter('pc_identity', 'e_value')
plt.title("E value vs %identity"); # add a title to the plot
There is a second BLASTX
query file in the directory data/kitasatospora
, called lantibiotic.fasta
. It describes the CDS for a suspected lantibiotic synthesis protein. To begin the analysis with this sequence, can you do the following?
# SOLUTION - EXERCISE 01
# !! Do not execute this cell if skipping computational step !!
# We can reuse the directories and db, but need to define new input/output filenames
query = os.path.join(datadir, 'lantibiotic.fasta') # query sequence(s)
blastout = os.path.join(outdir, 'lantibiotic_blastx_kitasatospora.tab') # BLAST output
# Create command-line for BLASTX
cmd_blastx = NcbiblastxCommandline(query=query, out=blastout, outfmt=6, db=db)
# Run BLASTX, and catch STDOUT/STDERR
stdout, stderr = cmd_blastx()
# Check STDOUT, STDERR
print("STDOUT: %s" % stdout)
print("STDERR: %s" % stderr)
# !! Uncomment the line below, if skipping computational step !!
# blastout = os.path.join('prepped', 'kitasatospora', 'lantibiotic_blastx_kitasatospora.tab')
# Read BLASTX output, and reuse the column headers defined earlier
results = pd.read_csv(blastout, sep="\t", header=None)
results.columns = headers
# Create a scatterplot
results.plot.scatter('bitscore', 'pc_identity')
plt.title("%identity vs bitscore"); # add a title to the plot