KEGG
¶The KEGG
browser interface, while able to integrate searches across comprehensive and quite disparate datasets, does not always present the most convenient interface to extract that information (such as downloading FASTA sequences for an entry). As with all browser-based interfaces, it can also be tedious and time-consuming to point-and-click your way through a large number of searches.
As with all programmatic searches, there are a number of advantages to an automated approach:
Where it is not practical to submit a large number of simultaneous queries via a web form (because it is tiresome to point-and-click over and over again), this can be handled programmatically instead. You have the opportunity to change custom options to help refine your query, compared to the website interface. If you need to repeat a query, it can be trivial to get the same settings every time, if you use a programmatic approach.
The Biopython interface to KEGG
has several other advantages that we will not cover in this lesson, in that it allows for a much greater range of image manipulations for the pathway maps that KEGG
returns.
Be warned also that the conditions of service include:
"This service should not be used for bulk data downloads".
# Show plots as part of the notebook
%matplotlib inline
# Show images inline
from IPython.display import Image
# Standard library packages
import io
import os
# Import Biopython modules to interact with KEGG
from Bio import SeqIO
from Bio.KEGG import REST
from Bio.KEGG.KGML import KGML_parser
from Bio.Graphics.KGML_vis import KGMLCanvas
# Import Pandas, so we can use dataframes
import pandas as pd
In the cell below, we define a couple of useful functions that convert some returned output into Pandas dataframe form, and display .pdf
images directly in the notebook.
# A bit of code that will help us display the PDF output
def PDF(filename):
return HTML('<iframe src=%s width=700 height=350></iframe>' % filename)
# Some code to return a Pandas dataframe, given tabular text
def to_df(result):
return pd.read_table(io.StringIO(result), header=None)
KEGG
query¶Bio.KEGG.REST
, and catch the result in a variable.The available functions are:
kegg_conv()
- convert identifiers from KEGG
to those for other databaseskegg_find()
- find KEGG
entries with matching query datakegg_get()
- retrieve data for a specific entry from KEGG
kegg_info()
- get information about a KEGG
databasekegg_link()
- find entries in KEGG
using a database cross-referencekegg_list()
- list entries in a a databaseThe generic form of using these functions to get information from KEGG
and place the output in the variable myvar
is:
myvar = REST.<function>(<query>, <arg1>, <arg2>, `...`).read()
where <function>
is one of the functions above, <query>
is a string containing yoru query for KEGG
, and <arg1>
, <arg2>
and so on are arguments that may be required for some of the functions.
You will use some of these functions in the notebook cells below to get information from KEGG
.
kegg_info()
¶For instance, to get information about the KEGG
databases as a whole, you can use kegg_info("kegg")
to get a handle from KEGG
describing the databases, and catch it in a variable:
result = REST.kegg_info("kegg").read()
We could convert this handle to a Pandas dataframe with the function defined above: to_df()
:
to_df(result)
or .read()
the handle, and print it to output directly with the print()
statement:
print(result)
# Perform the query
result = REST.kegg_info("kegg").read()
# Print the result
print(result)
# Convert result to dataframe
# NOTE: kegg_info() requests do not produce a suitable data format for dataframe representation
#to_df(result)
This gives us a similar overview to the available resources as the KEGG
landing page. However, the kegg_info()
function is a little more powerful, as it can find information about specific databases:
# Print information about the PATHWAY database
result = REST.kegg_info("pathway").read()
print(result)
and even about specific organisms (identified with their three-letter code):
# Print information about Kitasatospora setae
result = REST.kegg_info("ksk").read()
print(result)
kegg_list()
¶For example, to list all the entries in the PATHWAY database, you could use:
# Get all entries in the PATHWAY database as a dataframe
result = REST.kegg_list("pathway").read()
to_df(result)
and to restrict the results only to those pathways that are present in K. setae, you can filter the database results with a query string ksk
, as the second argument:
# Get all entries in the PATHWAY database for K. setae as a dataframe
result = REST.kegg_list("pathway", "ksk").read()
to_df(result)
If, instead of specifying one of the top-level KEGG
databases, you specify an organism code, KEGG
will return a list of gene entries for that organism:
# Get all genes from K. setae as a dataframe
result = REST.kegg_list("ksk").read()
to_df(result)
kegg_find()
¶For instance, to query the GENES database with the entry accession KSE_17560
you could use:
# Find a specific entry with a precise search term
result = REST.kegg_find("genes", "KSE_17560").read()
to_df(result)
With the query above, KEGG
returns information for the exact entry we've requested. But we can also use less precise search terms, and combine them with the +
symbol. For example, to search for shiga toxin
we would use the query:
"shiga+toxin"
# Find all shiga toxin genes
result = REST.kegg_find("genes", "shiga+toxin").read()
to_df(result)
We can restrict this search to specific organisms, such as Escherichia coli O111 H-11128 (EHEC), by supplying its three letter code (here, eoi
) as the database to be searched:
# Find all shiga toxin genes in eoi
result = REST.kegg_find("eoi", "shiga+toxin").read()
to_df(result)
The kegg_find()
query string can also search in specific fields of the entry. The format for this is:
"<query_value>/<field>"
So, to search for all compounds with a molecular weight between 300 and 310 mass units, you can use the code:
# Find all compounds with mass between 300 and 310 units
result = REST.kegg_find("compound", "300-310/mol_weight").read()
to_df(result)
kegg_get()
¶Most functions you've seen so far will return two columns of data: the first column being the entry accession, and the second column being a description of that entry, or the requested value.
For example, the first compound in our search for molecular weights in the range 300-310 above has entry accession cpd:C00051
. We can recover this entry as follows:
# Get the entry information for cpd:C00051
result = REST.kegg_get("cpd:C00051").read()
print(result)
KEGG
provides a number of different entry types, which cannot all be recovered in exactly the same ways. For instance, the COMPOUND entries typically have an associated molecular structure image, which can be recovered with kegg_get()
by specifying the format to be "image"
:
# Display molecular structure for cpd:C00051
result = REST.kegg_get("cpd:C00051", "image").read()
Image(result)
GENE entries are sequences, so can be recovered as their database entries (default), or as FASTA format nucleotide and/or protein sequences:
# Get entry information for KSE_17560
result = REST.kegg_get("ksk:KSE_17560").read()
print(result)
# Get coding sequence for KSE_17560
result = REST.kegg_get("ksk:KSE_17560", "ntseq").read()
print(result)
# Get protein sequence for KSE_17560
result = REST.kegg_get("ksk:KSE_17560", "aaseq").read()
print(result)
To specify one of the generic pathway maps, you can combine the map
prefix with the pathway number to make the query mapNNNNN
as in the cells, below.
# Get map of fatty-acid biosynthesis
result = REST.kegg_get("map00061", "image").read()
Image(result)
# Get map of central metabolism
result = REST.kegg_get("map01100", "image").read()
Image(result)
If you want to retrieve the pathway map corresponding to a particular organism, then you can replace the prefix map
with the three-letter code for that organism, as in the examples below for Kitasatospora where map
is replaced with ksk
:
# Get map of fatty-acid biosynthesis in Kitasatospora
result = REST.kegg_get("ksk00061", "image").read()
Image(result)
# Get map of central metabolism in Kitasatospora
result = REST.kegg_get("ksk01100", "image").read()
Image(result)
KEGG
provides copious information about pathways in the accompanying database entries, which can be obtained by not providing a download format:
# Get data for fatty-acid biosynthesis in Kitasatospora
result = REST.kegg_get("ksk00061").read()
print(result)
As you can see from the database entry for ksk00061
above, the pathway is composed of many GENE
and COMPOUND
entries, but the returned data format is not easy to work with to extract that data.
result = REST.kegg_link(<database>, <entry>).read()
For instance, to identify the COMPOUND
entries represented in the map00061
pathway, you would compose the query:
result = REST.kegg_link("compound", "map00061").read()
as below:
# Get genes involved with fatty-acid biosynthesis in Kitasatospora
result = REST.kegg_link("compound", "map00061").read()
to_df(result)
You can use any of the databases in KEGG
with this function, though not all may give you a result for any given query.
You can use this function to query generic pathways against the very useful reference databases of KEGG
:
ko
: KEGG
orthologues - a collection of functional orthologuesec
: EC
numbers - a collection of Enzyme Commission classificationsrn
: REACTION
entries - descriptions of chemical interconversionsFor example, to identify reactions that are involved in the fatty-acid synthesis pathway, and then get the database entry for one of these, you could use the queries in the cells below:
# Get reactions involved with fatty-acid biosynthesis
result = REST.kegg_link("rn", "map00061").read()
to_df(result)
# Get reactions R00742
result = REST.kegg_get("R00742").read()
print(result)
The UniProt
record Q05655
describes a human protein kinase. Using KEGG
, can you discover: