02 - Programming for UniProt (browser/notebook)

Table of Contents

  1. Introduction
  2. Python imports
  3. Running a remote UniProt query
    1. Connecting to UniProt
    2. Constructing a query
    3. Perform the query
    4. EXAMPLE: Putting it together
  4. Advanced queries
    1. key:value queries
    2. Exercise 01
    3. Combining queries
    4. Exercise 02
  5. Processing query results
    1. Tabular
    2. Excel
    3. FASTA sequence
    4. pandas dataframe

Introduction

The UniProt browser interface is very powerful, but you will have noticed from the previous exercises that even the most complex queries can be converted into a single string that describes the search being made of the UniProt databases. Using the browser interface, this string is generated for you, and placed into the search field at the top of the UniProt webpage every time you run a search.

It can be tedious and time-consuming to point-and-click your way through a large number of browser-based searches, but by using the UniProt webservice, the search strings you've already seen, and a Python module called bioservices, we can compose and run as many searches as we like using a small amount of code, and pull the results of those searches down to our local machines.

This notebook presents examples of methods for using UniProt programmatically, via a webservice, and you will be controlling the searches using Python code in this notebook.

There are a number of advantages to this approach:

  • It is easy to set up repeatable searches for many sequences, or collections of sequences
  • It is easy to read in the search results and conduct downstream analyses that add value to your search

Where it is not practical to submit a large number of simultaneous queries via a web form (because it is tiresome to point-and-click over and over again), this can be handled programmatically instead. You have the opportunity to change custom options to help refine your query, compared to the website interface. If you need to repeat a query, it can be trivial to apply the same settings every time, if you use a programmatic approach.

Python imports

To use the Python programming language to query UniProt, we have to import helpful packages (collections of Python code that perform specialised tasks.

Running a remote UniProt query

There are three key steps to running a remote UniProt query with bioservices:
  1. Make a link to the UniProt webservice
  2. Construct a query string
  3. Send the query to UniProt, and catch the result in a variable

Once the search result is caught and contained in a variable, that variable can be processed in any way you like, written to a file, or ignored.

Connecting to UniProt

To open a connection to UniProt, you make an instance of the UniProt() class from bioservices. This can be made to be persistent so that, once a single connection to the database is created, you can interact with it over and over again to make multiple queries.

To make a persistent instance, you can assign UniProt() to a variable:
service = UniProt() # it is good practice to have a meaningful variable name

Constructing a query

UniProt allows for the construction of complex searches by combining fields. A full discussion is beyond the scope of this lesson, but you will have seen in the preceding notebook that the searches you constructed by pointing and clicking on the UniProt website were converted into text in the search field at the top.

To describe the format briefly: there are a set of defined keywords (or keys) that indicate the specific type of data you want to search in (such as host, annotation, or sequence length), and these are combined with a particular value you want to search for (such as mouse, or 40674) in a key:value pair, separated by a colon, such as host:mouse or ec:3.2.1.23.

If you provide a string, instead of a key:value pair, UniProt will search in all fields for your search term.

Programmatically, we construct the query as a string, e.g.

query = "Q9AJE3"  # this query means we want to look in all fields for Q9AJE3

Perform the query

To send the query to UniProt, you will use the .search() method of your active instance of the UniProt() class.

If you have assigned your instance to the variable service (as above), then you can run the query string as a remote search with the line:
result = service.search(query)  # Run a query and catch the output in result

In the line above, the output of the search (i.e. your result) is stored in a new variable (created when the search is complete) called result. It is good practice to make variable names short and descriptive - this makes your code easier to read.

EXAMPLE: Putting it together

The code in the cell below uses the example code above to create an instance of the UniProt() class, and uses it to submit a pre-stored query to the UniProt service, then catch the result in a variable called result. The print() statement then shows us what the result returned by the service looks like.

The UniProt() instance defined in the cell above is persistent, so you can reuse it to make another query, as in the cell below:

Advanced queries

The examples above built queries that were simple strings. They did not exploit the key:value search structure, or combine search terms. In this section, you will explore some queries that use the UniProt query fields, and combine them into powerful, filtering searches.

key:value queries

As noted above (and at http://www.uniprot.org/help/query-fields) particular values of specific data can be requested by using key:value pairs to restrict searches to named fields in the UniProt database.

As a first example, you will note that the result returned for the query "Q01844" has multiple entries. Only one of these is the sequence with accession value equal to "Q01844", but the other entries make reference to this sequence somewhere in their database record. If we want to restrict our result only to the particular entry "Q01844", we can specify the field we want to search as accession, and build the following query:

query = "accession:Q01844"  # specify a search on the accession field

Note that we can use the same variable name query as earlier (this overwrites the previous value in query). The code below runs the search and shows the output:

By using this and other key:value constructions, we can refine our searches to give us only the entries we're interested in

Exercise 01 (10min)

Using key:value searches, can you find and download sets of entries for proteins that satisfy the following requirements? (HINT: this link to the UniProt query fields may be helpful, here):

  • Have publications authored by someone with the surname Broadhurst
  • Have protein length between 9000aa and 9010aa
  • Derive from the taipan snake
  • Have been found in the wing

Combining queries

Combining terms in a UniProt query can be as straightforward as putting them in the same string, separated by a space.

For example:

query = "organism:rabbit tissue:eye"

will search for all entries deriving from rabbits that are found in the eye

Exercise 02 (10min)

Using key:value searches, can you find and download sets of entries for proteins that satisfy the following requirements? (HINT: this link to the UniProt query fields may be helpful, here):

  • Found in sheep spleen
  • Have "rxlr" in their name, have a publication with author name Pritchard, and are between 70aa and 80aa in length
  • Derive from a quokka and have had their annotations manually reviewed
  • Are found in cell membranes of horse heart tissue, and have had their annotations manually reviewed

Combining terms with Boolean logic

Boolean logic allows you to combine search terms with each other in arbitrary ways using three operators, specifying whether:

Searches are read from left-to right, but the logic of a search can be controlled by placing the combinations you want to resolve first in parentheses (()). Combining these operators can build some extremely powerful searches. For example, to get all proteins from horses and sheep, identified in the ovary, and having length greater than 200aa, you could use the query:

query = "tissue:ovary AND (organism:sheep OR organism:horse) NOT length:[1 TO 200]"

Processing query results

So far you have worked with the default output from bioservices, although you know from the previous notebook that UniProt can provide output in a number of useful formats for searches in the browser.

The default output is tabular, and gives a good idea of the nature and content of the entries you recover. In this section, you will see some ways to download search results in alternative formats, which can be useful for analysis.

All the output format options are controlled in a similar way, using the frmt=<format> argument when you conduct your search - with <format> being one of the allowed terms (see the bioservices documentation for a full list).

Tabular

The default datatype is the most flexible datatype for download: tabular.

This can be specified explicitly with the tab format:

result = service.search(query, frmt="tab")

By default, the columns that are returned are: Entry, Entry name, Status, Protein names, Gene names, Organism, and Length. But these can be modified by passing the columns=<list> argument, where the <list> is a comma-separated list of column names. For example:

columnlist = "id,entry name,length,organism,mass,domains,domain,pathway"
result = service.search(query, frmt="tab", columns=columnlist)

The list of allowed column names can be found by inspecting the content of the variable service._valid_columns.

Converting to a dataframe

The pandas module allows us to process tabular data into dataframes, just like in R.

To do this, we have to use the io.StringIO() class to make it think that our downloaded results are a file

df = pd.read_table(io.StringIO(result))

Doing this will produce a pandas dataframe that can be manipulated and analysed just like any other dataframe. We can, for instance, view a histogram of sequence lengths from the table above:

Excel

You can download Excel spreadsheets directly from UniProt, just as with the browser interface.
result = service.search(query, frmt="xls")

You can't use the Excel output directly in your code without some file manipulation, and you'll have to save it to a file, as in the example below. Also, the downloaded format is not guaranteed to be current for your version of Excel, and the application may ask to repair it. But, if you want Excel output to share with/display to others, you can get it programmatically.

NOTE: the downloaded format is actually `.xlsx`, rather than `.xls` which is implied by the format

FASTA sequence

If you're interested only in the FASTA format sequence for an entry, you can use the fasta option with frmt to recover the sequences directly, as in the example below:

pandas dataframe

In addition to the conversion of tabular output to a pandas dataframe above, you can ask the UniProt() instance to return a pandas dataframe directly, with the .get_df() method.
result = service.get_df("tissue:venom (organism:viper OR organism:mamba)", limit=None)

However, this is slow compared to the other methods above and can take a long time for queries with thousands of results

This dataframe works like any other dataframe. You can get a complete list of returned columns:

Or, for instance, the number of rows and columns in the results:

and use the convenient features of a dataframe, such as built-in plotting:

and grouping/subsetting:

Exercise 03 (10min)

Can you use bioservices, UniProt and pandas to:


  • download a dataframe for all proteins that have "rxlr" in their name
  • render a violin plot (sns.violinplot()) that shows the distribution of protein lengths grouped according to the evidence for the protein