SI_Holmes_etal_2020

Supplementary information for Holmes et al. (2020)

View the Project on GitHub widdowquinn/SI_Holmes_etal_2020

README.md - SI_Holmes_etal_2020

This repository contains files detailing the process of fitting the model of the enrichment array experiment described in Holmes et al. (2020). These files are intended to enable independent reproduction, exploration and extension of the analysis reported in the paper.

We have used the Jupyter notebook environment to facilitate literate programming, and to encourage exploration of and experimentation with the code and model. These notebooks have sections of explanatory text that are punctuated by code snippets. In the Jupyter notebook environment, all of these code snippets are editable and runnable.

Table of Contents

  1. Files and Directories
  2. Quick Start
  3. Replicating the manuscript model

Files and Directories

How to get help for the code/analyses in this repository

Please raise any issues at the GitHub issues page for this repository, by following the link below:

Quick Start

We have provided three ways to work with these analysis notebooks:

1. Read the analysis notebooks on GitHub

You do not need to start a MyBinder instance, or start your own local Jupyter instance, to read these notebooks. We provide static versions of the Jupyter analysis notebooks at the following links:

2. Interact with the analysis notebooks on MyBinder

To encourage exploration and reproduction, we have tried to make these notebooks compatible, so far as is possible, with MyBinder, to enable you to run them in the cloud without having to install software on your own machine. To use these notebooks, click on this link, or the button below.

Binder

3. Interact with the analysis notebooks on your own machine

To reproduce our work on your own machine, we recommend using a conda environment to ensure compatibility of dependencies and to replicate the environment used for the analysis. This environment separates installation of Python packages from your system’s Python installation (if there is one), enabling the running of these analyses without interfering with any previously-installed setup.

Create and start the virtual environment

NOTE: You will need to have installed conda[*] for your system.

$ conda create -n holmes_et_al_2020 python=3.6
$ source activate holmes_et_al_2020
$ conda install --yes --file requirements.txt

Start the notebook

From the top level directory of this repository, start the Jupyter notebook server by issuing the command:

jupyter notebook

A new browser window or tab should open, containing the Jupyter homepage, which will show a listing of files and directories in the top level of the repository.

Read more

Opening the notebooks

From the Jupyter homepage in your browser window, click on the link to the notebooks/ subdirectory. Then click on the link for the notebook you wish to use. The selected notebook should then open in a new browser tab.

When they were committed to the repository, the notebooks contained output from the original runs, so they can be read and understood without needing to rerun the models. If you would like to rerun/reproduce/modify these outputs, we recommend restarting the kernel and clearing all output before beginning. This can be done by clicking on Kernel -> Restart & Clear Output, in the notebook window.

Replicating the manuscript model

To replicate the manuscript model from scratch: start the virtual environment (in MyBinder, or on your own machine), then run the notebooks and scripts as described below (remembering to use Kernel -> Restart & Clear Output in each notebook to remove the original/existing outputs):

Data processing, QA and normalisation

Fitting the model on the full dataset

NOTE: the complete fit takes between 5 and 9 hours on my laptop (2013 MacBook Pro, 2.8GHz i7 16GB RAM).

10-fold crossvalidation

The crossvalidation dataset construction and model fit were conducted in the multiplexing directory, using Python scripts rather than Jupyter notebooks. To reproduce the dataset construction and fits, first change directory to multiplexing:

cd multiplexing

then build the input datasets with the multiplex_data.py script:

./multiplex_data.py -v -d ../notebooks/datasets/normalised_array_data.tab \
                    -k 10 -o 10-fold_CV --seed 123456789 \
                    -l 10-fold_CV_multiplex.log

This will create a new directory called 10-fold_CV, containing one new subdirectory for each training/test split of the input dataset.

Next, use the run_multiplex_models.py script to fit the Stan model to each of the multiplexed training/test sets.

./run_multiplex_models.py -v -i 10-fold_CV --script ./run_model.py \
                          --seed 123456789 \
                          -l 10-fold_CV_run_models.log

NOTE: the run_multiplex_models.py has a dependency on the pysge module for submission of jobs to our local cluster. This is not included in the requirements.txt file, so the script will fail at this point. The command-lines that this script produces in the log file can, however, be copied for execution on any system available to you. If you happen to be running on a cluster with SGE scheduling, then installation of pysge in the virtual environment will enable use of the cluster to fit the multiplexed models.

Finally, use the join_multiplexed_data.py script to combine prediction output from each of the 10 test sets into a single .tab file. This will contain predictions for each of the probes from the input dataset, using the model fit to the remaining training data.

./join_multiplexed_data.py -v -i 10-fold_CV -o 10-fold_CV.tab \
                           -l 10-fold_CV_join_data.log

The combined data produced in this way can then be used as input for the notebook 03-model_validation.ipynb.

NOTE: PRNG seeds

All random processes in the model building and fitting can take a seed value for the pseudorandom number generator. For replication of the values in the paper, this seed should be set to 123456789 for all processes: