Computational Biology of Potato Pathogens

06/06/2017

1. Who, me?

My Background

I'm not really a computer scientist…

1992-1996 BSc(Hons) Forensic and Analytical Chemistry
University of Strathclyde
1995 Product Development Chemist
Mobil Oil
1996 Body Fluid Analysis
Glasgow Royal Infirmary
1996-1999 PhD Computational Biology
University of Strathclyde

1999-2003 PostDoc Systems Biology
University of Wales, Aberystwyth
2003-2011 Computational Biologist
Scottish Crop Research Institute
2005-2013 BA(Hons) Mathematics
Open University
2011-present Computational Biologist
The James Hutton Institute

Earlier work

Learning on the job…

Sequence/structure evolution of snake venom toxins
Drug target site discovery algorithm

Systems Biology
Modelling yeast metabolism
Directed evolution

Pritchard et al. (1999) J. Mol. Biol. doi:10.1007/978-1-62703-986-4_4: ET of snake venom toxins
Pritchard et al. (2000) J. Theor. Biol. doi:10.1006/jtbi.1999.1043: Hopfield network model of protein evolution
Pritchard et al. (2001) Prot. Eng. doi:10.1093/protein/14.8.549: Covariation analysis
Pritchard & Kell (2002) Eur. J. Biochem. doi:10.1046/j.1432-1033.2002.03055.x: Yeast glycolysis model

2. The James Hutton Institute

The James Hutton Institute

formed from Scottish Crop Research Institute (SCRI) and Macaulay Land Use Research Institute (MLURI) in 2011
main sites: Dundee, Aberdeen
also: Glenshaugh, Balruddery, Hartwood
oversees Biomathematics and Biostatistics Scotland (BioSS)
University of Dundee Division of Plant Sciences based at Dundee site
Scottish Government policy advice

Vision and Mission

Vision

"To be at the forefront of innovative and transformative science for sustainable management of land, crop and natural resources that supports thriving communities."

Mission

"To conduct excellent science and engage in new ways of working across disciplines, with business, policy and society, that guide contemporary thought and challenge conventional wisdom, ensure trust and deliver the best outcomes for all."

Group Objective

"To deliver greater food and environmental security through science connecting land and people"

ICS Hutton

@huttonics

3. Plant-pathogen interactions

A Global Challenge

feed 9.2bn people (86% in developing world) by 2050
double food production in 50yr
food losses to pathogens:
- 10-25% of planted crops
- 10% of post-harvest crops
- (enough to feed >2bn people, 2kcal/day)
pathogens moving polewards
severe threats to ecosystem services

Fisher et al. (2012) Nature doi:10.1038/nature10947: Emerging fungal threats
"Prediction for Biological Hazards" https://www.gov.uk/government/publications/biological-hazards-prediction
Bebber et al. (2014) Global Ecol. Biog. doi:10.1111/geb.12214: Global pest/pathogen distribution
Bebber et al. (2014) New Phytol. doi:10.1111/nph.12722: Global pest/pathogen distribution

Challenges in Scotland

food security
- burden, cost of crop disease
- emerging pathogens (imports, climate change)
environmental sustainability
- pesticide minimisation, withdrawal
- durable resistance via breeding (and/or GM)
£308m cereals
£258m other crops (£171m potatoes)
£264m horticulture

Outputs from Scottish Farms, 2015-16

Key Potato Pathogens

Soft rot enterobacteria

Pectobacterium spp., Dickeya spp., Erwinia spp.
many species, host ranges
plant cell wall degrading enzymes
rots plants in field, tubers in storage
€30m/yr losses in Netherlands
P. atrosepticum main cause of blackleg in Scotland
D. solani recently emerged, very aggressive

Potato late blight

Phytophthora infestans
Global problem
estimated $6bn/yr cost worldwide
UK costs ≈£55m/yr in losses, ≈£70m/yr in control
requires frequent pesticide application (incl. for organics)
(related spp. cause extensive damage to trees: larch, oak, juniper)

Science and Advice for Scottish Agriculture (SASA)

How do pathogens infect?

Dodds & Rathjen (2010) Nature doi:10.1038/nrg2812 - plant-pathogen interaction mechanisms

How do pathogens infect?

The mechanics are complex: systems-level approaches required

Block et al. (2008) Curr. Op. Plant Biol. doi:10.1016/j.pbi.2008.06.007 - plant-bacterium interactions

Models Guide Thinking

Prevailing interaction model: "Zig-Zag"

1. plant detects pathogen (PTI)
2. pathogen produces effector (ETS)
3. plant detects effector (ETI)
model is evolutionary
not a dynamic interaction
not a specific timescale
not a specific biological scale
qualitative only

Pritchard & Birch (2011) Plant Sci. doi:10.1016/j.plantsci.2010.12.008: Systems biology of plant-pathogen interactions

Dynamic (Toy) Model

Quantitative, specific timescales and molecular interactions

Pritchard & Birch (2014) Mol. Plant Pathol. doi:10.1111/mpp.12210: Dynamic model of plant-pathogen interaction

Dynamic (Toy) Model

Quantitative, specific timescales and molecular interactions

Pritchard & Birch (2014) Mol. Plant Pathol. doi:10.1111/mpp.12210: Dynamic model of plant-pathogen interaction

Dynamic (Toy) Model

Quantitative, specific timescales and molecular interactions

Pritchard & Birch (2014) Mol. Plant Pathol. doi:10.1111/mpp.12210: Dynamic model of plant-pathogen interaction

4. Genomes I
(Phytophthora and potato)

P. infestans genome

published 2009, focus on pathogen effectors

effector complement affects host range, aggression
P. sojae/P. ramorum sequences available for comparison
two-speed genome (expansion) enhances diversity in effectors

Haas et al. (2009) Nature doi:10.1038/nature08358 - Phytophthora genome

Effector classification

is a supervised (machine) learning problem

feature extraction
model construction
testing/validation
prediction

Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4_4 - Classifier statistics

RxLR effectors

effectors are modular (address and payload)

build binary classifier to identify effectors
- highly diverse payloads
- RxLR motif characteristic sequence
- RxLR/address motif matches pattern (HMM)

classifier trained on ≈30 examples
identifies ≈400 on the genome

Whisson et al. (2007) Nature doi:10.1038/nature06203 - RxLR identification
Haas et al. (2009) Nature doi:10.1038/nature08358 - Phytophthora genome
Boutemy et al. (2011) doi:0.1074/jbc.M111.262303 - RxLR structure

Genome-scale predictions

Whisson et al. (2007) Nature doi:10.1038/nature06203 - RxLR identification
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4_4 - Classifier statistics

Potato genome

published 2011, focus on R-genes (NB-LRR)

effector-detecting proteins (ETI): modular, several subclasses
domain-based model to predict and classify

Jupe et al. (2012) BMC Genomics doi:10.1186/1471-2164-13-75 - NB-LRR predictions

NB-LRR predictions

≈488 candidates identified, 366 placed on genome

53 positives (NB-LRR), nucleotide-binding negatives $\rightarrow$ Sn=1, FPR=0
MEME $\rightarrow$ 20 characteristic domains; MAST search of genome

Jupe et al. (2012) BMC Genomics doi:10.1186/1471-2164-13-75 - NB-LRR predictions

R-gene enrichment

commercial gene enrichment bead 'array', based on NB-LRR model

"physical BLAST search"
design 'bait' from model
capture gDNA with 'bait'
sequence, assemble/map to genome

identified 338 additional candidate NB-LRRs

engineering durable resistance
R-gene 'stacking' to evade adaptable pathogens

Jupe et al. (2013) Plant J. doi:10.1111/tpj.12307 - NB-LRR enrichment sequencing

5. Genomes II (bacteria)

riboSeed

NUI Galway
Teagasc
Nick Waters
Fiona Brennan
Florence Abram
Ashleigh Holmes
part of larger project on environmental E. coli

Short Reads and Repeats

most genome sequencing is (for now) short read Illumina
reads shorter than repeat sequence $\implies$ repeats not resolved

16S Metabarcoding

16S central to microbial ecology (microbiome)

universal, essential structural RNA - low variation
all 16S metabarcoding methods rely on reference databases

bacteria commonly have several repeated very similar copies
16S does not assemble well: most published genomes collapse 16S

rDNA differences

rDNA flanking regions differ within a genome, not between genomes

identify clusters from related bacteria
treat clusters in isolation
'pin' unique reads to the corresponding reference cluster
distribute common reads among all clusters

riboSeed Algorithm Sketch

reference \(\leftarrow\) closely-related genome
while iters < N

map reads to reference
clusters \(\leftarrow\) reference rDNA clusters
pseudocontigs \(\leftarrow\) {}
for c in clusters

extract reads mapping to cluster and flanking region
hybrid assemble extracted reads with cluster
add assembly to pseudocontigs
reference \(\leftarrow\) joined pseudocontigs
iters++
use pseudocontigs as 'trusted long reads' in hybrid assembly

riboSeed Graphical Overview

Simulated Reads

artifical chromosome: spaced E.coli rDNA clusters

de novo assembles 0/7 clusters
de fere novo assembles 4/7 clusters (E. coli reference)
de fere novo assembles 1/7 clusters (Klebsiella reference)

Reference Genome Choice

most rDNA clusters assembled when ref. mutation rate $\leq$ 0.03

Real Data Performance

hybrid (PacBio/Illumina assembly reference)

Illumina-only de novo asm: 1/4; de fere novo asm: 4/4

Closed scaffold of Illumina-only Staphylococcus aureus UAM-1 draft genome

future directions

HMM profile for homologous rDNA, rather than reference genome 'bait'
assemble SRA/ENA public datasets

riboSeed

riboSeed on GitHub: https://nickp60.github.io/riboSeed/

6. Diagnostics

Diagnostics & Classification

Scottish Government-funded
Sonia Humphris
Ian Toth
Emma Campbell

SRE Global Distribution

SRE Taxonomy

enterobacterial taxonomy is difficult to resolve, in general

Historical classification mostly phenotypic, polyphasic

all SRE originally Erwinia (1950s - bucket classification)
now three distinct genera (Erwinia, Dickeya, Pectobacterium)
binomial nomenclature not designed for large amounts of genome data, or metadata curation

old names hold over in literature, collections
name discontinuities affect analyses, databases

Czajkowski et al. (2015) Ann. Appl. Biol. doi:10.1111/aab.12166 - SRE diagnostics

Historical Taxonomy

Erwinia revised ≈1953 (Pectobacterium), ≈2003-2005 (Dickeya)

Gardan et al. (2003) Int. J. Sys. Microbiol. doi:10.1099/ijs.0.02423-0 - SRE reclassification
Samson et al. (2005) Int. J. Sys. Microbiol. doi:10.1099/ijs.0.02791-0 - SRE reclassification

Legislation is Taxonomy-Based

European and Mediterranean Plant Protection Organisation (EPPO)

member states should regulate D. dianthicola and E. amylovora as quarantine pests (A2 list)

Seed Potatoes (Scotland) Amendment Regulations (2010)

zero tolerance policy for all Dickeya spp. on potatoes in Scotland to ensure production of `clean' (disease-free) seed potato production for export

consortium for control and epidemiology

Taxonomy For Legislation and Policy

Easy to incorporate into legislation (binary classification)

Assumes taxonomy is assigned precisely and correctly
Assumes taxonomy is a proxy for risk

Essentially a data structure problem!

Historical phenotypic classification is semi-arbitrary (≈binary tree)

Is a species concept appropriate for bacteria? (No)
Is disease risk only transferred parent to offspring? (No: HGT)
Is there 1:1 mapping from genome/phenotype to disease? (No)
Are all currently known bacteria correctly classified? (No)

Toth et al. (2006) Ann. Rev. Phyto. doi:10.1146/annurev.phyto.44.070505.143444 - horizontal transfer of pathogenicity
Deans et al. (2015) PLoS Biol. doi:10.1371/journal.pbio.1002033 - no 1:1 map from taxonomy to disease
Pritchard et al (2015) Anal. Methods doi:10.1039/c5ay02550h: Bacterial pathogen classification for policy

Dickeya diagnostics

Having recently sequenced 25 Dickeya genomes, we were asked to develop new diagnostics

To legislate effectively, must discriminate and identify the pathogen

MLST/16S/qPCR schemes exist, trained on 'old' classifications
- limited resolution, considerable natural variation
MLST/16S/qPCR are small parts (<5kbp) of a larger genome (≈5Mbp)
- good choice for one group, maybe not for another
qPCR is cheaper than WGS (for now)
no qPCR primers existed to distinguish among Dickeya spp.

Pritchard et al. (2013) Plant Path. doi:10.1111/j.1365-3059.2012.02678.x - Dickeya diagnostics
Czajkowski et al. (2015) Ann. Appl. Biol. doi:10.1111/aab.12166 - SRE diagnostics

Whole-Genome Diagnostics Design

Bulk-predict primers on all genomes (Primer3)
Predict cross-amplification in silico - intensive, parallelised
Evaluate in vitro against panel of unseen isolates

Pritchard et al. (2012) PLoS One doi:10.1371/journal.pone.0034498 - bulk diagnostic qPCR primer design (E. coli)

Classification is a Problem!

The first design run could not predict diagnostic primers!

incorrect classifications in public databases 'poisoned' the training set

Pritchard et al. (2013) Plant Path. doi:10.1111/j.1365-3059.2012.02678.x - bulk diagnostic qPCR primer design (SRE)

Consequences of Misclassification

Real-world impacts of misclassification

False positives (type I errors):
- clean samples rejected: economic cost
- farms quarantined/closed: economic/societal cost
False negatives (type II errors):
- (irreversible) introduction of infectious material
- potential for novel host jumps and spread

≈18% of genomes in public databases misclassified by species

accurate classifications essential for diagnostics training
reclassification of genomes in public databases necessary

Pritchard et al. (2016) Anal. Methods doi:10.1039/c5ay02550h - SRE classification
Varghese et al. (2015) Nucl. Acids Res. doi:10.1093/nar/gkv657 - Misclassification in public DBs

Successful Design

Designed primers that discriminate at species level across Dickeya

since used across Europe (fields condemned)

Also designed primers that discriminate RxLR variants (population surveys)

Pritchard et al. (2013) Plant Path. doi:10.1111/j.1365-3059.2012.02678.x - bulk diagnostic qPCR primer design (SRE)

Precise Classification

Designed primers at subserotype level for E. coli O104:H4 outbreak

needs two primers for discrimination
distinguishes historical O104:H4 from 2011 outbreak

Pritchard et al. (2012) PLoS One doi:10.1371/journal.pone.0034498 - bulk diagnostic qPCR primer design (E. coli)

A New Outbreak Workflow

Genomics has transformed outbreak detection and prediction

http://www.globalmicrobialidentifier.org/ - Global Microbial Identifier Initiative

7. Whole-genome classification

DNA-DNA hybridisation

"Gold standard" whole-genome classification since 1960s

"70% identity" ≈ same species
denature DNA from two organisms
allow DNA to anneal, measure temperature change
replace with whole-genome comparisons?

Average Nucleotide Identity (ANIm)

Whole-genome sequence replacement for DDH

align genomes
calculate mean %identity of all homologous regions
"70% identity" (DDH) ≈ 95% identity (ANIm)

insensitive to dataset composition (unlike clustering)
approximate limiting case of MLST/MLSA/multigene comparisons

Goris et al. (2007) Int. J. Syst. Microbiol. doi:10.1099/ijs.0.64483-0 - ANI method
Richter and Rossello-Mora (2009) Proc. Natl. Acad. Sci. USA doi:10.1073/pnas.0906412106 - ANIm method, JSpecies tool

`pyani`

python package and scripts for ANI

available on PyPI
ANIm, ANIb etc.
calculates, visualises, (soon) classifies
parallelises under SGE/OGE

http://widdowquinn.github.io/pyani/
Pritchard et al. (2015) Anal. Methods doi:10.1039/C5AY02550H - pyani used on SRE

ANIm: Dickeya

ANIm %ID indicates reclassification of Dickeya.

red blocks on diagonal indicate >95% identity
9 species-level groups
2 novel species
Correctly places species misidentified in GenBank

Pritchard et al. (2015) Anal. Methods doi:10.1039/C5AY02550H - proposed Dickeya reclassification

ANIm: Pectobacterium

ANIm %ID indicates reclassification of Pectobacterium.

red blocks on diagonal indicate >95% identity
10 species-level groups
4 novel species
P. carotovorum split
P. wasabiae split

Pritchard et al. (2015) Anal. Methods doi:10.1039/C5AY02550H - proposed Pectobacterium reclassification
Faure et al. (2016) Int. J. Syst. Microbiol doi:10.1099/ijsem.0.001524 - Reclassification of P. wasabiae

ANI Criticisms

95% identity threshold is arbitrary
not a phylogenetic relationship ('just' sequence similarity)
not a functional interpretation of disease risk

ANI considers 'homologous' regions (variable definition)
is this misled by lateral gene transfer?
is homology a good proxy for disease disk?

ANIm: Pectobacterium

ANIm %coverage: all OK for Pectobacterium spp.

red blocks indicate >50% coverage
all isolates align over >50% of genome

Pritchard et al. (2015) Anal. Methods doi:10.1039/C5AY02550H - ANI applied to Pectobacterium

ANIm: Dickeya

ANIm %coverage highlights an issue

red blocks indicate >50% coverage
not all isolates align over 50% of genome
two outlier species: different genera?

Pritchard et al. (2015) Anal. Methods doi:10.1039/C5AY02550H - ANI applied to Dickeya

Whole Genome Classification

Increasing interest in whole-genome classification

genome ≈ almost all hereditary information
binomial nomenclature doesn't reflect bacterial evolution well
- network, not a tree
- large within-species genomic variation (pangenomes)
- sensitive to input dataset (does not scale)

McInerney et al. (2017) Nature Microbiol. doi:10.1038/nmicrobiol.2017.40 - bacterial pangenomes
Baltrus (2016) Trends Microbiol. doi:10.1016/j.tim.2016.02.004 - proposed WGS reclassification

ANI Results Define Graphs

ANIm of all sequenced SRE genomes. Edges > 50% coverage

three main groups (genera)

Cliques

cliques - k-complete graphs - are 'natural' clusterings

clique membership varies with ANI %identity
clique membership (at given %ID) is permanent and scales

at some %identity values, all graph components are cliques

Network Deconstruction

Reclassification: Pectobacterium

Faure et al. (2016) Int. J. Syst. Microbiol doi:10.1099/ijsem.0.001524 - Reclassification of P. wasabiae

Reclassification: Dickeya

Reclassification: Erwinia

JHI Collections

JHI holds historical pathogen samples from 1950s onwards

sequence historical samples
sequence current/recent outbreak samples
environmental variation
historical changes (evolution, introduction events)

Sequenced ≈50 P. atrosepticum isolates from infections (2009-2015)

sourced from SASA
Illumina sequencing
prokka, roary, QUAST, parSNP

SNP distribution

SNPs widespread across all P. atrosepticum genomes

`parSNP` Clades

P. atrosepticum divisible into four clades

clades contain isolates found outwith UK/EU (from GenBank)

Distribution

all four clades widespread, no obvious geographical pattern

Pangenomes

15% of P. atrosepticum genes are 'accessory'

Accessory genes

accessory gene tree not congruent with the SNP tree

1. Who, me?

My Background

Earlier work

2. The James Hutton Institute

The James Hutton Institute

Vision and Mission

ICS Hutton

3. Plant-pathogen interactions

A Global Challenge

Challenges in Scotland

Key Potato Pathogens

How do pathogens infect?

How do pathogens infect?

Models Guide Thinking

Dynamic (Toy) Model

Dynamic (Toy) Model

Dynamic (Toy) Model

4. Genomes I (Phytophthora and potato)

P. infestans genome

Effector classification

RxLR effectors

Genome-scale predictions

Potato genome

NB-LRR predictions

R-gene enrichment

5. Genomes II (bacteria)

riboSeed

Short Reads and Repeats

16S Metabarcoding

rDNA differences

riboSeed Algorithm Sketch

riboSeed Graphical Overview

Simulated Reads

Reference Genome Choice

Real Data Performance

riboSeed

6. Diagnostics

Diagnostics & Classification

SRE Global Distribution

SRE Taxonomy

Historical Taxonomy

Legislation is Taxonomy-Based

Taxonomy For Legislation and Policy

Dickeya diagnostics

Whole-Genome Diagnostics Design

Classification is a Problem!

Consequences of Misclassification

Successful Design

Precise Classification

A New Outbreak Workflow

7. Whole-genome classification

DNA-DNA hybridisation

Average Nucleotide Identity (ANIm)

pyani

ANIm: Dickeya

ANIm: Pectobacterium

ANI Criticisms

ANIm: Pectobacterium

ANIm: Dickeya

Whole Genome Classification

ANI Results Define Graphs

Cliques

Network Deconstruction

Network Deconstruction

Network Deconstruction

Network Deconstruction

Network Deconstruction

Reclassification: Pectobacterium

Reclassification: Dickeya

Reclassification: Erwinia

JHI Collections

SNP distribution

parSNP Clades

Distribution

Pangenomes

Accessory genes

8. Acknowledgements

Without Whom…

4. Genomes I
(Phytophthora and potato)

`pyani`

`parSNP` Clades